不太理解 Redlock 算法中 split brain 相关内容

› Redis

This topic created in 2119 days ago, the information mentioned may be changed or developed.

在Distributed locks with Redis – Redis - Retry on failure中，它说：

When a client is unable to acquire the lock, it should try again after a random delay in order to try to desynchronize multiple clients trying to acquire the lock for the same resource at the same time (this may result in a split brain condition where nobody wins). Also the faster a client tries to acquire the lock in the majority of Redis instances, the smaller the window for a split brain condition (and the need for a retry), so ideally the client should try to send the SET commands to the N instances at the same time using multiplexing.

我的疑问

“this may result in a split brain condition where nobody wins”，究竟是什么会造成 split brain condition 呢？
“Also the faster a client tries to acquire the lock in the majority of Redis instances, the smaller the window for a split brain condition (and the need for a retry)”，为什么 client 越快尝试获取，split brain condition 发生的机率就越小呢？
“so ideally the client should try to send the SET commands to the N instances at the same time using multiplexing”，Distributed locks with Redis – Redis - The Redlock algorithm不是说“It tries to acquire the lock in all the N instances sequentially”，为什么这里就变成了“at the same time”？

Supplement 1 · Oct 21, 2021

是我理解错了split brain，可以看一下What is split-brain in distributed systems? - Quora简单了解split brain。

6 replies • 2021-09-30 15:35:34 +08:00

ALLLi

Jul 29, 2020

1. 如果有三个客户端都在争锁，可能出现 2，2，1 的情况谁也抢不到锁，这时候你所有客户端没有随机 delay 直接重试，很可能还是都抢不到。
2. client 获取一个锁所花的时间越少，被其他客户端插进来的几率就越小
3. 这里的 at the same time 是指同一个客户端同时向所有节点请求锁，而第一点里面强调的是不同客户端加随机 delay

JasonLaw

Jul 29, 2020

@ALLLi #1 其实我感觉，你并没有太明白我的问题，而且也没有回答到我的问题。

duwan

Jul 29, 2020

不太懂 redlock，但也说一下我的想法不知道对不对。
1. split brain 是由网络故障导致的。例如 5 个节点，被网络故障分成两个网络分区：一个分区 2 个节点，另一个 3 个节点。当客户端获取锁的时候，由于网络故障，不能访问另一个分区的 redis 节点，就相当于获取锁失败了。
2. 向每个 redis 节点请求锁的时间间隔越短，就越容易达到条件：总消耗时间 < 锁释放时间

ppyybb

Jul 29, 2020

1 如果按照严格顺序加锁，确实一下想不到 split brain 的情况,但问题是里面有个 timeout(在算法第二点描述),这样就会出现两个调用者没有严格按书讯去获取锁的情况，从而导致 split brain 了
2 这里的理解有点歧义，到底是越快尝试获取，还是说获取的速度越快，这两个是有区别的。如果是第一点，你不尝试获取这个状态就短时间不会结束，只有 retry 才有可能结束。
但是从后面的话看又感觉是在说锁获取所有锁的时间,这个就可以考虑极端情况，如果获取所有锁是一个原子操作(时间很短),那么就不存在应用层感知的冲突了，必然立刻结束 split brain.
3. 第三个我认为同时发送的操作是非常规操作，是只有 acquire 失败的情况下才会执行，目的就是第二点讨论的，越快获取剩下的资源，split brain 的持续时间越短。

farseeraliens

Jul 29, 2020

@JasonLaw 1 楼说的完全正确, 是你没理解他的意思. 文档说的 split brain 不是多个 proposer/leader 各行其是, 而是达到了 live lock. 如果不理解为什么随机退避能避免 live lock, 你可能需要查一查"布里丹之驴".

shenyangno1

Sep 30, 2021

我想尝试回答一下疑问 1，我读到原文这段话的时候也很懵逼，文中的"desynchronize"令我困惑。在网上搜到了这里的讨论，楼上提到的"布里丹之驴"给了我一点启发。

这里的重点是“随机延时重试”（ retry after a random delay ）。

如果出现这段描述的“脑裂”，那一定是出现了竞争，每个客户端在同一时间对同一把锁尝试加锁，形成了每个客户端都未占有 majoriry 数量的 Redis 实例的局面。客户端在发现自身获锁失败后，其后续的释放锁差不多也是同步进行的（同一时间），下面就要进行重新尝试获锁。如果是立刻重试或是等待一个固定时间后重试，那么下一次尝试获取锁也是同步进行的（同一时间），大概率还是会发生竞争。

那么这时“随机延时重试”就比较骚了，等于打乱了“同步重试”的节奏（步伐），降低了重试时发生竞争的概率，也许这就是原文想表达的"desynchronize"。

“随机延时重试”打开了“脑裂”的局面。