Cluster Split-Brain explained [part 3]

As you may remember, part 2 of the cluster Split-Brain article series has been summed up with some questions which will be answered in this post. Read the first part here and the second part here.

So, where is this mechanism that prevents Split-Brain? Why doesn’t it work in this case and can it be somehow more secure?

Protection against cluster split or Split-Brain is included in Open-E JovianDSS and works well but in the example from the previous post it simply did not have a chance to work. Why? Well, because we deliberately have used an example that shows that the effectiveness of this mechanism depends on the cluster environment. In our case, there was only one connection between nodes of the cluster used for communication between nodes and for data synchronization on disks (cluster path / mirror path). When the connection was lost, the mechanism simply did not have a chance to work. So let’s expand the environment so as to see how this security mechanism works exactly.

Let’s add a second connection between nodes and assign to each of them a separate function:

Before we begin to destroy our cluster, I’d like to quickly explain what this mysterious Cluster Split-Brain Protection is.

ZFS on the node that currently manages the pool (has an imported pool) performs the so-called micro writes on this pool, and thus on its disks. Thanks to this, if any other system (node)”wants” to import it, it will start to investigate whether there are any micro writes that it does not perform itself. If micro-writes are detected, the transfer of the pool will fail with the corresponding error code and a corresponding message. However, if in 10 seconds time the micro-writes aren’t detected, the node will take over the pool management and it will start doing such writes (the mechanism of writes and the examination of their occurrence is a native ZFS mechanism).

Here are some examples of what will happen in case of individual failures and a series of failures in the cluster.

Initial status: Node A manages Pool0 and provides shareX via VIP1

1. The cluster path line is damaged and thus the nodes do not communicate, as each of them considers the other one as “not working”. Node A still exposes resources while node B states that the node A has failed, and after some time of non-response from the node A it takes over the resources. Pool0 starts to be taken over. The ZFS mechanism begins to investigate whether there are micro writes to prove that the pool is used. As the mirror path line is still working and node A writes data to node B disks, and thus also micro data is available there, node B does not take over the resources, because the ZFS mechanism detects micro-writes and ends the attempt to take over the resources – there is no cluster split. In this case, if the mirror path line is damaged the cluster will split like the described case in the article part 1, and next after repairing the communication paths the mechanism from part 2 (the one that does not allow to destroy the data on the disks) will work. We’re getting back to the initial state: node A manages Pool0 and shares the shareX resource via VIP1.

2. The mirror path line is damaged. This means that the nodes are still communicating on the cluster level, but node A does not synchronize disk data on the remote drives of node B. For example, after one day of the cluster working in this state, this would lead to a quite strange cluster split: node A would still maintain current cluster resources, and node B would maintain its resources, but the data on disks would be old, since it has not been synchronized for a day. But this will not happen, due to the fact that there is another security mechanism which, during a mirror path line error, will mark the cluster as unmanageable. This way such resource will be possible to be issued only on the system where its disks are local – in such case we can be sure that the current data there. Therefore, thanks to this additional mechanism (unmanageable state) node B will not take over the resources, and after repairing the lines the mirror path will synchronize data to remote drives. After solving the cluster path issue it will mark the resource again as manageable. Clusters can also be repaired the other way round.

But wait! There’s one more question, the one that was stated at the end of part 2 – can it be more secure?

Well, it always can be. But such security will not give you 100% certainty that a cluster split will not happen. Adding more communication paths to the cluster, which will duplicate cluster path / mirror path functions will only reduce the likelihood of a Cluster Split-Brain, but it will not cause that it will never occur. Even if we add 5 mirror paths and 5 cluster paths and some other communication lines, we can’t guarantee that the problem will never occur because the client can set up a minimalist environment like in the example from the first part. Thus, we answer the question why we need a mechanism to protect against data loss after clustering, since we have a mechanism to counteract cluster splitting – because we can not assume that Cluster Split-Brain will never happen.