Cluster Split-Brain explained [part 1]

Today we’re starting a new series of technical articles. This time the topic is Cluster Split-Brain protection with Open-E JovianDSS. We’ve decided to discuss it on the blog due to the fact that in the last update 24 of the ZFS-based Open-E JovianDSS our developers have added an additional, Pool activity-based Cluster Split-Brain Prevention feature. What exactly does this feature do? What is a cluster split-brain? How to handle it with Open-E JovianDSS? Answers to those and many more questions will be provided in this series. Read on!

This is how it starts

Let’s start with two servers based on Open-E JovianDSS that will become cluster nodes, node A and node B.

Each server has its own disks. When setting up a cluster over Ethernet we have to connect the machines in a way that they can “see” each other. We configure the connection between them – for now let’s assume that it will be one local network line with a direct connection, let’s name it a cluster path.

The path between the nodes is not a single path and it should be a bonding in an active backup mode which is better than only one Ethernet interface because it requires at least two interfaces. However, this doesn’t mean it will not fail because in fact anything can happen. In order not to dig into details of such cases in this specific article, we will not provide an in-depth analysis of scenarios of what happens when succeeding paths fail.

Now, as the nodes are able to “see” each other, we can configure the cluster. Our cluster will make one resource available, for example via the SMB protocol. In order to do this, we need to share disks on node A with node B, and the other way around. In order to share the nodes, users need the mirror path in Open-E JovianDSS. We will use the same configuration line like the one that’s been used for the nodes to communicate.

Thus, we have two nodes in the cluster:

| node A with disks | <— cluster path / mirror path —> | node B with disks |

But this is the least secure solution. Why? Let us set up cluster resources – it will be a file network resource called ShareX. We also need the virtual IP (VIP) that will be connected on the configuration level with the file resource ShareX. Let’s call this virtual IP the VIP1.

We create a pool Pool0 where the ShareX will be located. In order to save the data on disks on both nodes, the Pool will be created in a way to make its lowest disk-level structure (VDEV) to consist of mirror containers (vdev), meaning that each VDEV container will have one disk from node A and one from node B. This way we use all drives to build containers.

Now, let’s assume that Pool0 with such a configuration is created on node A and on this node the pool will be imported. We configure VIP1 (it must have an equivalent of physical network adapters on both nodes) which will be assigned to Pool0. On Pool0 we configure the Dataset, which will also be a network resource, so we name it ShareX. So now we have a cluster that shares a file resource on node A through IP VIP1.

We connect from Windows via VIP1 to the ShareX resource and we save the data. The data goes to node A where ZFS saves them to VDEV disks locally and also to their mirrored copies on node B via mirror path.

We make a test and we transfer Pool0 from node A to node B – this way the resource ShareX and VIP1 “disappear” from node A and “appear” on node B. Windows loses access to ShareX for a moment, but after a while, it continues to copy.

Now imagine that the communication line between nodes is damaged (there are of course other security mechanisms, e.g. ping nodes but we will skip them for now, as we know that users might configure Open-E clusters in such a way).

What is the consequence?

From node A’s point of view, node B does not work, while from node B’s point of view, node A does not work as well. Therefore on both nodes, according to algorithms, the other node has crashed and they both try to take over the resources. Thus, node A gets Pool0 with ShareX and VIP1. Node B, on the other hand, already has those resources so it doesn’t do anything.

Cluster Split-Brain has taken place and both nodes are issuing the same cluster resources. It should be noted that their mirror devs use only local disks resources because the mirror path has been damaged.

Now imagine that a Windows client can save part of the data to one node and one part to another, as it has two the same IPs available in the network. There may also be many Windows clients performing independent writes on both nodes. Thus, we have two ShareX resources with completely different data.