Detecting and Handling Network Segmentation ("Split Brain")

When network segmentation occurs, a distributed system that does not handle the partition condition properly allows multiple subgroups to form. This condition can lead to numerous problems, including distributed applications operating on inconsistent data.

For example, because thin clients connecting to a server cluster are not tied into the membership system, a client might communicate with servers from multiple subgroups. Or, one set of clients might see one subgroup of servers while another set of clients cannot see that subgroup but can see another one.

SQLFire handles this problem by allowing only one subgroup to form and survive. The distributed systems and caches of other subgroups are shut down as quickly as possible. Appropriate alerts are raised through the SQLFire logging system to alert administrators to take action.

Network partition detection in SQLFire is based on the concept of a lead member and a group management coordinator. The coordinator is a member that manages entry and exit of other members of the distributed system. For network partition detection, the coordinator is always a SQLFire locator. The lead member is always the oldest member of the distributed system that does not have a locator running in the same process. Given this, two situations will cause SQLFire to declare a network partition:

You enable network partition detection by setting the enable-network-partition-detection distributed system property to true. Enable network partition detection in all locators and in any other process that you should be sensitive to network partitioning. Processes that do not have network partition detection enabled are not eligible to be the lead member, so their failure will not trigger declaration of a network partition.

Note: The distributed system must contain locators to enable network partition detection.