Network Partitioning Scenarios

This topic describes network partitioning scenarios and what happens to the partitioned sides of the distributed system.

What the Losing Side Does

In a network partitioning scenario, the "losing side" constitutes the cluster partition where the membership coordinator has detected that there is an insufficient quorum of members to continue.

The membership coordinator calculates membership weight change after sending out its view preparation message. If a quorum of members does not remain after the view preparation phase, the coordinator on the "losing side" declares a network partition event and sends a network-partition-detected UDP message to the members and then close its distributed system with a ForcedDisconnectException. If a member fails to receive the message before the coordinator closes the connection, it is responsible for detecting the event on its own.

When the losing side discovers that a network partition has occurred, all peer members receive a RegionDestroyedException with Operation: FORCED_DISCONNECT.

If a CacheListener is installed, the afterRegionDestroy callback is invoked with a RegionDestroyedEvent, as shown in this example logged by the losing side. The peer member process IDs are 14291 (lead member) and 14296, and the locator is 14289.
[info 2008/05/01 11:14:51.853 PDT <CloserThread> tid=0x4a] 
Invoked splitBrain.SBListener: afterRegionDestroy in client1 whereIWasRegistered: 14291 
event.isReinitializing(): false 
event.getDistributedMember(): thor(14291):40440/34132 
event.getCallbackArgument(): null 
event.getRegion(): /TestRegion 
event.isDistributed(): false 
event.isExpiration(): false 
event.isOriginRemote(): false 
Operation: FORCED_DISCONNECT 
Operation.isDistributed(): false 
Operation.isExpiration(): false 

Peers still actively performing operations on the cache may see ShutdownExceptions or CacheClosedExceptions with Caused by: ForcedDisconnectException.

If a member using the Admin interface on the losing side has an AlertListener configured, its alert callback is invoked for all system logging above the configured alertLevel:
[info 2008/05/01 11:14:42.126 PDT <Pooled Message Processor2> tid=0x41] 
Invoked splitBrain.SBAlertListener in client with vmID 1, pid 14289 
alert.getConnectionName(): gemfire1_thor_14291 
alert.getDate(): Thu May 01 11:14:42 PDT 2008 
alert.getLevel(): WARNING alert.getMessage(): unable to send message 
to biscuit.gemstone.com/10.80.10.70:50972 (128 bytes); 
Operation was not permitted by datagram socket. 
alert.getSourceId(): TimeScheduler.Thread tid=0x1d 
alert.getSystemMember(): gemfire1_thor_14291
If a member shuts down due to loss of all locators, it logs an alert like this:
This member has been forced out of the distributed system. 
Reason=Unable to contact any locators and network partition detection is enabled

What the Surviving Side Does

If a locator on the surviving side has an AlertListener configured, its alert callback is invoked for messages above the configured AdminDistributedSystem.getAlertLevel. On the surviving side, the peer member is 7435, the locator (coordinator) is 7444, and the locator is 7430.

[info 2008/05/01 11:14:55.807 PDT <Pooled Message Processor2> tid=0x40] 
Invoked splitBrain.SBAlertListener in client with vmID 2, pid 7430 
alert.getConnectionName(): gemfire4_biscuit_7438 
alert.getDate(): Thu May 01 11:14:55 PDT 2008 
alert.getLevel(): WARNING 
alert.getMessage(): 15 sec have elapsed while waiting for replies: 
<ReplyProcessor21 2688 waiting for 2 replies from [thor(14291):40440/34132, 
thor(14296):40442/55944]> on biscuit(7438):50975/57267 whose current 
membership list is: [[biscuit(7435):50978/50626, thor(14291):40440/34132, 
thor(14296):40442/55944, biscuit(7438):50975/57267]] 
alert.getSourceId(): vm_6_thr_10_client2_biscuit_7438 tid=0x48 
alert.getSystemMember(): gemfire4_biscuit_7438
If a member using the Admin interface on the surviving side has a SystemMembershipListener configured, it processes memberCrashedEvents for the peer members of the losing side:
[info 2008/05/01 11:15:22.742 PDT <DM-MemberEventInvoker> tid=0x1b] 
Invoked splitBrain.SBSystemMembershipListener: memberCrashed in admin2 
event.getDistributedMember(): thor(14291):40440/34132 
event.getMemberId(): thor(14291):40440/34132 [info 2008/05/01 11:15:27.790 PDT 
<DM-MemberEventInvoker> tid=0x1b] Invoked splitBrain.SBSystemMembershipListener: 
memberCrashed in admin2 event.getDistributedMember(): thor(14296):40442/55944 
event.getMemberId(): thor(14296):40442/55944

What Isolated Members Do

When a member is isolated from all locators, it is unable to receive membership view changes. It can't know if the current coordinator is present or, if it has left, whether there are other members available to take over that role. In this condition, a member will eventually detect the loss of all other members and will use the loss threshold to determine whether it should shut itself down. In the case of a distributed system with 2 locators and 2 cache servers system, the loss of communication with the non-lead cache server plus both locators would result in this situation and the remaining cache server would eventually shut itself down.