The default recovery settings for DRBD on the Chef Server will take care of split brain scenarios most of the time. The following are steps that you can take to manually recover a split brain. Before following any of the steps read the entire document through completely.
The Chef Server ships with helper commands to determine the current ha-status of the Server. DRBD ships with utilities to determine and control the state DRBD. First you'll want to check the DRBD status and ha-status of both backend nodes and determine which steps you'll need to take for remediation.
$: chef-server-ctl ha-status
The output of this should be several lines of '[OK]'. If you see any '[ERROR]' lines are that not related to chef-mover then you'll need to fix the HA.
$: cat /proc/drbd
The primary should show something like "cs:Connected ro:Primary/Secondary" The secondary should show something like "cs:Connected ro:Secondary/Primary"
If both are not connected you'll need fix DRBD.
The first step here is to stop the services on all machines in the cluster, including the FE's. After the services have been stopped you'll need to recover DRBD and then recover HA. After you've recovered then restart the Chef Server service on all the frontends.
-
$: chef-server-ctl stop
- recover from broken DRBD
- recover from broken HA
-
$: chef-server-ctl start
on the primary backend and all frontends
When recovering broken DRBD you'll first want to determine which backend should be the primary. This is normally the backend node that was last running as primary. It's critical that you determine this correctly to prevent data loss and corruption. After you've determined which node should be primary then we'll need
If either of the nodes are connection state is "cs:WFConnection" or "StandAlone" then you'll need to reconnect them.
When recovering from a split brain the victim always needs to be in the 'StandAlone' state.
On the victim
-
$: drbdadm disconnect pc0
-
$: drbdadm secondary pc0
After you've ensured that the victim is secondary then promote the primary
On the primary
-
$: drbdadm connect pc0
-
$: drbdadm primary pc0
Now connect the secondary to the primary
On the victim
-
$: drbdadm connect pc0
Now verify the health on both
-
$cat /proc/drbd
Both should be "cs:Connected"
When recovering from broken HA you'll first want to determine which node should the primary and which should be secondary.
On the node that you'd prefer to be primary do a master recover: $: chef-server-ctl master-recover
It's common for the recovery to take a few seconds. You can verify the status of the master by checking ha-status: $: chef-server-ctl ha-status
After the master has recovered you can ensure the backup is healthy: chef-server-ctl backup-recover