Chef Server HA and DRBD Split Brain Recovery

The default recovery settings for DRBD on the Chef Server will take care of split brain scenarios most of the time. The following are steps that you can take to manually recover a split brain. Before following any of the steps read the entire document through completely.

Determine the cluster health

The Chef Server ships with helper commands to determine the current ha-status of the Server. DRBD ships with utilities to determine and control the state DRBD. First you'll want to check the DRBD status and ha-status of both backend nodes and determine which steps you'll need to take for remediation.

Determine the HA status

$: chef-server-ctl ha-status

The output of this should be several lines of '[OK]'. If you see any '[ERROR]' lines are that not related to chef-mover then you'll need to fix the HA.

Determine the DRBD status

$: cat /proc/drbd

The primary should show something like "cs:Connected ro:Primary/Secondary" The secondary should show something like "cs:Connected ro:Secondary/Primary"

If both are not connected you'll need fix DRBD.

Recovering from broken HA and DRBD Split Brains

The first step here is to stop the services on all machines in the cluster, including the FE's. After the services have been stopped you'll need to recover DRBD and then recover HA. After you've recovered then restart the Chef Server service on all the frontends.

$: chef-server-ctl stop
recover from broken DRBD
recover from broken HA
$: chef-server-ctl start on the primary backend and all frontends

Recovering from broken DRBD

When recovering broken DRBD you'll first want to determine which backend should be the primary. This is normally the backend node that was last running as primary. It's critical that you determine this correctly to prevent data loss and corruption. After you've determined which node should be primary then we'll need

If either of the nodes are connection state is "cs:WFConnection" or "StandAlone" then you'll need to reconnect them.

When recovering from a split brain the victim always needs to be in the 'StandAlone' state.

On the victim

$: drbdadm disconnect pc0
$: drbdadm secondary pc0

After you've ensured that the victim is secondary then promote the primary

On the primary

$: drbdadm connect pc0
$: drbdadm primary pc0

Now connect the secondary to the primary

On the victim

$: drbdadm connect pc0

Now verify the health on both

$cat /proc/drbd

Both should be "cs:Connected"

Recovering from broken HA

When recovering from broken HA you'll first want to determine which node should the primary and which should be secondary.

On the node that you'd prefer to be primary do a master recover: $: chef-server-ctl master-recover

It's common for the recovery to take a few seconds. You can verify the status of the master by checking ha-status: $: chef-server-ctl ha-status

After the master has recovered you can ensure the backup is healthy: chef-server-ctl backup-recover

ryancragun/split_brain_recovery.md