Geo-rep Recovery plan

Use Case

In the event of geo-rep master suffering a partial failure (one of the GlusterFS brick process not functioning) or a full failure (master node shuts down), the steps involved in recovering the master from the slave is what is covered here.

Notion used in this document

failover - Switching from geo-rep master to slave
failback - Populating the master with slave data and switching back to master

Switching implies transfer of control over to the slave, thereby allowing write operations on it.

Recovery Mechanism

The mechanism is twofold requiring user intervention.

Step 0 - Phase 1 (Failover)

Provided interface is through Gluster CLI executed either from master or slave. Executing from slave would require SSH keys of the slave to be copied to master.

gluster> volume geo-replication <master> <slave> recover failover start

In case the command is executed on the slave, the semantics of master <--> slave are reversed. Whatever may be the case, existing geo-rep sessions between the master/slave is terminated. At this point the user can switch his application to and continue as usual.

Do we need status for failover ?

Step 1 - Phase 2 (Failback)

Similar to Step 0, the provided interface is through Gluster CLI. This phase populates the real master with the data present on the slave (now acting as master).

gluster> volume geo-replication <master> <slave> recover failback start

start will initiate the data transfer without the end-user face a downtime. The preferred utility and mechanism is discussed further ahead in this document. Please refer to that. We can do a one-shot invocation of the sync utility and allow it to sync as much as it can.

Status of the sync can be observed with

gluster> volume geo-replication <master> <slave> recover failback status

Sync in Progress

Once initial sync is done

gluster> volume geo-replication <master> <slave> recover failback status

Sync Completed

We move on the next step; final sync

gluster> volume geo-replication <master> <slave> recover failback commit

commit would require the user to experience a downtime while it initiates the final sync action. Again, this can be done using Rsync or gsync.

To prevent Rsync from crawling and check-summing each and every file; index translator can be used to keep track of which files were updated. index translator needs to be loaded before start and will keep track of modifies files by creating an hardlink to it in a configured directory. Then, get the list of files by readdir(2) and feed them to Rsync.Additionally, we would need to modify index translator to create negative entries to track delete operations.

Initial Sync Mechanism

Utilities for the initial sync i.e. Failback (start mode)

Rsync: Allow Rsync to determine which files need to be sync'd (it's usual rolling check-summing algorithm).
Gsync: Gsync can efficiently determine which files to sync using it's xtime checking approach (provided that the real master is not full empty). Hence, this would be significantly faster than the Rsync method in determining which files to sync.

Just a quick thought experiment how the to-do list would be affected if active-active were in place. I'm assuming here that the basic scenario is an unidirectional M -> S synchronization (so the basic scenario is the same old thing, not intended to be extended by the richer possibilities that is available with active-active).

There would be no need to flag S as "non-updatable" as data coming from M would be naturally protected from external overwrites. In particular, M would not delete content from slave when the data loss happens, because lack of files spread over geo-rep linkage only in case of an explicit declaration of such intent, ie. a negative entry (which are created as a callback of rm/rmdir, not by disaster).
There would be no need to do hacks on xtime propagation and aggregate S's xtime vector to a synthetic master xtime. Multi-xtime is a concept that naturally gives rise to the needed behavior.

So with active-active it's almost the case that recovery boils down to setting up a reversed M <- S geo-rep link, without any special geo-rep subcommand. The only issue I see is ending up with a lot of conflicts in between M and S instead of a good consistent state if allowance of local modification is not restricted properly. Ie., until data syncs back, ongoing modifications should happen on slave; when original data set got synced back, a full read-only period should come to get at fully synced state; when that's done, modifications should be directed to M (and the backward geo-rep can be stopped, optionally [optional, as it won't do anything if the constraint of "modification only on M" is in place]).

So only thing to be added is some logic/UI that tracks/coordinates these phases (probably in the same way as we discussed in the context of the current codebase).

vshankar/geo-rep-recovery.md