In the event of geo-rep master suffering a partial failure (one of the GlusterFS brick process not functioning) or a full failure (master node shuts down), the steps involved in recovering the master from the slave is what is covered here.
Notion used in this document
- failover - Switching from geo-rep master to slave
- failback - Populating the master with slave data and switching back to master
Switching implies transfer of control over to the slave, thereby allowing write operations on it.
The mechanism is twofold requiring user intervention.
Provided interface is through Gluster CLI executed either from master or slave. Executing from slave would require SSH keys of the slave to be copied to master.
gluster> volume geo-replication <master> <slave> recover failover start
In case the command is executed on the slave, the semantics of master <--> slave are reversed. Whatever may be the case, existing geo-rep sessions between the master/slave is terminated. At this point the user can switch his application to and continue as usual.
Do we need status
for failover ?
Similar to Step 0
, the provided interface is through Gluster CLI. This phase populates the real master with the data present on the slave (now acting as master).
gluster> volume geo-replication <master> <slave> recover failback start
start will initiate the data transfer without the end-user face a downtime. The preferred utility and mechanism is discussed further ahead in this document. Please refer to that. We can do a one-shot invocation of the sync utility and allow it to sync as much as it can.
Status of the sync can be observed with
gluster> volume geo-replication <master> <slave> recover failback status
Sync in Progress
Once initial sync is done
gluster> volume geo-replication <master> <slave> recover failback status
Sync Completed
We move on the next step; final sync
gluster> volume geo-replication <master> <slave> recover failback commit
commit would require the user to experience a downtime while it initiates the final sync action. Again, this can be done using Rsync
or gsync
.
To prevent Rsync from crawling and check-summing each and every file; index
translator can be used to keep track of which files were updated. index
translator needs to be loaded before start and will keep track of modifies files by creating an hardlink to it in a configured directory. Then, get the list of files by readdir(2)
and feed them to Rsync
.Additionally, we would need to modify index
translator to create negative entries to track delete
operations.
Utilities for the initial sync i.e. Failback (start mode)
Rsync
: Allow Rsync to determine which files need to be sync'd (it's usual rolling check-summing algorithm).Gsync
: Gsync can efficiently determine which files to sync using it's xtime checking approach (provided that the real master is not full empty). Hence, this would be significantly faster than the Rsync method in determining which files to sync.
Just a quick thought experiment how the to-do list would be affected if active-active were in place. I'm assuming here that the basic scenario is an unidirectional M -> S synchronization (so the basic scenario is the same old thing, not intended to be extended by the richer possibilities that is available with active-active).
So with active-active it's almost the case that recovery boils down to setting up a reversed M <- S geo-rep link, without any special geo-rep subcommand. The only issue I see is ending up with a lot of conflicts in between M and S instead of a good consistent state if allowance of local modification is not restricted properly. Ie., until data syncs back, ongoing modifications should happen on slave; when original data set got synced back, a full read-only period should come to get at fully synced state; when that's done, modifications should be directed to M (and the backward geo-rep can be stopped, optionally [optional, as it won't do anything if the constraint of "modification only on M" is in place]).
So only thing to be added is some logic/UI that tracks/coordinates these phases (probably in the same way as we discussed in the context of the current codebase).