Rabbitmq node monitor.md

Cluster state on disk:

Mnesia schema db_nodes - nodes of the schema. Either disk nodes or nodes where tables are replicated to. extra_db_nodes - configuration telling mnesia which nodes to connect to on startup. running_db_nodes - nodes which mnesia is currently connected to. [1] table nodes - nodes on which tables are replicated. A list with all nodes, and with "active" nodes. "all nodes" is a subset of db_nodes, "active nodes" is a subset of running_db_nodes In a way db_nodes and running_db_nodes are same as "all nodes" and "active nodes" of the schema table
nodes_running_at_shutdown a list of nodes, which are currently running. This is similar to running_db_nodes, but is monitored by the node_monitor. It's modified when a node starts, joins/leaves the cluster or when the rabbit process stops on a node.
cluster_nodes.config two lists, one containing all clustered nodes, another containing disc nodes. modified when a node joins/leaves the cluster

Monitors:

mnesia_monitor - a process linked to other monitors on all db_nodes.
rabbit_node_monitor - monitors nodes (net_kernel:monitor_nodes/2) and rabbit processes on remote nodes.
All the queues/channels/gm can monitor state across nodes.

Messages:

Common:

nodedown - a message from erlang internal node monitor. handled by mnesia_monitor to keep track on down nodes (does not directly remove them from running_db_nodes) and rabbit_node_monitor to track how many nodes are running for pause_minority and pause_if_all_down triggers check_partial_partition
nodeup - counterpart of nodedown. handled by mnesia_monitor to check the cluster status. This handler may send an inconsistent_database event. rabbit_node_monitor logs the event and does nothing

mnesia_minitor:

Link EXIT signal from mnesia_monitor Updates running_db_nodes and active nodes for all tables

rabbit_node_monitor:

notify_node_up notify all nodes from running_db_nodes (except self) by sending node_up to them
DOWN from rabbit process Update cluster status (removes the stopped node) Clean up transient queues, listeners, alarms. Updates partition tracking (handle_dead_rabbit)
node_up (not to be confused with nodeup) sent by a node monitor on a started remote node to notify the cluster (in a boot step) Update cluster status, update alarms, cleanup started node from recoverable slaves for mirrored queues (handle_live_rabbit)
joined_cluster/left_cluster - update cluster status
{mnesia_system_event, {inconsistent_database, running_partitioned_network, Node}} this message is being treated as reconnect after partial partition update alarms, cleanup started node from recoverable slaves for mirrored queues (handle_live_rabbit) record partitioned state I'm not sure this is the right message to report reconnect. This message may be emitted multiple times and does not necessary mean that a node have rejoined

Partial partition handling:

check_partial_partition: the message is sent by a node handling a nodedown message to all the running nodes except the sender and the node, which is "down". The message contains GUIDs of these two nodes

A node, which receives this message, will check that the "down" node is actually down by checking it's status (in the node_monitor data) and by sending an RPC request to call rabbit:is_running/0 If the "down" node is running, the "checker" node responds to the "reporter" node with partial_partition message with the "checker" node and the "down" node The RPC request is sent in a one-off process.

This feels dangerous intuitively and not that easy to reason about.
partial_partition: the message tells a node that there is a partial partition. It contains the "checker" node and the "not_really_down" node. On this message node monitor will force disconnect from the "checker" node and send it a partial_partition_disconnect message The node may also pause instead if it's in pause_minority or pause_if_all_down mode
partial_partition_disconnect: the message tells a node to disconnect from another node.

The assumption here is that a node should be promoted to a full partition, disconnecting from the "checker" node and leaving the "checker" and the "down" nodes in a partition together.

But because DOWN messages are symmetric and there is no additional coordination this process may leave entire cluster disconnected or keep disconnecting nodes for some time.

a note on disconnect:

When disconnecting, the nodes will disable reconnection for 1 second.

When some nodes are down, node monitor will ping entire cluster every 1 second.

Also it will send a cast keepalive message to all running nodes every 10 seconds.

[1] running_db_nodes: This value is maintained by internal mnesia monitors. A node is removed from this list when mnesia_monitor processes detects another mnesia_monitor to be "down". When rediscovering the node it will not be automatically re-added unless schema is merged. This can be called exolicitly: mnesia:change_config(extra_db_nodes, [Node]) or if the node restarts. You may need to set the same extra_db_nodes configuration, which is already there, to reconnect the cluster. When nodes are discovered, mnesia sends a message like this: {mnesia_system_event, {inconsistent_database, running_partitioned_network, Node}} to all processes subscribed to such events. This may happen every time mnesia checks schema consistency, both when the node is discovered to be up (e.g. a message is sent between nodes) or when connecting with mnesia:change_config(extra_db_nodes, ...).

hairyhum/Rabbitmq node monitor.md