Skip to content

Instantly share code, notes, and snippets.

@rwenz3l
Created September 18, 2020 09:39
Show Gist options
  • Save rwenz3l/1671dd77cbc3790af29b3b42266dc242 to your computer and use it in GitHub Desktop.
Save rwenz3l/1671dd77cbc3790af29b3b42266dc242 to your computer and use it in GitHub Desktop.
Documentation of my testing on the NATS Streaming Server PR1090 that adds the option to add and remove cluster nodes.

NATS Streaming Server - Test Add/Remove Peers

For the nats-streaming-server PR to allow for the addition and removal of nodes while the cluster is running,

we prepared a small testing environment to verify it's behaviour.

Setup

# Create some folders to store all the data in
mkdir nats-testing && cd $_
mkdir -p ./bin
mkdir -p ./conf
mkdir -p ./data/ws{1,2,3,4,5}
mkdir -p ./logs/ws{1,2,3,4,5}
mkdir -p ./logs/cluster/ws{1,2,3,4,5}
# Clone the Repository
git clone https://github.com/nats-io/nats-streaming-server.git

# enter dir
pushd nats-streaming-server

# checkout feature-branch
git checkout add_remove_cluster_nodes

# build binary
go build

# verify:
ls -lh nats-streaming-server
# -rwxr-xr-x  1 red  staff    19M 18 Sep 07:50 nats-streaming-server

# check for the new flag:
./nats-streaming-server -h | grep allow_add_remove_node

# move binary to bin folder
mv ./nats-streaming-server ../bin/
popd

→ Now we need some config files for our cluster:

# ------------------------------------
cat > ./conf/ws1.conf << EOF
port: 4221
http_port: 8221
cluster {
  listen: "0.0.0.0:6221"
  routes: [ 
    "nats://localhost:6222",
    "nats://localhost:6223",
    "nats://localhost:6224",
    "nats://localhost:6225"]
}
streaming {
  id: "ws_test"
  store: "file"
  dir: "./data/ws1"
  log: "./logs/ws1"
  cluster_log_path: "./logs/cluster/ws1"
  cluster {
    node_id: "ws1"
    peers: ["ws2","ws3","ws4","ws5"]
    allow_add_remove_node: true
  }
}
EOF

# ------------------------------------
cat > ./conf/ws2.conf << EOF
port: 4222
http_port: 8222
cluster {
  listen: "0.0.0.0:6222"
  routes: [
    "nats://localhost:6221", 
    "nats://localhost:6223",
    "nats://localhost:6224",
    "nats://localhost:6225"]
}
streaming {
  id: "ws_test"
  store: "file"
  dir: "./data/ws2"
  log: "./logs/ws2"
  cluster_log_path: "./logs/cluster/ws2"
  cluster {
    node_id: "ws2"
    peers: ["ws1","ws3","ws4","ws5"]
    allow_add_remove_node: true
  }
}
EOF

# ------------------------------------
cat > ./conf/ws3.conf << EOF
port: 4223
http_port: 8223
cluster {
  listen: "0.0.0.0:6223"
  routes: [
    "nats://localhost:6221", 
    "nats://localhost:6222",
    "nats://localhost:6224",
    "nats://localhost:6225"]
}
streaming {
  id: "ws_test"
  store: "file"
  dir: "./data/ws3"
  log: "./logs/ws3"
  cluster_log_path: "./logs/cluster/ws3"
  cluster {
    node_id: "ws3"
    peers: ["ws1","ws2","ws4","ws5"]
    allow_add_remove_node: true
  }
}
EOF

# ------------------------------------
cat > ./conf/ws4.conf << EOF
port: 4224
http_port: 8224
cluster {
  listen: "0.0.0.0:6224"
  routes: [
    "nats://localhost:6221",
    "nats://localhost:6222",
    "nats://localhost:6223",
    "nats://localhost:6225"]
}
streaming {
  id: "ws_test"
  store: "file"
  dir: "./data/ws4"
  log: "./logs/ws4"
  cluster_log_path: "./logs/cluster/ws4"
  cluster {
    node_id: "ws4"
    peers: ["ws1","ws2","ws3","ws5"]
    allow_add_remove_node: true
  }
}
EOF

# ------------------------------------
cat > ./conf/ws5.conf << EOF
port: 4225
http_port: 8225
cluster {
  listen: "0.0.0.0:6225"
  routes: [
    "nats://localhost:6221",
    "nats://localhost:6222",
    "nats://localhost:6223",
    "nats://localhost:6224"]
}
streaming {
  id: "ws_test"
  store: "file"
  dir: "./data/ws5"
  log: "./logs/ws5"
  cluster_log_path: "./logs/cluster/ws5"
  cluster {
    node_id: "ws5"
    peers: ["ws1","ws2","ws3","ws4"]
    allow_add_remove_node: true
  }
}
EOF

→ Start all the servers.

(I used a tmux window to have them all in one terminal screen so I can observe the logs too.)

# optional tmux window
tmux new -s nats
for _ in {0..4}; do tmux splitw -v -p 80; done
tmux select-layout main-vertical
clear
# Server 1:
./bin/nats-streaming-server --config "./conf/ws1.conf"

# Server 2:
./bin/nats-streaming-server --config "./conf/ws2.conf"

# Server 3:
./bin/nats-streaming-server --config "./conf/ws3.conf"

# Server 4:
./bin/nats-streaming-server --config "./conf/ws4.conf"

# Server 5:
./bin/nats-streaming-server --config "./conf/ws5.conf"
# We can check the current setup with this command:
curl -XGET -s http://localhost:822{1,2,3,4,5}/streaming/serverz | jq '.role'
# Example Output:
"Follower"
"Follower"
"Follower"
"Follower"
"Leader"

# So in our case ws5 is the Leader.
wget https://raw.githubusercontent.com/nats-io/nats.go/master/examples/nats-pub/main.go -O nats-pub.go

go get -u github.com/nats-io/nats.go
go build nats-pub.go

# Let's Remove the Leader:
./nats-pub '_STAN.raft.ws_test.node.remove' 'ws5'

# nodes should be removed now and a new Follower/Leader setup is very likely,
# go check with the curl command used earlier.
# Example Output:
"Leader"
"Follower"
"Follower"
"Follower"
# So ws1 became the new leader.
# Now remove ws5 from all configurations
# ------------------------------------
cat > ./conf/ws1.conf << EOF
port: 4221
http_port: 8221
cluster {
  listen: "0.0.0.0:6221"
  routes: [ 
    "nats://localhost:6222",
    "nats://localhost:6223",
    "nats://localhost:6224"]
}
streaming {
  id: "ws_test"
  store: "file"
  dir: "./data/ws1"
  log: "./logs/ws1"
  cluster_log_path: "./logs/cluster/ws1"
  cluster {
    node_id: "ws1"
    peers: ["ws2","ws3","ws4"]
    allow_add_remove_node: true
  }
}
EOF

# ------------------------------------
cat > ./conf/ws2.conf << EOF
port: 4222
http_port: 8222
cluster {
  listen: "0.0.0.0:6222"
  routes: [
    "nats://localhost:6221", 
    "nats://localhost:6223",
    "nats://localhost:6224"]
}
streaming {
  id: "ws_test"
  store: "file"
  dir: "./data/ws2"
  log: "./logs/ws2"
  cluster_log_path: "./logs/cluster/ws2"
  cluster {
    node_id: "ws2"
    peers: ["ws1","ws3","ws4"]
    allow_add_remove_node: true
  }
}
EOF

# ------------------------------------
cat > ./conf/ws3.conf << EOF
port: 4223
http_port: 8223
cluster {
  listen: "0.0.0.0:6223"
  routes: [
    "nats://localhost:6221", 
    "nats://localhost:6222",
    "nats://localhost:6224"]
}
streaming {
  id: "ws_test"
  store: "file"
  dir: "./data/ws3"
  log: "./logs/ws3"
  cluster_log_path: "./logs/cluster/ws3"
  cluster {
    node_id: "ws3"
    peers: ["ws1","ws2","ws4"]
    allow_add_remove_node: true
  }
}
EOF

# ------------------------------------
cat > ./conf/ws4.conf << EOF
port: 4224
http_port: 8224
cluster {
  listen: "0.0.0.0:6224"
  routes: [
    "nats://localhost:6221",
    "nats://localhost:6222",
    "nats://localhost:6223"]
}
streaming {
  id: "ws_test"
  store: "file"
  dir: "./data/ws4"
  log: "./logs/ws4"
  cluster_log_path: "./logs/cluster/ws4"
  cluster {
    node_id: "ws4"
    peers: ["ws1","ws2","ws3"]
    allow_add_remove_node: true
  }
}
EOF
# Signal a reload to nats:
./bin/nats-streaming-server --signal reload
# multiple nats-streaming-server processes running:
# 12488
# 12489
# 12490
# 12491
./bin/nats-streaming-server --signal reload=12488
./bin/nats-streaming-server --signal reload=12489
./bin/nats-streaming-server --signal reload=12490
./bin/nats-streaming-server --signal reload=12491

# Configurations have been reloaded and ws5 is no longer part of the cluster.

Results

  • The PubSub message is received correctly and the node is shutdown as well as removed from the (raft) cluster
  • Other peers keep trying to connect to the cluster, until the configuration has been changed and reloaded.
  • Re-starting the node will not add them back to the cluster, but keep them in the candidate state indefinitely, unless we add it back in. Once added, they turn from Candidate to Follower.
@kozlovic
Copy link

Note that reload only works for NATS related configuration (when embedded), not for streaming configuration. There is no plan to support that at this point.
If the server is running, the RAFT configuration that matters is the one from raft state, so it is good. Changing the config file needs to be done before the node is restarted, but you reload won't help.

Also, note that you can specify in the "peers" list the current node, it will be ignored. that can simplify your config - with copy/paste :-).
For instance:

cluster {
    node_id: "ws1"
    peers: ["ws1","ws2","ws3","ws4","ws5"]
    allow_add_remove_node: true
  }

is a valid configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment