Chef High Availability: Backend Cluster and its not so common problems list.
- In my case, Chef HA setup is entirely on AWS but this can be translated to other vendors too
chef-backend-ctl
commands are for backend nodeschef-server-ctl
commands are for frontend nodes
Chef is known to be delicate about hostname configuration, and for that I put together this list of actions you can take to sort out this issue.
This should be done for all your Chef nodes
- Manual check to see if the hostname matching its FQDN
hostname; hostname -A
- Query AWS API to check instance's hostname and adjust hostname with the returned value
AWS_HOSTNAME=`curl http://169.254.169.254/latest/meta-data/hostname`;
echo ${AWS_HOSTNAME}
hostname ${AWS_HOSTNAME};
- Adjust /etc/hosts if incomplete
EXT_IP=`ifconfig eth0 |grep "inet addr" |awk '{print $2}' |awk -F: '{print $2}'`
cat /etc/hosts | grep -qE "${EXT_IP}\s+${AWS_HOSTNAME}" &&\
echo "hostname already in there" || echo "${EXT_IP} ${AWS_HOSTNAME}" >> /etc/hosts
- Check hostnames again
hostname; hostname -A
- Reconfigure Chef with
chef-backend-ctl reconfigure
(Backend) orchef-server-ctl reconfigure
(Frontend)
In the scenario where you had total follower node failure (all following nodes crashing) causing a loss of quorum, but you still have the leader "operational", you can recover following this process.
Note: My follower nodes are part of an AWS ASG, and in order to get the cluster working again I have to set the ASG size to 1 to get the cluster operational again, and then can set it to the desired number of nodes, to have a smooth join cluster proceedure and avoid a race condition.
Leader
rm /var/opt/chef-backend/leaderl/data/no-start-pgsql
chef-backend-ctl create-cluster --quorum-loss-recovery
Follower1
chef-backend-ctl join-cluster; chef-backend-ctl reconfigure
Leader
chef-backend-ctl reconfigure
FollowerN
chef-backend-ctl join-cluster ...
chef-backend-ctl reconfigure
Then you'll most certainly need to run through the next section of this document (fixing ES)
A broken Chef ES index can cause all sort of funky things when you try and query data via knife
or the UI . Some common symtoms are:
- Searching for windows in the UI returns no or mixed results
knife node search 'platform:redhat'
returns all resultscurl localhost:9200/chef/_search -d '{"query":{"query_string":{"lowercase_expanded_terms":false,"query":"content:platform__=__redhat"}}}'
on any of your Chef servers will return all results too
Additionally, checking elasticsearch
status would return Red
chef-backend-ctl status elasticsearch
Role: Leader
Local Status: running (pid 32744)
Logging: running (pid 1892)
Time up: 0d 0h 1m 17s
Cluster Status: red
Active Shards: 70.0%
** Nodes **
Seems that the way to fix this issue is to wipe out the index and rebuild it from scratch, but note of warning, YOU WILL NEED TO BLOCK ALL ACCESS to Chef for a short while, which will create downtime and unability to use the service during that time.
This is the only way to ensure a fully and healthy rebuilt index!
The access point is usually the Frontend node and you could try one of the following methods to block the traffic in there:
- Block incoming traffic via
iptables
on the machine - Use a Firewall/Security Group vendor service to block incoming traffic
- If you're using a Load Balancer, unassign the nodes from it
Do not stop the Frontend node because you'll need it to rebuild the index.
In a nutshell, we will be doing the following:
- Block traffic on the Frontend node
- Delete the index from one of the Backends (ideally the leader)
- Reconfigure services and reindex all data the Frontend node
- Resume traffic on the Frontend node
Check that the cluster is still operative even if ES has a red status, and if not have a look at Restoring Chef cluster after node failure
$ chef-backend-ctl cluster-status
Name IP GUID Role PG ES
ip-10-10-181-87 10.10.181.87 c5bbb54df8f74213cac49b605404583e follower follower not_master
ip-10-10-183-242 10.10.183.242 92bcc24ea62b8c2a492205ead2770eeb leader leader not_master
ip-10-10-183-76 10.10.183.76 701581deb012cbcdbcca1a1c2e7f8edd follower follower master
$ chef-backend-ctl status
Service Local Status Time in State Distributed Node Status
leaderl running (pid 9889) 0d 1h 37m 19s leader: 1; waiting: 0; follower: 2; total: 3
etcd running (pid 9685) 0d 1h 37m 52s health: green; healthy nodes: 3/3
postgresql running (pid 9974) 0d 1h 37m 16s leader: 1; offline: 0; syncing: 0; synced: 2
elasticsearch running (pid 10184) 0d 0h 22m 13s state: red; nodes online: 3/3
List indices on ElasticSearch, chef
should come up red
$ curl 'http://localhost:9200/_cat/indices?v'
health status index pri rep docs.count docs.deleted store.size pri.store.size
red open chef 5 1 575 5 48.9mb 24.4mb
Aditionally you can query the ElasticSearch for other relevant information with the following commands:
- Chef cluster health
- For a detailed output:
curl 'localhost:9200/_cluster/health/chef?pretty&level=shards'
- For a simplified output:
curl 'localhost:9200/_cat/health?v'
- For a detailed output:
- Cluster State
curl 'http://localhost:9200/_cluster/state?pretty'
- Shard states
- List all shards
curl 'localhost:9200/_cat/shards?v'
- Filter by unassigned shards
curl localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED
- List all shards
Remember that traffic should be blocked on the Frontend node
- Delete the chef index
curl -XDELETE 'http://localhost:9200/chef'
{"acknowledged":true}
- Reconfigure the Frontend node
chef-server-ctl reconfigure
- Reindex all the data
chef-server-ctl reindex -a
- Run a final check to see the size of the index
curl 'http://localhost:9200/_cat/indices?v'
health status index pri rep docs.count docs.deleted store.size pri.store.size
green open chef 5 1 1574 58 547.3mb 273.9mb