I no longer recommend the below method. For my updated recommendation, see this gist
The linux watchdog service can be configured to run certain 'liveness' tests periodically, and then take some action (such as reboot) if a test fails (and doesn't recover) for some period of time.
There is a built-in test for pinging an IP address (i.e. 'ping' directive), but you may instead need to test access to a hostname (e.g. api.importantserver.com) and there is no guarantee that the name will always resolve to the same IP. Here we describe a way to specify a custom script for testing ping connectivity to a hostname, and how to plug this script into the linux watchdog(8) service.
sudo apt install watchdog
systemctl status watchdog
sudo mkdir /etc/watchdog.d
sudo nano /etc/watchdog.d/pinghost.sh
now paste the following contents into the pinghost.sh file
#!/usr/bin/bash
readonly EUSERVALUE=246
readonly TARGET=io.adafruit.com
ping -c3 -q $TARGET > /dev/null
if [ $? -eq 0 ];
then
exit 0
else
echo "failed to ping $TARGET";
exit $EUSERVALUE
fi
change the value of 'TARGET' to match your desired ping destination. Now make the script executable by root:
sudo chmod u+x /etc/watchdog.d/pinghost.sh
Lastly, edit the watchdog config file and tune the values of 'interval' and 'retry-timeout' if desired. The 'interval' setting is how often (in seconds) to run the test script (suggestion: 120 or higher). The 'timeout' setting is how long (in second) to wait in a test-failing state before rebooting (suggestion: 1800).
sudo nano /etc/watchdog.conf
Once you're happy with the config settings, stop and start the service to activate the changes.
sudo systemctl stop watchdog
sudo systemctl start watchdog
Now the watchdog is running. You can check the status of the watchdog service with this command:
systemctl status watchdog
If you see that the service fails with the following error
This interval length (59) might reboot the system while the process sleeps! Try 59 or less
then you can add '--force' to the command line in the service file (/lib/systemd/system/watchdog.service) like this:
[Unit]
Description=watchdog daemon
Conflicts=wd_keepalive.service
After=multi-user.target
OnFailure=wd_keepalive.service
[Service]
Type=forking
EnvironmentFile=/etc/default/watchdog
ExecStartPre=/bin/sh -c '[ -z "${watchdog_module}" ] || [ "${watchdog_module}" = "none" ] || /sbin/modprobe $watchdog_module'
ExecStart=/bin/sh -c '[ $run_watchdog != 1 ] || exec /usr/sbin/watchdog $watchdog_options --force'
ExecStopPost=/bin/sh -c '[ $run_wd_keepalive != 1 ] || false'
[Install]
WantedBy=default.target
After making the above change, you'll need to run:
sudo systemctl stop watchdog
sudo systemctl daemon-reload
sudo systemctl start watchdog
You can get the whole log history from the watchdog service with:
journalctl -u watchdog
or you can just watch the end of the log as it grows by adding the -f (follow) parameter:
journalctl -u watchdog -f
You should test that the watchdog script is working by disconnecting a network cable (or otherwise disabling network connectivity). In the log file, you should see output similar to this to indicate a watchdog-triggered reboot:
Dec 29 11:17:04 miketinkerpi watchdog[633]: test binary /etc/watchdog.d/pinghost.sh returned 246 = 'user-reserved code'
Dec 29 11:17:24 miketinkerpi watchdog[633]: test binary /etc/watchdog.d/pinghost.sh returned 246 = 'user-reserved code'
Dec 29 11:17:44 miketinkerpi watchdog[633]: test binary /etc/watchdog.d/pinghost.sh returned 246 = 'user-reserved code'
Dec 29 11:18:04 miketinkerpi watchdog[633]: test binary /etc/watchdog.d/pinghost.sh returned 246 = 'user-reserved code'
Dec 29 11:18:24 miketinkerpi watchdog[633]: test binary /etc/watchdog.d/pinghost.sh returned 246 = 'user-reserved code'
Dec 29 11:18:44 miketinkerpi watchdog[633]: Retry timed-out at 80 seconds for /etc/watchdog.d/pinghost.sh
Dec 29 11:18:44 miketinkerpi watchdog[633]: repair binary /etc/watchdog.d/pinghost.sh returned 246 = 'user-reserved code'
Dec 29 11:18:44 miketinkerpi watchdog[633]: shutting down the system because of error 246 = 'user-reserved code'
Dec 29 11:18:44 miketinkerpi watchdog[975]: /usr/lib/sendmail does not exist or is not executable (errno = 2)
-- Boot 9ebb8dd9c90b4219a3784acf8e4361e5 --