Thoughts and approaches when monitoring cronjobs
In my experience there's often a better way to do things than with cronjobs, however for some use-cases it's the right tool for the job.
- If it's possible to modify the job itself to push a metric to a metrics system, this often reduces systems setup, coupling, and moving parts.
- The metrics system can then alert on a lack of recent data points.
- I've successfully used this method to monitor database backups both using Prometheus's push-gateway, as well as InfluxDB with Grafana.
- This also enables sending along other data points, such as duration information for the job so it can be graphed.
- Care must be taken not to allow failed metrics code to cause the job to fail.
- Where it's not reasonable to modify the job, here's a couple of approaches that can be considered:
- Replacing the cron entry with a wrapper script that records duration and sends the metric.
- Signal handling should be implemented and passed through to the child process (the job).
- Creating a custom metrics exporter that checks on the results of the actions taken by the job.
- E.G. checking S3 for recent files in the backups directory.
- Different metrics systems would require different paradigms.
- With Prometheus, it could be a daemon exposing a
/metrics
endpoint which when hit reaches out to S3 and formats the returned data for consumption by Prometheus. Prometheus could then alert on both missing backups as well as an unreachable metrics exporter.
- Replacing the cron entry with a wrapper script that records duration and sends the metric.