- Distributed systems are complex, things fail in unexpected ways
- Monitoring gives you visibility into the system
- Monitoring tells you stuff is broken before the customer notices
-
Blackbox monitoring - monitor from outside the box in. The customer view of the system. e.g. is user service up and reachable from the Internet (Monitis), can customers login (Test Service)
-
Whitebox monitoring - monitoring from inside the system using data provided by the system. e.g. livestats, CPU load, memory usage, disk iops
- Monitor everything that could break
- Alert on things that will break or are broken but low impact
- Notify (page) on things that are broken and have a customer impact or will break very soon.
- Things that expire: domains, SSL certs
- Things that can can be slow or error: latency, increase in 500 errors, exceptions
- Things that can grow: queues, disk space
- things that could be a problem or will be a problem e.g. queue is growing
- things that are currently a problem that could affect customers e.g queue is really big, events aren't getting ingested, customers can't login
- what not to notify on: a service is down on a single host, anything that isn't directly customer impacting (sleep is good), CPU, memory, network utilization (usually)
- anything that moves or could move in the future
- Whitebox monitoring: Sensu
- Blackbox monitoring: Monitis
- Sensu clients + servers
- Uchiwa (Sensu UI)
- Grafana (Dashboards)
- Monitis (External service checks)
- Telegraf (Metrics agent)
- OpsGenie (On-call paging)
- Metrics collected with sensu and telegraf
- Stored in influxdb
- Accessed with grafana
- Do some alerting based on data in influxdb, e.g. timing
- Have cloudwatch metrics in grafana e.g. RDS metrics, ALB metrics