eheydrick/monitoring.md

Last active August 8, 2018 17:34

Star () You must be signed in to star a gist
Fork () You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/eheydrick/182daaae7dab006582880469cdc923fe.js"></script>
Save eheydrick/182daaae7dab006582880469cdc923fe to your computer and use it in GitHub Desktop.

Download ZIP

Monitoring talk

Raw

monitoring.md

Monitoring Overview

Why monitoring

Distributed systems are complex, things fail in unexpected ways
Monitoring gives you visibility into the system
Monitoring tells you stuff is broken before the customer notices

Types of monitoring

Blackbox monitoring - monitor from outside the box in. The customer view of the system. e.g. is user service up and reachable from the Internet (Monitis), can customers login (Test Service)
Whitebox monitoring - monitoring from inside the system using data provided by the system. e.g. livestats, CPU load, memory usage, disk iops

Monitoring vs alerting vs notifying

Monitor everything that could break
Alert on things that will break or are broken but low impact
Notify (page) on things that are broken and have a customer impact or will break very soon.

What to monitor..

Things that expire: domains, SSL certs
Things that can can be slow or error: latency, increase in 500 errors, exceptions
Things that can grow: queues, disk space

What to alert on

things that could be a problem or will be a problem e.g. queue is growing

What to notify on

things that are currently a problem that could affect customers e.g queue is really big, events aren't getting ingested, customers can't login
what not to notify on: a service is down on a single host, anything that isn't directly customer impacting (sleep is good), CPU, memory, network utilization (usually)

What metrics to collect

anything that moves or could move in the future

How we do monitoring + metrics

Whitebox monitoring: Sensu
Blackbox monitoring: Monitis

Components

Sensu clients + servers
Uchiwa (Sensu UI)
Grafana (Dashboards)
Monitis (External service checks)
Telegraf (Metrics agent)
OpsGenie (On-call paging)

Metrics

Metrics collected with sensu and telegraf
Stored in influxdb
Accessed with grafana
Do some alerting based on data in influxdb, e.g. timing
Have cloudwatch metrics in grafana e.g. RDS metrics, ALB metrics

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment