Amanda Sopkin (amsopkin at- gmail.com)
- causes may not actually effect users
- e.g. disk full is a cause of the app being down, not a symptom; you do need to fix it but because of the user impact, not because disks inherently need to be empty
- only the minimum of details needed to be useful
- good graphs need labels, etc. good practices
- no more than 5 graphs per console (dashboard), no more than 5 plots/lines per graph from monitoring w/ prometheus
- escalation procedure
- what is the appropriate chain of alerting
- Console graphs? not particularly effective
- Latency? somewhat effective
- raw? hard to see average/impact
- average? useless, doesn't move much
- p99 instead, & p100
- heatmap of latency
- latency of failures vs. successes is important, erroring early is not better; its still erroring
- error rates
- good for rare response codes
- not particularly good for other patterns
- request size may also be an error
- Traffic demand
- RequestPerSecond
- not useful for new issues
- saturation? fullness
- users vs. CPU/mem/etc
- evaluate trends in your alerts
- outages may need more detailed evaluation of alerts
- user metrics (google analytics/new relic/APM)
- monitor the monitoring
- batch job completions
- allow 2 failures before alerting - otherwise increase job rerun rate until 2 missed failures is acceptable