- impact downtime bisa besar, 1 menit pun besar
-
monitoring
- collect. Define what data do you want to collect. Maybe tps, response time. Method can be push or pull
- process. How you use the metric data that you collect. Maybe parse the log, etc
- observatibility. Give you a quick glance at your system.
-
monitoring from different perspectives
- user view. What important from user. Trx time, Trx flow
- infra view. Condition of your current infra. Cpu load, mem usage, disk usage
-
each team have to have visibility to their product. Enable it. Maybe place a 'monitor' on each squad table
-
alerting is where the action start
-
notify a state change that might indicate a problem
-
action based on that alert is important. Every minute is count
-
to help with that, automate.
- looping process of alerting :
- define your baseline. What the baseline for good performing system. But this can be volatile depending on the condition of system
- implementation. Implement the alert
- alerting
- action
- learning. Can you automate the action taken?
- new alert or new baseline
- deleting. Clean old data, to reduce noise. Too much noise can make you ignore your alert
- know what your alert tell you:
- level of severity. Different level of severity can result in different alert channel
- what is happening. Give enough information about what is actually happen, what services are impacted, who is the pic, etc
- action to take. You can also give a hint or guideline about what to do. But if it repeatable, you can automate that to enable self healing system
- on call rotation :
- proactive. Based on early warning
- reactive. Based on already occurring problem
- scheduling and shifting task
- provide necessary guideline and tools
- knowledge sharing
- mentoring and on call shadowing for new member
- build culture of empathy with sharing responsibility and ownership
- measure to success - mtta. Minutes to acknowledge - mtr. Minutes to resolve - sla. Even between services
- current problem : - false positive - too little or too many - analysis and decision making