Skip to content

Instantly share code, notes, and snippets.

@luqmansungkar
Last active February 25, 2020 03:56
Show Gist options
  • Save luqmansungkar/f1ecad6627500ea36443096b8de5ceee to your computer and use it in GitHub Desktop.
Save luqmansungkar/f1ecad6627500ea36443096b8de5ceee to your computer and use it in GitHub Desktop.

How Being a Freak about Alert and Monitoring Saves Tokopedia from Millions of Dollars in Downtime

  • impact downtime bisa besar, 1 menit pun besar

1

  • monitoring

    • collect. Define what data do you want to collect. Maybe tps, response time. Method can be push or pull
    • process. How you use the metric data that you collect. Maybe parse the log, etc
    • observatibility. Give you a quick glance at your system.
  • monitoring from different perspectives

    • user view. What important from user. Trx time, Trx flow
    • infra view. Condition of your current infra. Cpu load, mem usage, disk usage
  • each team have to have visibility to their product. Enable it. Maybe place a 'monitor' on each squad table

  • alerting is where the action start

  • notify a state change that might indicate a problem

  • action based on that alert is important. Every minute is count

  • to help with that, automate.

2

  • looping process of alerting :
    • define your baseline. What the baseline for good performing system. But this can be volatile depending on the condition of system
    • implementation. Implement the alert
    • alerting
    • action
    • learning. Can you automate the action taken?
    • new alert or new baseline
    • deleting. Clean old data, to reduce noise. Too much noise can make you ignore your alert
  • know what your alert tell you:
    • level of severity. Different level of severity can result in different alert channel
    • what is happening. Give enough information about what is actually happen, what services are impacted, who is the pic, etc
    • action to take. You can also give a hint or guideline about what to do. But if it repeatable, you can automate that to enable self healing system

3

  • on call rotation :
    • proactive. Based on early warning
    • reactive. Based on already occurring problem
    • scheduling and shifting task
    • provide necessary guideline and tools
    • knowledge sharing
    • mentoring and on call shadowing for new member
    • build culture of empathy with sharing responsibility and ownership
  • measure to success - mtta. Minutes to acknowledge - mtr. Minutes to resolve - sla. Even between services
  • current problem : - false positive - too little or too many - analysis and decision making

4 5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment