lost-theory/01_intro.md

Last active March 26, 2019 15:52

Star () You must be signed in to star a gist
Fork () You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/lost-theory/a4ea65bf19f2f9ffe7a3.js"></script>
Save lost-theory/a4ea65bf19f2f9ffe7a3 to your computer and use it in GitHub Desktop.

Download ZIP

Monitorama 2014 notes

Raw

01_intro.md

Monitorama 2014 notes

http://monitorama.com/

Best talks day 1:

Please, no More Minutes, Milliseconds, Monoliths... or Monitoring Tools! - Adrian Cockcroft
- gave 5 good rules for monitoring systems, showed what cloud / microservices monitoring looks like @ Netflix
Simple math to get some signal out of your noisy sea of data - Toufic Boubez
- explains why static alert thresholds don't work and gave 3 techniques to use instead
Car Alarms and Smoke Alarms - Dan Slimmon
- how to use sensitivity and specificity in monitoring, some good math
Metrics 2.0 - Dieter Plaetinck
- metrics20.org = redesign of graphite that fixes a bunch of stuff, keep an eye on this project
StatsG at New York Times - Eric Buth
- the first half of the talk on ops philosophy was really interesting, second half about statsg is not so useful

Best talks day 2:

"Auditing all the things": The future of smarter monitoring and detection - Jen Andre
- really awesome security talk, lots of good practical steps for us
Is There An Echo In Here?: Applying Audio DSP algorithms to monitoring - Noah Kantrowitz
- shows how to use audio processing techniques on monitoring data, good math, very interesting
The Lifecycle of an Outage - Scott Sanders
- github's tools & procedures & culture around resolving outages
A whirlwind tour of Etsy's monitoring stack - Daniel Schauenberg
- practical walkthrough of Etsy's (extensive) monitoring system
Web performance observability - Mike McLane & Joseph Crim
- not sure we can directly use the tool they made, but this is a good idea of what a web performance benchmark suite looks like, also see canary.io lightning talk

Good lightning talks:

serverspec + sensu: interesting approach to testing & monitoring, if you write serverspecs for testing / CI, you can also run then on your productions servers and get even better coverage
monitoring & inadvertent spam traps: anecdote from a developer on how developers can use monitoring to solve problems
Expanding Context to Faciliate Correlation: showed 3 open source tools that improve on graphite/nagios web interfaces
canary.io: project from github ops for doing web performance testing, still in the early stages, but looks promising
Distributed Operational Responsibility: some tips from spotify on why ops responsiblities (like monitoring) should be shared with developers

Semi-interesting sponsor plugs:

VividCortex: MySQL performance analysis tool (SaaS) from ex-percona guys
Pagerduty: we should start using multi-user alerting (new feature, they gave 2 good use-cases)
Elastic Search: ~70% of the people attending were using ElasticSearch
Big Panda: building a smarter "inbox" for ops (to replace email + jira)

Recurring themes / big takeaways:

monitoring must scale ahead of the underlying system
you need high frequency monitoring: it's not OK to wait minutes for a check result or alert
collect data on everything with graphite
data collection should be a default on everything from the beginning, it should not be a time-consuming / reactive / after-the-fact process
only alert when work isn't getting done, RAM / swap / CPU / etc. are not something you should directly alert on
manually watching graphs & dashboards doesn't scale
start using anomaly detection
static thresholds do not work for data from the data center, moving averages are only slightly better, you need to use better math
do more analysis, understand your data (scatterplots, histograms, find distributions, correlations, probability & stats, etc.)
ops should provide self-service data collection / monitoring / alerting for developers

Raw

02_day_one.md

welcome

Jason Dixon:

this monitorama is 2x the size of last year & berlin
conference buddies, if you see someone with a heart sticker introduce yourself to them
everyone give a high five or free hug
why do this? this isn't a ruby conference
empathy and culture is important, especially for ops
between engineers, ops, and management
and for the community here
share the love
sponsors are great bla bla
breaks and lunch bla bla

Please, no More Minutes, Milliseconds, Monoliths... or Monitoring Tools! - Adrian Cockcroft

http://www.slideshare.net/adriancockcroft/monitorama-please-no-more

keynote
formerly of netflix
graph of enterprise IT cloud adoption
from left to right: ignore, ignore, ignore, no, no, I said No dammit, oh no, oh fuck
rest of world = half way through cloud adoption
you are here = trying to play catch up

20 years exp:

94 "SE Toolkit"
98 Sun Perf. Tuning
99 Resource Mgmt.
00 Capacity Planning for Web Services
07 Outstanding Contrib. to Computer Metrics
04-08 Capacity Planning Workshops
14 Monitorama!

state of the art in 2008:

cacti, ganglia, nagios, zenoss, mrtg, Wireshark
low number of machines
it was subversive to think that open source could replace expensive enterprise tools
created "SE", a C interpeter which could extract solaris performance information and output it all in a standard format
created "virtual adrian", a simple rule based system for automated monitoring of disk, memory, etc. in solaris (to watch systems while he was on vacation)

why no more monitoring tools?

we have too many
we need more analysis tools, can we get an analysorama conference?
rule #1: we spend too much time collecting, storing, and displaying metrics
if you spend 50% of your time on this it's too much
we need more automation, more analysis
monitoring should not be tacked on, it should be a default

what's wrong with minutes?

not enough resolution to catch problems
it takes 5-8 minutes before you start seeing alerts
if you had second resolution, you can see the difference in 5 seconds
if your rollbacks are quick, you can revert a bad change in 5 seconds
compare a 10 second outage to a 10 minute outage
from continuous delivery we know that small incremental changes are best
so we need the same from monitoring
instant detection and rollback within seconds should be a goal
SaaS tools that do this: VividCortex, boundary
how does netflix do it? hystrix and turbine, websockets, streaming metrics, 1 second resolution & 15 seconds of history, circuit breakers, pages go to who is directly responsible for a specific component or change
rule #2: metric collection -> display latency should be < human attention span (10s)

what's wrong with milliseconds?

in a lot of JVM instrumentation, ms is the standard
the problem with ms is that a lot of datacenter and hardware communication needs nanosecond resolution
rule #3: validate your measurement system has enough accuracy and precision
if there's a difference between something taking X and Y nanoseconds in your system, and all you have are a bunch of 1ms data points, you can't identify the problem

what's wrong with monoliths?

monolithic monitoring tools are easy to deploy, but when they go down, you then have no monitoring
there needs to be a pool of aggregators, displayers, etc.
easier to do upgrades, more resilient to downtime
anything monolithic has performance problems, scalability problems, SPOFs, can't tell the difference between monitoring system going down vs. actual system going down
in-band monitoring: running monitoring on the same process, server, data center, etc. as the system itself
SaaS monitoring: send to a third party
both: an outage can't take out both monitoring systems, HA monitoring
they might not being monitoring exactly the same stuff, but they should have some overlap
rule #4: monitoring needs to be as (or more) available & scalable than the underlying system

continuous delivery:

high rate of change
new machines being spun up and shut down all the time (in netflix's case)
short baselines for alert threshold analysis
ephemeral configuration
short lifetimes make it hard to aggregate historical data
hand tweaked solutions do not work, it would take too much effort

microservices:

complex flow of requests
how do you monitor end-to-end when the dependencies and flow of requests is so complex and dynamic?
Gilt Groupe: went from a handful of services to 450 services over the course of a year
"death star" microservice pattern: everything is calling everything else in one big tangled graph of dependencies
how to you visualize this? we need more hierarchy & grouping

closed loop control systems:

how did netflix do autoscaling?
on every deploy during peak time, double the number of servers
using load average, which is not the best metric to use
lots of overshoots
new solution: scryer
predictive autoscaler, FFT based algorithm, builds a forward predicted model to set the autoscale level
scales ahead of time, then corrects as necessary
using the old method it was hard to do this analysis, because the data was so chunky (from the doubling)

code canaries:

ramp up of deployment, looks for errors, if there are problems it emails the responsible team and stops rolling out the code

monitoring tools for developers:

most monitoring tools are built for ops / sysadmin (DBA vs. network admin vs. sysadmin vs. storage admin)
fiefdoms of different teams and tools, different levels of access, hard to collaborate, hard to integrate and extend
state of the art is to move towards APM, analytics, integrated tools for all teams
deep linking & embedding, extensible tools
business transactions, response time, runtime (e.g. JVM) metrics

challenges with dynamic ephemeral cloud apps:

dedicated hardware: arrives infrequently, disappears infrequently, sticks around for years, unique IPs and MAC addresses
cloud assets: arrive in bursts, stick around for a few hours, recycles the IP and MACs of machines that were just shut down!
in the cloud model, you need to have a historical record of everything that ever happened in your infrastructure (Netflix Edda)

traditional arch:

business logic
DB master & slave
some fabric in between
storage

new cloud systems:

business logic
NoSQL nodes
cloud object store
not all hosted cloud services have detailed monitoring / metrics exposed
you depend on web services to integrate with cloud services
span zones & regions, monitoring now needs to span zones & regions too
NoSQL introduces new failure modes

5 rules:

1. analysis > collection
1. key business metric monitoring should be second resolution
1. precision and accuracy -> more confidence
1. monitoring must be more scalable than the underlying system
1. start building distributed, ephemeral cloud native applications

Q&A:

Q: you mentioned better visualization for microservices, like what?
- A: a user hits the homepage -> what services are hit?, there is no arch. diagram anymore, part of viz. involves seeing which zones and regions are hit, manual tagging & hierarchy of components, owners, etc. it's useful to for instance limit to just the services my team owns or depends on, aspect-oriented view, but it's not a solved problem, most OSS monitoring tools have good backends but less good UIs, cloudweaver looks interesting
Q: canary system, what types of checks are you running?
- A: error rate, CPU time, response time, jmeter functional tests, business metrics, and you need to do the comparison on freshly spun up nodes (e.g. 3 old vs. 3 new copies of the code on freshly spun up machines)

Computers are a Sadness, I am the Cure - James Mickens

(this talk was just entertainment, no practical information)
i'm here to take you on a quest
everything i'm going to tell you is 100% true
bla bla
distributed systems send messages back and forth
most messages fail because god hates us
so we send more
10 years ago the MapReduce paper was like alien technology
it was so simple and seductive, you just specified a map and reducer function, ran it on commodity machines, it was amazing
that was 10 years ago
let's stop talking about MapReduce
say "word count" one more time
let's also stop talking about "the cloud"
the problem with all this social cloud stuff is that i hate most people
there are two kinds of people: people who have actually build cloud software and others
others: cloud is great!, 99.9999999%!, everyone is happy, everything is a solved problem!
real cloud people: it's a nightmare, hardware fails, SLAs are misleading, IO is queued up, packets get sent to a black hole, it's madness
why does anything happen at all in the cloud?
it's like an old timey map with dragons in the middle
this is why we need monitoring & analysis
a message of hope: give up
look at the CAP theorem, you can't have it all
if your email goes down, then your reaction should be to want to use email less, go do something else
can't take your test at your MOOC? take it later, your MOOC degree will be just as worthless
let's be serious though
some things we do need to care about
(nosql rant i didn't fully write down, nosql = bane from batman, throw out all the rules and laws, chaos)
conventional wisdom: america needs more programmers
reality: we need fewer programmers
technology is not the future, no more stupid apps, painting is the future, go do that, leave me alone
if you are a VC who funds this kind of stuff, i hope you become poor
let's be serious about security
threat model: mossad or not-mossad
either you are being attacked by mossad or you're not
"not attacked by mossad" = where you want to be, just keep using strong passwords and don't click on weird links
"you are being attacked by mossad" = no defenses, you're going to die
america's mental model of the CIA, FBI, etc. are that they are bunch of boy scouts
in reality: drones, exoskeletons, cable splicing submarines
they're not going to send boy scouts, they're not going to fight close range musket battles, they're going to use their advantage of having access to all the infrastructure you depend on
how do you defend against that with rocks and pencils and leaves?
easy attacks are easy
"Mary" from "Central University" working as a "Rectuier" with an attactive profile picture wants to be my friend on Facebook
obviously i don't know mary
BUT WHAT IF I DO KNOW MARY
most important goal in security: eliminate men as a gender
possible solution: dude overflow detected -> trigger bear trap and the guy from the SAW movie

summary:

ozzy osbourne crazy train = cloud computing
bane = nosql
bla bla

Q&A:

Q: can i be your friend on facebook?
- A: there is a background check, and i will wait 2-3 days to show i'm not desparate, i encourage you to submit an application though, i love judging people

Simple math to get some signal out of your noisy sea of data - Toufic Boubez

i lied! there are no simple tricks
too good to be true = it probably is
background:
CTO Metafor Software
CTO Layer 7 Technologies
CTO Saffron Technologies
let's start with the "Wall of Charts"
- hire a new guy: shove him in front of the wall of charts
- we collect 1000s of metrics, pick 10, and put them in a dashboard
- this is meaningless
- WoC leads to alert fatigue
- alert fatigue is one of the largest problem in ops
- watching WoCs cannot scale
- at some point, you will need a person or a team dedicated to watching the WoCs
- so we need to turn this work over to the machines
to the rescue: anomaly detection
- definition: detect events or patterns which do not match expectation
- definition for devops: alert when one of our graphs starts looking wonky
who else is doing anomaly detection?
- manufacturing QC has been doing this for a long time
- measure the diameter, weight, etc. of the flux capacitors and throw the outliers away
- assumptions: normal, gaussian distrbution; data is "stationary", it doesn't change much over time
- the "three-sigma rule": 68% of the values lie within 1 std dev of mean, 95% lie within 2, 99.7% lie within 3
- mark those percentages as the "red lines" on the graphs and take action when a value falls outside of a red line
if you implement 3-sigma rule alerts in the data center:
- a. you get alerted all the time, or
- b. you don't get alerted when there's a real problem
the assumptions from manufacturing (gaussian, stationary) don't apply to the data center
static thresholds are ineffective
if data is moving, we need a moving threshold, that's a smart idea
the "big idea" of moving averages: the next value should be consistent with the recent trend
- finite window of past values, ignore the whole history
- calculate a predicted value
- "smoothed" version of time series
- compare squared error rates between smooth vs. raw data
- now you can compute the 3-sigma values based on that smoothed data
what about spikes, outliers, etc.? windows can be skewed
ok, now we use a weighted moving average, less weight on data that is further away
- not good enough, doesn't handle trends, exponential smoothing
- double exponential smoothing (DES)
- triple exponential smoothing (TES)
- Holt-Winters (seasonal effects)
result:
- a. you are woken up a lot less, but still woken up
- b. it still doesn't catch some problems
are we doomed?
no
smoothing works on certain kinds of data
smoothing works when deviations are normally distributed
there are lots of non-gaussian techniques, we're only going to scratch the surface in this talk
trick #1: histograms
- (better: kernel densities, but histograms work and are simple)
- if you have a bunch of different time series of the same metric, build a histogram for each series
- start by looking at the distribution of your data, understand what it looks like before you start your analysis
trick #2: kolmogorov-smirnov test
- it sounds cool and it works
- compares two probability distributions
- requires no assumptions about the underlying distribution
- measures max dist. between two cumulative dists.
- good for comparing day-to-day, week-to-week, seasonal affects
- "are these two series similar or not?"
- KS with windowing
  - example: KS for week 1 vs. week 2 and week 2 vs. week 3 (where week 3 is during christmas and we experienced a problem)
  - 1 vs. 2: small distance
  - 2 vs. 3: huge distance
- the case where 3-sigma static threshold failed is now extremely clear with KS
trick #3: diffing / derivatives
- often when your data is not stationary, the derivative is
- e.g. random walks
- most frequently, the first difference is sufficient: dS(t) <- S(t+1) - S(t)
- once you have the stationary data set, gaussian techniques work better
- real example: CPU time
- the distribution is totally non-gaussian, very noisy and random looking
- but.. first difference, it totally is gaussian!
you're not doomed if you know your data
understand the statistical properties of your data
data center data is typically non gaussian
so don't use smoothing
use histograms, KD, and derivatives instead

Q&A:

Q: is your point to make everything gaussian?
- A: no! sorry if i conveyed this message, KS does not involve gaussian, there are lots good non-gaussian techniques

The Care and Feeding of Monitoring - Katherine Daniels

a story
- pagerduty tells us our site is down
- so we checked, and it was down
- then... a minute later, it's back
- hmm. ok.
- then.. a few minutes later
- down again
- and up again
this is.. The Blip, a randomly occurring outage that fixes itself
so what's happening?
- 500 rate.. nothing
- API errors.. nothing
- error rate... nothing
what are we missing from our monitoring?
monitor all the things!
- we're missing something, just start randomly adding metrics until we find it
- then you get.. this..
- zenoss screenshot that's all red from down checks
we're trying to find a needle in a haystack and just added more hay
this is why you don't do a full body diagnostic scan for medical patients, the more you look for, the more you might find, and they might not all be actual issues
so, we need to monitor only some of the things..
first looked at the load balancers, because everything dropped out of the LB at once
tried provisioning a new ELB, switching availability zones
looked at access logs
everything worked the same, still getting the blip
how about the healthcheck?
- the healthcheck was hitting something called "healthD", a healthcheck service that failed when one or both of two important backend components went down
- and there weren't any logs or monitoring for healthD itself
looking inside healthD showed that one of the two services, api2, had a problem
- it seems a certain misbehaving user was triggering bad requests
- so we went into api2 and added metrics per response type
- found the response type that stood out
- decreased timeouts from 60 seconds to 5 seconds
- optimized some slow queries
- deleted some old slow / unused API methods
now the site was back to normal

why didn't we have monitoring for this?

1. black boxes, mysteries
- any X-as-a-Service that you depend on (e.g. ELBs) are black boxes and need some special care for monitoring
1. technical debt / bad technical decision
- why did the healthcheck require both services to be up?
- why did we even have two separate APIs?
- long ago someone decided to do a rewrite, but the old system remained
- we can only move foward at this point, we can't shut down either system, so we need to monitor both
what to monitor:
- monitor all services
- monitor responsiveness (network, API, web server)
- system metrics (memory used, CPU used, disk space)
- application metrics (read lock time, write lock time, error rate, API response time)
don't get into a situation where you have to say "oh yeah that check is red but it's OK, don't worry"
as someone mentioned earlier, your monitoring needs to scale above your application
- load test your monitoring, make sure it can keep up and responds properly with increased load
monitoring should not be a silo, it shouldn't be an ops problem
- monitoring should be built in to the application from the beginning
- work with developers
- ask: "what does it mean for this application to work properly? what does it look like when it breaks?"
monitoring shouldn't be a reactive last minute thing

Car Alarms and Smoke Alarms - Dan Slimmon

Sr. Plat Engineer at Exosite, which does internet of things
- we recently made a better mousetrap that texts you when it goes off, so if you have a building full of mouse traps you only need to check the one that was tripped
we wear many hats in ops
but data science is becoming a very important hat
people believe you when you have graphs
signal to noise ratio
example: plagiarism detection
- let's say we make a system that has a 90% chance of positive plagiarism detection
- 20% chance of negative result
- and 30% of kids currently plagiarize

some questions:

1. given a random paper, what's the prob you get a negative result?
- 59%
1. what's the probability that the system will catch a plagiarized answer?
- 90%, duh, we already knew that, why'd i ask you that?
1. if you get a positive result, what's the probability the result really is plagiarized?
- 65.8%
this is an unintuitively terrible result
we originally heard 90% chance
but now in the real world it's down to 65.8%, that's pretty useless
sensitivity and specificity
- sensitivity: % of actual positives that are identified as such
- specificity: % of actual negatives that are identified as such
- high sensitivity: freaks the the fuck out when anything might be considered slightly bad
- high specificity: if it says you cheated, sorry, you definitely cheated
here's the graph if you want to look at it again: http://imgur.com/LkxcxLt.png
how does this relate to ops?
- positive predictive value is the probabiilty that: when you get paged, something is actually wrong
- consider your service has 99.9% uptime, and your check is 99% accurate
- that sounds pretty good right?
- P(TP) = 0.01%
- P(FP) = 0.99%
- PPV = P(TP) / (P(TP) + P(FP)) = 9.1%
- if you get paged, you only have a 1 in 10 chance that something is actually wrong
- that's horrible
car alarms
- when you hear a car alarm, is your immediate reaction to run and check to make sure everything is ok?
- the majority of car alarms sounding don't indicate a problem, they go off all the time for no reason
- they have low specificity, high sensitivity
smoke alarms
- when you hear a smoke alarm in a building, you don't have the same reaction
- you don't sit around and say "do you guys smell smoke? i think i'm just gonna wait here"
- you get out of the building and wait for the fire department to give the OK
why do we have such noisy checks?
- undetected outages are embarrassing, so we focus on sensitivity
- that's a normal, good reaction to have
- but understand that the relation between the alert threshhold and PPV
- looser threshold = less alerting, higher PPV, more uninterrupted sleep (but a chance you'll miss a real problem)
- strict threshold = more alerting, lower PPV, more false positives
sensitivity / specificity don't need to be competing concerns
instead of a line, you need a surface
hysteresis is a great way to get these additional degrees of freedom
state machines
time series analysis (like mentioned earlier, smoothing, histograms, derivatives, etc.)
as your data changes (e.g. your service becomes more or less reliable) or your checks become more reliable
your sensitivity & specificity will change too, sometimes wildly, so you can't just set it once and forget about it
a lot of nagios configs conflate the detection vs. indentification of a problem
for example, say you have these 4 checks for your website:
- 1. apache process count
- 1. swap usage
- 1. site responding to HTTP
- 1. requests per second
"your alerting should only tell you whether work is getting done"
if your site is still up, but apache isn't running, that's great news! (haha)
so cross off #1 and #2
and #3 and #4 can be combined into one check, if your RPS is good, then it must be responding
here's a tool that i want: something like nagios that monitors services instead of hosts
when a service is down, only then do you kick off a bunch of host level diagnostics
if the tool was aware of these SNR concepts (specificity, etc.), and had some built in knobs to tune, that would be even better
other useful stuff:
- bischeck
- see links in slides

Q&A:

Q: is it foolish to tweak these knobs manually? shouldn't this be automated?
- A: i haven't found anything to automate this yet, manually tweaking is the only way i've found so far

Metrics 2.0 - Dieter Plaetinck

works at vimeo
video transcoding & storage
lots of metrics, lots of graphite
when a user uploads, it first runs a few checks to determine which data center to route your upload to
graphite is used to make a feedback loop to make sure that kind of automated system is working properly
but this talk is going to be about problems, mostly with graphite
a timeseries looks like this: (unixtime, value)
timeseries are labelled like "mysql.database1.queries_per_second"
it is difficult to navigate the hierarchies
it is difficult to find how and why a metric is being generated
timeseries don't have units, they don't describe their behavior (e.g. semantics like which time period they cover)
unclear, inconsistent formats
metrics are tightly coupled to the source and lack context
one metric name can have multiple meanings
complexity = lots of sources * lots of people * multiple aggregators
it's a time sink
- everything has to be done explicitly, even when this data could be determined implicitly (units, legend, axes, titles, etc.)
- in graphite, different subtrees may contain the same types of data, so this makes it hard to compare across the hierarchy
- as you gather more metrics, these problems get worse
metrics 2.0 tries to solve these problems
metrics have a self describing format

compare graphite:

stats.timers.dfs5.proxy_server.object.GET.200.timing.upper_90

to metrics2.0:

{
    server: dfvimeodfsproxy5,
    http_method: GET,
    http_code: 200,
    unit: ms,
    metric_type: gauge,
    stat: upper_90,
    swift_type: object
}

metrics20 allows you to use more characters to label your metrics (e.g. "/" for "Req/s")
metrics20 allows you to add extra metadata to your metrics
- for example, src/from parameters, so you can track where a metric is being submitted from
conceptual model -> wire protocol (compatible with graphite/statsd/carbon) -> storage
metrics20.org
units are extremely useful:
- MB/s, Err/d, Req/h, ...
- B Err Warn Conn Job File Req ...
- we allow you to use SI + IEEE standard units
easier to learn, more flexible

Carbon-tagger:

middleware between old graphite instance and new metrics20 instance
adapts old format to new format (adding metadata, units, etc.)

Statsdaemon:

similar to etsy statsd, drop-in compatible
if you send a bunch of bytes B over time, it automatically figures out this is B/s
if you send a bunch of milliseconds ms over time, it automatically calculates percentiles/min/max/mean/etc.

Graph-Explorer:

dashboard system with a new query syntax

New query syntax:

proxy-server swift server:regex unit=ms
automatically does group-by based on metadata
automatic legends, axes, tagging (these are all manual in graphite)

stat=upper_90 from datatime to datetime avg over (5M, 1h, 1d, ...)

Some examples:

Which is slower, PUT or GET?

stack ...
http_method:(PUT|GET)
swift_type=object

Show http performance per server:

http_method:(PUT|GET)
group by unit, server

grab all job stats (note how no timeseries names are explicitly given, this finds all timeseries that have a unit of "Jobs/second"):

transcode unit=Job/s
avg over <time>
from <datetime> to <datetime>

another example:

...didn't catch it...

another example, but now grouped by zone:

...
group by zone

network bandwidth by server:

unit=MB/s network dfvimeorpc sum by server[]

cumulative total of bandwidth over time

(automatic integration)

rate of change:

(automatic derivatives)

bonus features:

graphs are interactive (inspect, zoom)
set up rules & alerts
- imagine a disk space check which can alert you on both individual machines and cluster-wide
email alerts (with embedded graphs)
emit events (see anthracite), add notes / events to graphs, events have full text search
better dashboards: allow you to dynamically append a fragment of a query to every query in the dashboard (e.g. switching between different group-by clauses)
easier to define colors

future work:

these three features are all about condensing series into smaller sets of data:
- aggregation rules
- graphite API functions like summarize, etc.
- consolidateBy & graph renderers (i.e. at the pixel level to generate images)
a lot of mistakes show up from these operations
with metrics20 we shouldn't need to do this anymore, the graphs themselves should know how to do this
maybe we can automatically display mean/lower/upper/upper90/lower90 on graphs
facet based suggestions
imagine if you consistently emitted metrics with "unit=Err/s" across your entire stack, i.e. this was a standard in every piece of infrastructure / system / application, if you did this, you could have complete visibility into errors across your entire infrastructure, plus super easy drill-down

Q&A:

Q: openstack has a technology called "cata"(?), used by ceilometer, it's a standard, has 5 W's metadata, etc. have you looked at that?
- A: i haven't, i tried searching for something like this but didn't find anything, sounds interesting, definitely will look at it
Q: does carbon-tagger cause performance problems?
- A: we have 170k metrics at vimeo and it's performed fine. both tools i mentioned are written in go

Our Most Wicked Problem - Ashe Dryden

lack of diversity in tech is a wicked problem
http://en.wikipedia.org/wiki/Wicked_problem
it's like playing tetris with only one piece
whites and asians are overrepresented in tech vs. the general population
women, black, and hispanic are underrepresented
56% of women leave tech after entering, twice the attrition rate of men, and we don't have stats on other groups
why is it a wickedly hard problem?
incomplete or contradictory knowledge
not enough research
people & opinions involved
people have different opinions on this subject
economic problems
not all schools can get computers & internet access & teachers for tech
there is a pay difference between certain groups
there is no solution
just like poverty, the problem can never be totally solved
there's no right or wrong solution
we don't even know what the solution is yet
the solvers of this problem can also be the creators of the problems
what contributes? society, class, family & community, education, industry
what can i do?
- if you're a parent, raise your children to be respectful of others
- get involved in education
- listen to the people who are affected
- have empathy
- collaborate
- change your behavior
- use your power & influence to change things, talk to your boss, talk to your colleagues, talk to strangers, reach out, speak out on behalf of others

Q&A:

Q: i'm a pro-feminist man, and i understand why you can't depend on the repressed group to solve the problem, but if i use my voice then i'm going to be speaking for women and reinforce the problem, what can i do?
- A: instead of speaking on behalf of others, speak for yourself to create space for others
Q: what is low hanging fruit in this problem?
- A: talk to your friends, if someone says something that doesn't sound right to you, that sounds harmful, say something to them, and explain to them instead of criticize them
Q: is it difficult because success has no definition for this problem?
- A: yes

StatsG at New York Times - Eric Buth

works at the New York Times in the interactive news department
what does our department do?
i sometimes can't do a good job of explaining it, maybe some examples would be better
"The Guantanamo Docket"
- interactive timeline showing what has happened to the gitmo detainees from 2002 to 2014
- click on detainee's name to bring up their bio, documents, articles, etc.
"Watching Syria's War"
- timeline of video clips & articles
Sochi 2014
- neat tables and graphs of olympic results (medal counts, etc.)
haiku.nytimes.com
- finds accidental haikus written in articles
Blackout Poetry
- article starts off completely redacted, then you click on words to reveal them and create a poem
and lots more...

what's in common?

i don't know actually, we're kind of responsible for whatever we say yes to doing
we're separate from the larger NYTimes organization
we have our own infrastructure, we don't have to deal with the larger more "corporate" parts like the CMS, mobile app, etc.
we don't have as much traditional releases, milestones, etc.
heterogeneity
over 100 active apps
short turnarounds
collaborations with other departments
everything is different, for a good reason
another example: the Dialect Quiz
- someone threw together a node.js app last minute
- ended up being their highest traffic feature ever
if you work in systems, this might lead you to become an embittered jerk
- everyone tells you their project is the most important thing ever and then it launches and you're stuck maintaining it forever
- if you are in the position to say "no", you start to say "no" all the time
- no new technologies, no new languages, more conservative choices
- ops is vaguely managerial, you are partially in charge of leading technology projects, to make sure projects succeed, to give technical advice, to help organize the systems and keep them running
- so if you have a bad run, if have some bad experiences, you tend to start saying no to everything
- a year ago i tried to make a change in this behavior
what if your relationship was the opposite?
- what if you tried to say "yes" to everything?
- this is actually the reason behind having an interactive news dept., to do this kind of stuff
- even though it can be a pain in the ass
if someone's enthusiastic about something, and you shut them down, that's not good for either side
wasted enthusiasm is a very bad thing
if you don't embrace that enthusiasm, they will go elsewhere

so how do you handle so many heterogenous systems?

have preferences and offer alteratives (e.g. nginx instead of apache)
pick technologies that are widely applicable (e.g. varnish works in front of everything)
what are you logging? how are you logging?
can you set this up without my help?
everything needs to be self-serve
including metrics gathering
old way: boilerplate / sample code / examples
new way: be reasonable, follow a few guidelines, and you're free to run whatever you want
we had an old log aggregation system, which was unmaintained
statsd replaced that system
because statsd is:
- self reporting, zero config
- get what you asked for
- easy to integrate with everything
- easy to explain
- doesn't over-solve the problem
well.. we did decide to over-solve the problem a bit.. and wrote statsG
- easier to run
- automate data retention
- eliminate flushing
- safely expose self-serve data retrieval
go is a good choice for this kind of application
- running binaries is a big advantage
- (gave a few other reasons i missed)
redis also sounded like a good fit
- redis is good at sets, this sounds like a set management problem
- redis has automatic expiration
lua for scripting redis
- having a scripting language inside the DB allows you to do aggregation inside the DB itself, which is very easy and super fast
result:
- consumes JSON data
- interactive graphs with 10 second resolution
- dashboards are totally driven by developers
- Winter Olympics was a big success story, the developers wrote all their own monitoring by themselves
problems:
- UDP is awesome ("free" message sending), but is incredibly difficult to debug, filling up buffers/queues and dropping messages is always a worry
- redis is very powerful, but redundancy and scaling are a problem
rolling your own solution is OK, but it's not for everyone
if you feel enthusiastic about something, and you want to put the time into it, then you can roll your own
this allows you to get to the root of the problem and you might learn something really valuable
for us, it was having the ability to make metrics completely driven by developers
cool bonus:
- nytlabs.github.io/streamtools/
- this project is going back to using log data and building up subscribe-able streams of log events
- using a visual interface

Q&A:

Q: for that streamtools project, once you consume the data, what can you do with it?
- A: you can do anything, different plugins for sending to redis, sending to console, forwarding the message along to another service

The cost and complexity of reactive monitoring - Chris Baker

(this talk was mostly just a war story, not much real info to take away)
data guy @ Dyn
how many people have ever been in the situation where they were staring at a pile data wondering "how did this problem happen?"
how did we get there?
scale 1: how much money do we have? (money to buy infrastructure & tools vs. extremely strapped)
scale 2: cutting edge vs. classic (new and shiny vs. nagios)
scale 3: neckbeard vs. handwaver (refusal to work with new tools vs. oh please new tools save me)
scale 4: time (lots of time budgeted vs. project manager hovering over you)
scale 5: legacy (totes cloud brah vs. you down with PDP & ancient pyramids?)
cost = price & manhours
probability of user churn (customer leaves) vs. problem duration vs. problem severity
- time to identify
- time to mitigate
- time to resolve
- impact vs. identification vs. diagnosis vs. resolution
- if you fix a problem before it occurs, there is no customer impact, this is where you want to be
make more metrics to track this
metrics all the way down!
have metrics to track your metrics
but the end goal is to solve problems in CI / testing instead of production
time to identify: time motion study (cool industrial study, makes us feel good to compare ourselves to industry)
- first you have to realize there is an issue
- you should notice before your customer does
- where do you look first?
example: customer reports that API is unavailable
- so, the customer knew about this before we did
- when did the problem really start?
- here's where the complexity begins
- when you're under pressure, your problem solving ability changes
- humans are fallible, you're very likely to come up with any idea under pressure, then start to investigate or build evidence for that idea
- if you started using some brand new database monitoring software, and then something breaks, you're going to start being suspicious of that new monitoring software... even though in this case it's not the cause
- all the while time is still ticking
- vendor plug / shout out to VividCortex, this actually solved the problem! it highlighted the problem for us!
- we found the problem! or did we???
- (i guess this is turning into a war story now?)
- well, vividcortex showed us problems, but it didn't fix the customer's problem
- so.. back to square one
reactive monitoring is the result of a bigger problem
humans are not good at this kind of problem solving
the crunch to provide an answer often leads you to the wrong answer
part 2
- i work in DNS
- and we know there's a certain traffic pattern during the holidays, traffic increases, we run into new problems every year because of this
- but this year.. hmm.. everything is green, no pages, all graphs look amazing, everyone is relaxed & off-guard because things are going so well
- we're handling huge spikes of traffic with no problem
- when everything looks this good then something is probably wrong
- you need someone on your team to be the pessimist, to think that everything is broken all the time...
- who is driving these spikes? CDNs? marketing campaigns? botnets? round up the usual suspects
- how are we collecting this data? how does this data go from the real world into our monitoring system?
your dashboard is the sausage produced by the sum of your monitoring
if there's sawdust and rats in the input, it's going to show up in the output
interesting aspects of DNS traffic:
- recursive resolution (series of misses & lookups, terminating at the root)
- TTL = time to live
- RCODE = response codes, 0 = good, 1 = format error, 2 = server failure, 3 = name error, 4 = not impl., 5 = refused, 6-15 = bla bla
- if you're not monitoring RCODEs, you don't know whether there's rat bits in your sausage
- certain RCODEs don't use TTL/caching
- TTLs are a rule people, and we have rules for a reason!
- why monitor RCODE 5? it tells you all kinds of useful stuff
- well.. we weren't monitoring RCODE 5
- pretty obvious in retrospect

(i'm not quite sure what the main point of this talk is, it was more of a fun war story i guess)

Q&A:

Q: is it difficult carrying all this weight as a devops thought leader on your shoulders? (some kind of in-joke in the DevOps twitter community?)
- A: when i think about it.. atlas shrugged

From Zero To Visibility - Bridget Kromhout

having aspect ratio problems
yes, definitely aspect ratio problems
I work at 8thbridge
- small dev team, one person ops team (me)
joined the startup in progress
twisty maze of shell scripts
time consuming
easy to break
cron jobs which rewrote the crontab
in portland we have bespoke artisanal everything
we also used new relic
pros:
- nice graphs
- application level view
- good error analysis
cons:
- slow to update
- many false-positive alerts (not totally their fault)
- we couldn't afford it (has changed some since then)
so those were our motivating reasons to change
but the main motivator was not getting enough sleep
so i changed our monitoring to nagios
- nagios: every bit as hideous as you remember
- yes it's hideous, but everything is right where you left it in 1912
- the new shinies are great, e.g. sensu
- but if we started using sensu it would have been the most complicated thing in our stack
hating on nagios: the middle years
- this is when nagios starts getting chatty
- as soon as you see a problem, you write a new check and ratchet up the chattiness
- everyone hates you when you write spammy checks
how do i monitor something like HBase / hadoop?
- best way to monitor HBase: hbck, the hbase consistency checker
- nagios -> hbck bash script -> parse output
- the most awesome tool in the world won't be able to monitor stuff like this out of the box
- the only way you get that is by writing a custom check, which is the same no matter what technology you use

mongoDB:

much like stumbling upon a robbery, i walked into a mongoDB in progress, with zero monitoring
found nagios-plugin-mongodb
worked pretty well, made a few fixes & improvements
and they accepted my pull request!
but.. mongoDB gave us trouble on cybermonday
our traffic spiked and our response time went to crap
"a single write operation holds the lock exclusively, and no other read or write operations may share the lock"
the write lock always seemed sketchy, but it couldn't be that big of a problem, right? it was
so.. next step.. we need to measure everything
- we had an old unused, unmaintained graphite install
- running something inside screen does not make it a daemon!
- so, get that into shape
- statsd chef cookbook worked great
- graphite cookbook.. not so good, chef 11 only (we're dragging our feet on chef 10) and we run nginx, not apache
- had to use tcpdump to debug why statsd/graphite didn't work
- but got it working eventually
shout out to carbonate
- whisper-fill.py: backfills data between whisper files
- very useful for the cutover
how to detect real outages vs. deliberate drop-offs in traffic?
- we provide a third party cookie
- some people enable/disable our cookie on purpose (e.g. because they think it's causing a problem)
- and some people disable it accidentally (pushing bad code)
- this is difficult to catch without constantly looking at the graphs
we didn't have money for new relic so we used sentry (open source error reporting system)
this was really helpful in catching API errors from third parties trying to integrate with us
showed a diagram of all their monitoring tools and the way the data flows
when we explain this to non-ops people, they usually ask "why do you guys use so many tools? can't you use just one?"
no! there is no one tool, there is some overlap, but you can't survive with just one monitoring tool
what's next? wishlist for what i want to do next
- logstash, kibana, elasticsearch
- etsy/skyline - anomaly detection
- etsy/oculus - metric correlation for etsy's "kale" system
- zorkian/nagios-api - REST-like JSON interface to nagios
- grafana - better graphite interface
- hubot - want to use this to interact with nagios via chat
what is the ideal monitoring system?
- finds real problems
- actionable alerts
- usable by everyone

Q&A:

Q: why did you choose nagios if everyone hates it?
- A: i've done sysadmin before, quite a few years ago, i've never set it up from scratch, but i had a feeling it would work, it wasn't too bad to set it up manually, we needed a solution ASAP, and it worked
Q: have you looked at check_mk?
- A: i'm aware of it but if haven't looked closely at it, right now a lot of our nagios checks are alerting on data in graphite, what would you suggest using it for?
Q: uhhhh monitoring (?)
Q: what do you want to get out of the nagios API?
- A: scheduling downtime and acknowledging alerts via hubot

Conclusion of Day 1

Jason Dixon:

i remember talking about composable monitoring 2 years ago (http://www.infoq.com/news/2012/10/future-monitoring)
remember just a few years ago all we have was just nagios & cacti?
look how far we've gotten in just a few years

Raw

03_day_two.md

"Auditing all the things": The future of smarter monitoring and detection - Jen Andre

founder & programmer at Threatstack
premise:
- 1. are you keeping a record of all processes running on your network?
- 1. are you keeping a record of all hosts those processes are talking to?
- if not, you are not secure
why do you want to know this information?
because you're a tinfoil hat security person
is there a reason to be this paranoid? yes, if you ever get hacked
even if you think you are secure, people are the weak links
should you care if you are hacked?
snapchat for pets: maybe not
big pharmaceutical company: yes
rest of us: it depends, but probably yes
do a risk assessment process to figure out how important this is to you
whenever a company is hacked
they all post the same message
"we got hacked but we found no evidence of really bad stuff. please reset your password as a precaution."
really?
did you look for evidence? or is that wishful thinking
do you even have any evidence?
we don't know what goes on internally
but I do know that forensics after the fact is really hard and really expensive
if you log everything ahead of time by default, this is much easier
the cloud
- for security people the cloud limits visibility
- old school networking: defined perimeter, harden the outside of your network, DMZs, firewalls, etc.
- in the cloud this doesn't apply, there is no well defined perimeter
- so you need to do continuous security monitoring
- audit everything, instrument everything, keep historical records of everything (sent to a secure place)
- continually improve monitoring & detection

what to monitor:

systems: authentications, processes, network traffic, kernel modules, file system access
apps: authentications, DB requests, http logs
services: API calls to SaaS or cloud providers
intrusion detection
"active defense"
incident response
do you know who is accessing your S3 buckets? do you have logs of that?

monitoring your systems:

start at the host level
process auditing - linux audit
network flow - libnetfilter_conntrack
login - wtmp/audit/pam_loginuid
keep everything in one 'big data' DB (e.g. elasticsearch)
write scripts to analyze this data

The Linux Audit System

pros:

powerful
built in to the kernel
relatively low overhead
apt-get install audit
it audits all the things, sort of
syscalls, syscalls by user, logins, etc.
doesn't include network data

how does it work?

kernel threads doing things
-> audit messages ->
kernel thread queue
-> netlink socket ->
userland audit daemon & tools (redhat's auditd, auditctl, etc.)
-> /var/log/audit/audit.log

configuration:

files (watch all modifications to /etc/shadow):
    -w /etc/shadow -p wa

syscalls (watch all kernel module changes):
    -a always.exit -F arch=ARCH -S init_module -S delete_module -k modules

follow executable:
    -w /sbin/insmod -p x

cons:

the logging is very obtuse
- logged values are a mishmash of strings, decimal integers, hex, etc.
- lots of manual matching up of cryptic names and values to other log lines for context
it can crash your box
- if the auditor is slower than the rate of incoming messages, buffers fill up and stuff starts crashing
- enable rate limiting to help prevent this
performance...
one alternative is to connect directly to the auditing socket and write your own listener
- for example, we wrote a listener that emits JSON instead of the obtuse text logs
- we also wrote a luajit listener that can do super fast filtering, transformation, and alerts
libevent + filtering + state machine parser
reduced CPU usage from 120% to 10%, greatly increase throughput

logins:

wtmp / "last" command
fairly easy to parse and turn into json
auditd also records login info
you can configure SSH to emit login events to audit
what about tracking "sudo su -"? how do I track commands that are run once someone becomes root?
- use pam_loginuid
- this adds a session ID to every audit event so you can track everything from the user login -> running commands as root

network traffic:

src/dst ips
src/dst ports & protocol type
use the netfilter & conntrack systems
netfilter = used by iptables
conntrack = tracks connections
turn this on: sysctl nf_conntrack_acct
the conntrack tool will show you raw packets and byte counts, very ugly
use libnetfilter_conntrack to emit JSON
it's hard to directly tie a process to conntrack data
but you can correlate using port numbers

putting it all together:

someone logs in
you can view all the commands they run (as their user or as root)
you can view all their network connections
all this information is stored in a database that can be queried or accessed through a web interface

bonus: detection

so i am collecting all this information now, how can i use it for detection?
most attacks typically aren't very sophisticated
many attacks use valid credentials (obtained through weak human targets, social engineering, malware)

what to look for:

"is this user running commands they shouldn't be?"
"why is a user running gcc?"
"why is a user account running a command that only root or system user should run?"
"where are my users connecting from?" (china? eastern europe?)
"what are my users connecting to?" (again, any outlying places like china, eastern europe)
you can create simple rules for these

Q&A:

Q: something about conntrack
- A: capturing raw data is very large, you need to filter, another option is to have a NAT box / router that all machines connect through and track everything there
Q: are you saying it's ever OK to be hacked?
- A: no, but your response is different depending on what industry you're in, e.g. the medical industry you must respond within a certain number of days and disclose the information in a certain way according to the law, hacking is only going to be more common, everyone will eventually be hacked
Q: something about standards, are there any tools to help achieve standard compliance?
- A: (she lost her voice and couldn't continue)

Is There An Echo In Here?: Applying Audio DSP algorithms to monitoring - Noah Kantrowitz

math ahead!
metrics have value @ a certain time
we can put them into graphs, we look at them all day every day
but you can also put this data into a .wav file
have you ever seen a visualizer / EQ?
it looks kinda like our graphs
but they have a frequency domain
value over time vs. value over frequency
x axis frequency: 0Hz -> 20Hz
y axis decibel value: +0dB -> +50dB
you can use the fourier transform to turn (time, value) data into frequency data
(gave the formal definition)
sine wave
add multiple sine waves together
add some noise
and this starts looks like one of our graphs in systems land
you can convert this graph to frequency space to get the underlying components
this reveals new information
instead of the mathy formal definition of FT (with integrals and infinity signs, which computers are bad at)
we use DFT and DTFT, discrete fourier transforms
one problem with this is that we have to do an O(N^2) calculation on the entire data set
there is an algorithm called Fast Fourier Transform
which is O(NlogN) instead of O(N^2)
an IFT does the opposite process, it turns frequency data into time series data

low-pass filter:

say we have a series with a threshold
and it's constantly flapping in nagios terms
use FFT to convert to frequency, run a low-pass filter, use IFT To get back to time series
then apply your threshold
this gets rid of the noise
e.g. it allows you to catch longer term rampups instead of short term blips
there are also high-pass filters (delete high values) and band-pass filters (delete outside of range)

windowing:

chops off data that you aren't concerned with
rectangular window function - very simple to implement
need to be careful of spectral leakage when using a small window size
which gives you "mushy" peaks, less clear signal
triangular window function - better, but not perfect, also easy to implement
blackman harris window function - best result

how do you do this?

NumPy is the one-stop shop, all of these functions are built-in
FFTW for C
go-dsp for Go
nothing in ruby, there isn't much scientific / numeric software for ruby
go forth and find the signals!

bonus content:

discrete cosine transform (DCT)
- how most audio/video compression works
- this is why MP3 files are smaller than WAV files
- WAV stores all the frequency data
- MP3 stores the DCT, much smaller to store, then uses IFT to decompress
- someone, please write a metrics database that uses DCT!
wavelets
- next generation compression systems (e.g. H264)
- someone should build something using this too
???
- (something i missed)
hysteresis
- use input to predict output
control theory
- goes hand in hand with signal analysis
- signal analysis gives you tools to analyze data, but control theory gives you tools to act on the data
- for example autoscaling
- PID control loops

Q&A:

Q: can you demo some of the numpy code?
- A: sorry, no, it's too much to get into right now
Q: any monitoring tools using these techniques?
- A: no! I don't know of any, nagios flap detection is a poor reinvention of the most basic form of signal analysis, but it sucks, there's a thousand years of research on this subject and nobody is reading it or implementing it!
Q: is our data amenable to this approach? is our data really all built out of sine waves?
- A: most of the data we look at has periodic components, at the very least you have a daily cycle; and there are a lot more cycles e.g. timeouts, response times, user activity, etc. all contribute to periodic rhythms
Q: is your code on github?
- A: no it's all homegrown hacky python code, not releaseable yet
Q: if we added FFT to graphite would that solve a bunch of problems?
- A: yea that'd be helpful, but would be better in a streaming system like riemann
Q: something about high frequency data
- A: it's the same problem as audio, audio needs to be sampled, you might need to do the same thing with your data, sample it
Q: how do you deal with noise in data? what about the colored noises?
- A: haven't run into this much, i'm using data i know to be periodic

A Melange of Methods for Manipulating Monitored Data - Dr Neil J. Gunther

http://en.wikipedia.org/wiki/Neil_J._Gunther
author of many books, teaches classes, workshops
The Practical Performance Analyst
no more plane crash analogies? (monitorama berlin joke)
- too bad, it's a useful
- asiana flight 214
- report found that asiana pilots are too focused on instrumentation
- they didn't do basics like... look out the window
monitoring is not about pretty pictures / graphs / tools / fancy math
- it's all about the data
- what story is the data trying to tell you?
- you need to have a consistent interpretation of data, across all the data
how do we converge on consistency? i'll show some examples

The Greatest Scatter Plot

(shows strip charts of metric1 and metric2)
if we were good at looking at data the stock market would be a solved problem
is there a relation between metric1 and metric2?
put both sets of data into a scatter plot
does it show anything interesting? a trend in any direction?
linear regression
Least Squares Fit
LSQ fit and R^2 value (what percent of the data matches up with the model?)
are we done now? no, this is just the beginning
is linear fit the best choice?
what is the meaning of the slope?
are you comfortable extrapolating this model into the future?
the most important scatter plot in history
1929
Edwin Hubble's plot of distance of stars from us & their velocity
what does the slope mean? v/r, Hubble's constant
from this slope we can calculate the age of the universe!
one small problem, hubble's calculation of the age of the universe (2B years) was lower than age of the earth (3-5B)
how did the earth get here before the universe?
what could he do?
(answers from the crowd: "look out the window", "fudge the data")
well, the earth is not stationary, so he compensated for earth's velocity
and... the data got worse!
nonetheless, he published the data
some thought he was crazy, it's obvious something is not right
70 years later, Hubble is now vindicated
Hubble's plot was a tiny area of what we can now see
telescopes weren't good enough in Hubble's time
the data was wrong, but his model was correct
lesson: treating data as divine is a sin
i am fond of saying that all data is wrong

irregular time series:

regular samples: like a metronome, every time has a value
irregular samples: missing data
you use the arithmetic mean on regular series
you use the harmonic mean on irregular series
with unequal intervals you need to scale the mean based on how long the intervals are between data points
use HM on aggregate monitored data when the following apply:
R - rate metric (y axis)
A - something i didn't catch
T - something i didn't catch
E - something i didn't catch
this doesn't come up too often in our systems

Power Laws and the Law of Words:

Zipf's law
plot the frequency of words in the english language
words like "the" are many many magnitudes higher than more exotic words
what function describes this data? it's hard to say from looking at the graph
the trick is to use logarithmic axes
check if a linear regression works on the data with logarithmic axes
power laws imply persistent correlations that need to be explained
what is the explanation in Zipf's case?
the rules of english grammar require certain words to be more frequent than others
example: DB query times
rank by time (histogram)
put on loglog axes
hmm this data looks weird now, it's not linear
it has three different behaviors
1st part: power law decay
2nd part: exponential decay
3rd part: exponential decay
is that enough?
no, we must determine why each of those correlations fit
example: in Australia all business were required to register an ABN number for tax purposes, with a hard deadline
- very similar to the healthcare.gov problems
- at the 11th hour, people rushed to finish, and the system crashed
- could that peak have been predicted?
- yes, it's complicated, but a power law can do this
lesson: rank data by frequency (histogram) and try using log / loglog axes
- you can use this technique to predict spikes in noisy data
- this allows you see a strong correlation, the explanation is more difficult
conclusion: aim for consistency
learn to listen to your data

Q&A:

Q: have you seen people fudging data in the operations world?
- A: physicists are notorious for this, i haven't seen it as much in the operations world, i have been guilty of ignoring or overlooking strange noises or inconsistencies, also, be careful of making really complicated models (unless you know what you're doing), at some point you may feel a conviction about your model like Hubble did, and Hubble was correct in the end, important question for science: "how do I convince myself this model is true?", use this approach when making your models, look at Einstein's first 5 papers, everything is written in a way that anyone can understand, using very broad statements, then gradually narrows down and paints you into a corner of accepting his claim, and these were outrageous claims at the time, as simple as possible but no simpler, and this is now a rambling answer but it was fun to give
Q: hubble's estimate was wrong because his data wasn't accurate, it seems in our world that our measuremens are very accurate, does that change our approach?
- A: so, do we need to do something differently from Hubble? i'm fond of saying that all measurements are wrong, you don't have his exact problem, but you should never trust the data, you can have completely accurate measurement of the wrong thing, (relays an anecdote about LHC measurements that were accurate to 6-sigma, but a 50 cent connector was not attached properly, so the data was super accurate garbage that was misleading people)
Q: a comment - we can measure time accurately in computing, but most data in operations is very inaccurate and noisy
Q: another comment - i'm struggling with eventual consistency of the cloud, as such you have to deal with eventual consistency, even in your monitoring
- A: sure, that's a different concept, but yes if you're using a distributed system, the "consistency" of your models will have to take these distributed computing problems into account
Q: in your last example with the power laws, you found the peak after the fact, does it work ahead of time?
- A: yes, you can construct a power law prediction, it's not always correct, but it's another tool, requires more math
Q: would human behavior play into your prediction? i.e. you're counting on people to wait to the last minute?
- A: no, i might point to human behavior as the explanation, but the prediction does not depend on that fact

The Final Crontab - Selena Deckelmann

works at Mozilla on the Socorro team
Socorro is a crash reporting system
about:crashes
click on a crash there and it takes you to socorro's web interface
crash reports from users are fun to read (shows some funny quotes and http://lqbs.fr/suchcomments/)
(showed some diagrams of the system architecture)
postgres is central to the system
it's the main architectural element
background tasks are also important

so, what is the final crontab?

*/5 * * * * socorro /usr/bin/crontabber

our old cron jobs had no tests
but they were so critical to our systems
everything was special shell scripts
jobs would kick off postgres stored procedures that would break if run twice and are very hard to debug
email from cron
- everyone has this problem
- worst month: 22k emails sent from cron
crontabber saved us from a lot of these problems
cron emails are a security blanket that we no longer need anymore
use nagios/sentry instead
what's cron good for? it runs jobs on a predictable schedule

how socorro uses cron:

reports
postgres materialized views
status logging
jobs that don't fit into a queue system because of dependencies, complexity, etc.
github.com/mozilla/crontabber
pip install crontabber

here's what our jobs look like:

socorro.cron.jobs.matviews.ProductionVersionsCronApp|1d|02:00
...dozens of lines like this...

everything is a python class with a run method
shared code (e.g. transactions, setup, teardown), is shared across jobs using decorators
jobs have a frequency ("1d") and start time ("02:00"), and the job code contains metadata like dependencies
uses configman (github.com/mozilla/configman) for parsing command line args vs. config files
github.com/mozilla/socorro/blob/master/config/crontabber.ini-dist

what do i like about this system?

no more shell scripts, that's the main thing, huge improvement
easier to write & test
automatic retries on failure
jobs wait on their dependencies to run (including when a dependency fails)
dependencies are documented in the code, automatically builds a visualization of job flow
automated nagios alerts, including sending triggered exceptions to IRC, no more email alerts
configurable number of failures before CRITICAL
unit test framework for jobs

problems:

configs are a bit complex
one-off runs aren't simple (stored procedures are designed to only run once per day)
no parallel execution yet, jobs are run linearly in dependency order, one possible solution:

    */5 * * * * crontabber --conf=/etc/cron1.ini
    */5 * * * * crontabber --conf=/etc/cron2.ini
    */5 * * * * crontabber --conf=/etc/cron3.ini

yea... we're not going there again :)
depends on python 2.6 or higher and postgres 9.2 or higher

Q&A:

Q: no question but just want to say that it looks awesome
- A: thanks!
Q: have you had problems with circular dependencies?
- A: not sure, we only have 4 levels of dependencies, so i don't think we've run into that yet
Q: how is the JSON postgres performance?
- A: awesome, document size per row is tiny, main write DB is 1.5TB, half of that is probably JSON, way faster than hadoop, 1 hour for hadoop query -> 10 minutes for same query in postgres
Q: you're trying to get rid of shell scripts, did you rewrite in python or wrap them in python?
- A: rewrite in python, bash is OK to start, but gets too crufty
Q: did you look at pgAgent? (job scheduling agent for postgres)
- A: no we didn't look at that
Q: can it do cross-node dependencies?
- A: what do you mean
Q: like if a job on machineA depends on a job on machineB?
- A: no... right now it only runs on one machine
Q: is there a reason you didn't look into marathon or cronos for distributed cron?
- A: we didn't need a distributed tool, crontabber is more about the framework for jobs, and all these jobs seemed pretty critical to the product so we wrote our own system to handle them
Q: do you handle timeouts & stuck jobs?
- A: timeouts are built into the jobs themselves when necessary
Q: how do you determine what jobs are currently running? any visualization?
- A: no visualization, but that info is in the crontabber logs

This One Weird Time-Series Math Trick - Baron Schwartz

more math...
this was going to be about math, but other people already covered it!
works at VividCortex - New Relic for the database
formerly worked at Percona
author of: High Performance MySQL & Web Operations
"anomalies" vs. "typical data"
anomaly = not typical

my worldview:

monitoring tools are not enough
monitoring = healthchecks, metrics, graphs
we need performance management
work-getting-done is top priority
we need more than recipes or functions to grab and apply, we need to know the right techniques to use
fault detection = work is not getting done, true/false
anomaly detection = something is not normal, uses probability & statistics
just because something is anomalous doesn't mean it's bad

what is the holy grail?

determine normal behavior
predict how metrics "should" behavior
quantify deviations from prediction
do useful stuff with that data
at 1 second resolution, your systems are anomalous all the time
that holy grail is very practical, too practical for this talk
sometimes i want to do something fun
like use fun math
high level math is difficult to do at scale, it's better suited to academic papers
timeseries metrics are not always best displayed in strip charts
how many of you know these statistical / probability methods? (shows big list of methods)
how many of you have used the smirnov-kolmogorov test? (mentioned in Toufic's talk)
how many of you know these descriptive statistics methods? (wikipedia page on descriptive stats)
i don't know any of these
but basic statistics is good for quite a bit
learn the simplest, most effective approaches first
advanced stuff is there if you need it
you don't need a PhD to do this
spectrum of metrics analysis:

turd polishing <-------- sweet spot --------> lilly gilding
anomaly detection
anomaly -> deviation -> forecast/prediction -> central tendency/trend -> characterization of historical data
these are all separate problems with different techniques
dumb systems don't produce good results
if a system is getting work done, it's not faulty, no matter what a fancy technique says

control charts

draw lines for 3 sigmas
is the process within normal limits?
control charts assume a stationary mean
most data is not normally distributed
lots of problems at smaller time scales

first idea: moving averages

gives us a moving control chart
somewhat expensive to compute
current values are influenced by values in the past
a spike in data causes an inverse spike in the sigma values once that spike drops out of the window

exponential moving averages

more biased to recent history
cheaper to compute, only need to remember one value at each step and apply a decay factor
EWMA is a form of a low-pass filter
we can do the same thing we did earlier and make EWMA for control charts
which is a little better than moving average control charts or plain control charts
one place where EWMA falls down are trends
the EWMA lags behind the actual trend

double exponential smoothing

tries to solve the lagging by adding a prediction
once you do this, the alpha and beta factors become very sensitive
it's easy to way undershoot or overshoot the trend
holt-winters forecasting
DES plus seasonal indexes
more complex, slow to train, previous anomalies start getting built into the predictions
MACD - moving average convergence-divergence
comes from the finance world
finance is probably the most advanced application of these techniques, look there for inspiration
seems to be the most accurate

Q&A:

Q: what happens when you subtract current timeseries data from previous week's data?
- A: yea i've tried that sort of thing, this is similar to holt-winters, what happens if you had an outage last week? then you will be predicting an outage next week, also, is week the right period? should you combine weekly/daily/hourly? should you use multiple "seasons" (i.e. if using weekly data, use 3 weeks in the past)?

The Lifecycle of an Outage - Scott Sanders

operations at github
tools + process = confidence
take any business metric and multiply it by your downtime
while you have downtime, you have no registrations, no revenue, etc.
human error is not random, it is systematically connected to people, tools, tasks, and operating environment

triggers:

detection & notification of a problem, get a human involved
alert fatigue is real
people tune out notifications
human fatigue is also a problem
if you are paged in the middle of the night
keep shifts as short as possible, right now github has 24 hour shifts
simplify overrides and give them out freely
be persistent, don't page every 15 minutes, page every 60 seconds until a problem is ack'ed
escalate quickly, don't let a dead battery cause your downtime to go on longer
be loud
create handoff reports for every on-call shift, spot trends
- github has a chat command called "handoff" which generates a report & graphs of all incidents during an on-call shift

initial response:

establish command & identify severity, quickly
graphs are a great way to determine severity
chat bots are a great way to signal to both systems & teammates what is happening during an incident

github's monitoring stack:

graphite, 175k updates/sec
collectd (system level metrics), 1200 metrics per host
statsd (app level metrics), 4 million events/sec
and.. sFlow, SNMP, HTTP, etc.
logging: scrolls, splunk, syslog-ng
1TB of logs indexed per day
special purpose monitoring directly covers business concerns
we don't consider a tool production ready until we can interact with it via chat
- because that interface fits our culture
- you should do the same for your culture
- accept the processes that emerge and adapt your tools to augment those processes
- don't force your team into processes

corrective action

collective knowledge & feedback loops
real example: last year, github was hit by a string of DDOS attacks

    hubot: nagios critical - ddos detected via splunk search
        (this also generates a github issue
        with the check result and a link
        to DDoS-mitigation.md playbook)
    tmm1: oh?
    tmm1: /arbor graph -1h @application
    hubot: <graph of incoming traffic>
    tmm1: /pager me incoming ddos
    tmm1: ...more steps to determine what's happening...
    other people join in
    jssjr: going to enable protection now
    jssjr: /shields enable w.x.y.z/24
    hubot: please respond with the magic word, today's word is knight
    jssjr: /shields enable w.x.y.z/24 knight
    jssjr: /graph me -1h @network.border.cp1.in
    hubot: <graph of incoming traffic at the router to verify the change>

playbooks are awesome
they allow you to distribute knowledge
as you come across a new problem or missing knowledge, add more to your documentation
tools make software less horrible
nobody should have to know everything about your entire infrastructure
make things safe for your less experienced engineers

create issues for postmortems

dedicate a repository for postmortems, for github this private repo is: github/availability
identify problems
involve many people
propose solutions
some incidents require a public postmortem to be released the same day
but the private postmortem can be open for weeks, to make sure we got it right and are completely satisified the issue is fixed
this is how we close the loop on outages and make progress towards prevention
for example, some improvements for DDoS are: automatic mitigation, better monitoring, etc.
study the lifecycle of your outages
tools are complimentary to your process, not the other way around
communication is the cornerstone of incident management
tools & process enable confidence
never stop iterating

Q&A:

Q: do you have problems with availability of your tools during outages?
- A: absolutely, for example we keep the playbooks off-site and on-site to make sure they're always available
Q: you mentioned a huge graphite instance, what backend are you using? i don't think whisper would work?
- A: we are using whisper
Q: tell us about the "shields up" command, what does it do? does it get logged somewhere?
- A: well, our chat is logged, that gives us the timeline
Q: if you're fixing an outage and you need to clone something from github, what do you do?
- A: ha ha well we work very hard to make sure that doesn't happen

A whirlwind tour of Etsy's monitoring stack - Daniel Schauenberg

software engineer on infrastructure team @ etsy
25 million members
18 million items listed
60 million monthly visitors
1.5 billion page views per month
all with a single monolithic PHP app
master-master mysql
we have some smaller services in java
and image service is not in PHP
we deploy a lot
the actual number doesn't matter much
what matters is how comfortable are you deploying a change right now?
when you start at etsy the first thing you do is deploy the site (team section)
and then you watch the graphs
what are in the graphs?

ganglia:

system level metrics, everything specific to a node (requests per second, jobs queued, CPU, memory, etc.)
one instance per DC/environment
220k RRD files
fully configured through chef roles
automatically runs all files in a certain directory to generate these stats

StatsD:

single instance, one server
traffic mostly comes from 70 web servers & 24 API servers
heavily sampled (10%)
graphite as backend

graphite:

application level metrics (not system level)
2 machines: 96G RAM, 20 cores, 7.3T SSD RAID 10
500k metrics per minute
mirrored master/master setup
sharded setup, 7 relays running per box, replicating data to the other server
the sharded setup also helps isolate problems (when something blows up, only one of the two servers is affected)
things to monitor when running graphite:
- disk writes, disk reads, # of keys being written, # of values being written, cache vs. relay stats
don't monitor graphite with graphite
we monitor graphite with ganglia

syslog-ng:

web, search, gearman, photos, nagios, network, vpn
1.2GB of logs written / minute
fully configured via chef roles (to determine which log files to send for a node)
rule ordering is important
syslog boxes also run a web frontend called supergrep which is a node.js app that basically runs "tail -f *.log | grep ..." over the web
syslog boxes also run etsy/logster
extracts metrics from log files
written in python
runs once per minute via cron

splunk:

supergrep only shows the last ~1 minute of data, how about longer?
splunk indexes all your log files
easy & powerful search syntax
saved searches
glorified grep

logstash:

experiment to replace splunk
easier to integrate with
easy to set up in dev environment (can't do this with splunk)
can logstash give our developers more insight while they are developing?

eventinator:

tracks all events in the infrastructure
chef runs & changes
DNS changes
network changes
deploys
server provisioning and decommissioning (we use dedicated hardware, no cloud)
12 million events in the last 2 years
originally stored in one mysql table, now using elasticsearch (free search)

chef:

everything is configured with chef
same cookbooks in dev & prod
every node runs chef every 10 minutes
tons of custom knife plugins & handlers
we use spork for our workflow, which notifies IRC of changes / promotions, also kicks off a CI build
mentioned git repo vs. chef server being out of sync
"knife node lastrun web0200.ny4.etsy.com"
120 recipes successfully run in 20 seconds
there's also a handler for failures, chef failures are automatically sent to a pastebin and posted in chat

nagios:

raise your hand if you have a strong feeling about nagios (everyone raised their hand)
raise your other hand if that feeling is love (only a few people)
well, too bad for most of you, computers don't care about your emotions
nagios works really well for us
2 instances per DC/environment
we use nagdash to aggregate results across all instances, our main view of the world
interact via IRC, set downtime, see check results
used to have a manual deploy process (ssh into box, etc.)
why do that? we have a good way to test & deploy software
now they have a real deployment process, real CI process
feels just like working on the web app, that's a good thing

nagios herald:

adds context to nagios alerts
what are the first 5 things you do when you get paged?
you already have your phone in your hand, wouldn't it be great to get this information in the alert?
now our alert emails contain graphs, tables, output of shell commands, alert thresholds, alert frequency (# of times alert has been triggered in the past 7 days)
this is awesome, on-call is so much better now

ops weekly:

we have weekly rotations
at the end of your shift, you are given a survey
you have to specify which alerts were actionable, which were ignorable
of pages during sleep vs. awake time
amount of time kept awake by alerts
can also scrape data from fitbit to get actual sleep times
and these results are discussed at the weekly ops meeting

summary:

use a set of trusted tools
enhance tools when they come up short
keep trying new things
write your own tools where applicable

See our blog, github, and other talks for more detail.

Q&A:

Q: how do you feel about kale?
- A: kale is our anomaly detection stack, it's still an experiment, we're trying to figure out how and where to use it, it was recently broken by a graphite upgrade
Q: how self-service is your nagios setup? do you provide tools for devs to build monitoring?
- A: not very self-service, still need to write your own checks & configs, but every team has an ops person, and all those people are excited about writing checks that make developers lives better
Q: elaborate on logstash & elasticsearch?
- A: right now it's an experiment, also using kibana, side-by-side with splunk, what parts of splunk work better in logstash? how useful is it for developers in their dev environment? those are the main points
Q: how many syslog servers? do you split the logs between multiple hosts for performance reasons?
- A: two, and I think they both get the same data for redundancy purposes

Wiff: The Wayfair Network Sniffer - Dan Rowe

wayfair.com
leads the infrastructure tools team at Wayfair
two sub-teams: internal tools (customers are employees) and dev tools (customers are engineers)
wayfair is an online retailer
7 million products
16 million visitors per month
a lot of these kind of presentations someone presents a homegrown tool and everyone is like
"why did you do it that way? why didn't you use X?"
i'm going to try to cover those questions ahead of time

our setup:

active/active DC setup
main sites -> loadbalancer -> PHP web server farm
java / ASP.net for other stuff

logging overview:

syslog, app log, network traffic, commits
logstash
elasticsearch
kibana, dashboards, graphite, zabbix, ad hoc querying & alerting

what is wiff?

out of band traffic sniffer and analyzer
wireshark as a service
packet processing pipeline
feed in packets -> process -> output -> report / analyze -> profit

how do you feed in the packets?

wireshark / NIC level
pcap files (ring buffer or tcpdump files)
rabbit mq
once you feed in the packets, configure which protocols, ports, etc. you are interested in
currently HTTP, HTTPS (needs private keys to decrypt, take care not to log the request/response bodies anywhere..), and TCP are supported
showed a typical HTTP processing workflow (big diagram)
reporters output the data somewhere
JSON, elasticsearch, rabbitmq
wiff is the beginning of the pipeline
we have some example kibana queries to get started with
once it's in elasticsearch it's up to you to do the analysis
alerting: doesn't exist yet, want to build an alerting system for ES

pessimism:

if we already have web server logs and application logs, why do we need this?
this is just another vantage point to gather this data
it's a companion tool
where does it fit?
you tell me, it can track both inbound & outbound traffic
it can spot problems before the request hits a given layer
what if your LB or webserver is misconfigured?
what if the request never reaches where you expect it to reach?
what if your server segfaults?
can spot problems that don't show up in logs
real world example: Set-Cookie was being specified multiple times per response, but their logging was only showing it as set once
because it's out of band, it doesn't matter if it crashes, it doens't matter if it goes down
it doesn't require you to make changes to your application
very little performance overhead
(i think all of these arguments apply to using plain old tcpdump?)
MOAWSL: mother of all web server logs
we have this layer that aggregates all web requests in a single log file, standard format
but if you didn't have this layer, wiff could be used to do that

other benefits:

runs on windows
can be used to watch network traffic of proprietary / third party software
packet RTT
obtain network timing information
call frequency (how often is this web API getting called?)
showed screenshots of command line tool & kibana dashboard

todo:

improve SSL decryption performance (do it in the background)
better reporting

notes:

needs some monitoring
watch for dropped packets, un-stitchable requests
no support for SPDY or websockets
YMMV, it works for us, not used by anyone else yet

github.com/wayfair/wiff

Q&A:

Q: do you instrument wiff before & after the load balancer? to track requests through the system?
- A: uhh we can see the source/destination and track them that way, but that isn't done automatically
Q: anything on the roadmap for SIP traffic?
- A: no, but we have a big call center, i can see it being useful there
Q: what is the throughput?
- A: we have 10G NICs, it's only using ~1G in testing, depends on tcpdump buffer settings and how much your NIC can handle

Web performance observability - Mike McLane & Joseph Crim

work at Godaddy
we went full prezi, so bring some dramamine
measure performance
is it good enough?
if not, look for bottlenecks
how are people using our hosting?
setting up blogs, PHP apps
what are the common use cases?
know your customer
so... lots of PHP benchmarks
wordpress, joomla, drupal
response time is very important for your customers and their customers
people leave and/or complain when things are slow
imagine loading screens in video games, nobody likes loading screens
google has shown that page load time has a direct impact on how likely a person is to make a purchase
google ranks your site based on the load time

webrockit:

webrockit is our performance testing stack
how long does page load time take in a real browser?
data collected has to be real, match up with real users' experience
it needs to be understandable by our internal users
webrockit uses headless browsers to calculate page load time
time to first byte
number of assets
time to complete loading assets
100 different stats related to page load time

why not use a commercial offering?

too expensive for the amount of traffic we want to pump through
data resolution wasn't good enough
didn't include all the stats we wanted
we wanted to feed data into graphite
no commercial offering gave us all the features we wanted

how about open source?

similar to commercial offerings
we looked at: casperjs, selenium, watir, ghost.py
none of them had all the parts we wanted
so we decided to build our own and open source it
working prototype in 3 days
using phantomjs, wraps headless webkit with an API
and it was spot on with how real browsers work, gave accurate measurements
the API lets you do some cool stuff like overriding which IP to use for host
and exposes all the internal timing / metrics in the browser

example:

let's say we want to benchmark changes across changes in our app
let's use a standard LAMP stack, running wordpress, using stock versions of everything
no optimization ahead of time
let's point webrockit at it
start by focusing on time to first byte
test #1: enable compression
- this made time to first byte slightly worse
- that's useful to know
test #2: switch from modphp to fastcgi + phpfpm
- no speed change, but more stable looking graphs
test #3: enable APC
- APC is an opcode cache for PHP, so source doesn't need to be compiled for each request
- gave a great improvement in response time
test #4: upgrade package versions
- php 5.3 to 5.5, apache 2.2 to 2.4, fastcgi -> modproxyfcgi
- another good improvement

The end result is that we had a nice workflow for testing and iterating on performance changes.

how does webrockit work?

we decided to use sensu
which is normally used for monitoring
but had all the basic pieces we needed for building a performance testing suite
we wanted the design to be API-first, REST API
written in jruby & sinatra (jruby = easier deployment)
users Riak for main source of truth, storing results
- the data structures used are really simple, would be easy to port to other data stores
checksync API, webrockit API -> write checks to disk for sensu
all metrics go into graphite

web UI:

uses rails
set up a poller, e.g.: AWS east & west, digital ocean, internal network, etc.
then set up a check: name, run interval, which poller to use, URL, ip address override (to skip DNS lookup)
you can view a queue of all the jobs, each job has some debugging info in case there's a problem
wait for the job to run for a while then you can view results
graphite dashboards (high level overview of a few metrics)
cubism graphs (condensed strip charts, very easy to spotcheck)
explorer view (drill down into those 100 different finegrained metrics, add multiple targets to a graph to visualize better)

future:

virtualization
introduce packet loss / traffic shaping / bandwidth limits / TCP level network tweaks
better analysis (see all the previous talks on math & anomaly detection)
heatmaps
events & errors (200 expected and now it's 404 or 301, page size drastically changed, etc.)
better dashboards, what is the state of the art? can we use or feed into those systems
better debian support (we're a RH/centos/fedora shop)
real configuration management (we are both a puppet & chef shop, which drew applause from the crowd, they are using bash scripts to install everything right now)

sound interesting?

twitter.com/webrockit
webrockit.io
https://github.com/WebRockit

Raw

04_lightning_talks.md

@M_richo, when testing and monitoring collide:

serverspec + sensu
serverspec = rspec testing framework for server configurations, platform agnostic, 26 resource types
very fast, example: 266 tests in 2.78 seconds
when do you want to write serverspecs? when you're writing infrastructure as code to validate your code
you can also run your serverspecs on your live servers, why? because it's quick and a cheap way to verify everything is working
great addition to your monitoring system
let's put this data into sensu
first attempt: wow we have a lot of failures, and i have no idea what's broken
1. use rspec's json output format
1. sensu has a feature to send check results over a socket
these two features allow you to split the checks up, instead of one huge summary check for all server you now have a bunch fo separate checks, easy to see failures
summary:
- write tests for your systems / infrastructure code
- don't duplicate your effort, run your serverspecs on production

@laprice, monitoring postgres performance:

hardware determines: memory, random_page_cost, tablespaces
workload determines: query_planner, autovacuum, stats_collector
what is autovacuum?
cleans out dead tuples
reorders pages on disk
thresholds can be set per table
one of the primary culprits for "my database is slow and i don't know why"
highly tunable: workers, nap time, duration, timeout, max age, cost delay, cost limit, etc.
focus on the tables that need it most (the largest tables)
track dead tuple count & percentage (>5%)
main question to answer: are my tables being vacuumed when they should be?
you can get this info by querying pg_stat_all_tables, see the docs

@petecheslock, 17th century shipbuilding and your failed software project:

aka - why your project managment sucks
the Vassa
grandest ship built by the royal swedish navy
the most expensive project ever undertaken by the country at the time
after sailing less than one mile a gust of wind hit the ship, it tipped over, and it sunk to the bottom of the sea
50 years later they recovered the ship and analyzed what went wrong
the captain who survived was thrown into jail, he was asked if the crew was drunk, they were not, he was later released
it tipped because it didn't have enough ballast
why? it started off as a 108 foot ship
then was changed to 111 feet (originally wanted 120 feet)
then they wanted to add another gundeck
sure, ok, then they needed to scale it up to 135 feet
(nobody in sweden had even built a ship with two gundecks yet)
they kept revising the number of guns, size of guns
rush job
the king also needed to have a bunch of ornate carvings added, making it more top heavy
most of the design came from the king's head
they did a lurch test (30 men running back and forth on the deck, believe it or not), and they had to stop because the ship was about to tip over
the design changed so many times, they needed to add ballast, but there was no place to add it
if they did add ballast, the lower gun deck would have been underwater
so you may be thinking..
why did they launch if all the tests failed??!
if they didn't launch on time, the people inolved would have been subjected to "the King's disgrace" (execution?)
to recap:
- schedule pressure
- changing needs
- no specs
- lack of project plan
- excessive innovation
- secondary innovations
- requirement creep
- lack of scientific methods
- ignoring the obvious: launched after failed tests
the lesson: those who ignore history are doomed to repeat it!

@hypertextranch, monitoring & inadvertent spam traps:

i work at wordpress.com as a developer
i've never actually seen nagios
but i've infiltrated your ranks
we see a lot of spam
any developer can make their own stats
memorization < (intuition + investigation)
how i found a random spammer
i deployed elasticsearch and checked our monitoring to see if it made things better or worse
i saw queries stacking up
only 3 nodes pegged CPU, all other nodes were fine
if this were a problem in my code, it would have caused a problem on all nodes
every blog has a main instance and is i replicated to two extra machines
so it seems like this is a problem with a single blog
some user scripted thier blog to pull in articles from the washington post, splice in some affiliate links, and repeat every 30 seconds
every time a site gets marked as spam by our filter, it causes the articles to be reindexed
lesson: your devs should look at monitoring because they probably have more intuition about problems
automated monitoring might not have caught these three bad nodes
an ops dude would have noticed that three nodes
but i as a dev was able to intuitively pick up on the problem right away

Chess - a reflection of life:

"Chess is everything: art, science, and sport"
tournament players lose 10-15 pounds after a tournament, physical and mental stress for 8 hours a day burns calories
you are the winner even if you lose, you can learn from every match
the game is egalitarian, the only thing that matters is the moves
it doesn't matter what your age or gender or race is
ego is the enemy of learning & growth
ego is an anchor
accept that there is more for you to learn, and you will
chess exemplifies the power of cause and effect
your moves at the start are directly related to the moves at the end
time & timing are everything
a good position fades quickly
the game is all about patterns
our brain is built to detect patterns
control the center applies to chess and to life and business
ran out of time

@isaacfinnegan, Expanding Context to Faciliate Correlation:

basically i want to show off some cool stuff
"we've got great tools"
really?
i have to use 5 different tools to get stuff done, they all have different, crappy interfaces
github.com/evernote/graphite-web
templates for graphite
NagUI: federated nagios interface
very fast (especially compared to the classic interface)
bulk viewing, bulk actions
drag & drop custom views, saved views, share views with your team
graphite integration
acknowledge + send to jira
mobile interface too
CMDB: pull data from different tools into one view
nagui + jira + graphite
i think this is the next step for monitoring tools
instead of monolithic rewrites, integrate existing tools

Feature Knobs & Deploy Knobs:

feature flags, feature toggles, config flags
they're awesome!
doing 100 deploys a day is awesome!
deploy dark and turn up slowly for everything
this leads to a problem though
over time, we have a million feature flags and it's not clear which ones can be safely turned off/on
you need a promotion process, cleanup process, which is tough
use feature knobs wisely...
what about deploy knobs?
with a deploy knob, once you turn it up, you can't go back
this makes them self-cleaning

some dude running linux tried to present but couldn't get the display to work

@michaelgorsuch, github ops, canary.io:

scratching an itch via small, composable tools
measure URL performance & availability
at high resolution (sub-second)
multiple vantage points
based on libcurl (ubiquitous and provides good stats)
sensorD, gets a blob of JSON with a list of URLs
it measures them with libcurl and spits out JSON, that's cool
now i have all these sensord instances running around the globe
what do i do with this json?
i need to aggregate
new tool: canaryD
siphon off the useful info, store it in redis for the past 5 minutes (starting small...)
exposes the stats via REST API
even with 5 minutes, that's 1200 measurements
compare that to nagios's check_http, that would be like 1 measurement per 5 minutes in nagios
so why not feed this high resolution data into a nagios check?
what if i want to share this data?
i want to make this open source, infrastructure independent
open measuring for an open web
it "launched" 3 days ago, by that i mean i tweeted a gist
it's running in DO, but rackspace offered a bunch of servers
someone already built a dashboard
github.com/canaryio
i'm learning go, don't be scared by the code

Sergey Fedorov, netflix, Stateful monitoring:

couldn't present due to technical difficulties

Martin Parm, spotify, Distributed Operational Responsibility:

first person to present using linux!
give ops responsibility back to developers
capacity planning
monitoring
config mgmt
instead of doing this for them, we give them the tools to do this
why do this? doesn't this seem like a bad idea?
we have so many changes and engineers we can't do it all with an ops team
so why not get the right people in front of a project the first time?
if you break something, you need to fix it, better accountability
we want the teams to work with technologies
how about monitoring?
- devs need training, but not a whole new education, just enough to solve their problems
- devs need autonomy, and will do stupid things (ops does stupid stuff too)
alerting: metrics & events -> magic monitoring pipeline & alerting rules -> pagerduty alerts
- our alerting stack: ffwd (homegrown stat forwarder), apache kafka, riemann, even more stuff
- we don't need them to learn or touch the internals of that alerting stack
different abstraction levels
script hooks, drop a script in a folder
write your own python script with riemann library
write your own rules, provide tools for that
impact on monitoring?
- more monitoring, better monitoring
- monitoring platform
- more teaching, less babysitting / hand-writing monitoring code

Charlie, cofounder of Hosted Graphite, protecting your lizard brain while on-call:

failures are very stressful at Hosted Graphite, people depend on us for their monitoring
feedback loop: failures -> more checks -> more alerting -> more docs
things are getting better, but...
but failures start training you on a primitive level, that certain things are bad
you start to learn that your phone is a source of pain and fear
things were alright until they weren't
panic, jumpy, stressful
why is that the reaction? you need to be calm to solve the technical problem
and most outages aren't that serious
i have to remind myself "it's not that bad"
but my lizard brain is fucking terrified no matter what
if you hear an incoming text, and it isn't even your phone, and you jump, then that's not right
just let people know that you're down, that can relieve some stress
is that stress symbolic of something else? are you afraid of failing? your company failing?
what are other on-call people thinking?
i've heard the same stuff from everyone.. big or small company, big or small team, one person or multiple people on-call
having someone else on-call in front of you is helpful
turn off all other notifications on your phone
what can we do better? i want to talk to people about this
what can companies do to improve mental health of those on-call?
i'm gonna stand by the door back there and i want to talk to you

Raw

05_sponsor_plugs.md

Sponsor Plug: New Relic - Chase

New Relic browser / front end:

how fast your pages load
how fast are your ajax calls?
JS error tracking

interesting stuff we found:

error messages get translated, "Syntax error" vs. "Erreur de syntaxe", they get reported differently
his site had no ajax, but there were a ton of AJAX errors
- what is this stuff?
- the majority are toolbars, malware, etc.
- browser extensions, google translate, etc.
- some are pretty nasty, "Skype click-to-call" got into an infinite loop and triggered tens of thousands of errors

Sponsor plug: Elastic Search - Rashid

who uses ES? show of hands
70% use it vs. 30% don't (hmm... interesting..)
i'm going to give a workshop on wednesday, so i'll demo a lot more then
but if anyone has any questions, feel free to ask me now
Q: why do we need log searching? why elasticsearch?
- A: a graph shows you when something might be wrong, but logs allow you to go back to the original event and see what exactly happened
Q: what did you have for breakfast?
- A: yogurt, granola, melon
Q: do you want to buy a musket?
- A: yes, to defend myself from the government
Q: did you know you can 3d print a musket?
- A: yes, i'm terrified of this
Q: does ZK cluster discovery work?
- A: not used it, zen (?) discovery works
Q: can you talk about jepsen and ES?
- A: there's a recent blog post about it, it's a tough subject, distributed is hard, we don't have an answer for everything but we're doing pretty good
Q: roadmap?
- A: for what?
Q: kibana?
- A: will talk more on wed, better aggregations / facets, which are useful for turning logs into charts, "top N query" reduced from N queries to 1
Q: when is ES going to learn how to reindex something something without something?
- A: push harder if you want this feature

Sponsor plug: Librato - Joe

CTO of librato
librato is a platform for storing, monitoring, and alerting on custom metrics
composable monitoring system tailored to you
in the past that meant building your own solution from scratch with a bunch of OSS
librato lets you correlate arbitrary time series with each other
marking events like deploys & config changes
no proprietary agent, everything works over HTTP
80-100 products (middleware, web servers, databases, etc.) know how to speak to librato via opensource plugins
if you can write to stdout, you can capture that log output and send to librato as metrics
new features:
- more integrations
- better alerts - tune the sensitivity of alerts using historical data
- better on-call information - associate URLs / documentation with alerts, find all previous occurrences of an alert
- "composite metrics" - custom query language to manipulate raw data, calculate ratios, aggregates (looks like graphite's URL/function interface)

Sponsor plug: Pagerduty

pagerduty sits between your monitoring systems and your on-call people
we integrate with everyone
we send SMS/email to the right person
we take reliability seriously, full end-to-end tests
- we have 4 android phones in our lab constantly receiving texts to ensure deliverability!

new stuff:

multi-user alerting
on-call handoff notifications
SSO
outbound webhooks

multi-user alerting:

we found this is a great way to do onboarding for new ops people
put the new guy on-call alongside a veteran so they can get trained up in being on-call
multi-user alerting is also good for higher levels of escalation
for example if two people sleep through the alert, then set up your third escalation level to alert everyone instead of continuing to retry people one-by-one

handoff notifications:

notify by email, sms, and push when you go on or off call

outbound webhooks:

now has integration with slack, hipchat, flowdock, etc.
live demo of webhook FAILED, kinda awkward... lolz
oh wait he just yelled from the crowd that it worked (sure it did)

Sponsor plug: Dataloop.io - David

lots of teams spend a lot of time building monitoring solutions using OSS
but as soon as you try to get developers or QA to use it, you run into problems
high learning curve, confusing documentation, difficult interfaces
we want to un-silo the monitoring tools
as we move to microservices, traditional monitoring gets more difficult
we are building the monitoring tool for microservices
easy to use
flexibility of nagios / graphite, but with drag & drop
easy to create alerts
use existing nagios check scripts
speaks graphite/statsd/carbon protocol
create hierarchies with drag & drop
use tags
write plugins in any language
another thing we do besides config is visualization
- nagios, collectd, and statsd all in one place
- create dashboards via drag & drop, resize
- send dashboard reports via email (good for weekly / monthly reports to management teams)
- embeddable widgets
next, alerting:
- big feature is multiple triggers for alerts
- build context for your alerts
- condition A and condition B and condition C
- e.g. both web performance & service up/down check must trigger before alert goes off
- this decreases alert spam
actions:
- email / SMS / phone
- send to jira
- trigger event handlers (any language)
driven by API, command line tool, or github
launching later this year, beta testing now

Sponsor plug: Salesforce

no-show

Sponsor plug: pingdom

interesting numbers from our customers:

14 billion checks per month
9.4 million detected outages per month
8 million alerts sent per month
total downtime to 500 million minutes, across 450k customers
what can we do at pingdom to help with this?
#1 most requested feature: alert management

new feature: BeepManager

pingdom.com/beepmanager
team members can customize their method of contact
automated escalations
integrate with other systems (nagios, new relic, rackspace cloud monitoring)
alert flood protection
access levels
alert templates
most important feature of monitoring system is that it works for your team
we are committed to making our tool work for your team

Sponsor plug: Grok - Jared

numenta.com/grok
we do anomaly detection
we've heard all about it these two days
how do we solve it? science
years of research, we've made some breakthroughs
automatic & unsupervised machine learning on timeseries data
open source at numenta.org

first product: grok

mobile app
automated model creation & monitoring for AWS instances
showed some examples
automatic anomaly detection in CPU load
they used this to catch someone running manual builds on a build server
required no setup / training
free trial: simple to get running, 10 servers, no time limit

Sponsor plug: Big Panda

we launched our private beta yesterday
we spend a lot of time tweaking tools, building thousands of alerts
what do you use to manage your response to issues?
jira, zendesk, email
those tools are meant for humans
they were not built for responding to tons of automatically created incidents, flapping alerts, etc.
bigpanda is basically jira for ops
live demo
home page "OpsBox" shows all alerts
UI should be very familiar to gmail users
star alerts, mute alerts
how do i rise above the noise of alerts?
shows a timeline of alerts, when did it start warning, when did it reach critical, when did it go back to normal
(pretty cool looking)
shows a lot more data in context
"Changes" view: event log of changes in your infrastructure
we're already helping people today respond to alerts in a much more intelligent manner

Sponsor plug: Datadog - Alexei

cofounder and CTO of Datadog
hosted monitoring service
easily monitor from 5 to 50,000 hosts
what have we been working on the past year?
better graphs
better visualizations, histograms
better counts & counters
heatmaps
better alerts, more sophisticated alerting
the ability to embed disturbing images into your dashboards (nicholas cage meme pics)
more integrations: fastly, google cloud, slack, new relic, 50-60 integrations total
monitoring is fun!
who here has learned a lot these past two days? (everyone)
who here wants to work on monitoring more? (still everyone)
that's good news because we're hiring ha ha laffs

spuder commented May 28, 2015

Thanks for compiling this. Found the exact talk I was looking for a year later.

lost-theory/01_intro.md

Monitorama 2014 notes

welcome

Please, no More Minutes, Milliseconds, Monoliths... or Monitoring Tools! - Adrian Cockcroft

Computers are a Sadness, I am the Cure - James Mickens

Simple math to get some signal out of your noisy sea of data - Toufic Boubez

The Care and Feeding of Monitoring - Katherine Daniels

Car Alarms and Smoke Alarms - Dan Slimmon

Metrics 2.0 - Dieter Plaetinck

Our Most Wicked Problem - Ashe Dryden

StatsG at New York Times - Eric Buth

The cost and complexity of reactive monitoring - Chris Baker

From Zero To Visibility - Bridget Kromhout

Conclusion of Day 1

"Auditing all the things": The future of smarter monitoring and detection - Jen Andre

Is There An Echo In Here?: Applying Audio DSP algorithms to monitoring - Noah Kantrowitz

A Melange of Methods for Manipulating Monitored Data - Dr Neil J. Gunther

The Final Crontab - Selena Deckelmann

This One Weird Time-Series Math Trick - Baron Schwartz

The Lifecycle of an Outage - Scott Sanders

A whirlwind tour of Etsy's monitoring stack - Daniel Schauenberg

of pages during sleep vs. awake time

Wiff: The Wayfair Network Sniffer - Dan Rowe

Web performance observability - Mike McLane & Joseph Crim

Sponsor Plug: New Relic - Chase

Sponsor plug: Elastic Search - Rashid

Sponsor plug: Librato - Joe

Sponsor plug: Pagerduty

Sponsor plug: Dataloop.io - David

Sponsor plug: Salesforce

Sponsor plug: Puppet

Sponsor plug: pingdom

Sponsor plug: Grok - Jared

Sponsor plug: Big Panda

Sponsor plug: Datadog - Alexei

spuder commented May 28, 2015