We sometimes experience a node(s) running hot. Determining what to do is conditional on the rest of the cluster. If it’s one node out of 50 that’s running hot, we can chalk it up too “bad node” and kill it. However, if it’s >50% of the nodes running hot, we should seek to understand what’s happening within the service. To that end, I wanted to create an alert on outlier nodes.
Let’s start with cumulative idle on a host:
sum without(cpu, job) (irate(node_cpu{host=~"prod-foo.*",mode="idle"}[3m]))
Idle is the measure of unused CPU, so it’s a shortcut for matching mode !~ "idle". Should we care, we’d want to subtract this from 1 to get the consumed CPU. Also of note, we accumulating idle from all CPUs on the host, and not averaging per cpu.
Now, let’s get the average idle for the cluster: