-
-
Save ronaldoedy/d75638f448d1a4902470f5fa5c6964e2 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
https://blog.svedr.in/posts/prometheus-quick-start.html | |
https://blog.svedr.in/posts/prometheus-quick-start.rst | |
.. title: Prometheus quick start | |
.. slug: prometheus-quick-start | |
.. date: 2016-07-05 15:50:00 UTC+02:00 | |
.. tags: linux, prometheus | |
.. link: | |
.. description: | |
.. type: text | |
Here's a little quick start procedure I like to use to get the `Prometheus <https://prometheus.io/>`_ monitoring system up and running, featuring Prometheus itself, the node and SNMP exporters, separate directories for configured and discovered targets and a couple of basic alerts. | |
So far, I used this on Ubuntu 14.04 and Debian Jessie. | |
.. TEASER_END | |
Prometheus | |
========== | |
First, install Prometheus:: | |
add-apt-repository ppa:ubuntu-lxc/lxd-stable | |
apt-get update | |
apt-get install golang build-essential | |
echo 'GOPATH="/opt/gocode"' >> /etc/environment | |
source /etc/environment | |
export GOPATH | |
go get github.com/prometheus/prometheus/cmd/prometheus | |
go get github.com/prometheus/prometheus/cmd/promtool | |
go get github.com/prometheus/node_exporter | |
go get github.com/prometheus/alertmanager | |
mkdir -p /etc/prometheus/targets/node | |
mkdir -p /etc/prometheus/targets/snmp | |
mkdir -p /var/lib/prometheus/data | |
mkdir -p /var/lib/prometheus/amdata | |
mkdir -p /var/lib/prometheus/discovery/node | |
mkdir -p /var/lib/prometheus/discovery/snmp | |
Put the following into ``/etc/prometheus/prometheus.yml``:: | |
scrape_configs: | |
- job_name: "node" | |
scrape_interval: "15s" | |
file_sd_configs: | |
- files: | |
- '/etc/prometheus/targets/node/*.json' | |
- '/var/lib/prometheus/discovery/node/*.json' | |
- job_name: 'snmp' | |
params: | |
module: [default] | |
file_sd_configs: | |
- files: | |
- '/etc/prometheus/targets/snmp/*.json' | |
- '/var/lib/prometheus/discovery/snmp/*.json' | |
relabel_configs: | |
- source_labels: [instance] | |
target_label: hostname | |
- source_labels: [__address__] | |
target_label: __param_address | |
- source_labels: [__param_address] | |
target_label: instance | |
- target_label: __address__ | |
replacement: '127.0.0.1:9116' | |
rule_files: | |
- /etc/prometheus/alert.rules | |
For starters, you can monitor the Prometheus node itself using Node exporter by | |
putting the following into ``/etc/prometheus/targets/node/localhost.json``:: | |
[ | |
{ | |
"targets": ["127.0.0.1:9100"], | |
"labels": { | |
"instance": "localhost" | |
} | |
} | |
] | |
SNMP Exporter | |
============= | |
Next, install the SNMP Exporter. You get to choose between the official branch:: | |
apt-get install python-netsnmp python-dev python-pip | |
pip install snmp_exporter | |
and my own branch, that I extended with a more modular config and I fixed an infinite loop that occurred in our network for some reason:: | |
apt-get install python-netsnmp python-dev python-pip | |
cd /opt | |
git clone https://github.com/Svedrin/snmp_exporter.git | |
cd snmp_exporter | |
git checkout svedrin-master | |
python setup.py install | |
cp -r snmp.yml.d /etc/prometheus | |
(So, you only need to run *one* of the two sections above.) | |
It totally helps if all your nodes are configured to use the same SNMP community | |
and you have a discovery tool that can generate a JSON file that knows them all. | |
This way, you can literally get up and running in minutes. | |
Alerting | |
======== | |
We installed Alert Manager already, time to configure it -- the config file is | |
``/etc/prometheus/alertmanager.conf``:: | |
global: | |
# The smarthost and SMTP sender used for mail notifications. | |
smtp_smarthost: 'smtp.derpyherp.com' | |
smtp_from: 'prometheus@herpyderp.com' | |
smtp_auth_username: 'derpity' | |
smtp_auth_password: 'derpington' | |
route: | |
receiver: 'team-X-mails' | |
group_by: ['alertname'] | |
group_wait: 30s | |
group_interval: 5m | |
repeat_interval: 6h | |
inhibit_rules: | |
- source_match: | |
severity: 'critical' | |
target_match: | |
severity: 'warning' | |
# Apply inhibition if the alertname is the same. | |
equal: ['alertname'] | |
receivers: | |
- name: 'team-X-mails' | |
email_configs: | |
- to: 'svedrin@herpyderp.com' | |
Rules go into ``/etc/prometheus/alert.rules``:: | |
ALERT node_down | |
IF up == 0 AND job="node" | |
FOR 5m | |
ANNOTATIONS { | |
summary = "Node is down", | |
description = "Node has been unreachable for more than 5 minutes.", | |
severity = "warning" | |
} | |
ALERT snmp_down | |
IF up == 0 AND job="snmp" | |
FOR 5m | |
ANNOTATIONS { | |
summary = "SNMP is down", | |
description = "SNMP has been unreachable for more than 5 minutes.", | |
severity = "warning" | |
} | |
ALERT fs_at_80_percent | |
IF hrStorageUsed{hrStorageDescr=~"/.+"} / hrStorageSize >= 0.8 | |
FOR 15m | |
ANNOTATIONS { | |
summary = "File system {{$labels.hrStorageDescr}} is at 80%", | |
description = "{{$labels.hrStorageDescr}} has been at 80% for more than 15 Minutes.", | |
severity = "warning" | |
} | |
ALERT fs_at_90_percent | |
IF hrStorageUsed{hrStorageDescr=~"/.+"} / hrStorageSize >= 0.9 | |
FOR 15m | |
ANNOTATIONS { | |
summary = "File system {{$labels.hrStorageDescr}} is at 90%", | |
description = "{{$labels.hrStorageDescr}} has been at 90% for more than 15 Minutes.", | |
severity = "average" | |
} | |
ALERT disk_load_mostly_random_reads | |
IF rate(diskIOReads{diskIODevice=~"sd[a-z]+"}[5m]) > 20 AND | |
rate(diskIONReadX{diskIODevice=~"sd[a-z]+"}[5m]) / rate(diskIOReads{diskIODevice=~"sd[a-z]+"}[5m]) < 10000 | |
FOR 15m | |
ANNOTATIONS { | |
summary = "Disk {{$labels.diskIODevice}} reads are mostly random.", | |
description = "{{$labels.diskIODevice}} reads have been mostly random for the past 15 Minutes.", | |
severity = "info" | |
} | |
ALERT disk_load_mostly_random_writes | |
IF rate(diskIOWrites{diskIODevice=~"sd[a-z]+"}[5m]) > 20 AND | |
rate(diskIONWrittenX{diskIODevice=~"sd[a-z]+"}[5m]) / rate(diskIOWrites{diskIODevice=~"sd[a-z]+"}[5m]) < 10000 | |
FOR 15m | |
ANNOTATIONS { | |
summary = "Disk {{$labels.diskIODevice}} writes are mostly random.", | |
description = "{{$labels.diskIODevice}} writes have been mostly random for the past 15 Minutes.", | |
severity = "info" | |
} | |
ALERT disk_load_high | |
IF diskIOLA1{diskIODevice=~"s|vd[a-z]+"} > 30 | |
FOR 15m | |
ANNOTATIONS { | |
summary = "Disk {{$labels.diskIODevice}} is at 30%", | |
description = "{{$labels.diskIODevice}} Load has exceeded 30% over the past 15 Minutes.", | |
severity = "warning" | |
} | |
ALERT cpu_load_high | |
IF ssCpuIdle < 70 | |
FOR 15m | |
ANNOTATIONS { | |
summary = "CPU is at 30%", | |
description = "CPU Load has constantly exceeded 30% over the past 15 Minutes.", | |
severity = "warning" | |
} | |
ALERT linux_load_high | |
IF laLoad1 > 50 | |
FOR 15m | |
ANNOTATIONS { | |
summary = "Linux Load is at 40", | |
description = "Linux Load has constantly exceeded 40 over the past 15 Minutes.", | |
severity = "average" | |
} | |
ALERT if_operstatus_changed | |
IF delta(ifOperStatus[15m]) != 0 | |
ANNOTATIONS { | |
summary = "Port {{$labels.ifDescr}} changed status", | |
description = "Port {{$labels.ifDescr}} went up or down in the past 15 Minutes", | |
severity = "info" | |
} | |
ALERT if_traffic_at_30_percent | |
IF ifSpeed > 10000000 AND | |
ifOperStatus == 1 AND | |
rate(ifInOctets[5m]) > ifSpeed * 0.3 | |
FOR 15m | |
ANNOTATIONS { | |
summary = "Port {{$labels.ifDescr}} is at 30%", | |
description = "Port {{$labels.ifDescr}} has had at least 30% traffic over the past 15 Minutes.", | |
severity = "warning" | |
} | |
ALERT if_traffic_at_70_percent | |
IF ifSpeed > 10000000 AND | |
ifOperStatus == 1 AND | |
rate(ifInOctets[5m]) > ifSpeed * 0.7 | |
FOR 15m | |
ANNOTATIONS { | |
summary = "Port {{$labels.ifDescr}} is at 70%", | |
description = "Port {{$labels.ifDescr}} has had at least 70% traffic over the past 15 Minutes.", | |
severity = "average" | |
} | |
.. note:: | |
Please be aware that those rules only cover SNMP data, and for the | |
most part relate to data the upstream SNMP exporter doesn't even | |
scrape. | |
You could also put the instance name into the alert summary and/or | |
description, but I'd advise against it. If you omit that info, you | |
can more easily group alerts by their summary. | |
Upstart configs | |
=============== | |
All this stuff has to be started somehow. If you're on Ubuntu 14.04, | |
you may want to (or find yourself forced to) use upstart. So, here goes: | |
``/etc/init/prometheus.conf``:: | |
# Run prometheus | |
start on startup | |
script | |
cd /opt/gocode/src/github.com/prometheus/prometheus | |
/opt/gocode/bin/prometheus \ | |
-storage.local.path="/var/lib/prometheus/data" \ | |
-config.file=/etc/prometheus/prometheus.yml \ | |
-alertmanager.url=http://localhost:9093/alert-manager/ \ | |
-web.external-url=http://192.168.0.1/prometheus | |
end script | |
``/etc/init/alertmanager.conf``:: | |
# Run alert manager | |
start on startup | |
script | |
/opt/gocode/bin/alertmanager \ | |
-log.level=debug \ | |
-storage.path="/var/lib/prometheus/amdata" \ | |
-config.file=/etc/prometheus/alertmanager.conf \ | |
-web.external-url=http://192.168.0.1/alert-manager/ | |
end script | |
``/etc/init/node-exporter.conf``:: | |
# Run node_exporter | |
start on startup | |
script | |
/opt/gocode/bin/node_exporter | |
end script | |
``/etc/init/snmp-exporter.conf``:: | |
# Run snmp_exporter | |
start on startup | |
script | |
# This is only relevant for the Svedrin edition. Omit it for upstream. | |
cat /etc/prometheus/snmp.yml.d/*.yml > /var/lib/prometheus/snmp.yml | |
/usr/local/bin/snmp_exporter /var/lib/prometheus/snmp.yml | |
end script | |
Systemd configs | |
=============== | |
If you're fortunate enough to be on a platform that supports Systemd, the following configs may come in handy. | |
``/etc/systemd/system/prometheus.service``:: | |
[Unit] | |
Description=Prometheus server | |
After=network.target | |
[Service] | |
WorkingDirectory=/opt/gocode/src/github.com/prometheus/prometheus/ | |
ExecStart=/opt/gocode/bin/prometheus \ | |
-storage.local.path=/var/lib/prometheus/data \ | |
-config.file=/etc/prometheus/prometheus.yml \ | |
-alertmanager.url=http://localhost:9093/alert-manager \ | |
-web.external-url=http://192.168.0.1/prometheus/ | |
User=prometheus | |
[Install] | |
WantedBy=multi-user.target | |
``/etc/systemd/system/alertmanager.service``:: | |
[Unit] | |
Description=Prometheus Alert Manager | |
After=network.target | |
[Service] | |
ExecStart=/opt/gocode/bin/alertmanager \ | |
-log.level=debug \ | |
-storage.path="/var/lib/prometheus/amdata" \ | |
-config.file=/etc/prometheus/alertmanager.conf \ | |
-web.external-url=http://192.168.0.1/alert-manager/ | |
User=prometheus | |
[Install] | |
WantedBy=multi-user.target | |
``/etc/systemd/system/node-exporter.service``:: | |
WantedBy=multi-user.target[Unit]xporter | |
Description=Prometheus Node Exporter | |
After=network.target | |
[Service] | |
ExecStart=/usr/local/sbin/node_exporter | |
User=nobody | |
[Install] | |
WantedBy=multi-user.target | |
``/etc/systemd/system/snmp-exporter.service``:: | |
[Unit] | |
WantedBy=multi-user.target | |
Description=Prometheus SNMP Exporter | |
After=network.target | |
[Service] | |
WorkingDirectory=/opt/snmp_exporter | |
Environment=PYTHONPATH=. | |
ExecStart=/usr/bin/python scripts/snmp_exporter snmp.yml | |
User=nobody | |
[Install] | |
WantedBy=multi-user.target | |
(I haven't yet ported my ``snmp.yml.d`` mechanism to my systemd machine, so I don't have a config for that yet.) | |
Apache2 Reverse Proxy | |
===================== | |
``/etc/apache2/sites-available/prometheus.conf``:: | |
ProxyPass /prometheus/ https://localhost:9090/prometheus/ | |
ProxyPassReverse /prometheus/ https://localhost:9090/prometheus/ | |
ProxyPass /alert-manager/ https://localhost:9093/alert-manager/ | |
ProxyPassReverse /alert-manager/ https://localhost:9093/alert-manager/ | |
Summary | |
======= | |
This config illustrates a quick way to get started. I consider it more | |
of a guideline than a production-ready setup, please don't forget to | |
adapt it to your needs. Especially the alert rules will need some tuning. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment