Getting started with server monitoring and alerting

This short article loosely follows my previous adventures in self-hosting services such as mail server, file server and document server; this time, I’ll show you how I handled server monitoring and alerting so that no matter what happens to the server, I can be quickly alerted.

First, let’s introduce today’s actors:

  • Prometheus, for scraping data from various sources
  • Node Exporter, for providing Prometheus data about the host machine
  • Node Exporter Smartmon, for appending S.M.A.R.T. data to Node Exporter output
  • cAdvisor, for providing Prometheus data about running containers (if you remember, everything on the server runs in a Docker container)
  • Alertmanager, for sending alerts to Slack based on data in Prometheus
  • Grafana, for visualising data from Prometheus
  • Netdata, just because we can 🤣 And it looks so cool! 🤩

In this article, I’ll show how I’m monitoring a single server, although the entire setup can be easily scaled nearly to infinity – you’d have one (or more) node(s) scraping, collecting, analysing and visualising all the data from all the nodes.

As before, I deployed everything to Dokku so I’ll be showing everything in terms of Dokku commands, but these can be easily converted to other methods of deployment, like docker-compose.yml (🤮🤣), or likely into a Kubernetes deployment. I’m fairly confident that Dokku commands are easily readable and understandable even to those not familiar with Dokku, but if you’re interested in docker-compose.yml version, let me know in the comments and I’ll try adding it later.

Stealing fire with Prometheus 🔥

Let’s start by creating a new app for Prometheus and pulling the required image:

dokku apps:create prometheus

docker image pull prom/prometheus:v2.15.2
docker image tag prom/prometheus:v2.15.2 dokku/prometheus:2.15.2

Prometheus and Node Exporter are sort of annoying in the way that they can’t be configured using environmental variables at all – most of the configuration can be adjusted in config file only, and some can be only changed by appending a parameter to the executed binary. That seemingly poses a problem in Dokku environment because there’s no intuitive way of passing the arguments – until you discover two pieces of crucial information:

  • Prometheus’ Dockerfile correctly separates ENTRYPOINT and CMD directives, nice 👌
  • Despite Dokku’s documentation not being really clear on this point, you can override the value of CMD for Docker image-based deployments

With that in mind, let’s adjust the default CMD directive with --web.enable-lifecycle option, allowing us to instruct Prometheus to reload its configuration file remotely, without the need for restarting the entire container:

dokku config:set prometheus DOKKU_DOCKERFILE_START_CMD="--config.file=/etc/prometheus/prometheus.yml --storage.tsdb.path=/prometheus --web.console.libraries=/usr/share/prometheus/console_libraries --web.console.templates=/usr/share/prometheus/consoles --web.enable-lifecycle"

The above command points to two directories, one for configuration and one for persistent data, so let’s create & mount them to have easy access to them from the host machine:

dokku storage:mount prometheus /var/lib/dokku/data/storage/prometheus/config:/etc/prometheus
dokku storage:mount prometheus /var/lib/dokku/data/storage/prometheus/data:/prometheus

sudo mkdir -p /var/lib/dokku/data/storage/prometheus/{config,data}
sudo touch /var/lib/dokku/data/storage/prometheus/config/{alert.rules,prometheus.yml}
sudo chown -R nobody:nogroup /var/lib/dokku/data/storage/prometheus

The above commands should also fix any permissions issue you might run into, as the Prometheus container runs under nobody user. As for the alert.rules and prometheus.yml configuration files, this is the setup I currently use (I plan to replace the external domains with Docker’s internal networking once Dokku 0.20.0 is released). Feel free to use it, but bear in mind that until we finish deploying all pieces of this setup, scraping data from cAdvisor and Node Exporter, as well as alerting via Alertmanager obviously won’t work.

Finally, deploy Prometheus, fix proxy configuration and secure the installation with Dokku’s HTTP Auth plugin:

dokku checks:disable prometheus
dokku tags:deploy prometheus 2.15.2
dokku proxy:ports-set prometheus http:80:9000
dokku letsencrypt prometheus
dokku http-auth:on prometheus <username> <password>

After accessing Prometheus at https://prometheus.<your-domain>.com, you should be welcomed by Prometheus Dashboard. Under Status ↝ Targets, you should see prometheus up, while cadvisor and node-exporter will be down for now. That’s fine, let’s bring them up now.

Raising Node Exporter up 🙌

As Node Exporter is made by the same team as Prometheus itself, the deployment process is eerily similar:

dokku apps:create node-exporter

docker image pull prom/node-exporter:v0.18.1
docker image tag prom/node-exporter:v0.18.1 dokku/node-exporter:0.18.1

dokku config:set node-exporter DOKKU_DOCKERFILE_START_CMD="--collector.textfile.directory=/data --path.procfs=/host/proc --path.sysfs=/host/sys"

dokku storage:mount node-exporter /proc:/host/proc:ro 
dokku storage:mount node-exporter /:/rootfs:ro 
dokku storage:mount node-exporter /sys:/host/sys:ro 
dokku storage:mount node-exporter /var/lib/dokku/data/storage/node-exporter:/data

sudo mkdir -p /var/lib/dokku/data/storage/node-exporter
sudo chown nobody:nogroup /var/lib/dokku/data/storage/node-exporter

dokku docker-options:add node-exporter deploy "--net=host"
dokku checks:disable node-exporter

There two things worth noting here:

First, the /data directory. Node Exporter Smartmon will periodically write HDD’s S.M.A.R.T. status into that directory, and Node Exporter will append that status into its own output. That’s why there’s no separate scraping of Node Exporter Smartmon necessary (nor possible, really) – Smartmon does not expose an HTTP service, it simply transforms S.M.A.R.T. status into a text file in Prometheus format.

Second, to properly inspect the network traffic, Node Exporter must have access to all network interfaces (for some reason 🧐 Interestingly, Netdata does not have this requirement…). This is accomplished with the --net=host Docker option. Because the container will operate on the host network, zero-downtime deploy checks must be unfortunately disabled as well, otherwise, you’d probably run into an issue where two containers are trying to bind to the same port.

Before deploying Node Exporter, let’s prepare Node Exporter Smartmon as well:

dokku apps:create node-exporter-smartmon

docker image pull galexrt/node-exporter-smartmon:v1.1.0
docker image tag galexrt/node-exporter-smartmon:v1.1.0 dokku/node-exporter-smartmon:1.1.0

dokku storage:mount /var/lib/dokku/data/storage/node-exporter:/var/lib/node_exporter
dokku docker-options:add node-exporter-smartmon deploy,run "--privileged"

dokku tags:deploy node-exporter-smartmon 1.1.0

dokku proxy:disable node-exporter-smartmon
dokku domains:disable node-exporter-smartmon

As mentioned above, Node Exporter Smartmon does not expose an HTTP service, so it’s not necessary to have Dokku’s proxy interact with it. As smartmon requires root privileges to run, --privileged argument must be supplied to the Docker container.

Finally, let’s finish Node Exporter’s deployment:

dokku tags:deploy node-exporter 0.18.1
dokku proxy:ports-set node-exporter http:80:9100
dokku letsencrypt node-exporter
dokku http-auth:on node-exporter <username> <password>

If everything went well (and I did not forget to mention any step here 😅), you should be able to access https://node-exporter.<your-domain>.com/metrics after a short while. At the bottom of the output, a lot of entries starting with smartmon_ should be present. And in Prometheus’ Dashboard, node-exporter should now be showing as UP 🎉

Bringing cAdvisor to the party 🥳

Let’s quickly deploy the last scraper:

dokku apps:create cadvisor

docker image pull gcr.io/google-containers/cadvisor:v0.35.0
docker image tag gcr.io/google-containers/cadvisor:v0.35.0 dokku/cadvisor:0.35.0

dokku config:set cadvisor DOKKU_DOCKERFILE_START_CMD="--docker_only --housekeeping_interval=10s --max_housekeeping_interval=60s --disable_metrics=disk"

dokku storage:mount cadvisor /:/rootfs:ro 
dokku storage:mount cadvisor /sys:/sys:ro 
dokku storage:mount cadvisor /var/lib/docker:/var/lib/docker:ro 
dokku storage:mount cadvisor /var/run:/var/run:rw

dokku tags:deploy cadvisor 0.35.0
dokku proxy:ports-set cadvisor http:80:8080
dokku letsencrypt cadvisor
dokku http-auth:on cadvisor <username> <password>

As before, if things went well, you should be able to access https://cadvisor.<your-domain>.com/metrics after a while, and cAdvisor should show as UP in Prometheus’ dashboard.

Nice, we have all the data we need, now let’s finally have a look at them!

Visualising with Grafana 📈

All of this should really feel like second nature to you by now:

dokku apps:create grafana

docker image pull grafana/grafana:6.6.1
docker image tag grafana/grafana:6.6.1 dokku/grafana:6.6.1

dokku storage:mount grafana /var/lib/dokku/data/storage/grafana:/var/lib/grafana
sudo mkdir /var/lib/dokku/data/storage/grafana
sudo chown 472:472 /var/lib/dokku/data/storage/grafana

dokku tags:deploy grafana 6.6.1
dokku proxy:ports-set grafana http:80:3000
dokku letsencrypt grafana

After accessing https://grafana.<your-domain>.com, you’ll be prompted to create a first user account, and after that, to add your first data source.

Our data source is, of course, Prometheus, so let’s choose that:

Since we enabled HTTP Basic Auth for our Prometheus installation, be sure to mention it in the data source configuration:

Hopefully, once you press the Save & Test button, the connection will be established 👌

Now, I’m both too lazy and too unskilled to actually create my own Grafana dashboards, but that’s fortunately not a problem as there’s a ton of publicly available dashboards – some of them are even good! Here are my top recommendations (bear 🐻 in mind 🤔, these will look slightly different after you import them yourself as I made small modifications to each):

1 Node Exporter for Prometheus Dashboard English Version UPDATE 1102 dashboardData for Grafana
Support Node Exporter v0.16 and above.Optimize the main metrics display. Includes: CPU, memory, disk IO, network, temperature and other monitoring metrics。https://github.com/starsliao/Prometheus

This is my default dashboard, it shows all the interesting data from Node Exporter.

Docker and system monitoring dashboardData for Grafana
A simple overview of the most important Docker host and container metrics. (cAdvisor/Prometheus)

This dashboard shows everything related to Docker containers, but unfortunately, it’s a bit buggy. And it also highlights an issue with cAdvisor itself, namely – cAdvisor refuses to track container renames 😡 Relevant issue has been opened for many years, but the fix is still nowhere to be seen…

I have no idea why that Load monitor is alerting all the time, but I gave up on trying to fix it 😢
Smartmon Textfile dashboardData for Grafana
Smartmon Texfile node_exporter Dashboard

Finally, this is the dashboard for keeping track on the health of your drives.

Going proactive with Alertmanager 🔔

Looking at graphs is fine and dandy (and if you’re anything like me, you’ll probably spend the next few days just randomly looking at the pretty charts 😅), but getting actually notified when something unusual happens with your system – that’s even more useful.

If you used the alert.rules file from my listing above, you should already see many alert rule definitions in the Alerts section of Prometheus’s dashboard (you’re welcome 🙏) – but there’s nobody to deliver the notifications yet! Let’s fix that quickly:

dokku apps:create alertmanager

docker image pull prom/alertmanager:v0.20.0
docker image tag prom/alertmanager:v0.20.0 dokku/alertmanager:0.20.0

dokku storage:mount alertmanager /var/lib/dokku/data/storage/alertmanager/config:/etc/alertmanager
dokku storage:mount alertmanager /var/lib/dokku/data/storage/alertmanager/data:/alertmanager
sudo mkdir -p /var/lib/dokku/data/storage/alertmanager/{config,data}
sudo touch /var/lib/dokku/data/storage/alertmanager/{config/alertmanager.yml,data/notifications.tmpl}

dokku tags:deploy alertmanager 0.20.0
dokku proxy:ports-set alertmanager http:80:9093
dokku letsencrypt alertmanager
dokku http-auth:on alertmanager <username> <password>

As before, the listing for config/alertmanager.yml and data/notifications.tmpl is included in the GitHub Gist. If you wish to output the alerts to a dedicated Slack channel, you’ll need to create a Slack app, enable incoming webhooks for it, and create a webhook. The process definitely got a bit lengthier than I remember it 🧐

And that’s should be about it! If everything went well, you should start getting alerts on high CPU usage (both system-wide and when a single container goes crazy), high RAM usage, swap usage, when the disk starts filling up or starts malfunctioning.

One way of testing that notifications are working is to spawn a fork bomb: for example with docker run --rm -it busybox sh -c "while true; do :; done". Depending on your server’s hardware, you might need more than one, I think my server needed to run seven of these at the same time until a barrage of alerts bombarded me two minutes later.

Bonus round: going pretty with Netdata 💄

A few days after finishing this setup, I discovered Netdata. It’s fast, it’s pretty, it works well without any configuration – it’s really quite amazing piece of software 🤩 Theoretically, I think it’s possible to scrape Netdata’s data into Prometheus and visualise them with Grafana, but for now, I’m not really interested in that – I’ll be keeping my Prometheus for longer-term monitoring, and Netdata for visualising the last few minutes – at least, that’s the plan for now.

Let’s get down to terminal one last time:

dokku apps:create netdata

docker image pull netdata/netdata:v1.19.0
docker image tag netdata/netdata:v1.19.0 dokku/netdata:1.19.0

dokku config:set netdata DO_NOT_TRACK=1 PGID=999
dokku docker-options:add netdata deploy,run "--cap-add=SYS_PTRACE"
dokku docker-options:add netdata deploy,run "--security-opt apparmor=unconfined"

dokku storage:mount netdata /etc/group:/host/etc/group:ro 
dokku storage:mount netdata /etc/os-release:/host/etc/os-release:ro 
dokku storage:mount netdata /etc/passwd:/host/etc/passwd:ro 
dokku storage:mount netdata /proc:/host/proc:ro 
dokku storage:mount netdata /sys:/host/sys:ro 
dokku storage:mount netdata /var/lib/dokku/data/storage/netdata:/etc/netdata 
dokku storage:mount netdata /var/run/docker.sock:/var/run/docker.sock:ro

dokku tags:deploy netdata 1.19.0
dokku proxy:ports-set netdata http:80:19999
dokku letsencrypt netdata
dokku http-auth:on netdata <username> <password>

After that, this nearly never-ending list of graphs should be your reward:

There’s a way to configure Netdata more to your liking (hint: look at the storage:mount command above), and have it send automatic alerts to Slack (or anywhere else) as well, but this article is already dangerously close to reaching 2500 words, so I think I’ll cut it short for now 😅

Happy monitoring! 🎉