Alerting - prometheus/prometheus

Prometheus is an open-source systems monitoring and alerting toolkit. Alerting in Prometheus is separated into two parts: alerting rules in Prometheus servers send alerts to an Alertmanager, which then manages those alerts and sends out notifications via methods such as email, on-call notification systems, and chat platforms.

The main steps to setting up alerting and notifications with Prometheus are:

  1. Setting up and configuring the Alertmanager
  2. Configuring Prometheus to talk to the Alertmanager
  3. Creating alerting rules in Prometheus

Alertmanager

The Alertmanager is a standalone service that receives alerts from Prometheus and disperses them to the correct receiver channels. It provides the following features:

  • Alert grouping: Alerts with the same labels are grouped together, so you don’t get spammed when a single alert fires many times.
  • Silencing: Alerts can be silenced to mute notifications for a specified time.
  • Inhibition: Alerts can be inhibited from firing when another alert is firing.
  • Aggregation: Alerts can be aggregated to fire a single notification when a certain number of alerts are firing.
  • Notification methods: Alerts can be sent via email, on-call notification systems, chat platforms, and more.

Configuring the Alertmanager

The Alertmanager can be configured with a static configuration file or dynamically discovered using service discovery. The configuration file specifies Alertmanager instances and parameters to configure how to communicate with them.

Static Configuration

To statically configure the Alertmanager, use the static_configs parameter in the alertmanager_config section of the Prometheus configuration file.

Dynamic Configuration

To dynamically configure the Alertmanager, use the service_discovery_config parameter in the alertmanager_config section of the Prometheus configuration file.

Relabeling

The relabel_configs parameter allows selecting Alertmanagers from discovered entities and provides advanced modifications to the used API path.

Timeout

The timeout parameter specifies the per-target Alertmanager timeout when pushing alerts.

Prometheus

Prometheus sends alerts to the Alertmanager using the Alertmanager API. Alerts are sent as a list of alerts in the request body.

Alerting Rules

Prometheus’s alerting rules are good at figuring out what is broken right now, but they are not a fully-fledged notification solution. Another layer is needed to add summarization, notification rate limiting, silencing, and alert dependencies on top of the simple alert definitions. In Prometheus’s ecosystem, the Alertmanager takes on this role.

Prometheus can be configured to periodically send information about alert states to an Alertmanager instance, which then takes care of dispatching the right notifications.

Configuring Alerting Rules

Alerting rules are defined in a separate file with the .rules extension. The file contains a list of alerting rules in the following format:

groups:
- name: example
rules:
- alert: ExampleAlert
expr: vector(1)
for: 5m
labels:
severity: critical
annotations:
description: This is an example alert

The alert field specifies the name of the alert. The expr field specifies the PromQL expression that generates the alert. The for field specifies the duration that the alert must be firing before a notification is sent. The labels field specifies labels to add to the alert. The annotations field specifies annotations to add to the alert.

Reloading Alerting Rules

Prometheus can be configured to reload the alerting rules by sending a POST request to the /-/reload endpoint.

Receivers

Receivers are the endpoints that receive notifications from the Alertmanager. Receivers can be configured to send notifications via email, on-call notification systems, chat platforms, and more.

Configuring Receivers

Receivers are configured in the Alertmanager configuration file.

Email Receiver

To configure an email receiver, use the email_config parameter in the receiver section of the Alertmanager configuration file.

PagerDuty Receiver

To configure a PagerDuty receiver, use the pagerduty_config parameter in the receiver section of the Alertmanager configuration file.

Slack Receiver

To configure a Slack receiver, use the slack_config parameter in the receiver section of the Alertmanager configuration file.

Webhook Receiver

To configure a webhook receiver, use the webhook_config parameter in the receiver section of the Alertmanager configuration file.

Examples

Here are some examples of alerting rules and Alertmanager configurations.

Alerting Rules

High CPU Usage Alert

This alert triggers when the CPU usage of a node exceeds 80% for 5 minutes.

groups:
- name: node_exporter
rules:
- alert: NodeHighCPUUsage
expr: node_cpu_seconds_total{mode="idle"} / node_cpu_seconds_total{mode!="idle"} < 0.2
for: 5m
annotations:
description: The CPU usage of node {{ $labels.instance }} exceeded 80% for 5 minutes.

High Memory Usage Alert

This alert triggers when the memory usage of a node exceeds 80% for 5 minutes.

groups:
- name: node_exporter
rules:
- alert: NodeHighMemoryUsage
expr: node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes < 0
for: 5m
annotations:
description: The memory usage of node {{ $labels.instance }} exceeded 80% for 5 minutes.

Alertmanager Configuration

Static Configuration

This Alertmanager configuration statically configures a single Alertmanager instance.

global:
resolve_timeout: 5m

route:
receiver: 'team-X-mails'
group_by: ['alertname']
group_wait: 10s
group_interval: 1m
repeat_interval: 3h

receivers:
- name: 'team-X-mails'
email_configs:
- to: '[email protected]'

inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
# The alertnames must match
alertnames:
- 'InstanceDown'

Dynamic Configuration

This Alertmanager configuration dynamically discovers Alertmanager instances using Kubernetes service discovery.

global:
resolve_timeout: 5m

route:
receiver: 'team-X-mails'
group_by: ['alertname']
group_wait: 10s
group_interval: 1m
repeat_interval: 3h

receivers:
- name: 'team-X-mails'
email_configs:
- to: '[email protected]'

alertmanager_config: