Prometheus is an open-source systems monitoring and alerting toolkit. Alerting in Prometheus is separated into two parts: alerting rules in Prometheus servers send alerts to an Alertmanager, which then manages those alerts and sends out notifications via methods such as email, on-call notification systems, and chat platforms.
The main steps to setting up alerting and notifications with Prometheus are:
- Setting up and configuring the Alertmanager
- Configuring Prometheus to talk to the Alertmanager
- Creating alerting rules in Prometheus
Alertmanager
The Alertmanager is a standalone service that receives alerts from Prometheus and disperses them to the correct receiver channels. It provides the following features:
- Alert grouping: Alerts with the same labels are grouped together, so you don’t get spammed when a single alert fires many times.
- Silencing: Alerts can be silenced to mute notifications for a specified time.
- Inhibition: Alerts can be inhibited from firing when another alert is firing.
- Aggregation: Alerts can be aggregated to fire a single notification when a certain number of alerts are firing.
- Notification methods: Alerts can be sent via email, on-call notification systems, chat platforms, and more.
Configuring the Alertmanager
The Alertmanager can be configured with a static configuration file or dynamically discovered using service discovery. The configuration file specifies Alertmanager instances and parameters to configure how to communicate with them.
Static Configuration
To statically configure the Alertmanager, use the static_configs
parameter in the alertmanager_config
section of the Prometheus configuration file.
Dynamic Configuration
To dynamically configure the Alertmanager, use the service_discovery_config
parameter in the alertmanager_config
section of the Prometheus configuration file.
Relabeling
The relabel_configs
parameter allows selecting Alertmanagers from discovered entities and provides advanced modifications to the used API path.
Timeout
The timeout
parameter specifies the per-target Alertmanager timeout when pushing alerts.
Prometheus
Prometheus sends alerts to the Alertmanager using the Alertmanager API. Alerts are sent as a list of alerts in the request body.
Alerting Rules
Prometheus’s alerting rules are good at figuring out what is broken right now, but they are not a fully-fledged notification solution. Another layer is needed to add summarization, notification rate limiting, silencing, and alert dependencies on top of the simple alert definitions. In Prometheus’s ecosystem, the Alertmanager takes on this role.
Prometheus can be configured to periodically send information about alert states to an Alertmanager instance, which then takes care of dispatching the right notifications.
Configuring Alerting Rules
Alerting rules are defined in a separate file with the .rules
extension. The file contains a list of alerting rules in the following format:
groups:
- name: example
rules:
- alert: ExampleAlert
expr: vector(1)
for: 5m
labels:
severity: critical
annotations:
description: This is an example alert
The alert
field specifies the name of the alert. The expr
field specifies the PromQL expression that generates the alert. The for
field specifies the duration that the alert must be firing before a notification is sent. The labels
field specifies labels to add to the alert. The annotations
field specifies annotations to add to the alert.
Reloading Alerting Rules
Prometheus can be configured to reload the alerting rules by sending a POST request to the /-/reload
endpoint.
Receivers
Receivers are the endpoints that receive notifications from the Alertmanager. Receivers can be configured to send notifications via email, on-call notification systems, chat platforms, and more.
Configuring Receivers
Receivers are configured in the Alertmanager configuration file.
Email Receiver
To configure an email receiver, use the email_config
parameter in the receiver
section of the Alertmanager configuration file.
PagerDuty Receiver
To configure a PagerDuty receiver, use the pagerduty_config
parameter in the receiver
section of the Alertmanager configuration file.
Slack Receiver
To configure a Slack receiver, use the slack_config
parameter in the receiver
section of the Alertmanager configuration file.
Webhook Receiver
To configure a webhook receiver, use the webhook_config
parameter in the receiver
section of the Alertmanager configuration file.
Examples
Here are some examples of alerting rules and Alertmanager configurations.
Alerting Rules
High CPU Usage Alert
This alert triggers when the CPU usage of a node exceeds 80% for 5 minutes.
groups:
- name: node_exporter
rules:
- alert: NodeHighCPUUsage
expr: node_cpu_seconds_total{mode="idle"} / node_cpu_seconds_total{mode!="idle"} < 0.2
for: 5m
annotations:
description: The CPU usage of node {{ $labels.instance }} exceeded 80% for 5 minutes.
High Memory Usage Alert
This alert triggers when the memory usage of a node exceeds 80% for 5 minutes.
groups:
- name: node_exporter
rules:
- alert: NodeHighMemoryUsage
expr: node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes < 0
for: 5m
annotations:
description: The memory usage of node {{ $labels.instance }} exceeded 80% for 5 minutes.
Alertmanager Configuration
Static Configuration
This Alertmanager configuration statically configures a single Alertmanager instance.
global:
resolve_timeout: 5m
route:
receiver: 'team-X-mails'
group_by: ['alertname']
group_wait: 10s
group_interval: 1m
repeat_interval: 3h
receivers:
- name: 'team-X-mails'
email_configs:
- to: '[email protected]'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
# The alertnames must match
alertnames:
- 'InstanceDown'
Dynamic Configuration
This Alertmanager configuration dynamically discovers Alertmanager instances using Kubernetes service discovery.
global:
resolve_timeout: 5m
route:
receiver: 'team-X-mails'
group_by: ['alertname']
group_wait: 10s
group_interval: 1m
repeat_interval: 3h
receivers:
- name: 'team-X-mails'
email_configs:
- to: '[email protected]'
alertmanager_config: