Prometheus Alertmanager Integration Guide
Prometheus is an open-source monitoring solution that resides locally on your machine.
What can Zenduty do for Prometheus users?
With Prometheus's Integration, Zenduty sends new Prometheus alerts to the right team and notifies them based on on-call schedules via email, text messages(SMS), phone calls(Voice), Slack, Microsoft Teams and iOS & Android push notifications, and escalates alerts until the alert is acknowledged or closed. Zenduty provides your NOC, SRE and application engineers with detailed context around the Prometheus alert along with playbooks and a complete incident command framework to triage, remediate and resolve incidents with speed.
Whenever Prometheus alert rule condition is triggered, an alert is created in Zenduty, which creates an incident. When that condition goes back to normal levels, Zenduty will auto-resolve the incident.
You can also use Alert Rules to custom route specific Prometheus alerts to specific users, teams or escalation policies, write suppression rules, auto add notes, responders and incident tasks.
To integrate Prometheus with Zenduty, complete the following steps:
In Zenduty:
-
To add a new Prometheus integration, go to Teams on Zenduty and click on the team you want to add the integration to.
-
Next, go to Services and click on the relevant Service.
-
Go to Integrations and then Add New Integration. Give it a name and select the application Prometheus from the dropdown menu.
-
Go to Configure under your Integrations and copy the Webhook URL generated.
In Prometheus:
-
Ensure that both Prometheus and Prometheus Alertmanager are downloaded and accessible locally on your system. To download them, visit here
-
Go to Alertmanager Folder and open alertmanager.yml. Add the webhook url (copied in the earlier steps) under Webhook Configs.
Your alertmanager.yml file should now look like this:
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 3h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'https://events.zenduty.com/integration/<unique key>/prometheus/<integration-key>/'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
-
Tip: If you're trying to generate alerts across multiple Zenduty Services, you can define your Alert Rules in different files. For example: first_rules.yml, second_rules.yml, and so on, each with a different integration endpoint.
-
In the Prometheus folder, open prometheus.yml. Add new rules files that you just created and set Target. Zenduty groups Prometheus alerts based on the alertname parameter.
Your prometheus.yml file should look like this:
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets: ["localhost:9093"]
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
-
Run Prometheus and Alert Manager using commands like:
run prometheus: ./prometheus --config.file=prometheus.yml
run alertmanager: ./alertmanager --config.file=alertmanager.yml
-
Once Prometheus is running, you will be able to see the alerts rules you configured.
When an alert is required, Zenduty will automatically create an incident.
-
Prometheus is now integrated.
For Prometheus Docker installations
In order to scrape data from the multiple services or pods, one has to write custom scraping rules on Prometheus. Refer to the example below.
prometheus.yml: |-
global:
scrape_interval: 10s
evaluation_interval: 10s
rule_files:
- /etc/prometheus/prometheus.rules
alerting:
alertmanagers:
- scheme: http
static_configs:
- targets:
- "alertmanager.monitoring.svc:9093"
scrape_configs:
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
In the above example, scrape_configs
defines the location from where the data needs to be scraped, which in this case is the kubernetes apiserver. You can define multiple jobs to scrape data from different services or pods. For Prometheus scraping, you need to define prometheus.io/scrape: 'true'
, prometheus.io/port: '9100'
within the annotations section for the service or pod.
/etc/prometheus/prometheus.rules
is the location of the Prometheus rule file, an example of which is shown below:
prometheus.rules: |-
groups:
- name: Host-related-AZ1
rules:
- alert: HostOutOfMemory
expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 20
for: 10m
labels:
slack: "true"
zenduty: "true"
severity: warning
team: devops
annotations:
summary: Host out of memory (instance {{ $labels.instance }})
description: "Node memory is filling up (< 20% left)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
In the above example, the different resource related partitions are defined in groups and these groups have different rules for alerting. You need to make sure that you add the appropriate labels
in your rules because Zenduty will be matching these labels
in the Alertmanager settings.
Now if the rule breaks and Prometheus sends it to the Alertmanager, then alertmanger must have the appropriate channel to notify. For configurating Alertmanager with Zenduty or Slack, please see the below example:
config.yml: |-
global:
resolve_timeout: 5m
templates:
- '/etc/alertmanager-templates/*.tmpl'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 20s
group_interval: 2m
repeat_interval: 5m
receiver: default # this is default receiver
routes:
- receiver: zen_hook # this is a condition based receiver, it will only alert the zen_hook receiver only if some conditions are met.
match:
team: devops
zenduty: "true"
group_wait: 20s
repeat_interval: 2m
receivers:
- name: zen_hook # zen_hook receiver defination
webhook_configs:
- url: <Zenduty_integration_url>
send_resolved: true
- name: 'default' # default receiver defination
slack_configs:
- channel: '# default-infra-logs'
send_resolved: true
title: "{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}"
text: "{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}"
One can add proxy in global settings if needed, like the snippet below.
config.yml: |-
global:
resolve_timeout: 5m
http_config:
proxy_url: 'http://127.0.0.1:1025'
For more information, visit the Alertmanager docs here