Integrations

Prometheus Alertmanager Integration Guide

Prometheus is an open-source monitoring solution that resides locally on your machine.

What can Zenduty do for Prometheus users?

With Prometheus's Integration, Zenduty sends new Prometheus alerts to the right team and notifies them based on on-call schedules via email, text messages(SMS), phone calls(Voice), Slack, Microsoft Teams and iOS & Android push notifications, and escalates alerts until the alert is acknowledged or closed. Zenduty provides your NOC, SRE and application engineers with detailed context around the Prometheus alert along with playbooks and a complete incident command framework to triage, remediate and resolve incidents with speed.

Whenever Prometheus alert rule condition is triggered, an alert is created in Zenduty, which creates an incident. When that condition goes back to normal levels, Zenduty will auto-resolve the incident.

You can also use Alert Rules to custom route specific Prometheus alerts to specific users, teams or escalation policies, write suppression rules, auto add notes, responders and incident tasks.

To integrate Prometheus with Zenduty, complete the following steps:

In Zenduty:

To add a new Prometheus integration, go to Teams on Zenduty and click on the team you want to add the integration to.
Next, go to Services and click on the relevant Service.
Go to Integrations and then Add New Integration. Give it a name and select the application Prometheus from the dropdown menu.
Go to Configure under your Integrations and copy the Webhook URL generated.

In Prometheus:

Ensure that both Prometheus and Prometheus Alertmanager are downloaded and accessible locally on your system. To download them, visit here
Go to Alertmanager Folder and open alertmanager.yml. Add the webhook url (copied in the earlier steps) under Webhook Configs.
Your alertmanager.yml file should now look like this:

		global:
		  resolve_timeout: 5m
		route:
		  group_by: ['alertname', 'cluster', 'service']
		  group_wait: 30s
		  group_interval: 5m
		  repeat_interval: 3h
		  receiver: 'web.hook'
		receivers:
		- name: 'web.hook'
		  webhook_configs:
		  - url: 'https://events.zenduty.com/integration/<unique key>/prometheus/<integration-key>/'
		inhibit_rules:
		  - source_match:
		      severity: 'critical'
		    target_match:
		      severity: 'warning'
		    equal: ['alertname', 'dev', 'instance']

Tip: If you're trying to generate alerts across multiple Zenduty Services, you can define your Alert Rules in different files. For example: first_rules.yml, second_rules.yml, and so on, each with a different integration endpoint.
In the Prometheus folder, open prometheus.yml. Add new rules files that you just created and set Target. Zenduty groups Prometheus alerts based on the alertname parameter.
Your prometheus.yml file should look like this:

		# my global config
		global:
		  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
		  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
		  # scrape_timeout is set to the global default (10s).
		# Alertmanager configuration
		alerting:
		  alertmanagers:
		  - static_configs:
		    - targets: ["localhost:9093"]
		# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
		rule_files:
		  - "first_rules.yml"
		  # - "second_rules.yml"
		# A scrape configuration containing exactly one endpoint to scrape:
		# Here it's Prometheus itself.
		scrape_configs:
		  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
		  - job_name: 'prometheus'
		    # metrics_path defaults to '/metrics'
		    # scheme defaults to 'http'.
		    static_configs:
		    - targets: ['localhost:9090']

Run Prometheus and Alert Manager using commands like:

run prometheus: ./prometheus --config.file=prometheus.yml

run alertmanager: ./alertmanager --config.file=alertmanager.yml
Once Prometheus is running, you will be able to see the alerts rules you configured.

When an alert is required, Zenduty will automatically create an incident.
Prometheus is now integrated.

For Prometheus Docker installations

In order to scrape data from the multiple services or pods, one has to write custom scraping rules on Prometheus. Refer to the example below.

prometheus.yml: |-
    global:
      scrape_interval: 10s
      evaluation_interval: 10s
    rule_files:
      - /etc/prometheus/prometheus.rules
    alerting:
      alertmanagers:
      - scheme: http
        static_configs:
        - targets:
          - "alertmanager.monitoring.svc:9093"

    scrape_configs:

      - job_name: 'kubernetes-apiservers'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: https

        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: default;kubernetes;https

In the above example, scrape_configs defines the location from where the data needs to be scraped, which in this case is the kubernetes apiserver. You can define multiple jobs to scrape data from different services or pods. For Prometheus scraping, you need to define prometheus.io/scrape: 'true', prometheus.io/port: '9100' within the annotations section for the service or pod.

/etc/prometheus/prometheus.rules is the location of the Prometheus rule file, an example of which is shown below:

prometheus.rules: |-
    groups:
    - name: Host-related-AZ1
      rules:
      - alert: HostOutOfMemory
        expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 20
        for: 10m
        labels:
          slack: "true"
          zenduty: "true"
          severity: warning
          team: devops
        annotations:
          summary: Host out of memory (instance {{ $labels.instance }})
          description: "Node memory is filling up (< 20% left)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

In the above example, the different resource related partitions are defined in groups and these groups have different rules for alerting. You need to make sure that you add the appropriate labels in your rules because Zenduty will be matching these labels in the Alertmanager settings.

Now if the rule breaks and Prometheus sends it to the Alertmanager, then alertmanger must have the appropriate channel to notify. For configurating Alertmanager with Zenduty or Slack, please see the below example:

config.yml: |-
    global:
      resolve_timeout: 5m
    templates:
    - '/etc/alertmanager-templates/*.tmpl'
    route:
      group_by: ['alertname', 'cluster', 'service']
      group_wait: 20s
      group_interval: 2m
      repeat_interval: 5m
      receiver: default  # this is default receiver
      routes:
      - receiver: zen_hook # this is a condition based receiver, it will only alert the zen_hook receiver only if some conditions are met.
          match:
            team: devops
            zenduty: "true"
          group_wait: 20s
          repeat_interval: 2m
    receivers:
    - name: zen_hook    # zen_hook receiver defination
      webhook_configs:
      - url: <Zenduty_integration_url>
        send_resolved: true
    - name: 'default'   # default receiver defination
      slack_configs:
      - channel: '# default-infra-logs'
        send_resolved: true
        title: "{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}"
        text: "{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}"

One can add proxy in global settings if needed, like the snippet below.

config.yml: |-
    global:
      resolve_timeout: 5m
      http_config:
        proxy_url: 'http://127.0.0.1:1025'

For more information, visit the Alertmanager docs here

Getting Started

Incidents and Response

Escalations and Schedules

Services and Integrations

Integrations

Alert Routing

Analytics

Role Based Access Control

Account Settings

SAML and Single Sign On

Zenduty API

Live Call Routing

Troubleshooting in Zenduty