Managing alerts

Overview

This document describes how to configure monitoring alerts.

Access Alertmanager UI

To navigate to Alertmanager UI:

  • For platform, on the left menu, click Centralized Monitoring, then click Alert Manager.
  • For cluster with the self-hosted monitoring, on the overview page, under FeaturesMonitoring, click Alert Manager.

Configuring Alertmanager via cluster specification

The Alertmanager page is displayed. To view the current Alertmanager configuration, navigate to the Status section, scroll down to Config.

You can transfer values to the Alertmanager configuration via the cluster specification:

  • For the centralized monitoring case, this should be the specification of the platform’s core cluster (the one in the kublr-system space).
  • For the self-hosted monitoring, the specification of the cluster itself.

Receivers

In the specification, use the specfeaturesmonitoringvaluesalertmanager section. Under it, in the config section. Here you can set the alert receivers:

spec:
    ...
    features:
        ...      
        monitoring:
            chart:
            version: 1.22.2-14
            enabled: true
            platform: ...
            values:
                alertmanager:
                    alerts: ...
                    config:
                        default_receiver: name-001
                        inhibit rules: 2
                        receivers: |
                          - name: name-001
                            slack_configs:
                                api_url:'https://hooks.slack.com/services/{address}'
                                channel: '#{channelname}'
                                send_resolved: true
                grafana: ...
                kubestatemetrics: ...
                prometheus: ...

Note on the configuration above: default_receiver is not affected by any rules (see the Routes section below).

See information about receivers in the Prometheus Alertmanager documentation.

You can configure any number of receivers of any type. Below is the example:

config:
  default_receiver: stg
  receivers: |
    - name: pagerduty
      pagerduty_configs:
      - routing_key: *************
        send_resolved: false
    - name: stg
      slack_configs:
        - api_url: 'https://hooks.slack.com/services/********'
          channel: '#alerts-stg'
          send_resolved: true
    - name: prod
      slack_configs:
        - api_url: 'https://hooks.slack.com/services/********'
          channel: '#alerts'
          send_resolved: true
    - name: email
      email_configs:
      - to: fake@kublr.com
        from: fake@kublr.com
        smarthost: smtp.gmail.com:587
        auth_username: fake@kublr.com
        auth_identity: fake@kublr.com
        auth_password: "********"
        html: '{{ template "email.google.html" . }}'

Routes

Every alert enters the routing tree at the configured top-level route, which must match all alerts (i.e. not have any configured matchers). It then traverses the child nodes. If continue is set to false, it stops after the first matching child. If continue is true on a matching node, the alert will continue matching against subsequent siblings. If an alert does not match any children of a node (no matching child nodes, or none exist), the alert is handled based on the configuration parameters of the current node.

See information about routes in the Prometheus Alertmanager documentation.

Below is the example:

routes: |
    - match:
        kublr_cluster: production
        severity: 'critical'
      continue: true
      receiver: pagerduty
    - match_re:
        alertname: "CMTargetDown|TargetDown"
        kublr_cluster: "kcp-kublr-ovh-local"
      continue: false
      receiver: stg
    - match:
        alertname: "clusterMemoryUsageLow"
      repeat_interval: '168h'
      continue: false
      receiver: qa
    - match_re:
        kublr_cluster: "kcp-kublr-ovh-local|kublr-prod|mirror-0-prod|mirror-1-prod"
      receiver: prod
    - match:
        kublr_cluster: "kublr-build"
      receiver: build
    - match_re:
        kublr_cluster: "kublr-qa|kublr-dev"
      receiver: qa

In the example above:

  • The match elements define the set of labels and their values. If all label values match, the corresponding Prometheus alert is triggered and the alert is sent to the receiver.
  • The match_re elements are the same as match but are based on the regular expressions.
  • The continue element defines whether the next match elements should be processed if the current one triggered the alert.

Alerts

Kublr provides the default set of predefined alerts. They are stored in Prometheus, at the Alerts section in the format of alert rules.

You can additionally create your own alerts via cluster specification. In the specification, use the specfeaturesmonitoringvaluesalertmanageralerts section. For example:

monitoring:
    ...
    values:
        alertmanager:
            alerts:
                - alert: LoggingNoNewDocuments
                annotations:
                    description: No new documents in ELK for 15 min!
                    summary: '{{$labels.kublr_space}}.{{$labels.kublr_cluster}} - LoggingNoNewDocuments'
                expr: 'sum by (kublr_space, kublr_cluster, cluster) (increase(elasticsearch_indices_docs{es_data_node="true"}[15m])) < 1'
                labels:
                    feature: centralizeLogging
                    severity: warning
        config:
            ...

The specification above adds the following alert rule to Prometheus:

Prometheus - alert configured from cluster specification

Prometheus specific parameters

In the specfeaturesmonitoringvaluesprometheus section of the cluster specification, you can edit different Prometheus parameters.

Below is the example:

monitoring:
      ...
      values:
        ...
        prometheus:
          config:
            retentionPeriod: 5d
            extraRulesConfigmaps:
                - fileMask: '*.yaml'
                  name: prometheus-sloth-rules
          persistence:
            enabled: true
            preconfiguredPersistentVolumeClaim: prometheus
            size: 128Gi
          resources:
            limits:
              memory: 10G
            requests:
              memory: 2G
          rules:
            overrides:
              - _disable: true
                _match:
                  alertRegexp: clusterMemoryUsageLowInfo|clusterMemoryUsageWarning|instanceMemoryUsageWarning

In the configuration above, retentionPeriod defines how long the Prometheus metrics are stored; default value is 7d (7 days). See detailed information about extraRulesConfigmaps in the section below.

Extra rule configmaps

The extraRulesConfigmaps section allows customizing Prometheus using Kubernetes config maps.

See how the config map prometheus-sloth-rules from configuration in previous section looks on the Kubernetes dashboard:

Prometheus - alert configured via Kubernetes config map

The content of config map is added as a file to the prometheus pod of the cluster. Note that several files can be specified within one config map as well as several config maps may be added via the cluster specification.

The content of a config map is custom and should be prepared by you. Note that it should have format compatible with Prometheus.

See extended example on the Kublr support portal: Monitoring: using slock/sloth for SLOs

Rule overrides

Starting from Kublr version 1.22, in the cluster specification, there is the ...prometheusrulesoverrides section. The section allows overriding of predefined alert rules provided by Kublr.

Below are the examples:

monitoring:
      ...
      values:
        ...
        prometheus:
          ...
          rules:
            overrides:
              - _disable: true
                _match:
                  alertRegexp: clusterMemoryUsageLowInfo|clusterMemoryUsageWarning|instanceMemoryUsageWarning

In the example above, all alerts matching the regular expression (containing 3 names of alert rules) will be disabled.

prometheus:
  rules:
    overrides:

      # disable all default Kublr rules except one
      - _disable: true
        _match:
          alertNot: RabbitmqUnactiveExchange

      # override specific rule using templating (with access to the original rule properties)
      - alert: RabbitmqUnactiveExchange
        for: 35m-{{.Rule.for}}
        hjsfdk: '{{ .Rule }}'
        _template: true

Inhibition rules

Kublr supports Alert Manager inhibition rules.

The inhibition rules mute the alerts (target) matching a set of matchers when the alerts (source) exist that matche another set of matchers. Both target and source alerts must have the same label values for the label names in the equal list.

To find the default Kublr inhibition rules, access the Alert Manager, then go to the Status tab and in presented Config find the inhibit_rules section.

There are three default inhibition rules:

  1. The one saying that any “critical” severity alert inhibits the non-critical ones (for example, warnings) inside specific Kublr cluster/space if source and target alert belong to the same silence group.

    - source_matchers:
      - kublr_cluster=~".+"
      - kublr_space=~".+"
      - severity="critical"
      - silence_group=~".+"
    target_matchers:
      - severity!="critical"
    equal:
      - silence_group
      - kublr_space
      - kublr_cluster
    

    Note that silence_group is an alert’s label. You can find alerts with specified silence groups in (Centralized) MonitoringPrometheusAlerts → filter by entering silence_group in a search field.

  2. The one saying that any “warning” severity alert inhibits the “info” alerts inside specific Kublr cluster/space if source and target alert belong to the same silence group.

    - source_matchers:
      - kublr_cluster=~".+"
      - kublr_space=~".+"
      - severity="warning"
      - silence_group=~".+"
      target_matchers:
      - severity="info"
      equal:
      - silence_group
      - kublr_space
      - kublr_cluster
    
  3. The one saying that any “workload” level alert inhibits the non-critical severity alerts from “replicaset” and “pod” levels inside specific Kublr cluster/space.

    - source_matchers:
      - kublr_cluster=~".+"
      - kublr_space=~".+"
      - level="workload"
      target_matchers:
      - level=~"replicaset|pod"
      - severity!="critical"
      equal:
      - kublr_space
      - kublr_cluster
    
  • You can use the Kublr default inhibition rules for your own alerts by assigning the corresponding set of labels.
  • To know how to change the default inhibition rules or add the new ones, refer to the Kublr support portal → “Monitoring frequently requested configuration changes” → Add inhibition rules.

See also