Default AlertManager rules in Kublr

Alert Name	Description
authGetTokenFail	Fail get token for service error rate is heightened on instance over last 3 min
clusterCpuUsageCritical	CPU usage on instance is higher than 90% for 8 min
clusterCpuUsageWarning	CPU usage on instance is higher than 70% for 8 min
clusterMemoryUsageCritical	Memory usage on the cluster is higher than 90% for 5 min
clusterMemoryUsageWarning	Memory usage on the cluster is higher than 75% for 5 min
clusterMemoryUsageLowInfo	Memory usage on the cluster is lower than 60% for 15 min
daemonSetMisscheduledNumberHigh	Number of misscheduled pods is high in cluster for DaemonSet for 7 min
daemonSetReadyNumberLow	Number of ready pods is low in cluster for DaemonSet for 7 min
daemonSetScheduledNumberLow	Number of scheduled pods is low in cluster for DaemonSet for 7 min
deploymentAvailableNumberLow	Number of available replicas is low in cluster for Deployment for 7 min
deploymentReplicaNumberLow	Number of replicas is low in cluster for Deployment for 7 min
deploymentUnavailableNumberHigh	Number of unavailable replicas is high in cluster for Deployment for 7 min
etcdInstanceIsLagging	etcd instance in cluster has significant difference between applied and committed transactions
etcdInstanceDown	etcd instance is down for over 1 min in cluster
etcdLeaderChangesTooOften	etcd cluster leader changed more than 3 times in 1 hour
etcdHighNumberOfFailedProposals	etcd cluster has more than 3 failed transaction proposals in 1 hour
etcdFdExhaustionClose	etcd cluster will run out of file descriptors in 4 hours or less
etcdHighNumberOfFailedGRPCRequests	Value of failed gRPC requiest in etcd cluster is high
ElasticsearchHeapUsageTooHigh	The Elasticsearch heap usage is over 95% in 5 min
ElasticsearchHeapUsageWarning	The Elasticsearch heap usage is over 85% in 5 min
ElasticsearchDiskOutOfSpace	The Elasticsearch disk usage is over 85%
ElasticsearchDiskSpaceLow	The Elasticsearch disk usage is over 80% in 2 min
ElasticsearchClusterRed	The Elasticsearch Cluster is in Red status in 5 min
ElasticsearchClusterYellow	The Elasticsearch Cluster is in Yellow status for 6 hours
ElasticsearchRelocatingShards	The Elasticsearch has been relocating shards
ElasticsearchRelocatingShardsTooLong	The Elasticsearch has been relocating shards for 12 hours
ElasticsearchInitializingShards	The Elasticsearch is initializing shards
ElasticsearchInitializingShardsTooLong	The Elasticsearch has been initializing shards for 4 hours
ElasticsearchUnassignedShards	The Elasticsearch has unassigned shards
ElasticsearchPendingTasks	The Elasticsearch has pending tasks. Cluster works slowly
ElasticsearchNoNewDocuments	No new documents in Elasticsearch for 15 min
ElasticsearchCountOfJVMGCRuns	The Elasticsearch node has JVM GC runs > 10 per sec
ElasticsearchGCRunTime	The Elasticsearch node has GC run time in seconds > 0.5 sec.
ElasticsearchJsonParseFailures	The Elasticsearch node has json parse failures
ElasticsearchBreakersTripped	The Elasticsearch node breakers tripped > 0
ElasticsearchClusterHealthDown	The ElasticSearch cluster health is degrading
FluentbitProblem	Fluent Bit pod in Kublr cluster has not processed any bytes for at least 15 minutes
FluentdProblem	Fluentd pod in Kublr cluster has not processed any records for at least 15 minutes
instanceDiskSpaceWarning	Device on instance in cluster has little space left for over 10 min
instanceDiskSpaceCritical	Device on instance in cluster has little space left for over 10 min
instanceDiskInodesWarning	Device on instance in cluster has few inodes left for over 10 min
instanceDiskInodesCritical	Device on instance in cluster has few inodes left for over 10 min
instanceDown	Instance is down for over 1 min in cluster
instanceMemoryUsageWarning	Memory usage on instance in cluster is higher than 85% for 5 min
instanceMemoryUsageCritical	Memory usage on instance in cluster is higher than 95% for 5 min
instanceCpuUsageWarning	CPU usage on instance in cluster is higher than 80% for 8 min
instanceCpuUsageCritical	CPU usage on instance in cluster is higher than 95% for 8 min
k8sApiServerDown	Kubernetes API server is down for over 1 min in cluster
KubeApiServerAbsent	No kube-apiservers are available in cluster for 1 min
kubeletDockerOperationErrors	Docker operation error rate is heightened on instance over last 10 min
KubeMetricServerFailure	Kube-metric-server is unavailable in cluster for 1 min
kublrStatusNotReady	Node condition is on for 5 min in cluster. Status is status
LogMoverProblem	Logs mover for cluster stopped sending message to ELK
LogstashProblem	No logs filtered for last 10 minutes on Logstash. Check centralize logging system!
LogstashDeadLetterQueue	Logstash failed to pass messages to Elasticsearch. If there are lot of messages, *.log files will be removed in 10 minutes
nodeStatusCondition	Node condition is not ready on for 7 min in cluster
nodeStatusNotReady	Node status is not ready for 7 min in cluster
PersistentVolumeUsageForecast	Based on recent sampling, the persistent volume claimed by PVC in namespace is expected to fill up within four days in cluster space
PersistentVolumeUsageWarning	The persistent volume claimed by PVC in namespace has value used in cluster space
PersistentVolumeUsageCritical	The persistent volume claimed by PVC in namespace has value used in cluster space
podPhaseIncorrect	Pod is stuck in a wrong phase for 7 min in cluster
podStatusNotReady	Pod is not ready (condition is condition) for 7 min in cluster space
podStatusNotScheduled	Pod is not scheduled (condition is condition) for 7 min in cluster space
podContainerWaiting	Pod container is waiting for 7 min in cluster space
podContainerRestarting	Pod container is restarting for 7 min in cluster space
podPhaseEvictedCount	Pods with evicted status detected for 5 min in cluster space
promRuleEvalFailures	Prometheus failed to evaluate rule in cluster space
replicaSetReplicaNumberLow	Number of replicas is low in cluster for ReplicaSet for 7 min
replicaSetFullyLabeledNumberLow	Number of fully labeled replicas is low in cluster for ReplicaSet for 7 min
replicaSetReadyNumberLow	Number of ready replicas is low in cluster for ReplicaSet for 7 min
replicationControllerReplicaNumberLow	Number of replicas is low in cluster for ReplicationController for 7 min
replicationControllerFullyLabeledNumberLow	Number of fully labeled replicas is low in cluster for ReplicationController for 7 min
replicationControllerReadyNumberLow	Number of ready replicas is low in cluster for ReplicationController for 7 min
replicationControllerAvailableNumberLow	Number of available replicas is low in cluster for ReplicationController for 7 min
RabbitmqDown	RabbitMQ node down in cluster
RabbitmqTooManyMessagesInQueue	RabbitMQ Queue is filling up (> 500000 msgs) in cluster
RabbitmqNoConsumer	RabbitMQ Queue has no consumer in cluster
RabbitmqNodeDown	Less than 1 node is running in RabbitMQ cluster
RabbitmqNodeNotDistributed	RabbitMQ distribution link state is not ‘up’ in cluster
RabbitmqInstancesDifferentVersions	Running different version of Rabbitmq in the same cluster , can lead to failure.
RabbitmqMemoryHigh	RabbitMQ node use more than 90% of allocated RAM in cluster
RabbitmqFileDescriptorsUsage	RabbitMQ node use more than 90% of file descriptors in cluster
RabbitmqTooManyUnackMessages	RabbitMQ has too many unacknowledged messages in cluster
RabbitmqTooManyConnections	RabbitMQ: the total connections of a node is too high in cluster
RabbitmqNoQueueConsumer	RabbitMQ queue has less than 1 consumer in cluster
RabbitmqUnroutableMessages	RabbitMQ queue has unroutable messages in cluster
SSLCertExpiredWarning	SSL certificate for host in cluster will expire in less than 7 days
SSLCertExpiredCritical	SSL certificate for host in cluster has expired
statefulSetReadyNumberLow	Number of ready replicas per StatefulSet is low in cluster for StatefulSet for 7 min
TargetDown	Prometheus targets down: Value of the job in cluster targets are down

Customizing Alerts

Fired alerts may be found in Prometheus | Alerts menu

Fired alerts

or in Grafana | Alerts dashoard.

Fired alerts grafana

In order to send alert notifications to slack channel:

create webhook (Slack Setting | Add Web App | Incoming Webhooks | Add Incoming Webhooks Integration)
deploy/redeploy kublr-monitoring package with the following values:

alertmanager:
  config:
    default_receiver: slack
    receivers: |
      - name: slack
        slack_configs:
          - api_url: '<slack_api_url>'
            channel: '<channel_name>'

or deploy kublr platform by adding the above code to spec.features.monitoring.values section of cluster specification:

spec:
  features:
    monitoring:
      enabled: true
        platform:
          enabled: true
        grafana:
          enabled: true
          persistent: true
          size: 128G
        prometheus:
          persistent: true
          size: 128G
        alertmanager:
          enabled: true
      values:
        alertmanager:
          config:
            default_receiver: slack
            receivers: |
              - name: slack
                slack_configs:
                  - api_url: '<slack_api_url>'
                    channel: '<channel_name>'

Please follow the Alertmanager receiver configuration documentation for more information.