Default AlertManager rules in Kublr

Default AlertManager rules in Kublr

Alert NameDescription
authGetTokenFailFail get token for service error rate is heightened on instance over last 3 min
clusterCpuUsageCriticalCPU usage on instance is higher than 90% for 8 min
clusterCpuUsageWarningCPU usage on instance is higher than 70% for 8 min
clusterMemoryUsageCriticalMemory usage on the cluster is higher than 90% for 5 min
clusterMemoryUsageWarningMemory usage on the cluster is higher than 75% for 5 min
clusterMemoryUsageLowInfoMemory usage on the cluster is lower than 60% for 15 min
daemonSetMisscheduledNumberHighNumber of misscheduled pods is high in cluster for DaemonSet for 7 min
daemonSetReadyNumberLowNumber of ready pods is low in cluster for DaemonSet for 7 min
daemonSetScheduledNumberLowNumber of scheduled pods is low in cluster for DaemonSet for 7 min
deploymentAvailableNumberLowNumber of available replicas is low in cluster for Deployment for 7 min
deploymentReplicaNumberLowNumber of replicas is low in cluster for Deployment for 7 min
deploymentUnavailableNumberHighNumber of unavailable replicas is high in cluster for Deployment for 7 min
etcdInstanceIsLaggingetcd instance in cluster has significant difference between applied and committed transactions
etcdInstanceDownetcd instance is down for over 1 min in cluster
etcdLeaderChangesTooOftenetcd cluster leader changed more than 3 times in 1 hour
etcdHighNumberOfFailedProposalsetcd cluster has more than 3 failed transaction proposals in 1 hour
etcdFdExhaustionCloseetcd cluster will run out of file descriptors in 4 hours or less
etcdHighNumberOfFailedGRPCRequestsValue of failed gRPC requiest in etcd cluster is high
ElasticsearchHeapUsageTooHighThe Elasticsearch heap usage is over 95% in 5 min
ElasticsearchHeapUsageWarningThe Elasticsearch heap usage is over 85% in 5 min
ElasticsearchDiskOutOfSpaceThe Elasticsearch disk usage is over 85%
ElasticsearchDiskSpaceLowThe Elasticsearch disk usage is over 80% in 2 min
ElasticsearchClusterRedThe Elasticsearch Cluster is in Red status in 5 min
ElasticsearchClusterYellowThe Elasticsearch Cluster is in Yellow status for 6 hours
ElasticsearchRelocatingShardsThe Elasticsearch has been relocating shards
ElasticsearchRelocatingShardsTooLongThe Elasticsearch has been relocating shards for 12 hours
ElasticsearchInitializingShardsThe Elasticsearch is initializing shards
ElasticsearchInitializingShardsTooLongThe Elasticsearch has been initializing shards for 4 hours
ElasticsearchUnassignedShardsThe Elasticsearch has unassigned shards
ElasticsearchPendingTasksThe Elasticsearch has pending tasks. Cluster works slowly
ElasticsearchNoNewDocumentsNo new documents in Elasticsearch for 15 min
ElasticsearchCountOfJVMGCRunsThe Elasticsearch node has JVM GC runs > 10 per sec
ElasticsearchGCRunTimeThe Elasticsearch node has GC run time in seconds > 0.5 sec.
ElasticsearchJsonParseFailuresThe Elasticsearch node has json parse failures
ElasticsearchBreakersTrippedThe Elasticsearch node breakers tripped > 0
ElasticsearchClusterHealthDownThe ElasticSearch cluster health is degrading
FluentbitProblemFluent Bit pod in Kublr cluster has not processed any bytes for at least 15 minutes
FluentdProblemFluentd pod in Kublr cluster has not processed any records for at least 15 minutes
instanceDiskSpaceWarningDevice on instance in cluster has little space left for over 10 min
instanceDiskSpaceCriticalDevice on instance in cluster has little space left for over 10 min
instanceDiskInodesWarningDevice on instance in cluster has few inodes left for over 10 min
instanceDiskInodesCriticalDevice on instance in cluster has few inodes left for over 10 min
instanceDownInstance is down for over 1 min in cluster
instanceMemoryUsageWarningMemory usage on instance in cluster is higher than 85% for 5 min
instanceMemoryUsageCriticalMemory usage on instance in cluster is higher than 95% for 5 min
instanceCpuUsageWarningCPU usage on instance in cluster is higher than 80% for 8 min
instanceCpuUsageCriticalCPU usage on instance in cluster is higher than 95% for 8 min
k8sApiServerDownKubernetes API server is down for over 1 min in cluster
KubeApiServerAbsentNo kube-apiservers are available in cluster for 1 min
kubeletDockerOperationErrorsDocker operation error rate is heightened on instance over last 10 min
KubeMetricServerFailureKube-metric-server is unavailable in cluster for 1 min
kublrStatusNotReadyNode condition is on for 5 min in cluster. Status is status
LogMoverProblemLogs mover for cluster stopped sending message to ELK
LogstashProblemNo logs filtered for last 10 minutes on Logstash. Check centralize logging system!
LogstashDeadLetterQueueLogstash failed to pass messages to Elasticsearch. If there are lot of messages, *.log files will be removed in 10 minutes
nodeStatusConditionNode condition is not ready on for 7 min in cluster
nodeStatusNotReadyNode status is not ready for 7 min in cluster
PersistentVolumeUsageForecastBased on recent sampling, the persistent volume claimed by PVC in namespace is expected to fill up within four days in cluster space
PersistentVolumeUsageWarningThe persistent volume claimed by PVC in namespace has value used in cluster space
PersistentVolumeUsageCriticalThe persistent volume claimed by PVC in namespace has value used in cluster space
podPhaseIncorrectPod is stuck in a wrong phase for 7 min in cluster
podStatusNotReadyPod is not ready (condition is condition) for 7 min in cluster space
podStatusNotScheduledPod is not scheduled (condition is condition) for 7 min in cluster space
podContainerWaitingPod container is waiting for 7 min in cluster space
podContainerRestartingPod container is restarting for 7 min in cluster space
podPhaseEvictedCountPods with evicted status detected for 5 min in cluster space
promRuleEvalFailuresPrometheus failed to evaluate rule in cluster space
replicaSetReplicaNumberLowNumber of replicas is low in cluster for ReplicaSet for 7 min
replicaSetFullyLabeledNumberLowNumber of fully labeled replicas is low in cluster for ReplicaSet for 7 min
replicaSetReadyNumberLowNumber of ready replicas is low in cluster for ReplicaSet for 7 min
replicationControllerReplicaNumberLowNumber of replicas is low in cluster for ReplicationController for 7 min
replicationControllerFullyLabeledNumberLowNumber of fully labeled replicas is low in cluster for ReplicationController for 7 min
replicationControllerReadyNumberLowNumber of ready replicas is low in cluster for ReplicationController for 7 min
replicationControllerAvailableNumberLowNumber of available replicas is low in cluster for ReplicationController for 7 min
RabbitmqDownRabbitMQ node down in cluster
RabbitmqTooManyMessagesInQueueRabbitMQ Queue is filling up (> 500000 msgs) in cluster
RabbitmqNoConsumerRabbitMQ Queue has no consumer in cluster
RabbitmqNodeDownLess than 1 node is running in RabbitMQ cluster
RabbitmqNodeNotDistributedRabbitMQ distribution link state is not ‘up’ in cluster
RabbitmqInstancesDifferentVersionsRunning different version of Rabbitmq in the same cluster , can lead to failure.
RabbitmqMemoryHighRabbitMQ node use more than 90% of allocated RAM in cluster
RabbitmqFileDescriptorsUsageRabbitMQ node use more than 90% of file descriptors in cluster
RabbitmqTooManyUnackMessagesRabbitMQ has too many unacknowledged messages in cluster
RabbitmqTooManyConnectionsRabbitMQ: the total connections of a node is too high in cluster
RabbitmqNoQueueConsumerRabbitMQ queue has less than 1 consumer in cluster
RabbitmqUnroutableMessagesRabbitMQ queue has unroutable messages in cluster
SSLCertExpiredWarningSSL certificate for host in cluster will expire in less than 7 days
SSLCertExpiredCriticalSSL certificate for host in cluster has expired
statefulSetReadyNumberLowNumber of ready replicas per StatefulSet is low in cluster for StatefulSet for 7 min
TargetDownPrometheus targets down: Value of the job in cluster targets are down

Customizing Alerts

Fired alerts may be found in Prometheus | Alerts menu

Fired alerts

or in Grafana | Alerts dashoard.

Fired alerts grafana

In order to send alert notifications to slack channel:

  • create webhook (Slack Setting | Add Web App | Incoming Webhooks | Add Incoming Webhooks Integration)
  • deploy/redeploy kublr-monitoring package with the following values:
alertmanager:
  config:
    default_receiver: slack
    receivers: |
      - name: slack
        slack_configs:
          - api_url: '<slack_api_url>'
            channel: '<channel_name>'

or deploy kublr platform by adding the above code to spec.features.monitoring.values section of cluster specification:

spec:
  features:
    monitoring:
      enabled: true
        platform:
          enabled: true
        grafana:
          enabled: true
          persistent: true
          size: 128G
        prometheus:
          persistent: true
          size: 128G
        alertmanager:
          enabled: true
      values:
        alertmanager:
          config:
            default_receiver: slack
            receivers: |
              - name: slack
                slack_configs:
                  - api_url: '<slack_api_url>'
                    channel: '<channel_name>'

Please follow the Alertmanager receiver configuration documentation for more information.