Alert Name | Description |
---|---|
authGetTokenFail | Fail get token for service error rate is heightened on instance over last 3 min |
clusterCpuUsageCritical | CPU usage on instance is higher than 90% for 8 min |
clusterCpuUsageWarning | CPU usage on instance is higher than 70% for 8 min |
clusterMemoryUsageCritical | Memory usage on the cluster is higher than 90% for 5 min |
clusterMemoryUsageWarning | Memory usage on the cluster is higher than 75% for 5 min |
clusterMemoryUsageLowInfo | Memory usage on the cluster is lower than 60% for 15 min |
daemonSetMisscheduledNumberHigh | Number of misscheduled pods is high in cluster for DaemonSet for 7 min |
daemonSetReadyNumberLow | Number of ready pods is low in cluster for DaemonSet for 7 min |
daemonSetScheduledNumberLow | Number of scheduled pods is low in cluster for DaemonSet for 7 min |
deploymentAvailableNumberLow | Number of available replicas is low in cluster for Deployment for 7 min |
deploymentReplicaNumberLow | Number of replicas is low in cluster for Deployment for 7 min |
deploymentUnavailableNumberHigh | Number of unavailable replicas is high in cluster for Deployment for 7 min |
etcdInstanceIsLagging | etcd instance in cluster has significant difference between applied and committed transactions |
etcdInstanceDown | etcd instance is down for over 1 min in cluster |
etcdLeaderChangesTooOften | etcd cluster leader changed more than 3 times in 1 hour |
etcdHighNumberOfFailedProposals | etcd cluster has more than 3 failed transaction proposals in 1 hour |
etcdFdExhaustionClose | etcd cluster will run out of file descriptors in 4 hours or less |
etcdHighNumberOfFailedGRPCRequests | Value of failed gRPC requiest in etcd cluster is high |
ElasticsearchHeapUsageTooHigh | The Elasticsearch heap usage is over 95% in 5 min |
ElasticsearchHeapUsageWarning | The Elasticsearch heap usage is over 85% in 5 min |
ElasticsearchDiskOutOfSpace | The Elasticsearch disk usage is over 85% |
ElasticsearchDiskSpaceLow | The Elasticsearch disk usage is over 80% in 2 min |
ElasticsearchClusterRed | The Elasticsearch Cluster is in Red status in 5 min |
ElasticsearchClusterYellow | The Elasticsearch Cluster is in Yellow status for 6 hours |
ElasticsearchRelocatingShards | The Elasticsearch has been relocating shards |
ElasticsearchRelocatingShardsTooLong | The Elasticsearch has been relocating shards for 12 hours |
ElasticsearchInitializingShards | The Elasticsearch is initializing shards |
ElasticsearchInitializingShardsTooLong | The Elasticsearch has been initializing shards for 4 hours |
ElasticsearchUnassignedShards | The Elasticsearch has unassigned shards |
ElasticsearchPendingTasks | The Elasticsearch has pending tasks. Cluster works slowly |
ElasticsearchNoNewDocuments | No new documents in Elasticsearch for 15 min |
ElasticsearchCountOfJVMGCRuns | The Elasticsearch node has JVM GC runs > 10 per sec |
ElasticsearchGCRunTime | The Elasticsearch node has GC run time in seconds > 0.5 sec. |
ElasticsearchJsonParseFailures | The Elasticsearch node has json parse failures |
ElasticsearchBreakersTripped | The Elasticsearch node breakers tripped > 0 |
ElasticsearchClusterHealthDown | The ElasticSearch cluster health is degrading |
FluentbitProblem | Fluent Bit pod in Kublr cluster has not processed any bytes for at least 15 minutes |
FluentdProblem | Fluentd pod in Kublr cluster has not processed any records for at least 15 minutes |
instanceDiskSpaceWarning | Device on instance in cluster has little space left for over 10 min |
instanceDiskSpaceCritical | Device on instance in cluster has little space left for over 10 min |
instanceDiskInodesWarning | Device on instance in cluster has few inodes left for over 10 min |
instanceDiskInodesCritical | Device on instance in cluster has few inodes left for over 10 min |
instanceDown | Instance is down for over 1 min in cluster |
instanceMemoryUsageWarning | Memory usage on instance in cluster is higher than 85% for 5 min |
instanceMemoryUsageCritical | Memory usage on instance in cluster is higher than 95% for 5 min |
instanceCpuUsageWarning | CPU usage on instance in cluster is higher than 80% for 8 min |
instanceCpuUsageCritical | CPU usage on instance in cluster is higher than 95% for 8 min |
k8sApiServerDown | Kubernetes API server is down for over 1 min in cluster |
KubeApiServerAbsent | No kube-apiservers are available in cluster for 1 min |
kubeletDockerOperationErrors | Docker operation error rate is heightened on instance over last 10 min |
KubeMetricServerFailure | Kube-metric-server is unavailable in cluster for 1 min |
kublrStatusNotReady | Node condition is on for 5 min in cluster. Status is status |
LogMoverProblem | Logs mover for cluster stopped sending message to ELK |
LogstashProblem | No logs filtered for last 10 minutes on Logstash. Check centralize logging system! |
LogstashDeadLetterQueue | Logstash failed to pass messages to Elasticsearch. If there are lot of messages, *.log files will be removed in 10 minutes |
nodeStatusCondition | Node condition is not ready on for 7 min in cluster |
nodeStatusNotReady | Node status is not ready for 7 min in cluster |
PersistentVolumeUsageForecast | Based on recent sampling, the persistent volume claimed by PVC in namespace is expected to fill up within four days in cluster space |
PersistentVolumeUsageWarning | The persistent volume claimed by PVC in namespace has value used in cluster space |
PersistentVolumeUsageCritical | The persistent volume claimed by PVC in namespace has value used in cluster space |
podPhaseIncorrect | Pod is stuck in a wrong phase for 7 min in cluster |
podStatusNotReady | Pod is not ready (condition is condition) for 7 min in cluster space |
podStatusNotScheduled | Pod is not scheduled (condition is condition) for 7 min in cluster space |
podContainerWaiting | Pod container is waiting for 7 min in cluster space |
podContainerRestarting | Pod container is restarting for 7 min in cluster space |
podPhaseEvictedCount | Pods with evicted status detected for 5 min in cluster space |
promRuleEvalFailures | Prometheus failed to evaluate rule in cluster space |
replicaSetReplicaNumberLow | Number of replicas is low in cluster for ReplicaSet for 7 min |
replicaSetFullyLabeledNumberLow | Number of fully labeled replicas is low in cluster for ReplicaSet for 7 min |
replicaSetReadyNumberLow | Number of ready replicas is low in cluster for ReplicaSet for 7 min |
replicationControllerReplicaNumberLow | Number of replicas is low in cluster for ReplicationController for 7 min |
replicationControllerFullyLabeledNumberLow | Number of fully labeled replicas is low in cluster for ReplicationController for 7 min |
replicationControllerReadyNumberLow | Number of ready replicas is low in cluster for ReplicationController for 7 min |
replicationControllerAvailableNumberLow | Number of available replicas is low in cluster for ReplicationController for 7 min |
RabbitmqDown | RabbitMQ node down in cluster |
RabbitmqTooManyMessagesInQueue | RabbitMQ Queue is filling up (> 500000 msgs) in cluster |
RabbitmqNoConsumer | RabbitMQ Queue has no consumer in cluster |
RabbitmqNodeDown | Less than 1 node is running in RabbitMQ cluster |
RabbitmqNodeNotDistributed | RabbitMQ distribution link state is not ‘up’ in cluster |
RabbitmqInstancesDifferentVersions | Running different version of Rabbitmq in the same cluster , can lead to failure. |
RabbitmqMemoryHigh | RabbitMQ node use more than 90% of allocated RAM in cluster |
RabbitmqFileDescriptorsUsage | RabbitMQ node use more than 90% of file descriptors in cluster |
RabbitmqTooManyUnackMessages | RabbitMQ has too many unacknowledged messages in cluster |
RabbitmqTooManyConnections | RabbitMQ: the total connections of a node is too high in cluster |
RabbitmqNoQueueConsumer | RabbitMQ queue has less than 1 consumer in cluster |
RabbitmqUnroutableMessages | RabbitMQ queue has unroutable messages in cluster |
SSLCertExpiredWarning | SSL certificate for host in cluster will expire in less than 7 days |
SSLCertExpiredCritical | SSL certificate for host in cluster has expired |
statefulSetReadyNumberLow | Number of ready replicas per StatefulSet is low in cluster for StatefulSet for 7 min |
TargetDown | Prometheus targets down: Value of the job in cluster targets are down |
Fired alerts may be found in Prometheus | Alerts menu
or in Grafana | Alerts dashoard.
In order to send alert notifications to slack channel:
alertmanager:
config:
default_receiver: slack
receivers: |
- name: slack
slack_configs:
- api_url: '<slack_api_url>'
channel: '<channel_name>'
or deploy kublr platform by adding the above code to spec.features.monitoring.values
section of cluster specification:
spec:
features:
monitoring:
enabled: true
platform:
enabled: true
grafana:
enabled: true
persistent: true
size: 128G
prometheus:
persistent: true
size: 128G
alertmanager:
enabled: true
values:
alertmanager:
config:
default_receiver: slack
receivers: |
- name: slack
slack_configs:
- api_url: '<slack_api_url>'
channel: '<channel_name>'
Please follow the Alertmanager receiver configuration documentation for more information.