At the end of November, we’ll be migrating the Sematext Logs backend from Elasticsearch to OpenSearch

Kubernetes Alerting: 10 Must-Have Alerts for Proactive Monitoring

May 21, 2024

Table of contents

Running a Kubernetes cluster includes keeping an eye on it to make sure your apps and services are healthy.

You don’t want to be staring at a bunch of Kubernetes dashboards all day, though. You want to set up kubernetes alerting with appropriate alerts instead, right?

With k8s alerts, you will spot problems quickly in your Kubernetes cluster and hopefully fix them quickly as well. But what should you alert on? Here are the top 10 most important alerts you should set up for your Kubernetes cluster. All kubernetes alerts you see here are included in Kubernetes monitoring in Sematext.

1. High CPU Limit Usage

Why is this important

This alert triggers when a Kubernetes pod exceeds its CPU limit, indicating potential resource contention or inefficient resource allocation. If the CPU limit keeps getting hit, it can lead to slower application response times and potential service disruptions. In short, you don’t want to see this happening.

Action

Investigate the affected pod and consider adjusting resource limits or optimizing the application.

Use the following command to inspect the CPU usage of the affected pod:

kubectl top pod <pod_name> -n <namespace>

To adjust resource limits for the pod, edit its YAML configuration file:

kubectl edit pod <pod_name> -n <namespace>

Within the YAML file, modify the ‘resources’ section to adjust CPU limits:

resources:
  limits:
    cpu: <new_cpu_limit>

Replace <new_cpu_limit> with the desired CPU limit value in CPU units.

2. CPU Limit Usage Reached

Why is this important

Similar to the previous alert, this one notifies when a pod reaches its CPU limit.

Action

Analyze the workload’s resource demands and scaling requirements to prevent performance degradation.

Run the following command to get CPU usage metrics for the workload associated with the affected pod:

kubectl top pod --selector=<selector> -n <namespace>

Replace <selector> with the appropriate label selector for the workload, such as app=<app_name>, to aggregate CPU usage across all pods belonging to the workload.

3. Kubelet Volume Manager Unavailable

Why is this important

The kubelet volume manager failure can affect pod storage, potentially causing data loss. Unavailable volume managers may prevent pod volumes from mounting, impacting applications relying on persistent storage.

Action

Investigate the kubelet service and underlying storage infrastructure for issues and restart affected components if necessary.

Run the following command to check the status of the kubelet service:

kubectl get pods -n kube-system | grep kubelet

Check the logs of the kubelet pods to identify any errors or warnings related to the volume manager:

kubectl logs <kubelet_pod_name> -n kube-system

Check the status of the storage volumes and network connectivity:

kubectl get pv,pvc -n <namespace>

Make sure that storage volumes are correctly attached and accessible by the kubelet service.

If necessary, restart the kubelet service and any related components. This command forces Kubernetes to recreate the kubelet pod, hopefully resolving any issues with the volume manager.

kubectl delete pod <kubelet_pod_name> -n kube-system

4. Kubernetes API Server Errors

Why is this important

Monitors for client errors (4XX) and server errors (5XX) from the Kubernetes API server, which could signify communication problems or internal server issues.

Action

Check network connectivity and API server logs to identify and resolve underlying issues.

Verify the status of the Kubernetes API server:

kubectl get pods -n kube-system | grep kube-apiserver

Examine logs for errors or warnings related to the API server:

kubectl logs <kube-apiserver_pod_name> -n kube-system

Check network connectivity to the API server:

kubectl cluster-info

5. Node Under Pressure

Why is this important

Alerts when a Kubernetes node experiences resource pressure, potentially affecting pod scheduling and performance.

Action

Monitor resources on the Kubernetes node to identify areas under pressure:

kubectl describe node <node_name>

Look for high CPU, memory, or disk usage that could indicate resource pressure.

Review the workloads running on the node to identify resource-hungry applications or containers:

kubectl get pods --all-namespaces -o wide

If resource pressure is persistent, consider scaling resources such as CPU, memory, or storage:

kubectl scale node <node_name> --cpu=<new_cpu_capacity> --memory=<new_memory_capacity>

Adjust resource limits based on the workload requirements and available node capacity.

Distribute workloads across multiple nodes to mitigate resource pressure:

kubectl drain <node_name> --ignore-daemonsets

6. Anomalous Node CPU/Memory Capacity

Why is this important

Detects when Kubernetes nodes use more CPU or memory than usual, which might mean they’re running out of resources or not working efficiently.

Action

Check how resources are being used over time and make changes to node capacity or how workload is distributed if necessary.

Monitor CPU and memory usage on Kubernetes nodes:

kubectl top nodes

Review resource usage trends over time to identify anomalies or spikes in CPU and memory usage.

If nodes consistently exceed CPU or memory limits, consider scaling up node capacity:

kubectl scale node <node_name> --cpu=<new_cpu_capacity> --memory=<new_memory_capacity>

Replace <node_name> with the name of the affected node, <new_cpu_capacity> with the desired CPU capacity, and <new_memory_capacity> with the desired memory capacity.

7. Missing Pod Replicas for Deployments/StatefulSets

Why is this important

Notifies when pods controlled by Deployments or StatefulSets are missing, indicating deployment failures or pod evictions. When pods are missing, it means that essential components of your applications are not running as expected, which can lead to downtime, bad performance and data loss.  For example, if you configured your cluster to have 2 replicas of a pod and one of those replicas is missing, then you are effectively running a single instance of a pod and have a SPOF (single point of failure) and have a fault tolerance problem.

Action

Inspect deployment/statefulset configurations and cluster events to diagnose and address deployment issues.

Inspect the configuration of the affected Deployment or StatefulSet:

kubectl describe deployment <deployment_name> -n <namespace>

or

kubectl describe statefulset <statefulset_name> -n <namespace>

Review the desired and current replica counts to determine if there is any difference.

Review cluster events to identify any events related to the missing pod replicas:

kubectl get events -n <namespace>

Look for events indicating pod evictions or deployment failures.

8. Pod Status and State Issues

Why is this important

Checking pod status is important for spotting problems like application errors, not enough resources, or scheduling trouble. For example, if a pod is stuck in a “waiting” state, it might indicate that it’s unable to start due to a missing resource or configuration issue.

Action

Analyze pod logs, events, and configuration to troubleshoot and resolve underlying issues affecting pod stability and performance.

Review the logs of pods to identify potential errors or issues:

kubectl logs <pod_name> -n <namespace>

Examine pod events to understand recent changes or events affecting pod status:

kubectl describe pod <pod_name> -n <namespace>

Review the pod’s configuration to verify the settings and resource allocations:

kubectl describe pod <pod_name> -n <namespace>

9. Pod Restart and Failure Scenarios

Why is this important

Frequent pod restarts, container crashes, image pull failures, or out-of-memory (OOM) errors will likely impact your application’s reliability and user experience, making effective kubernetes pod monitoring essential. Imagine a critical microservice repeatedly encountering out-of-memory errors. This can lead to extended downtime, potentially leading to revenue loss and customer dissatisfaction. Nobody wants that.

Action

If a pod keeps restarting because it’s running out of memory, look into its logs to see why:

kubectl logs <pod_name> -n <namespace>

Look for patterns in pod restarts or failures, such as out-of-memory errors, container crashes, or image pull failures.

If the pod is restarting due to resource constraints, consider increasing its resource limits:

kubectl edit pod <pod_name> -n <namespace>

10. High ETCD Leader Change/No Leader

Why is this important

Monitors ETCD cluster health, alerting on frequent leader changes or absence of a leader, which can impact cluster consistency and resilience. Frequent leader changes or absence of a leader indicate potential issues with cluster communication or stability. Like in romantic relationships, good communication in the Kubernetes cluster is key to its happiness. 😉

Action

Investigate ETCD cluster health, network connectivity, and disk performance to ensure proper operation and stability.

Check the health status of the ETCD cluster and identify the current leader:

etcdctl endpoint health --endpoints=<etcd_endpoints>

Replace <etcd_endpoints> with the actual endpoints of your ETCD cluster.

Monitor ETCD cluster events for frequent leader changes:

etcdctl watch --prefix / --rev=<current_revision> --endpoints=<etcd_endpoints>

Verify network connectivity between ETCD nodes and check disk performance on ETCD nodes to prevent I/O-related issues.

Sematext Kubernetes Alerts

Sematext comes with a number of pre-configured alert rules that are ready to use right away. These alerts keep an eye on your infrastructure and Kubernetes cluster without needing any setup. 

Here’s why they’re handy:

No Setup Needed – saves you time

You don’t have to spend time setting up alerts. Sematext’s default alerts are already set to watch key services in your Kubernetes clusters. When you start using Sematext, you’ll start getting alerts for important events like high CPU usage or pod failures whenever they happen. This helps you catch problems quickly.

Made for Kubernetes – no need to be a Kubernetes expert

Sematext’s default K8S alerts are designed specifically for Kubernetes. They cover all sorts of things that can go wrong in Kubernetes, like when a pod breaks or a server runs out of resources.

Easy Troubleshooting

When you receive an alert, you’ll also get helpful insights and guidance on where to look and what to do next. This makes it easy to identify the root cause of issues and resolve them quickly, saving you time and effort.

Always Fine-tuned

Sematext continuously fine-tunes these alerts based on real-world feedback and insights, as well as when new versions of Kubernetes are released, making sure they still work well to spot problems and alerting you quickly about potential issues.

Anomaly Detection

Some of Sematext’s default alerts for Kubernetes include anomaly detection capabilities. These alerts can identify unusual patterns or behaviors in your infrastructure or Kubernetes cluster, helping you to catch problems early before they get big.

Customizable Notifications

You can customize how you receive alerts. Whether it’s via email, text message, Slack, or another channel, you can choose the notification method that works best for you and your team. This ensures that you stay informed about critical events in your infrastructure and Kubernetes cluster, no matter where you are.

By using Sematext’s default alerts, you can set up monitoring for your infrastructure and Kubernetes quickly and without deep Kubernetes monitoring experience.

Summary

Configuring these alerts along with robust kubernetes monitoring and alerting covers most common Kubernetes cluster issues, enabling fast resolution of potential issues before they impact application availability and performance. 

Moreover, it’s important to keep improving these alerts by fine-tuning alert thresholds and continuously reviewing the actions to be taken to maintain a healthy infrastructure. With Sematext Alerts and following kubernetes monitoring best practices, managing alerts is easier, so you can stay on top of your Kubernetes cluster’s health and fix issues quickly.

Java Logging Basics: Concepts, Tools, and Best Practices

Imagine you're a detective trying to solve a crime, but...

Best Web Transaction Monitoring Tools in 2024

Websites are no longer static pages.  They’re dynamic, transaction-heavy ecosystems...

17 Linux Log Files You Must Be Monitoring

Imagine waking up to a critical system failure that has...