7.1 : Kubernetes Troubleshooting Guide: Fixing Common Pod and Cluster Errors

Kubernetes Troubleshooting: Your Guide to Fixing Common Pod and Cluster Errors

Kubernetes (K8s) is a powerful platform for managing containerized applications. But let's be honest, things don't always go as planned. Pods crash, clusters misbehave, and you find yourself scratching your head. Don't worry! This guide is here to help you troubleshoot common Kubernetes errors like a pro.

Think of Kubernetes as a well-oiled machine, a factory floor if you will. Your application is the product, the pods are the individual workers, and the cluster is the entire factory. If a worker (pod) suddenly stops working (crashes), or the factory (cluster) has a power outage, production stops. Troubleshooting is about finding out why the worker stopped or why the power went out.

This guide focuses on practical solutions, using simple language, and real-world examples.

1. Pods Stuck in Pending State: The "Out of Resources" Scenario

Problem: Your pod refuses to start and is stuck in the Pending state.

Analogy: Imagine you're trying to hire a new worker (pod), but the factory floor (cluster) is already crowded. There's no space or resources available.

Why it happens: The cluster doesn't have enough resources (CPU, memory) to allocate to your pod.

How to fix it:

Check available resources: Use kubectl describe node to check the resource usage of your nodes. Look for high CPU or memory utilization. If a node is maxed out, that's a prime suspect.
```
  kubectl describe node <node-name>
```
Scale up your cluster: Add more nodes to your cluster. This increases the overall resources available.

Optimize resource requests and limits: Review your pod's resources.requests and resources.limits in your YAML file. Are you asking for too much? Are you requesting more than the node can provide?

  apiVersion: v1
  kind: Pod
  metadata:
    name: my-pod
  spec:
    containers:
    - name: my-container
      image: my-image
      resources:
        requests:
          cpu: 100m  # Request 100 millicores of CPU
          memory: 128Mi # Request 128 MB of memory
        limits:
          cpu: 200m  # Limit to 200 millicores of CPU
          memory: 256Mi # Limit to 256 MB of memory

Node Affinity/Taints: Ensure your pod isn't being scheduled on a node it's incompatible with due to node affinity or taints.

2. Pod Crashing in a Loop: The "Buggy Worker"

Problem: Your pod keeps crashing and restarting. kubectl get pods shows a high "Restarts" count.

Analogy: The worker (pod) keeps making the same mistake, causing the machine (application) to break down repeatedly.

Why it happens: The application inside the pod has an error, a bug, or a misconfiguration.

How to fix it:

Check pod logs: This is your primary detective work! Use kubectl logs <pod-name> to examine the application logs. Look for error messages, exceptions, and stack traces.
```
  kubectl logs my-pod
```
Check events: Use kubectl describe pod <pod-name> to examine events related to the pod. Kubernetes provides insights into why the pod is crashing, such as OOMKilled (Out of Memory Killed).
```
  kubectl describe pod my-pod
```
Debugging: If the logs are not sufficient, you might need to debug the application directly, often by attaching to the container and running debugging tools.
Resource Limits: Ensure the pod has enough resources allocated. An OOMKilled event indicates memory issues. Adjust the resources.limits in your pod's YAML.

Real-world Example:

Let's say you have a simple Python web server that reads data from a database. The server keeps crashing. Checking the logs reveals a psycopg2.OperationalError: could not connect to server: Connection refused. This indicates a problem connecting to the database – perhaps the database is down, the credentials are incorrect, or there's a network issue. Fixing the database connection resolves the crashing pod.

3. DNS Resolution Failures: The "Lost Address"

Problem: Your application can't resolve internal service names.

Analogy: The worker (pod) is trying to contact another department (service), but the address book is missing or incorrect.

Why it happens: Kubernetes relies on CoreDNS for internal service discovery. If CoreDNS is malfunctioning, pods won't be able to resolve service names.

How to fix it:

Check CoreDNS pods: Ensure the CoreDNS pods are running and healthy. Use kubectl get pods -n kube-system | grep coredns.
```
  kubectl get pods -n kube-system | grep coredns
```
Check CoreDNS logs: Examine the logs of the CoreDNS pods for errors.
```
  kubectl logs -n kube-system coredns-<pod-id>
```
Test DNS resolution: Run nslookup <service-name>.<namespace>.svc.cluster.local from inside a pod to test DNS resolution. Replace <service-name>, <namespace>, and cluster.local with your actual values.
```
  kubectl exec -it <pod-name> -- nslookup my-service.my-namespace.svc.cluster.local
```
Restart CoreDNS: In extreme cases, restarting the CoreDNS deployment might resolve the issue.

4. Service Not Accessible: The "Blocked Door"

Problem: You can't access your service from outside the cluster.

Analogy: The factory (cluster) has a product (service) ready, but the loading dock (ingress/load balancer) is blocked.

Why it happens: There are several potential causes:

Incorrect Service Type: Your service might be set to ClusterIP instead of NodePort or LoadBalancer. ClusterIP makes the service only accessible from within the cluster.
Ingress Configuration: Your ingress controller might not be properly configured to route traffic to your service.
Firewall Rules: Firewall rules might be blocking traffic to the nodes or the service.
Cloud Provider Issues: Load balancer provisioning might be failing in your cloud provider.

How to fix it:

Check Service Type: Use kubectl describe service <service-name> to verify the service type. Change it to NodePort or LoadBalancer if needed.
```
  kubectl describe service my-service
```
Check Ingress: Examine the ingress configuration to ensure it's correctly routing traffic to your service. Check the logs of the ingress controller pods for errors.
Verify Firewall Rules: Ensure firewall rules are allowing traffic to the nodes on the appropriate ports (e.g., 80, 443).
Cloud Provider Dashboard: Check your cloud provider's console for any issues with load balancer provisioning.

5. ImagePullBackOff: The "Missing Component"

Problem: Your pod is stuck in ImagePullBackOff state.

Analogy: The worker (pod) is missing a crucial component (image) needed to do its job.

Why it happens: Kubernetes can't pull the container image specified in your pod definition.

How to fix it:

Verify Image Name and Tag: Ensure the image name and tag in your pod definition are correct. Typos are common!

Image Pull Secret: If the image is in a private registry, make sure you have configured an image pull secret and referenced it in your pod definition.

  apiVersion: v1
  kind: Pod
  metadata:
    name: my-private-pod
  spec:
    containers:
    - name: my-container
      image: my-private-registry/my-image:latest
    imagePullSecrets:
    - name: my-registry-secret

Registry Authentication: Verify that the credentials in your image pull secret are valid.
Network Connectivity: Ensure your nodes have network connectivity to the container registry.

6. Liveness/Readiness Probe Failures: The "Unresponsive Worker"

Problem: Your pod is being restarted because its liveness or readiness probes are failing.

Analogy: The supervisor checks if the worker (pod) is alive and ready. If the worker doesn't respond, the supervisor assumes something is wrong and restarts it.

Why it happens: Liveness and readiness probes are used to monitor the health of your application. If these probes fail, Kubernetes assumes the application is unhealthy and restarts it (liveness probe) or stops sending traffic to it (readiness probe).

How to fix it:

Check Probe Configuration: Review the configuration of your liveness and readiness probes in your pod definition. Ensure they are correctly configured to check the health of your application.

  apiVersion: v1
  kind: Pod
  metadata:
    name: my-pod
  spec:
    containers:
    - name: my-container
      image: my-image
      livenessProbe:
        httpGet:
          path: /healthz
          port: 8080
        initialDelaySeconds: 3
        periodSeconds: 10
      readinessProbe:
        httpGet:
          path: /readyz
          port: 8080
        initialDelaySeconds: 5
        periodSeconds: 15

Verify Application Health: Make sure your application is actually healthy and responding correctly to the probes. Check the application logs for errors.
Adjust Probe Parameters: Increase the initialDelaySeconds, periodSeconds, or timeoutSeconds parameters if your application takes longer to start or respond.

7. Persistent Volume Claim (PVC) Issues: The "Storage Problem"

Problem: Pods are failing to start because they can't claim persistent volumes.

Analogy: The worker (pod) needs a specific tool (persistent volume) to do its job, but the tool isn't available.

Why it happens: There are several reasons a PVC might fail to bind to a PV:

No Matching PV: There's no Persistent Volume (PV) available that meets the PVC's requirements (size, access mode, storage class).
PV Already Bound: The PV that the PVC is trying to claim is already bound to another PVC.
Storage Class Issues: There's a problem with the storage class configuration.
Cloud Provider Issues: Volume provisioning might be failing in your cloud provider.

How to fix it:

Check PVC and PV: Use kubectl describe pvc <pvc-name> and kubectl describe pv <pv-name> to examine the PVC and PV. Verify that the PVC's requirements match the PV's capabilities.
```
  kubectl describe pvc my-pvc
  kubectl describe pv my-pv
```
Check Storage Class: Verify that the storage class is correctly configured and that the provisioner is working.
Cloud Provider Dashboard: Check your cloud provider's console for any issues with volume provisioning.

Challenge: Intermittent Network Connectivity Issues

Challenge: You're experiencing intermittent network connectivity issues between pods. Sometimes they can communicate, and sometimes they can't. This is notoriously difficult to debug.

Solution:

DNS investigation: Start by verifying DNS resolution as described earlier. Intermittent DNS issues are common.
NetworkPolicy examination: Review your NetworkPolicies to ensure they're not inadvertently blocking traffic. Incorrectly configured policies can cause sporadic connectivity problems. Use kubectl get networkpolicy --all-namespaces -o yaml to inspect them.
CNI Plugin: The CNI (Container Network Interface) plugin is responsible for pod networking. Investigate the CNI plugin's logs. For example, if you're using Calico, check the Calico node logs (kubectl logs -n kube-system calico-node-<pod-id>).
Network Tools: Use tools like tcpdump or ping from inside the pods to diagnose network connectivity. You may need to install these tools in your container images.
MTU Issues: Maximum Transmission Unit (MTU) mismatches can cause intermittent packet loss. Check the MTU settings on your network interfaces and ensure they're consistent.
Monitoring: Implement comprehensive network monitoring to track packet loss, latency, and other network metrics. This can help you identify patterns and pinpoint the source of the problem.

Diagram (Simplified Kubernetes Architecture Relevant to Troubleshooting):

+---------------------+     +---------------------+     +---------------------+
|      User/CLI       | --> |    kubectl (API)    | --> |     kube-apiserver    |
+---------------------+     +---------------------+     +---------------------+
        ^                                                  |
        |                                                  |
        +--------------------------------------------------+
                                                         |
        +--------------------------------------------------+
        |
        v
+---------------------+     +---------------------+     +---------------------+
|   kube-scheduler    | --> |     kubelet (Node)  | --> |   Container Runtime   |
+---------------------+     +---------------------+     +---------------------+
        |                                                  |
        |  Pod Definition                                  |  (e.g., Docker, containerd)
        |                                                  |
+---------------------+     +---------------------+     +---------------------+
|    etcd (Storage)   |     |    CoreDNS          |     |      Network Plugin   |
+---------------------+     +---------------------+     +---------------------+

Key Takeaways:

Logs are your best friend: Always start by checking the logs of the affected pods and components.
Describe everything: Use kubectl describe to get detailed information about pods, services, nodes, and other resources.
Break down the problem: Simplify the problem by isolating components and testing them individually.
Document your findings: Keep track of the steps you've taken and the results you've obtained. This will help you avoid repeating mistakes and share your knowledge with others.
Stay Calm! Troubleshooting can be frustrating, but a systematic approach will eventually lead you to the solution.

By following this guide and building your Kubernetes troubleshooting skills, you'll be well-equipped to handle common pod and cluster errors and keep your applications running smoothly. Good luck!