AKS Pods crashing after turning on load balancer service

Mir Majeed 20 Reputation points
2025-08-12T16:53:55.2433333+00:00

We are encountering an issue with our AKS workload where all pods for the WebUI service

enter a CrashLoopBackOff state after configuring Horizontal Pod Autoscaler and

LoadBalancer service.

Please see the attachment with implementation details.AKS Ticket.pdf

Azure Kubernetes Service
Azure Kubernetes Service
An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
{count} votes

Accepted answer
  1. Himanshu Shekhar 255 Reputation points Microsoft External Staff Moderator
    2025-08-12T21:00:30.9233333+00:00

    Hi Mir Majeed

    Welcome to Microsoft Q&A Platform. Thank you for reaching out & hope you are doing well. 

    Please begin by gathering information about the failing pods:

    1. Check overall pod status : kubectl get pods -n <namespace> -o wide
    2.  Get detailed pod information : kubectl describe pod <pod-name> -n <namespace>
    3.  Check pod events for specific error messages : kubectl get events --sort-by=.metadata.creationTimestamp -n <namespace>

    Collect and analyze logs from both current and previous container instances:

    1. Current container logs : kubectl logs <pod-name> -n <namespace>
    2. Previous container logs (crucial for crashed pods) : kubectl logs <pod-name> --previous -n <namespace>
    3. Real-time log monitoring : kubectl logs -f <pod-name> -n <namespace>
    4. For multi-container pods : kubectl logs <pod-name> -c <container-name> -n <namespace>

    Verify resource consumption and limits:

    1. Check current resource usage : kubectl top pods -n <namespace>
    2. Examine resource requests and limits : kubectl describe pod <pod-name> -n <namespace> | grep -A 5 -B 5 "Limits|Requests"
    3. Check node capacity: kubectl describe node <node-name>

    Inspect the Load Balancer service configuration:

    1. Check service status and endpoints : kubectl get svc -n <namespace> -o wide kubectl describe svc <service-name> -n <namespace>
    2.  Verify service endpoints: kubectl get endpoints -n <namespace>
    3. Check external IP assignment: kubectl get svc <service-name> -n <namespace> -o jsonpath='{.status.loadBalancer.ingress.ip}'

    Examine health probe settings and responses:

    1. Check for health probe annotations : kubectl get svc <service-name> -n <namespace> -o yaml | grep -i probe
    2. Test health probe endpoints manually: curl -I http://<pod-ip>:<port>/<health-path>
    3. Validate probe configuration in deployment: kubectl get deployment <deployment-name> -n <namespace> -o yaml | grep -A 10 -B 10 "Probe"

    Analyze HPA configuration and metrics collection:

    1. Check HPA status: kubectl get hpa -n <namespace> kubectl describe hpa <hpa-name> -n <namespace>
    2. Verify metrics server functionality:
    • kubectl get --raw "/apis/metrics.k8s.io/v1beta1/nodes"
    • kubectl get --raw "/apis/metrics.k8s.io/v1beta1/pods"

    Check for HPA events

    1. kubectl get events --field-selector involvedObject.kind=HorizontalPodAutoscaler -n <namespace>

    Verify network security group (NSG) and firewall configurations:

    1. Please check NSG rules affecting load balancer traffic az network nsg rule list --resource-group <resource-group> --nsg-name <nsg-name>

     Verify Azure Load Balancer probe IP (168.63.129.16) is allowed

    Please check internal connectivity :

    1. kubectl exec -it <pod-name> -n <namespace> -- netstat -an

     Test service connectivity from within cluster

    1. kubectl run test-pod --image=busybox --rm -it -- wget -O- <service-name>.<namespace>:80

    Pods failing due to insufficient CPU/memory resources

    1. kubectl describe pod <pod-name> | grep -i "insufficient|exceeded|limit"
    2. kubectl top pods --sort-by=cpu
    3. kubectl top pods --sort-by=memory

    Reference YAML File

    
    resources:   
     requests:     
      cpu: "250m"     
      memory: "512Mi"   
     limits:     
      cpu: "500m"    
      memory: "1Gi"
    

    Missing or Misconfigured Health Probes

    1. kubectl describe pod <pod-name> | grep -A 5 -B 5 "Probe"
    2. kubectl logs <pod-name> | grep -i "probe|health"

    Reference YAML file:

    spec:
      containers:
        - name: webui
          livenessProbe:
            httpGet:
              path: /health
              port: 80
            initialDelaySeconds: 30
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /ready
              port: 80
            initialDelaySeconds: 15
            periodSeconds: 5
            timeoutSeconds: 3
            failureThreshold: 3
    
    

    HPA Metrics Collection Failures HPA unable to scale due to missing metrics

    1. kubectl describe hpa <hpa-name> | grep -i "fail|error|unable"
    2. kubectl get deployment <deployment-name> -o yaml | grep resources

     For reference : Ensure resource requests are properly configured

    spec:
      containers:
        - name: webui
          resources:
            requests:
              cpu: "100m"
              memory: "128Mi"
            limits:
              cpu: "500m"
              memory: "512Mi"
    
    

     Load Balancer Health Probe Conflicts

    1. kubectl get svc <service-name> -o yaml | grep annotations
    2. kubectl describe endpoints <service-name>

    For reference : Disable health probes for specific ports if needed

    metadata:
      annotations:
        service.beta.kubernetes.io/port_8080_no_probe_rule: "true"
    
    
    

    Refer documentation below for details :

    1. Health Probes: Configure comprehensive liveness and readiness probes : https://learn.microsoft.com/en-us/azure/aks/best-practices-app-cluster-reliability
    2. Gradual Scaling: Use HPA behavior policies to control scaling velocity : https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/
    3. Monitoring: Implement comprehensive monitoring and alerting : https://learn.microsoft.com/en-us/azure/azure-monitor/containers/container-insights-livedata-overview
    4. Pod is stuck in CrashLoopBackOff mode : https://learn.microsoft.com/en-us/troubleshoot/azure/azure-kubernetes/create-upgrade-delete/pod-stuck-crashloopbackoff-mode

    Please let us know if you need any further assistance.

    Regards

    Himanshu

    1 person found this answer helpful.
    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. hossein jalilian 11,905 Reputation points Volunteer Moderator
    2025-08-12T17:45:25.69+00:00

    Thanks for posting your question in the Microsoft Q&A forum.

    It usually means the containers keep failing right after starting, so Kubernetes keeps restarting them with longer waits in between. Common causes include app misconfigurations, missing env variables, bad ConfigMaps/Secrets, low resource limits, probe failures, or networking issues after exposing the service. The quickest way to troubleshoot is to check pod logs and kubectl describe for errors, review probes, verify configs, and make sure resource limits and dependencies match the new scaling and traffic setup.


    Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.