Hi Mir Majeed
Welcome to Microsoft Q&A Platform. Thank you for reaching out & hope you are doing well.
Please begin by gathering information about the failing pods:
- Check overall pod status : kubectl get pods -n <namespace> -o wide
- Get detailed pod information : kubectl describe pod <pod-name> -n <namespace>
- Check pod events for specific error messages : kubectl get events --sort-by=.metadata.creationTimestamp -n <namespace>
Collect and analyze logs from both current and previous container instances:
- Current container logs : kubectl logs <pod-name> -n <namespace>
- Previous container logs (crucial for crashed pods) : kubectl logs <pod-name> --previous -n <namespace>
- Real-time log monitoring : kubectl logs -f <pod-name> -n <namespace>
- For multi-container pods : kubectl logs <pod-name> -c <container-name> -n <namespace>
Verify resource consumption and limits:
- Check current resource usage : kubectl top pods -n <namespace>
- Examine resource requests and limits : kubectl describe pod <pod-name> -n <namespace> | grep -A 5 -B 5 "Limits|Requests"
- Check node capacity: kubectl describe node <node-name>
Inspect the Load Balancer service configuration:
- Check service status and endpoints : kubectl get svc -n <namespace> -o wide kubectl describe svc <service-name> -n <namespace>
- Verify service endpoints: kubectl get endpoints -n <namespace>
- Check external IP assignment: kubectl get svc <service-name> -n <namespace> -o jsonpath='{.status.loadBalancer.ingress.ip}'
Examine health probe settings and responses:
- Check for health probe annotations : kubectl get svc <service-name> -n <namespace> -o yaml | grep -i probe
- Test health probe endpoints manually: curl -I http://<pod-ip>:<port>/<health-path>
- Validate probe configuration in deployment: kubectl get deployment <deployment-name> -n <namespace> -o yaml | grep -A 10 -B 10 "Probe"
Analyze HPA configuration and metrics collection:
- Check HPA status: kubectl get hpa -n <namespace> kubectl describe hpa <hpa-name> -n <namespace>
- Verify metrics server functionality:
- kubectl get --raw "/apis/metrics.k8s.io/v1beta1/nodes"
- kubectl get --raw "/apis/metrics.k8s.io/v1beta1/pods"
Check for HPA events
- kubectl get events --field-selector involvedObject.kind=HorizontalPodAutoscaler -n <namespace>
Verify network security group (NSG) and firewall configurations:
- Please check NSG rules affecting load balancer traffic az network nsg rule list --resource-group <resource-group> --nsg-name <nsg-name>
Verify Azure Load Balancer probe IP (168.63.129.16) is allowed
Please check internal connectivity :
- kubectl exec -it <pod-name> -n <namespace> -- netstat -an
Test service connectivity from within cluster
- kubectl run test-pod --image=busybox --rm -it -- wget -O- <service-name>.<namespace>:80
Pods failing due to insufficient CPU/memory resources
- kubectl describe pod <pod-name> | grep -i "insufficient|exceeded|limit"
- kubectl top pods --sort-by=cpu
- kubectl top pods --sort-by=memory
Reference YAML File
resources:
requests:
cpu: "250m"
memory: "512Mi"
limits:
cpu: "500m"
memory: "1Gi"
Missing or Misconfigured Health Probes
- kubectl describe pod <pod-name> | grep -A 5 -B 5 "Probe"
- kubectl logs <pod-name> | grep -i "probe|health"
Reference YAML file:
spec:
containers:
- name: webui
livenessProbe:
httpGet:
path: /health
port: 80
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 80
initialDelaySeconds: 15
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
HPA Metrics Collection Failures HPA unable to scale due to missing metrics
- kubectl describe hpa <hpa-name> | grep -i "fail|error|unable"
- kubectl get deployment <deployment-name> -o yaml | grep resources
For reference : Ensure resource requests are properly configured
spec:
containers:
- name: webui
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "512Mi"
Load Balancer Health Probe Conflicts
- kubectl get svc <service-name> -o yaml | grep annotations
- kubectl describe endpoints <service-name>
For reference : Disable health probes for specific ports if needed
metadata:
annotations:
service.beta.kubernetes.io/port_8080_no_probe_rule: "true"
Refer documentation below for details :
- Health Probes: Configure comprehensive liveness and readiness probes : https://learn.microsoft.com/en-us/azure/aks/best-practices-app-cluster-reliability
- Gradual Scaling: Use HPA behavior policies to control scaling velocity : https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/
- Monitoring: Implement comprehensive monitoring and alerting : https://learn.microsoft.com/en-us/azure/azure-monitor/containers/container-insights-livedata-overview
- Pod is stuck in CrashLoopBackOff mode : https://learn.microsoft.com/en-us/troubleshoot/azure/azure-kubernetes/create-upgrade-delete/pod-stuck-crashloopbackoff-mode
Please let us know if you need any further assistance.
Regards
Himanshu