AKS Node Pool Upgrade - Disk Attach Failure
After an automated node pool upgrade (system and workload) on our production AKS cluster (see attachment - Node Upgrade) a number of our workloads hosted in the cluster encountered an issue with persistent volume attach errors (see attachment - PVC Error). The RocketChat namespace is being used as an example in this instance.
After some troubleshooting the Suspected root cause: a ghost attach/detach issue where stale VolumeAttachment objects remained bound to the old nodes. The above upgrade created a new node pool and the VolumeAttachment objects still pointed at the old node pool. The Azure Disk CSI driver was unable to reconcile state, and new attach attempts remained stuck. This caused the application containers to remain in ContainersCreating.
Steps Taken to Resolve
- Identified PVC → PV → VolumeAttachment mappings (kubectl get pvc/pv/volumeattachment).
- Patched and deleted stale VolumeAttachment objects to remove finalizers.
- Scaled down the workload to release the PVC.
- Restarted the csi-azuredisk-controller deployment in kube-system.
- Scaled the workload back up, which triggered a clean attach and restored service.
Component Versions
- Kubernetes version (AKS): 1.32.6
- Node pool OS: Azure Linux
- Node image version: AKSAzureLinux-V3gen2-202508.06.0
- VM size: Standard_D8ds_v5
- Azure Disk CSI driver: v1.33.2
Scope / Impact
This occurred only in the Production cluster during the node pool upgrade process, it did not impact any workloads in the Non-Production cluster that upgraded in the same time window.
The issue closely matches the following issue:
AttachVolume.Attach failed for volume "xyz" after node image update · Issue #2357 · kubernetes-sigs/azuredisk-csi-driver
This issue is still open and unresolved.
We would appreciate some guidance on:
- Whether you are aware of any permanent fix is planned for this issue in the future.
- What the recommended recovery path when this issue is encountered. Was the process that was used in this case the appropriate approach or is there a better alternative to ensure a timely and successful recovery.
- Are there any mitigation strategies available to prevent this issue occurring in the future?