AKS Node Pool Upgrade - Disk Attach Failure

Question

AKS Node Pool Upgrade - Disk Attach Failure

Nikhil Singh 25

After an automated node pool upgrade (system and workload) on our production AKS cluster (see attachment - Node Upgrade) a number of our workloads hosted in the cluster encountered an issue with persistent volume attach errors (see attachment - PVC Error). The RocketChat namespace is being used as an example in this instance.

After some troubleshooting the Suspected root cause: a ghost attach/detach issue where stale VolumeAttachment objects remained bound to the old nodes. The above upgrade created a new node pool and the VolumeAttachment objects still pointed at the old node pool. The Azure Disk CSI driver was unable to reconcile state, and new attach attempts remained stuck. This caused the application containers to remain in ContainersCreating.

Steps Taken to Resolve

Identified PVC → PV → VolumeAttachment mappings (kubectl get pvc/pv/volumeattachment).
Patched and deleted stale VolumeAttachment objects to remove finalizers.
Scaled down the workload to release the PVC.
Restarted the csi-azuredisk-controller deployment in kube-system.
Scaled the workload back up, which triggered a clean attach and restored service.

Component Versions

Kubernetes version (AKS): 1.32.6
Node pool OS: Azure Linux
Node image version: AKSAzureLinux-V3gen2-202508.06.0
VM size: Standard_D8ds_v5
Azure Disk CSI driver: v1.33.2

Scope / Impact

This occurred only in the Production cluster during the node pool upgrade process, it did not impact any workloads in the Non-Production cluster that upgraded in the same time window.

The issue closely matches the following issue:

AttachVolume.Attach failed for volume "xyz" after node image update · Issue #2357 · kubernetes-sigs/azuredisk-csi-driver

This issue is still open and unresolved.

We would appreciate some guidance on:

Whether you are aware of any permanent fix is planned for this issue in the future.
What the recommended recovery path when this issue is encountered. Was the process that was used in this case the appropriate approach or is there a better alternative to ensure a timely and successful recovery.
Are there any mitigation strategies available to prevent this issue occurring in the future?

Ankit Yadav 410 Reputation points Microsoft External Staff Moderator

2025-08-28T09:54:00.3366667+00:00

Hello Nikhil Singh,
I've reached out to you in Private messages for the attachments you mentioned in the issue description.

While we're checking the issue, please re-share those in the private message so that we can understand the situation better.
Additionally, if you were trying to add any hyperlinks into the issue, which are not visible now in the thread, do share them in the Private Message too.

Your answer

Ankit Yadav 410 Reputation points Microsoft External Staff Moderator

2025-08-28T09:54:00.3366667+00:00

Hello Nikhil Singh,
I've reached out to you in Private messages for the attachments you mentioned in the issue description.

While we're checking the issue, please re-share those in the private message so that we can understand the situation better.
Additionally, if you were trying to add any hyperlinks into the issue, which are not visible now in the thread, do share them in the Private Message too.

Share via

AKS Node Pool Upgrade - Disk Attach Failure

Your answer