Azure Local Deployment Failed - How do I recover the nodes to try again?
After nearly 100 hours of just awfulness, I finally got to the point where I cleared validation and began the deployment. However, it hung after each of the nodes were adopted into the AD. Now each node is stuck autobooting to deploy and won't let me delete the VMswitches or anything else. Before I start over and reimage each node, is there anyway to reset the nodes to "factory" or at least to state where I can attempt a redeployment? There is literally 0 documentation and since Broadcom ruined the only semi-decent product out there, I'm stuck figuring this out.
Any help is very appreciated!
Azure Local
-
Mounika Reddy Anumandla • 6,970 Reputation points • Moderator
2025-05-22T04:12:49.53+00:00 Hi Mike Jackman,
It sounds like you're facing quite a challenge with your Azure local deployment getting stuck. I understand how frustrating it can be after investing such a significant amount of time. Here are a few steps you can take to help reset your nodes and possibly avoid starting over.
Restart the Device Management ServiceRestart-Service DeviceManagementService
This service manages device provisioning and status updates with Azure Stack HCI. Restarting it helps refresh the device state, particularly in the Azure portal or Windows Admin Center. After this, you can checkGet-ClusterNode
,Get-VMNetworkAdapter
, or look for stale provisioning states.Check for Unexpected VM Switches
During cluster setup, unexpected or auto-created switches (e.g.,
ConvergedSwitch(managementcompute)
) can interfere with expected network layout.Get-VMSwitch
- Remove unexpected ones: Remove-VMSwitch -Name "ConvergedSwitch(managementcompute)" -Force
https://learn.microsoft.com/en-us/powershell/module/hyper-v/remove-vmswitch?view=windowsserver2025-ps
Restart Deployment via the Azure Portal
Tthis can work if the node is still showing in the Azure portal under Arc-integrated infrastructure. Go to Azure > Azure Stack HCI > Your cluster. Use "Customize" > Uncheck nodes that were already attempted or failed.Revalidate and redeploy.
- Note: You can only do this if Azure Arc agent is still running and reporting correctly on the nodes.
Repair the Nodes
$Cred = Get-Credential
Repair-Server -Name "NodeName" -LocalAdminCredential $Cred
It is a valid command if you're using Windows Admin Center (WAC) or Azure Stack HCI deployment scripts.
This command repairs a node that failed during deployment or has partial configuration (like AD join but no cluster role). It helps avoid full reimaging.
Monitor Operation Progress
$ID = "<Operation ID>"
Start-MonitoringActionplanInstanceToComplete -actionPlanInstanceID $ID
- This is part of the Azure Stack HCI "action plan" automation framework.
- It’s correct in environments using Azure Edge Stack HCI deployments, especially when runbooks or orchestration are involved.
- Helps track and wait for long-running operations (like cluster creation, SDN config, etc.).
Document for reference: https://learn.microsoft.com/en-us/azure/azure-local/manage/repair-server?view=azloc-24113- What version of Azure you're currently using, and did you notice any issues or errors during the validation stage?
- Are there any specific error messages you're receiving, either during the booting process or when trying to access the nodes?
Let me know if you have any further queries!
If the information is helpful, please click "upvote" to let us know!
- Remove unexpected ones: Remove-VMSwitch -Name "ConvergedSwitch(managementcompute)" -Force
-
Mike Jackman • 0 Reputation points
2025-05-22T06:15:17.35+00:00 I appreciate the effort put into this response. Unfortunately I already tried everything here and read all documents cited. None worked. I am hour 5 into reinstalling Azure Stack HCI on all nodes. This really is the worst.
-
ArkoSen-6842 • 4,165 Reputation points • Moderator
2025-05-22T11:05:10.4266667+00:00 if deployment fails post-AD join but pre-cluster config, the only supported remediation path is to reimage and use Repair-Server.
-
ArkoSen-6842 • 4,165 Reputation points • Moderator
2025-05-22T11:13:07.9366667+00:00 Hello Mike Jackman,
When the nodes successfully join Active Directory, the operating system applies domain policies, firewall rules, and trust settings that harden the configuration. If the deployment fails at this point — due to a timeout, orchestration crash, or partial provisioning — the node becomes stuck in an inconsistent state. It will autoboot into a half-provisioned environment and block retry attempts, such as deleting virtual switches / restarting the deployment from WAC / running repair scripts without cleanup
In the official Microsoft Learn article Repair a node on Azure Local, Microsoft provides detailed guidance for what to do when a node ends up in a broken or failed state. It states that “Repairing a node reimages a node and brings it back to the system with the previous name and configuration.”
This tells us that recovery assumes a fresh OS install.
It also explains that only the system volume is deleted and newly provisioned during deployment. which confirms that system cleanup, not rollback, is the supported path. Under node replacement, it allows Current node (reimaged)” scenarios, showing that reusing hardware is supported only after a clean OS install. Therefore, recovery is expected to happen only via reimage + repair. The use of the command
Repair-Server -Name "<Node>"
assumes that the OS has already been freshly installed and re-registered with Azure Arc before invoking the repair process.If you're still mid-deployment and haven't reimaged yet, you can try the following manual cleanup approach to bring the nodes back into a usable state
Clear cluster and deployment state
Stop-Service ClusSvc -Force Remove-Item -Recurse -Force "C:\Windows\Cluster" Remove-Item -Recurse -Force "C:\ProgramData\Microsoft\Windows\Cluster" Remove-Item -Recurse -Force "C:\ProgramData\Microsoft\AzureStackHCI\HciDeployment" -ErrorAction SilentlyContinue
If available the do a
Clear-HCIConfigurationState
then check and remove stuck network switches
Get-VMSwitch | Remove-VMSwitch -Force
If blocked then do a
Disable-NetAdapterBinding -Name "vEthernet (ConvergedSwitch*)" -ComponentID ms_tcpip
Disable-NetAdapterBinding (NetAdapter) | Microsoft Learn
From your domain controller, delete the node’s computer object and flush DNS entries
If registered with Arc do a azcmagent disconnect
Install the same version of the Azure Stack HCI OS on the node. Then register the node with Azure Arc using the same resource group, subscription and region and assign the appropriate roles (Azure Local Device Management, Key Vault Secrets User, etc.)
Run the repair commands from another working node in the cluster
$cred = Get-Credential Repair-Server -Name "<NodeName>" -LocalAdminCredential $cred
You can even monitor the operation
Start-MonitoringActionplanInstanceToComplete -actionPlanInstanceID "<OperationID>"
verify the same using Get-VirtualDisk | Get-StorageJob
If no output, storage rebalance is complete
-
Mike Jackman • 0 Reputation points
2025-05-22T13:53:46.0533333+00:00 I stayed up until 4am, reimaged all the hosts and started completely from scratch. Same behavior, stuck after the nodes are adopted into the AD. Unless someone has an idea better than spending 6 hours to reimage all the hosts and get them in Arc again, I do believe Proxmox is my future.
-
Mike Jackman • 0 Reputation points
2025-05-22T14:37:56.0166667+00:00 I settle for someone telling me how to get the logs? This seems to just silently fail with not feedback.
-
Mike Jackman • 0 Reputation points
2025-05-23T17:35:34.47+00:00 Well they merged my questions to one...
The horror show of this deployment continues into its 27th day. After it hanging on the join domain step for 10 hours, it times out. The "resume deployment" button un-greys. Clicking it and it does in fact resume and join the domain.
After getting to the "create cluster" step and failing. Thank god it just failed with an error message and log file, which is the exception, not the rule with azure local. I had to go into each node and manually disable ipv6 on each nic. Not sure why its enabled, at no point is it mentioned in the guide or in sconfig. Clicking resume deploy, resumes the deployment.
Now I'm hung at "configure networking." No log, no feedback. Just hung. Really inspires confidence that we will have a multi-million dollar setup potentially running on this.
So, question, do I just wait 10 hours again? The click "resume deployment". Is there someone at Microsoft who wants to explain to me why we would spend hundreds of thousands of dollars on this? Is Azure Local even a real thing? Am I stuck in a 27 day long coma?
-
M Luthfi Mukhlis • 0 Reputation points
2025-08-02T03:56:45.7066667+00:00 Hi,
I hvae the same error, stuck at Join node to a domain, waiting the until almost >10h only to get the deployment failed.
may i know what did you do to pass the AD task? -
Tarun Kohli • 0 Reputation points
2025-08-22T21:07:15.32+00:00 Hi @Mike Jackman Since you had spent hours to fix the validation, just wanted to check if you ran into the issue that we are having right now with our deployment. I have attached the snapshot. Please see if you can help in here
-
Mike Jackman • 0 Reputation points
2025-08-23T14:03:01.5533333+00:00 Don't. Seriously, just use regular WSFC or Proxmox. I got the entire cluster setup, it took weeks and the first AKS deployment borked the entire system, with Microsoft telling me to start over AGAIN. It doesn't work. This product is years, if ever, away from being production ready.
Sign in to comment