Azure Local Deployment Failed - How do I recover the nodes to try again?

Question

Azure Local Deployment Failed - How do I recover the nodes to try again?

Mike Jackman 0

After nearly 100 hours of just awfulness, I finally got to the point where I cleared validation and began the deployment. However, it hung after each of the nodes were adopted into the AD. Now each node is stuck autobooting to deploy and won't let me delete the VMswitches or anything else. Before I start over and reimage each node, is there anyway to reset the nodes to "factory" or at least to state where I can attempt a redeployment? There is literally 0 documentation and since Broadcom ruined the only semi-decent product out there, I'm stuck figuring this out.

Any help is very appreciated!

Mounika Reddy Anumandla 6,970 Reputation points Moderator

2025-05-22T04:12:49.53+00:00
Hi Mike Jackman,

It sounds like you're facing quite a challenge with your Azure local deployment getting stuck. I understand how frustrating it can be after investing such a significant amount of time. Here are a few steps you can take to help reset your nodes and possibly avoid starting over.
Restart the Device Management Service

Restart-Service DeviceManagementService
This service manages device provisioning and status updates with Azure Stack HCI. Restarting it helps refresh the device state, particularly in the Azure portal or Windows Admin Center. After this, you can check Get-ClusterNode, Get-VMNetworkAdapter, or look for stale provisioning states.

Check for Unexpected VM Switches

During cluster setup, unexpected or auto-created switches (e.g., ConvergedSwitch(managementcompute)) can interfere with expected network layout.

Get-VMSwitch

Remove unexpected ones: Remove-VMSwitch -Name "ConvergedSwitch(managementcompute)" -Force
https://learn.microsoft.com/en-us/powershell/module/hyper-v/remove-vmswitch?view=windowsserver2025-ps

Restart Deployment via the Azure Portal

Tthis can work if the node is still showing in the Azure portal under Arc-integrated infrastructure. Go to Azure > Azure Stack HCI > Your cluster. Use "Customize" > Uncheck nodes that were already attempted or failed.Revalidate and redeploy.

Note: You can only do this if Azure Arc agent is still running and reporting correctly on the nodes.

Repair the Nodes

$Cred = Get-Credential

Repair-Server -Name "NodeName" -LocalAdminCredential $Cred

It is a valid command if you're using Windows Admin Center (WAC) or Azure Stack HCI deployment scripts.

This command repairs a node that failed during deployment or has partial configuration (like AD join but no cluster role). It helps avoid full reimaging.

Monitor Operation Progress

$ID = "<Operation ID>"

Start-MonitoringActionplanInstanceToComplete -actionPlanInstanceID $ID

This is part of the Azure Stack HCI "action plan" automation framework.

It’s correct in environments using Azure Edge Stack HCI deployments, especially when runbooks or orchestration are involved.

Helps track and wait for long-running operations (like cluster creation, SDN config, etc.).

Document for reference: https://learn.microsoft.com/en-us/azure/azure-local/manage/repair-server?view=azloc-24113- What version of Azure you're currently using, and did you notice any issues or errors during the validation stage?

Are there any specific error messages you're receiving, either during the booting process or when trying to access the nodes?

Let me know if you have any further queries!

If the information is helpful, please click "upvote" to let us know!
Mike Jackman 0 Reputation points

2025-05-22T06:15:17.35+00:00

I appreciate the effort put into this response. Unfortunately I already tried everything here and read all documents cited. None worked. I am hour 5 into reinstalling Azure Stack HCI on all nodes. This really is the worst.
ArkoSen-6842 4,165 Reputation points Moderator

2025-05-22T11:05:10.4266667+00:00

if deployment fails post-AD join but pre-cluster config, the only supported remediation path is to reimage and use Repair-Server.
ArkoSen-6842 4,165 Reputation points Moderator

2025-05-22T11:13:07.9366667+00:00
Hello Mike Jackman,

When the nodes successfully join Active Directory, the operating system applies domain policies, firewall rules, and trust settings that harden the configuration. If the deployment fails at this point — due to a timeout, orchestration crash, or partial provisioning — the node becomes stuck in an inconsistent state. It will autoboot into a half-provisioned environment and block retry attempts, such as deleting virtual switches / restarting the deployment from WAC / running repair scripts without cleanup

In the official Microsoft Learn article Repair a node on Azure Local, Microsoft provides detailed guidance for what to do when a node ends up in a broken or failed state. It states that “Repairing a node reimages a node and brings it back to the system with the previous name and configuration.”

This tells us that recovery assumes a fresh OS install.

It also explains that only the system volume is deleted and newly provisioned during deployment. which confirms that system cleanup, not rollback, is the supported path. Under node replacement, it allows Current node (reimaged)” scenarios, showing that reusing hardware is supported only after a clean OS install. Therefore, recovery is expected to happen only via reimage + repair. The use of the command Repair-Server -Name "<Node>" assumes that the OS has already been freshly installed and re-registered with Azure Arc before invoking the repair process.

If you're still mid-deployment and haven't reimaged yet, you can try the following manual cleanup approach to bring the nodes back into a usable state

Clear cluster and deployment state

Stop-Service ClusSvc -Force Remove-Item -Recurse -Force "C:\Windows\Cluster" Remove-Item -Recurse -Force "C:\ProgramData\Microsoft\Windows\Cluster" Remove-Item -Recurse -Force "C:\ProgramData\Microsoft\AzureStackHCI\HciDeployment" -ErrorAction SilentlyContinue

If available the do a

Clear-HCIConfigurationState

then check and remove stuck network switches

Get-VMSwitch | Remove-VMSwitch -Force

If blocked then do a

Disable-NetAdapterBinding -Name "vEthernet (ConvergedSwitch*)" -ComponentID ms_tcpip

Disable-NetAdapterBinding (NetAdapter) | Microsoft Learn

From your domain controller, delete the node’s computer object and flush DNS entries

If registered with Arc do a azcmagent disconnect

Install the same version of the Azure Stack HCI OS on the node. Then register the node with Azure Arc using the same resource group, subscription and region and assign the appropriate roles (Azure Local Device Management, Key Vault Secrets User, etc.)

Run the repair commands from another working node in the cluster

$cred = Get-Credential Repair-Server -Name "<NodeName>" -LocalAdminCredential $cred

You can even monitor the operation

Start-MonitoringActionplanInstanceToComplete -actionPlanInstanceID "<OperationID>"

verify the same using Get-VirtualDisk | Get-StorageJob

If no output, storage rebalance is complete
Mike Jackman 0 Reputation points

2025-05-22T13:53:46.0533333+00:00

I stayed up until 4am, reimaged all the hosts and started completely from scratch. Same behavior, stuck after the nodes are adopted into the AD. Unless someone has an idea better than spending 6 hours to reimage all the hosts and get them in Arc again, I do believe Proxmox is my future.
Mike Jackman 0 Reputation points

2025-05-22T14:37:56.0166667+00:00

I settle for someone telling me how to get the logs? This seems to just silently fail with not feedback.
Mike Jackman 0 Reputation points

2025-05-23T17:35:34.47+00:00

Well they merged my questions to one...

The horror show of this deployment continues into its 27th day. After it hanging on the join domain step for 10 hours, it times out. The "resume deployment" button un-greys. Clicking it and it does in fact resume and join the domain.

After getting to the "create cluster" step and failing. Thank god it just failed with an error message and log file, which is the exception, not the rule with azure local. I had to go into each node and manually disable ipv6 on each nic. Not sure why its enabled, at no point is it mentioned in the guide or in sconfig. Clicking resume deploy, resumes the deployment.

Now I'm hung at "configure networking." No log, no feedback. Just hung. Really inspires confidence that we will have a multi-million dollar setup potentially running on this.

So, question, do I just wait 10 hours again? The click "resume deployment". Is there someone at Microsoft who wants to explain to me why we would spend hundreds of thousands of dollars on this? Is Azure Local even a real thing? Am I stuck in a 27 day long coma?
M Luthfi Mukhlis 0 Reputation points

2025-08-02T03:56:45.7066667+00:00

Hi,

I hvae the same error, stuck at Join node to a domain, waiting the until almost >10h only to get the deployment failed.
may i know what did you do to pass the AD task?
Tarun Kohli 0 Reputation points

2025-08-22T21:07:15.32+00:00

Hi @Mike Jackman Since you had spent hours to fix the validation, just wanted to check if you ran into the issue that we are having right now with our deployment. I have attached the snapshot. Please see if you can help in here
Mike Jackman 0 Reputation points

2025-08-23T14:03:01.5533333+00:00

Don't. Seriously, just use regular WSFC or Proxmox. I got the entire cluster setup, it took weeks and the first AKS deployment borked the entire system, with Microsoft telling me to start over AGAIN. It doesn't work. This product is years, if ever, away from being production ready.

Your answer

Mike Jackman 0 Reputation points

2025-05-22T06:15:17.35+00:00

I appreciate the effort put into this response. Unfortunately I already tried everything here and read all documents cited. None worked. I am hour 5 into reinstalling Azure Stack HCI on all nodes. This really is the worst.
ArkoSen-6842 4,165 Reputation points Moderator

2025-05-22T11:05:10.4266667+00:00

if deployment fails post-AD join but pre-cluster config, the only supported remediation path is to reimage and use Repair-Server.
Mike Jackman 0 Reputation points

2025-05-22T13:53:46.0533333+00:00

I stayed up until 4am, reimaged all the hosts and started completely from scratch. Same behavior, stuck after the nodes are adopted into the AD. Unless someone has an idea better than spending 6 hours to reimage all the hosts and get them in Arc again, I do believe Proxmox is my future.
Mike Jackman 0 Reputation points

2025-05-22T14:37:56.0166667+00:00

I settle for someone telling me how to get the logs? This seems to just silently fail with not feedback.
Mike Jackman 0 Reputation points

2025-05-23T17:35:34.47+00:00

Well they merged my questions to one...

The horror show of this deployment continues into its 27th day. After it hanging on the join domain step for 10 hours, it times out. The "resume deployment" button un-greys. Clicking it and it does in fact resume and join the domain.

After getting to the "create cluster" step and failing. Thank god it just failed with an error message and log file, which is the exception, not the rule with azure local. I had to go into each node and manually disable ipv6 on each nic. Not sure why its enabled, at no point is it mentioned in the guide or in sconfig. Clicking resume deploy, resumes the deployment.

Now I'm hung at "configure networking." No log, no feedback. Just hung. Really inspires confidence that we will have a multi-million dollar setup potentially running on this.

So, question, do I just wait 10 hours again? The click "resume deployment". Is there someone at Microsoft who wants to explain to me why we would spend hundreds of thousands of dollars on this? Is Azure Local even a real thing? Am I stuck in a 27 day long coma?
M Luthfi Mukhlis 0 Reputation points

2025-08-02T03:56:45.7066667+00:00

Hi,

I hvae the same error, stuck at Join node to a domain, waiting the until almost >10h only to get the deployment failed.
may i know what did you do to pass the AD task?
Tarun Kohli 0 Reputation points

2025-08-22T21:07:15.32+00:00

Hi @Mike Jackman Since you had spent hours to fix the validation, just wanted to check if you ran into the issue that we are having right now with our deployment. I have attached the snapshot. Please see if you can help in here
Mike Jackman 0 Reputation points

2025-08-23T14:03:01.5533333+00:00

Don't. Seriously, just use regular WSFC or Proxmox. I got the entire cluster setup, it took weeks and the first AKS deployment borked the entire system, with Microsoft telling me to start over AGAIN. It doesn't work. This product is years, if ever, away from being production ready.

Share via

Azure Local Deployment Failed - How do I recover the nodes to try again?

Your answer