Using Cloud Shell to Fix a Dead VM

So for the first time, I get to blog about an Azure related experience that might be worth a read. As a MSFT employee, I have a small lab in Azure that I use to test out changes to Security Monitoring. That lab happens to have an offline CA with an enterprise subordinate CA allowing me to play around with ADCS.  As it happens, my enterprise CA’s certificate was due to expire and so I went to fire up my root CA and get that process started. I was a bit surprised to find out that I had left this on as I normally turn it off, but when I RDPed to it, I got nothing. Ping and any other attempt to connect got nothing… as did a reboot… nothing. I was locked out of this VM.

I initially followed the advice of a colleague and deleted the VM and attached the drive as a data disk to another server… Looking at the event logs, this running server hadn’t logged a thing since February… well that’s not. I did a check disk on the drive and attempted to re-attach and reboot… and nothing… That left me with a completely dead root CA that I could not access to save my life… I did some checking around and stumbled on this cool feature. It wasn’t the only one I found detailing the steps, but this one had the most correct information.

Step 1, turn on cloud shell.

That’s pretty straight forward, from your Azure console, simply click the shell button:

image

This is going to create a separate resource group if that matters with its own storage account.

Step 2, and missing from most of the guides I found, is to install the repair commands.

They aren’t installed by default, so if you find the wrong guide, you’re going to get an error saying the command doesn’t exist. Run the following line:

az extension add -n vm-repair

Step 3, create a repair VM.

This is pretty straight forward from the guide. What it will do is create a new resource group with a single virtual machine named repair-<name of dead VM>. It will copy the OS disk of the dead VM (this will be located in the dead VM’s resource group) and attach it as a data disk to your repair VM. You can RDP to this VM if you wanted to as it will have a public IP and RDP access, so if you want to do some sleuthing once you create this, you can.

az vm repair create –g <RG of Dead VM> –n <VM Name of Dead VM> –repair-username <Local admin name of your choosing> –repair-password <local admin password of your choosing> –verbose

Step 4, start doing repairs.

The run IDs are not well documented, and I’d add that the example they gave you doesn’t really do anything…. So I’d start by doing the list-scripts run ID:

az vm repair list-scripts

This was quite useful, as showed me all of the windows and linux options available. This is what they are as of this post:

image

I highlighted the useful ones… in my case, I’m pretty sure the sfc script is the one I needed to run, but I also did the bcdedit script as well:

az vm repair run -g <RG of Dead VM> –n <VM Name of Dead VM> –run-on-repair –run-id win-sfc-sf-corruption –verbose

There was a bit of a surprise here in this step… SFC takes a while to run, and your cloud shell only stays connected for 20 minutes. I didn’t see a way of changing that config, so this times out while the script is running, defeating the purpose of watching it in the console. Fortunately, that does not kill the command. I did see the system file checker running in task manager on the repair VM even after the shell timed out. I did not find an easy way to adjust that time out (admittedly, I didn’t spend much time looking). My only real way of knowing that it finished was to try and run another run ID… that fails as long as the current task is running. That’s not ideal, but it works.

Repeat for any additional run-id that you want to run changing only the highlighted piece.

Step 5, restore your VM.

This also didn’t work quite right. It failed on the first restore attempt. You can manually do this by detaching the data drive to the repair VM and then swapping the OS drive on the dead VM with the new repaired disk… I was eventually able to get this working, which was nice because it cleaned up the repair resource group along with it… That’s kind of important if you care managing costs, which I’m guessing most people do. It also swaps the OS disk for you on the dead VM, so the old OS disk will be detached and you’ll be booting off of the disk copy:

az vm repair restore -g <RG of Dead VM> –n <VM Name of Dead VM> –verbose

One last thing though, it does leave the old OS drive behind… so back to that cost thing… once you get your restored VM up and running, you may want to get rid of the old disk.

Anyways, after that I had a bootable RootCA… so crisis averted.