Microsoft Releases Workaround for RC4 Installer Bug

A couple years ago, I published this article in regards to an RC4 bug with the SCOM installer that effectively kills the  installation of SCOM when RC4 is disabled on a system. MSFT has now published a workaround here.  It’s not ideal in my opinion as I think the installer should be addressed, but it does provide a workaround.

Speaking of installers, at SCOMathon this week, it was brought to my attention of another bug with the SCOM installer involving group managed service accounts. Simply put, the installer doesn’t handle them out of the box, which makes it impossible to add additional management servers after adding group managed service accounts. This defeats the purpose in my opinion of doing this as the old service accounts are still required.

For those with access to UserVoice, I have published a request to this link. Please upvote if you have votes remaining.

Thank you.

Update on Security Monitoring and UR3

I’ve been working with Kevin Holman as well as spending some time working through my MP to see if we could determine the cause of the problems related to UR 3 and security monitoring. At this point, there is good news and bad news.

First, the good news:

It is related to a change that the product team made with UR3 affecting how event log filtering works in the security logs. Since Security Monitoring does A LOT of event long filtering in the security logs, it’s naturally affected, as is (from what I understand) any MP that’s doing any kind of security log monitoring with parameter filtering:

  • Fixed the monitoring agent-related issue that affected formatted strings. These are now read from the provider DLLs to show a localized string

This is the crux of the issue: Due to the nature of the OS, the XML view of some logs differs from the friendly view of the same log. Here is an example as it affects Security Monitoring:

Note the 4624 XML view in the screenshot below:

image

Now take a look at the general view of the same event:

image

This has been a bit of a thorn in the flesh so to speak for SCOM administrators over the years. I know when I first started working with SCOM, I spent much more time than I’d have liked troubleshooting this issue and even filed a bug report with the Windows product team in my first year at MSFT. It was not fixed due to a considerable amount of re-writing code that would need to happen. Right or wrong, that’s what I was told. The SCOM product team recognized this problem and chose to fix it. Ultimately, I think that’s a good thing, though I will need to make some changes to Security Monitoring, as will anyone using these codes in custom MPs… As of UR3 in 2019 (even after this is fixed), the rules/monitors that they write will no longer flag on those codes but on the values in the event description.

With that out of the way, there were a couple of bugs introduced by this. One is relatively easy to fix as I have been told, while the other will take a bit more time. In terms of ETA, it’s likely that a hotfix will be released at some point, but UR4 as I understand it will hopefully have both of these addressed.

Continuing with good news, there’s more. First, the product team considers this a functional bug, which means it’s a high priority to fix.

Second, this has allowed the product team to revisit how they look at high volume logs and see if there’s a better way to handle high volume logs. This has been an ongoing issue so much so that there is general guidance out there on not doing monitoring in the security logs which of course I’ve ignored. Smile Over the years, this hasn’t been too much of a problem for Security Monitoring, but as anyone reading this knows, Security Monitoring looks almost exclusively at the Security log, which in most environments is a high volume log. Performance, due to this, is always a concern in large environments. Hopefully, this leads to some improvements in how SCOM parses these logs allowing for better performance in general.

Now for the bad news:

SCOM 2019 with UR3 will not work with Security Monitoring without said hot fix. It’s quite possible given the nature of what was exposed that it may never work as well. I’m also going to need to go through the MP and look for any filtering I’m doing based off of those %% values and correct them. I’ve been told that a simply OR statement to look for both may not work either. I’m not quite clear on the why for this as of now, but I will likely have to do some re-writing to take this into account. I’m not sure how the final product will look (either making these overridable parameters, the OR statement, or simply publishing a 2019 UR3 and later version of this MP) until I have a better idea as to what the finished state will look like.

For now, I’d simply say that if you have to go to UR3, you should probably remove Security Monitoring until the hotfix is published.

SCOM 2019 UR 3 and Security Monitoring

This is an FYI in regards to this management pack, but there does appear to be some sort of issue with the UR3 agent on domain controllers and the Security Monitoring management pack. I’m looking into this at the moment. The main impact is on a domain controllers, but there does seem to be lesser obvious issues with other UR3 agents. The issue at hand is that monitoringhost.exe consumes a large amount of CPU. You also see agent restarts due to this as well. I’m not sure if the issue is on my end or if there is a problem on the product side of things or both.

If there is an issue that is correctable on this side, I’ll issue an update as soon as it is fixed.

Thank you.

Security Update to SCOM 2019

I somehow managed to miss this, but last month Microsoft released an update to SCOM 2019 to patch an elevation of privilege vulnerability that exists with SCOM 2019. It’s worth noting that you must be running UR2 to apply this patch. UR2 also provides some nice auditing capabilities as well, so I highly recommend going to it for that reason alone. Note that this is a patch that sits on top of UR2.  The details can be found here:

CVE-2021-1728 – Security Update Guide – Microsoft – System Center Operations Manager Elevation of Privilege Vulnerability

AutoPilot, Timeouts, and PowerShell Scripting

I don’t normally write about autopilot, and I’m not going to try and crowd the space that Michael Niehaus has done an excellent job documenting, but when I find myself spending large amounts of time troubleshooting something that is ultimately not documented or not well documented, it’s probably worth a quick post…

The scenario was pretty straight forward. I had to automate the configuration of some OEM images for a customer that involved configuration profiles, compliance policies, app installs, and PowerShell scripts. What I found during the initial build was that there were a lot of time outs. Troubleshooting wasn’t much better as the logs I was able to obtain weren’t showing any errors. It was as if AutoPilot had simply stopped… All I saw from the enrollment status page was an 0x800705B4 error indicating that it had exceeded the time limit configured by the administrator.

In order to troubleshoot this, I created a separate group and added a test device to it and slowly added my targeted items. It eventually came down to a couple of PowerShell scripts that removed windows features that required a reboot (internet explorer and internet printing if you’re curious). For whatever reason, even with the –norestart switch to the disable-windowsoptionalfeature command AutoPilot simply didn’t know what to do with it. I would note that removing optional features that didn’t require reboots processed normally. I’m not going to pretend to be an autopilot expert, but I’m guessing anything that could potentially force a restart could be in play here. So if there’s a moral to that story, pay close attention to PowerShell scripts where the result may require reboots.

Using Cloud Shell to Fix a Dead VM

So for the first time, I get to blog about an Azure related experience that might be worth a read. As a MSFT employee, I have a small lab in Azure that I use to test out changes to Security Monitoring. That lab happens to have an offline CA with an enterprise subordinate CA allowing me to play around with ADCS.  As it happens, my enterprise CA’s certificate was due to expire and so I went to fire up my root CA and get that process started. I was a bit surprised to find out that I had left this on as I normally turn it off, but when I RDPed to it, I got nothing. Ping and any other attempt to connect got nothing… as did a reboot… nothing. I was locked out of this VM.

I initially followed the advice of a colleague and deleted the VM and attached the drive as a data disk to another server… Looking at the event logs, this running server hadn’t logged a thing since February… well that’s not. I did a check disk on the drive and attempted to re-attach and reboot… and nothing… That left me with a completely dead root CA that I could not access to save my life… I did some checking around and stumbled on this cool feature. It wasn’t the only one I found detailing the steps, but this one had the most correct information.

Step 1, turn on cloud shell.

That’s pretty straight forward, from your Azure console, simply click the shell button:

image

This is going to create a separate resource group if that matters with its own storage account.

Step 2, and missing from most of the guides I found, is to install the repair commands.

They aren’t installed by default, so if you find the wrong guide, you’re going to get an error saying the command doesn’t exist. Run the following line:

az extension add -n vm-repair

Step 3, create a repair VM.

This is pretty straight forward from the guide. What it will do is create a new resource group with a single virtual machine named repair-<name of dead VM>. It will copy the OS disk of the dead VM (this will be located in the dead VM’s resource group) and attach it as a data disk to your repair VM. You can RDP to this VM if you wanted to as it will have a public IP and RDP access, so if you want to do some sleuthing once you create this, you can.

az vm repair create –g <RG of Dead VM> –n <VM Name of Dead VM> –repair-username <Local admin name of your choosing> –repair-password <local admin password of your choosing> –verbose

Step 4, start doing repairs.

The run IDs are not well documented, and I’d add that the example they gave you doesn’t really do anything…. So I’d start by doing the list-scripts run ID:

az vm repair list-scripts

This was quite useful, as showed me all of the windows and linux options available. This is what they are as of this post:

image

I highlighted the useful ones… in my case, I’m pretty sure the sfc script is the one I needed to run, but I also did the bcdedit script as well:

az vm repair run -g <RG of Dead VM> –n <VM Name of Dead VM> –run-on-repair –run-id win-sfc-sf-corruption –verbose

There was a bit of a surprise here in this step… SFC takes a while to run, and your cloud shell only stays connected for 20 minutes. I didn’t see a way of changing that config, so this times out while the script is running, defeating the purpose of watching it in the console. Fortunately, that does not kill the command. I did see the system file checker running in task manager on the repair VM even after the shell timed out. I did not find an easy way to adjust that time out (admittedly, I didn’t spend much time looking). My only real way of knowing that it finished was to try and run another run ID… that fails as long as the current task is running. That’s not ideal, but it works.

Repeat for any additional run-id that you want to run changing only the highlighted piece.

Step 5, restore your VM.

This also didn’t work quite right. It failed on the first restore attempt. You can manually do this by detaching the data drive to the repair VM and then swapping the OS drive on the dead VM with the new repaired disk… I was eventually able to get this working, which was nice because it cleaned up the repair resource group along with it… That’s kind of important if you care managing costs, which I’m guessing most people do. It also swaps the OS disk for you on the dead VM, so the old OS disk will be detached and you’ll be booting off of the disk copy:

az vm repair restore -g <RG of Dead VM> –n <VM Name of Dead VM> –verbose

One last thing though, it does leave the old OS drive behind… so back to that cost thing… once you get your restored VM up and running, you may want to get rid of the old disk.

Anyways, after that I had a bootable RootCA… so crisis averted.

SCOM 2019 and Later Versions of 2016 No Longer Need FIPS Configuration

I’m a bit surprised on this as our documentation does not imply that this is the case, and I know personally I’ve had to setup FIPS for SCOM 2016 on numerous occasions, but I ran into a couple of issues recently with newer versions of SCOM 2016 when configured for FIPS were working in spite of what the instructions said needed to be done.

I decided to test this in my lab and configure FIPS on my Web Console server without going through the process I detailed on my blog. To my surprise, the console continues to work. I did get an authentication screen asking for credentials at first, which doesn’t always happen, so that may be worth watching. It also seems that later versions of 2016 will work with FIPS on as well. I’m not sure when that transition was made.

Security Monitoring: New Account Lockout Report

This was a customer request. There’s not much to it, but I had a customer ask if they could get an account lockout report displaying locked out accounts. I’ve added a collection rule and a report that does this for them. That is straight forward. There’s also a report that will list the accounts that locked out, the source of the lockout, and the date the account was locked out. This will be in the 1.8.x release of the MP. As always, any questions, feedback, or feature requests, feel free to reach out to me on linked in and I will gladly do what I can to improve this product.

It looks like this:

image

Cyber-Security for the IT Professional: Part 4 Asking the Right Questions and Implementing Easy Wins

You can find Part 1 of this series here.

You can find Part 2 of this series here.

You can find Part 3 of this series here.

I want to start by simply noting that I don’t want to give the impression that mitigating against exploits is a bad thing. I’d note though that technical exploits typically fall on the vendor to mitigate against, and to that extent, understanding their formal guidance for patching and design for these is critical. As such, items such as a robust patching process is a given. Simple mitigations against PtH are also a given since there’s nothing in place from a technology standpoint that can stop it on Server 2008 or 2012 R1. A lot of mitigations to technical vulnerabilities are as simple as staying current, which IT organizations often struggle to do.

That said, the anatomy of a modern attack follows a fairly straight forward plan:

  1. Compromise an asset via some means (usually, but not exclusively, phishing).
  2. Add some sort of persistence mechanism to allow ease of returning to the asset.
  3. Harvest any credentials that they can.
  4. Use those credentials on other systems and repeat step 2 and 3 as needed.
  5. Continue until you get the credentials that you need (i.e. administrators).
  6. Go do what they initially set to do (steal your data, ransomware, whatever)

Let’s revisit our assumptions for a second. Assumed breach notes that we cannot prevent step 1. It’s going to happen. I’m not saying don’t educate end users, but I am saying that ultimately we need to prepare well beyond step 1. We have to keep the attackers on that system and that system alone. Keep in mind that most attackers are organizations with a limited amount of capital, just like our organizations. We cannot necessarily stop them from doing anything, but what we can do cheaply and easily as a first step to securing your environment is to make it very expensive for them to do what they set out to do. Keep that that in mind, because with the right measures in place, they won’t waste their time on you. I’ll admit that if the organization is determined and has deep enough pockets, you will likely have a long road ahead, but this is also a rare scenario. Commoditizing a zero day vulnerability, for example, is very expensive for them to do… but a nation state, for instance, could have the pockets to do it if they thought it would achieve their goal. The average attacker, however, will not be willing to undertake said costs but instead will be quite happy to continue exploiting the same vulnerabilities.

They key to stopping a bad guy is to address the design vulnerabilities listed in part 2 of this series. Restricting movement at any tier can be implemented cheaply with relative ease:

  • Randomize your local admin passwords using a tool like LAPS. This way the attacker can no longer reuse the local admin hash. There are also GPO settings that can be configured to restrict local admin usage so that it’s not being used across the network.
  • Block all inbound connections at the local firewall and only allow administrative connections from designated administrative addresses (i.e your PAWS).

Dealing with administrative credentials is a bit harder to do, but there are still some things we can implement that can greatly improve security posture.

  • With all accounts, we really need to identify which administrative accounts we need and make a conscious effort to limit them in a least privilege setting. There should not, for instance, be very many people that need domain admin accounts. Very few accounts should need to be admins across the entire data tier (T1) either. A few administrators might need those rights and perhaps a deployment account for software deployment, but by and large, there shouldn’t be many accounts with these types of needs. For the most part service accounts should not need these rights, and accounts that need to run in memory should only have rights to log on to the systems that need those accounts. A good strategy here will have a measured affect on reducing the attack surface of most organizations as the bad guys have fewer accounts that they can compromise.  This can be a political issue across many companies and it’s a training issue as well, but if security is a focus, this is a great place to start, and this one needs to start with management.
  • As well, immediately implementing a PAW structure is another thing that can prevent lateral movement. It’s worth noting that you don’t necessarily need a separate PAW for each tier (one PAW can be configured to boot from multiple VHDs for instance), but a dedicated PAW is a very good idea. administration of any tier should be done from the tier specific PAW. That allows you to restrict administration from the IP addresses of the PAW machines and prevent administrator logons from the Tier 2 environment. This PAW should be hardened with no productivity applications on it. Internet should be restricted if at all possible and ideally the devices should be built using a known good media (i.e. download the media from the vendor site and validate the hashes, do not grab it off of your software share). At this point, you’ve split your tiers cleanly. If a bad guy is on a T2 system, even if a T2 admin signs on to it, those credentials won’t be useful because they can only reuse them from a PAW. If that is hardened properly, they won’t be able to get into it. They’ll also never be able to scrape a T1 or T0 cred off of said asset because those creds are also restricted to their PAW’s location.

Those are the easy wins. Long term, implementing tools such as Just in Time administration, credential guard, identity management, and more advanced monitoring will all prove to be beneficial to various degrees. Microsoft has plenty of guidance on this subject and while getting everything in place in a short time is a tall order, you can focus on the items that mitigate the most pressing vulnerabilities facing your organization. The first steps are not full proof obviously, as attackers can still try other things from that compromised machine, but their ability to move will be greatly restricted, so much so that they are much more likely to give up. So what do we do to continue hardening our environment? The question I think we need to ask is probably more philosophical. What problem are we attempting to solve? Are we closing a vulnerability that will make it difficult for attackers to move through the environment? Are we simply hardening a particular system? Often times security purchases or decisions are made to solve a issues around a particular exploit. Pass the hash is a good example of this. It’s without question one of the most exploited technical vulnerabilities on the market today. While there is some sense to mitigating it, I would argue that resources would be better dedicated towards eliminating credential leakage and unrestricted movement. Ultimately, if an organization were to take care of those issues, mitigating pass the hash in particular is not nearly as important because the attacker won’t have as many credentials to steal, nor will they be able to easily move in your environment with the ones they’ve acquired.

It’s also worth asking how common a particular exploit is before you mitigate against it. There are probably tens of thousands (if not more) of exploits out there, many are simply theoretical. The real question that needs to be answered is whether or not said exploit is in use. A commonly used exploit makes much more sense to mitigate against because it has effectively been commoditized, and the more of those that are mitigated will force an attacker to go elsewhere.. Rushing to stop a zero day my have merits if you’re particularly vulnerable to it or have adversaries with deep enough pockets to exploit it, but not surprisingly, attackers are usually exploiting technical vulnerabilities that have had patches available for them for years which means your patching strategy is something that should be heavily scrutinized.

Another place I would start is simply asking which (or all) of these design vulnerabilities that your particular organization faces and whether or not the solution is addressing them. I’m going to pick on admin password randomization software for a minute (note, not local admin randomization). Will something like this make it harder to brute force your passwords? Yes. But how often have we seen a bad guy on the inside of an organization brute forcing passwords? It doesn’t happen. Attackers don’t need to brute force your password. They have a number of means to get your password without guessing it. If you’ve seen password randomization systems in use, you’ll understand some of the other problems as well. What typically happens when they get implemented? I’ve seen this before, but typically an administrator opens up notepad and pastes their password in clear text on to their machine so that they can use it as needed, since memorizing said password is usually out of the question. Attackers can still get that info if they want it. They can still install a key logger and get it in that capacity. This type of approach fails in large part because the password still exists on Tier 2. That would be true, I might add even without these systems if now PAWS are not in use. The bottom line is that password randomization systems don’t secure Tier 2, which means you’re still exposed to the underlying design vulnerability that allows for the bad guys to steal your credentials.

The point being is that we need to do a better job asking questions and understanding where we are exposed. We aren’t going to be able to mitigate against every vulnerability, but understanding the important vulnerabilities and how to mitigate them will do wonders in improving our posture.