I ran into a rather peculiar issue with a SCOM agent, and after speaking to Ainsley Blackmon in SCOM support, it was pretty clear that this hasn’t been seen before. Hopefully that means that it is something you won’t ever see, but it did have enough similarities to the TLS/Schannel issues that I’d occasionally observe with a SCOM agent that it’s worth writing it down, especially since all of the log information was rather cryptic about what was actually going on.
First, the scenario. It was straight forward. We were deploying SCOM 2016 as a part of a migration from a 2012 R2 environment. All systems checked in except for one. It remained stuck on “Not Monitored”. A quick trip to the system showed the standard authentication issues that you see in this issue. The connection was immediately being closed. The server was domain joined to the same domain as the management server, so there was nothing to troubleshoot with authentication. Reinstalling the agent, both manually and via console, and repairing the agent all gave a success, but the end result was the same. On the management server, I did dig up a rather cryptic error about the agent not having what it needs to open communication, and after some digging it was very obvious. It was missing it’s self-signed certificate.
For a background, that self signed cert is something SCOM uses to (as I understand it) encrypt communication between an agent and a management server so that things such as runas account passwords can be securely transmitted between them. You don’t need to do anything with this particular certificate. The health service will generate it when it starts. It’s self-signed, and it just sits in your certificate store. The below screenshot is an example of this from my 2012 environment. Note that in 2016, the folder name changes from “Operations Manager” (as shown below) to “Microsoft Monitoring Agent”.
In this particular agent’s case, the Microsoft Monitoring Agent folder and certificate were missing. That seemed odd. The logs weren’t very helpful on this issue either. There was nothing in the app/system log. There was, however, a bunch of 5061 audit failures in the security log. I could get a screenshot here, so I grabbed an example off the internet.
The major differences were the Operation (highlighted above). During a health service restart, this event would be shown as pictured above. During an agent install, the install would generate the same event, but the value in the Operation field was “Create Key”.
We eventually had to take to procmon to figure this one out, but ultimately, LSASS was getting denied access to a single folder in the OS:
In this case, the permissions on this folder were corrupted. Again, I don’t think this will be a common issue, but I suspect that with the move away from TLS, that these might pop up from time to time. For the record, this is what the permissions to that folder should be: