Monitors vs. Rules and how they Affect Alert Management

I’m going to do a back to the basics article here, and it’s not because things haven’t been written on the subject of monitors, rules, and SCOM, but because I don’t think they have been flushed out well, and to non-seasoned SCOM engineers, they are not exactly intuitive. As such, I wanted to walk down these two mechanisms that SCOM uses to monitor your environment and how they affect your alert management process, as there are some significant differences between the two that SCOM users need to be aware of (note this is users, the guys who close/acknowledge alerts, not just the engineers dedicated to SCOM). I would note that alert management can be a difficult thing to do.  It’s the hardest part of managing SCOM due to noise, internal politics, management issues and a whole host of other things.  I’d note that a colleague of mine has put together a nice chart on the subject.  There is not one right way to do alert management, but there are a lot of wrong ways.  Today, I’d like to discuss some things that can cause problems with alert management, namely monitors and rules.

So first, let’s start with the basics.  SCOM has 4 main data types that get stored in the DataWarehouse: event, state, performance, and alert data.  Of these four types, rules are responsible for collecting event and performance data.  These are not alert generating, just mechanisms to collect and store data for the purpose of reporting. You want to know the processor utilization of server X over the last 3 months? It’s a rule that collects this and tosses it in the DW.  The same goes with certain event collection rules. Their purpose in this scenario is solely to collect.

State data, however, is a different animal. This is the health of the objects monitored in SCOM. You view this data in the OperationsManager Database through Health Explorer.  Of note, only monitors can change state, so when you view Health Explorer, you will only see monitors in a red or yellow state (or all states if you unscope it), but you will never see alerts generated by a rule.  This is why on occasion you’ll open health explorer off of an alert only to see a completely healthy object underneath it.  This is because a rule generated the alert (note that on occasion a management pack designer will create both a monitor and a rule with the same name…. yeah… just to confuse you).  I’ll get to more about this in a minute.

That brings us to alerts, as both monitors and rules generate alerts.  Because of this, you need to view alerts as simply alerts and recognize that SCOM has two alert generating mechanisms, both of which behave a bit differently. Both can generate alerts.  Though neither monitors nor rules are required to (note that monitors almost always do generate alerts, but this is not necessarily a given).

So let’s start with rules:

  1. They don’t change state.  I covered this before, but a rule just tells you something happened.  It is not a definitive that your environment is unhealthy.  A common rule would be something like “SQL DB backup failed to complete”.  There’s nothing wrong with the health of your SQL environment, but this just might be something you want to look at. If you don’t care about the backup, then turn it off (and not the rule, the backup itself), as at this point it is clutter.
  2. Rules like to talk.  They tell you something needs to be looked at.  They tell you again.  They tell you again and again and again.  A poorly constructed rule can generate thousands of alerts in a short period of time… and sadly yes, I know this from experience 🙂  You can build alert flood protection into rules, but that has to be done when the rule is created, and if it’s created in a sealed management pack, the only way to do alert flood is to disable the rule and re-create it.  With alert flood protection, rules will increment a “repeat counter” in SCOM.  It is worth noting that this counter is not visible by default in your console, and you’ll want to right click on your columns, select “personalize view”, and check the box that says “Repeat Count” (usually near the bottom).  Rules with high repeat counts means that the condition has been detected on numerous occasions.  It means you might want to investigate what is going on, as closing the alert will only mean that it is coming back in a few minutes.
  3. Rules do not auto-close.  This means that if you’ve fixed the problem, you need to close the alert if it was generated by a rule. 
  4. Alert Generation cannot be turned off for rules in sealed management packs.  You can only disable the rule and/or recreate it if needed.

Now on to monitors:

  1. Monitors do change state.  This means that when a monitor detects an unhealthy condition, the object that contains it will go unhealthy.  Not all objects roll up through Windows Computer, so be careful there, as simply using health explorer from that view may surprise you as you won’t necessarily see what you want.  There is no super-class in SCOM, and I don’t believe that will change in 2016.
  2. Monitors have a mechanism for detecting a healthy condition.  You can do this via timer, via event ID, or some script. Bottom line is that the premise behind monitors is health; and as such, they need to have both healthy and unhealthy conditions.
  3. Alerts generated by monitors usually auto-resolve.  This means that when a monitor generates an alert, it will close the same alert.  This is an overridable parameter, so by all means check it, but it’s pretty rare to see auto-resolve turned off by default.  I cannot think of an example, though I’m sure someone has seen it…  That said, this can be a cause for noise in your environment if you have a monitor that is flip flopping back and forth between healthy and unhealthy.  Monitors doing this needs to be addressed, and since the alerts can go away, if you don’t have a good alert management process in place, you can miss these.  I typically like to use the most common alerts report in the generic report library, as this can tell you which monitors/rules are generating said alerts.  Every now and then, you’ll see items on that report that have no alerts in the console, and a flip flopping monitor can be the cause of this.
  4. Closing an alert generated by a monitor does not reset the health.  This means that if the underlying condition is still present, the health of the object in question will not go back to green.  Therefore, DO NOT CLOSE AN ALERT GENERATED BY A MONITOR. INSTEAD, USE HEALTH EXPLORER TO RESET THE HEALTH.  Sorry that I had to be loud there. This is probably the most common mistake someone new to SCOM makes (myself included by the way).  It is an even bigger problem with organizations that use SCOM in their NOC (network operations center), as typically the helpdesk/NOC tends to be made up of people with less experience, and the natural tendency is to close the alert when you think you’ve fixed it.  Sadly, some close it because they don’t know what to do with it.  When that happens, that particular alert will not generate alerts again unless the monitor goes healthy first.  Other monitors will continue to generate alerts, but the one in question will not generate until it resets.  This also causes grooming problems with the OperationsManager DB as state data isn’t groomed while the object is unhealthy. It would be nice if the product team put an update to generate a health reset in this scenario, but to my knowledge this has not happened.

OK, so that’s the scoop.  The big thing is to identify which mechanism generated the alert.  From there, you can craft a process on how to deal with it.  Just to answer one question I get asked a lot, the alert itself does tell you which mechanism generated it.

image

Using SCOM to Detect Successful Pass the Hash attacks (Part 1)

Part 2 is here.

Those that know me know I’ve been using my free time to mess around with the idea of being able to use SCOM to help in identifying when an advanced persistent threat is active in your environment.  This is a problem that most IT organizations have given that the average attacker isn’t discovered until more than 250 days after they owned the environment. Many are never found.  Part of the problem associated with this is the massive amount of log information that needs to be parsed in order to determine an active presence in the environment. There are products you can buy such as Microsoft ATA, Splunk, or forwarding log information to Azure and using OMS.  Products like these can be expensive, but in the same token much better at log analytics than a tool like SCOM. That said, my goal is to create a poor mans solution to identifying a possible PtH in progress event. I’ll do so by seeing what is generated when I reproduce an event in my lab.  This entry will cover successful elevation attempts.  My next entry will cover my attempt to detect an attacker who is attempting to elevate with a non-DA account.

To start, I download all the necessary tools to a machine. On the same machine, I’ve made a standard domain account local admin on the machine. This is because a pass the hash attack requires having local admin rights to the machine in order to read from the LSA. To be clear, for the average attacker, getting local access to a machine, any machine, is easy to do. Typically this starts at tier 2 with a targeted fishing attack, and despite the fact that we try and educate users to never open up that email from a non-trusted source, they do it anyways, roughly to the tune of about 11% of users. I’m not bothering with this piece as I’m assuming that an attacker can get to this point fairly easily… reality is they can.

Step 1 – switching credentials:

The first thing I’ve done is to simply execute mimikatz and launch a local command shell under a different set of creds than what I’m running under. My user account that I’m signed with with is “test”.  I have a domain admin on another session, and unbeknownst to this DA, my test account is compromised.  This is straight forward:

image

I grabbed the hash and launched a command shell. It appears that this generates traffic.  Using mimikatz to launch a command line under a domain admin’s credentials generates this:image

Each of these items are parameterized, which makes it somewhat easy to craft a rule in SCOM. The trick is making sure that the events in question are unique to this type of an attack. If they aren’t, all I’ve done is create a whole bunch of noise that will be ignored.  As luck would have it, the LogonProcessName and LogonType fields are distinctly different from the average 4624 event in my environment.  Let’s hold on to that thought for the moment.

Step 2 – Lateral Movement:

Now, I’m going to use those credentials to hit another machine.  This is why PtH attacks are of such concern.  This is easy.  From the command prompt that opened, I simply launched a psexec from the new shell to a remote system.  My logged on user has no rights to this system.  However, I’m in.

image

I’d note that I’m trying this on my SCOM server, but moving to my DC in step 3 was just as easy, though it did manage to kick off my logged on user as an unexpected result.  Same command, different machine. My generic user account now owns my environment.  I found this event on my SCOM machine’s security log:

image

The XML view is a bit more complex as the impersonation level for whatever reason doesn’t translate properly. Instead of seeing “Impersonation” in the XML, I simply see a code (%%1833).

image

That’s fine.  That code unfortunately is not unique to this type of a movement.

Step 3 – The DC:

On the DC, I see the pictures below.  The problem is that I see a bunch of these, so at this point I’m going to have to configure some sort of alert flood protection.  The other odd behavior here is that the impersonation level on the DC is set to Delegation, whereas on the member server, it was simply Impersonation.

image image

The other part is that there isn’t much for bread crumbs. The impersonation level is “Delegation”, but this is hardly uncommon for 4624 events. It does, however, at least in my small sample limited view, appear to be unusual for a domain admin to sign on to a machine with an impersonation level of delegation.  I could be wrong, and that’s part of why I’m publishing this.  Computer accounts commonly have this type of impersonation, but not user accounts. This will (hopefully) give me something unique to create in SCOM.

Now that we can see the behavior of this attack, we can potentially monitor for it.  DISCLAIMER:  This is me in my lab. I’m writing this in part for my own benefit and in part because my lab is not a production environment. My goal here is to be able to monitor for an attack in progress, but to do so in a way that does not generate noise. I cannot emphasize enough that your organization will need a good alert management process in order to actually respond properly to these alerts. I’m hoping that some people with some better lab environments as well as a security background can potentially reproduce this as well as verify that the noise level in a quiet state is low.

So on to the rules.  I want to test this in some environments other than my lab to see if this holds up. It’s quite possible that it does not and instead either generates a lot of noise or doesn’t fire in certain circumstances.  Feel free to add comments with your own result.  The rule is straight forward.

Rule 1: Monitoring the DC for Step 3 related events:

Target: Active Directory 2008 DC Computers

The rule type is NT Event.  Here’s a screenshot of the parameters:

image

Parameter 9 is the logon type, parameter 21 is the impersonation level, and parameter 6 is specifically ignoring these events if there’s a $ symbol in them (which is true in the case of a machine account doing impersonation).  I configure alert flood via the Source Network address (parameter 19) as well as a filter to make sure it’s not catching any local authentication. I’m not sure if that’s the right answer or not, but this should keep this to one IP address. If someone is hopping from destination to destination, this would show as multiple alerts. The flip side is that if they sit on one system and hit many, it shows only one alert. I’m not sure there’s an easy way to configure alert flood for this given that this event shows up multiple times on a DC with only one login attempt.

On the same token, we can configure something similar against the server OS to capture the events seen when an account does side to side movement in a tier. If you’re forwarding security events to an event collector, we should be able to create a similar rule there.

One observation in my lab is the domain admin logons via RDP will generate this alert, while standard users via RDP do not.  As a rule, you probably shouldn’t be using a DA account for much of anything, but this can potentially generate false positives. I’d love additional feedback on this particular rule.

Rule 2: Monitoring the Member Servers for Lateral Walk (step 2):

Target:  Windows Server Operating System

Like the other rule, it is an alert generating NT event rule targeting the security log.

image

Parameter 9 is the logon type and Parameter 21 is the impersonation code.  Parameter 19 is filtering out the local IP address.  Due to noise, I had to filter out a few additional things. I excluded anything with ANONYMOUS in it as DCs see this type of logon event for the SYSTEM account under normal conditions. I also filtered by the $ character as local machine accounts authenticate in this manner. My SQL server also lit up this one due to normal traffic, as such I created an override to turn this off for the SQL computers group that is created by the SQL management pack. You must have the SQL MP installed in order to override this. Unfortunately, this means you cannot detect this condition on a SQL server.  However, we have plenty of other events to target. I also had to disable this against domain controllers for the same reason, though this wasn’t nearly as noisy. I needed to include Kerberos as RDP sessions will generate this event under an NTLM connection. As well, I configured alert suppression for this rule via parameter 19. This event appears more than once on a targeted system.

Rule 3: Monitoring for a credential swap (step 1):

Target:  Windows Server Operating System.

As with the other rules, we are targeting the security log.

image

Parameter 9 is logon type.  Parameter 10 is the process name.  Parameter 11 is the authentication package.

The end result at this point in my lab is a very quiet set of targeted monitors that can detect the crumbs left behind when an attacker penetrates the environment. This test was only in my lab, so at this point, please feel free to let me know via the comments if you can replicate this or if your production environments are picking up noise that I’m not seeing in my lab. The goal is to leave a user with alerts that are actionable. I can provide the MP I’m developing (though of note I’m doing other things in here as well). If this is something you are interested in testing, please hit me up on linked in.