Using SCOM to Detect Failed Pass the Hash attacks (Part 2)

A couple weeks back, I wrote a piece on creating some rules to potentially detect pass the hash attacks in your environment. This is the second article in this series, and if time permits one of many more I hope to do over the next year or so on using SCOM to detect active threats in an environment.  To start, I want to apologize on the delay, my lab crashed on me and I had to spend way too much time fixing it.

Today we will discuss looking at failure attempts.  When an attacker compromises a machine, they are able to use windows credential editor or mimikatz to enumerate the users that are logged on as well as the hashes stored inside of the LSA. What they do not necessarily know is what permissions said accounts have; as such, they only thing they can do is attempt to login with these stored accounts until they find one that works, giving them complete access to the victim’s environment. They accomplish this by moving laterally throughout the organization, desktop to desktop or server to server until they finally steal credentials of a domain admin account, giving them the keys to the kingdom. That reason alone is why domain admin accounts should never sign on to a desktop, nor should they sign on to a server or run as a service.  On average, it takes an attacker approximately 2 days to go from infiltrating a desktop, to stealing the keys to the kingdom.  It also takes the victim the better part of a year to realize that they’ve been owned.

As a reminder, the goal here is to avoid noise generated by normal events. I’ve seen several implementations of security monitoring in SCOM which do nothing but generate thousands of alerts that no one will look at. Alert management is the most difficult aspect to a SCOM environment, and even with good processes in place, asking a person or team to sift through hundreds, if not thousands, of alerts generated by standard, every-day activity. This accomplishes nothing but re-enforcing the check-the-monitoring-box mentality that so many organizations already have.

As for the experiment, I’m going to use my compromised system to steal accounts, much like I did in part 1.  The big difference is that instead of the account having the domain admin rights that the attacker desires, it will be nothing more than a standard account which has no access to other machines in the environment. The process is pretty much the same as what I did in part 1, and as such the rule I created to look for credential swap flagged immediately.  That’s good.  And, as expected, I see the following result when launching a remote psexec to my server.

image

Essentially, I could not move laterally in this scenario, but how his looks surprised me.  I expected this to lead some failures (4625 events) that can be tracked.  However, this turns out to not be the case.  For one, my tier 1 monitor that I setup in part 1 of this series flagged. I was not expecting that, and sure enough, I see 4624 events on my SCOM server for the user account whose credentials I used.  It was followed immediately by a 4634 event indicating the session was destroyed. There were no 4625 events on this server, nor were there any on my domain controller. I’m not quite sure why authentication is handled in this capacity, but then again, I’ve never really looked at it this closely.  Unfortunately, this does me little good as it is a very common sequence of events, but since my other rules still alerted in this scenario, I have visibility into this attempt.  I attempted a few other means with similar results.  I’m still generating a 4624, with the only difference being the 4634 that follows immediately after it.  Attacking a DC with a non-DA account yielded the same results.  There was a 4624 followed immediately by a 4634 event, and the 4624 was picked up by my previous rules, so there’s no need to create another rule.

I suspect that Rule 2 in my previous post will likely be the noisiest.  I had to turn it off in my lab for SQL servers and DCs due to standard traffic. That’s fine, as it still has some value, but I suspect that some types of normal front end to back end type communication will trigger this.  That doesn’t do any favors to the person or team responsible for security monitoring.  An occasional false positive is acceptable, but at the end of the day, I prefer my alerts to be actionable every time. If they are not, SCOM quickly turns into that tool that no one uses because everything that comes out of it is worthless.  This brings me to another point. If you’re having problems with Rule 2 in particular (though 3 may come into play here too), then consider Parameter 19 a bit closer.  That parameter contains the source IP address.  While I wouldn’t consider it to be the best practice, a tier 1 environment may have service accounts and applications that connect to other systems within tier 1. Lots of that is probably not a good thing, but it is somewhat normal for a front end/back end configuration.  Any type of web server/DB is likely to trigger false positives.  However, the anatomy of an attack is a bit different.

Attackers rarely get to start at the server level or the DC level.  They almost always start in tier 2. Security professionals refer to this as “assumed breach.”  Simply put, no matter how much you train people, roughly 10% of your environment is not going to verify the source but will instead click on the crazy cat link or whatever the popular meme of the day is. Security teams unfortunately cannot stop this.  That said, this is also your attacker’s entry point into the environment. Chances are good that someone is sitting in your tier 2 environment right now, because getting there through one of the myriad of flash or java vulnerabilities is pretty easy to do.  But that also gives us a unique way to search for attackers, because SCOM can use wild cards.  Since Parameter 19 contains an IP address, you can make these rules and use wildcards to filter down the results so that you’re getting alerts when these anomalies are detected from a tier 2 IP address accessing a tier 1 or tier 0 system.  Other than your IT staff, this shouldn’t be happing at all.  This would have to be customized to an environment, but this is not something that would be terribly difficult to do.

One final note, I’m working on a management pack that contains these rules along with additional security rules along with other security related items that SCOM can provide. My only test environment is my lab, which is hardly a production grade environment. I do welcome feedback. While I cannot support this management pack, I can provide it if this is something you are interested in trying it out. My main goal is to keep the noise down to a minimum so that each alert is actionable. While that is not always easy to do, trying this out in other environments will go along way to getting it to that point.  If this is something you are interested in testing, please hit me up on linked in.

The Anatomy of a Good SCOM Alert Management Process – Part 3: Completing the Alert Management Life Cycle.

This is my final article in a 3 part series about Alert Management.  Part 1 is herePart 2 is here.

In the first two parts, we have already discussed why alert management is necessary and what tends to get in the way.  The final article in this series will cover what processes need to change or be added in order to facilitate good alert management.

The information below can be found in a number of documentation.  It is found in our health check that is provided for SCOM.  I’ve seen it in a number of presentations by a number of different Microsoft PFEs as well.  It shows up on some blogs too.  Simply put, there’s plenty out there that can put you in the right direction, though sometimes the WHY gets left out.

Daily Tasks

  • Check using, Operations Manager Management Packs that Operations Manager components are healthy
  • Check new Alerts from previous day are not still in state of ‘New’
  • Check for any unusual Alert or Event noise; investigate further if required (e.g. failing scripts, WMI issues, etc.,)
  • Check all Agents ‘Status’ for any that may be other than in a Green state
  • Review nightly backup jobs and database space allocation

Weekly Tasks

  • Schedule weekly meeting with IT Operational stake-holders \ and technical staff to review previous weeks most common alerts
  • Run the ‘Most Common Alerts’ report; investigate where necessary (see above bullet)

Monthly Tasks

  • Check for new Management Pack versions of those installed. Also check newly released management packs for suitability for your monitored environment
  • Run the baseline counters to access the ongoing performance of the Operations Manager environment as new agents are added and as new management packs are added

The task list doesn’t necessarily say WHO is responsible for completing these items, but I can say with reasonable certainty that if the SCOM administrator is the only one expected to do these tasks, he or she will fail.  Alert noise in particular is a team effort.  That needs to be handled directly by the people whose responsibility it is to maintain the systems they are monitoring.  That means that your AD guys should be watching the AD management pack.  The SQL guys need to be watching for SQL alerts, and so and so forth.  They know their products better than what the SCOM administrator will know them.

Tier one (and by proxy two) can certainly be the eyes and ears on the alerts that come through, but they need clearly defined escalation paths to the appropriate teams so that issues that aren’t easily resolved can be sent on to the correct tier three teams.  SCOM does a lot of self-alerting, so that escalation needs to include the SCOM administrators as issues such as WMI scripts not running, failing workflows, and various management group related alerts need to eventually make it to the SCOM administrator.  Issues such as health service heart beats (and by proxy gray agents when that heartbeat threshold is exceeded) need to be looked at right away.  Those indicate that an agent is not being monitored (at the least).  There are a number of reasons as to why that could be the case ranging from down systems (which you want to address), to bad processes, to some sort of client issue preventing communication.

Finally, all of this requires some sort of accountability.  Management doesn’t necessarily need to know why system X is red.  That’s usually the wrong question.  What management needs to be ensuring is that when there’s an alert from SCOM, SOMEONE is addressing it, and that someone also has a clear escalation path when they get to a point where they aren’t sure what’s going on.  To be clear, there’s going to be A LOT of this at first. That’s normal, and that also gets us into other key processes that need to be formed or adjusted in order to make this work.

  1. Server commission/decommission:  The most common issue for gray agents in SCOM is the failure to remove it from SCOM when the server is being retired.  It’s a simple change, but that has to be worked into your organizations current process.  On the flip side, ensuring that new servers are promptly added to SCOM is also important.  How that is managed is more organization specific.  You can auto-deploy via SCCM or AD (though don’t forget to change the remotely manageable settings if you do) or you can manually deploy through the SCOM console. You can also pre-install the image and use AD assignment as another option if that is preferred as well.  Keep in mind that systems in a DMZ will require certificates or a gateway to authenticate, which will further affect these processes.  You may also want to think about whether or not your development systems should be monitored the production environment (as these will usually generate more noise).  You may want to consider putting these systems in a dev SCOM environment (you’ll likely have no additional cost).
  2. Development Environment:  The Dev SCOM environment is also something that will have it’s own processes.  It will be used more for testing new MP rollouts, but in terms of being watched by your day to day support operations, it really is only being watched by the engineers responsible for their products as well as the SCOM administrator.
  3. Maintenance:  Server maintenance will need to be adjusted as well.  This might be the biggest process change (or in most cases, a new process altogether).  Rebooting a DC during production hours (for example) is somewhat normal since it really won’t cause an outage. If that DC is say the PDC emulator, each DC in SCOM will generate an alert when that DC goes down.  Domain controllers aren’t the only example here, as any time a server is rebooted.  Reboots can generate a health service heart beat alert if the server misses it’s ping or even a gray server if the reboot takes a while.  Application specific alerts can be generated as well, and SCOM specific alerts will generate when workflows are suddenly terminated.  This process is key as it’s a direct contributor to what is typically a daily amount of noise that SCOM generates.  SCOM isn’t smart enough to know which outages are acceptable to your organization and which ones aren’t.  It’s up to the org to tell it.  SCOM includes a nice tool called Maintenance Mode to assist with this (though it’s worth noting that this is a workflow that the management server orders a client to execute, so it can take a few minutes to go into affect).  System Center 2016 has also added the ability to schedule maintenance mode, so that noisy objects can be put in MM automatically when that 2:00 AM backup job is running.  If there’s a place for accountability, this one is key, as the actions of the guy doing the maintenance rarely get back to him or her as that same person is often not responsible for the alert that is generated.  Don’t assume this one will define itself organically.   It probably wont, and it may need some sort of management overview to get this one working well.
  4. Updates: The Update process is also one that will need adjusting.  It’s a bit of a dirty little secret in the SCOM world, but the simply using WSUS and/or SCCM will not suffice.  There’s a manual piece too involving running SQL scripts and importing SCOM’s updated internal MPs.  The process hasn’t changed as long as I’ve been doing it, but if you aren’t sure, Kevin Holman writes an updated one with just about every release (such as this one).
  5. Meeting with key teams:  This is specified as a weekly task, though as the environment is tuned (see below) and better maintained, this one can be happen less frequently.  The bottom line is that SCOM will generate alerts.  Some are easy to fix, such as the SQL SPN alerts that usually show up in a new deployment.  Some not so much.  If the SQL team doesn’t watch SQL alerts, they won’t know what is legit and what isn’t.  If they aren’t meeting with the SCOM admin on a somewhat consistent basis, then the tuning process won’t happen.  The Tier 1 and 2 guys start ignoring alerts when they see the same alerts over and over again with no guidance or attempts to fix them.  This process is key, as that communication doesn’t always happen organically.  SCOM also gives us some very nice reports in the ‘Generic Reports Library’ to help facilitate these meetings.  The ‘Most Common Alerts’ report mentioned above is a great example as you can configure the report to give you a good top down analysis of what is generating the most noise.  It will tell you which management packs are generating it.  Most importantly, what invariably happen is that the top 3-4 items usually account for 50-70% of your alert volume. So much of the tuning process can be accomplished by simply running this report and sitting down with the key teams.
  6. Tuning:  This really ties into those meetings, but at the same time, the tuning process needs to have it’s own process flow.  Noise needs to be escalated by the responsible teams to the SCOM administrator so that it can be addressed.  Noise can be addressed by threshold changes or by turning off certain rules/monitors.  To an extent, the SCOM administrators should push back on this as well.  In a highly functional team, this isn’t the case, but the default reaction that so many people have is just ‘turn it off.’  That’s not always the right answer.  It certainly can be in the right situation. For example, SCOM will tell you that website X or app pool Y is not running, and this can be normal in a lot of organizations.  But a lot of alerts aren’t that simple, and all of them need to be investigated, as some can be caused by events such as reboots, and many (such as SQL SPN alerts) are being ignored because the owner isn’t sure what to do.  This is not always readily apparent, and some back and forth here is healthy.
  7. Documentation:  In any health check, Microsoft asks if SCOM changes are documented.  I’ve yet to see a ‘yes’ answer here.  Truthfully, most organizations don’t handle change control that well, and IT people seem to be rather averse to documentation.  I’m sure part of that is that there’s already so much of it that it rarely gets read or ever makes sense. Other parts is that change management isn’t usually a daily event, and SCOM alert changes need to happen frequently. You really don’t need a change management meeting to facilitate those types of changes as the only real people affected are the SCOM admin and whomever owns the system/process in question, and waiting for those meetings can be painful to everyone responsible for dealing with said alerts.  I’ve always used a poor man’s implementation here.  Each management pack comes with a description and a version field that is easily editable.  Each time I make a change to a customization MP, I increment the version.  I put the new version number in the description field with a list of change(s) made, who made them, why, and who else was involved.  This is worthwhile for CYA, as management may occasionally ask if SCOM picked up on specific events, and you don’t want to try and explain why the alert for said event was turned off.  It’s also useful for role changes. Whenever a new SCOM administrator starts, the new SCOM admin tends to want to redo the environment because they have no clue what their predecessor(s) did and why.  That little history here can provide a quick rundown of the what and why which a new SCOM admin can use.  This assumes of course that a best practice is followed for customizations (don’t use the default MP, and by all means, do not simply dump all your changes into one MP).  It also assumes this is communicated.
  8. Backups:  This can be org specific, as spinning up a new SCOM environment might be preferable than maintaining terra-bytes of backup space.  This certainly is reasonable, but the org needs to actually make a decision here (and this one is a management decision in my opinion).  That said, if the other practices are being followed, suddenly those customizations are more important. Customized MPs can be backed up via a script or an MP, and this is usually the most important item needed for backups, as it takes the most work to restore manually.

I hope at this point that it is clear that rolling out SCOM is an org commitment.  A ‘check the box’ mentality won’t work here (though that’s probably true for all software).  There’s too much that needs to be discussed, and there’s too many processes that will require change.  If anything, this should provide any SCOM admin or member of management a good starting point to making these changes.