The Anatomy of a Good SCOM Alert Management Process – Part 2: Road blocks to good alert management.

This is my second article in a 3 part series.  Part 1 is here. Part 3 is here.

What is the goal of an Alert Management Process?

The question at hand seems rather obvious as the goal of an alert management process should in some capacity to fix technology problems in the organization.  Yet when it comes to implementation, rarely do I see good alert management in place, and that has a lot to do with a product that is put in place without any reasonable goals.  Organizations spend money to purchase System Center.  They spend countless hours planning their SCOM layout, sizing their environment, deploying agents, and installing management packs.  Yet virtually no time is spent planning on how they will use it, what they will use it for, what processes need to be changed, and identifying new processes to implement.

The lack of a clear goal is problem number 1.  Regardless of the organization, I tend to think that the goal of alert management should be the same.  The primary goal of SCOM alert management should be to make EVERY alert actionable.  The secondary goal is to ensure that actions are being taken on every alert.

Out of the box, SCOM will not accomplish either task, but it does provide you with the tools to accomplish them.  The primary goal is hindered by a product that can generate plenty of noise, as default management pack settings are rarely perfect for an environment.  Likewise, standard actions taken by an IT staff (such as rebooting a server) can generate alerts as well if there isn’t a process in place to put the server into maintenance mode.  These can only be addressed by changing or adding processes, and that has to start with management.

In the next article, I will go over these problems in a bit more detail.  In this article, I’m going to cover the main road bocks that I’ve observed.

Management Issues

The largest failure, in my opinion to an alert management lifecycle is the lack of a clear directive from senior IT management to embrace it.  In most larger organizations, SCOM is owned by the monitoring team, who has a directive to monitor.  That directive rarely has any depth to it, nor is it clearly articulated to other teams.  Because of this, the team implementing SCOM is often doing it on an island.  I’d add that while my expertise is in SCOM, I would be willing to bet that other products have the exact same issues.  The following items are what should be supplied by management:

  1. Clear directive to all teams that a monitoring solution will be implemented.  Accountability needs established with each team as to what their responsibilities will be with the monitoring system that is in place.
  2. Management should define what is to be monitored.  Perhaps we don’t want to put an agent on that dev box (since it will generate noise).  Or perhaps instead we have a dev monitoring environment where we put our development systems so that we can test and tune new MPs in an environment that isn’t going to interfere with our administrator’s daily jobs.
  3. Management needs to define which business processes are most critical and as such deserve the most attention.  I think it goes without saying, but just because Microsoft makes a Print Server Management pack does not mean the monitoring team should be installing it (and if by chance you do, please for the sake of your DataWarehouse turn off the performance collection rules). Perhaps there’s a good reason to do it, but items like this are typically much lower priority than say that critical business application that will be a major crisis when its down.  We IT guys are task oriented, and we don’t always have the same opinion as management as to what is and isn’t important.  If your team isn’t empowered to make and enforce the necessary decisions, then it would be wise to at the very least weigh in on what you want monitored and slowly expand the environment once the key business processes are being monitored in a way that the business deems appropriate.
  4. Management needs to clearly state that they expect their IT organization to change its processes to respond to SCOM alerts.  This will not happen organically, no matter how much we want it to.  Quite frankly if this isn’t done, implementing SCOM or any other monitoring solution will be a complete waste of time. In order for this to work in the SCOM environment, all IT teams have to actually be using SCOM.  This means using the console.  The SCOM administrator can create custom views, dashboards, roles, etc. to reduce the clutter inside of the environment, but the SQL team needs to be watching SQL alerts daily. The SharePoint guys need to be looking at SharePoint alerts daily  The Exchange guys need to be looking at Exchange daily.. etc. etc. etc.  This will not be solved by turning alerts into tickets (not initially at least), nor will it be solved by creating an email alert. If SCOM is to be rolled out, and you want the benefits of rolling it out, then your IT staff needs to be using it in the way it was intended.  That means being in the console daily and managing what you are responsible for.

What you have if these three items aren’t accomplished is a “check the box” solution instead of a solution that will actually facilitate good ITSM processes.  Management said to roll out monitoring, so we did.  Never mind the fact that management never told us what to monitor and how to use it, so we just tossed in a few management packs because they seemed like good ideas and never bothered addressing the items they brought up. And as such, that really cool PKI management pack that we rolled out told us that certificate X was about to expire, but since no one bothered to do anything with the alert for the 21 days it sat in the console, the cert still expired and the system still went down (and yes, I’ve seen this happen more than once).

People Issues

People are often times just as much of a problem as management, and often when there’s no clear directive from management, that essentially gives people the permission to do what they want.  Often times, that will be nothing.

As a general rule of thumb, people don’t appreciate having someone tell them what they are doing is wrong.  We are proud of our work, and we would like to believe that we know exactly what we are doing, and some cultures (especially the unhealthy ones) relish making people pay for their mistakes.  The reality though is that we aren’t perfect, we do make mistakes, and having something tell us that is a good thing.  Unfortunately, we don’t appreciate that advise.  As such, people love to do the following:

  1. Refuse to use the product or decide to use it in a way that is counterproductive.
  2. Blow off the Tier 1 people responsible for triaging alerts (if Tier 1 is doing this).
  3. Blow off the SCOM administrator responsible for tuning SCOM.

That tends to discourage any real org change from happening. The Tier 1 guys eventually stop calling.  If you have a really good SCOM administrator, he or she finds a new job at Microsoft or chooses to focus their efforts elsewhere (since rarely is monitoring their only responsibility).  These people are being asked to solve technical problems, but most IT guys aren’t very good at solving people problems.

Technical Issues

I don’t want to pretend that SCOM is a perfect solution.  It’s not, but it does work very well when used in the way it was intended to be used.  The reality however is that out of the box, SCOM will generate A LOT of noise.  The Tuning process will be covered more in the next article, but the bottom line is that it MUST be done. SCOM isn’t smart enough to know that the Web Application that you have turned off is supposed to be turned off.  It just knows that its off and tells you (repeatedly in some scenarios).  It doesn’t know what thresholds are appropriate for your organization.  You have to set them.  It is smart enough to tell you things that it needs to do its job, but you have to do those things.

Out of the box, SCOM is usually going to tell you a lot of things. That will be overwhelming to most people as they often see hundreds of alerts, and not all of them are intuitive (for the most part they are or can be researched pretty easily, but it will require effort).  Alerts generally require a certain amount of troubleshooting, and they can take a lot of time to fix.  Not to mention, the skillset and/or permissions needed isn’t always going to be centered with the SCOM administrator.  If it’s going to work, it has to be a team effort, and that will mean lots of change.

The Anatomy of a Good SCOM Alert Management Process – Part 1: Why is alert management necessary?

I’ve had the luxury of doing SCOM work for several years now, across many different client types and infrastructure, and one of the constants that I see in most every environment that I’ve worked in is a lack of planning surrounding how SCOM will be used in their environment.  There are a number of reasons for why this is, ranging from a lack of understanding about how the tool works, to internal political issues, administrative problems such as lazy or secretive administrators, to many other things as well.  The solutions to these problems are not always easy and require a bit more than just tossing in a piece of technology so that we can check a box and say “we have monitoring.’’  The reality though, is that this is precisely how many SCOM environments are designed.

We at Microsoft have often been very good at solving technical problems, and the SCOM community as a whole has a number of fantastic blogs ranging from frequent bloggers such as Kevin Holman, Stefan Stranger, and Marnix Wolf to folks such as myself who are far less intelligent and do far less blogging.  In all, if you need to solve a technical problem in SCOM, it’s not hard to do it as someone has already done it.  Unfortunately, I think this is where we can often fall short, as we leave it up to our customers to use our products, and they can have some very interested ways of using them.   SCOM is no different.  The biggest problem with SCOM that I see is that organizations never address the people or processes surrounding the technology that they purchased.

First, let’s start with the obvious. SCOM is very good at telling users that something is wrong.  It’s not hard to spin up, and after tossing in a few management packs, you will quickly start seeing alerts ranging from simple noise to real problems.  SCOM engineers quickly realize that there are lots of really cool Microsoft and Non-Microsoft MPs out there along with some really good ideas (and bad ones).  It does not take long for a customer to deploy a bunch of agents, import some management packs, and next thing they know, their alerts screen is full of red and yellow alerts indicating that something may be wrong.

What I find from here though, is that the alert management process pretty much stops at this point.  Yes, there are organizations that truly do try to have an end to end alert lifecycle, but in just about every organization that I’ve visited, they are stuck at this point even while thinking they aren’t.  These orgs have a SCOM administrator, who often wears multiple hats, and maybe perhaps a tier 1 staff watching the active alerts to some extent. Usually, tier 2 and 3 are completely disengaged, never touching the SCOM console or perhaps going so far as actively resisting monitoring attempts. In an attempt to bring monitoring issues to light, orgs decide to send emails or generate tickets on alert generation.  Generating tickets usually frustrates the help desk, as SCOM can quite literally generate thousands of alerts a day, which essentially also makes SCOM our own private spam server.  Administrators create rules and put SCOM alerts in folders, and in the end, nothing gets changed while the same alert generates tens, hundreds, or even thousands of emails that go unanswered, and the problem is never actually solved.

The problem is never solved because of a fundamental lack of understanding of what needs to be done and why.  There are a few reasons why alert management is necessary:

  1. There are technology issues with the product which require it.  State changes are not groomed from the database while the state of the object associated is unhealthy.  A failure to manage alerts and fix issues associated with them leaves data in the database beyond its grooming requirement. Likewise, state and performance data can be stored in the DataWarehouse for a long period of time. Failure to manage alerts can lead to a very large DW, often times containing lots of data that the customer could care less about and eventually leading to performance issues if it is not being managed.
  2. All environments are different.  This should go without saying, but it means that it is impossible for SCOM to meet the exact needs of your organization OUT OF THE BOX.  Thresholds for alerts in one organization may be too high, while in others too low.  In some orgs, the monitor or rule is not applicable, and in some cases, what they really want to monitor is turned off by default.  As such, the SCOM administrators primary job is to tune alerts.

Tuning, while seeming like a simple job, requires teamwork. While it would be nice if your SCOM administrator is a technology guru, the reality is that this engineer likely knows bits and pieces about AD, Platforms, Clustering, Skype, DNS, IIS, SharePoint, Azure, Exchange, PKI, Cisco, SAN, and whatever else you happen to have in your environment.  He or she will likely not know these products in detail and as such relies on tier 1 and 2 to investigate issues as well as tier 3’s input as problems are uncovered.  That problem is further complicated by processes which need to change as actions by other IT administrators/engineers can lead to additional alerts for reasons as simple as not putting an object into maintenance mode before rebooting it.

As such, an alert management lifecycle is necessary to handle the end to end life of an alert, whether that be creation to resolution in the event of real problems or the tuning of alerts to reduce noise. 

Part 2:  Process Blockers to Good Alert Management.

Part 3:  Completing the Alert Management Life Cycle.