Monitors vs. Rules and how they Affect Alert Management

I’m going to do a back to the basics article here, and it’s not because things haven’t been written on the subject of monitors, rules, and SCOM, but because I don’t think they have been flushed out well, and to non-seasoned SCOM engineers, they are not exactly intuitive. As such, I wanted to walk down these two mechanisms that SCOM uses to monitor your environment and how they affect your alert management process, as there are some significant differences between the two that SCOM users need to be aware of (note this is users, the guys who close/acknowledge alerts, not just the engineers dedicated to SCOM). I would note that alert management can be a difficult thing to do.  It’s the hardest part of managing SCOM due to noise, internal politics, management issues and a whole host of other things.  I’d note that a colleague of mine has put together a nice chart on the subject.  There is not one right way to do alert management, but there are a lot of wrong ways.  Today, I’d like to discuss some things that can cause problems with alert management, namely monitors and rules.

So first, let’s start with the basics.  SCOM has 4 main data types that get stored in the DataWarehouse: event, state, performance, and alert data.  Of these four types, rules are responsible for collecting event and performance data.  These are not alert generating, just mechanisms to collect and store data for the purpose of reporting. You want to know the processor utilization of server X over the last 3 months? It’s a rule that collects this and tosses it in the DW.  The same goes with certain event collection rules. Their purpose in this scenario is solely to collect.

State data, however, is a different animal. This is the health of the objects monitored in SCOM. You view this data in the OperationsManager Database through Health Explorer.  Of note, only monitors can change state, so when you view Health Explorer, you will only see monitors in a red or yellow state (or all states if you unscope it), but you will never see alerts generated by a rule.  This is why on occasion you’ll open health explorer off of an alert only to see a completely healthy object underneath it.  This is because a rule generated the alert (note that on occasion a management pack designer will create both a monitor and a rule with the same name…. yeah… just to confuse you).  I’ll get to more about this in a minute.

That brings us to alerts, as both monitors and rules generate alerts.  Because of this, you need to view alerts as simply alerts and recognize that SCOM has two alert generating mechanisms, both of which behave a bit differently. Both can generate alerts.  Though neither monitors nor rules are required to (note that monitors almost always do generate alerts, but this is not necessarily a given).

So let’s start with rules:

  1. They don’t change state.  I covered this before, but a rule just tells you something happened.  It is not a definitive that your environment is unhealthy.  A common rule would be something like “SQL DB backup failed to complete”.  There’s nothing wrong with the health of your SQL environment, but this just might be something you want to look at. If you don’t care about the backup, then turn it off (and not the rule, the backup itself), as at this point it is clutter.
  2. Rules like to talk.  They tell you something needs to be looked at.  They tell you again.  They tell you again and again and again.  A poorly constructed rule can generate thousands of alerts in a short period of time… and sadly yes, I know this from experience 🙂  You can build alert flood protection into rules, but that has to be done when the rule is created, and if it’s created in a sealed management pack, the only way to do alert flood is to disable the rule and re-create it.  With alert flood protection, rules will increment a “repeat counter” in SCOM.  It is worth noting that this counter is not visible by default in your console, and you’ll want to right click on your columns, select “personalize view”, and check the box that says “Repeat Count” (usually near the bottom).  Rules with high repeat counts means that the condition has been detected on numerous occasions.  It means you might want to investigate what is going on, as closing the alert will only mean that it is coming back in a few minutes.
  3. Rules do not auto-close.  This means that if you’ve fixed the problem, you need to close the alert if it was generated by a rule. 
  4. Alert Generation cannot be turned off for rules in sealed management packs.  You can only disable the rule and/or recreate it if needed.

Now on to monitors:

  1. Monitors do change state.  This means that when a monitor detects an unhealthy condition, the object that contains it will go unhealthy.  Not all objects roll up through Windows Computer, so be careful there, as simply using health explorer from that view may surprise you as you won’t necessarily see what you want.  There is no super-class in SCOM, and I don’t believe that will change in 2016.
  2. Monitors have a mechanism for detecting a healthy condition.  You can do this via timer, via event ID, or some script. Bottom line is that the premise behind monitors is health; and as such, they need to have both healthy and unhealthy conditions.
  3. Alerts generated by monitors usually auto-resolve.  This means that when a monitor generates an alert, it will close the same alert.  This is an overridable parameter, so by all means check it, but it’s pretty rare to see auto-resolve turned off by default.  I cannot think of an example, though I’m sure someone has seen it…  That said, this can be a cause for noise in your environment if you have a monitor that is flip flopping back and forth between healthy and unhealthy.  Monitors doing this needs to be addressed, and since the alerts can go away, if you don’t have a good alert management process in place, you can miss these.  I typically like to use the most common alerts report in the generic report library, as this can tell you which monitors/rules are generating said alerts.  Every now and then, you’ll see items on that report that have no alerts in the console, and a flip flopping monitor can be the cause of this.
  4. Closing an alert generated by a monitor does not reset the health.  This means that if the underlying condition is still present, the health of the object in question will not go back to green.  Therefore, DO NOT CLOSE AN ALERT GENERATED BY A MONITOR. INSTEAD, USE HEALTH EXPLORER TO RESET THE HEALTH.  Sorry that I had to be loud there. This is probably the most common mistake someone new to SCOM makes (myself included by the way).  It is an even bigger problem with organizations that use SCOM in their NOC (network operations center), as typically the helpdesk/NOC tends to be made up of people with less experience, and the natural tendency is to close the alert when you think you’ve fixed it.  Sadly, some close it because they don’t know what to do with it.  When that happens, that particular alert will not generate alerts again unless the monitor goes healthy first.  Other monitors will continue to generate alerts, but the one in question will not generate until it resets.  This also causes grooming problems with the OperationsManager DB as state data isn’t groomed while the object is unhealthy. It would be nice if the product team put an update to generate a health reset in this scenario, but to my knowledge this has not happened.

OK, so that’s the scoop.  The big thing is to identify which mechanism generated the alert.  From there, you can craft a process on how to deal with it.  Just to answer one question I get asked a lot, the alert itself does tell you which mechanism generated it.


So You Want to Roll Out SCOM: Decisions that you should make before you click Install.

I’ve been on a process kick lately, in large part because issues that I encounter in SCOM environments aren’t related to the technology but the processes that surround them. From there, I decided to put together a nice primer of all of the things that need to be done in advance, or at the very least considered in advance.

The Technical:

Kevin Holman has a great quick start guide to setting up SCOM.  That really covers most of the technical side of the deployment, though there are a few key things you may want to consider before you start installing:

  • SQL Server location:  Are you going to keep the SQL server local to the SCOM server?  Do you have an enterprise cluster?  How are you going to carve out the DBs?  A default SCOM install uses two databases, though you can add more with reporting as well as ACS.  From a best practice standpoint, you may want to consider carving out separate volumes for the DB, DB log file, DW, DW log file, as well as the temp DB.  SCOM is very disk intensive, and isolating these five databases to their own volumes will help with performance.
  • Account naming convention, and if you want to create all of them:  Here’s what is required.  Kevin mentions the accounts in his quick start guide as well.  Oh, and for the sake of all things security, DO NOT use  a domain admin account for any of these accounts.  Also, don’t forget to configure your SPNs.
  • Sizing:  I get asked a lot as to how big to make the environment.  The answer is rather vague.  It really depends.  Microsoft provides a nice sizing guide that can help answer those questions.  The size of your environment really depends on what you want to monitor, as well as how much availability you want.
  • Backup strategy:  Do you want to backup just the databases?  That’s the traditional method, though I strongly recommending having a restore procedure in place and validated if that’s the plan.  I’d note that if space is a premium, you may want to consider grabbing your unsealed customizations.  That takes up a lot less space (at the cost of a total los of historical data in a disaster), but there are easy ways to do this, either by management pack or by script.
  • Data Retention:  The SCOM DB doesn’t keep data all that long, though in a larger environment, you may want to reduce those settings. In a smaller environment, you may want to consider increasing them.  DataWarehouse retention is a bigger deal as this is not configurable via the SCOM console.  The DW can also get rather large, particularly state and performance hourly data which has a default threshold of 400 days.  This can lead to a very large DW and a very angry storage administrator who wants to know why you need all that space.  I personally recommend keeping the daily aggregations for 365 days and the hourly aggregations for about 120 days.  That really is an organizational decision, but one that should be made early on.

The quasi-technical:

This is still somewhat technical, but there’s procedural considerations to be had here.

  • Naming convention for SCOM customizations as well as custom management packs:  I strongly recommend using some sort of organizational name to lead all of your customizations.  The reason being is that six months from now, you won’t likely remember what you named that custom monitor.  Save yourself the time and make sure you have some sort of consistent naming convention, for no other reason than to allow for an easier search.
  • What do you want to monitor:  Let’s start with the obvious, SCOM is a framework that can monitor a lot of things.  Microsoft makes management packs for most (if not all) of their products.  There are a lot of 3rd party MPs out there as well (note, not all are free).  Not all are that good.  Oh, and most importantly, don’t make the mistake that myself and many others have made in rolling all of them out at once.  You’ll want to tune each MP, so roll them out progressively (preferably in a QA environment first) so as to identify noise before you roll them out into production and have an angry monitoring team.
  • Alert Management:  I’ve written a 3 part piece on this subject.  The first part is here (and it links to the other two).  Suffice to say, most organizations don’t really sit down and think about how they plan on responding to alerts.  The end result is that the organization has purchased a monitoring tool, but it does not monitor. 
  • What to monitor:  Are you going to throw your development systems into your production SCOM environment?  Do you really only care about a few core systems?  The bottom line is that SCOM is going to tell you lots of things about your environment.  It’s great at detecting bad IT hygiene, and it doesn’t know what items are by design or not.  If you want to actually have a good process responding to alerts, then you probably want to sit down and decide what systems are important to alert on.  If you throw everything into one environment, you are going to make it very difficult on those who are supposed to do the monitoring.
  • What processes need changing:  This goes back to alert management, but the bottom line is that plenty of organizational processes will need to change to account for SCOM. A short list includes maintenance process, decommissioning of servers, commissioning of servers, responding to alerts.
  • Who needs access:  Contrary to a lot of systems, most of your IT staff really needs to be only an operator.  Their job is to close alerts, reset state, and dashboards/reports.  You probably don’t want to give them the rights to start customizing your environment.
  • Custom Views:  This involves meeting with the various teams in your organization, but you’re going to want to get them using SCOM.  This means that they should likely have a scoped role so as not to be exposed to items that they don’t need to see.  It may also involve creating custom dashboards for them.  There’s a lot of really cool things you can do with dashboards.  Here’s one for start.


I’m married to a Quality Manager, so I get to hear about this every day in the manufacturing world, and I happen to have a degree in Manufacturing Engineering to go with it.  Needless to say, IT doesn’t do documentation that well, especially when it comes to break fix.

  • Customizations:  This is a big one.  I typically recommend implementing a basic versioning system for your custom MPs and using the built in description field to record the version number, what changed, who changed it, and why.  It doesn’t need a committee or special change management, but it can be very useful in keeping a running history of what was done in the environment.  As a bonus, when the SCOM owner leaves, his or her replacement will be able to pick up the changes much easier.  What often happens, however, when there’s little documentation is that the new administrator is often very tempted to simply start over. 
  • Health Check:  Microsoft offers an excellent service for many of their products known as a health check. You may want to consider doing something like this within the first few months of rolling out SCOM. A health check will determine if there are any performance bottlenecks in your environment as well as identify potential issues that you may need to address.  It will help you see where your best practices might be falling a bit short and allow you to maximize your use of the tool.  It’s not a requirement by any means, but it will provide you a very nice picture of your environment as well as a direction in terms of what needs to be addressed going forward.  (Shameless self promotion, but if by chance someone reads this and decides to purchase one, please be so kind as to let their account manager know that you read it here.  Those types of things look great on reviews).

The Anatomy of a Good SCOM Alert Management Process – Part 3: Completing the Alert Management Life Cycle.

This is my final article in a 3 part series about Alert Management.  Part 1 is herePart 2 is here.

In the first two parts, we have already discussed why alert management is necessary and what tends to get in the way.  The final article in this series will cover what processes need to change or be added in order to facilitate good alert management.

The information below can be found in a number of documentation.  It is found in our health check that is provided for SCOM.  I’ve seen it in a number of presentations by a number of different Microsoft PFEs as well.  It shows up on some blogs too.  Simply put, there’s plenty out there that can put you in the right direction, though sometimes the WHY gets left out.

Daily Tasks

  • Check using, Operations Manager Management Packs that Operations Manager components are healthy
  • Check new Alerts from previous day are not still in state of ‘New’
  • Check for any unusual Alert or Event noise; investigate further if required (e.g. failing scripts, WMI issues, etc.,)
  • Check all Agents ‘Status’ for any that may be other than in a Green state
  • Review nightly backup jobs and database space allocation

Weekly Tasks

  • Schedule weekly meeting with IT Operational stake-holders \ and technical staff to review previous weeks most common alerts
  • Run the ‘Most Common Alerts’ report; investigate where necessary (see above bullet)

Monthly Tasks

  • Check for new Management Pack versions of those installed. Also check newly released management packs for suitability for your monitored environment
  • Run the baseline counters to access the ongoing performance of the Operations Manager environment as new agents are added and as new management packs are added

The task list doesn’t necessarily say WHO is responsible for completing these items, but I can say with reasonable certainty that if the SCOM administrator is the only one expected to do these tasks, he or she will fail.  Alert noise in particular is a team effort.  That needs to be handled directly by the people whose responsibility it is to maintain the systems they are monitoring.  That means that your AD guys should be watching the AD management pack.  The SQL guys need to be watching for SQL alerts, and so and so forth.  They know their products better than what the SCOM administrator will know them.

Tier one (and by proxy two) can certainly be the eyes and ears on the alerts that come through, but they need clearly defined escalation paths to the appropriate teams so that issues that aren’t easily resolved can be sent on to the correct tier three teams.  SCOM does a lot of self-alerting, so that escalation needs to include the SCOM administrators as issues such as WMI scripts not running, failing workflows, and various management group related alerts need to eventually make it to the SCOM administrator.  Issues such as health service heart beats (and by proxy gray agents when that heartbeat threshold is exceeded) need to be looked at right away.  Those indicate that an agent is not being monitored (at the least).  There are a number of reasons as to why that could be the case ranging from down systems (which you want to address), to bad processes, to some sort of client issue preventing communication.

Finally, all of this requires some sort of accountability.  Management doesn’t necessarily need to know why system X is red.  That’s usually the wrong question.  What management needs to be ensuring is that when there’s an alert from SCOM, SOMEONE is addressing it, and that someone also has a clear escalation path when they get to a point where they aren’t sure what’s going on.  To be clear, there’s going to be A LOT of this at first. That’s normal, and that also gets us into other key processes that need to be formed or adjusted in order to make this work.

  1. Server commission/decommission:  The most common issue for gray agents in SCOM is the failure to remove it from SCOM when the server is being retired.  It’s a simple change, but that has to be worked into your organizations current process.  On the flip side, ensuring that new servers are promptly added to SCOM is also important.  How that is managed is more organization specific.  You can auto-deploy via SCCM or AD (though don’t forget to change the remotely manageable settings if you do) or you can manually deploy through the SCOM console. You can also pre-install the image and use AD assignment as another option if that is preferred as well.  Keep in mind that systems in a DMZ will require certificates or a gateway to authenticate, which will further affect these processes.  You may also want to think about whether or not your development systems should be monitored the production environment (as these will usually generate more noise).  You may want to consider putting these systems in a dev SCOM environment (you’ll likely have no additional cost).
  2. Development Environment:  The Dev SCOM environment is also something that will have it’s own processes.  It will be used more for testing new MP rollouts, but in terms of being watched by your day to day support operations, it really is only being watched by the engineers responsible for their products as well as the SCOM administrator.
  3. Maintenance:  Server maintenance will need to be adjusted as well.  This might be the biggest process change (or in most cases, a new process altogether).  Rebooting a DC during production hours (for example) is somewhat normal since it really won’t cause an outage. If that DC is say the PDC emulator, each DC in SCOM will generate an alert when that DC goes down.  Domain controllers aren’t the only example here, as any time a server is rebooted.  Reboots can generate a health service heart beat alert if the server misses it’s ping or even a gray server if the reboot takes a while.  Application specific alerts can be generated as well, and SCOM specific alerts will generate when workflows are suddenly terminated.  This process is key as it’s a direct contributor to what is typically a daily amount of noise that SCOM generates.  SCOM isn’t smart enough to know which outages are acceptable to your organization and which ones aren’t.  It’s up to the org to tell it.  SCOM includes a nice tool called Maintenance Mode to assist with this (though it’s worth noting that this is a workflow that the management server orders a client to execute, so it can take a few minutes to go into affect).  System Center 2016 has also added the ability to schedule maintenance mode, so that noisy objects can be put in MM automatically when that 2:00 AM backup job is running.  If there’s a place for accountability, this one is key, as the actions of the guy doing the maintenance rarely get back to him or her as that same person is often not responsible for the alert that is generated.  Don’t assume this one will define itself organically.   It probably wont, and it may need some sort of management overview to get this one working well.
  4. Updates: The Update process is also one that will need adjusting.  It’s a bit of a dirty little secret in the SCOM world, but the simply using WSUS and/or SCCM will not suffice.  There’s a manual piece too involving running SQL scripts and importing SCOM’s updated internal MPs.  The process hasn’t changed as long as I’ve been doing it, but if you aren’t sure, Kevin Holman writes an updated one with just about every release (such as this one).
  5. Meeting with key teams:  This is specified as a weekly task, though as the environment is tuned (see below) and better maintained, this one can be happen less frequently.  The bottom line is that SCOM will generate alerts.  Some are easy to fix, such as the SQL SPN alerts that usually show up in a new deployment.  Some not so much.  If the SQL team doesn’t watch SQL alerts, they won’t know what is legit and what isn’t.  If they aren’t meeting with the SCOM admin on a somewhat consistent basis, then the tuning process won’t happen.  The Tier 1 and 2 guys start ignoring alerts when they see the same alerts over and over again with no guidance or attempts to fix them.  This process is key, as that communication doesn’t always happen organically.  SCOM also gives us some very nice reports in the ‘Generic Reports Library’ to help facilitate these meetings.  The ‘Most Common Alerts’ report mentioned above is a great example as you can configure the report to give you a good top down analysis of what is generating the most noise.  It will tell you which management packs are generating it.  Most importantly, what invariably happen is that the top 3-4 items usually account for 50-70% of your alert volume. So much of the tuning process can be accomplished by simply running this report and sitting down with the key teams.
  6. Tuning:  This really ties into those meetings, but at the same time, the tuning process needs to have it’s own process flow.  Noise needs to be escalated by the responsible teams to the SCOM administrator so that it can be addressed.  Noise can be addressed by threshold changes or by turning off certain rules/monitors.  To an extent, the SCOM administrators should push back on this as well.  In a highly functional team, this isn’t the case, but the default reaction that so many people have is just ‘turn it off.’  That’s not always the right answer.  It certainly can be in the right situation. For example, SCOM will tell you that website X or app pool Y is not running, and this can be normal in a lot of organizations.  But a lot of alerts aren’t that simple, and all of them need to be investigated, as some can be caused by events such as reboots, and many (such as SQL SPN alerts) are being ignored because the owner isn’t sure what to do.  This is not always readily apparent, and some back and forth here is healthy.
  7. Documentation:  In any health check, Microsoft asks if SCOM changes are documented.  I’ve yet to see a ‘yes’ answer here.  Truthfully, most organizations don’t handle change control that well, and IT people seem to be rather averse to documentation.  I’m sure part of that is that there’s already so much of it that it rarely gets read or ever makes sense. Other parts is that change management isn’t usually a daily event, and SCOM alert changes need to happen frequently. You really don’t need a change management meeting to facilitate those types of changes as the only real people affected are the SCOM admin and whomever owns the system/process in question, and waiting for those meetings can be painful to everyone responsible for dealing with said alerts.  I’ve always used a poor man’s implementation here.  Each management pack comes with a description and a version field that is easily editable.  Each time I make a change to a customization MP, I increment the version.  I put the new version number in the description field with a list of change(s) made, who made them, why, and who else was involved.  This is worthwhile for CYA, as management may occasionally ask if SCOM picked up on specific events, and you don’t want to try and explain why the alert for said event was turned off.  It’s also useful for role changes. Whenever a new SCOM administrator starts, the new SCOM admin tends to want to redo the environment because they have no clue what their predecessor(s) did and why.  That little history here can provide a quick rundown of the what and why which a new SCOM admin can use.  This assumes of course that a best practice is followed for customizations (don’t use the default MP, and by all means, do not simply dump all your changes into one MP).  It also assumes this is communicated.
  8. Backups:  This can be org specific, as spinning up a new SCOM environment might be preferable than maintaining terra-bytes of backup space.  This certainly is reasonable, but the org needs to actually make a decision here (and this one is a management decision in my opinion).  That said, if the other practices are being followed, suddenly those customizations are more important. Customized MPs can be backed up via a script or an MP, and this is usually the most important item needed for backups, as it takes the most work to restore manually.

I hope at this point that it is clear that rolling out SCOM is an org commitment.  A ‘check the box’ mentality won’t work here (though that’s probably true for all software).  There’s too much that needs to be discussed, and there’s too many processes that will require change.  If anything, this should provide any SCOM admin or member of management a good starting point to making these changes.

The Anatomy of a Good SCOM Alert Management Process – Part 2: Road blocks to good alert management.

This is my second article in a 3 part series.  Part 1 is here. Part 3 is here.

What is the goal of an Alert Management Process?

The question at hand seems rather obvious as the goal of an alert management process should in some capacity to fix technology problems in the organization.  Yet when it comes to implementation, rarely do I see good alert management in place, and that has a lot to do with a product that is put in place without any reasonable goals.  Organizations spend money to purchase System Center.  They spend countless hours planning their SCOM layout, sizing their environment, deploying agents, and installing management packs.  Yet virtually no time is spent planning on how they will use it, what they will use it for, what processes need to be changed, and identifying new processes to implement.

The lack of a clear goal is problem number 1.  Regardless of the organization, I tend to think that the goal of alert management should be the same.  The primary goal of SCOM alert management should be to make EVERY alert actionable.  The secondary goal is to ensure that actions are being taken on every alert.

Out of the box, SCOM will not accomplish either task, but it does provide you with the tools to accomplish them.  The primary goal is hindered by a product that can generate plenty of noise, as default management pack settings are rarely perfect for an environment.  Likewise, standard actions taken by an IT staff (such as rebooting a server) can generate alerts as well if there isn’t a process in place to put the server into maintenance mode.  These can only be addressed by changing or adding processes, and that has to start with management.

In the next article, I will go over these problems in a bit more detail.  In this article, I’m going to cover the main road bocks that I’ve observed.

Management Issues

The largest failure, in my opinion to an alert management lifecycle is the lack of a clear directive from senior IT management to embrace it.  In most larger organizations, SCOM is owned by the monitoring team, who has a directive to monitor.  That directive rarely has any depth to it, nor is it clearly articulated to other teams.  Because of this, the team implementing SCOM is often doing it on an island.  I’d add that while my expertise is in SCOM, I would be willing to bet that other products have the exact same issues.  The following items are what should be supplied by management:

  1. Clear directive to all teams that a monitoring solution will be implemented.  Accountability needs established with each team as to what their responsibilities will be with the monitoring system that is in place.
  2. Management should define what is to be monitored.  Perhaps we don’t want to put an agent on that dev box (since it will generate noise).  Or perhaps instead we have a dev monitoring environment where we put our development systems so that we can test and tune new MPs in an environment that isn’t going to interfere with our administrator’s daily jobs.
  3. Management needs to define which business processes are most critical and as such deserve the most attention.  I think it goes without saying, but just because Microsoft makes a Print Server Management pack does not mean the monitoring team should be installing it (and if by chance you do, please for the sake of your DataWarehouse turn off the performance collection rules). Perhaps there’s a good reason to do it, but items like this are typically much lower priority than say that critical business application that will be a major crisis when its down.  We IT guys are task oriented, and we don’t always have the same opinion as management as to what is and isn’t important.  If your team isn’t empowered to make and enforce the necessary decisions, then it would be wise to at the very least weigh in on what you want monitored and slowly expand the environment once the key business processes are being monitored in a way that the business deems appropriate.
  4. Management needs to clearly state that they expect their IT organization to change its processes to respond to SCOM alerts.  This will not happen organically, no matter how much we want it to.  Quite frankly if this isn’t done, implementing SCOM or any other monitoring solution will be a complete waste of time. In order for this to work in the SCOM environment, all IT teams have to actually be using SCOM.  This means using the console.  The SCOM administrator can create custom views, dashboards, roles, etc. to reduce the clutter inside of the environment, but the SQL team needs to be watching SQL alerts daily. The SharePoint guys need to be looking at SharePoint alerts daily  The Exchange guys need to be looking at Exchange daily.. etc. etc. etc.  This will not be solved by turning alerts into tickets (not initially at least), nor will it be solved by creating an email alert. If SCOM is to be rolled out, and you want the benefits of rolling it out, then your IT staff needs to be using it in the way it was intended.  That means being in the console daily and managing what you are responsible for.

What you have if these three items aren’t accomplished is a “check the box” solution instead of a solution that will actually facilitate good ITSM processes.  Management said to roll out monitoring, so we did.  Never mind the fact that management never told us what to monitor and how to use it, so we just tossed in a few management packs because they seemed like good ideas and never bothered addressing the items they brought up. And as such, that really cool PKI management pack that we rolled out told us that certificate X was about to expire, but since no one bothered to do anything with the alert for the 21 days it sat in the console, the cert still expired and the system still went down (and yes, I’ve seen this happen more than once).

People Issues

People are often times just as much of a problem as management, and often when there’s no clear directive from management, that essentially gives people the permission to do what they want.  Often times, that will be nothing.

As a general rule of thumb, people don’t appreciate having someone tell them what they are doing is wrong.  We are proud of our work, and we would like to believe that we know exactly what we are doing, and some cultures (especially the unhealthy ones) relish making people pay for their mistakes.  The reality though is that we aren’t perfect, we do make mistakes, and having something tell us that is a good thing.  Unfortunately, we don’t appreciate that advise.  As such, people love to do the following:

  1. Refuse to use the product or decide to use it in a way that is counterproductive.
  2. Blow off the Tier 1 people responsible for triaging alerts (if Tier 1 is doing this).
  3. Blow off the SCOM administrator responsible for tuning SCOM.

That tends to discourage any real org change from happening. The Tier 1 guys eventually stop calling.  If you have a really good SCOM administrator, he or she finds a new job at Microsoft or chooses to focus their efforts elsewhere (since rarely is monitoring their only responsibility).  These people are being asked to solve technical problems, but most IT guys aren’t very good at solving people problems.

Technical Issues

I don’t want to pretend that SCOM is a perfect solution.  It’s not, but it does work very well when used in the way it was intended to be used.  The reality however is that out of the box, SCOM will generate A LOT of noise.  The Tuning process will be covered more in the next article, but the bottom line is that it MUST be done. SCOM isn’t smart enough to know that the Web Application that you have turned off is supposed to be turned off.  It just knows that its off and tells you (repeatedly in some scenarios).  It doesn’t know what thresholds are appropriate for your organization.  You have to set them.  It is smart enough to tell you things that it needs to do its job, but you have to do those things.

Out of the box, SCOM is usually going to tell you a lot of things. That will be overwhelming to most people as they often see hundreds of alerts, and not all of them are intuitive (for the most part they are or can be researched pretty easily, but it will require effort).  Alerts generally require a certain amount of troubleshooting, and they can take a lot of time to fix.  Not to mention, the skillset and/or permissions needed isn’t always going to be centered with the SCOM administrator.  If it’s going to work, it has to be a team effort, and that will mean lots of change.

The Anatomy of a Good SCOM Alert Management Process – Part 1: Why is alert management necessary?

I’ve had the luxury of doing SCOM work for several years now, across many different client types and infrastructure, and one of the constants that I see in most every environment that I’ve worked in is a lack of planning surrounding how SCOM will be used in their environment.  There are a number of reasons for why this is, ranging from a lack of understanding about how the tool works, to internal political issues, administrative problems such as lazy or secretive administrators, to many other things as well.  The solutions to these problems are not always easy and require a bit more than just tossing in a piece of technology so that we can check a box and say “we have monitoring.’’  The reality though, is that this is precisely how many SCOM environments are designed.

We at Microsoft have often been very good at solving technical problems, and the SCOM community as a whole has a number of fantastic blogs ranging from frequent bloggers such as Kevin Holman, Stefan Stranger, and Marnix Wolf to folks such as myself who are far less intelligent and do far less blogging.  In all, if you need to solve a technical problem in SCOM, it’s not hard to do it as someone has already done it.  Unfortunately, I think this is where we can often fall short, as we leave it up to our customers to use our products, and they can have some very interested ways of using them.   SCOM is no different.  The biggest problem with SCOM that I see is that organizations never address the people or processes surrounding the technology that they purchased.

First, let’s start with the obvious. SCOM is very good at telling users that something is wrong.  It’s not hard to spin up, and after tossing in a few management packs, you will quickly start seeing alerts ranging from simple noise to real problems.  SCOM engineers quickly realize that there are lots of really cool Microsoft and Non-Microsoft MPs out there along with some really good ideas (and bad ones).  It does not take long for a customer to deploy a bunch of agents, import some management packs, and next thing they know, their alerts screen is full of red and yellow alerts indicating that something may be wrong.

What I find from here though, is that the alert management process pretty much stops at this point.  Yes, there are organizations that truly do try to have an end to end alert lifecycle, but in just about every organization that I’ve visited, they are stuck at this point even while thinking they aren’t.  These orgs have a SCOM administrator, who often wears multiple hats, and maybe perhaps a tier 1 staff watching the active alerts to some extent. Usually, tier 2 and 3 are completely disengaged, never touching the SCOM console or perhaps going so far as actively resisting monitoring attempts. In an attempt to bring monitoring issues to light, orgs decide to send emails or generate tickets on alert generation.  Generating tickets usually frustrates the help desk, as SCOM can quite literally generate thousands of alerts a day, which essentially also makes SCOM our own private spam server.  Administrators create rules and put SCOM alerts in folders, and in the end, nothing gets changed while the same alert generates tens, hundreds, or even thousands of emails that go unanswered, and the problem is never actually solved.

The problem is never solved because of a fundamental lack of understanding of what needs to be done and why.  There are a few reasons why alert management is necessary:

  1. There are technology issues with the product which require it.  State changes are not groomed from the database while the state of the object associated is unhealthy.  A failure to manage alerts and fix issues associated with them leaves data in the database beyond its grooming requirement. Likewise, state and performance data can be stored in the DataWarehouse for a long period of time. Failure to manage alerts can lead to a very large DW, often times containing lots of data that the customer could care less about and eventually leading to performance issues if it is not being managed.
  2. All environments are different.  This should go without saying, but it means that it is impossible for SCOM to meet the exact needs of your organization OUT OF THE BOX.  Thresholds for alerts in one organization may be too high, while in others too low.  In some orgs, the monitor or rule is not applicable, and in some cases, what they really want to monitor is turned off by default.  As such, the SCOM administrators primary job is to tune alerts.

Tuning, while seeming like a simple job, requires teamwork. While it would be nice if your SCOM administrator is a technology guru, the reality is that this engineer likely knows bits and pieces about AD, Platforms, Clustering, Skype, DNS, IIS, SharePoint, Azure, Exchange, PKI, Cisco, SAN, and whatever else you happen to have in your environment.  He or she will likely not know these products in detail and as such relies on tier 1 and 2 to investigate issues as well as tier 3’s input as problems are uncovered.  That problem is further complicated by processes which need to change as actions by other IT administrators/engineers can lead to additional alerts for reasons as simple as not putting an object into maintenance mode before rebooting it.

As such, an alert management lifecycle is necessary to handle the end to end life of an alert, whether that be creation to resolution in the event of real problems or the tuning of alerts to reduce noise. 

Part 2:  Process Blockers to Good Alert Management.

Part 3:  Completing the Alert Management Life Cycle.