I’m going to do a back to the basics article here, and it’s not because things haven’t been written on the subject of monitors, rules, and SCOM, but because I don’t think they have been flushed out well, and to non-seasoned SCOM engineers, they are not exactly intuitive. As such, I wanted to walk down these two mechanisms that SCOM uses to monitor your environment and how they affect your alert management process, as there are some significant differences between the two that SCOM users need to be aware of (note this is users, the guys who close/acknowledge alerts, not just the engineers dedicated to SCOM). I would note that alert management can be a difficult thing to do. It’s the hardest part of managing SCOM due to noise, internal politics, management issues and a whole host of other things. I’d note that a colleague of mine has put together a nice chart on the subject. There is not one right way to do alert management, but there are a lot of wrong ways. Today, I’d like to discuss some things that can cause problems with alert management, namely monitors and rules.
So first, let’s start with the basics. SCOM has 4 main data types that get stored in the DataWarehouse: event, state, performance, and alert data. Of these four types, rules are responsible for collecting event and performance data. These are not alert generating, just mechanisms to collect and store data for the purpose of reporting. You want to know the processor utilization of server X over the last 3 months? It’s a rule that collects this and tosses it in the DW. The same goes with certain event collection rules. Their purpose in this scenario is solely to collect.
State data, however, is a different animal. This is the health of the objects monitored in SCOM. You view this data in the OperationsManager Database through Health Explorer. Of note, only monitors can change state, so when you view Health Explorer, you will only see monitors in a red or yellow state (or all states if you unscope it), but you will never see alerts generated by a rule. This is why on occasion you’ll open health explorer off of an alert only to see a completely healthy object underneath it. This is because a rule generated the alert (note that on occasion a management pack designer will create both a monitor and a rule with the same name…. yeah… just to confuse you). I’ll get to more about this in a minute.
That brings us to alerts, as both monitors and rules generate alerts. Because of this, you need to view alerts as simply alerts and recognize that SCOM has two alert generating mechanisms, both of which behave a bit differently. Both can generate alerts. Though neither monitors nor rules are required to (note that monitors almost always do generate alerts, but this is not necessarily a given).
So let’s start with rules:
- They don’t change state. I covered this before, but a rule just tells you something happened. It is not a definitive that your environment is unhealthy. A common rule would be something like “SQL DB backup failed to complete”. There’s nothing wrong with the health of your SQL environment, but this just might be something you want to look at. If you don’t care about the backup, then turn it off (and not the rule, the backup itself), as at this point it is clutter.
- Rules like to talk. They tell you something needs to be looked at. They tell you again. They tell you again and again and again. A poorly constructed rule can generate thousands of alerts in a short period of time… and sadly yes, I know this from experience 🙂 You can build alert flood protection into rules, but that has to be done when the rule is created, and if it’s created in a sealed management pack, the only way to do alert flood is to disable the rule and re-create it. With alert flood protection, rules will increment a “repeat counter” in SCOM. It is worth noting that this counter is not visible by default in your console, and you’ll want to right click on your columns, select “personalize view”, and check the box that says “Repeat Count” (usually near the bottom). Rules with high repeat counts means that the condition has been detected on numerous occasions. It means you might want to investigate what is going on, as closing the alert will only mean that it is coming back in a few minutes.
- Rules do not auto-close. This means that if you’ve fixed the problem, you need to close the alert if it was generated by a rule.
- Alert Generation cannot be turned off for rules in sealed management packs. You can only disable the rule and/or recreate it if needed.
Now on to monitors:
- Monitors do change state. This means that when a monitor detects an unhealthy condition, the object that contains it will go unhealthy. Not all objects roll up through Windows Computer, so be careful there, as simply using health explorer from that view may surprise you as you won’t necessarily see what you want. There is no super-class in SCOM, and I don’t believe that will change in 2016.
- Monitors have a mechanism for detecting a healthy condition. You can do this via timer, via event ID, or some script. Bottom line is that the premise behind monitors is health; and as such, they need to have both healthy and unhealthy conditions.
- Alerts generated by monitors usually auto-resolve. This means that when a monitor generates an alert, it will close the same alert. This is an overridable parameter, so by all means check it, but it’s pretty rare to see auto-resolve turned off by default. I cannot think of an example, though I’m sure someone has seen it… That said, this can be a cause for noise in your environment if you have a monitor that is flip flopping back and forth between healthy and unhealthy. Monitors doing this needs to be addressed, and since the alerts can go away, if you don’t have a good alert management process in place, you can miss these. I typically like to use the most common alerts report in the generic report library, as this can tell you which monitors/rules are generating said alerts. Every now and then, you’ll see items on that report that have no alerts in the console, and a flip flopping monitor can be the cause of this.
- Closing an alert generated by a monitor does not reset the health. This means that if the underlying condition is still present, the health of the object in question will not go back to green. Therefore, DO NOT CLOSE AN ALERT GENERATED BY A MONITOR. INSTEAD, USE HEALTH EXPLORER TO RESET THE HEALTH. Sorry that I had to be loud there. This is probably the most common mistake someone new to SCOM makes (myself included by the way). It is an even bigger problem with organizations that use SCOM in their NOC (network operations center), as typically the helpdesk/NOC tends to be made up of people with less experience, and the natural tendency is to close the alert when you think you’ve fixed it. Sadly, some close it because they don’t know what to do with it. When that happens, that particular alert will not generate alerts again unless the monitor goes healthy first. Other monitors will continue to generate alerts, but the one in question will not generate until it resets. This also causes grooming problems with the OperationsManager DB as state data isn’t groomed while the object is unhealthy. It would be nice if the product team put an update to generate a health reset in this scenario, but to my knowledge this has not happened.
OK, so that’s the scoop. The big thing is to identify which mechanism generated the alert. From there, you can craft a process on how to deal with it. Just to answer one question I get asked a lot, the alert itself does tell you which mechanism generated it.