So You Want to Roll Out SCOM: Decisions that you should make before you click Install.

I’ve been on a process kick lately, in large part because issues that I encounter in SCOM environments aren’t related to the technology but the processes that surround them. From there, I decided to put together a nice primer of all of the things that need to be done in advance, or at the very least considered in advance.

The Technical:

Kevin Holman has a great quick start guide to setting up SCOM.  That really covers most of the technical side of the deployment, though there are a few key things you may want to consider before you start installing:

  • SQL Server location:  Are you going to keep the SQL server local to the SCOM server?  Do you have an enterprise cluster?  How are you going to carve out the DBs?  A default SCOM install uses two databases, though you can add more with reporting as well as ACS.  From a best practice standpoint, you may want to consider carving out separate volumes for the DB, DB log file, DW, DW log file, as well as the temp DB.  SCOM is very disk intensive, and isolating these five databases to their own volumes will help with performance.
  • Account naming convention, and if you want to create all of them:  Here’s what is required.  Kevin mentions the accounts in his quick start guide as well.  Oh, and for the sake of all things security, DO NOT use  a domain admin account for any of these accounts.  Also, don’t forget to configure your SPNs.
  • Sizing:  I get asked a lot as to how big to make the environment.  The answer is rather vague.  It really depends.  Microsoft provides a nice sizing guide that can help answer those questions.  The size of your environment really depends on what you want to monitor, as well as how much availability you want.
  • Backup strategy:  Do you want to backup just the databases?  That’s the traditional method, though I strongly recommending having a restore procedure in place and validated if that’s the plan.  I’d note that if space is a premium, you may want to consider grabbing your unsealed customizations.  That takes up a lot less space (at the cost of a total los of historical data in a disaster), but there are easy ways to do this, either by management pack or by script.
  • Data Retention:  The SCOM DB doesn’t keep data all that long, though in a larger environment, you may want to reduce those settings. In a smaller environment, you may want to consider increasing them.  DataWarehouse retention is a bigger deal as this is not configurable via the SCOM console.  The DW can also get rather large, particularly state and performance hourly data which has a default threshold of 400 days.  This can lead to a very large DW and a very angry storage administrator who wants to know why you need all that space.  I personally recommend keeping the daily aggregations for 365 days and the hourly aggregations for about 120 days.  That really is an organizational decision, but one that should be made early on.

The quasi-technical:

This is still somewhat technical, but there’s procedural considerations to be had here.

  • Naming convention for SCOM customizations as well as custom management packs:  I strongly recommend using some sort of organizational name to lead all of your customizations.  The reason being is that six months from now, you won’t likely remember what you named that custom monitor.  Save yourself the time and make sure you have some sort of consistent naming convention, for no other reason than to allow for an easier search.
  • What do you want to monitor:  Let’s start with the obvious, SCOM is a framework that can monitor a lot of things.  Microsoft makes management packs for most (if not all) of their products.  There are a lot of 3rd party MPs out there as well (note, not all are free).  Not all are that good.  Oh, and most importantly, don’t make the mistake that myself and many others have made in rolling all of them out at once.  You’ll want to tune each MP, so roll them out progressively (preferably in a QA environment first) so as to identify noise before you roll them out into production and have an angry monitoring team.
  • Alert Management:  I’ve written a 3 part piece on this subject.  The first part is here (and it links to the other two).  Suffice to say, most organizations don’t really sit down and think about how they plan on responding to alerts.  The end result is that the organization has purchased a monitoring tool, but it does not monitor. 
  • What to monitor:  Are you going to throw your development systems into your production SCOM environment?  Do you really only care about a few core systems?  The bottom line is that SCOM is going to tell you lots of things about your environment.  It’s great at detecting bad IT hygiene, and it doesn’t know what items are by design or not.  If you want to actually have a good process responding to alerts, then you probably want to sit down and decide what systems are important to alert on.  If you throw everything into one environment, you are going to make it very difficult on those who are supposed to do the monitoring.
  • What processes need changing:  This goes back to alert management, but the bottom line is that plenty of organizational processes will need to change to account for SCOM. A short list includes maintenance process, decommissioning of servers, commissioning of servers, responding to alerts.
  • Who needs access:  Contrary to a lot of systems, most of your IT staff really needs to be only an operator.  Their job is to close alerts, reset state, and dashboards/reports.  You probably don’t want to give them the rights to start customizing your environment.
  • Custom Views:  This involves meeting with the various teams in your organization, but you’re going to want to get them using SCOM.  This means that they should likely have a scoped role so as not to be exposed to items that they don’t need to see.  It may also involve creating custom dashboards for them.  There’s a lot of really cool things you can do with dashboards.  Here’s one for start.

Documentation:

I’m married to a Quality Manager, so I get to hear about this every day in the manufacturing world, and I happen to have a degree in Manufacturing Engineering to go with it.  Needless to say, IT doesn’t do documentation that well, especially when it comes to break fix.

  • Customizations:  This is a big one.  I typically recommend implementing a basic versioning system for your custom MPs and using the built in description field to record the version number, what changed, who changed it, and why.  It doesn’t need a committee or special change management, but it can be very useful in keeping a running history of what was done in the environment.  As a bonus, when the SCOM owner leaves, his or her replacement will be able to pick up the changes much easier.  What often happens, however, when there’s little documentation is that the new administrator is often very tempted to simply start over. 
  • Health Check:  Microsoft offers an excellent service for many of their products known as a health check. You may want to consider doing something like this within the first few months of rolling out SCOM. A health check will determine if there are any performance bottlenecks in your environment as well as identify potential issues that you may need to address.  It will help you see where your best practices might be falling a bit short and allow you to maximize your use of the tool.  It’s not a requirement by any means, but it will provide you a very nice picture of your environment as well as a direction in terms of what needs to be addressed going forward.  (Shameless self promotion, but if by chance someone reads this and decides to purchase one, please be so kind as to let their account manager know that you read it here.  Those types of things look great on reviews).

Certificate CRL Check Fails When Deploying SCOM Agents to a Unix/Linux Server

I spent the better part of a day and a half working with a client on a rather frustrating issue deploying the SCOM agent to Linux machines.  I ended up working with a few people internally until we were all able to narrow down what it was (special thanks to Kris Bash, Steve Webber, and Ken Engelhardt on this one).

Anyways, our deployment was relatively straight forward, we could install the agent, but we were unable to sign the certificate.  The error we got is not one completely unusual to SCOM, but the common solutions for it did not apply:

“The SSL certificate could not be checked for revocation. The server used to check for revocation might be unreachable.”

Generally with this error, the issue is due to a mismatch between the machine name to whom it was issued and the value listed in DNS.  It can also occur when there are multiple machines in the resource pool and their certificates haven’t been exchanged.  This is fairly well documented, but in our case it was related to something different.

There were other symptoms as well.  Manually signing the certificate and managing it still failed.

After quite a bit of digging, it was apparent that the breakdown was with WSMan.  You could see this in the WSMan operational log in the event viewer, with errors each time we attempted to deploy the agent through the SCOM console.  Manually running the WSMAN piece failed as well.  For the record, this is the powershell syntax that SCOM is using to connect via WSMan.

Test-WSMan –Computername <xxxxxx> –Authentication Basic –Credential (get-Crediential) –Port 1270 –UseSSL

The error was exactly the same, citing a CRL lookup failure; and likewise, an event showed up in the WSMan logs.  WSMan, however, doesn’t appear to have a native way to skip the CA check (or at least one that I could figure out..  WinRM, however, does appear to have one.  SCOM will also use WinRM to communicate to communicate.  I’ve added those parameters in bold. Running without that generates the same CRL error.  I didn’t need all of the parameters.  I simply used the –skiprevocationcheck parameter and everything worked.

winrm enumerate http://schemas.microsoft.com/wbem/wscim/1/cim-schema/2/SCX_Agent?__cimnamespace=root/scx -username:<UNIX/Linux user> -password:<UNIX/Linux password> -r:https://<UNIX/Linux system>:1270/wsman -auth:basic -skipCACheck -skipCNCheck -skiprevocationcheck -encoding:utf-8

We were eventually able to trace this down to a 3rd party product called Axway, also known as Tumbleweed.  This product is used in higher secure environments and aids in CRL checking and certificate authentication.  It has an option to bypass CRL checking for self-signed certificates, but  this wasn’t working here.  I suspect that this is because the SCOM/Unix cert, while technically self signed, is really being issued by scxadmin (a tool on the Unix machine).  The certificate’s issuer is listed as SCX-Certificate instead of the machine name of the Unix/Linux machine.  As such, Tumbleweed was forcing a CRL check, and when it couldn’t look up the CRL, it would fail.

image

Uninstalling Tumblweed from the management server fixed the problem.  That leaves us with a couple of solutions:

  1. Uninstall the product.  Not the best choice.
  2. See if you can get the certificate excluded.  This is ideal, though in my case, no one seemed to know who that person was.
  3. Temporary bypass.

Tumbleweed updates the following registry key with it’s own information:

HKEY_LOCAL_MACHINE\Software\Microsoft\Cryptography\OID\EncodingType 1\CertDllVerifyRevocation\Default

Our DLL is cryptnet.dll

You can remove their DLL and replace it with ours, deploy the agent, and then put it back.  Everything works.