Thursday, October 29, 2009

Stopping alert floods when branch office connectivity is lost OR simple opsmgr Event Rule and Monitor Creation


We had a problem where we would get alert floods of heartbeat and computer not responding alerts when we lost a network link to a branch site in System Center Operations Manager 2007.  After posting a question on the social groups I was directed to an excellent post by Steve Ross that basically solved my question.  I have extended is solution with a few very minor changes that allowed it to work better for me.  Note that Ross has updated his script a few times and I probably won’t B) so make sure to check his site first.  I will detail my changes and how I set up my alerting below.  Again, my MP-fu is still developing, I should probably put this together in a nicely packaged MP but that will have to come as a future iteration.

After setting up his script, I created a simple Windows Event Reset monitor that watched for event 18041, fired an alert.  The monitor reset on 18040 events but the alert stayed so I could see what had happened.  One problem was that I didn’t have an easy way to see how long the outage was without digging into the logs on the machine running the alerts.  I also had an issue that if more than one branch went down, I didn’t get a second alert.  Not all branches are created equal, if a small office branch went down, I may decide it can wait for a more appropriate time to fix, if a major site branch goes down, I need to know now. 

Remember, this all extends the work done by Steve Ross, (which extends yet others, ain’t teh interwebs grand?).  Go there, get his script if you are trying to suppress alert floods.  Then come back and read through the below to see if it makes sense in your env.  If you just want a rule/monitor walkthrough, well… enjoy… and skip to Items 3 and 4 below.

Update: as of Nov 23, 2012, it looks like the Steve Ross site is no longer responding.  I have put my modified version of his script here.  

Here are the things I did to make it more workable for me.


1.       We have a management server that is dedicated to branch gateway traffic, I am running my script on there as a scheduled task.  We have notifications go out for alerts that are older than 5 minutes so I set this to run every 2 minutes.
2.  I changed the alerting logic slightly.  I wanted to have down events show as Error Events.  I also wanted different branches to show as different event numbers.  I did this by adding the for loop variable (‘i’) to the event number.
I did both by changing this line in the script:

Call oAPI.LogScriptEvent("BranchSiteMonitoring.vbs", badEvt, 0, "Branch " & _
"site router for the " & var3(1) & " computer group is unavailable.  " & _
"Computers in this group should be in maintenance mode currently.")

to

Call oAPI.LogScriptEvent("BranchSiteMonitoring.vbs", badEvt + i, 1, "Branch " & _
"site router for the " & var3(1) & " computer group is unavailable.  " & _
"Computers in this group should be in maintenance mode currently.")

3.       Then I created an Event Collection Rule
a.       Under the Authoring panel -> Rules –> create a rule.
b.      We are going to create a simple NT Event Log Collection rule.  This rule will watch the event log and, whenever on of our specified events happens, it will grab it and put it in the database.  I am not alerting here.  I have a separate alert for that below.  This is just to record the data for posterity’s sake.  Actually, this is to create a view so I can see at a glance, whether we are up or down and where.   Don’t forget to pick a management pack to store this in, all the guidance warns away from using the default MP.  Generally, we put overrides for specific MPs in an MP specific management pack. _SQL_Overrides or similar.  This doesn’t override anything specific so we have a catchall for other types of rules.



c.       Enter your rule name, category, etc.  I am targeting this at the Availability health of a windows computer.  I am effectively saying that if my branches are not in communication, my gateway server is unhealthy.  I am also setting the rule to disabled.  I will override and enable it on certain machines (the gateway) that I want to watch.


d.      Set your log to ‘Operations Manager’ as this is where the script logs to.



e.      Set your Event source to Health Service Script.  If you are making your own rules, you can get the details from any event in the event logs, see the pic below.  For the Event ID, I wanted to match alerts for any of my branches.  According to the logic we set up earlier, ‘All clear’ events default to 18040.  The script will then loop through the branches and increment the error message, branch1 will be 18041, branch 2 will be 18042, etc.  In order to catch all my branches, I put in the regex ‘1804[0-9]’.  This will match ‘1804’ followed by any single digit.



4.       Next, I created Windows Event Reset Monitor for each branch that I wanted to alert on
a.       Under the Authoring console -> create Monitor

b.      Select Windows Events -> Simple Event Detection -> Windows Event Reset.  Make sure to select the MP you want to save the Monitor to.


c.       Fill in a Name and Description.  I set the monitor target to be Windows Computer and the parent monitor to be availablility.  Note that we do not enable the monitor in this instance.  I am leaving it disabled and I will override it to enable it on certain machines.  This prevents all my machines from wasting cycles on something I only need to watch on one.


d.      Specify ‘Operations Manager’ as the log we are watching.


e.      Enter the Event source and the Event ID of the event we are trying to catch.  These can be gathered from the event itself.  See the highlighted pic below.



f.        For the ‘good’ event, we do the same as above. Specify ‘Operations Manager’ as the log we are watching.


g.       Enter the Event source and the Event ID of the event we are trying to catch.  These can be gathered from the event itself.


h.      We want this monitor to be a critical alert so we set the states accordingly.



i.         I want to generate an alert with a specific Alert name as below.


j.        Hit Create.  Then do this again for each branch (I couldn’t find a way to create monitors in powershell or I would have scripted this.  B)
5.       Now that you have all of your Monitors and your rule created, you will want to enable them for specific computers you want to monitor.  In my case, we have the script running in one place so I only enabled it on one machine. 
a.       Find the rule in the authoring pane. 
b.      Click overrides – Override the rule – for a specific object of class: …


c.       Find your monitor server in the object selector.
d.      Click Enabled and change the value to true.


e.      Now find your monitors in the authoring pane.  Repeat steps b-d for each.
6.       Now you should get alerts when branch sites are down and they should clear when they are up. 
7.       I also created a view to be able to see the events coming in.  From here I can see what the state is now and what, if anything, has been down recently.
a.       In the monitoring pane, right click on your MP folder and select New -> Event View
b.      Specify that you only want events with a specific event number and specifiy that number to include all your ‘All clear’ event and all of your branch events.


c.       Save the view.
8.       Enjoy!

6 comments:

  1. Very cool! This closes a serious functionality gap in OpsManager. Thanks for sharing.

    ReplyDelete
  2. Thank you very much for this post! It really help me do my job right. Thank you very much for sharing. (Quick tip, step numebr 4a it says Create rule, when it you be create Monitor).

    ReplyDelete
  3. Great article, I have implemented this solution at my scom installation, it works like a charm.
    Thank you very mutch for this :o)

    ReplyDelete
  4. Do you have the original article from Ross there website dosnt contain it anymore :(

    ReplyDelete
  5. @Kenneth, I have updated the post to contain a version of my modified script. The Steve Ross site doesn't seem to be responding any longer.

    ReplyDelete

analytics