OpsMgr R2 by Example: The Windows Server Management Pack

The Windows Server management pack is available as a single download that contains different libraries to monitor Windows Server 2000, 2003 and 2008 Operating Systems.

How to Install the Windows Server MP

  1. Download the Windows Server Operating System management pack from the Management Pack Catalog. The Windows Server Operating System Management Pack Guide is included in the download and labeled “OM2007_MP_WinSerBas.doc.”
  2. Read the Management Pack guide – for things like how to activate monitoring for physical disks and disk partitions.
  3. Import the Windows Server Operating System Management Pack (using either the Operations console or PowerShell).
  4. Create a WindowsServer_Overrides management pack to contain any overrides required for the MP.

Windows Server MP Tuning / Alerts to look for

The following alerts were encountered and resolved while tuning the various Windows Server management packs (these are listed in alphabetical order by Alert name):

Alert: Disk transfer (reads and writes) latency is too high

Issue: This monitor checks for high values on the performance counter every 60 seconds over a 5-minute timeframe.

Resolution: Determined that spikes were occurring on a specific drive on the system. The drive needs to be either be replaced with a higher speed drive, or some of the uses of this drive should be moved to another physical drive.

Alert: Event log is full

Issue: Alert generated by the windows server 2003 management pack from the Event Log File is Full alert rule. The alert description contains information about which event log is full (in this case it was the PowerShell log file).

Resolution: Logged into the server and verified that the log size was set to a maximum of 512KB and to override events older than 7 days. Re-configured to increase the size to 2048KB and to overwrite events as needed. Closed the alert.

Alert: Logical Disk Free Space is Low

Issue: Low disk space on the drive identified in the alert.

Resolution: Can either free up disk space on the drive or configure an override for the drive to change the monitoring configurations for the drive. You can configure the overrides for the system drives or non-system drives. For this configuration, there is a C drive, D drive, and Q drive. The Q drive was critical, and free space could not be made available on the drive. The only options available without modifications to the script (which are not viable in sealed management packs) are to set an override for non-system drives and set it to a level where the Q drive is no longer critical. This means that the levels for the D drive on the same system will not fire until it hits the new critical levels. The other option is to acknowledge the alert and not to resolve it at this point. The script that does this check is called FreeSpace.vbs and automatically distributed into a temporary directory located under %ProgramFiles%\System Center Operations Manager 2007\Health Service State.

Alert: Network Interface failed.

Issue: Network interface on a system was no longer online.

Resolution: The system in question had been accidentally unplugged from the network. Closed the alert after the network interface was online.

Alert: The device has a bad block.

Issue: Bad block on the drive on the system.

Resolution: Ran chkdsk /F to scan for bad blocks that required a reboot due to the bad block being identified on the boot partition.

Alert: The event log file is full. New event instances will be discarded.

Issue: The event log was set to override events older than 7 days.

Resolution: Increased the event log size from 512KB to 2048KB, and set to overwrite events as needed.

Alert: The service terminated unexpectedly.

Issue: The service identified in the alert failed.

Resolution: Verified that the server can be pinged using the tasks on the right, and using the Computer Management task verified the service was in a started state. Closed the alert after placing information in the company knowledge to track this for a pattern to see what is causing the service to fail. In one case the service was actually down, used the Computer Management task to restart the service.

Alert: The share configuration was invalid. The share is unavailable.

Issue: The share within the alert was a user share on a system.

Resolution: Determined the user did still exist in Active Directory (AD Users and computers, validated that the user name was the same). Recreated the user folder per the product knowledge. If the user no longer existed, the share would have been removed using the net share /delete option presented in the product knowledge.

Alert: Too many requests for performance counter data have timed out

Issue: In this environment, this only seems to occur with Windows 2000 systems running Diskeeper. Diskeeper started at just after 9 pm, and then there is an alert just after 10:15 pm (perflib event id 1015 in the application log for the PerfDisk performance data counter), and Diskeeper completes its running just after this event is logged.

Resolution: Disabled this alert for the specific servers that are Windows 2000 systems running Diskeeper. Stored the override in the MicrosoftWindowsServer_Overrides management pack kept for overrides on the various Operating System related management packs. If there were a large enough number of systems, it would be recommended instead to upgrade the version of Diskeeper (or the operating systems).

Alert: Total CPU Utilization Percentage is too high

Issue: Most likely, the processor on the system is currently over-utilized and indicating a bottleneck condition. Common potential causes for this include:

  • Misconfigured anti-virus can cause high processor utilization if files which should be excluded from scanning are not (such as for Exchange databases, logs, and the bin directory).
  • Hardware failure is another possibility that should be considered and research through the hardware vendor.
  • A hung process may be consuming resources to the exclusion of all others.
  • A large portion of the time the system actually is bottlenecked. This can be verified either by checking in the processor performance counters gathered by OpsMgr to determine if there is a consistent bottleneck. You can also check this by logging into the system and using task manager to determine what is using up CPU cycles. Most likely, it is a process running on the system that is using too much processing.
  • A great Microsoft discussion on Processor Bottlenecks is available at http://technet.microsoft.com/en-us/library/aa995907.aspx.

Resolution: Add more processing resources (faster processors, additional processors), replace the system with stronger processor(s), split the load through network load balancing, or move off programs/services creating load to the system. Until the processing bottleneck can be addressed, determine from the trending of the performance counters what an acceptable level is for this particular system in your organization and set an override so that alerts will be generated only if the system goes beyond the levels identified for the server.

Alert: Total Percentage Interrupt Time is too high

Issue: Most likely, the processor on the system is currently over-utilized and indicating a bottleneck condition. Common potential causes for this include:

  • Misconfigured anti-virus can cause high processor utilization if files which should be excluded from scanning are not (such as for Exchange databases, logs, and the bin directory).
  • Hardware failure is another possibility that should be considered and research through the hardware vendor.
  • A hung process may be consuming resources to the exclusion of all others.
  • A large portion of the time the system actually is bottlenecked. This can be verified either by checking in the processor performance counters gathered by OpsMgr to determine if there is a consistent bottleneck. You can also check this by logging into the system and using task manager to determine what is using up CPU cycles. Most likely it is a process running on the system that is using too much processing.
  • A great Microsoft discussion on Processor Bottlenecks is available at http://technet.microsoft.com/en-us/library/aa995907.aspx.

Resolution: Add more processing resources (faster processors, additional processors), replace the system with stronger processor(s), split the load through network load balancing, or move off programs/services creating load to the system. Until the processing bottleneck can be addressed, determine from the trending of the performance counters what an acceptable level is for this particular system in your organization and set an override so that alerts will be generated only if the system goes beyond the levels identified for the server.

Alert: Windows Event 2008 – Unable to read an event log

Issue: The application log file had corrupted in one instance and the server application log in another instance.

Resolution: Verified that the server was not in some way restricting access to the log file. Used the Computer Management task to fix the corrupt event log through right-clicking on the event log and choosing the option to Clear all Events and then re-opening the event log that had been corrupt.

Windows Server Management Pack Evolution

Overall, the Windows Server management pack provides a very strong set of functionality for Windows Operating Systems. An area that would be useful would be creation of additional diagnostics and recoveries such as one to run the disk cleanup utility on low disk space situations, and one to report on where drive space is used on a disk that is running low on disk space.

Advertisements
This entry was posted in Tuning and Configuration. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s