inset
Operations Manager Server Provides Crucial Infrastructure Support
Aug. 20, 2001

Appropriately nicknamed "MOM," the new Microsoft Operations Manager is a powerful and scalable event and performance monitoring system that watches over and helps maintain the availability of a family of Windows servers and server applications. Although Microsoft has provided basic server monitoring in other products, it counted on third-party vendors to help customers build enterprise-scale monitoring and management systems. As part of its new .NET Management Services initiative and with the urging of its customers, Microsoft has brought serious system monitoring in-house. Rather than re-create the wheel, Microsoft acquired MOM’s base technology from NetIQ and focused on building considerable "knowledge" into the product, allowing customers to see benefits faster.

System Monitoring Challenges

Because most system monitoring solutions are distributed across many machines and require a high degree of built-in intelligence to sort out the trivial from the critical, modern monitoring systems are only slightly less complicated than the systems they manage. Furthermore, the lack of industry standards means that firms implementing monitoring solutions must make large commitments to proprietary technologies. However, these obstacles must be overcome, because distributed systems will never have the reliability and availability of their mainframe predecessors unless they have effective automated monitoring systems. (See "The Need for Monitoring and Management".)

Applications Are What Count

Although detecting low-level failures or depleted resources (such as disk space or memory) is instrumental to keep systems up and running, users ultimately care about their applications. To verify that applications are working within agreed-upon parameters, developers must "instrument" these applications so that monitoring software can collect relevant health and performance information.

Another means of monitoring applications is through "synthetic transactions." The monitoring system should be able to periodically create application-level transactions that prove that the application is working properly, and if not, raise an alert. However, these transactions must be harmless and must not affect the integrity of real data.

Instrumentation Standards Needed

Until recently, no standards existed for instrumenting operating system software or applications, and the existing technologies are narrowly focused and incompatible with one another. Network devices, including computers, can use Simple Network Management Protocol (SNMP) to transmit network-related information to a monitoring system such as HP OpenView Network Node Manager. However, SNMP is too limited to perform the advanced monitoring tasks required by computer hardware, operating systems, and applications.

Operating system status information is logged differently depending on its type; Unix operating systems have a text-based event log service called Syslog, while Microsoft Windows NT/2000 systems have the binary-formatted Event Log service. Application developers can also avail themselves of these facilities, but many do not.

The traditional way for developers to instrument their applications is to record activity in an external log, such as a file that records the time, the username, and an event identifier each time someone completes a transaction with the application. The problems with this approach are that each application log uses its own schema, there is no easy way to know when new data gets written to them, and application monitoring programs don’t know where to find them.

From its beginning, Microsoft Windows NT/2000 has had a general-purpose performance monitoring system that can collect operating system and application performance data (assuming developers take advantage of it), but Unix systems have no comparable facilities.

Because of the lack of instrumentation standards, developers of system and server-based application software have not put as much effort into instrumentation as users might desire. Developers are not inclined to add instrumentation for every possible monitoring system, particularly since most shops do not have application-level monitoring software anyway.

The major operating system, hardware, and network vendors have approached this lack of standardization by collaborating to define an abstraction layer through which access to all management data can be unified, independent of the environment and underlying management protocols. The driving force behind this effort is the Web-Based Enterprise Management (WBEM) initiative. WBEM defines a single schema (called the Common Information Model, or CIM) that, through specific modules called "providers," can access various types of instrumentation data. However, WBEM does not define which protocols vendors must use to provide an interface to the CIM. Microsoft’s WBEM implementation is called Windows Management Instrumentation (WMI) and has come with Windows since Windows 2000. It currently uses the COM+ protocols to link to the CIM, but its stated direction is to build a Simple Object Access Protocol (SOAP) interface to it. (See the "Resources" section below for links to more information on WBEM and WMI.)

Event Storms and Root Causes

Because of the many dependencies that applications have on operating system services, the network, and other applications, failures rarely happen in isolation, but instead come in a flood and can continue to stream in as long as the underlying problem exists. For example, if a farm of Web servers can no longer communicate with a back-end database server, each of the Web servers will repeatedly generate an event for each failed attempt to reconnect to the database. This scenario illustrates several common monitoring problems:

  • Operations personnel do not want to receive individual alerts from each Web server; all they want to know is that all of them have a common problem related to the database server.
  • Operators do not want to be bombarded with annoying repeat messages about the same problem, especially once someone has acknowledged the initial alert and is working on its resolution.
  • An application failure could have many root problems—in the network, in the database software, in the database server’s operating system, in hardware, or in a service running on a third machine on which the database or Web servers depend, such as the Domain Name System (DNS).

Effective monitoring software must be able to aggregate alerts and suppress repeat alerts. Ideally, it must also successfully analyze the origin of any problems—a challenging process called "root-cause analysis." Skilled operators can sometimes correlate the timing and pattern of events, and by coupling this information with their knowledge of the system’s history and topology, can quickly arrive at the root cause. However, getting monitoring software to perform root-cause determination as well as a skilled operator is extremely difficult, and the process may involve elements of artificial intelligence. Vendors such as BMC and Micromuse have software designed to perform these functions, but their products are expensive and require extensive configuration and modeling before they can work effectively.

A Poor Track Record

Because of the problems listed above, monitoring systems are often imperfect and expensive to implement. The challenge is particularly acute in projects involving all-encompassing management "framework" products, such as CA Unicenter or Tivoli TME.

Smaller-scale "point solutions" that abandon the goal of monitoring an entire enterprise full of heterogeneous servers and network devices are often more successful at meeting lesser goals, but they may lack scalability and may have limited ability to monitor certain large enterprise-scale applications, such as SAP R/3.

"Knowledge" Is Key

Although the software itself can be costly, the principal reason monitoring solutions are so expensive is that they require considerable knowledge about the whole system, including the network, hardware, operating system, system services, and applications. Then, someone with that knowledge must determine which events and performance metrics are important to monitor, without flooding operators with irrelevant or routine information, and then configure the monitoring system’s rules and filters to achieve those goals. Since the number of different events, performance metrics, and configuration parameters is almost unlimited, creating a beneficial rather than annoying monitoring system may take considerable time and money.

Furthermore, simply forwarding the important events in raw form to system operators may be of little help. Raw event messages can be exceedingly cryptic, even to system experts. The monitoring system should be able to translate these messages into more meaningful terms and give operators some prescriptive information on what to do next.

The more out-of-box knowledge that can be embedded in the monitoring system, the quicker and easier it will be to begin getting value out of it. Once a monitoring system starts to ease the jobs of system operators, they are more likely to be supportive and cooperative with efforts to fine-tune or broaden the scope of the monitoring system.

Why Microsoft Introduced MOM

Creating a viable system-monitoring product is a challenging task, with no guarantee of success. But Microsoft recognized that it had to take on this task because leaving monitoring services to third parties was ultimately hurting acceptance of its server products, which, in turn, was harming its chance of becoming a dominant vendor of server operating systems and applications.

By offering its own monitoring solution, Microsoft can

  • Ensure that a common instrumentation infrastructure is available to its customers, channel partners, and ISV community, which in turn will encourage development of an extensive body of well-instrumented applications, management knowledge, and integrated management products.
  • Drive down the total cost of monitoring systems so that more customers will implement them, thus improving the perception that its server products are "enterprise ready."
  • Ensure coordination between its operating system and application instrumentation efforts and the management group, and make sure that Microsoft’s application and operation system developers also help develop the knowledge needed to effectively process their instrumentation data.

Last October, when Microsoft announced its ".NET Management Services" vision (see "Microsoft Launches .NET Management Services Initiative" on page 12 of the Dec. 2000 Update), the company also announced that it had licensed NetIQ’s Operations Manager technology as the basis for MOM.

Microsoft already had a modest server-monitoring tool named HealthMon that was first bundled with Systems Management Server (SMS) 2.0 and later integrated into Application Center Server. Although it could have built it into a more full-featured monitoring solution, it would have taken years of development to equal the performance, robustness, and functionality of NetIQ’s OnePoint Operations Manager product, and Microsoft wanted to put its emphasis into building management knowledge, not infrastructure. The Operations Manager technology was already in its third generation, having first been Serverware SeNTry, then Mission Critical SeNTtry Enterprise Event Manager, and lastly Mission Critical/NetIQ OnePoint Operations Manager Version 3.3. (Mission Critical and NetIQ merged in May 2000.)

As part of the deal with NetIQ, Microsoft took over further development of Operations Manager for Windows 2000, while NetIQ can continue to sell and support OnePoint Operations Manager for the Windows NT 4.0 platform. NetIQ engineers assisted Microsoft in developing the first release of MOM, and Microsoft has an option to use them for future MOM development.

Microsoft’s management lineup also includes SMS, Application Center Server, and a suite of management services and technologies such as WMI, Microsoft Management Console (MMC), Active Directory–based Group Policy, Windows Installer, Remote Installation Service, Windows Update, and Terminal Services. Microsoft has reorganized all of these management technologies under a single engineering and marketing team, so customers and partners should expect better focus and coordination in the management area than previously. (For comparison information on MOM, SMS, and Application Center, see "Microsoft’s Other Management Products".)

MOM Product Overview

MOM records relevant events on monitored computers and notifies operators of problems by sending messages (alerts) to consoles and by sending notifications to e-mail addresses or pagers. A MOM system consists of management agents that run on all monitored computers, and an infrastructure that configures the agents to monitor and filter data, collects the filtered data into a central database, and then performs further processing and reporting on it. Rules and scripts running on each agent determine which events are relevant. Management consoles connect to the database to view system status.

MOM implementations require three different server roles:

  • SQL Server stores all collected MOM data and configuration information.
  • Consolidators/Agent Managers are intermediary servers that install MOM agents on groups of managed machines, push rules out to those agents, and then receive alerts and events from agents.
  • Data Access Servers (DASs) coordinate all transactions with SQL Server, receive events from Consolidators, and translate MOM-formatted data into SQL commands. DASs also act as the back end to MOM’s operator and administrator consoles.

For small installations, these three roles can all reside on the same machine, but for larger installations, customers will want SQL Server on a dedicated server. The mid-tier Consolidator/Agent Manager and Data Access Server roles then also run on a dedicated server and can even be implemented on redundant servers to allow load balancing and fault-tolerance. MOM scales to manage thousands of servers and process hundreds of thousands of daily events.

For additional information on MOM’s architecture, see the illustration "MOM Architecture".

MOM Agents

The MOM agent is a Windows service that runs on both NT 4.0 and Windows 2000 Servers. The agent is automatically installed if the target server meets the selection criteria configured on a Consolidator/Agent Manager server.

Agents can perform the following tasks:

  • Send data to a Consolidator
  • Apply rules to filter events
  • Execute a script
  • Collapse multiple events into one event
  • Send an alert based on an event
  • Send an alert when a performance threshold is crossed
  • Generate an SNMP message to a remote SNMP monitoring console
  • Run any noninteractive or batch program
  • Change a state variable (data stored in memory for later use by a script to compare changes over time)

The agents perform these functions according to rules and scripts that are pushed to them from MOM Agent Managers and run with domain administrative authority.

The agents have access to many system health data sources, called "providers." These providers include the following:

  • A "heartbeat" that indicates a healthy connection with each agent's Consolidator/Agent Manager
  • All Windows NT/2000 event logs
  • All Windows NT/2000 performance counters
  • Microsoft Internet Information Server (IIS) log files
  • SQL Server trace log files
  • SNMP traps from itself or other network devices
  • Unix Syslogs (each agent can listen for Syslog data sent to it over the network from Unix or Linux servers)
  • Single-line application text logs (where each log entry appends only one line of text; this allows many legacy applications to be monitored)
  • Timed events (jobs that MOM executes at predetermined times)
  • Missing events (events that don't occur within a specified interval, such as a backup job that doesn't start within an hour of its scheduled time)
  • Changes to the system services list (to see if a service has stopped running)
  • Any WMI data providers (eventually this will be the means by which all monitoring providers are accessed, but in this release it is mostly used to provide access to hardware management interfaces, allowing MOM to perform tasks such as checking to see that the internal temperature is within tolerance)

If a developer has properly instrumented an application by linking its "vital sign" information with the Windows 2000/NT Application Event Log and Performance Counters, or by enabling it to write structured status information to a text-based log, then MOM will be able to monitor that application’s health.

MOM’s Knowledge

A MOM agent does nothing unless instructed to do so. MOM’s "knowledge" comes in two forms: rules and scripts to tell each agent what to do, and a knowledge database of articles that help operators know how to interpret and respond to alerts.

Rules. A rule is essentially a fast and efficient filter that looks at some designated data provider, compares it to some criteria, and takes some designated action if it meets or fails to meet the criteria. A typical rule might be "watch the System Event Log, and if Event ID 5719 occurs, forward it to a Consolidator."

Scripts. MOM agents are capable of triggering and running any script supported by Windows 2000 Active Scripting, including VBScript, JavaScript, PERL, and REXX (the latter two require the appropriate interpreter to be installed on the server being monitored). Although they are slower and require more system resources than rules, scripts can perform more complex actions and can track state information to measure behavior over time. For example, a script could sample processor utilization over time, and if the average over a 15-minute window were over 90%, it would raise an alert.

Rules and scripts can apply not only to individual MOM agents but also to MOM Consolidators. Some rules look at whole groups of servers at once and thus need to apply a rule or script to a collection of events or data. For example, a rule that watches failed logon attempts must monitor events from all possible domain controllers.

What makes MOM unique is that it comes, out of the box, with a base Management Pack of over 5,000 rules and scripts. MOM administrators can add new rules or scripts or modify existing ones.

Knowledge base. Alerts generated by the preconfigured rules and scripts are in many cases linked to a built-in knowledge base of relevant Microsoft technical support articles, giving operations center personnel both descriptive and prescriptive information on what to do next.

MOM's base Management Pack contains rules and scripts for Windows 2000 system services, and Microsoft will sell an add-on Application Pack that covers many of the .NET Servers and contains another 4,000 rules and scripts. The Application Pack is currently in beta and is scheduled for release this fall. NetIQ also offers a wide suite of Extended Management Packs (or XMPs) that cover both Microsoft and third-party applications, as well as facilities to integrate MOM with other management products. For more information, see "Third-Party Support for MOM" and the chart "MOM Management Packs and Integration Products".

Consoles and Notification

MOM comes with two console types that provide a user interface to the system: a Microsoft Management Console (MMC) snap-in and a Web console. Both link to MOM’s Data Access Server, which in turn is linked to the SQL Server database. There is no direct user interface to the MOM agents; administrators install and push rules out to agents by changing parameters stored in the MOM database.

MMC console. The MMC-based version of the MOM console is intended for the MOM administrator and server specialists. Although this console is fully capable of displaying system status and alerts, its primary function is configuration. Multiple instances of the console can be open at the same time.

Web console. The Web console is used by operators primarily to view the status of the managed environment. Operations personnel can examine the details and history of alerts and even acknowledge them and add comments, but they cannot change MOM’s rules, scripts, or configuration. MOM Web console users have full access to MOM’s knowledge base and can also append additional comments to the articles.

MOM also includes notification rules for sending e-mails or pages or for executing external programs, such as activating a voice notification, escalation, and response system.

Reports

MOM uses the run-time version of Access to query its SQL Server database, and it can generate about 150 canned reports viewable from both the MOM MMC and Web consoles. These reports are not customizable and MOM includes no facilities to construct new reports, but organizations can use a full version of Access or a third-party reporting product like Seagate’s Crystal Reports to produce custom reports. Crystal’s Web site (www.crystaldecisions.com/products/crystalreports/) has downloadable MOM report examples to help report developers get started.

Configuration Still Daunting

Microsoft’s strategy of focusing on building knowledge into MOM significantly lowers the bar for customers to get value from their system monitoring investment. However, configuration and customization are still not trivial and require personnel with enough deep understanding of the monitored programs to fine-tune the rules and scripts.

The preconfigured rules and scripts make assumptions on how customers will be using their systems and applications, and a threshold that is too high in one customer’s environment may be too low in another’s. Even with its built-in "knowledge," organizations will undoubtedly have to tune and tweak the system on an ongoing basis.

Although MOM discovers servers and the applications installed on those servers, it does not discover the topology of the system and does not detect most inter-computer and inter-application dependencies. Thus, most of the canned rules and scripts may still not clearly identify root-cause events. Organizations that want to correlate even simple events from different applications or servers will frequently have to create custom scripts that perform the desired logic.

What’s Missing?

Because MOM is a new Microsoft product, it still lacks the third-party integration enjoyed by competing products such as BMC Patrol or HP OpenView. NetIQ is addressing these needs with its Extended Management Packs, but at press time there were no products to monitor some major ERP and CRM systems such as SAP R/3 or Siebel, nor were management agents available for any variety of Unix or Linux system.

The first release of the MOM Application Pack is still missing support for important .NET servers, such as BizTalk Server, Content Management Server, Mobile Information Server, and SharePoint Portal Server. Customers that want to monitor these applications will have to wait for an update to the Application Pack, find a third-party management pack, or build their own rules and scripts for them.

As part of Microsoft’s .NET initiative, the company is betting that organizations will begin exposing B2B or B2C functions as public XML/SOAP Web Services. However, MOM is currently not capable of monitoring the availability and performance of Web services.

On the other hand, Microsoft has indicated that it will be adding native SOAP interfaces to WMI as part of the .NET Management Initiative. This would allow a monitored computer to make its data available as a Web service, which would then allow direct monitoring over the Internet. This method would be particularly useful for monitoring computers belonging to another organization.

SOAP support could also help Microsoft enable MOM to better monitor heterogeneous systems. Currently, this is difficult since MOM uses Microsoft proprietary technologies such as COM+ to install and communicate with its agents. If SOAP becomes MOM’s standard agent communication protocol and if Unix vendors adopt their WBEM services to support SOAP, heterogeneous system monitoring will become much simpler.

Now that the first release of MOM is shipping, Microsoft plans to focus its efforts on making MOM’s architecture more open with well-documented APIs. This will be especially important if Microsoft is to entice other firms to integrate related products, such as event correlation engines or trouble-ticketing systems. The latter category is especially important, because most medium and large organizations expect to turn most alerts into trouble tickets and then track those tickets until they are closed. Trouble-ticketing products should be able to get information on an alert from MOM, keep MOM updated with the status of the repair and who is performing it, and then close the alert when the repair is complete.

Packaging, Licensing, Pricing, System Requirements

Customer’s can obtain MOM through the normal channels and through Microsoft’s volume licensing programs. MOM is licensed on a per-processor basis for each machine running a MOM agent, including the MOM infrastructure servers. MOM’s estimated retail price is US$849 per processor, which includes the base Management Pack.

The estimated retail price of the Application Management Pack is US$949 per processor on all MOM-managed servers hosting applications supported by the pack.

MOM infrastructure servers must be running Windows 2000 Service Pack (SP) 2, and MOM also requires SQL Server 2000 Standard or Enterprise Editions. (MOM comes with the SQL Server–based Microsoft Data Engine, but a limit of 2GB of data makes it impractical for all but small organizations.) If SQL Server is licensed in per-seat mode rather than per-processor mode, SQL Client Access Licenses are needed for all MOM Data Access Servers and for each computer used to access the database from either the MMC or Web consoles. In most cases, especially in larger installations with console connections from many different computers, per-processor SQL Server licenses will be more economical.

MOM agents require any version of Windows 2000 or Windows NT 4.0 SP4 or later. Although MOM can monitor workstations, it is not intended to monitor thousands of desktops in addition to the servers (nor is it priced for that type of use). However, MOM could be used to monitor workstations such as remote unattended kiosks or batch processing machines.

Resources

For more information on MOM, see the Microsoft Operations Manager home page at www.microsoft.com/mom.

ISVs interested in developing Management Packs for MOM or in integrating other products with MOM can download the MOM SDK from the Microsoft Management Alliance Web site at www.microsoft.com/mom/mma.

For more information on WBEM and WMI, see "Web-Based Enterprise Management (WBEM) SDK 1.0 Ships" on page 15 of the Oct. 1998 Update. Also see the article "Windows Management Instrumentation: Background and Overview" at msdn.microsoft.com/library/default.asp?url=/library/en-us/wmeother/wmi_wp_4bla.asp.

For more information on the NetIQ Extended Management Packs, see www.microsoft.com/mom/partners/netiq.asp and www.netiq.com/products/xmp/catalog.asp.