Alerts and Alert Definitions

Topics marked with * relate to features available only in vFabric Hyperic.

This page is a high level summary of alerting functionality in Hyperic HQ and VMware vFabric™ Hyperic®. See the last section, Advanced Alert Functionality in vFabric Hyperic, for a summary of vFabric Hyperic-only features.

Alerts

IT teams can use Hyperic's alerting system to automate and manage IT problem detection and response processes. Hyperic alerting features allow you to:

  • Fire and report an alert for a resource when a condition you specify occurs.

  • Notify designated personnel or stakeholders of alert events.

  • Execute resource control operations when an alert fires.

  • Track the resolution status of problems revealed by alerts.

  • Analyze alert and alert action history.

Alert Definition Process

You create an alert for a resource, you define an alert definition for it. An alert definition specifies the condition that should initiate alert firing. An alert condition relates to either a metric Hyperic collects or an event Hyperic tracks for the resource. A metric condition specifies a particular metric, and the value or behavior should initiate alert firing - for example "Availablility < 100%". An event condition specifies an event - a log event, a configuration file change, a control action - whose occurrence should initiates alert firing. An alert definition can also specifies actions for Hyperic to perform when an alert is fired. You set up alert definitions from the Hyperic user interface, using dialogs and selector lists to specify the condition and actions. The "minimum" alert definition simply specifies the rules for firing. Actions are optional. The alert definition process is described in Define an Alert for a Resource Type.

Note: For information about the using Hyperic's web services API for creating alert definitions, see HQApi alertdefinition command.

Alerts in the Hyperic User Interface

Any fired alert shows up immediately in Hyperic pages that present alert status and history, including the Recent Alerts portlet in the dashboard and the Alerts tab for a resource. Additional alert views are described in Advanced Alert Functionality in vFabric Hyperic.

Fixing and Acknowledging Alerts

When an alert is fired, its status is "unfixed", and will be indicated as such in Hyperic pages until its status is changed to "fixed". Hyperic provides several mechanisms for marking an alert fixed. You can explicitly mark an alert fixed from the Hyperic user interface. If multiple alerts have fired for the same alert definition, you can do a "fix all". Additional alert management capabilities are described in Advanced Alert Functionality in vFabric Hyperic.

An alert with an escalation also has an "acknowledgment" status, to indicate that responsible or concerned parties are aware of the problem. When an alert with an escalation is fired, it is "unacknowledged", and remains so until it is explicitly acknowledged from the Hyperic user interface.

Enabling and Disabling Alert Definitions

At any given point in time, an alert definition is either enabled or disabled. When an alert definition is enabled, Hyperic's alerting engine evaluates the alert condition and rules, and fires alerts accordingly. Alerts will not fire for a disabled alert definition. Hyperic provides several mechanisms for enabling and disabling alert definitions.

An alert definition can be enabled:

  • by a user explicitly disabling it from the Hyperic user interface

  • automatically, if it configured it to disable itself each time it fires, and re-enable itself when the fired alert is marked "Fixed".

  • as a result of an authorized user globally disabling all alert definitions from the HQ Server Settings page.

An alert definition can be disabled:

  • temporarily, as a step in an escalation

  • automatically upon firing, if it configured it to disable itself each time it fires, and re-enable itself when the fired alert is marked "Fixed".

  • as a result of an authorized user globally enabling all alert definitions from the HQ Server Settings page.

Introduction to Escalation Schemes

An escalation is a type of alert action; it is a predefined sequence of notifications steps that starts automatically when alert fires. An escalation can define numerous steps to perform over whatever duration you choose. When the alert is marked "fixed" Hyperic stops the escalation. You create an escalation in the Hyperic Administration tab. You assign an escalation to an alert definition using the Escalation tab on the Alert Definition page.

There are several benefits to using escalation:

  • Prevent redundant alerts — When an alert kicks off an escalation, Hyperic effectively disables the associated alert definition - preventing a sequence of additional alerts for the same problem. The alert definition remains inactive until the escalation ends. An escalation configured to repeat itself ensures that redundant alerts will be prevented even if the escalation ends before the triggering problem is resolved.

  • Automate issue management processes — An escalation automates the process of monitoring and managing problem resolution processes. Thoughtfully configured escalations call attention to "long-running" or broken response processes, and make it harder for issues to fall through the crack.

  • Reduce the effort of managing alert response rules — Unlike other types of notifications that are defined within an alert definition (for example, the Notify Hyperic Users and Notify Other Recipients actions) an escalation is defined and updated separately. When policies, procedures, or staff assignments change, it is less effort to update one escalation than many alert definitions.

  • Escalations add flexibility to automation — An escalation has an "acknowledgement" status that enables the automated response to be more flexible and take into account whether or not someone is attending to the problem. You can specify steps to perform based on whether an alert is or is not acknowledged, or based on how long it has been unacknowledged.

Options for Controlling Alert and Notification Volume

The purpose of alerting is to speed the process of detecting and resolving problems. Rapid detection and response can be compromised when multiple alerts fire as a result of the same problem, or if responders are inundated by repetitive alert notifications. Excessive alert and notification are less likely when:

  • A given problem or root cause results in one, rather than many, alerts.

  • An alert status of "unfixed" indicates a problem that still exists and needs attention, rather than a transient issue occurred, and then went away.

  • A single problem doesn't result in a firestorm of redundant notifications.

Hyperic Hyperic provides a range of options for reducing the volume of alerts, and taking action when alert volume exceeds a manageable level. Prevention is the best strategy.

The best way to prevent redundant alerts is to assign a repeating escalation to every alert definition. An escalation is a series of notifications and a schedule for sending them. When the alert fires, Hyperic issues notifications according to the escalation schedule, and for the duration of the escalation, the alert will not fire again. Only after the escalation ends - because all steps are complete or the alert was marked fixed - can the alert definition fire again. You can set your escalations to repeat until the initiating alert is fixed to prevent redundant alerts for the same triggering condition.

An alternative approach for preventing redundant alerts is to configure each alert definition to disable itself upon firing. If you do, the alert will fire once, disable itself, and re-enable itself when the alert is fixed.

Responding to Alert and Notification Storms

If for some reason the volume of alerts or notifications gets out of control, you can use options on the HQ Server Settings page to immediately and globally:

  • Disable all alert definitions — No alerts will fire for any resources. Notifications defined in escalations in progress will be completed.

  • Disable all notifications — No alert notifications will be sent. Any escalations currently in progress stop - any remaining notification steps are not performed.

vFabric Hyperic offers additional features for managing alert and notification volume, as described in the following section.

Advanced Alert Functionality in vFabric Hyperic

vFabric Hyperic provides all the features described in the previous sections, plus these additional alert definition and management features:

  • Multi-condition resource alerts — In vFabric Hyperic you can define up to three conditions for a resource alert.

  • Additional alert actions — vFabric Hyperic provides additional alert actions, including

    • SNMP trap — generation

    • Script action — you can configure a script that does custom alert processing or notification, for instance, to share alert information with another management system

    • Control action — operation on a resource, either the resource where the alert fired, or a related resource,

  • Recovery alerts — In vFabric Hyperic, you can create recovery alerts to streamline your process for responding to alerts. First you create an alert definition that is configured to fire once and then disable itself until fixed. Then you define a recovery alert that fires when the condition that fired the primary alert is no longer true. When the recovery alert fires, it sets the primary alert's status to "fixed" and re-enables the primary alert definition.

  • Resource type alerts — In vFabric Hyperic you can create an alert definition for a resource type, that will be inherited by all resources of that type. Resource type alerts are useful if want to assign exactly the same alert rules to every resource of the same type, and to be able to enable and disable the alert definition for all of them in one fell swoop.

Best Practice for Resource Type Alert Definitions

Tailoring an inherited alert definition at the resource level is not recommended. A resource type alert definition applies to all resources of that type. If you modify the inherited alert definition for an individual resource, a subsequent update to the resource type alert definition will override the changes made at the resource level.

  • Resource group alerts — In vFabric Hyperic you can create an alert definition for a compatible group - a group you have defined that contains selected resources, all of which have the same resource type. A resource group alert is useful when you are concerned about how many of a set of resources are having a particular problem - you want to know if 2 out of 10 platforms have high disk utilization, for instance. A resource group alert is evaluated differently than resource alerts or resource type alerts. A resource alert or resource type alert is fires for a specific resource based on monitoring results for that resource only. A resource group alert fires when a metric condition is true for a specified number or percentage of the resources in the group.

  • Notification throttling — Notification throttling allows you limit the number of notifications that can be issued over a specified interval; when notification volume reaches the limit, Hyperic stops sending individual notifications, and instead sends periodic rollup notifications, until the volume of alerts with notification actions goes down.

  • Advanced Views for Alert Monitoring and Analysis — In vFabric Hyperic, the Alert Center presents filterable views of alerts and alert definitions. The Operations Center and presents filterable views of unfixed alerts.