Clustering Hyperic Servers for Failover

Available only in vFabric Hyperic

Overview

To avoid interruption of Hyperic Server operation in the case of failure, you can configure a cluster of Hyperic Servers. The failover configuration uses:

  • EHCache's distributed caching for replicating changes throughout the cluster.

  • The nodeStatus.hqu plugin for monitoring the availability of nodes.

  • A hardware load balancer for managing failover when an node becomes unavailable. The load balancer checks the status of each node every 10 seconds, by issuing an HTTP request to the node's nodeStatus.hqu plugin. The check will return a response of master=true for the primary node, and a response of master=false for other nodes in the cluster.

A Hyperic Server cluster contains multiple nodes; two are generally sufficient. One Hyperic Server, automatically selected by Hyperic, serves as the primary node. The other node or nodes serve as hot backups---they do not share the workload with the primary node.

A failover configuration is transparent to users and Hyperic administrators; it is not apparent that the active Hyperic server instance is clustered, or which node is currently active.

Requirements for an Failover Deployment

  • A hardware-based load balancer.

  • Only one Hyperic Server in an Hyperic Server cluster should receive agent communications at a time. The load balancer should should not direct agent connections to an Hyperic server instance that serves as the secondary node.

  • Database Considerations---All nodes in the Hyperic cluster must share the same database. You cannot use Hyperic's internal PostgreSQL database in a failover configuration. You must use an external database; MySQL, Oracle, and PostgreSQL are supported.

Configuring a Server Cluster

These instructions assume that you do not already have an Hyperic Server installation.

Step 1 - Install the First Hyperic Server Instance

Run the full installer in the appropriate mode for the type of database server you will use (-mysql, -postgresql, or -oracle).  You must choose one of these options, because clustering requires the use of an external Hyperic database. The installer will create the Hyperic database schema.

Step 2 - Install Additional Hyperic Server Nodes

For each additional node:

  • Run the full installer in the appropriate mode for the type of database created during installation of the first server instance (-mysql, -postgresql, or -oracle). 

  • When the installer installer prompts for the location of the Hyperic database, specify the location of the database created for the first server instance.

  • When the installer asks if you want to upgrade, overwrite, or exit the process, select the choice for "upgrade".

Step 3 - Configure Cluster Name and Communications Properties


Configure the cluster-related properties on each of the Hyperic Servers in the cluster, in the "Cluster Settings" section of its conf/hq-server.conf file.

Default hq-server.conf File

# Cluster Settings
################################################################################
#
# Property: ha.partition
#  
# This property defines the name of the HQ cluster. Each HQ server with the
# same ha.partition name will join the same cluster. This property is required
# for proper cluster initialization.
#
#ha.partition=

#
# Property: ha.node.address
#
# This property defines the IP address or hostname to bind the multicast listener
# to. This property is required for proper cluster initialization.
#
#ha.node.address=

#
# Property: ha.node.mcast_addr
#
# This property defines the multicast address to use. This property is not required
# and defaults to 238.1.2.3.
#
#ha.node.mcast_addr=238.1.2.3

#
# Property ha.node.mcast_port
#
# This property defines the multicast port to use. This property is not required
# and defaults to 45566.
#
#ha.node.mcast_port=45566

#
# Property ha.node.cacheListener.port
#
# This property defines the multicast port that is used to discover cache peers. This
# property is not required and defaults to 45567
#ha.node.cacheListener.port=45567

#
# Property ha.node.cacheProvider.port
#
# This property defines the multicast port that is used to synchronize caches throughout
# the HQ cluster. This property is not required and defaults to 45568.
#ha.node.cacheProvider.port=45568

Required Cluster Properties

For each Hyperic Server in the cluster you must specify:

ha.partition

Name of the cluster---this value is identical for each node in the cluster

ha.node.address

Multicast listen address---specifies IP address or hostname upon which the node listens for multicast traffic; this value is unique to each node in the cluster.

Note: If you are upgrading from a pre-v3.0 failover configuration, the each server's .conf file will contain obsolete cluster properties, including server.cluster.mode and server.ha.bind_addr properties. Delete these properties and replace with the current failover properties described below.

Optional Cluster Properties

If desired, you can control these communication behaviors for the nodes in the cluster:

ha.node.mcast_addr
and
ha.node.mcast_port

Address and port for sending multicast messages to other nodes. Note: ha.node.mcast_addr must be the same on each node.

ha.node.cacheListener.port
and ha.node.cacheProvider.port

Ports used for discovering and synchronizing with cache peers.

Step 4 - Configure the Load Balancer

Configure the load balancer, according to the vendor or supplier instructions. Procedures vary, but at a minimum you will identify the Hyperic Server nodes in the cluster and the failover behavior.

  1. Identify the Hyperic Server nodes in the cluster.

  2. Configure the load balancer to check the nodeStatus.hqu URL every 10 seconds.  For example, in a 2-node cluster, if the the IP addresses of the nodes are 10.0.0.1 and 10.0.0.2, configure the load balancer to check these URLs every 10 seconds:

    • http://hqadmin:hqadmin@10.0.0.1:7080/hqu/health/status/nodeStatus.hqu
    • http://hqadmin:hqadmin@10.0.0.2:7080/hqu/health/status/nodeStatus.hqu
  3. Configure the load balancer to direct all traffic to the node whose status is master=true.

Step 5 - Configure Agents to Communicate with Hyperic Server Cluster

The Hyperic Agents in your environment communicate with the Hyperic Server cluster through the load balancer. When you startup a newly installed agent, either supply the load balancer listen address and port interactively, or specify the connection information in agent.properties.

For existing agents, you can run hq-agent.sh setup, to force the setup dialog.

Step 6 - Start the Nodes

Start the Hyperic Servers.

Troubleshooting a Failover Configuration

This section describes the most common sources of problems the failover configuration.

  • Multicast blocking ---T he cluster detection and cache peer detection relies on multicast. Make sure your router isn't blocking multicast packets; otherwise the Hyperic cluster will fail to initialize properly. It's also common for virtualization technologies like VMware and Xen to not enable multicast by default.

  • Don't register agents using the loopback address — If you install an Hyperic Agent on the same machine as a Hyperic Server node, when you specify the IP address the server should use to contact the agent, don't specify loopback address (127.0.0.1).

  • Alerts that were currently firing or in escalation were "lost" — A failover to another cluster node occurred in the middle of the alerts being fired or escalated. The alert state could be lost.