Troubleshoot Agent and Server Problems

This page has tips for troubleshooting problems in a VMware vFabric™ Hyperic® deployment.

Looking for Clues

This section describes options for getting information that might help you diagnose problems in an Hyperic deployment.

HQ Health

The HQ Health page, available in the "Plugins" section of the Administration page, displays a variety of metrics and status about your Hyperic deployment, including server host statistics and Hyperic Server process information.

HQ Health provides views, queries, and diagnostic tools that provide visibility into metric loads, caches, the Hyperic database, and agents across your deployment. For more information, see ui-HQHealth in vFabric Hyperic User Interface.

hqstats and agentstats

Mainly of use to Hyperic Support (or others with an internals-level knowledge of Hyperic), the files in the hq-server/logs/hqstat and AgentHome/logs/agentstats folders contain a variety of system and subsystem performance, resource usage, and other statistics for the Hyperic Server and Hyperic Agent, respectively.

The statistics files in the hqstat and agentstats directories are in .csv format and can be viewed in a spreadsheet program or other .csv viewer.

Agent Metrics

Hyperic Agent metrics are helpful in diagnosing many problems that can occur. By default, these metrics are reported:

  • Availability

  • JVM Free Memory - Indicator

  • JVM Total Memory - Indicator

  • Number of Metrics Collected Per Minute - Indicator

  • Number of Metrics Sent to the Server Per Minute

  • Server Offset

  • Total Time Spend Fetching Metrics per Minute

Depending on your environment, you may find it useful to track other agent metrics, such as:

In addition to the default metrics

  • Number of Metrics which Failed to be Collected 

  • Number of Metrics which Failed to be Collected per Minute

  • Maximum Time Spent Processing a Request

  • Number of Connection Failures

  • Total Time Spent Fetching Metrics per Minute

For more information about default and available agent metrics see View Hyperic Agent Metrics.

Log Files

The following log files can be a useful source of information in the event that a problem occurs in a Hyperic deployment:

  • ServerHome/logs/wrapper.log

  • ServerHome/logs/bootstrap.log

  • ServerHome/logs/server.log

  • ServerHome/logs/hqdb.log (only available for deployments using the built-in PostgreSQL database.

  • AgentHome/logs/wrapper.log

  • AgentHome/logs/agent.log

  • AgentHome/logs/agent.log.startup

You can increase the level of logging an agent performs in its agent.properties file. Note that debug logging is very verbose and uses more system resources. Hyperic recommends configuring debug logging only when troubleshooting problems, and only at the subsystem level. For more information, see Configure Agent Logging.

Thread Dumps

This section has instructions for generating thread dumps for the Hyperic Server and Hyperic Agent.

Generate a Hyperic Server Thread Dump from User Interface

Follow these steps to output a server thread dump to your browser.

  1. Click the Administration tab.

  2. Click HQ Health in the Plugins section of the Administration page.

  3. Click Print in the HQ Process Information section on the HQ Health page.

Generate a Hyperic Server Thread Dump from Command Line

To generate a thread dump on Windows:

  • If Hyperic Server is running in a terminal window — Try <cntrl><break> in the terminal window.

  • If Hyperic Server is running as a service — Use a tool like StackTrace.

To generate a thread dump on Unix-like systems, use jstack http://download.oracle.com/javase/1.5.0/docs/tooldocs/share/jstack.html on the Hyperic Server process. For example, if the server's PID is 215:

jstack 215 >mydumpfile.txt
How to find the Hyperic Server PID

You can run jps in a shell to determine the Hyperic Server's process ID — look for the process named "Bootstrap". For example:

$ jps
187 WrapperStartStopApp
408 Jps
215 Bootstrap

On Unix-like systems, you can also run Kill -3 on the Hyperic server process. Note however, that if you do, the thread dump will be written to wrapper.log, and be difficult to parse.

Generate Agent Thread Dump from User Interface

Run the agent launcher with the dump option. For more information see Start, Stop, and Other Agent Operations

Check Port Availability

The Hyperic Server must be able to establish a connection with the agent, and vice versa.

To verify that the server can access the agent's listen port, run the following from the server platform:

telnet AgentIp AgentPort

For example:

$ telnet 192.168.1.114 2144

For a successful connection, the results are similar to:

Trying 127.0.0.1...
Connected to localhost.
Escape character is '^\]'.
GET

Connection closed by foreign host.

To verify that the agent can access the server's listen port, run the following from the agent platform:
$ telnet ServerIP ServerPort

For example:

$ telnet 192.168.1.114 7080

Hyperic Server Problems

This section describes problems that could prevent the Hyperic Server from starting up.

Large Event Table

A message like the following in server.log indicates that the EAM_EVENT_LOG is so large that trigger initialization is taking more than 15 minutes.

License Issues

In vFabric Hyperic, the number of platforms you can manage is limited by your license. For more information, see Install or Configure vFabric Hyperic License.

Backlogged Hyperic Server

When the Hyperic Server starts after a period of downtime, it can be inundated with metric reports from agents that continued to run while the server was down. When the server is processing a large metric backlog from many agents, the maximum size of the agent-server connection pool size can become a bottleneck and affect server performance.

You can check the "Current Thread Busy" metric for the Hyperic Server's internal Tomcat server to determine whether connection pool size is an issue.

To enable a restarted server catch up with a metric backlog you can increase the maximum size of the agent-server connection pool — typically this is the only time that increasing the size of the connection pool is indicated. The maximum number of agent-server connections is configured with the org.hyperic.lather.maxConns property in web.xml. Note however that, in effect, the number of connections is limited by the maximum number of server execution threads, which is configured using the tomcat.maxthreads property in server.conf. So, when you increase the value of org.hyperic.lather.maxConns, it may be necessary to increase the value of tomcat.maxthreads.

To enable a backlogged Hyperic Server to catch up, enable 5% to 10% more connections than there are agents reporting to the server. For example, if you have 1000 agents, enable 1050 to 1100 agent-server connections.

To change the size of the agent-server connection pool:

  1. As necessary, update the maximum number of agent-server connections — org.hyperic.lather.maxConns — in <Server installation directory>/hq-engine/hq-server/webapps/ROOT/WEB-INF/web.xml file, in the stanza shown below. Ensure that the value of maxConns is 5% to 10% greater than the number of agents that report to the server.

    <init-param>
    <param-name>org.hyperic.lather.maxConns</param-name>
    <param-value>3000</param-value>
    </init-param>
  2. As necessary, configure the maximum number of Tomcat threads — tomcat.maxthreads — in server.conf. Ensure that the value of maxThreads greater than the value of maxConns.

  3. Restart the Hyperic Server to enable the changes to take effect.

Agent Startup or Connection Problems

This section describes problems that could prevent the Hyperic Agent from starting up or connecting to the Hyperic Server.

Agent Failed to Connect to Server at First Startup

Every time an agent starts up it attempts to contact the Hyperic Server. If, the first time you start up an agent, it cannot connect to the server, the agent will continue to have problems connecting to the server, even after the server is reachable.

The first time a Hyperic Agent successfully connects with the Hyperic Server, the agent saves the server connection settings in its /data directory. If the server is not available (because the wrong address/port was configured for the agent, or because the server hasn't been started or is still in the process of starting up) the agent will fail to connect, and hence fail to persist the server connection data. Upon agent restart, the agent will not be able to find the connection data it requires, and fail to connect to the server. In this case, server.log will contain a message similar to:

2010-04-20 11:04:26,640 ERROR [Thread-1|Thread-1] [AutoinventoryCommandsServer] Unable to send autoinventory platform data to server, sleeping for 33 secs before retrying. Error: Unable to communicate with server -- provider not yet setup

To solve this problem:

  1. Delete the agent's /data directory.

    • This forces the agent to obtain new agent - server communication properties.

  2. Verify the address and listen port for the Hyperic Server.

  3. Verify the Hyperic Server is up.

  4. Start the agent, supplying the correct server connection properties, either in agent.properties or interactively. See the instructions referenced in step 2 above.

Server Does Not Have Agent Token

Starting in 4.6.5, if all platforms managed by an agent are removed from Hyperic inventory, the Hyperic Server also removes the saved authentication token for that agent from the Hyperic database. So after you delete a platform that is managed by an agent that does not manage any other platforms (as in the case of an agent that manages only the platform it runs on), the Hyperic Server will not accept connections from that agent.

If you want the agent to rediscover the platform, you must repeat the initial agent setup process. To force the agent setup dialog:

  • While the agent is running, by running the agent launcher with the setup option, or

  • By deleting the agent's /data and restarting the agent.

Agent Start Script Timeout

By default, the agent start script times out after five minutes if the startup sequence is not successful. Check the agent.log.startup file for messages.

If desired, you can configure a longer timeout period – to give the agent more time to connect to the server — by adding the agent.startupTimeOut property, defined below, to the agent.properties file.

agent.startupTimeOut

Description

The number of seconds that the agent startup script will wait before determining that the agent did not startup successfully. If the agent is not determined to be listening for requests within this period of time, an error is logged, and the startup script times out.

Default

As installed, agent.properties does not contain a line that defines the value of this property. The default behavior of the agent is to timeout after 300 seconds.

After editing the agent.properties file, save your changes and restart the agent.

Java Service Wrapper Timeout

Under high load, the agent may become unresponsive. If this occurs and there are no coincident errors or warnings in the agent log that indicate another explanation, it may be that the agent JVM was starved for memory, and unresponsive to a ping from the Java Service Wrapper (JSW).

In that case the wrapper.log file will contain an entry like this:

ERROR | wrapper | 2009/01/15 02:15:18 | JVM appears hung: Timed out waiting for signal from JVM.

To resolve the problem, you can configure the JSW to give the agent more time to respond to startup and ping requests.

Increase the JSW's timeout period from 30 seconds to 300. To do so, add this property to AGENT_HOME/bundles/agent-4.x.xxxx/conf/wrapper.conf.

wrapper.ping.timeout=300
This will cause the JSW to wait longer for a ping response before determining that the JVM is hung.

Increase the agent's startup timeout from 30 seconds to 300. This will give the agent more time to start up before wrapper gives up on it and kills the process. To do so, add this property value:

wrapper.startup.timeout=300
to

AgentHome/bundles/agent-4.x.xxxx/conf/wrapper.conf

Problem Running Start Script setup Option

If agent.scu is missing...

Iff a Hyperic Agent's AgentHome/conf/agent.scu file is missing, subsequent attempts to run the agent start script (hq-agent.sh or hq-agent.bat) with the setup option will fail. To resolve this problem, you must either:

  • Reinstall the agent, or

  • Perform these steps:

    1. Stop the agent.

    2. Delete its /data directory.

    3. Set agent.setup.camPword in AgentHome/conf/agent.properties to a plain text value.

    4. Start the agent.

Invalid or Unknown Availability

This section describes reasons that Hyperic might incorrectly show an agent (or agent-managed resource) as unavailable, show its availability status as "Unknown" (availability icon is grey), or "flapping" availability values.

Out-of-Sync Agent and Server Clocks

If Hyperic erroneously indicates that resources are unavailable, it may be because the system clocks on the agent and server hosts are out-of-sync. By default, Hyperic monitors the offset between the server and an agent — see the "Server Offset" metric on the Monitor tab. An offset of less than one minute is unlikely to pose problems; with a larger offset, problems may occur.

To solve an offset problem, install NTP and synchronize system clocks on the agent and server hosts.

To prevent agent and server becoming significantly out-of-sync, you can run NTP on each system, use Hyperic to monitor the offset on each system, and set alerts based on the offset metric. To do so, configure a platform service of type "NTP" to monitor each NTP service, and set alerts to fire when an offset from the time authority grows unacceptably high. For more information, see the "Network Services" section in vFabric Hyperic Resource Configuration and Metrics.

Overloaded Agent

If an agent's queue of metrics grows to a certain level over a period of time, the following warning message is written to the agent.log file:

The Agent is having a hard time keeping up with the frequency of metrics taken. Consider increasing your collection interval.

To investigate, you can configure the the agent to report the "Total Time Spent Fetching Metrics per Minute" metric — if the agent spends more than half its time fetching metrics — it is overloaded.

You can alleviate the problem by

  • Increase metric collection intervals — For most metrics, the default is every 5 or 10 minutes. You can change the collection interval for all metrics collected for a resource type on the Administration > Monitoring Defaults page for the resource type.

Want to Change a Whole Bunch of Metric Collection Intervals?

If you want to change metric templates in bulk, or without using the Hyperic user interface, you can change metric collection settings - including collection intervals -  for a resource type with the HQApi metricTemplate command. You can use the metricTemplate sync option from the command line or in a script, as desired. For more information, see HQApi metricTemplate command section in vFabric Hyperic Web Services API.

Try to redistribute load

If an agent that logs metric volume warnings is monitoring a large number of remote services over the network (for example, HTTP, FTP, SNMP, or another service type whose protocol the agent supports), you can spread the load around — configure a different agent to monitor some of the network services. You can compare the agent loads on the Agents tab of HQ Health.

Slow User Interface

If Hyperic's web user interface is slow, the cause may be an overloaded backend - the Hyperic database, or the Hyperic Server itself. See Overloaded Backend.

Warning Messages in the Agent Log

This section has information about the significance of selected warning messages that might be written to the agent.log file.

Connection Timeout Messages

Lather, the connection protocol for agent-to-server communication, is configured such that agent connections time out after five minutes, and a timeout message is written to agent.log. You can increase the timeout period from 300000 to 900000 in this file:

hq-engine/server/default/deploy/lather-jboss.sar/jboss-lather.war/WEB-INF/web.xml
in this stanza:

<init-param>
<param-name>org.hyperic.lather.execTimeout</param-name>
<param-value>900000</param-value>
</init-param>