Diagnosing System Problems

Locator does not start

Locator startup fails with an error like this:

ERROR: Operation "start-locator" failed because: Start of locator failed.
The end of "/gemfire/GemFire65/bin/start_locator.log"
contained this message: "[severe 2010/10/14 11:49:49.119 CEST <main>
tid=0x1] Could not start locator
com.gemstone.gemfire.GemFireConfigException: Unable to contact a Locator
service.  Operation either timed out or Locator does not exist.
Configured list of locators is "[192.168.2.1<v0>:41111]". 	at
com.gemstone.org.jgroups.protocols.TCPGOSSIP.sendGetMembersRequest(TCPGOSS
IP.java:189) 	at
com.gemstone.org.jgroups.protocols.PingSender.run(PingSender.java:86) 	at
java.lang.Thread.run(Thread.java:637) "..

This indicates a mismatch somewhere in the address, port pairs used for locator startup and configuration. The address you use for locator startup must match the address you list for the locator in the gemfire.properties locators specification. Every member of the locator’s distributed system, including the locator itself, must have the complete locators specification in the gemfire.properties.

Response:
  • Check that your locators specification includes the address you are using to start your locator.
  • If you use a bind address, you must use numeric addresses for the locator specification. The bind address will not resolve to the machine’s default address.
  • If you are using a 64-bit Linux system, check whether your system is experiencing the leap second bug. See Java applications on 64-bit platforms hang or use 100% CPU for more information.

Application or cache server process does not start

Possible Cause 1: GemFire will not start because it has detected an invalid license property (either license-data-management or license-application-cache in the gemfire.properties file.

Response:
  1. Check the GemFire output for a com.gemstone.gemfire.LicenseException message. The content of the message will indicate which license property contains an invalid license.
    For example, if you have specified an invalid serial number in license-data-management, the following message will appear:
    The specified serial number "#####-#####-#####-#####-#####" may be expired or invalid for Data Management Node license. Remove serial number from configuration in order to use the default evaluation license.
    In this case, you can either:
    • remove the serial number from gemfire.properties and restart the server. When the server restarts, it will use the default evaluation license; or
    • replace the serial number in gemfire.properties with a valid and non-expired serial number. Restart the server and check the logs to make sure the license is valid.
  2. If you have specified dynamic in one of the license properties, a message similar to the following may appear if GemFire cannot locate a valid dynamic license:
    Failed to dynamically acquire a Data Management Node license within the 10 second timeout. Consider increasing license-server-timeout or remove "dynamic" from configuration in order to use the default evaluation license.
    In this case, do the following:
    • Check to see if there is a serial number file in the serial number directory. If the file exists, verify that the serial number or serial numbers located in the file are valid. See Local VMware vFabric Directories for the appropriate directory on your operating system.
    • Make sure that the GemFire process is running on a vSphere virtual machine that is part of a vSphere installation that includes a vFabric License Server.
    • If you are using a vFabric License Server to manage dynamic licenses, verify that the vFabric License Server is up and running and reachable by the GemFire process.
    • If the vFabric License Server is functioning, try increasing the timeout value in license-server-timeout property of gemfire.properties. Restart the GemFire process.
    • If all else fails, remove the keyword dynamic from license property and reboot the server. When the server restarts, it will use the default evaluation license.

Possible Cause 2: If the process tries to start and then silently disappears, on Windows this indicates a memory problem.

Response:
  • On a Windows host, decrease the maximum JVM heap size. This property is specified on the command line:
    cacheserver start -J-Xmx1024m
    For details, see JVM Memory Settings and System Performance.
  • If this doesn’t work, try rebooting.

Application or cache server does not join the distributed system

Response: Check these possible causes.
  • Network problem—the most common cause. First, try to ping the other hosts.
  • Firewall problems. If members of your distributed GemFire system are located outside the LAN, check whether the firewall is blocking communication. GemFire is a network-centric distributed system, so if you have a firewall running on your machine, it could cause connection problems. For example, your connections may fail if your firewall places restrictions on inbound or outbound permissions for Java-based sockets. You may need to modify your firewall configuration to permit traffic to Java applications running on your machine. The specific configuration depends on the firewall you are using.
  • Wrong multicast port (when using multicast for membership and discovery). Check the gemfire.properties file of this application or cache server to see that the mcast-port is configured correctly. If you are running multiple distributed systems at your site, each distributed system must use a unique multicast port.
  • Can’t connect to locator (when using TCP for discovery).
    • Check for an error message that includes this string:
      [severe 2005/10/24 11:21:02.908 PDT nameFromGemfireProperties DownHandler
      		(FD_SOCK) nid=0xf] GossipClient.getInfo(): exception connecting to host
      		localhost:30303: java.net.ConnectException: Connection refused
      This error means that the application or cache server is configured to connect to a non-existent locator.
    • Check that the locators attribute in this process’s gemfire.properties has the correct IP address for the locator.
    • Check that the locator process is running. If not, see instructions for related problem, Data distribution has stopped, though member processes are running.
    • Bind address set incorrectly on a multi-homed host. When you specify the bind address, use the IP address rather than the host name. Sometimes multiple network adapters are configured with the same hostname. See Using Bind Addresses.
  • Wrong version of GemFire. A version mismatch can cause the process to hang or crash. Check the software version with the gemfire version command.
  • Bad IP address in the system hosts file. Check that the addresses in your hosts file are valid. If this is the problem, the failing member’s log file may contain a message of this type:
    com.gemstone.gemfire.ForcedDisconnectException: Attempt to
    connect to distributed system timed out
    at 
    com.gemstone.org.jgroups.protocols.pbcast.GMS.down(GMS.java:786)
    at . . . 

Member process seems to hang

Response:
  • During initialization—For persistent regions, the member may be waiting for another member with more recent data to start and load from its disk stores. See Disk Storage. Wait for the initialization to finish or time out. The process could be busy—some caches have millions of entries, and they can take a long time to load. Look for this especially with cache servers, because their regions are typically replicas and therefore store all the entries in the region. Applications, on the other hand, typically store just a subset of the entries. For partitioned regions, if the initialization eventually times out and produces an exception, the system architect needs to repartition the data.
  • For a running process—Investigate whether another member is initializing. Under some optional distributed system configurations, a process can be required to wait for a response from other processes before it proceeds.

Member process does not read settings from the gemfire.properties file

Either the process can’t find the configuration file or, if it is an application, it may be doing programmatic configuration.

Response:
  • Check that the gemfire.properties file is in the right directory.
  • Make sure the process is not picking up settings from another gemfire.properties file earlier in the search path. GemFire looks for a gemfire.properties file in the current working directory, the home directory, and the CLASSPATH, in that order.
  • For an application, check the documentation to see whether it does programmatic configuration. If so, the properties that are set programmatically cannot be reset in a gemfire.properties file. See your application’s customer support group for configuration changes.

Cache creation fails - must match DOCTYPE root

System member startup fails with an error like one of these:

Exception in thread "main" com.gemstone.gemfire.cache.CacheXmlException: 
While reading Cache XML file:/C:/gemfire/client_cache.xml. 
Error while parsing XML, caused by org.xml.sax.SAXParseException: 
Document root element "client-cache", must match DOCTYPE root "cache".
Exception in thread "main" com.gemstone.gemfire.cache.CacheXmlException: 
While reading Cache XML file:/C:/gemfire/cache.xml. 
Error while parsing XML, caused by org.xml.sax.SAXParseException: 
Document root element "cache", must match DOCTYPE root "client-cache".

GemFire declarative cache creation uses one of two DOCTYPE/root element pairs: cache or client-cache. The name must be the same in both places.

Response:
  • Modify your cache.xml file so it has the proper DOCTYPE/root element matching.

For peers and servers:

<?xml version="1.0"?>
<!DOCTYPE cache PUBLIC 
   "-//GemStone Systems, Inc.//GemFire Declarative Caching 6.6//EN" 
   "http://www.gemstone.com/dtd/cache6_6.dtd">
<cache> 
   ...
</cache>

For clients:

<?xml version="1.0"?>
<!DOCTYPE client-cache PUBLIC 
   "-//GemStone Systems, Inc.//GemFire Declarative Caching 6.6//EN" 
   "http://www.gemstone.com/dtd/cache6_6.dtd">
<client-cache> 
   ...
</client-cache>

Cache isn’t configured properly

An empty cache can be a normal condition. Some applications start with an empty cache and populate it programmatically, but others are designed to bulk load data during initialization.

Response:

If your application should start with a full cache but it comes up empty, check these possible causes:
  • No regions—If the cache has no regions, the process isn’t reading the cache configuration file. Check that the name and location of the cache configuration file match those configured in the cache-xml-file attribute in gemfire.properties. If they match, the process may not be reading gemfire.properties. See Member process does not read settings from the gemfire.properties file.
  • Regions without data—If the cache starts with regions, but no data, this process may not have joined the correct distributed system. Check the log file for messages that indicate other members. If you don’t see any, the process may be running alone in its own distributed system. In a process that is clearly part of the correct distributed system, regions without data may indicate an implementation design error. Contact the application’s customer support group.

Unexpected results for keySetOnServer and containsKeyOnServer

Client calls to keySetOnServer and containsKeyOnServer can return incomplete or inconsistent results if your server regions are not configured as partitioned or replicated regions.

A non-partitioned, non-replicate server region may not hold all data for the distributed region, so these methods would operate on a partial view of the data set.

In addition, the client methods use the least loaded server for each method call, so may use different servers for two calls. If the servers do not have a consistent view in their local data set, responses to client requests will vary.

The consistent view is only guaranteed by configuring the server regions with partitioned or replicate data-policy settings. Non-server members of the server system can use any allowable configuration as they are not available to take client requests.

The following server region configurations give inconsistent results. These configurations allow different data on different servers. There is no additional messaging on the servers, so no union of keys across servers or checking other servers for the key in question.
  • Normal
  • Mix (replicated, normal, empty) for a single distributed region. Inconsistent results depending on which server the client sends the request to
These configurations provide consistent results:
  • Partitioned server region
  • Replicated server region
  • Empty server region: keySetOnServer returns the empty set and containsKeyOnServer returns false

Response: Use a partitioned or replicate data-policy for your server regions. This is the only way to provide a consistent view to clients of your server data set. See Region Data Storage and Distribution Options.

Data operation returns PartitionOfflineException

In partitioned regions that are persisted to disk, if you have any members offline, the partitioned region will still be available but may have some buckets represented only in offline disk stores. In this case, methods that access the bucket entries return a PartitionOfflineException, similar to this:

com.gemstone.gemfire.cache.persistence.PartitionOfflineException: 
Region /__PR/_B__root_partitioned__region_7 has persistent data that is no 
longer online stored at these locations: 
[/10.80.10.64:/export/straw3/users/jpearson/bugfix_Apr10/testCL/hostB/backupDirectory 
created at timestamp 1270834766733 version 0]

Response: Bring the missing member online, if possible. This restores the buckets to memory and you can work with them again. If the missing member cannot be brought back online, or the disk stores for the member are corrupt, you may need to revoke the member, which will allow the system to create the buckets in new members and resume operations with the entries. See Handling Missing Disk Stores.

Entries are not being evicted or expired as expected

Check these possible causes.
  • Transactions—Entries that are old enough for eviction may remain in the cache if they are involved in a transaction. Further, transactions never time out, so if a transaction hangs, the entries involved in the transaction will remain stuck in the cache. If you have a process with a hung transaction, you may need to end the process to remove the transaction. In your application programming, do not leave transactions open ended. Program all transactions to end with a commit or a rollback. See Using Eviction and Expiration Operations.
  • Partitioned regions—For performance reasons, eviction and expiration behave differently in partitioned regions and can cause entries to be removed before you expect. See Eviction and Expiration.

Can’t find the log file

Operating without a log file can be a normal condition, so the process does not log a warning.

Response:

OutOfMemoryError

An application gets an OutOfMemoryError if it needs more object memory than the process is able to give. The messages include java.lang.OutOfMemoryError.

Response:

The process may be hitting its virtual address space limits. The virtual address space has to be large enough to accommodate the heap, code, data, and dynamic link libraries (DLLs).
  • If your application is out of memory frequently, you may want to profile it to determine the cause.
  • If you suspect your heap size is set too low, you can increase direct memory by resetting the maximum heap size, using -Xmx. For details, see JVM Memory Settings and System Performance.
  • You may need to lower the thread stack size. The default thread stack size is quite large: 512kb on Sparc and 256kb on Intel for 1.3 and 1.4 32-bit JVMs, 1mb with the 64-bit Sparc 1.4 JVM; and 128k for 1.2 JVMs. If you have thousands of threads then you might be wasting a significant amount of stack space. If this is your problem, the error may be this:
    OutOfMemoryError: unable to create new native thread
    The minimum setting in 1.3 and 1.4 is 64kb, and in 1.2 is 32kb. You can change the stack size using the -Xss flag, like this: -Xss64k
  • You can also control memory use by setting entry limits for the regions.

PartitionedRegionDistributionException

The com.gemstone.gemfire.cache.PartitionedRegionDistributionException appears when GemFire fails after many attempts to complete a distributed operation. This exception indicates that no data store member can be found to perform a destroy, invalidate, or get operation.

Response:
  • Check the network for traffic congestion or a broken connection to a member.
  • Look at the overall installation for problems, such as operations at the application level set to a higher priority than the GemFire processes.
  • If you keep seeing PartitionedRegionDistributionException, you should evaluate whether you need to start more members.

PartitionedRegionStorageException

The com.gemstone.gemfire.cache.PartitionedRegionStorageException appears when GemFire can’t create a new entry. This exception arises from a lack of storage space for put and create operations or for get operations with a loader. PartitionedRegionStorageException often indicates data loss or impending data loss.

The text string indicates the cause of the exception, as in these examples:

Unable to allocate sufficient stores for a bucket in the partitioned region....
Ran out of retries attempting to allocate a bucket in the partitioned region....
Response:
  • Check the network for traffic congestion or a broken connection to a member.
  • Look at the overall installation for problems, such as operations at the application level set to a higher priority than the GemFire processes.
  • If you keep seeing PartitionedRegionStorageException, you should evaluate whether you need to start more members.

Application crashes without producing an exception

If an application crashes without any exception, this may be caused by an object memory problem. The process is probably hitting its virtual address space limits. For details, see OutOfMemoryError.

Response: Control memory use by setting entry limits for the regions.

Timeout alert

If a distributed message does not get a response within a specified time, it sends an alert to signal that something might be wrong with the system member that hasn’t responded. The alert is logged in the sender’s log as a warning.

A timeout alert can be considered normal.

Response:
  • If you’re seeing a lot of timeouts and you haven’t seen them before, check whether your network is flooded.
  • If you see these alerts constantly during normal operation, consider raising the ack-wait-threshold above the default 15 seconds.

Member produces SocketTimeoutException

A client, server, gateway, or gateway hub produces a SocketTimeoutException when it stops waiting for a response from the other side of the connection and closes the socket. This exception typically happens on the handshake or when establishing a callback connection.

Response:

Increase the default socket timeout setting for the member. This timeout is set separately for the client Pool and for the Gateway and GatewayHub, either in the cache.xml file or through the API. For a client/server configuration, adjust the "read-timeout" value as described in Client Configuration Properties or use the com.gemstone.gemfire.cache.client.PoolFactory.setReadTimeout method. For the gateway, see Gateway Configuration Properties.

Member logs ForcedDisconnectException, Cache and DistributedSystem forcibly closed

A distributed system member’s Cache and DistributedSystem are forcibly closed by the system membership coordinator if it becomes sick or too slow to respond to heartbeat requests. When this happens, listeners receive RegionDestroyed notification with an opcode of FORCED_DISCONNECT. The GemFire log file for the member shows a ForcedDisconnectException with the message

This member has been forced out of the distributed system because it did not respond 
within member-timeout milliseconds

Response:

To minimize the chances of this happening, you can increase the DistributedSystem property member-timeout. Take care, however, as this setting also controls the length of time required to notice a network failure. It should not be set too high.

Members cannot see each other

Suspect a network problem or a problem in the configuration of transport for memory and discovery.

Response:
  • Check your network monitoring tools to see whether the network is down or flooded.
  • If you are using multi-homed hosts, make sure a bind address is set and consistent for all system members. For details, see Using Bind Addresses.
  • If TCP, check that all the applications and cache servers are using the same locator address.
  • If multicast:
    • Check that all the applications and cache servers are using the same multicast IP address and port.
    • Confirm that the multicast IP address and port are a valid combination.
    • Confirm that multicast is enabled on the network. For details, see How Member Discovery Works.

Some new members are not seen by existing members

If your application creates many, many short-lived members, the system may fail to recognize some new members as they appear. When a member departs the distributed system, GemFire ignores all messages from that member’s address for a period of time called the shun sunset. This keeps the system from trying to process a dead member’s spurious messages. If you have members joining and using departed member’s addresses before the shun sunset has passed, the system will not recognize them.

Response:

Set the shun sunset low enough to allow the system to recognize your new members. The default sunset is 90 seconds. You can change it using the system property JGroups.SHUN_SUNSET, which is specified in seconds.

Note that the available pool of “wildcard�? ports on Windows is much smaller than on Linux or Solaris, so this problem is more likely to be seen on Windows.

One part of the distributed system cannot see another part

This situation can leave your caches in an inconsistent state. In networking circles, this kind of network outage is called the “split brain problem.�?

Response:
  • Restart all the processes to ensure data consistency.
  • Going forward, set up network monitoring tools to detect these kinds of outages quickly.
  • Enable network partition detection.

Data distribution has stopped, though member processes are running

Suspect a problem with the network, the locator, or the multicast configuration, depending on the transport your distributed system is using.

Response:
  • Check the health of your system members. Search the logs for this string:
    Uncaught exception
    An uncaught exception means a severe error, often an OutOfMemoryError. See OutOfMemoryError.
  • Check your network monitoring tools to see whether the network is down or flooded.
  • If you are using multicast, check whether the existing configuration is no long appropriate for the current network traffic.
  • If you are using locators for membership and discovery, check whether the locators have stopped. For a list of the locators in use, check the locators property in one of the application gemfire.properties files.
    • Restart the locator processes on the same hosts, if possible. The distributed system begins normal operation, and data distribution restarts automatically.
    • If a locator must be moved to another host or a different IP address, complete these steps:
      1. Shut down all the members of the distributed system in the usual order.
      2. Restart the locator process in its new location.
      3. Edit all the gemfire.properties files to change this locator’s IP address in the locators attribute.
      4. Restart the applications and cache servers in the usual order.
  • Create a watchdog daemon or service on each locator host to restart the locator process when it stops

Distributed-ack operations take a very long time to complete

This problem can occur in systems with a great number of distributed-no-ack operations. That is, the presence of many no-ack operations can cause ack operation to take a long time to complete.

Response:

For information on alleviating this problem, see Slow distributed-ack Messages.

Slow system performance

Slow system performance is sometimes caused by a buffer size that is too small for the objects being distributed.

Response:

If you are experiencing slow performance and are sending large objects (multiple megabytes), try increasing the socket buffer size settings in your system. For more information, see Socket Communication.

Can’t get Windows performance data

Attempting to run performance measurements for GemFire on Windows can produce this error message:

Can't get Windows performance data. RegQueryValueEx returned 5

This error can occur because incorrect information is returned when a Win32 application calls the ANSI version of RegQueryValueEx Win32 API with HKEY_PERFORMANCE_DATA. This error is described in Microsoft KB article ID 226371 at http://support.microsoft.com/kb/226371/en-us.

Response:

To successfully acquire Windows performance data, you need to verify that you have the proper registry key access permissions in the system registry. In particular, make sure that Perflib in the following registry path is readable (KEY_READ access) by the GemFire process:

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Perflib

An example of reasonable security on the performance data would be to grant administrators KEY_ALL_ACCESS access and interactive users KEY_READ access. This particular configuration would prevent non-administrator remote users from querying performance data.

See http://support.microsoft.com/kb/310426 and http://support.microsoft.com/kb/146906 for instructions about how to ensure that GemFire processes have access to the registry keys associated with performance.

Java applications on 64-bit platforms hang or use 100% CPU

If your Java applications suddenly start to use 100% CPU, you may be experiencing the leap second bug. This bug is found in the Linux kernel and can severely affect Java programs. In particular, you may notice that method invocations using Thread.sleep(n) where n is a small number will actually sleep for much longer period of time than defined by the method. To verify that you are experiencing this bug, check the host's dmesg output for the following message:

[10703552.860274] Clock: inserting leap second 23:59:60 UTC
To fix this problem, issue the following commands on your affected Linux machines:
prompt> /etc/init.d/ntp stop
prompt> date -s "$(date)"

See the following web site for more information:

http://blog.wpkg.org/2012/07/01/java-leap-second-bug-30-june-1-july-2012-fix/