Troubleshooting Common Problems

Solve common problems may be encountered using VMware vFabric SQLFire.


Licensing Problems

If you install an invalid serial number or if SQLFire cannot obtain a dynamic license from the vFabric License Server, SQLFire fails to start and throws an exception. In this case, depending on your licensing configuration, you may need to perform one of the following fixes:
  • Replace the invalid serial number or serial numbers specified either in sqlfire.properties or in serial number files with valid (non-expired) serial licenses, and restart the server.
  • Remove the invalid serial number or serial numbers from sqlfire.properties and restart the server. When the server restarts, it will use the default evaluation license.
  • Remove the keyword "dynamic" from sqlfire.properties and restart the server. When the server restarts, it will use the default evaluation license.
  • Increase the amount of time specified in license-server-timeout. This option is only applicable if you are running SQLFire on a vSphere virtual machine and using the vFabric License Server to acquire licenses dynamically.
  • Configure the license locally in sqlfire.properties. Keep in mind that the vFabric License Server cannot provide a standalone license for SQLFire Professional or SQLFire Enterprise. You can only use a license server configuration if you are running SQLFire on a vSphere virtual machine as part of vFabric Suite.

Member Startup Problems

When you start SQLFire members, startup delays can occur if specific disk store files on other members are unavailable. This can occur in a healthy system depending on the order in which members are started up. For example, consider the following startup message for a locator ("locator2):
SQLFire Locator pid: 23537 status: waiting
Waiting for DataDictionary (DiskId: 531fc5bb-1720-4836-a468-3d738a21af63, Location: /vmware/locator2/./datadictionary) on: 
 [DiskId: aa77785a-0f03-4441-84f7-6eb6547d7833, Location: /vmware/server1/./datadictionary]
 [DiskId: f417704b-fff4-4b99-81a2-75576d673547, Location: /vmware/locator1/./datadictionary]
Here, the startup messages indicate that locator2 is waiting for the persistent datadictionary files on locator1 and server1 to become available. SQLFire always persists the data dictionary for indexes and tables that you create, even if you do not configure those tables to persist their stored data. The startup messages above indicate that the locator2 member was shut down before it could gracefully shut down in the a distributed system consisting of itself, locator1, and server2, and that locator1 or locator2 might potentially store a newer copy of the data dictionary for the distributed system.
Continuing the startup by booting the server1 data store yields:
Starting SQLFire Server using locators for peer discovery: localhost[10337],localhost[10338]
Starting network server for SQLFire Server at address localhost/127.0.0.1[1529]
Logs generated in /vmware/server1/sqlfserver.log
The server is still starting. 15 seconds have elapsed since the last log message: 
 Region /_DDL_STMTS_META_REGION has potentially stale data. It is waiting for another member to recover the latest data.
My persistent id:

  DiskStore ID: aa77785a-0f03-4441-84f7-6eb6547d7833
  Name: 
  Location: /10.0.1.31:/vmware/server1/./datadictionary

Members with potentially new data:
[
  DiskStore ID: f417704b-fff4-4b99-81a2-75576d673547
  Name: 
  Location: /10.0.1.31:/vmware/locator1/./datadictionary
]
Use the "sqlf list-missing-disk-stores" command to see all disk stores that are being waited on by other members.
The data store startup messages indicate that locator1 has "potentially new data" for the data dictionary. In this case, both locator2 and server1 were shut down before locator1 in the system, so those members are waiting on locator1 to ensure that they have the latest version of the data dictionary.

The above messages for data stores and locators can be commonplace when individual members are shut down one-by-one rather than by using sqlf shut-down-all, which allows all members to synchronize and shut down gracefully. However, if the indicated disk store persistence files are available on the missing member, simply start that member and allow the running members to recover. For example, in the above system you would simply start locator1 and allow locator2 and server1 to synchronize their data.

To avoid this type of delayed startup and recovery:
  1. Use sqlf shut-down-all to gracefully shut down all data store members after synchronizing disk stores with the available locators.
  2. Use sqlf locator to shut down remaining locator members after the data stores have stopped.
  3. If a member cannot be restarted and it is preventing other data stores from starting, use sqlf revoke-missing-disk-store to revoke the disk stores that are preventing startup. This can cause some loss of data if the revoked disk store actually contains recent changes to the data dictionary or to table data.
If persistence disk store files for the data dictionary are deleted, moved, or modified, further complications can occur during startup. These problems are generally indicated by a ConflictingPersistendDataException while starting up other members of the system. For example:
ConflictingPersistentDataException: Region /_DDL_STMTS_META_REGION remote member curwen(23695)<v1>:4505 with 
persistent data /10.0.1.31:/vmware/locator1/./datadictionary created at timestamp 1373667883741 version 0 diskStoreId 
9cf5aea67c6c4374-9d7205f72fecd47c name  was not part of the same distributed system as the local data from 
/10.0.1.31:/vmware/server1/./datadictionary created at timestamp 1373649392112 version 0 diskStoreId aa77785a0f034441-84f76eb6547d7833 
name  - See log file for details.
If the datadictionary directory is deleted or moved, then SQLFire creates a new data dictionary upon startup of that member. (Remember, all SQLFire locators and data stores maintain a persistent data dictionary for the distributed system, even if you do not persist data in the tables.) However, other members in the distributed system may expect the locator or data store to have the previous, deleted version of the datadictionary, in order to recover more recent operations. If this occurs, the newly-created data dictionary conflicts with other members view of the distributed system, and new members that startup throw a ConflictingPersistendDataException.
To resolve a ConflictingPersistendDataException:
  1. Shut down the member that is causing the exception. In the above example, you would shut down remote member curwen(23695) .
  2. Restore the original datadictionary directory in the shut down member, if possible. Then restart the member with the expected data dictionary files.
  3. If you cannot restore the original datadictionary directory, use sqlf revoke-missing-disk-store to revoke the missing data dictionary disk store files.

If you cannot resolve startup problems associated with missing or conflicting data dictionary files, you can force the SQLFire member to complete its startup by using the sqlfire.datadictionary.allow-startup-errors property. This property enables you to startup a SQLFire member even if the volume or directory in which a disk store was created no longer exists; you can recreate the disk store manually, after forcing the member to restart.

Connection Problems

These are common problems that occur when connecting to a SQLFire distributed system:
  • You receive SQL State 08001 Error: 'Failed after trying all available servers: []'

    This problem can be caused if you specify null values for the username and password connection properties in the JDBC connection URL. Some third-party tools specify automatically supply null values but include the connection properties if you do not specify user credentials.

    If authentication is disabled in your distributed system, then you can specify any temporary user name and password value when connecting. Connect to vFabric SQLFire with JDBC Tools provides more details.

WAN Replication Problems

In WAN deployments, tables may fail to synchronize between two SQLFire distributed systems if the tables are not identical to one another (see Create Tables with Gateway Senders). If you have configured WAN replication between sites but a table fails to synchronize because of schema differences, follow these steps to correct the situation:
  1. Stop the gateway senders and gateway receivers in each SQLFire distributed system. See Start and Stop Gateway Senders.
  2. Use ALTER TABLE to add or drop columns on the problem table, to ensure that both tables have the same column definitions. Compare the output of the describe command for each table to ensure that the tables are the same. Or, use sqlf write-schema-to-sql in each distributed system to compare the DDL statements used to create each table.
  3. Use the SYS.GET_TABLE_VERSION Function to verify that both table have the same version in the data dictionary of each SQLFire cluster. If the versions do not match, use SYS.INCREMENT_TABLE_VERSION with the table having the smaller version to make both table versions equal.
  4. Restart gateway senders and gateway receivers for the distributed systems. See Start and Stop Gateway Senders.