You might need to troubleshoot issues that are adversely affecting the functioning of your fault tolerant virtual machines.

When attempting to power on a virtual machine with VMware Fault Tolerance enabled, an error message might appear. This is often the result of Hardware Virtualization (HV) not being available on the ESX/ESXi server on which you are attempting to power on the virtual machine. HV might not be available either because it is not supported by the ESX/ESXi server hardware or because HV is not enabled in the BIOS.

If the ESX/ESXi server hardware supports HV, but HV is not currently enabled, enable HV in the BIOS on that server. The process for enabling HV varies among BIOSes. See the documentation for your hosts' BIOSes for details on how to enable HV.

If the ESX/ESXi server hardware does not support HV, switch to hardware that uses processors that support Fault Tolerance.

After powering on a virtual machine with Fault Tolerance enabled, an error message might appear in the Recent Task Pane:

Secondary VM could not be powered on as there are no compatible hosts that can accommodate it.

This can occur for a variety of reasons including that there are no other hosts in the cluster, there are no other hosts with HV enabled, data stores are inaccessible, there is no available capacity, or hosts are in maintenance mode. If there are insufficient hosts, add more hosts to the cluster. If there are hosts in the cluster, ensure they support HV and that HV is enabled. The process for enabling HV varies among BIOSes. See the documentation for your hosts' BIOSes for details on how to enable HV. Check that hosts have sufficient capacity and that they are not in maintenance mode.

If a Primary VM appears to be executing slowly, even though its host is lightly loaded and retains idle CPU time, check the host where the Secondary VM is running to see if it is heavily loaded. A Secondary VM running on a host that is overcommitted for CPU resources might not get the same amount of CPU resources as the Primary VM. When this occurs, the Primary VM frequently must slow down to allow the Secondary VM to keep up, effectively reducing its execution speed to the slower speed of the Secondary VM.

Further evidence of this problem could be if the vLockstep Interval on the Primary VM's Fault Tolerance panel is yellow or red. This means that the Secondary VM is running several seconds behind the Primary VM. In such cases, Fault Tolerance slows down the Primary VM. If the vLockstep Interval remains yellow or red for an extended period of time, this is a strong indication that the Secondary VM is not getting enough CPU resources to keep up with the Primary VM.

To resolve this problem, set an explicit CPU reservation for the Primary VM at a MHz value sufficient to run its workload at the desired performance level. This reservation is applied to both the Primary and Secondary VMs ensuring that both are able to execute at a specified rate. For guidance setting this reservation, view the performance graphs of the virtual machine (prior to Fault Tolerance being enabled) to see how much CPU resources it used under normal conditions.

Enabling Fault Tolerance or migrating a running fault tolerant virtual machine using VMotion can fail if the virtual machine is too large (greater than 15GB) or if memory is changing at a rate faster than VMotion can copy over the network. This occurs if, due to the virtual machine’s memory size, there is not enough bandwidth to complete the VMotion switchover operation within the default timeout window (8 seconds).

To resolve this problem, before you enable Fault Tolerance, power off the virtual machine and increase its timeout window by adding the following line to the vmx file of the virtual machine:

ft.maxSwitchoverSeconds = "30"

where 30 is the timeout window in number in seconds. Enable Fault Tolerance and power the virtual machine back on. This solution should work except under conditions of very high network activity.

Note

If you increase the timeout to 30 seconds, the fault tolerant virtual machine might become unresponsive for a longer period of time (up to 30 seconds) when enabling FT or when a new Secondary VM is created after a failover.

In some cases, you might notice that the CPU usage for a Secondary VM is higher than for its associated Primary VM. This is because replaying events (such as timer interrupts) on the Secondary VM can be slightly more expensive than recording them on the Primary VM. This additional overhead is small. When the Primary VM is idle, this relative difference between the Primary and Secondary VMs might seem large, but examining the actual CPU usage shows that very little CPU resource is being consumed by the Primary VM or the Secondary VM.