If VM Component Protection (VMCP) is enabled, vSphere HA can detect datastore accessibility failures and provide automated recovery for affected virtual machines.

VMCP provides protection against datastore accessibility failures that can affect a virtual machine running on a host in a vSphere HA cluster. When a datastore accessibility failure occurs, the affected host can no longer access the storage path for a specific datastore. You can determine the response that vSphere HA will make to such a failure, ranging from the creation of event alarms to virtual machine restarts on other hosts.

Note

When you use the VM Component Protection feature, your ESXi hosts must be version 6.0 or higher.

There are two types of datastore accessibility failure:

PDL

PDL (Permanent Device Loss) is an unrecoverable loss of accessibility that occurs when a storage device reports the datastore is no longer accessible by the host. This condition cannot be reverted without powering off virtual machines.

APD

APD (All Paths Down) represents a transient or unknown accessibility loss or any other unidentified delay in I/O processing. This type of accessibility issue is recoverable.

VM Component Protection is enabled and configured in the vSphere Web Client. To enable this feature, you must select the Protect against Storage Connectivity Loss checkbox in the edit cluster settings wizard. The storage protection levels you can choose and the virtual machine remediation actions available differ depending on the type of database accessibility failure.

PDL failures

A virtual machine is automatically failed over to a new host unless you have configured VMCP only to Issue events.

APD events

The response to APD events is more complex and accordingly the configuration is more fine-grained.

After the user-configured Delay for VM failover for APD period has elapsed, the action taken depends on the policy you selected. An event will be issued and the virtual machine is restarted conservatively or aggressively. The conservative approach does not terminate the virtual machine if the success of the failover is unknown, for example in a network partition. The aggressive approach does terminate the virtual machine under these conditions. Neither approach terminates the virtual machine if there are insufficient resources in the cluster for the failover to succeed.

If APD recovers before the user-configured Delay for VM failover for APD period has elapsed, you can choose to reset the affected virtual machines, which recovers the guest applications that were impacted by the IO failures.

Note

If either the Host Monitoring or VM Restart Priority settings are disabled, VMCP cannot perform virtual machine restarts. Storage health can still be monitored and events can be issued, however.

For more information on configuring VMCP, see Configure Virtual Machine Responses.

The following timeline graphically demonstrates how VMCP recovers from a storage failure.

VMCP

T=0s: A storage failure is detected. vSphere HA starts the recovery process. For a PDL event, the workflow immediately starts and VMs are restarted on healthy hosts in the cluster. If the storage loss is due to an APD event, the APD Timeout timer starts (the default is 140 seconds).

T=140s: The host declares an APD Timeout and begins to fail non-VM I/O to the unresponsive storage device.

Between T=140s and 320s: This is the time period defined by the Delay for VM failover for APD, which is 3 minutes by default. The guest applications might become unstable after losing access to storage for an extended period of time. If an APD is cleared in this time period, the option to reset the VMs is available.

T=320s: vSphere HA now starts the APD recovery response after the Delay for VM failover for APD elapses (3 minutes after the APD Timeout is reached).