VMware Virtual SAN (vSAN) 6.6 introduces the following new features and enhancements:
vSAN 6.6 is a major new release that requires a full upgrade. Perform the following tasks to complete the upgrade to vSAN 6.6:
During an upgrade of the vSAN on-disk format, a disk group evacuation is performed. The disk group is removed and upgraded to on-disk format version 5.0, and the disk group is added back to the cluster. For two-node or three-node clusters, or clusters without enough capacity to evacuate each disk group, you must use this following RVC command to upgrade the on-disk format: vsan.ondisk_upgrade --allow-reduced-redundancy
When you allow reduced redundancy, your VMs are unprotected for the duration of the upgrade, because this method does not evacuate data to the other hosts in the cluster. It removes each disk group, upgrades the on-disk format, and adds the disk group back to the cluster. All objects remain available, but with reduced redundancy.
If you enable deduplication and compression during the upgrade to vSAN 6.6, you can select Allow Reduced Redundancy from the vSphere Web Client.
Using VMware Update Manager to upgrade hosts in parallel might result in the witness host being upgraded in
parallel with one of the data hosts in a stretched cluster. To avoid upgrade
problems, do not configure VMware Update Manager to upgrade a witness host in parallel with the data
hosts in a stretched cluster. Upgrade the witness host after all data
hosts have been successfully upgraded and have exited maintenance mode.
During upgrades of the vSAN on-disk format, the Physical Disk Health – Metadata Health check can fail intermittently. These failures can occur if the destaging process is slow, most likely because vSAN must allocate physical blocks on the storage devices. Before you take action, verify the status of this health check after the period of high activity, such as multiple virtual machine deployments, is complete. If the health check is still red, the warning is valid. If the health check is green, you can ignore the previous warning. For more information, see Knowledge Base article 2108690.
The following issues are known to occur in vSAN 6.6:
Cluster consistency health check fails during deep rekey operation
The deep rekey operation on an encrypted vSAN cluster can take several hours.
During the rekey, the following health check might indicate a failure:
Cluster configuration consistency. The cluster consistency check does not detect the deep rekey operation,
and there might not be a problem.
Workaround: Retest the vSAN cluster consistency health check after the deep rekey operation is complete.
VM OVF deploy fails if DRS is disabled
If you deploy an OVF template on the vSAN cluster, the operation fails if DRS is disabled on the vSAN cluster. You might see a message similar to the following: The operation is not allowed in the current state.
Workaround: Enable DRS on the vSAN cluster before you deploy an OVF template.
vSAN stretched cluster configuration lost after you disable vSAN on a cluster
If you disable vSAN on a stretched cluster, the stretched cluster configuration is not retained. The stretched cluster, witness host, and fault domain configuration is lost.
Workaround: Reconfigure the stretched cluster parameters when you re-enable the vSAN cluster.
Orphaned or inaccessible VMs after total cluster failure
After total cluster failure, some powered off or suspended VMs might become orphaned or inaccessible, especially when vSAN encryption is enabled.
Workaround: Use the following procedure to re-register orphaned or inaccessible VMs.
- Use RVC to connect to vCenter Server.
- Navigate to the name of the cluster where orphaned VMs exist and re-register them. For example, if the name of the cluster is "vsan," run the following command: vsan.check_state -ref /localhost/Datacenter/computers/vsan
vsan.check_state -ref /localhost/Datacenter/computers/vsan
2017-03-03 18:54:04 +0000: Step 1: Check for inaccessible vSAN objects
2017-03-03 18:54:10 +0000: Step 1b: Check for inaccessible vSAN objects, again
2017-03-03 18:54:11 +0000: Step 2: Check for invalid/inaccessible VMs
2017-03-03 18:54:11 +0000: Step 2b: Check for invalid/inaccessible VMs again
2017-03-03 18:54:11 +0000: Step 3: Check for VMs for which VC/hostd/vmx are out of sync Did not find VMs for which VC/hostd/vmx are out of sync
On-disk format version for witness host is later than version for data hosts
When you change the witness host during an upgrade to vSAN 6.6, the new witness host receives the latest on-disk format version. The on-disk format version of the witness host might be later than the on-disk format version of the data hosts. In this case, the witness host cannot store components.
Workaround: Use the following procedure to change the on-disk format to an earlier version.
- Delete the disk group on the new witness host.
- Set the advanced parameter to enable formatting of disk groups with an earlier on-disk format. For more information, see Knowledge Base article 2146221.
- Recreate a new disk group on the witness host with a vSAN on-disk format version that matches the data hosts.
Powered off VMs appear as inaccessible during witness host replacement
When you change a witness host in a stretched cluster, VMs that are powered off appear as inaccessible in the vSphere Web Client for a brief time. After the process is complete, powered off VMs appear as accessible. All running VMs appear as accessible throughout the process.
Cannot place hosts in maintenance mode if they have faulty boot media
vSAN cannot place hosts with faulty boot media into maintenance mode. The task to enter maintenance mode might fail with an internal vSAN error, due to the inability to save configuration changes. You might see log events similar to the following: Lost Connectivity to the device xxx backing the boot filesystem
Workaround: Remove disk groups manually from each host, using the Full data evacuation option. Then place the host in maintenance mode.
Health check times out if a host fails
If one host in the cluster fails, the health check might time out. You might see the following message: a back-end task took more than 120 seconds. When the vSAN health service detects that the host has failed, it restarts. The health check automatically resumes after ten minutes.
Health service does not work if vSAN cluster has ESXi hosts with vSphere 6.0 Update 1 or earlier
The vSAN 6.6 health service does not work if the cluster has ESXi hosts running vSphere 6.0 Update 1 or earlier releases.
Workaround: Do not add ESXi hosts with vSphere 6.0 Update 1 or earlier software to a vSAN 6.6 cluster.
After stretched cluster failover, VMs on the preferred site register alert: Failed to failover
If the secondary site in a stretched cluster fails, VMs failover to the preferred site. VMs already on the preferred site might register the following alert: Failed to failover. Ignore this alert. It does not impact the behavior of the failover.
During network partition, components in the active site appear to be absent
During a network partition in a vSAN 2 host or stretched cluster, the vSphere Web Client might display a view of the cluster from the perspective of the non-active site. You might see active components in the primary site displayed as absent.
Workaround: Use RVC commands to query the state of objects in the cluster. For example: vsan.vm_object_info
vCenter Server Appliance Installer accepts cluster name greater than 80 characters
If you enter a vSAN cluster name that is more than characters, the vCenter Server Appliance Installer accepts the name, but the configuration is invalid. The vCenter Server Appliance fails when it is booted.
Workaround: Enter a vSAN cluster name that is 80 characters or less.
vCenter Server Appliance Installer accepts mix of flash and magnetic drives for capacity
The vCenter Server Appliance Installer allows you to select a mix of flash devices and magnetic disks for the capacity tier of a disk group in a new vSAN cluster. The capacity tier of each disk group can support either all-flash or all-magnetic devices.
Workaround: Do not mix flash devices and magnetic disks on the capacity tier of the vSAN cluster.
Temporary Update configuration tasks visible if hosts are disconnected when you change vSAN encryption configurations
When you change the configurations in an encrypted vSAN cluster (such as turning encryption on or off or changing the KMS key), an Update vSAN configuration task runs on each host every 3 seconds until all hosts reconnect or until 5 minutes have passed. These tasks are not harmful and rarely impact performance.
Some objects are non-compliant after force repair
After you perform a force repair, some objects might not be repaired because the ownership of the objects was transferred to a different node during the process. The force repair might be delayed for those objects.
Workaround: Attempt the force repair operation after all other objects are repaired and resynchronized. You can wait until vSAN repairs the objects.
When you move a host from one encrypted cluster to another, and then back to the original cluster, the task fails
When you move a host from an encrypted vSAN cluster to another encrypted vSAN cluster, then move the host back to the original encrypted cluster, the task might fail. You might see the following message: A general system error occurred: Invalid fault. This error occurs because vSAN cannot re-encrypt data on the host using the original encryption key. After a short time, vCenter Server restores the original key on the host, and all unmounted disks in the vSAN cluster are mounted.
Workaround: Reboot the host and wait for all disks to get mounted.
Cluster becomes partitioned if vCenter Server and ESXi hosts reboot
If both the vCenter Server and ESXi hosts of a vSAN cluster are rebooted, the cluster can become partitioned.
Workaround: Restart the vSAN health service.
Stretched cluster imbalance after a site recovers
When you recover a failed site in a stretched cluster, sometimes hosts in the failed site are brought back sequentially over a long period of time. vSAN might overuse some hosts when it begins repairing the absent components.
Workaround: Recover all of the hosts in a failed site together within a short time window.
VM operations fail due to HA master issue with stretched clusters
Under certain failure scenarios in stretched clusters, certain VM operations such as vMotions or powering on a VM might be impacted. These failures scenarios include a partial or a complete site failure, or the failure of the high speed network between the sites. This problem is caused by the dependency on VMware HA being available for normal operation of stretched cluster sites.
Workaround: Disable vSphere HA before performing vMotion, VM creation, or powering on VMs. Then re-enable vSphere HA.
Updated Restoring or replacing vCenter Server can cause cluster partition
If the vCenter Server is replaced or recovered from backup, the host membership list might become out-of-date. This can cause ESXi hosts to become partitioned from the cluster.
Workaround: Use the following procedure to make sure all hosts are added to the vSAN cluster as the vCenter Server reboots.
- Before you reboot vCenter Server, configure hosts to ignore cluster member list updates. Run the following command on each host in the vSAN cluster:
esxcfg-advcfg -s1 /VSAN/IgnoreClusterMemberListUpdates
- After vCenter Server is running and all hosts are present in the cluster, configure hosts to use cluster member list updates. Run the following command on each host in the cluster:
esxcfg-advcfg -s0 /VSAN/IgnoreClusterMemberListUpdates
Disk decommission or disk unmount task fails
Disk decommission or disk unmount task might fail due to a conflict between the data write commit task and the virtual disk delete task. This problem might occur during upgrades that require a new vSAN on-disk format. You might see the following message in the VMkernel.log:
4724 2017-04-10T18:46:51.309Z cpu30:67232)LSOM: LSOMFreeMDDispatch:3797: Throttled: Waiting for component cleanup
Workaround: Reboot the host to clear the conflict and retry the operation.
vMotion network connectivity test incorrectly reports ping failures
The vMotion network connectivity test (Cluster > Monitor > vSAN > Health > Network) reports ping failures if the vMotion stack is used for vMotion. The vMotion network connectivity (ping) check only supports vmknics that use the default network stack. The check fails for vmknics using the vMotion network stack. These reports do not indicate a connectivity problem.
Workaround: Configure the vmknic to use the default network stack. You can disable the vMotion ping check using RVC commands. For example: vsan.health.silent_health_check_configure -a vmotionpingsmall
Cannot perform deep rekey if a disk group is unmounted
Before vSAN performs a deep rekey, it performs a shallow rekey. The shallow rekey fails if an unmounted disk group is present. The deep rekey process cannot begin.
Workaround: Remount or remove the unmounted disk group.
Log entries state that firewall configuration has changed
A new firewall entry appears in the security profile when vSAN encryption is enabled: vsanEncryption. This rule controls how hosts communicate directly to the KMS. When it is triggered, log entries are added to /var/log/vobd.log. You might see the following messages:
Firewall configuration has changed. Operation 'addIP4' for rule set vsanEncryption succeeded.
Firewall configuration has changed. Operation 'removeIP4' for rule set vsanEncryption succeeded.
These messages can be ignored.
Updated Limited support for First Class Disks with vSAN datastores
vSAN 6.6 does not fully support First Class Disks in vSAN datastores. You might experience the following problems if you use First Class Disks in a vSAN datastore:
- vSAN health service does not display the health of First Class Disks correctly.
- The Used Capacity Breakdown includes the used capacity for First Class Disks in the following category: Other
- The health status of VMs that use First Class Disks is not calculated correctly.
HA failover does not occur after setting Traffic Type option on a vmknic to support witness traffic
If you set the traffic type option on a vmknic to support witness traffic, vSphere HA
does not automatically discover the new setting. You must manually disable and then re-enable HA so it
can discover the vmknic. If you configure the vmknic and the vSAN cluster first, and then enable
HA on the cluster, it does discover the vmknic.
Workaround: Manually disable vSphere HA on the cluster, and then re-enable it.
After you disable and delete the iSCSI target service, some iSCSI objects remain in the vSAN datastore
If you use the Web Client to remove all iSCSI targets and LUNs, and disable the iSCSI target service, the iSCSI home object still exists in the vSAN datastore.
Workaround: To delete the iSCSI home object and all metadata associated with the iSCSI target service, run the following command on any host in the cluster: esxcli vsan iscsi homeobject delete
iSCSI I/O operation might be interrupted during iSCSI target failover
During iSCSI target failover, the iSCSI I/O operations might be interrupted. A host failure or a host reboot might trigger an iSCSI target failover.
Workaround: Retry the session from the iSCSI initiator.
iSCSI MCS is not supported
vSAN iSCSI target service does not support Multiple Connections per Session (MCS).
Any iSCSI initiator can discover iSCSI targets
vSAN iSCSI target service allows any initiator on the network to discover iSCSI targets.
Workaround: You can isolate your ESXi hosts from iSCSI initiators by placing them on separate VLANs.
After resolving network partition, some VM operations on linked clone VMs might fail
Some VM operations on linked clone VMs that are not producing I/O inside the guest operating system might fail. The operations that might fail include taking snapshots and suspending the VMs. This problem can occur after a network partition is resolved, if the parent base VM's namespace is not yet accessible. When the parent VM's namespace becomes accessible, HA is not notified to power on the VM.
Workaround: Power cycle VMs that are not actively running I/O operations.
When you log out of the Web client after using the Configure vSAN wizard, some configuration tasks might fail
The Configure vSAN wizard might require up to several hours to complete the configuration tasks. You must remain logged in to the Web client until the wizard completes the configuration. This problem usually occurs in clusters with many hosts and disk groups.
Workaround: If some configuration tasks failed, perform the configuration again.
New policy rules ignored on hosts with older versions of ESXi software
This might occur when you have two or more vSAN clusters, with one cluster running the latest software and another cluster running an older software version. The vSphere Web Client displays policy rules for the latest vSAN software, but those new policies are not supported on the older hosts. For example, RAID-5/6 (Erasure Coding) – Capacity is not supported on hosts running 6.0U1 or earlier software. You can configure the new policy rules and apply them to any VMs and objects, but they are ignored on hosts running the older software version.
Snapshot memory objects are not displayed in the Used Capacity Breakdown of the vSAN Capacity monitor
For Virtual Machines created with hardware version lower than 10, the snapshot memory is included in the Vmem objects on the Used Capacity Breakdown.
Workaround: To view snapshot memory objects in the Used Capacity Breakdown, create Virtual Machines with hardware version 10 or higher.
Storage Usage reported in VM Summary page might appear larger after upgrading to vSAN 6.5 or later
In previous releases of vSAN, the value reported for VM Storage Usage was the space used by a single copy of the data. For example, if the guest wrote 1 GB to a thin-provisioned object with two mirrors, the Storage Usage was shown as 1 GB. In vSAN 6.5 and later, the Storage Usage field displays the actual space used, including all copies of the data. So if the guest writes 1 GB to a thin-provisioned object with two mirrors, the Storage Usage is shown as 2 GB. The reported storage usage on some VMs might appear larger after upgrading to vSAN 6.5, but the actual space consumed did not increase.
Cannot place a witness host in Maintenance Mode
When you attempt to place a witness host in Maintenance Mode, the host remains in the current state and you see the following notification: A specified parameter was not correct.
Workaround: When placing a witness host in Maintenance Mode, choose the No data migration option.
Moving the witness host into and then out of a stretched cluster leaves the cluster in a misconfigured state
If you place the witness host in a vSAN-enabled vCenter cluster, an alarm notifies you that the witness host cannot reside in the cluster. But if you move the witness host out of the cluster, the cluster remains in a misconfigured state.
Workaround: Move the witness host out of the vSAN stretched cluster, and reconfigure the stretched cluster. For more information, see Knowledge Base article 2130587.
When a network partition occurs in a cluster which has an HA heartbeat datastore, VMs are not restarted on the other data site
When the preferred or secondary site in a vSAN cluster loses its network connection to the other sites, VMs running on the site that loses network connectivity are not restarted on the other data site, and the following error might appear: vSphere HA virtual machine HA failover failed.
This is expected behavior for vSAN clusters.
Workaround: Do not select HA heartbeat datastore while configuring vSphere HA on the cluster.
- Unmounted vSAN disks and disk groups displayed as mounted in the vSphere Web Client Operational Status field
After the vSAN disks or disk groups are unmounted by either running the esxcli vsan storage disk group unmount command or by the vSAN Device Monitor service when disks show persistently high latencies, the vSphere Web Client incorrectly displays the Operational Status field as mounted.
Workaround: Use the Health field to verify disk status, instead of the Operational Status field.
- On-disk format upgrade displays disks not on vSAN
When you upgrade the disk format, vSAN might incorrectly display disks that were removed from the cluster. The UI also might show the version status as mixed. This display issue usually occurs after one or multiple disks are manually unmounted from the cluster. It does not affect the upgrade process. Only the mounted disks are checked. The unmounted disks are ignored.
All vSAN clusters share the same external proxy settings
All vSAN clusters share the same external proxy settings, even if you set the proxy at the cluster level. vSAN uses external proxies to connect to Support Assistant, the Customer Experience Improvement Program, and the HCL database, if the cluster does not have direct Internet access.
- VMs in a stretched cluster become inaccessible when preferred site is isolated, then regains connectivity only to the witness host
When the preferred site becomes unavailable or loses its network connection to the secondary site and the witness host, the secondary site forms a cluster with the witness host and continues storage operations. Data on the preferred site might become outdated over time. If the preferred site then reconnects to the witness host but not to the secondary site, the witness host leaves the cluster it is in and forms a cluster with the preferred site, and some VMs might become inaccessible because they do not have access to the most recent data in this cluster.
Workaround: Before you reconnect the preferred site to the cluster, mark the secondary site as the preferred site. After the sites are resynchronized, you can mark the site you want to use as the preferred site.
- Storage Consumption Model for VM Storage Policy wizard shows incorrect information
If one or more hosts in a vSAN cluster is not running software version 6.0 Update 2 or later, the
Storage Consumption Model for the
VM Storage Policy wizard might show incorrect information when you select RAID 5/6 as the failure tolerance method.
Workaround: Upgrade all hosts to the latest software version.