Tag Archive for HA

HA Advanced Settings

Below are some of the Advanced HA Settings you can find on vSphere 5 and prior

Please note that each bullet details the version which supports this advanced setting:

  • das.maskCleanShutdownEnabled – 5.0 only
    Whether the clean shutdown flag will default to false for an inaccessible and poweredOff VM. Enabling this option will trigger VM failover if the VM’s home datastore isn’t accessible when it dies or is intentionally powered off.
  • das.ignoreInsufficientHbDatastore – 5.0 only
    Suppress the host config issue that the number of heartbeat datastores is less than das.heartbeatDsPerHost. Default value is “false”. Can be configured as “true” or “false”.
  • das.heartbeatDsPerHost – 5.0 only
    The number of required heartbeat datastores per host. The default value is 2; value should be between 2 and 5.
  • das.failuredetectiontime – 4.1 and prior
    Number of milliseconds, timeout time, for isolation response action (with a default of 15000 milliseconds). Pre-vSphere 4.0 it was a general best practice to increase the value to 60000 when an active/standby Service Console setup was used. This is no longer needed. For a host with two Service Consoles or a secondary isolation address a failuredetection time of 15000 is recommended.
  • das.isolationaddress[x] – 5.0 and prior
    IP address the ESX hosts uses to check on isolation when no heartbeats are received, where [x] = 0 ‐ 9. (see screenshot below for an example) VMware HA will use the default gateway as an isolation address and the provided value as an additional checkpoint. I recommend to add an isolation address when a secondary service console is being used for redundancy purposes. Start at das.isolationaddress1 when adding a second gateway
  • das.usedefaultisolationaddress – 5.0 and prior
    Value can be “true” or “false” and needs to be set to false in case the default gateway, which is the default isolation address, should not or cannot be used for this purpose. In other words, if the default gateway is a non-pingable address, set the “das.isolationaddress0” to a pingable address and disable the usage of the default gateway by setting this to “false”.
  • das.isolationShutdownTimeout – 5.0 and prior
    Time in seconds to wait for a VM to become powered off after initiating a guest shutdown, before forcing a power off.
  • das.allowNetwork[x] – 5.0 and prior
    Enables the use of port group names to control the networks used for VMware HA, where [x] = 0 – ?. You can set the value to be ʺService Console 2ʺ or ʺManagement Networkʺ to use (only) the networks associated with those port group names in the networking configuration.
  • das.bypassNetCompatCheck – 4.1 and prior
    Disable the “compatible network” check for HA that was introduced with ESX 3.5 Update 2. Disabling this check will enable HA to be configured in a cluster which contains hosts in different subnets, so-called incompatible networks. Default value is “false”; setting it to “true” disables the check.
  • das.ignoreRedundantNetWarning – 5.0 and prior
    Remove the error icon/message from your vCenter when you don’t have a redundant Service Console connection. Default value is “false”, setting it to “true” will disable the warning. HA must be reconfigured after setting the option.
  • das.vmMemoryMinMB – 5.0 and prior
    The minimum default slot size used for calculating failover capacity. Higher values will reserve more space for failovers. Do not confuse with “das.slotMemInMB”.
  • das.slotMemInMB – 5.0 and prior
    Sets the slot size for memory to the specified value. This advanced setting can be used when a virtual machine with a large memory reservation skews the slot size, as this will typically result in an artificially conservative number of available slots.
  • das.vmCpuMinMHz – 5.0 and prior
    The minimum default slot size used for calculating failover capacity. Higher values will reserve more space for failovers. Do not confuse with “das.slotCpuInMHz”.
  • das.slotCpuInMHz – 5.0 and prior
    Sets the slot size for CPU to the specified value. This advanced setting can be used when a virtual machine with a large CPU reservation skews the slot size, as this will typically result in an artificially conservative number of available slots.
  • das.sensorPollingFreq – 4.1 and prior
    Set the time interval for HA status updates. As of vSphere 4.1, the default value of this setting is 10. It can be configured between 1 and 30, but it is not recommended to decrease this value as it might lead to less scalability due to the overhead of the status updates.
  • das.perHostConcurrentFailoversLimit – 5.0 and prior
    By default, HA will issue up to 32 concurrent VM power-ons per host. This setting controls the maximum number of concurrent restarts on a single host. Setting a larger value will allow more VMs to be restarted concurrently but will also increase the average latency to recover as it adds more stress on the hosts and storage.
  • das.config.log.maxFileNum – 5.0 only
    Desired number of log rotations.
  • das.config.log.maxFileSize – 5.0 only
    Maximum file size in bytes of the log file.
  • das.config.log.directory – 5.0 only
    Full directory path used to store log files.
  • das.maxFtVmsPerHost – 5.0 and prior
    The maximum number of primary and secondary FT virtual machines that can be placed on a single host. The default value is 4.
  • das.includeFTcomplianceChecks – 5.0 and prior
    Controls whether vSphere Fault Tolerance compliance checks should be run as part of the cluster compliance checks. Set this option to false to avoid cluster compliance failures when Fault Tolerance is not being used in a cluster.
  • das.iostatsinterval (VM Monitoring) – 5.0 and prior
    The I/O stats interval determines if any disk or network activity has occurred for the virtual machine. The default value is 120 seconds.
  • das.failureInterval (VM Monitoring) – 5.0 and prior
    The polling interval for failures. Default value is 30 seconds.
  • das.minUptime (VM Monitoring) – 5.0 and prior
    The minimum uptime in seconds before VM Monitoring starts polling. The default value is 120 seconds.
  • das.maxFailures (VM Monitoring) – 5.0 and prior
    Maximum number of virtual machine failures within the specified “das.maxFailureWindow”, If this number is reached, VM Monitoring doesn’t restart the virtual machine automatically. Default value is 3.
  • das.maxFailureWindow (VM Monitoring) – 5.0 and prior
    Minimum number of seconds between failures. Default value is 3600 seconds. If a virtual machine fails more than “das.maxFailures” within 3600 seconds, VM Monitoring doesn’t restart the machine.
  • das.vmFailoverEnabled (VM Monitoring) – 5.0 and prior
    If set to “true”, VM Monitoring is enabled. When it is set to “false”, VM Monitoring is disabled.
  • das.config.fdm.deadIcmpPingInterval – 5.0 only
    Default value is 10. ICPM pings are used to determine whether a slave host is network accessible when the FDM on that host is not connected to the master. This parameter controls the interval (expressed in seconds) between pings.
  • das.config.fdm.icmpPingTimeout – 5.0 only
    Default value is 5. Defines the time to wait in seconds for an ICMP ping reply before assuming the host being pinged is not network accessible.
  • das.config.fdm.hostTimeout – 5.0 only
    Default is 10. Controls how long a master FDM waits in seconds for a slave FDM to respond to a heartbeat before declaring the slave host not connected and initiating the workflow to determine whether the host is dead, isolated, or partitioned.
  • das.config.fdm.stateLogInterval – 5.0 only
    Default is 600. Frequency in seconds to log cluster state.
  • das.config.fdm.ft.cleanupTimeout – 5.0 only
    Default is 900. When a vSphere Fault Tolerance VM is powered on by vCenter Server, vCenter Server informs the HA master agent that it is doing so. This option controls how many seconds the HA master agent waits for the power on of the secondary VM to succeed. If the power on takes longer than this time (most likely because vCenter Server has lost contact with the host or has failed), the master agent will attempt to power on the secondary VM.
  • das.config.fdm.storageVmotionCleanupTimeout – 5.0 only
    Default is 900. When a Storage vMotion is done in a HA enabled cluster using pre 5.0 hosts and the home datastore of the VM is being moved, HA may interpret the completion of the storage vmotion as a failure, and may attempt to restart the source VM. To avoid this issue, the HA master agent waits the specified number of seconds for a storage vmotion to complete. When the storage vmotion completes or the timer expires, the master will assess whether a failure occurred.
  • das.config.fdm.policy.unknownStateMonitorPeriod – 5.0 only
    Defines the number of seconds the HA master agent waits after it detects that a VM has failed before it attempts to restart the VM.
  • das.config.fdm.event.maxMasterEvents – 5.0 only
    Default is 1000. Defines the maximum number of events cached by the master
  • das.config.fdm.event.maxSlaveEvents – 5.0 only
    Default is 600. Defines the maximum number of events cached by a slave.

Basic design principle: Avoid using advanced settings as much as possible as it leads to increased complexity.

Always disable HA and re-enable to activate any changes

Useful KB Links

Advanced Configuration options for VMware High Availability for pre-5.0

Setting Multiple Isolation Response Addresses for VMware High Availability

 

Enabling Host Monitoring in HA Clusters

VMware HA clusters enable a collection of ESX/ESXi hosts to work together so that, as a group, they provide higher levels of availability for virtual machines than each ESX/ESXi host could provide individually. When you plan the creation and usage of a new VMware HA cluster, the options you select affect the way that cluster responds to failures of hosts or virtual machines.
Before creating a VMware HA cluster, you should be aware of how VMware HA identifies host failures andisolation and responds to these situations. You also should know how admission control works so that you can choose the policy that best fits your failover needs. After a cluster has been established, you can customize its behavior with advanced attributes and optimize its performance by following recommended best practices.

How VMware HA works

VMware HA provides high availability for virtual machines by pooling them and the hosts they reside on into a cluster. Hosts in the cluster are monitored and in the event of a failure, the virtual machines on a failed host are restarted on alternate hosts.

Primary and Secondary Hosts in a VMware HA Cluster

When you add a host to a VMware HA cluster, an agent is uploaded to the host and configured to communicate with other agents in the cluster. The first five hosts added to the cluster are designated as primary hosts, and all subsequent hosts are designated as secondary hosts. The primary hosts maintain and replicate all cluster state and are used to initiate failover actions. If a primary host is removed from the cluster, VMware HA promotes another host to primary status.
Any host that joins the cluster must communicate with an existing primary host to complete its configuration (except when you are adding the first host to the cluster). At least one primary host must be functional for VMware HA to operate correctly. If all primary hosts are unavailable (not responding), no hosts can be successfully configured for VMware HA.

Failure Detection and Host Network Isolation

Agents communicate with each other and monitor the liveness of the hosts in the cluster. This is done through the exchange of heartbeats, by default, every second. If a 15-second period elapses without the receipt of heartbeats from a host, and the host cannot be pinged, it is declared as failed. In the event of a host failure, the virtual machines running on that host are failed over, that is, restarted on the alternate hosts with the most available unreserved capacity (CPU and memory.)

Note: In the event of a host failure, VMware HA does not fail over any virtual machines to a host that is in maintenance mode, because such a host is not considered when VMware HA computes the current failover level. When a host exits maintenance mode, the VMware HA service is re-enabled on that host, so it becomes available for failover again.

Host network isolation occurs when a host is still running, but it can no longer communicate with other hosts in the cluster. With default settings, if a host stops receiving heartbeats from all other hosts in the cluster for more than 12 seconds, it attempts to ping its isolation addresses. If this also fails, the host declares itself as isolated from the network.
When the isolated host’s network connection is not restored for 15 seconds or longer, the other hosts in the cluster treat it as failed and attempt to fail over its virtual machines. However, when an isolated host retains access to the shared storage it also retains the disk lock on virtual machine files. To avoid potential data corruption, VMFS disk locking prevents simultaneous write operations to the virtual machine disk files and attempts to fail over the isolated host’s virtual machines fail. By default, the isolated host shuts down its virtual
machines, but you can change the host isolation response to Leave powered on or Power off

Redundancy and Reducing Isolation

If you ensure that your network infrastructure is sufficiently redundant and that at least one network path is available at all times, host network isolation should be a rare occurrence.

Which setting should I use?

  • Shutdown

It depends. Some people prefer “Shut down” because they do not want to use a deprecated host and it will shut down your VMs in a clean manner.

The isolation response is a setting that needs to be taken into account when you create your design. For instance when using an iSCSI array or NFS choosing “leave powered on” as your default isolation response might lead to a split-brain situation depending on the version of ESX used. The reason for this being that the disk lock times out if the iSCSI network is also unavailable. In this case the VM is being restarted on a different host while it is not being powered off on the original host. In a normal situation this should not lead to problems as the VM is restarted and the host on which it runs owns the lock on the VMDK, but for some weird reason when disaster strikes you will not end up in a normal situation but you might end up in an exceptional situation

  • Leave Powered On

Many people prefer to use “Leave powered on” because it reduces the chances of a false positive. A false positive in this case is an isolated heartbeat network but a non-isolated VM network and a non-isolated iSCSI / NFS network.

How does HA knows if the host is isolated or completely unavailable when you have selected “leave powered on”?

HA actually does not know the difference. HA will try to restart the affected VMs in both cases. When the host has failed a restart will take place, but if a host is merely isolated the non-isolated hosts will not be able to restart the affected VMs. This is because of the VMDK file lock; no other host will be able to boot a VM when the files are locked. When a host fails this lock starves and a restart can occur.

Isolation Response Considerations

The default value for isolation/failure detection is 15 seconds. In other words the failed or isolated host will be declared dead by the other hosts in the HA cluster on the fifteenth second and a restart will be initiated by one of the primary hosts.

For now let’s assume the isolation response is “power off”. The “power off”(isolation response) will be initiated by the isolated host 1 second before the das.failuredetectiontime. A “power off” will be initiated on the fourteenth second and a restart will be initiated on the fifteenth second.

Does this mean that you can end up with your VMs being down and HA not restarting them?
Yes, when the heartbeat returns between the 14th and 15th second the “power off” could already have been initiated. The restart however will not be initiated because the heartbeat indicates that the host is not isolated anymore.

How can you avoid this?

Pick “Leave VM powered on” as an isolation response. Increasing the das.failuredetectiontime will also decrease the chances of running in to issues like these.

Basic design principle: Increase “das.failuredetectiontime” to 30 seconds (30000) to decrease the likely-hood of a false positive

Please see the below link for further information on this

http://rickardnobel.se/vmware-ha-das-failuredetectiontime/