Archive for July 2014

HA in VMware vSphere 5.x – What actually happens?

HEARTBEAT

The HA Question?

We were asked what actually happens to the hosts and VMs in vSphere 5.5 if an isolation event was triggered and we completely lost our host Management Network. (Which I have seen happen in the past!) I have written several blog posts about HA in the HA Category so I am not going to go back over these. I am just going to focus on this question with our settings which are set as below.

It is important to note that the restarting by VMware HA of virtual machines on other hosts in the cluster in the event of a host isolation or host failure is dependent on the “host monitoring” setting. If host monitoring is disabled, the restart of virtual machines on other hosts following a host failure or isolation is also disabled

On our Non Production Cluster and our Production Cluster we have HA enabled and Enable Host Monitoring turned on with Leave Powered On as our default

HA1

HA2

HA3

The vSphere architecture comprises of Master and Slave HA agents. Except during network partitions there is one master in the cluster. A master agent is responsible for monitoring the health of virtual machines and restarting any that fail. The Slaves are responsible for sending information to the master and restarting virtual machines as instructed by the master.

HA4

HA5

When a HA cluster is created it will begin by electing a master which will try and gain ownership of all the datastores it can directly access or by proxying requests to one of the slaves using the management network. It does this by locking a file called protectedlist that is stored on the datstores in an existing cluster. The master will also try and take ownership of any datastores that it discovers on the way and will periodically try any datatstores it could not access previously.

The master uses the protectlist file to store the inventory and keeps track of the virtual machines protected by HA. It then distributes the inventory across all the datastores

HA6

There is also a file called poweron located on a shared datastore which contains a list of powered on virtual machines. This file is used by slaves to inform the master that they are isolated by the top line of the file containing a 0 or 1 with 1 meaning isolated

HA7

Datastore Heartbeating

In vSphere versions prior to 5.x, machine restarts were always attempted, even if it was only the Management network which went down and the rest of the VM networks were running fine. This was not a desirable situation. VMware have introduced the concept of Datastore heartbeating which adds much more resiliency and false positives which resulted in VMs restarting unnecessarily.

Datastore Heartbeating is used when a master has lost network connectivity with a slave. The Datastore Heartbeating mechanism is then used to validate if a host has failed or is isolated/network partitioned which is validated through the poweron file as mentioned previously. By default HA picks 2 heartbeat datastores. To see which datastores, click on the vCenter name and select Cluster Status

HA3

Isolation and Network Partitioning

A host is considered to be either isolated or network partitioned when it loses network access to a master but has not completely failed.

Isolation

  • A host is not receiving any heartbeats from the master
  • A host is not receiving any election traffic
  • A host cannot ping the isolation address
  • Virtual machines may be restarted depending on the isolation response
  • A VM will only be shut down or powered off when the isolated host knows there is a master out there that has taken ownership for the VM or when the isolated host loses access to the home datastore of the VM

Network Partitioning

  • A host is not receiving any heartbeats from the master
  • A host is receiving election traffic
  • An election process will take place and the state reported to vCenter and virtual machines may be restarted depending on the isolation response

What happens if? 

  • The Master fails

If the slaves have not received any network heartbeats from the master, then the slaves will try and elect a new master. The new master will gather the required information and restart the VMs. The Datastore lock will expire and a newly elected master will relock the file if it has access to the Datastore

  • A Slave fails

The master along with monitoring the slave hosts also receives heartbeats from the slaves every second. If a slave fails or become isolated, the master will check for connectivity for 15 seconds then it will see if the host is still heartbeating to the datastore. Next it will try and ping the management gateway. If the datastore and management gateway prove negative then the host will be declared failed and determine which VMs need to be restarted and will try and distribute them fairly across the remaining hosts

  • Power Outage

If there is a Power Outage and all hosts power down suddenly then as soon as the power for the hosts returned, an election process will be kicked off and a master will be elected. The Master reads protected list which contains all VMs which are protected by HA and then the Master initiates restarts for those VMs which are listed as protected but not running

  • Complete Management Network failure

First of all it’s a very rare scenario where the Management Network becomes unavailable at the same time from all the running Host’s in the Cluster. VMware recommend to have redundant vmnics configured for the Host and each vmkernel management vmnic going into a different management switch for full redundancy. See pic below.

vmkernelredundant

If all the ESXi Hosts lose the Management Network then the Master and the Slaves will remain at the same state as there will be no election happening because the FDM agents communicate through the Management Network. Because the VMs will be accessible on the Datastores which the master knows by reading the protectedlist file and the poweron file on the Datastores, it will know if there is a complete failure of the Management network or a failure of itself or a slave or an isolation/network partition event. Each host will ping the isolation address and declare itself isolated. It will then trigger the isolation response which is to leave VMs powered on

A host remains isolated until it observes HA network traffic, like for instance election messages or it starts getting a response from an isolation address. Meaning that as long as the host is in an “isolated state” it will continue to validate its isolation by pinging the isolation address. As soon as the isolation address responds it will initiate an election process or join an existing election process and the cluster will return to a normal state.

Useful Link

Thanks to Iwan Rahabok 🙂

http://virtual-red-dot.blogspot.co.uk/2012/02/vsphere-ha-isolation-partition-and.html

 

 

DFS Troubleshooting on Windows Server 2008 R2

helpicon

DFS Troubleshooting

The DFS Management MMC is the tool that can manage most common administration activities related to DFS-Namespaces. This will show up under “Administrative Tools” after you add the DFS role service in Server Manager. You can also add just the MMC for remote management of a DFS namespace server. You can find this in Server Manager, under Add Feature, Remote Server Administration Tools (RSAT), Role Administration Tools, File Services Tools.

Another option to manage DFS is to use DFSUTIL.EXE, which is a command line tool. There are many options and you can perform almost any DFS-related activity, from creating a namespace to adding links to exporting the entire configuration to troubleshooting. This can be very handy for automating tasks by writing scripts or batch files. DFSUTIL.EXE is an in-box tool in Windows Server 2008.

What can go wrong?

  • Access to the DFS namespace
  • Finding shared folders
  • Access to DFS links and shared folders
  • Security-related issues
  • Replication latency
  • Failure to connect to a domain controller to obtain a DFSN namespace referral
  • Failure to connect to a DFS server
  • Failure of the DFS server to provide a folder referral

Methods of Troubleshooting

I have a very basic lab set up with DFS running on 2 servers. I will be using this to demonstrate the troubleshooting methods

My DFS Namespace is \\dacmt.local\shared

Troubleshooting Commands

  • dfsutil.exe /spcinfo

Determine whether the client was able to connect to a domain controller for domain information by using the DFSUtil.exe /spcinfo command. The output of this command describes the trusted domains and their domain controllers that are discovered by the client through DFSN referral queries. This is known as the “Domain Cache”

dfs1

  • start \\10.1.1.160 (where 10.1.1.160 is your DC)

This should pop up with an Explorer box listing the shares hosted by your Domain Controller

dfs2

  •  netview \\10.1.1.160 (where 10.1.1.160 is your DC)

A successful connection lists all shares that are hosted by the domain controller.

dfs3

  • net view \\10.1.1.200 (Where 10.1.1.200 is your DFS Server)

You can see this shows you your namespace and your shares held on your DFS Server

dfs7

  • dfsutil.exe /pktinfo 

If the above connection tests are successful, determine whether a valid DFSN referral is returned to the client after it accesses the namespace. You can do this by viewing the referral cache (also known as the PKT cache) by using the DFSUtil.exe /pktinfo command

If you cannot find an entry for the desired namespace, this is evidence that the domain controller did not return a referral

dfs4

  • dfsutil.exe cache domain flush
  • dfsutil.exe cache referral flush
  • dfsutil.exe cache provider flush

dfs6

  • ipconfig /flushdns and dfsutil.exe /pktflush and dfsutil.exe /spcflush

By default, DFSN stores NetBIOS names for root servers. DFSN can also be configured to use DNS names for environments without WINS servers. For more information, click the underlined link to view the article in the Microsoft Knowledge Base:

dfs8

  •  DFS and System Configuration

Even when connectivity and name resolution are functioning correctly, DFS configuration problems may cause the error to occur on a client. DFS relies on up-to-date DFS configuration data, correctly configured service settings, and Active Directory site configuration.

First, verify that the DFS service is started on all domain controllers and on DFS namespace/root servers. If the service is started in all locations, make sure that no DFS-related errors are reported in the system event logs of the servers.

dfs9

  • repadmin /showrepl * dc=dacmt,dc=local

When an administrator makes a change to the domain-based namespace, the change is made on the Primary Domain Controller (PDC) emulator master. Domain controllers and DFS root servers periodically poll PDC for configuration information. If the PDC is unavailable, or if “Root Scalability Mode” is enabled, Active Directory replication latencies and failures may prevent servers from issuing correct referrals.

dfs10

  • DFS and NTFS Permissions

If a client cannot gain access to a shared folder specified by a DFS link, check the following:

  • Use the DFS administrative tool to identify the underlying shared folder.
  • Check status to confirm that the DFS link and the shared folder (or replica set) to which it points are valid. For more information, see “Checking Shared Folder Status” earlier in this chapter.
  • The user should go to the Windows Explorer DFS property page to determine the actual shared folder that he or she is attempting to connect to.
  • The user should attempt to connect to the shared folder directly by way of the physical namespace. By using a command such as ping, net view or net use, you can establish connectivity with the target computer and shared folder.
  • If the DFS link has a replica set configured, then be aware of the latency involved in content replication. Files and folders that have been modified on one replica might not yet have replicated to other replicas.

It is also worth checking you do not have any general networking issues on the server you are connecting from and also that there are no firewall rules or Group Policies blocking File and Printer Sharing!

  • DFS Tab on DFS folders accessed through the DFS Namespace

It is recommended that one of the first things that you determine when tracking an access-related issue with DFS is the name of the underlying shared folder that the client has been referred to. In Windows 2000, there is a shell extension to Windows Explorer for precisely this purpose. When you right-click a folder that is in the DFS namespace, there is a DFS tab available in the Properties window. From the DFS tab, you can see which shared folder you are referencing for the DFS link. In addition, you can see the list of replicas that refer to the DFS link, so you can disconnect from one replica and select another. Finally, you can also refresh the referral cache for the specified DFS link. This makes the client obtain a new referral for the link from the DFS server.

dfs11

  • Replication Latency

Because the topology knowledge is stored in the domain’s Active Directory, there is some latency before any modification to the DFS namespace is replicated to all domain controllers.

From an administrator’s perspective, remember that the DFS administrative console connects directly to a domain controller. Therefore, the information that you see on one DFS administrative console might not be identical with the information about another DFS administrative console (which might be obtaining its information from a different domain controller).

From a client’s perspective, you have the additional possibility that the client itself might have cached the information before it was modified. So, even though the information about the modification might have replicated to all the domain controllers, and even if the DFS servers have obtained updates about the modification, the client might still be using an older cached copy. The ability to manually flush the cache before the referral time-out has expired, which is done from the DFS tab in the Properties window in Windows Explorer, can be useful in this situation.

  • dfsdiag /testdcs /domain:dacmt.local
  • DFSDiag /testsites /dfspath:\\dacmt.local\Shared\Folder 1 /full
  • DFSDiag /testsites /dfspath:\\dacmt.local\Shared /recurse /full
  • DFSDiag /testdfsconfig /dfsroot:\\dacmt.local\Shared
  • DFSDiag /testdfsintegrity /dfsroot:\\dacmt.local\Shared
  • DFSDiag /testreferral /dfspath:\\dacmt.local\Shared

With this you can check the configuration of the domain controllers on your DFS Server. It verifies that the DFS Namespace service is running on all the DCs and its Startup Type is set to Automatic, checks for the support of site-costed referrals for NETLOGON and SYSVOL and verifies the consistency of site association by hostname and IP address on each DC.

dfs12
and

dfs13

and

dfs14

DFSR and File Locking

DFS lacks a central feature important for a collaborative environment where inter-office file servers are mirrored and data is shared: File Locking. Without integrated file locking, using DFS to mirror file servers exposes live documents to version conflicts. For example, if a colleague in Office A can open and edit a document at the same time that a colleague in Office B is working on the same document, then DFS will only save the changes made by the person closing the file last.

There is also another version conflict potential which arises even when the two colleagues are not working on the same file at the same time. DFS Replication is a single-threaded operation, a “pull” process. The result, synchronisation tasks are able to quite easily “queue” up and create a backlog. As a result changes made at one location are not immediately replicated to the other side. It is this time delay which creates yet another opportunity for file version conflicts to occur.

http://blogs.technet.com/b/askds/archive/2009/02/20/understanding-the-lack-of-distributed-file-locking-in-dfsr.aspx

NETBIOS Considerations

In terms of NetBios, the default behavior of DFS is to use NetBIOS names for all target servers in the namespace. This allows clients that support NetBios only name resolution to locate and connect to targets in a DFS namespace. Administrators can use NetBIOS names when specifying target names and those exact paths are added to the DFS metadata. For example, an administrator can specify a target \\dacmt\Users, where dacmt is the NetBIOS name of a server whose DNS or FQDN name is dacmt.local

http://support.microsoft.com/kb/244380