Archive for June 2017

Rolling back and recovering from failed vSphere 6 and external multisite PSC upgrades

Recovering from failed vSphere 6 and external multisite PSC upgrades 

So what happens if you have the following situation within a failed vSphere upgrade with a multi-site system?

The Scenario

The requirement was to upgrade 3 vCenters from 5.1U3 to 6.0U2. The vCenter servers previously had embedded SSO but have been repointed to the external 6.0 U2 PSCs. Note, we had an intermediary stage of repointing to 5.5U3 external SSOs first before we could upgrade the PSCs to 6.0U2. So once the PSCs are in a multisite 6.0U2 configuration, the upgrade of the vCenters can start and this is where we need to take care as the systems are interlinked at this point.

So we now have

  • 3 x vSphere 6.0U2 PSCs set up in a multisite configuration
  • 3 x 5.1 U3 vCenters pointing to a vSphere 6.0U2 PSC
  • 3 x SQL 2008 R2 Databases running the vCenter Databases and the Update Manager Databases. Not seen in the example above

Why did it fail initially?

Never assume anything. We had already upgraded 2 environments without any issue so we overlooked checking the SQL environments which were meant to be a replica but clearly weren’t!

  • DB password had special characters in the password.
  • Needed to give db owner rights on the vCenter DB and the MSDB database prior to upgrading
  • The ODBC Connection was not the right version. We needed SQL Native Client 10 or 11.
  • Additional Database Permissions for VMware (Can be seen in the vSphere 6.0 Documentation Center
  • SQL Server 2008 R2 needed Service Pack 2 installing. (No problem with this)
  • An informational messages regarding the vCenter FQDN
  • An informational message about the ALL_SERVICES accounts. (New in vSphere. Depends if you want to use Local System for all your vCenter Services or use the multiple vSphere accounts for individual services

So how do you recover from one failed vCenter Upgrade in this scenario?

We do need to take into account at what step has it failed and check the logs to verify this. Can it be an issue that can be worked though and fixed forward or do you need to rollback and depending on how far the installer has got before failing, it might open variations in the options for a rollback.  For example, if it fails to authenticate to the database, then it is unlikely it would have made any DB changes. The installer logs will give us this information.

You can retrieve the installation log files manually for examination.

Procedure

1

Navigate to the installation log file locations.

%PROGRAMDATA%\VMware\vCenterServer\logs directory, usually C:\ProgramData\VMware\vCenterServer\logs

%TEMP% directory, usually C:\Users\ username\AppData\Local\Temp

The files in the %TEMP% directory include vminst.log, pkgmgr.log, pkgmgr-comp-msi.log, and vim-vcs-msi.log.

2

Open the installation log files in a text editor for examination.

Steps

  • Make sure you have been through all the pre-requisites prior to starting the upgrade and test in a lab as well. Give yourself the best start at not failing.
  • Snapshot the whole environment (All VCs, All DBs and all All PSCs) Make sure the environment is quiesced or if you want a more consistent image vs crash consistent, shutdown the whole environment and snapshot it cold
If the upgrade of any VC 5.x to 6.0 fails in this environment, you must roll back ALL PSC.  The reason is that the solution user format differs between 5.x and 6.x. During the upgrade of the VC, the VC 5.x solution users will be removed from the PSC and replaced with 6.x solution users early in the VC 6.x upgrade (during vmafd firstboot I believe). If you have a failure and only roll back the VC then you have a VC with 5.x solution users talking to a PSC that no longer is aware of those users. You must roll back ALL PSCs in the SSO Domain as they replicate.
  • If you encounter an unrecoverable upgrade installer error then in the event of rolling back to snapshots, the order will be all PSCs, the vCenter DB and vCenter Server
  • In the event the rollback above fails, all servers should be rolled back again with the order of power on being all PSCs, all vCenter DBs and all vCenter servers

Questions we have been asked

During an upgrade, could we stop/break the multisite replication agreements between PSCs to avoid any replication of issues in the event of an upgrade problem on one vCenter? 

There’s no issue with breaking replication agreements generally but it not something that should be done for an upgrade. The vdcrepadmin command line tool does allows the breaking and creating of agreements and it actually protects the customer by not allowing them to delete the only agreement available. This prevents a customer from inadvertently creating an isolated PSC. What you would do is go through and create the new agreements and once they are in place just delete the extra ones. There is nothing special about the replication agreements that are created during PSC deployment. They are the same as ones created with vdcrepadmin.