Dirmann Technology Consultants

VMware VCF 3.9.0.0 to 3.10.1.2 Upgrade Troubles

Team!

It feels good to be able to type up a post! It’s been a bit and most of my free time has been allocated to the VCDX process. Shout out to my study group friends Jesper, Kyle, and Gio! Soon, I’ll start writing up the experiences that we have all encountered so that I can share it with all of you! It’s quite a journey, but that’s a story for another time. Also, I want to give a shout out to my buddy, Sandeep! He was in the weeds with me on this one. Anyways, let’s jump right into the meat on this one because this could really be a time saver.

What’s the deally?

Here’s the scoop – a customer of ours wants to upgrade to VCF 4.x. Through their partnership with VMware they have acquired services to assist with the upgrade, but being as how there still is no direct upgrade path provided from VMware for 3.x -> 4.x I have to assume that they are perfecting the process. VMware has told them straight up that they need to be at least at version 3.10.1.2, which makes me believe that this is going to be a hard requirement when the upgrade path becomes available. Anyhow, they are sitting on VCF 3.9.0.0 still. We’ve got some upgrading to do. That’s where my team comes into play.

The Challenge

If you’re not familiar with VCF or version 3.x, at its most basic form we have a management domain with its own VCSA and PSC backed by NSX-V on the network side and vSAN on the storage end. The workload domain, again, has its own VCSA and PSC, but it has NSX-T providing network infrastructure and also vSAN for the storage side. The entire instance is overseen by the SDDC Manager. What happens when you patch? Well this is executed through the SDDC Manager via Life Cycle Manager. The first thing that occurs is SDDC Manager applies its patches. Next comes the management domain’s NSX-V followed by the PSC, VCSA, and finally the hosts. Rinse and repeat for the workload domain (minus the SDDC Manager, of course).

To try to make life easier, I went for gusto and decided to do a skip-level upgrade from 3.9.0.0 to 3.10.1.2. When VCF says, “skip-level upgrade” they really mean “apply every patch from 3.9.0.0+ to 3.10.1.2” instead of what you would think a skip-level upgrade really means. So I downloaded the Bundle Transfer Utility and Skip-Level Upgrade Tool from My VMware and got to work. Transferred the updates successfully, verified them, and we’re good to go. Simple. Let’s upgrade this bad boy.

Right out of the gates, we hit a snag -_- . Updates for the SDDC Manager not only failed to complete, but they failed to even start.

SLUCLI Error

“Failed to run date command”. So naturally, the first thing we do is check the log file (trimmed to save you some time)…

2021-09-28 21:19:54.590 [main] INFO [console-logger]

2021-09-28 21:20:10.786 [main] ERROR [com.vmware.evo.sddc.lcm.common.utils.SshCommandRunner]
Command '[date, +%s]' execution failed:
com.jcraft.jsch.JSchException: Auth fail
at com.jcraft.jsch.Session.connect(Session.java:519)
at com.jcraft.jsch.Session.connect(Session.java:183)
at com.vmware.evo.sddc.lcm.common.utils.RemoteCommandService.executeSshCommand(RemoteCommandService.java:42)
at com.vmware.evo.sddc.lcm.common.utils.SshCommandRunner.executeWithResult(SshCommandRunner.java:34)
at com.vmware.evo.sddc.lcm.tools.slu.util.RemoteFileOperations.runCmdOverSSH(RemoteFileOperations.java:1268)
at com.vmware.evo.sddc.lcm.tools.slu.util.RemoteFileOperations.getSddcManagerEpochTime(RemoteFileOperations.java:1241)
at com.vmware.evo.sddc.lcm.tools.slu.SkipLevelUpgradeCoordinator.performSkipLevelUpgrade(SkipLevelUpgradeCoordinator.java:1779)
at com.vmware.evo.sddc.lcm.tools.slu.SkipLevelUpgradeApplication.process(SkipLevelUpgradeApplication.java:293)
at com.vmware.evo.sddc.lcm.tools.slu.SkipLevelUpgradeApplication.main(SkipLevelUpgradeApplication.java:338)
2021-09-28 21:20:10.786 [main] ERROR [com.vmware.evo.sddc.lcm.tools.slu.util.RemoteFileOperations]
DATE_COMMAND_FAILED

2021-09-28 21:20:12.895 [main] ERROR [com.vmware.evo.sddc.lcm.tools.slu.SkipLevelUpgradeApplication]
Skip Level Upgrade Tool error
com.vmware.evo.sddc.lcm.tools.slu.error.SkipLevelUpgradeException: Failed to run date command.
at com.vmware.evo.sddc.lcm.tools.slu.util.RemoteFileOperations.runCmdOverSSH(RemoteFileOperations.java:1275)
at com.vmware.evo.sddc.lcm.tools.slu.util.RemoteFileOperations.getSddcManagerEpochTime(RemoteFileOperations.java:1241)
at com.vmware.evo.sddc.lcm.tools.slu.SkipLevelUpgradeCoordinator.performSkipLevelUpgrade(SkipLevelUpgradeCoordinator.java:1779)
at com.vmware.evo.sddc.lcm.tools.slu.SkipLevelUpgradeApplication.process(SkipLevelUpgradeApplication.java:293)
at com.vmware.evo.sddc.lcm.tools.slu.SkipLevelUpgradeApplication.main(SkipLevelUpgradeApplication.java:338)
Caused by: java.lang.NullPointerException: null
at com.vmware.evo.sddc.lcm.tools.slu.util.RemoteFileOperations.runCmdOverSSH(RemoteFileOperations.java:1269)
... 4 common frames omitted

2021-09-28 21:20:12.911 [main] ERROR [console-logger]
SDDC Manager Skip Level Upgrade Tool failed with error: Failed to run date command.
Caused by: java.lang.NullPointerException

Well, the thing that stands out the most is the glaring AUTH FAILED message. Again, naturally the first thing we try is to make sure that the credentials for the Primary User, Basic authentication user, and root for the SDDC Manager are all what we think they are. Sure enough, after testing them we deemed them valid. Valid credentials? Check! What else?

The Resolution

We tried everything. I mean everything. Well, everything except one thing, clearly. We were verifying DNS resolution from the utility VM we were using to ensure we were in fact hitting the SDDC Manager, to SSH versions, and even tried to execute the upgrade from a RHEL8 box instead of a Windows. I’ll save you all dozens of minutes of reading and myself hours of typing by leaving out all of the troubleshooting details. Eventually, a white flag was raised and a support case with VMware was opened. After troubleshooting with support for a few hours, the ticket was eventually escalated to engineering, who then internally escalated it. The response finally came back and it wasn’t an actual resolution, but more so one of those things that make the lightbulb go off. The engineer mentioned that the SLU CLI tool “makes direct reference to the Primary User, Basic authentication user, and root passwords that are entered.” Hmmm.

Lightbulb GIF - Find & Share on GIPHY

I had an idea. I took a look at the passwords that they were using. They were complex with a myriad of letters, numbers, and…special characters. The root password,  in particular, caught my eye because it had a ‘#’ in it. On a hunch, we changed the root password to something basic and re-ran the SLU CLI upgrade process…

Post PW Change

BOOM! The upgrade is finally underway! I don’t want to hold you in suspense, so I’ll tell you that the SDDC Manager wound up completing successfully after this!

Conclusion

There’s apparently a bug in 3.9.0.0 as it relates to password complexity. Hashtags seems to cause a problem with the SLU CLI tool. We reset it to a basic password to initiate the upgrade, and once it completed it went right back to a complex password. Thanks for reading. If you enjoyed the post make sure you check us out at dirmann.tech and follow us on LinkedInTwitterInstagram, and Facebook!

 

Share this article on social media:
Facebooktwitterredditpinterestlinkedinmail