Updating VMware Cloud Foundation (VCF) from 5.1.0 to 5.1.1

I recently upgraded a nested lab from VMware Cloud Foundation (VCF) 5.1.0 to 5.1.1. A few lessons were learned along the way, and I’m going to document those below.

Step 0 – Validate the environment

My VCF lab had been powered down for a couple of months and during that time many of the credentials expired. The SDDC Manager has the ability to rotate credentials automatically, but I do not have that configured, and it likely would not have worked (since everything was powered off). One expired credential was the root user for VCSA. Typically, when I log into my lab and the root password has expired, I’ll change it temporarily to something complex, delete the /etc/security/opasswd file (which contains password history) and then set the password back to my default lab value. However, with this vCenter Server 8 I was unable to use my trusty lab password due to complexity requirements. I found a couple places (https://www.reddit.com/r/vmware/comments/186lrdc/comment/kicbdxm/, https://virtuallythatguy.co.uk/how-to-change-vcsa-root-password-and-bypass-bad-password-it-is-based-on-a-dictionary-word-for-vcenter-vcsa-root-account/) which mentioned settings in /etc/security/pwquality.conf. This file did exist on my VCSA, so I changed these parameters:
dictcheck = 0 # changed from 1 to 0
enforcing = 0 # was commented out, uncommented and changed from 1 to 0

This allowed me to reuse my standard password. After doing so, I restarted the VCSA and SDDC Manager for good measure and confirmed that both UIs were responsive.

Once the SDDC Manager was online, it showed that several other credentials (NSX Manager & Edge admin/audit/root, backup SFTP, etc) had been disconnected & were no longer working. I cycled through the various endpoints, changing passwords and ‘remediating’ the password from within SDDC Manager.

For some NSX credentials I was running into an error with remediation stating credentials were incorrect, even though I knew they had been updated. I found this article: https://kb.vmware.com/s/article/88561 which described an NSX API lockout-period, so I followed the resolution in the KB article to finish the password remediation.

Step 1 – Complete pending tasks

Running a precheck for upgrade showed a compatibility validation error for the SDDC Manager, NSX Manager, and vCenter Server components. The details of one error said Check the operationsmanager logs and if this issue persists contact VMware support. Reference token: UNVO1U.

Since the issue was related to compatibility, I first ran /opt/vmware/sddc-support/sos --version-health I had wanted to confirm that the running versions were expected versions. (Including command output below for reference).

Welcome to Supportability and Serviceability(SoS) utility!
Performing SoS operation for vcf-sddc-01 domain components
Health Check : /var/log/vmware/vcf/sddc-support/healthcheck-2024-04-16-18-33-30-6966
Health Check log : /var/log/vmware/vcf/sddc-support/healthcheck-2024-04-16-18-33-30-6966/sos.log
SDDC Manager : vcf-sddcm-01.lab.enterpriseadmins.org
+-------------------------+-----------+
|          Stage          |   Status  |
+-------------------------+-----------+
|         Bringup         | Completed |
| Management Domain State | Completed |
+-------------------------+-----------+
+--------------------+---------------+
|     Component      |    Identity   |
+--------------------+---------------+
|    SDDC-Manager    | 192.168.10.29 |
| Number of Servers  |       4       |
+--------------------+---------------+
Version Check Status : GREEN
+-----+------------------------------------------------+---------------------------+----------------------+-----------------------+-------+
| SL# |                   Component                    | BOM Version (lcmManifest) |   Running version    | VCF Inventory Version | State |
+-----+------------------------------------------------+---------------------------+----------------------+-----------------------+-------+
|  1  |   ESXI: vcf-vesx-01.lab.enterpriseadmins.org   |       8.0.2-22380479      |    8.0.2-22380479    |     8.0.2-22380479    | GREEN |
|  2  |   ESXI: vcf-vesx-02.lab.enterpriseadmins.org   |       8.0.2-22380479      |    8.0.2-22380479    |     8.0.2-22380479    | GREEN |
|  3  |   ESXI: vcf-vesx-03.lab.enterpriseadmins.org   |       8.0.2-22380479      |    8.0.2-22380479    |     8.0.2-22380479    | GREEN |
|  4  |   ESXI: vcf-vesx-04.lab.enterpriseadmins.org   |       8.0.2-22380479      |    8.0.2-22380479    |     8.0.2-22380479    | GREEN |
|  5  | NSX_MANAGER: vcf-nsxm.lab.enterpriseadmins.org |     4.1.2.1.0-22667789    |  4.1.2.1.0-22667789  |   4.1.2.1.0-22667789  | GREEN |
|  6  |  SDDC: vcf-sddcm-01.lab.enterpriseadmins.org   |          5.1.0.0          |       5.1.0.0        |        5.1.0.0        | GREEN |
|  7  |  VCENTER: vcf-vc-01.lab.enterpriseadmins.org   |    8.0.2.00100-22617221   | 8.0.2.00100-22617221 |  8.0.2.00100-22617221 | GREEN |
+-----+------------------------------------------------+---------------------------+----------------------+-----------------------+-------+
Progress : 100%, Completed tasks : [VCF-SUMMARY, VERSION-CHECK]
Legend:

 GREEN - No attention required, health status is NORMAL
 YELLOW - May require attention, health status is WARNING
 RED - Requires immediate attention, health status is CRITICAL


Health Check completed successfully for : [VCF-SUMMARY, VERSION-CHECK]

I then tried the workaround from this KB article: https://kb.vmware.com/s/article/90074, updating the vcf.compatibility.controllers.compatibilityCheckEnabled property and restarting LCM. This did not correct the issue either.

As a last report, I followed the instructions in the original error message and reviewed SDDC Manager logs with tail -f /var/log/vmware/vcf/operationsmanager/operationsmanager.log. I saw several errors related to a version check similar to:

2024-04-17T14:08:46.217+0000 ERROR [vcf_lcm,661fd7ecb6286abb2c0f18d6e8bd8c95,50f1] [c.v.e.s.l.a.i.InventoryClientHelper,Scheduled-3] Failed to compare VRSLCM version with VCF 4.0.0.0 BOM version in domain ed6e5e6e-c775-44c2-9853-3d85ee90c0cc

This helped me find https://kb.vmware.com/s/article/95790. The KB article provided some SQL commands for updating the vRealize Lifecycle Manager (vRSLCM) build number in the SDDC Manager database. After restarting the lcm service on the SDDC Manager, a failing task to integrate Aria Operations for Logs with VCF restarted automatically and completed. Once complete the SDDC Manager had no pending/failing tasks and the prechecks completed successfully.

Update SDDC Manager

After the password and version check issues were sorted out, the VMware Cloud Foundation SDDC Manager Update was successful. This step took about 30 minutes to complete.

Update NSX

After the SDDC Manager update, the next step was updating NSX Edge clusters. Edge node upgrades were failing with an error similar to:

2024-04-17T14:41:34.279+0000 ERROR [vcf_lcm,661fdf9ecca507a4f0fd8b24fb8032bf,feea] [c.v.evo.sddc.lcm.model.task.SubTask,http-nio-127.0.0.1-7400-exec-1] Upgrade error occured: Check for open alarms on edge node.: [Edge node 926f00f0-c544-40e8-b5e7-da8a9037bc10 has 1 open alarm(s) present. Kindly resolve the open alarm(s) before proceeding with the upgrade.]: vcf-edge-02,
Check for open alarms on edge node.: [Edge node cf6f0b2d-e84d-4953-aa98-166e2a8a40c4 has 1 open alarm(s) present. Kindly resolve the open alarm(s) before proceeding with the upgrade.]: vcf-edge-01
Reference token EC0PAR

Looking at edge nodes in NSX Manager, they had the error The datapath mempool usage for malloc_heap_socket_0 on Edge node cf6f0b2d-e84d-4953-aa98-166e2a8a40c4 has reached 93% which is at or above the high threshold value of 85%. This is likely triggered as I have small edge nodes but am using services that suggest medium is the minimum size. I marked the alarms as resolved in NSX Manager and tried again. The upgrade failed with the same error the second time, so I went back to the NSX Manager UI and suppressed this alarm for 2 hours. The next attempt completed as expected.

After the NSX Edge cluster was upgraded, the hosts were each placed in maintenance mode and NSX updates applied. This step completed without incident. I did not capture the timing, but believe it took about 45 minutes, but ran unattended.

During the NSX Manager upgrade, one failure did occur. I did not capture the error message from SDDC Manager, but looking at the NSX Manager virtual machine (there is only 1 in this environment) in vCenter Server, there was one message of “This virtual machine reset by vSphere HA. Reason: VMware Tools heartbeat failure.” This has happened before to this NSX Manager, likely due to the fact it is running as a nested VM on a heavily loaded host. I edited the High Availability VM override for this VM and disabled vSphere HA VM Monitoring. Once all the services were back online from the HA restart, I retried the NSX upgrade from SDDC Manager and it completed without further incident.

Update vCenter Server

Updating the vCenter Server was uneventful. Based on timings recorded by SDDC Manager, it took about an hour to complete this vCenter Server upgrade.

Update ESXi Hosts

The final step of this workflow was to update ESXi to 8.0 Update 2b. This step completed without incident in about an hour (4 hosts nested hosts had to be evacuated & updated).

Lessons Learned

With the exception of some lab/environment specific issues, the upgrade worked as expected. I need to review overall password management in this lab, either coming up with options to manage passwords better or setting password policy to not enforce rotation. As some lab environments may be powered off for fairly long stretches, disabling rotation is likely the better option for this lab. This exercise was also a good reminder that ignoring errors, such as the failed/pending Aria Operations for Logs integration task, can cause unintended consequences. In addition, sizing components appropriately would likely result in less painful upgrades, but would require additional hardware investment.

This entry was posted in Lab Infrastructure, Virtualization. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

Notify me of followup comments via e-mail. You can also subscribe without commenting.