MongoDB: Test data for performance monitoring

This post will cover loading some test data into our MongoDB instance and generating some queries for performance monitoring. In previous posts we covered creating a MongoDB replica set (here) and configuring the Aria Operations Management Pack for MongoDB (here).

Reviewing the MongoDB website, there is a good article about some sample datasets: https://www.mongodb.com/developer/products/atlas/atlas-sample-datasets/. The MongoDB post covers importing the data using Atlas, then describes each data set. At the very end of the article, they cover importing this data with the mongorestore command line utility. As we do not have a GUI available with this Mongo instance, this is what we’ll do in this post.

The first step is to SSH into the primary node of our MongoDB replicaset. We can find this value on the MongoDB Replica Set Details dashboard in Aria Operations (its in the MongoDB Replica Sets widget at the top right in the column ‘Primary Replication’) or by using the rs.status() command in Mongo Shell discussed earlier in this series.

From the /tmp directory, we’ll download the sampledata archive using the command line utility curl like below:

curl https://atlas-education.s3.amazonaws.com/sampledata.archive -o sampledata.archive

The download will be about 372MB. Once we have the file, we will use the command line mongorestore command with the following syntax:

mongorestore --archive=sampledata.archive -u root -p 'password'

We can get the root password from the console of the first VM in our cluster, the one where we ran the rs.initiate earlier. The restore should complete rather quickly. Progress is written to the screen during the restore, but the final line in my output was:

2024-05-11T18:21:40.888+0000    425367 document(s) restored successfully. 0 document(s) failed to restore.

A couple hundred thousand records should be enough to work with for our needs — where we primarily want to make sure our monitoring dashboard is working.

Having data in our database isn’t really enough, we do need to have some queries running as well. I’m sure there are more complete/better load generating tools (such as YCSB), but after a quick search I found a couple of PowerShell examples for connecting to MongoDB (https://stackoverflow.com/questions/45010964/how-to-connect-mongodb-with-powershell). One is a module available in the PowerShell Gallery. It was easy to install with Install-Module Mdbc, so I gave this a shot. One of the first issues I encountered was with the default root password I was using. It had a colon in it, which is the character used to separate username:password in the connection string. I found a quick way to escape the special characters and a little more trial and error was able to create a connection string. One thing I ran into was the default readPreference assumed that all reads should come from the primary node, so neither of my secondary nodes were really doing anything. I ended up using the ‘secondaryPreferred’ method, so that I could see load on multiple nodes in the cluster.

$mongoPass = [uri]::EscapeDataString('wj:dFDgb6tom')
$mongoConnectString = "mongodb://root:$mongoPass@svcs-mongo-01.lab.enterpriseadmins.org,svcs-mongo-02.lab.enterpriseadmins.org,svcs-mongo-03.lab.enterpriseadmins.org/?readPreference=secondaryPreferred"

With the password escaped and the connection string built, it is easy to connect to the database. For example, to return a list of databases/collections from the mongo instance, I can run the following command:

Connect-Mdbc $mongoConnectString *

# List returned:
admin
config
local
sample_airbnb
sample_analytics
sample_geospatial
sample_guides
sample_mflix
sample_restaurants
sample_supplies
sample_training
sample_weatherdata

Running Connect-Mdbc $mongoConnectString sample_analytics * (adding a specific database name to the command) will return the three tables listed in the database. A few quick foreach loops later, we have a query that’ll run for a fairly long time, and we could easily make the loop have more iterations. It gives you some basic output to watch so you know it is working, and CTRL+C will let you exit the loop at any point.

$randomCounts = 2
1..1000 | %{
  $myResults = @()
  foreach ($thisDB in (Connect-Mdbc $mongoConnectString * |?{$_ -match 'sample'} | Get-Random -Count $randomCounts)) {
    foreach ($thisTable in (Connect-Mdbc $mongoConnectString $thisDB * | Get-Random -Count $randomCounts)) {
      Connect-Mdbc $mongoConnectString $thisDB $thisTable | Get-Random -Count $randomCounts
      $myResults += [pscustomobject][ordered]@{
        "Database" = $thisDB
        "Table"    = $thisTable
        "RowCount" = (Get-MdbcData -as PS | Measure-Object).Count
      } # end outputobject
    } # end table loop
  } # end db loop
  $rowsReturned = ($myResults | Measure-Object -Property rowcount -sum).Sum
  "Completed iteration $_ and returned $rowsReturned rows"
} # end counter loop

While running the above loop, I also went through and messed with cluster nodes, rebooting them to see what happens and see if queries failed. The cluster was more resilient than I had expected. This worked well to generate some CPU load on my Mongo VMs to populate an Aria Operations dashboard.

Posted in Lab Infrastructure, Virtualization | Leave a comment

Optimizing Operations: Aria Operations Management Pack for MongoDB

In a previous post (here), I covered how to setup a MongoDB replica set to be monitored by Aria Operations in a vSphere based lab. This article will cover the installation & configuration of the Aria Operations Management Pack for MongoDB. The article will be divided into three sections:

  • Installing the Management Pack – which will cover the Aria Suite Lifecycle Marketplace workflow
  • Configuring a service account – this will be the monitoring user for the MongoDB instances, configured inside Mongo Shell.
  • Adding a MongoDB adapter instance in Aria Operations – configuring Aria Operations to use the new management pack and service account.

Installing the Management Pack

In Aria Suite Lifecycle (formerly known as vRealize Suite Lifecycle Manager), I typically stay in the Lifecycle Operations, Locker, or Content Management tiles. In this post we’ll use the Marketplace tile to add a management pack for MongoDB. For details on creating a MongoDB Replica Set cluster, feel free to check out this previous post.

At the top of the Marketplace screen, there is a search box. When I search for Mongo I get three results. Hovering over each, there is one for Aria Operations Management Pack for MongoDB Version 9.0:

After selecting DOWNLOAD in the bottom right of the tile, we are presented the EULA. After reading the document, we check the box and select next. On the last page of this wizard, we enter our name, email address, company name, and country, then click DOWNLOAD.

This will show a Success. Download for VMware Aria Operations Management Pack for MongoDB 9.0 is initiated. For more details visit request page. The request text is a link to show the specific request. It should contain just one stage and complete fairly quick. Here is a screenshot of the expected results of the download task:

Back in the Marketplace, we switch to the ‘Available’ tab, which shows the subset of management packs that we’ve already downloaded. Due to the length of management pack names, it may be helpful to search for mongo in the search bar.

When we select ‘View Details’ a page will appear that provides some details about the selected management pack. If we scroll to the end of the page, there are buttons for INSTALL and DELETE. Since we are installing this for the first time, we’ll select INSTALL.

We must then pick our Datacenter and Environment and finally select the INSTALL button in the lower right.

This will result in a banner stating Installation in progress. Check request status. Request Status is again a link that takes us to the ‘Requests’ page. Again this should be a single stage task, but it’ll likely take longer than the download.

Once this task completes, our management pack will be available in the Aria Operations instance. We’ll just need to give it hostnames and credentials to begin collecting data.

Configuring a service account

From an SSH session to our primary MongoDB node, we’ll open a Mongo Shell with the following command:

mongosh --username root

We’ll enter our root password when prompted. One important thing to note, is that the root password for MongoDB is the root password of the node where we initiated the replica set.

// specify that we want to use the admin database instead of the default test
use admin

// Create our limited access service account for monitoring.  
db.createUser(
   {
     user: "svc-ariaopsmp",
     pwd: "VMware1!",
     roles: [ { role: "clusterMonitor", db: "admin" } ]
   }
)

// Confirm that it worked
db.getUser('svc-ariaopsmp')

When the above command completes, the final JSON object presented to the screen should look like this:

{
  _id: 'admin.svc-ariaopsmp',
  userId: UUID('2ee886ee-f700-46e7-b022-950fbb36034f'),
  user: 'svc-ariaopsmp',
  db: 'admin',
  roles: [ { role: 'clusterMonitor', db: 'admin' } ],
  mechanisms: [ 'SCRAM-SHA-1', 'SCRAM-SHA-256' ]
}

Which shows that we have a username ‘svc-ariaopsmp’ that has the ‘clusterMonitor’ role, which is a limited permission set for monitoring operations, just like what we are doing with Aria Operations.

Adding a MongoDB adapter instance in Aria Operations

From Aria Operations, we’ll navigate to the Data Sources > Integrations > Repository and verify we can see our new installed integration. If we click the tile, details come up about the management pack, including the metrics collected and content that is now available. We can select ‘ADD ACCOUNT’ from the top of the screen to create an adapter instance.

When adding an account there are only a few fields required:

The ‘Name’ field is used in some of the dashboards to identify which MongoDB instance that is being described. It is important to include a short descriptive name, in my case I’m going to call this ‘svcs-mongo-replicaset’ to denote that it is deployed in my services network, is running mongo, and is a replica set cluster.

The ‘Description’ field is not shown by default on the MongoDB dashboards added by th management back, but can contain additional details about the adapter instance. If you have various MongoDB instances managed by different teams, this might be a good place to store the contact, for example if your credentials expire or stop working.

The ‘Host’ field is where we enter the hostname of our MongoDB instance. Since we’ve configured a replica set, I’m listing a comma separated list of all the nodes of the MongoDB cluster. This will help ensure we are getting monitoring data even if one node fails.

The Credential allows us to specify whether or not authentication is required and enter username/password details. In my case, I do require username/password, and am only running mongod, not mongos, but I’m going to enter the same account in both fields in case something changes in the future. Here is what my new credential looks like:

Finally, I ‘VALIDATE CONNECTION’ and wait for the ‘Test connection successful’ notice to appear, then click ‘ADD’ at the bottom of the screen.

Additional advanced settings are available, like which service to connect to, specifics around SSL, timeouts, and autodiscovery. Those settings did not need to be tweaked for this simple environment.

After 5-10 minutes, we should have some initial data. Browsing to Visualize > Dashboards we can filter the list of dashboards for Mongo. There are 7 dashboards available from the management pack we installed related to MongoDB. Not all of them will have data, for example we don’t have Mongos deployed. However, if we select MongoDB Replica Set Details we should see some information about our environment.

To make the most of this, we’ll really need some sample data and a way to run queries as needed to have something to monitor. We’ll cover that in the next post (here).

Posted in Lab Infrastructure, Virtualization | 1 Comment

Creating a MongoDB Replica Set

I was recently looking at the Aria Operations Management Pack for MongoDB (https://docs.vmware.com/en/VMware-Aria-Operations-for-Integrations/9.0/Management-Pack-for-MongoDB/GUID-73744E17-88DD-49A1-8B86-5BD896C874D8.html) and wanted to kick the tires in my vSphere-based lab. To be able to test this management pack, I wanted to deploy a three node MongoDB replica set.

To begin, I found a solid starting point, the Bitnami MongoDB appliance: https://bitnami.com/stack/mongodb/virtual-machine. I downloaded this appliance which included MongoDB 7.0.9 pre-installed.

I deployed three copies of the appliance, specifying static IPs during the deployment. I pre-created forward and reverse DNS records for these IPs. The VMs come configured with 1vCPU and 1GB of RAM. When doing a replica set, I ran into a few issues with only 1GB of RAM, having 2GB of RAM seemed better. Once the VMs were powered on, I logged in with the bitnami/bitnami username/password combination and changed the bitnami password. I then made a few changes I made in the OS using the vSphere web console.

sudo nano /etc/ssh/sshd_config
# find PasswordAuthentication no and change to yes

sudo rm /etc/ssh/sshd_not_to_be_run
sudo systemctl start sshd

This allowed me to login over SSH to make the remaining changes. Over SSH I made some additional changes to allow MongoDB to be accessed remotely:

sudo nano /etc/nftables.conf  # find and add "tcp dport { 27017-27019 } accept" in the section below accepting tcp22
sudo nft -f /etc/nftables.conf -e

I also wanted to clean up a couple of general networking issues, namely setting the hostname of the VM and removing an extra secondary IP address that might get pulled via DHCP.

# set the OS hostname
echo mongo-xx.lab.example.com | sudo tee /etc/hostname

# disable the DHCP interface if using static addresses
sudo nano /etc/network/interfaces
# find the line at the end of the file for 'iface ens192 inet dhcp' and remove it or make it a comment.

# confirm DNS resolution is configured
sudo nano /etc/resolv.conf  # review for nameserver entries

sudo systemctl restart networking

Next up, we will make some changes to the MongoDB to support our replica set. We need a security key for nodes to talk with each other. Instead of using a short password, from my local machine running PowerShell, I created a new GUID to use as a security key for node communication and removed the hyphens.

[guid]::NewGuid() -Replace '-', ''

For me, this created the string 88157a33a9dc499ea6b05c504daa36f8 which I’ll reference in the document. There are other ways to create longer/more complex keys, but this was something quick. We need to put this key in a file that will be referenced in a mongo config file. To create the security key we’ll use the following command:

echo '88157a33-a9dc-499e-a6b0-5c504daa36f8' | sudo tee /opt/bitnami/mongodb/conf/security.key
sudo chmod 400 /opt/bitnami/mongodb/conf/security.key
sudo chown mongo:mongo /opt/bitnami/mongodb/conf/security.key

Now that we have that security key, we’ll update our mongo config.

sudo nano /opt/bitnami/mongodb/conf/mongodb.conf

# Find and remove the comments for replication and replSetName, and set a replication set name if desired.

# uncomment #keyFile at the end of the file and set value to /opt/bitnami/mongodb/conf/security.key

# save the file, restart services
sudo systemctl restart bitnami.mongodb.service

Now that the cluster nodes are all prepared, we can turn this into a replica set cluster. Since I’m not familiar with this process, I decided to create snapshots of all the nodes (PowerCLI makes this easy — get-vm mongo-0* | New-Snapshot -Name 'Pre RS config') and reboot them for good measure. I’ll log back into the first node of the cluster over SSH and enter the mongo shell interface, using the root password found on the virtual appliance console.

mongosh --username root
# enter the root password

# initiate the cluster
rs.initiate( {
   _id : "replicaset",
   members: [
      { _id: 0, host: "mongo-01.lab.example.com:27017" },
      { _id: 1, host: "mongo-02.lab.example.com:27017" },
      { _id: 2, host: "mongo-03.lab.example.com:27017" }
   ]
})

# check the status to ensure cluster is working
rs.status()

The rs.status() command should return the status of the cluster. We should see one of our nodes is the primary and the other two are secondary (based on the stateStr property).

With our replica set working, we can now start monitoring. The next post (link) will cover installing and configuring the management pack. The final post in this series (link) will show how to import some sample data and run sample queries to put read load on the replica set.

Posted in Lab Infrastructure, Scripting, Virtualization | 2 Comments

Ubuntu 24.04, Packer, and vCenter Server Customization Specifications

I’ve heard from several people who have highly recommended Packer (https://packer.io) to create standardized images. I’ve been wanting to dig into this for some time, and with the recent release of Ubuntu 24.04 decided now would be a good time to dig in. I plan on using this template for most Linux VMs deployed in my vSphere based home lab. In addition to the base install of Ubuntu, there are a handful of agents/customizations that I wanted to have available:

  • Aria Operations for Logs agent
  • Aria Automation Salt Stack Config minion
  • Trusts my internal root CA
  • Joined to Active Directory for centralized authentication

I ended up with a set of packer configuration files & a customization spec that did exactly what I wanted. With each install or customization, I tried to decide if it would be best to include the automation in the base image (executed by Packer) or the customization spec (executed by the customization script). Some of this came down to personal preference, and I might revisit the choices in the future. For example, I’ve placed the code to trust my internal CA into the base template. I may want to evaluate removing that from the template and having multiple customization specs to have an option where that certificate is not trusted automatically.

For those interested, I’ve summarized the final output in the next two sections, but also tried to document notes and troubleshooting steps toward the end of the article.

Packer Configuration

The Packer Configuration spans several files. I’ve described each file below and attached a zip file with the working configuration.

  • http\meta-data – this is an empty file but is expected by Packer.
  • http\user-data – this file contains a listing of packages installed automatically and some commands ran automatically during the template creation. For example, these commands will allow VMware Tools customization to execute custom scripts.
  • setup\setup.sh.txt – this is a script which runs in the template right before it is powered off. It contains some cleanup code and agent installs. You’ll need to rename this file to remove the .txt extension if you want it to execute.
  • ubuntu.auto.pkr.hcl – contains variable declarations and then defines all the virtual machine settings which are created.
  • variables.pkrvars.hcl – contains shared code (vCenter Server details, credentials, Datacenter, Datastore, etc) which may be consumed by multiple templates.

Download: https://enterpriseadmins.org/files/Packer-Ubuntu2404-Public.zip

With these files present in a directory, I downloaded the Packer binary for my OS (from: https://developer.hashicorp.com/packer/install?product_intent=packer) and placed it in the same directory. From there I only needed to run two commands.

./packer.exe init .
./packer.exe build .

The first command initializes packer, this will download the vSphere plugin we’ve specified. The second command will actually kick off the template build. In my lab this took ~6 minutes to complete. Once finished, I had a new vSphere Template in my inventory which could be deployed easily.

vSphere Customization Specification > Customization script

The customization spec includes things like how to name the VM, the time zone, network settings, etc. The part of this script which really helps with completing some of the desired customizations was the customization script. This took a bit of trial and error, described in the notes section at the end of this article. I’ve included the final script below as reference. This code runs as part of the virtual machine deployment and is unique to each VM.

#!/bin/sh
if [ x$1 = x"precustomization" ]; then
    echo "Do Precustomization tasks"
    # append group to sudoers with no password
    echo '%lab\ linux\ sudoers ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers
elif [ x$1 = x"postcustomization" ]; then
    echo "Do Postcustomization tasks"
    # generate new openssh-server key
    test -f /etc/ssh/ssh_host_dsa_key || dpkg-reconfigure openssh-server
    # make home directories automatically at login
    /usr/sbin/pam-auth-update --enable mkhomedir
    # do a domain join and then modify the sssd config
    echo "VMware1!" | /usr/sbin/realm join lab.enterpriseadmins.org -U svc-windowsjoin --computer-ou "ou=services,ou=lab servers,dc=lab,dc=enterpriseadmins,dc=org"
    sed -i -e 's/^#\?use_fully_qualified_names.*/use_fully_qualified_names = False/g' /etc/sssd/sssd.conf
    systemctl restart sssd.service
fi

Notes / troubleshooting

Since I wanted to make this process as low touch as possible, so I needed to automate serveral agent installations and other customizations. With each

I had previously saved some sample configuration files for Ubuntu 22.04 (unfortunately I didn’t bookmark the original source). I cleaned up the files a bit, removing some declared variables that weren’t in use. I downloaded the Ubuntu 24.04 ISO image, placed it on a vSphere datastore, and updated the iso_paths property in the ubuntu.auto.pkr.hcl file and other credential/environmental values in the variables.pkrvars.hcl accordingly.

The initial build completed without incident, creating a vSphere template. The first deployment failed. Reviewing the var/log/vmware-imc/toolsDeployPkg.log file, the message ERROR: Path to hwclock not found. hwclock was observed. There was a KB article for this (https://kb.vmware.com/s/article/95091) related to Ubuntu 23.10, which mentioned that the util-linux-extra package was needed. I added this to the definition of packages in the user-data file and rebuilt the template using packer build. This resolved the issue and future deployments were successful.

One thing I noticed was that the resulting virtual machine had two CD ROM devices. I looked around and found a PR (link) stating that an option existed to control this behavior as of the vSphere 1.2.4 plugin. I updated the required_plugins mapping in the ubuntu.auto.pkr.hcl file to state this 1.2.4 version is the minimum required. I then added reattach_cdroms = 1 later in the file with the other CD ROM related settings.

One other thing that I noticed in this process was that it would have been helpful to have a date/time stamp either in the VM name or the notes field, just to know when that instance of a template was created. I looked around and found out how to get a timestamp and used that syntax to add a notes = "Template created ${formatdate("YYYY-MM-DD", timestamp())}" property to my ubuntu.auto.pkr.hcl file.

After making the above fixes, I deployed a VM from the latest template and applied a customization spec which contained a customization script do a few final customization tasks (update /etc/sudoers, generate a new openssh-server key, complete the domain join, make a change to the sssd configuration and finally restart ssd services. This script failed to execute, reviewing the /var/log/vmware-imc/toolsDeployPkg.log I noticed the message user defined scripts execution is not enabled. To enable it, please have vmware tools v10.1.0 or later installed and execute the following cmd with root privilege: 'vmware-toolbox-cmd config set deployPkg enable-custom-scripts true'. Back in my user-data configuration file, in the late-commands section, I added this command to enable custom scripts in the template.

After rebuilding the template to enable custom scripts, I deployed a new VM. This did not complete the domain join as I had hoped. All of my commands were running in a precustomization period, before the virtual machine was on the network. I found the following KB article: https://kb.vmware.com/s/article/74880 which described how to run some commands in precustomization and others during postcustomization. Moving the domain join to postcustomization solved this issue, as the VM was on the network when the domain join ran.

I wanted the templates to trust my internal CA, so I added a few commands to the setup.sh script to download the certificate file from an internal webserver and run update-ca-certificates.

The next task I wanted to complete was the installation of the Aria Automation Config (aka Salt Stack Config) minion. In the past I had used the salt-project version of the minion, but reviewing VMware Tools documentation (https://docs.vmware.com/en/VMware-Tools/12.4.0/com.vmware.vsphere.vmwaretools.doc/GUID-373CD922-AF80-4B76-B19B-17F83B8B0972.html) I found an alternative way. I added the open-vm-tools-salt-minion as a package in the user-data file and had Packer add additional configuration_parameters to the template to specify the salt_minion.desiredstate and salt_minion.args values.

I also wanted the template to include the Aria Operations for Logs (aka Log Insight) agent. The product documentation showed how to pass configuration during install (https://docs.vmware.com/en/VMware-Aria-Operations-for-Logs/8.16/Agents-Operations-for-Logs/GUID-B0299481-23C1-482D-8014-FAC1727D515D.html). However, I was having problems automating the download of the agent. Trying to do a wget of the link from the agent section of the Aria Ops for Logs console the resulting file was an HTML redirect. I found this article: https://michaelryom.dk/getting-log-insight-agent which provided an API link to download the package and I was able to wget this file. I placed the wget and install commands in the setup.sh script that runs right before the new template is powered down.

After rebuilding the template with packer, I deployed another test VM. I confirmed that:

  • SSH worked
  • AD Authentication worked
  • The Aria Ops for Logs agent sent logs
  • My internal CA was trusted
  • The Aria Automation Config minion was reporting (the key needed accepted in the console)

To repackage the template VM takes about 6 minutes. To deploy & customize the template takes about 2 minutes, but everything I wanted in the VM is ready to go.

Posted in Lab Infrastructure, Scripting, Virtualization | Leave a comment

Updating VMware Cloud Foundation (VCF) from 5.1.0 to 5.1.1

I recently upgraded a nested lab from VMware Cloud Foundation (VCF) 5.1.0 to 5.1.1. A few lessons were learned along the way, and I’m going to document those below.

Step 0 – Validate the environment

My VCF lab had been powered down for a couple of months and during that time many of the credentials expired. The SDDC Manager has the ability to rotate credentials automatically, but I do not have that configured, and it likely would not have worked (since everything was powered off). One expired credential was the root user for VCSA. Typically, when I log into my lab and the root password has expired, I’ll change it temporarily to something complex, delete the /etc/security/opasswd file (which contains password history) and then set the password back to my default lab value. However, with this vCenter Server 8 I was unable to use my trusty lab password due to complexity requirements. I found a couple places (https://www.reddit.com/r/vmware/comments/186lrdc/comment/kicbdxm/, https://virtuallythatguy.co.uk/how-to-change-vcsa-root-password-and-bypass-bad-password-it-is-based-on-a-dictionary-word-for-vcenter-vcsa-root-account/) which mentioned settings in /etc/security/pwquality.conf. This file did exist on my VCSA, so I changed these parameters:
dictcheck = 0 # changed from 1 to 0
enforcing = 0 # was commented out, uncommented and changed from 1 to 0

This allowed me to reuse my standard password. After doing so, I restarted the VCSA and SDDC Manager for good measure and confirmed that both UIs were responsive.

Once the SDDC Manager was online, it showed that several other credentials (NSX Manager & Edge admin/audit/root, backup SFTP, etc) had been disconnected & were no longer working. I cycled through the various endpoints, changing passwords and ‘remediating’ the password from within SDDC Manager.

For some NSX credentials I was running into an error with remediation stating credentials were incorrect, even though I knew they had been updated. I found this article: https://kb.vmware.com/s/article/88561 which described an NSX API lockout-period, so I followed the resolution in the KB article to finish the password remediation.

Step 1 – Complete pending tasks

Running a precheck for upgrade showed a compatibility validation error for the SDDC Manager, NSX Manager, and vCenter Server components. The details of one error said Check the operationsmanager logs and if this issue persists contact VMware support. Reference token: UNVO1U.

Since the issue was related to compatibility, I first ran /opt/vmware/sddc-support/sos --version-health I had wanted to confirm that the running versions were expected versions. (Including command output below for reference).

Welcome to Supportability and Serviceability(SoS) utility!
Performing SoS operation for vcf-sddc-01 domain components
Health Check : /var/log/vmware/vcf/sddc-support/healthcheck-2024-04-16-18-33-30-6966
Health Check log : /var/log/vmware/vcf/sddc-support/healthcheck-2024-04-16-18-33-30-6966/sos.log
SDDC Manager : vcf-sddcm-01.lab.enterpriseadmins.org
+-------------------------+-----------+
|          Stage          |   Status  |
+-------------------------+-----------+
|         Bringup         | Completed |
| Management Domain State | Completed |
+-------------------------+-----------+
+--------------------+---------------+
|     Component      |    Identity   |
+--------------------+---------------+
|    SDDC-Manager    | 192.168.10.29 |
| Number of Servers  |       4       |
+--------------------+---------------+
Version Check Status : GREEN
+-----+------------------------------------------------+---------------------------+----------------------+-----------------------+-------+
| SL# |                   Component                    | BOM Version (lcmManifest) |   Running version    | VCF Inventory Version | State |
+-----+------------------------------------------------+---------------------------+----------------------+-----------------------+-------+
|  1  |   ESXI: vcf-vesx-01.lab.enterpriseadmins.org   |       8.0.2-22380479      |    8.0.2-22380479    |     8.0.2-22380479    | GREEN |
|  2  |   ESXI: vcf-vesx-02.lab.enterpriseadmins.org   |       8.0.2-22380479      |    8.0.2-22380479    |     8.0.2-22380479    | GREEN |
|  3  |   ESXI: vcf-vesx-03.lab.enterpriseadmins.org   |       8.0.2-22380479      |    8.0.2-22380479    |     8.0.2-22380479    | GREEN |
|  4  |   ESXI: vcf-vesx-04.lab.enterpriseadmins.org   |       8.0.2-22380479      |    8.0.2-22380479    |     8.0.2-22380479    | GREEN |
|  5  | NSX_MANAGER: vcf-nsxm.lab.enterpriseadmins.org |     4.1.2.1.0-22667789    |  4.1.2.1.0-22667789  |   4.1.2.1.0-22667789  | GREEN |
|  6  |  SDDC: vcf-sddcm-01.lab.enterpriseadmins.org   |          5.1.0.0          |       5.1.0.0        |        5.1.0.0        | GREEN |
|  7  |  VCENTER: vcf-vc-01.lab.enterpriseadmins.org   |    8.0.2.00100-22617221   | 8.0.2.00100-22617221 |  8.0.2.00100-22617221 | GREEN |
+-----+------------------------------------------------+---------------------------+----------------------+-----------------------+-------+
Progress : 100%, Completed tasks : [VCF-SUMMARY, VERSION-CHECK]
Legend:

 GREEN - No attention required, health status is NORMAL
 YELLOW - May require attention, health status is WARNING
 RED - Requires immediate attention, health status is CRITICAL


Health Check completed successfully for : [VCF-SUMMARY, VERSION-CHECK]

I then tried the workaround from this KB article: https://kb.vmware.com/s/article/90074, updating the vcf.compatibility.controllers.compatibilityCheckEnabled property and restarting LCM. This did not correct the issue either.

As a last report, I followed the instructions in the original error message and reviewed SDDC Manager logs with tail -f /var/log/vmware/vcf/operationsmanager/operationsmanager.log. I saw several errors related to a version check similar to:

2024-04-17T14:08:46.217+0000 ERROR [vcf_lcm,661fd7ecb6286abb2c0f18d6e8bd8c95,50f1] [c.v.e.s.l.a.i.InventoryClientHelper,Scheduled-3] Failed to compare VRSLCM version with VCF 4.0.0.0 BOM version in domain ed6e5e6e-c775-44c2-9853-3d85ee90c0cc

This helped me find https://kb.vmware.com/s/article/95790. The KB article provided some SQL commands for updating the vRealize Lifecycle Manager (vRSLCM) build number in the SDDC Manager database. After restarting the lcm service on the SDDC Manager, a failing task to integrate Aria Operations for Logs with VCF restarted automatically and completed. Once complete the SDDC Manager had no pending/failing tasks and the prechecks completed successfully.

Update SDDC Manager

After the password and version check issues were sorted out, the VMware Cloud Foundation SDDC Manager Update was successful. This step took about 30 minutes to complete.

Update NSX

After the SDDC Manager update, the next step was updating NSX Edge clusters. Edge node upgrades were failing with an error similar to:

2024-04-17T14:41:34.279+0000 ERROR [vcf_lcm,661fdf9ecca507a4f0fd8b24fb8032bf,feea] [c.v.evo.sddc.lcm.model.task.SubTask,http-nio-127.0.0.1-7400-exec-1] Upgrade error occured: Check for open alarms on edge node.: [Edge node 926f00f0-c544-40e8-b5e7-da8a9037bc10 has 1 open alarm(s) present. Kindly resolve the open alarm(s) before proceeding with the upgrade.]: vcf-edge-02,
Check for open alarms on edge node.: [Edge node cf6f0b2d-e84d-4953-aa98-166e2a8a40c4 has 1 open alarm(s) present. Kindly resolve the open alarm(s) before proceeding with the upgrade.]: vcf-edge-01
Reference token EC0PAR

Looking at edge nodes in NSX Manager, they had the error The datapath mempool usage for malloc_heap_socket_0 on Edge node cf6f0b2d-e84d-4953-aa98-166e2a8a40c4 has reached 93% which is at or above the high threshold value of 85%. This is likely triggered as I have small edge nodes but am using services that suggest medium is the minimum size. I marked the alarms as resolved in NSX Manager and tried again. The upgrade failed with the same error the second time, so I went back to the NSX Manager UI and suppressed this alarm for 2 hours. The next attempt completed as expected.

After the NSX Edge cluster was upgraded, the hosts were each placed in maintenance mode and NSX updates applied. This step completed without incident. I did not capture the timing, but believe it took about 45 minutes, but ran unattended.

During the NSX Manager upgrade, one failure did occur. I did not capture the error message from SDDC Manager, but looking at the NSX Manager virtual machine (there is only 1 in this environment) in vCenter Server, there was one message of “This virtual machine reset by vSphere HA. Reason: VMware Tools heartbeat failure.” This has happened before to this NSX Manager, likely due to the fact it is running as a nested VM on a heavily loaded host. I edited the High Availability VM override for this VM and disabled vSphere HA VM Monitoring. Once all the services were back online from the HA restart, I retried the NSX upgrade from SDDC Manager and it completed without further incident.

Update vCenter Server

Updating the vCenter Server was uneventful. Based on timings recorded by SDDC Manager, it took about an hour to complete this vCenter Server upgrade.

Update ESXi Hosts

The final step of this workflow was to update ESXi to 8.0 Update 2b. This step completed without incident in about an hour (4 hosts nested hosts had to be evacuated & updated).

Lessons Learned

With the exception of some lab/environment specific issues, the upgrade worked as expected. I need to review overall password management in this lab, either coming up with options to manage passwords better or setting password policy to not enforce rotation. As some lab environments may be powered off for fairly long stretches, disabling rotation is likely the better option for this lab. This exercise was also a good reminder that ignoring errors, such as the failed/pending Aria Operations for Logs integration task, can cause unintended consequences. In addition, sizing components appropriately would likely result in less painful upgrades, but would require additional hardware investment.

Posted in Lab Infrastructure, Virtualization | Leave a comment