Ubuntu 24.04, Packer, and vCenter Server Customization Specifications

I’ve heard from several people who have highly recommended Packer (https://packer.io) to create standardized images. I’ve been wanting to dig into this for some time, and with the recent release of Ubuntu 24.04 decided now would be a good time to dig in. I plan on using this template for most Linux VMs deployed in my vSphere based home lab. In addition to the base install of Ubuntu, there are a handful of agents/customizations that I wanted to have available:

  • Aria Operations for Logs agent
  • Aria Automation Salt Stack Config minion
  • Trusts my internal root CA
  • Joined to Active Directory for centralized authentication

I ended up with a set of packer configuration files & a customization spec that did exactly what I wanted. With each install or customization, I tried to decide if it would be best to include the automation in the base image (executed by Packer) or the customization spec (executed by the customization script). Some of this came down to personal preference, and I might revisit the choices in the future. For example, I’ve placed the code to trust my internal CA into the base template. I may want to evaluate removing that from the template and having multiple customization specs to have an option where that certificate is not trusted automatically.

For those interested, I’ve summarized the final output in the next two sections, but also tried to document notes and troubleshooting steps toward the end of the article.

Packer Configuration

The Packer Configuration spans several files. I’ve described each file below and attached a zip file with the working configuration.

  • http\meta-data – this is an empty file but is expected by Packer.
  • http\user-data – this file contains a listing of packages installed automatically and some commands ran automatically during the template creation. For example, these commands will allow VMware Tools customization to execute custom scripts.
  • setup\setup.sh.txt – this is a script which runs in the template right before it is powered off. It contains some cleanup code and agent installs. You’ll need to rename this file to remove the .txt extension if you want it to execute.
  • ubuntu.auto.pkr.hcl – contains variable declarations and then defines all the virtual machine settings which are created.
  • variables.pkrvars.hcl – contains shared code (vCenter Server details, credentials, Datacenter, Datastore, etc) which may be consumed by multiple templates.

Download: https://enterpriseadmins.org/files/Packer-Ubuntu2404-Public.zip

With these files present in a directory, I downloaded the Packer binary for my OS (from: https://developer.hashicorp.com/packer/install?product_intent=packer) and placed it in the same directory. From there I only needed to run two commands.

./packer.exe init .
./packer.exe build .

The first command initializes packer, this will download the vSphere plugin we’ve specified. The second command will actually kick off the template build. In my lab this took ~6 minutes to complete. Once finished, I had a new vSphere Template in my inventory which could be deployed easily.

vSphere Customization Specification > Customization script

The customization spec includes things like how to name the VM, the time zone, network settings, etc. The part of this script which really helps with completing some of the desired customizations was the customization script. This took a bit of trial and error, described in the notes section at the end of this article. I’ve included the final script below as reference. This code runs as part of the virtual machine deployment and is unique to each VM.

#!/bin/sh
if [ x$1 = x"precustomization" ]; then
    echo "Do Precustomization tasks"
    # append group to sudoers with no password
    echo '%lab\ linux\ sudoers ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers
elif [ x$1 = x"postcustomization" ]; then
    echo "Do Postcustomization tasks"
    # generate new openssh-server key
    test -f /etc/ssh/ssh_host_dsa_key || dpkg-reconfigure openssh-server
    # make home directories automatically at login
    /usr/sbin/pam-auth-update --enable mkhomedir
    # do a domain join and then modify the sssd config
    echo "VMware1!" | /usr/sbin/realm join lab.enterpriseadmins.org -U svc-windowsjoin --computer-ou "ou=services,ou=lab servers,dc=lab,dc=enterpriseadmins,dc=org"
    sed -i -e 's/^#\?use_fully_qualified_names.*/use_fully_qualified_names = False/g' /etc/sssd/sssd.conf
    systemctl restart sssd.service
fi

Notes / troubleshooting

Since I wanted to make this process as low touch as possible, so I needed to automate serveral agent installations and other customizations. With each

I had previously saved some sample configuration files for Ubuntu 22.04 (unfortunately I didn’t bookmark the original source). I cleaned up the files a bit, removing some declared variables that weren’t in use. I downloaded the Ubuntu 24.04 ISO image, placed it on a vSphere datastore, and updated the iso_paths property in the ubuntu.auto.pkr.hcl file and other credential/environmental values in the variables.pkrvars.hcl accordingly.

The initial build completed without incident, creating a vSphere template. The first deployment failed. Reviewing the var/log/vmware-imc/toolsDeployPkg.log file, the message ERROR: Path to hwclock not found. hwclock was observed. There was a KB article for this (https://kb.vmware.com/s/article/95091) related to Ubuntu 23.10, which mentioned that the util-linux-extra package was needed. I added this to the definition of packages in the user-data file and rebuilt the template using packer build. This resolved the issue and future deployments were successful.

One thing I noticed was that the resulting virtual machine had two CD ROM devices. I looked around and found a PR (link) stating that an option existed to control this behavior as of the vSphere 1.2.4 plugin. I updated the required_plugins mapping in the ubuntu.auto.pkr.hcl file to state this 1.2.4 version is the minimum required. I then added reattach_cdroms = 1 later in the file with the other CD ROM related settings.

One other thing that I noticed in this process was that it would have been helpful to have a date/time stamp either in the VM name or the notes field, just to know when that instance of a template was created. I looked around and found out how to get a timestamp and used that syntax to add a notes = "Template created ${formatdate("YYYY-MM-DD", timestamp())}" property to my ubuntu.auto.pkr.hcl file.

After making the above fixes, I deployed a VM from the latest template and applied a customization spec which contained a customization script do a few final customization tasks (update /etc/sudoers, generate a new openssh-server key, complete the domain join, make a change to the sssd configuration and finally restart ssd services. This script failed to execute, reviewing the /var/log/vmware-imc/toolsDeployPkg.log I noticed the message user defined scripts execution is not enabled. To enable it, please have vmware tools v10.1.0 or later installed and execute the following cmd with root privilege: 'vmware-toolbox-cmd config set deployPkg enable-custom-scripts true'. Back in my user-data configuration file, in the late-commands section, I added this command to enable custom scripts in the template.

After rebuilding the template to enable custom scripts, I deployed a new VM. This did not complete the domain join as I had hoped. All of my commands were running in a precustomization period, before the virtual machine was on the network. I found the following KB article: https://kb.vmware.com/s/article/74880 which described how to run some commands in precustomization and others during postcustomization. Moving the domain join to postcustomization solved this issue, as the VM was on the network when the domain join ran.

I wanted the templates to trust my internal CA, so I added a few commands to the setup.sh script to download the certificate file from an internal webserver and run update-ca-certificates.

The next task I wanted to complete was the installation of the Aria Automation Config (aka Salt Stack Config) minion. In the past I had used the salt-project version of the minion, but reviewing VMware Tools documentation (https://docs.vmware.com/en/VMware-Tools/12.4.0/com.vmware.vsphere.vmwaretools.doc/GUID-373CD922-AF80-4B76-B19B-17F83B8B0972.html) I found an alternative way. I added the open-vm-tools-salt-minion as a package in the user-data file and had Packer add additional configuration_parameters to the template to specify the salt_minion.desiredstate and salt_minion.args values.

I also wanted the template to include the Aria Operations for Logs (aka Log Insight) agent. The product documentation showed how to pass configuration during install (https://docs.vmware.com/en/VMware-Aria-Operations-for-Logs/8.16/Agents-Operations-for-Logs/GUID-B0299481-23C1-482D-8014-FAC1727D515D.html). However, I was having problems automating the download of the agent. Trying to do a wget of the link from the agent section of the Aria Ops for Logs console the resulting file was an HTML redirect. I found this article: https://michaelryom.dk/getting-log-insight-agent which provided an API link to download the package and I was able to wget this file. I placed the wget and install commands in the setup.sh script that runs right before the new template is powered down.

After rebuilding the template with packer, I deployed another test VM. I confirmed that:

  • SSH worked
  • AD Authentication worked
  • The Aria Ops for Logs agent sent logs
  • My internal CA was trusted
  • The Aria Automation Config minion was reporting (the key needed accepted in the console)

To repackage the template VM takes about 6 minutes. To deploy & customize the template takes about 2 minutes, but everything I wanted in the VM is ready to go.

Posted in Lab Infrastructure, Scripting, Virtualization | Leave a comment

Updating VMware Cloud Foundation (VCF) from 5.1.0 to 5.1.1

I recently upgraded a nested lab from VMware Cloud Foundation (VCF) 5.1.0 to 5.1.1. A few lessons were learned along the way, and I’m going to document those below.

Step 0 – Validate the environment

My VCF lab had been powered down for a couple of months and during that time many of the credentials expired. The SDDC Manager has the ability to rotate credentials automatically, but I do not have that configured, and it likely would not have worked (since everything was powered off). One expired credential was the root user for VCSA. Typically, when I log into my lab and the root password has expired, I’ll change it temporarily to something complex, delete the /etc/security/opasswd file (which contains password history) and then set the password back to my default lab value. However, with this vCenter Server 8 I was unable to use my trusty lab password due to complexity requirements. I found a couple places (https://www.reddit.com/r/vmware/comments/186lrdc/comment/kicbdxm/, https://virtuallythatguy.co.uk/how-to-change-vcsa-root-password-and-bypass-bad-password-it-is-based-on-a-dictionary-word-for-vcenter-vcsa-root-account/) which mentioned settings in /etc/security/pwquality.conf. This file did exist on my VCSA, so I changed these parameters:
dictcheck = 0 # changed from 1 to 0
enforcing = 0 # was commented out, uncommented and changed from 1 to 0

This allowed me to reuse my standard password. After doing so, I restarted the VCSA and SDDC Manager for good measure and confirmed that both UIs were responsive.

Once the SDDC Manager was online, it showed that several other credentials (NSX Manager & Edge admin/audit/root, backup SFTP, etc) had been disconnected & were no longer working. I cycled through the various endpoints, changing passwords and ‘remediating’ the password from within SDDC Manager.

For some NSX credentials I was running into an error with remediation stating credentials were incorrect, even though I knew they had been updated. I found this article: https://kb.vmware.com/s/article/88561 which described an NSX API lockout-period, so I followed the resolution in the KB article to finish the password remediation.

Step 1 – Complete pending tasks

Running a precheck for upgrade showed a compatibility validation error for the SDDC Manager, NSX Manager, and vCenter Server components. The details of one error said Check the operationsmanager logs and if this issue persists contact VMware support. Reference token: UNVO1U.

Since the issue was related to compatibility, I first ran /opt/vmware/sddc-support/sos --version-health I had wanted to confirm that the running versions were expected versions. (Including command output below for reference).

Welcome to Supportability and Serviceability(SoS) utility!
Performing SoS operation for vcf-sddc-01 domain components
Health Check : /var/log/vmware/vcf/sddc-support/healthcheck-2024-04-16-18-33-30-6966
Health Check log : /var/log/vmware/vcf/sddc-support/healthcheck-2024-04-16-18-33-30-6966/sos.log
SDDC Manager : vcf-sddcm-01.lab.enterpriseadmins.org
+-------------------------+-----------+
|          Stage          |   Status  |
+-------------------------+-----------+
|         Bringup         | Completed |
| Management Domain State | Completed |
+-------------------------+-----------+
+--------------------+---------------+
|     Component      |    Identity   |
+--------------------+---------------+
|    SDDC-Manager    | 192.168.10.29 |
| Number of Servers  |       4       |
+--------------------+---------------+
Version Check Status : GREEN
+-----+------------------------------------------------+---------------------------+----------------------+-----------------------+-------+
| SL# |                   Component                    | BOM Version (lcmManifest) |   Running version    | VCF Inventory Version | State |
+-----+------------------------------------------------+---------------------------+----------------------+-----------------------+-------+
|  1  |   ESXI: vcf-vesx-01.lab.enterpriseadmins.org   |       8.0.2-22380479      |    8.0.2-22380479    |     8.0.2-22380479    | GREEN |
|  2  |   ESXI: vcf-vesx-02.lab.enterpriseadmins.org   |       8.0.2-22380479      |    8.0.2-22380479    |     8.0.2-22380479    | GREEN |
|  3  |   ESXI: vcf-vesx-03.lab.enterpriseadmins.org   |       8.0.2-22380479      |    8.0.2-22380479    |     8.0.2-22380479    | GREEN |
|  4  |   ESXI: vcf-vesx-04.lab.enterpriseadmins.org   |       8.0.2-22380479      |    8.0.2-22380479    |     8.0.2-22380479    | GREEN |
|  5  | NSX_MANAGER: vcf-nsxm.lab.enterpriseadmins.org |     4.1.2.1.0-22667789    |  4.1.2.1.0-22667789  |   4.1.2.1.0-22667789  | GREEN |
|  6  |  SDDC: vcf-sddcm-01.lab.enterpriseadmins.org   |          5.1.0.0          |       5.1.0.0        |        5.1.0.0        | GREEN |
|  7  |  VCENTER: vcf-vc-01.lab.enterpriseadmins.org   |    8.0.2.00100-22617221   | 8.0.2.00100-22617221 |  8.0.2.00100-22617221 | GREEN |
+-----+------------------------------------------------+---------------------------+----------------------+-----------------------+-------+
Progress : 100%, Completed tasks : [VCF-SUMMARY, VERSION-CHECK]
Legend:

 GREEN - No attention required, health status is NORMAL
 YELLOW - May require attention, health status is WARNING
 RED - Requires immediate attention, health status is CRITICAL


Health Check completed successfully for : [VCF-SUMMARY, VERSION-CHECK]

I then tried the workaround from this KB article: https://kb.vmware.com/s/article/90074, updating the vcf.compatibility.controllers.compatibilityCheckEnabled property and restarting LCM. This did not correct the issue either.

As a last report, I followed the instructions in the original error message and reviewed SDDC Manager logs with tail -f /var/log/vmware/vcf/operationsmanager/operationsmanager.log. I saw several errors related to a version check similar to:

2024-04-17T14:08:46.217+0000 ERROR [vcf_lcm,661fd7ecb6286abb2c0f18d6e8bd8c95,50f1] [c.v.e.s.l.a.i.InventoryClientHelper,Scheduled-3] Failed to compare VRSLCM version with VCF 4.0.0.0 BOM version in domain ed6e5e6e-c775-44c2-9853-3d85ee90c0cc

This helped me find https://kb.vmware.com/s/article/95790. The KB article provided some SQL commands for updating the vRealize Lifecycle Manager (vRSLCM) build number in the SDDC Manager database. After restarting the lcm service on the SDDC Manager, a failing task to integrate Aria Operations for Logs with VCF restarted automatically and completed. Once complete the SDDC Manager had no pending/failing tasks and the prechecks completed successfully.

Update SDDC Manager

After the password and version check issues were sorted out, the VMware Cloud Foundation SDDC Manager Update was successful. This step took about 30 minutes to complete.

Update NSX

After the SDDC Manager update, the next step was updating NSX Edge clusters. Edge node upgrades were failing with an error similar to:

2024-04-17T14:41:34.279+0000 ERROR [vcf_lcm,661fdf9ecca507a4f0fd8b24fb8032bf,feea] [c.v.evo.sddc.lcm.model.task.SubTask,http-nio-127.0.0.1-7400-exec-1] Upgrade error occured: Check for open alarms on edge node.: [Edge node 926f00f0-c544-40e8-b5e7-da8a9037bc10 has 1 open alarm(s) present. Kindly resolve the open alarm(s) before proceeding with the upgrade.]: vcf-edge-02,
Check for open alarms on edge node.: [Edge node cf6f0b2d-e84d-4953-aa98-166e2a8a40c4 has 1 open alarm(s) present. Kindly resolve the open alarm(s) before proceeding with the upgrade.]: vcf-edge-01
Reference token EC0PAR

Looking at edge nodes in NSX Manager, they had the error The datapath mempool usage for malloc_heap_socket_0 on Edge node cf6f0b2d-e84d-4953-aa98-166e2a8a40c4 has reached 93% which is at or above the high threshold value of 85%. This is likely triggered as I have small edge nodes but am using services that suggest medium is the minimum size. I marked the alarms as resolved in NSX Manager and tried again. The upgrade failed with the same error the second time, so I went back to the NSX Manager UI and suppressed this alarm for 2 hours. The next attempt completed as expected.

After the NSX Edge cluster was upgraded, the hosts were each placed in maintenance mode and NSX updates applied. This step completed without incident. I did not capture the timing, but believe it took about 45 minutes, but ran unattended.

During the NSX Manager upgrade, one failure did occur. I did not capture the error message from SDDC Manager, but looking at the NSX Manager virtual machine (there is only 1 in this environment) in vCenter Server, there was one message of “This virtual machine reset by vSphere HA. Reason: VMware Tools heartbeat failure.” This has happened before to this NSX Manager, likely due to the fact it is running as a nested VM on a heavily loaded host. I edited the High Availability VM override for this VM and disabled vSphere HA VM Monitoring. Once all the services were back online from the HA restart, I retried the NSX upgrade from SDDC Manager and it completed without further incident.

Update vCenter Server

Updating the vCenter Server was uneventful. Based on timings recorded by SDDC Manager, it took about an hour to complete this vCenter Server upgrade.

Update ESXi Hosts

The final step of this workflow was to update ESXi to 8.0 Update 2b. This step completed without incident in about an hour (4 hosts nested hosts had to be evacuated & updated).

Lessons Learned

With the exception of some lab/environment specific issues, the upgrade worked as expected. I need to review overall password management in this lab, either coming up with options to manage passwords better or setting password policy to not enforce rotation. As some lab environments may be powered off for fairly long stretches, disabling rotation is likely the better option for this lab. This exercise was also a good reminder that ignoring errors, such as the failed/pending Aria Operations for Logs integration task, can cause unintended consequences. In addition, sizing components appropriately would likely result in less painful upgrades, but would require additional hardware investment.

Posted in Lab Infrastructure, Virtualization | Leave a comment

TinyCore 15 x64 Virtual Machine – very small VM for testing

I recently made a post about building a new TinyCore 15 virtual machine for testing. As with past VM builds of this distribution, I used the x86 CorePlus ISO image. I had someone ask me why I didn’t use the x86 Pure 64 version instead. I didn’t have a good reason, only that many years ago I had struggled to get it working, but didn’t remember any of the details on what problem I encountered. This post will cover steps very similar to the previous article, but building the x86-64 port in a virtual machine.

The Virtual Machine

When creating the virtual machine, I used the following options:

  • Compatible with: ESXi 6.7 U2 and later (vmx-15)
  • Operating System: Linux / Other 4.x or later Linux (64-bit)
  • 1 vCPU
  • 1 GB RAM
  • 1 GB disk (thin provisioned)
  • Expand Video card > Total video memory = 8MB (when using GUI, for CLI only I left it at the default 4MB)
  • VM Options tab > Boot Options > Firmware = BIOS

These are the same options used in the x32 version of this article, with the exception of the ‘operating system’ selected in step 2.

The Install

  • Power on VM
  • Open Remote Console (the one that launches VMRC or VMware Workstation, not the web console)
  • Attach to a local TinyCorePure64 ISO image (specifically I used this ISO: http://tinycorelinux.net/15.x/x86_64/release/TinyCorePure64-current.iso)
  • CTRL+ALT+INS to reboot
  • Select Boot TinyCorePure64 (default)
  • Select Apps > Click the Apps button (top left) > ‘Cloud (Remote)’ > Browse
  • Find the Remote Extension tc-install-GUI.tcz
  • Change the toggle in bottom left to ‘Download + Load’ and click Go. A window with progress should appear, this will take a minute to complete.
  • Click the installation button on the task bar. Select Frugal > Whole Disk > sda > install boot loader > ext4
  • Install Extensions from this TCE/CDE Directory, left as default /mnt/sr0/cde
  • Proceed
  • When the display says “installation has completed”, Exit > Shutdown.
  • Power On VM (this will ensure that the CD is no longer connected and boot into the install)
  • The VM likely boots to the error failed in waitforX and leaves you at a tc@box:~$ command prompt.

Customization

In the previous article, I created a bit of automation to build a TinyCore appliance to set hostname, include my CA certificate, install open-vm-tools, install Firefox if using a GUI. For this x86-64 version, I’m going to use that same script.

After following the above instructions to install, we should be at a failed in waitforX console. Running the script which installs open-vm-tools will resolve this issue. The previous article outlines the script, its placement on a web server, and some other ancillary files. Assuming those dependencies are already in place, we only need to run:

wget http://www.example.com/build/buildscript2.txt
mv buildscript2.txt buildscript2.sh
chmod +x buildscript2.sh
./buildscript2.sh tc-150x64-gui

The above commands will set the hostname to tc-150x64-gui, and since gui is part of the name, will also install graphical components open-vm-tools-desktop as well as Firefox. Our internal CA will also be trusted at the command prompt and in Firefox. Once the script is complete, we can sudo reboot and confirm everything boots up. When we launch Firefox we’ll be able to see that it is the 64-bit version (from Help > About Firefox).

For a CLI version of this VM, I also made changes to /mnt/sda1/tce/onboot.lst to only include ca-certificates.tcz, curl.tcz, pcre.tcz, and open-vm-tools.tcz. I saved these changes by running backup and then rebooting again to confirm success. After exporting these to OVA files, I now have a folder with a variety of VMs that can be quickly deployed as needed.

Posted in Virtualization | Leave a comment

Fine-Tuning Updates: Targeted Host Remediation with vSphere Lifecycle Manager

In vSphere 7.0 functionality was introduced to be able to manage a cluster of ESXi hosts with a single image. In addition to being a bit easier to configure than the prior Update Manager Baselines, this feature also integrates with Hardware Support Managers to be able to update host firmware.

PowerCLI 12.1 introduced the ability to remediate against these single images, using syntax like:

Get-Cluster img-test | Set-Cluster -Remediate -AcceptEULA

This and other vLCM cmdlets are discussed in this blog post: https://blogs.vmware.com/PowerCLI/2020/10/new-release-powercli-12-1-vlcm-enhancements.html.

This is a very simple command and works very well to programmatically interact with the clusters if you have many to remediate. However, sometimes we need to remediate hosts in a more controlled order, or as part of a larger workflow. For example, perhaps we need to notify an operations management product or team before a host goes into maintenance to prevent after hour pages or service desk issues being logged. This can be done, but requires a few extra cmdlets. The code block below provides sample commands to remediate a specific host in a cluster.

$clusterid = (Get-Cluster 'Img-Test').ExtensionData.MoRef.Value
$vmhostid = (Get-VMHost 'h178-vesx-04.lab.enterpriseadmins.org').ExtensionData.MoRef.Value

# Initialize hosts to updates, hostids can be specified comma separated.
$SettingsClustersSoftwareApplySpec = Initialize-SettingsClustersSoftwareApplySpec -Hosts $vmhostid -AcceptEula $true 

# Apply the specification object to the cluster
$taskId = Invoke-ApplyClusterSoftwareAsync -Cluster $clusterid -SettingsClustersSoftwareApplySpec $SettingsClustersSoftwareApplySpec

# The apply task runs async so we need to watch tasks for it to complete.
$task=Get-Task |?{$_.Id -eq "CisTask-$taskId"} 

# Loop until the task finishes
While ($task.State -eq "Running") {
    "sleeping..."
    Start-Sleep -Seconds 60 
    $task=Get-Task |?{$_.Id -eq "CisTask-$taskId"} 
}

In the example above we get the ID value of the cluster and host, these are the values like domain-c3146330 and host-3146220 and are the values required by the cmdlets we are using.

We then create a spec to apply. The parameters are described here: https://developer.vmware.com/apis/vsphere-automation/latest/esx/api/esx/settings/clusters/cluster/softwareactionapplyvmw-tasktrue/post/. I was originally confused by the -commit parameter until I found this documentation. I had assumed the property was similar to a git commit message and was passing in random strings of text. The property is optional, so in the above example I do not pass in the argument at all. However, if you’d like to find the current/expected value, you can get it from the cluster object like this: (Invoke-GetClusterSoftwareCompliance -Cluster 'domain-c3146214').commit

Next we pass our clusterid and above spec to the ‘Invoke’ function. The command returns a task ID, so we are capturing the output to a variable so we can use it later. The Invoke command runs asynchronously, so we will use this ID to check on the status until the task completes.

At the end we are checking for the task to complete. The Get-Task cmdlet has an -ID parameter, but in my testing I would get an error of The identifier CisTask-525d3596-4a7a-60ab-df10-ab97999b8511:com.vmware.esx.settings.clusters.software resulted in no objects. when I tried to pass the ID to Get-Task. Using where-object instead worked reliably, so I used it instead.

Hopefully this helps if you have the need to remediate a single host in a cluster using a vLCM Image.

Posted in Lab Infrastructure, Scripting, Virtualization | Leave a comment

Video Card Power Consumption & Savings

The primary system in my homelab is a Dell Precision 7920 Tower. I recently had the case opened and looking inside I saw the video card. This is a physically large Nvidia GeForce RTX 2080ti GPU, but for my purposes was overkill. The system typically runs virtual machine workloads that do little to nothing video related. I noticed that this GPU had extra power running to the card and it made me wonder how much extra power it would consume, even in an idle state.

I have a Kill A Watt monitor which can measure power usage, so I connected it up and powered on the system. It booted to ESXi and sitting in maintenance mode the system used between 160 and 180 watts of power, typically running at the lower end of that range.

I then removed the GPU and replaced it with a lower end MSI Geforce 210 (https://www.amazon.com/dp/B003XM568I) for $40USD. This card gets all its power from the PCI bus, no extra power input required. Checking this configuration with the same Kill A Watt, from the same maintenance mode state previously tested, I was using between 100 and 120 watts.

Using an electricity calculator, this savings of 60 watts running 24/7, at around $0.15USD/kWh, is a savings of $79USD. The ROI for this replacement video card is great, paying for itself in about 6 months. I had incorrectly assumed that the GPU wouldn’t be consuming much power in an idle state, but this test confirmed that significant energy savings could be realized with a minor change, not impacting the specific use case.

Posted in Lab Infrastructure | Leave a comment