Automating Cluster Management with Aria Operations API

As part of routine maintenance, it is sometimes necessary to take an Aria Operations cluster offline. For example, it is recommended to take the cluster offline to perform backups (https://docs.vmware.com/en/VMware-Aria-Operations/8.12/Best-Practices-Operations/GUID-1D058B4A-93BA-44D1-8794-AE8E1B96B3E4.html).

Since most folks want to schedule backups, it is important to be able to leverage automation to take the cluster offline. There is an cluster management API document at https://ops.example.com/casa/api-guide.html that has some details on how to do this.

Authentication

When logging into this API, I provided the admin username/password combination. Here is an example of checking the cluster state using that method:

$creds = Get-Credential
(Invoke-RestMethod -URI https://ops.example.com/casa/sysadmin/cluster/online_state -Credential $creds).cluster_online_state_snapshot

However, I’d prefer to use a centrally managed service account in Active Directory for such tasks. The ability to do this was first introduced in vRealize Operations 8.6 (doc) and still exists in Aria Operations 8.18 (doc). It depends on a separate Active Directory configuration / definition than the one in the product UI. The links provided show where/how to configure this identity provider from the /admin interface. Here is a screenshot showing this configuration:

Once Active Directory is configured for admin operations, we need to change our API authentication slightly to be able to use it. In the original example, we provided our username & password as a powershell credential object. In this example, we’ll end up with an extra API call to authenticate, then use the resulting bearer token as a header when checking the status. A code sample is below, but you’ll notice the authorization header that passes vrops-ldap along with base64 encoded username (as an AD userPrincipalName), colon, and password to an authorize resource. That resource will return a token that we’ll provide as a header to check the cluster status.

$b64 = [System.Convert]::ToBase64String([System.Text.encoding]::ASCII.GetBytes("h267-opsbu@lab.enterpriseadmins.org:VMware1!"))

$authorize = Invoke-RestMethod -Uri 'https://ops.example.com/casa/authorize' -Method Post -ContentType 'application/json' -Headers @{Authorization="vrops-ldap $b64"; Accept='application/json'}

(Invoke-RestMethod -URI https://ops.example.com/casa/sysadmin/cluster/online_state -Headers @{Authorization="Bearer $($authorize.accessToken)"; Accept='application/json'} -ContentType 'application/json').cluster_online_state_snapshot

Taking the cluster offline

With the authentication sorted out above, we can now post to this API to take the cluster offline. You’ll notice that we set the state to offline and provide a reason why. The example uses the same bearer token that we created in the above example.

$body = @{ 'online_state'='OFFLINE'; 'online_state_reason'='Lets back this thing up.'} | convertto-json
Invoke-RestMethod -URI https://ops.example.com/casa/sysadmin/cluster/online_state -Body $body -Method POST -ContentType 'application/json'  -Headers @{Authorization="Bearer $($authorize.accessToken)"; Accept='application/json'}

The above example submits a request to take the cluster offline but returns immediately after doing so. In the URI we could provide a ?async=false so that our command waits until completion. Another option would be to submit an async request (default), then create a loop to periodically check the cluster state using the prior ‘get’ request until the cluster is offline. I prefer the periodic polling option, as you can code in your own counter/timing/failure logic as needed.

If you check out the docs at /casa/api-guide.html, you’ll also see examples of setting the “Show reason on maintenance page” checkbox via the JSON body.

Bring the cluster back online

After our maintenance / backup task is complete, we’ll want to bring the cluster back online. In this example we don’t need to provide a reason in our body.

$body = @{ 'online_state'='ONLINE'} | convertto-json
Invoke-RestMethod -URI https://ops.example.com/casa/sysadmin/cluster/online_state?async=false -Body $body -Method POST -ContentType 'application/json' -Headers @{Authorization="Bearer $($authorize.accessToken)"; Accept='application/json'}

In this example I’m using the ?async=false so that the API call doesn’t return until the cluster is back online. Again, we could opt to use the default async request and periodically poll the service if we’d like.

Conclusion

The casa API is very useful for automating cluster management tasks. This article focuses on a few examples related to cluster state changes and authentication, but the API supports many other things, like PAK file uploads, NTP & certificate management, and even the configuration of AD authentication. You should check out /casa/api-guide.html on an Aria Operations node for more examples.

Posted in Lab Infrastructure, Scripting, Virtualization | Leave a comment

Scaling Your Tests: How to Set Up a vCenter Server Simulator

I’ve recently been testing a handful of reporting tools against vCenter Server endpoints. I have several lab instances for various major releases, which allow me to test a wide range of configurations. However, there are some tests that are hard to simulate, like what if a cluster has 100 hosts, or one host has 200 VMs?

Years ago I remember stumbling on a vCenter Server simulator. I didn’t have a specific need for it at the time but with this recent testing I checked, and the tool still exists and is actively maintained here: https://github.com/vmware/govmomi/blob/main/vcsim/README.md. There is even a container image available. This article will show how to create a single VM that can help simulate many vCenter Servers for our various reporting requirements.

Server setup

I started with an Ubuntu 24.04 virtual machine (deployed from a template previously created with Packer, as described in this previous post). I then installed docker compose and make a few configuration changes to make docker a bit easier to work with.

# install docker compose, add our active directory user & root to the docker group
sudo apt install docker-compose-v2
sudo usermod -aG docker ${USER}
sudo usermod -aG docker root
# Log off/on for new group membership to become effective

# create a folder for some of our configs, make the docker group owner of that folder
sudo mkdir /data
sudo chgrp docker /data
sudo chmod g+srw -R /data 

# configure docker to not overlap network ranges and make ranges smaller than default
echo '{"default-address-pools":[{"base":"172.17.0.0/16","size":26}]}' | sudo tee --append /etc/docker/daemon.json
sudo systemctl restart docker

# create a folder for our simulator 
mkdir vcsim
cd vcsim

Sample vCenter Inventories

William Lam maintains a github repo with recordings of some lab inventories, available here: https://github.com/lamw/govc-recordings. I downloaded this repo and extracted the contents to /data/vcsim/sims, which provides 5 different lab environments from about 4 years ago.

This repo also contains instructions on how to save an existing inventory for use in the simulator. Getting ready to do some maintenance or decommission an environment? Might as well save that point in time snapshot of the inventory we can report against later.

Docker compose.yml

Looking at this file, you’ll see 3 instances of the vcsim container running. There are two generic inventories, controlled by parameters passed to command for things like number of hosts per cluster, port groups, and listening port. The third vCenter inventory is one of the saved inventories from the govc-recordings repo. I’m running multiple saved inventories, but in the interest of saving space have only included one sample below. Each vcsim container will get a container IP and listen on port 8989. In this example, nginx is handling all of the traffic coming in — we reference a reverse_proxy.conf file as well as a certificate folder. Those will be discussed later.

In the /data/vcsim/ folder, compose.yml has the following contents:

version: '2'
services:
  # This nginx web server will terminate at a wildcard SSL certificate, then use the first label
  # of the DNS entry to send requests to port 8989 on the associated container, for example:
  # https://small.vcsim.example.com:443 --> https://small:8989
  nginx:
    image: nginx:latest
    volumes:
      - /data/vcsim/reverse_proxy.conf:/etc/nginx/conf.d/default.conf
      - /data/vcsim/cert:/etc/nginx/certs
    ports:
      - "443:443"

  # Here are a few simulated VCs with progressively larger inventories
  small:
    image: vmware/vcsim:latest
    command: -api-version "8.0.3" -cluster 1 -dc 1 -folder 5 -host 8 -standalone-host 0 -l "0.0.0.0:8989" -vm 20
    restart: unless-stopped
  medium:
    image: vmware/vcsim:latest
    command: -api-version "8.0.3" -cluster 2 -dc 2 -folder 5 -host 16 -standalone-host 0 -l "0.0.0.0:8989" -vm 100
    restart: unless-stopped

  # The following examples are from a github repo
  wlam7:
    image: vmware/vcsim
    command: -load /simdata -l "0.0.0.0:8989"
    restart: unless-stopped
    volumes:
      - /data/vcsim/sims/vcsim-vcsa.primp-industries.local:/simdata

nginx reverse_proxy.conf

This nginx configuration will extract the subdomain from the request, for example the word small from small.vcsim.example.com and store it in a $sub variable. We will then proxy the request for each subdomain to the appropriate container, for example https://small.vcsim.example.com:443/ to the individual container of https://small:8989/. In DNS we only need to create a single cname record for *.vcsim.example.com that points to our container host. Any container that gets created will automatically be available via our wildcard subdomain.

I’ve not included the certificate files, but as you can see from the compose.yml and reverse_proxy.conf files, there is a directory at /data/vcsim/cert containing a PEM formatted certificate and private key named wildcard-vcsim-example-com.pem and wildcard-vcsim-example-com.key. This allows all our requests (from PowerCLI or the like) to have a valid certificate, while we only need to maintain a single certificate file.

server {
   # Listen for any HTTPS request
   listen 443 ssl;

   # Extract the subdomain name from 'SUB.vcsim.example.com'
   server_name ~^(?<sub>[^.]+)\.vcsim\.example\.com$;

   # Define the path to the certificate and key
   ssl_certificate /etc/nginx/certs/wildcard-vcsim-example-com.pem;
   ssl_certificate_key /etc/nginx/certs/wildcard-vcsim-example-com.key;

   # For any request, proxy to the container name on port 8989
   location / {
        resolver 127.0.0.11 valid=1s;
        proxy_pass https://$sub:8989;
   }
}

Bring Up

With our compose.yml, reverse_proxy.conf, certificate files, and simulated inventories in place, we are ready to startup the service. To this we only need to run a single command:

docker compose up -d

When the system restarts, these containers will restart automatically. If we want to check stats or logs, we can do so with the following:

docker stats
docker compose logs

When running the docker compose commands we need to be in the /data/vcsim/ folder, or specify that folder via arguments.

Conclusion

Using the above steps, I have a single lab VM with ~10 vCenter inventories… using less than 2GB of RAM and 2vCPU. There are of course some caveats, like no UI and some methods that don’t exist in the simulator, but for many test scenarios when you just need the SOAP API this works great.

Posted in Lab Infrastructure, Virtualization | Leave a comment

Automate workaround for ESX Admins group

In a recent security advisory (VMSA-2024-0013), there is a workaround listed for hosts older than ESXi 8.0u3 (https://knowledge.broadcom.com/external/article/369707). This knowledge base article lists a few advanced settings and an esxcli command which can be ran to apply this workaround. Setting advanced settings and invoking esxcli are two things that PowerCLI can do very well. The following code sample highlights those commands and helps automate the process listed in the knowledge base article.

$vmhosts = Get-Cluster h243-cluster | Get-VMHost
foreach ($vmHost in $vmhosts) {
  Write-Host "Processing host $($vmHost.Name)"
  # Get advanced setting, if it is not the desired value, set it to the desired value.
  $vmhost | Get-AdvancedSetting Config.HostAgent.plugins.hostsvc.esxAdminsGroupAutoAdd | ?{$_.Value -ne $false} | Set-AdvancedSetting -Value $false -Confirm:$false
  $vmhost | Get-AdvancedSetting Config.HostAgent.plugins.vimsvc.authValidateInterval | ?{$_.Value -ne 90} | Set-AdvancedSetting -Value 90 -Confirm:$false
  $vmhost | Get-AdvancedSetting Config.HostAgent.plugins.hostsvc.esxAdminsGroup | ?{$_.Value -ne ''} | Set-AdvancedSetting '' -Confirm:$false 

  # Find and remove the default admin group if present (ends with \esx admins)
  $esxcli = $vmhost | Get-EsxCli -V2
  $esxcli.system.permission.list.Invoke() | ?{$_.IsGroup -eq $true -AND $_.Principal -match [regex]::escape('\esx^admins')+'$' -AND $_.Role -eq 'Admin' } | %{
    write-host "Found group $($_.Principal) and will attempt to remove."
    $removeGroup = $esxcli.system.permission.unset.CreateArgs()
    $removeGroup.id = $_.Principal
    $removeGroup.group = $_.IsGroup
    $esxcli.system.permission.unset.invoke($removeGroup)
  }

  # List current system permissions for reference
  $esxcli.system.permission.list.Invoke()
} # end vmhosts loop

For more detail on these PowerCLI cmdlets, check out the documentation links below:
Get-AdvancedSetting
Set-AdvancedSetting
Get-EsxCli

Posted in Scripting, Virtualization | Leave a comment

Troubleshooting vCenter permission errors

I was recently helping troubleshoot an issue where a service account was configured with the least privileges possible. When the service attempted to perform a specific operation, an access denied message was encountered. The service performing this action immediately cleaned up after itself, deleting the virtual machine that was created.

Typically in the UI we can see a warning event on an object when a required privilege is missing. For example, in the following screenshot a read only service account attempted to change the CPU Count for a virtual machine. This operation failed due to a missing permission, but we can clearly see the missing privilege is VirtualMachine.Config.CPUCount.

However, in our specific case the affected object was destroyed automatically, and we didn’t have an opportunity to view this event in the UI on our specific VM. We could have likely found this event on a parent object, but the environment had a lot of events occurring, making it difficult to find in the UI. Instead, we used PowerCLI to filter the logs for what we needed. In this sample we are using Get-VIEvent to query for all events in the last 15 minutes, then filter on the client side where the event text contains my service account and the Event Type is the specific NoPermission event we were interested in.

Get-VIEvent -Types Warning -Start (Get-Date).AddMinutes(-15) -Finish (Get-Date) | 
Where-Object {$_.FullFormattedMessage -match 'svc-vspherero' -AND 
       $_.EventTypeId -eq 'com.vmware.vc.authorization.NoPermission'} |
Select-Object FullFormattedMessage, ObjectName

This worked for our case. Another option would be to use the Get-VIEventPlus function from here: https://www.lucd.info/2013/03/31/get-the-vmotionsvmotion-history/. Using this custom function, a EventFilterSpec is created where we can have vCenter only return the NoPermission events. This is more efficient than doing the query on the client side and lets us return more applicable events. Using the sample below, we can group our NoPermission events and see how many times they occurred.

$allNoPerm = Get-VIEventPlus -EventType com.vmware.vc.authorization.NoPermission
$allNoPerm | Group-Object -Property FullFormattedMessage | 
  select-object Name, Count | Sort-Object -Property Count -Descending

For example, using the above code in my lab I found some additional service accounts that were missing required permissions. I was able to review documentation and confirm that the required privileges were updated between when I initially created the custom role and the current version of the service.

I hope these sample queries help identify missing privileges if needed.

Posted in Scripting, Virtualization | Leave a comment

Linux P2V with vCenter Converter Standalone

I was recently speaking with someone who was encountering an Access Denied issue when trying to P2V an older CentOS physical machine. The specific error they were getting was The thumbprint of the remote host you are connecting to is: Access denied. Screenshot below for reference:

I was able to recreate this error in a lab, in this post we’ll explore the cause of the problem and a couple of possible solutions.

Looking at the documentation (https://docs.vmware.com/en/vCenter-Converter-Standalone/6.6/vcenter-converter/GUID-E6C55568-EE61-4D1F-A3DC-71269790D9FD.html), we see a few prerequisites to prepare the source Linux machine. The first two are:

  • Enable SSH on the source Linux machine.
  • Make sure that you use the root account or account with sudo privileges without password prompt for all commands to convert a powered on Linux machine.

The initial error of “The thumbprint of the remote host you are connecting to is: Access denied.” occurred when using the root account for a host where PermitRootLogin was set to no.

There are two possible solutions for this issue, described in that second bullet. We can either get the root account functional (over SSH), or switch to a user with sudo access.

Enabling SSH for the root user

Many sshd_config files have the ability to login as root over SSH disabled by default. You typically need to login as a regular user and then switch to root (su -) or prefix commands with sudo. To keep things as secure as possible, while temporarily satisfying this requirement, we can add a couple of lines to the very end of our sshd_config file:

Match Address 192.168.127.194,192.168.127.81
     PermitRootLogin yes

Once we’ve saved our config file changes, we’ll reload the SSHD config to make this change active. For this older CentOS test system, the command is service sshd reload.

This allows us to enable root login over SSH, but only for two client systems. The first IP I have listed is the Windows system where I’m running the VMware vCenter Converter Standalone client. This client logs in to view the source machine details while we are configuring the job. The second IP address listed is the temporary IP address assigned to the helper VM. I’m using a static IP address for the helper VM, which allows me to put that specific IP address in my sshd config. If I were using a DHCP address, I could list the subnet instead, for example: 192.168.127.0/24. For reference, failure to list this helper VM address results in the error Permission denied, please try again.

Once the P2V migration is complete, we can edit the /etc/ssh/sshd_config file again, this time removing the two lines added. We’ll reload the config, again with service sshd reload.

sudo without password prompt

If the preference is to designate a specific account to use instead of root, that can be done instead. The account will need to be able to execute all commands with sudo privileges without password. In this example, we will create a specific user just to do the P2V:

sudo useradd myp2vuser
sudo passwd myp2vuser # set a password for this user

Once our specific user is created, we’ll edit the /etc/sudoers file to add the following line:

myp2vuser  ALL=(ALL)  NOPASSWD: ALL

Once we’ve saved the changes to our sudoers file, we’ll attempt to login over ssh. We should be able to login without issue. To test that sudo is working, I like to run a command like cat /etc/shadow. It should fail with a permission denied error. We can then run sudo cat /etc/shadow, which should work without being prompted for a password.

Once the migration is complete, we can remove our temporary user with the command userdel myp2vuser and remove the entry added to the /etc/sudoers file.

Conclusion

With either of the above options from the product documentation, we are able to complete a P2V migration without the ‘Access Denied’ error previously reported. Hopefully this helps if you run into a similar issue!

Posted in Virtualization | Leave a comment