Cannot configure identity source due to Type or value exists.

On vCenter Server 7.0u3p (aka 7.0.3.01800), I recently experienced an error “Cannot configure identity source due to Type or value exists.” when configuring Active Directory over LDAPS. The issue was caused by a duplicate certificate, but that fact was not immediately obvious.

To configured AD over LDAPS we must provide the certificate used by the domain controller. To obtain this certificate, the following KB article: https://kb.vmware.com/s/article/2041378 shows how to use openssl s_client to obtain the certificate on port 636 (LDAPS). Obtaining the certificates from each domain controller and presenting both to the “Edit Identity Source” screen (as shown below):

Would result in the following error:

Tailing the /storage/log/vmware/vmdird/vmdird-syslog.log file, we noticed an entry when saving the above configuration similar to:

2024-01-23T13:39:26.847703+00:00 err vmdird  t@140567635818240: InternalAddEntry: VdirExecutePostAddCommitPlugins - code(9619)
2024-01-23T13:39:26.848501+00:00 err vmdird  t@140567635818240: VmDirSendLdapResult: Request (Add), Error (LDAP_TYPE_OR_VALUE_EXISTS(20)), Message (Invalid or duplicate (userCertificate)), (0) socket (127.0.0.1)

The Invalid or duplicate (userCertificate) part of this error was interesting. After checking with the directory services folks, they confirmed they had placed the same certificate on multiple domain controllers, listing each domain controller name/IP in the subject alternative name (subjectAltName) field. When using openssl s_client to obtain the certificates, each DC returned the exact same value, which would explain a duplicate.

To work around this issue, we left both servers listed in the “Edit Identity Source” screen, but only provided a single certificate file. This change saved successfully and didn’t result in the ‘Type or value exists’ error message.

Posted in Lab Infrastructure, Virtualization | Leave a comment

vSphere ESXi Host Certificate Status Alarm bulk resolution

In the vSphere UI, some hosts will occasionally trigger an alarm of “ESXi Host Certificate Status”. VMware Skyline has a finding for this issue as well — vSphere-HostCertStatusAlarm. The resolution is typically straightforward, right click the host > certificates > renew certificate. However, if you have hundreds of hosts where this needs to happen it can be tedious to use the UI. This post will explore why the certificates expire & how to automate their replacement when needed.

By default, these certificates are issued by the VMware Certificate Authority (vmca). A cert is issued to the host when the host is added to vCenter Server. The validity period for the certificate is configured using a vCenter advanced setting. From Inventory > vCenter > Configure > Settings > Advanced Setting, the value for vpxd.certmgmt.certs.daysValid is the length of time that a renewed certificate will be requested.  The default should be 1825, which is 5 years. 

In this view you can also see the vpxd.certmgmt.certs.minutesBefore value.  This is the starting date for the certificate request.  The default 1440 value (24 hours) ensures that anything validating this certificate doesn’t think its not yet valid because it was too recently issued.  I mention this as it’ll be relevant in some of the examples below.

From Administration > Certificates > Certificate Management, the ‘VMware Certificate Authority’ tile shows the validity of the VMCA certificate.  This will be the max age that a new certificate can be valid.  For example, in my lab this value is Nov 9th, 2028.  This is a touch under 5 years, so even though I would request a 5 year validity period (vpxd.certmgmt.certs.daysValid setting above), this VMCA cert will expire prior to that and can only issue certs through this slightly shorter date.  This VMCA certificate is typically valid 10 years from when the VC is first deployed or the cert is replaced.  This cert can be regenerated with certificate-manager (https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.authentication.doc/GUID-E1D35792-ED03-468A-966B-362BED18021A.html), but doing so will restart all vCenter services and is more disruptive than the ESXi host certificate renewal process.

To figure out how to automate this functionality, I enabled code capture (Developer Center > Code Capture) and recorded the action of renewing the certificate from the UI. This showed me what the UI was doing to make the request and gave me a lead on “CertificateManager” being something I should search for.

Armed with that info, I was able put together a short PowerCLI script to query the cert values for all hosts and store the output in a variable so we can use it later.  In the output I included the moref of the host, because that would be needed if we want to call the CertMgrRefreshCertificates_Task that we identified using Code Capture.  CertificateInfo has additional properties such as issuer (the vCenter) and Subject, but those aren’t required for our exercise so I didn’t include them. 

# create a collection to store ESXi Host & certificate details
$myHostCertStatus = @()
foreach ($thisESX in Get-VMHost -State:Connected | Sort-Object Name) {
  $certMgr = Get-View -Id $thisESX.ExtensionData.ConfigManager.CertificateManager
  $myHostCertStatus += $certMgr.CertificateInfo | Select-Object @{N='VMHost';E={$thisESX.name}}, @{N='moref';E={$thisESX.extensiondata.moref}}, NotBefore, NotAfter, Status
}

With our new variable populated, we can look at the specific host from the UI action to cross reference our timestamps. Here is the event from the UI:

Here is the output from our variable, filtered down to only the host we are interested in.

$myHostCertStatus | Where-Object {$_.VMhost -match 'euc-esx-21'}

VMHost    : euc-esx-21.lab.enterpriseadmins.org
moref     : HostSystem-host-25598
NotBefore : 2/12/2024 8:31:07 PM
NotAfter  : 11/9/2028 6:51:22 PM
Status    : good

We can see that the NotBefore value is Monday the 12 at 8:31 PM (EST is -5 hours, so this is 24 hours prior to that 3:31 time from the screenshot above).  The NotAfter is shorter than the 5 years requested, because it follows the VMCA expiry date of Nov 9th.

We now have all the info we need — a way to validate current certificate status, the method we need to call to renew a certificate, and a host we are willing to test with. In the example below I’m filtering down our variable to only one test host and passing that item to the method identified from code capture. A task ID is returned by Power CLI.

$myHostCertStatus | Where-Object {$_.VMhost -match 'euc-esx-21'} | %{
  $thisCertMgr = Get-View -Id 'CertificateManager-certificateManager'
  $thisCertMgr.CertMgrRefreshCertificates_Task($_.moref)
}

Type Value
---- -----
Task task-3269126

If we look in the UI, we can confirm this task was executed and get the timestamp of the request.

Re-running the host query block from above to check this hosts output, we can see that the NotAfter value has not changed (it is still constrained by the VMCA validity) but the NotBefore value has been updated.

VMHost    : euc-esx-21.lab.enterpriseadmins.org
moref     : HostSystem-host-25598
NotBefore : 2/13/2024 1:34:23 PM
NotAfter  : 11/9/2028 6:51:22 PM
Status    : good

Now that we’ve confirmed the certificate has been replaced, and the expiration date aligns with our expectations, we can tweak this to look at the NotAfter or Status properties and run the same code on a larger block of hosts.

Posted in Lab Infrastructure, Scripting, Virtualization | Leave a comment

Testing Syslog from the command line

From time to time it is helpful to be able to send a syslog message to confirm that things are working correctly — firewall ports are open, nothing is filtering out the traffic in line, including a timestamp in the message body to show times are being received correctly, etc. I recently saw a post on Twitter showing a way to send a syslog message from the command line (https://twitter.com/nickrusso42518/status/1756711901088698584). The tweet showed the following syntax:

echo "<14>Test UDP syslog message" >> /dev/udp/10.0.0.1/514

Unfortunately, when I tested this on an ESXi host I found the /dev/udp target is not present. However, knowing sending the message like this was possible, I remembered that ESXi hosts do provide netcat (nc) and I wanted to see if using the same type of syntax with that command would work. A quick search and I found an example that did exactly what I wanted:

echo '<14>bwuchner-test-syslog sent at 2024-02-15 9:38:05 EST' | nc -v -u -w 0 192.168.45.80 514

The above worked great, even from an ESXi host. To round out my notes, I wanted to try and find a similar way of doing this from Windows boxes as well. My go-to shell of choice on Windows is PowerShell, since it comes out of the box on all supported Windows versions. A quick search and I found a function that did exactly what I was hoping: https://gist.github.com/PeteGoo/21a5ab7636786670e47c. I’ll include the function below, for reference, along with the syntax to use it to send a syslog message.

function Send-UdpDatagram
{
      Param ([string] $EndPoint, 
      [int] $Port, 
      [string] $Message)

      $IP = [System.Net.Dns]::GetHostAddresses($EndPoint) 
      $Address = [System.Net.IPAddress]::Parse($IP) 
      $EndPoints = New-Object System.Net.IPEndPoint($Address, $Port) 
      $Socket = New-Object System.Net.Sockets.UDPClient 
      $EncodedText = [Text.Encoding]::ASCII.GetBytes($Message) 
      $SendMessage = $Socket.Send($EncodedText, $EncodedText.Length, $EndPoints) 
      $Socket.Close() 
} 

Send-UdpDatagram -EndPoint 192.168.45.80 -Port 514 -Message '<14>bwuchner-test-syslog from powershell 2024-02-15 9:41:52 EST'

I was able to confirm each of these methods worked to send a test syslog message to Aria Operations for Logs (formerly known as vRealize Log Insight).

Posted in Scripting | Leave a comment

Unlocking Seamless Connectivity with Tailscale

I may bit a bit late to the party, but I recently found out about Tailscale. I was looking for a remote access solution for my lab, which is behind my ISPs Carrier Grade NAT (CGNAT). This means I don’t have a publicly accessible IP address, so I really need a solution that can help overcome that configuration. I had heard of Tailscale from a colleague and figured I’d give it a spin. In a minimal amount of time I went from no remote access, to having full remote access to my entire network, and then created a site-to-site tunnel to another colleagues lab. This post will outline the various steps along the way.

Step 1: Remote Access

To get started, I wanted to install the Tailscale client on a system in my lab and on a laptop connected to a different network. In this basic configuration, I assumed I could treat the lab system as a jump box, connect to it with SSH or RDP and be able to reach other devices on my network from there. This was super easy… just install the OS specific application on each system and bam! This was the easy button for setting up remote access, it just worked.

Step 2: Subnet Router

While reading the documentation on Tailscale, I noticed they had a feature called a Subnet Router. This is a service that was designed for devices where the Tailscale client couldn’t be installed, like random network printers, and allowed those devices to be reached from devices on the Tailscale tailnet. I deployed an Ubuntu 20.04 VM to act as my subnet router. The install was straightforward, on the Linux VM I just needed to run a few commands from the console:

curl -fsSL https://tailscale.com/install.sh | sh

echo 'net.ipv4.ip_forward = 1' | sudo tee -a /etc/sysctl.d/99-tailscale.conf
echo 'net.ipv6.conf.all.forwarding = 1' | sudo tee -a /etc/sysctl.d/99-tailscale.conf
sudo sysctl -p /etc/sysctl.d/99-tailscale.conf

sudo tailscale up --advertise-routes=192.168.0.0/17 --accept-routes=true --snat-subnet-routes=false

The Tailscale documentation at https://tailscale.com/kb/1019/subnets was also very helpful in describing these commands/steps.

Step 3: Site-to-Site connectivity

While setting up the subnet router, I noticed the docs had some details on site-to-site networking (https://tailscale.com/kb/1214/site-to-site). This looked very interesting, as I had previously wanted to setup cross site networking to demo VMware Site Recovery Manager. The only caveat I saw in the documentation was:

This scenario will not work on subnets with overlapping CIDR ranges

I pinged a colleague of mine, to see if they would be interested in peering networks, and if so, what IP addresses they used in their lab. Turns out we had some minor overlapping segments, but luckily the segments on my side were internal only/non-routed networks (dedicated to storage & vmotion). I made a few changes to what subnets I advertised on my side, added a statement to adjust MTU, and added a couple route statements within the physical network as described in the Tailscale docs. The updated lines on my subnet router look like this:

iptables -t mangle -A FORWARD -i tailscale0 -o eth0 -p tcp -m tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu

sudo tailscale up --advertise-routes=192.168.10.0/24,192.168.32.0/20,192.168.127.0/24 --accept-routes=true --snat-subnet-routes=false

My colleague also deployed a subnet router with a very similar configuration and then also added some routes to his physical network.

curl -fsSL https://tailscale.com/install.sh | sh

echo 'net.ipv4.ip_forward = 1' | sudo tee -a /etc/sysctl.d/99-tailscale.conf
echo 'net.ipv6.conf.all.forwarding = 1' | sudo tee -a /etc/sysctl.d/99-tailscale.conf
sudo sysctl -p /etc/sysctl.d/99-tailscale.conf

iptables -t mangle -A FORWARD -i tailscale0 -o eth0 -p tcp -m tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu

tailscale up --advertise-routes=192.168.55.0/24,192.168.60.0/24 --accept-routes=true --snat-subnet-routes=false

With these configurations in place, we could each ping devices on each others network. I’ve created a diagram which includes the routes created in the physical network as a reference.

As a test, from a device on my lab, I did a trace route to a device on my collegues network.

> tracert 192.168.60.10

Tracing route to 192.168.60.10 over a maximum of 30 hops

  1     1 ms     1 ms     1 ms  192.168.127.252
  2    <1 ms    <1 ms    <1 ms  192.168.40.31
  3    30 ms    24 ms    26 ms  100.105.105.105
  4    25 ms    24 ms    27 ms  192.168.60.10

Trace complete.

As we can see, traffic goes to my labs layer 3 switch, is sent to the Tailscale Subnet Router, which sends it to a 100.105.x.x address (which is on the tailnet), and then reaches the IP on the remote site. With IP connectivity established, the next steps were to make name resolution work. Since we both have Pihole DNS servers on our networks, this was accomplished by adding conditional forwarding on each of our Pihole servers. With conditional forwarding in place, we are able to query our own DNS servers for the others lab domain names, which in turn will query the correct server. In the Tailscale admin console, DNS is configured for Split DNS with similar configuration, my lab requests come to my DNS servers, their domains going to their DNS servers.

What is great, is that with this third option configured, we not only have site-to-site connectivity, but can reach both networks even while remote, thanks to the Tailscale client installed on mobile device. For example, while connected to a mobile hotspot, not connected directly to either lab, I’m able to trace route to devices on each network.

> tracert 192.168.60.10
Tracing route to 192.168.60.10 over a maximum of 30 hops
  1   486 ms    73 ms    76 ms  agb-vpnrtr01.tail1234.ts.net. [100.105.105.105]
  2   239 ms    80 ms    84 ms  192.168.60.10
Trace complete.


> tracert 192.168.127.30
Tracing route to CORE-CONTROL-21 [192.168.127.30]
over a maximum of 30 hops:
  1    56 ms    11 ms     8 ms  net-vpnrtr-01.tail1234.ts.net. [100.123.123.123]
  2    24 ms    37 ms    55 ms  192.168.40.1
  3    61 ms     9 ms    13 ms  CORE-CONTROL-21 [192.168.127.30]
Trace complete.

Summary

I was surprised how easy it was to setup Tailscale, even in a fairly complex network with overlapping address space. The documentation was easy to follow, the setup was quick, and performance has been very good. I set out to solve one specific problem, and in short order solved that problem — and expanded the lab to an entirely different site along the way.

Posted in Lab Infrastructure, Virtualization | Leave a comment

Jenkins & Java Upgrade on Windows

One service that I use quite frequently in my lab is Jenkins. I have this running on a Windows VM and have a variety of tasks that run from there, some on a scheduled and others consumed by web hooks. For example, I have a job that Aria Automation calls to create records on my Windows DNS server for Linux/nested ESXi builds.

I recently was looking in the Manage Jenkins section and noticed two issues, one that Jenkins needed an upgrade and another that stated You are running Jenkins on Java 11, support for which will end on or after Sep 30, 2024.

Updating Jenkins was super easy. I created a snapshot of the VM, in case things went sideways, and then pushed the button to update Jenkins from within the web console. This took care of itself and when Jenkins restarted it was current. I let this sit a few days, run a variety of tests, and when I was happy that everything was stable I deleted the VM snapshot.

The second task was to update Java, and I decided to do this a few days after the above Jenkins update. That way if something went wrong it would be easier to know if it were a Jenkins or Java issue. I’m glad I did, as I ran into two issues when updating Java, described below.

To start the upgrade process, I downloaded the latest version of Java 17 JDK from https://adoptium.net/temurin/releases/?os=windows&arch=x64&version=17. I also backed up my D:\Program Files\Jenkins and C:\Users\svc-jenkins\AppData\Local\Jenkins folders. I had done this prior to updating Jenkins and decided it would be wise to do again for the Java update. I then took a snapshot of the virtual machine as one last restore point.

With backups in place, I stopped the Jenkins service (from services.msc on the Windows VM), and uninstalled Java JDK 11 from add/remove programs. This is the only application using Java, so I wasn’t worried about other application dependencies. I then installed JDK 17 into D:\Program Files\Eclipse Adoptium\jdk-17.0.9.9-hotspot, selecting that I wanted to add an entry to my PATH statement, associate JAR files, and set the JAVA_HOME variable.

After the installation completed, I attempted to start the Jenkins service, but it stopped immediately. I then decided to reboot, as I had changed system environment variables and wanted to make sure those were in effect, but the service did not start on boot. Since I knew that the path to java.exe had changed, I went looking for a Jenkins configuration that pointed at the old file system path — I found such an entry in D:\Program Files\Jenkins\jenkins.xml and updated the <executable> location. After doing so the service started successfully, however I was only able to access it locally from the server console and not on a remote machine.

I checked the Windows firewall and found an inbound rule for Jenkins that was restricted to only one program — the previous path to Jenkins. I updated the ‘this program’ value on the programs and services tab to D:\Program Files\Eclipse Adoptium\jdk-17.0.9.9-hotspot\bin\java.exe, which resolved the remote access issues.

Now my Jenkins & Java versions are up-to-date and everything is working as expected. Hopefully this article helps someone else who runs into issues with this upgrade.

Posted in Lab Infrastructure, Scripting | 1 Comment