I was recently working on an issue where a query in Aria Operations for Logs was not returning an event that I really expected to be present. After a bit of troubleshooting, I found that the ESXi host was sending the logs to syslog, but a firewall was preventing the logs from being received. Reflecting on this, I realized that there were many possible failure scenarios where a host could be property configured, but something in the path could be causing problems. You can see some of the possible failure points in the image below, anywhere the log message has to traverse a firewall or forwarder are all suspects for problems.
As we can see above, some syslog topologies can be complex, and that introduces the possibility of failure. ESXi host firewalls, physical firewalls, and any log forwarding device can be a place where events are lost. I wanted to create a script to help identify some of these gaps which we’ll outline below.
Part 1 – Sending a Test Message
For this test, I wanted to use the esxcli system syslog mark
command to send a message. To make this message easy to find in Aria Operations for Logs, I generated a GUID to send in the message and will be able to look for it later. Any unique string will work, but this is something easy enough to generate with each test. Also, in larger environments where good configuration management is happening, I may not need to test every host. I decided to add a bit of logic in the script to only test a percentage of available hosts.
$newGuid = [guid]::NewGuid().guid
$message = @{'message'="$newGUID - Test Message"}
$percent = Read-Host -Prompt "What percentage of Hosts should we review? "
# For each random host, send a syslog message with esxcli
$sendResults = @()
$hosts = get-vmhost -State:Connected
$hostCount = [math]::Ceiling(( $hosts | Measure-Object).Count * ($percent / 100))
$hosts | Get-Random -Count $hostCount | Sort-Object Name | %{
$esxcli2 = $_ | Get-EsxCli -V2
$sendResults += $_ | Select-Object Name, @{N='SyslogServer';E={($_ | Get-AdvancedSetting -Name Syslog.global.logHost).Value}},
@{N='SyslogMarkSent';E={$esxcli2.system.syslog.mark.Invoke($message)}}
}
The above code will create a custom object $sendResults
that will contain all of the hosts where the test syslog message was sent. In the next section we’ll see which of those events made it to our Aria Operations for Logs instance.
Part 2 – Query the Aria Operations for Logs events API
To make sure our syslog ‘mark’ messages made it from ESXi to our centralized Aria Operations for Logs instance, we’ll use the API to query for logs containing the $newGuid
value we sent from part 1.
The first couple of lines of this script take care of logging into the API. We then send an event query path, and make a hashtable of the hostname & timestamp string. This will allow us to index into our results to see when Aria Operations for Logs received our event. Finally, we’ll loop through all the hosts that received a test message in part 1 and get the event timestamp from our hashtable.
$loginBody = @{username='admin'; password='VMware1!'; provider='Local'} | Convertto-Json
$loginToken = (Invoke-RestMethod -uri 'https://syslog.example.com:9543/api/v2/sessions' -Method 'POST' -body $loginBody).sessionId
$myEvents = Invoke-RestMethod -uri "https://syslog.example.com:9543/api/v2/events/text/CONTAINS%20$($newGuid)?limit=1000&timeout=30000&view=SIMPLE&order-by-direction=DESC" -Headers @{Authorization="Bearer $loginToken"}
$queryHt = $myEvents.results | Select-Object hostname, timestampString | Group-Object -Property hostname -AsHashTable
$finalResults = @()
foreach ($check in $sendResults) {
$finalResults += $check | Select *, @{N='FoundInLogs';E={ $queryHt[$_.name].timestampString }}
}
$finalresults
If all goes as expected we should see all of our test hosts to have text in every column, with a ‘FoundInLogs’ column having a fairly current timestamp. Instead, we found this in our lab:
Name SyslogServer SyslogMarkSent FoundInLogs
---- ------------ -------------- -----------
h259-vesx-43.example.com udp://192.168.45.73:514 true 2024-11-17 20
h259-vesx-44.example.com udp://192.168.45.73:514 true
h259-vsanwit-01.example.com true
test-vesx-71.example.com udp://syslog.example.com:514 true 2024-11-17 20
Above we observe two hosts without a value in ‘FoundInLogs’ and one that doesn’t even have a syslog destination configured. The first host does have syslog configured, but our test message was not received. Investigating this host specifically, we find that the host firewall rule allowing outbound syslog was not enabled, as seen in the screenshot below (where we’d expect the check box to be selected):
This was caused by unchecking that box so that the test would fail, just so we could check our script logic. The other host (a vSAN witness host) does not have a syslog destination defined at all. This happened to be a gap in how configurations where applied in this environment. This host exists outside of a cluster and we are managing this setting at a cluster level; its an oversite that is easily corrected. However, without testing we may not have uncovered these issues.
Conclusion
Automation can help ensure that not only are settings consistently configured across an environment but can also help prove that the end-to-end flow is working. Hopefully this can help identify logging problems before those logs are needed.