VM metrics missing from Azure Monitor after scale set upgrade and custom extension deployment

We recently upgraded our VM scale set from Standard_D4s_v3 to Standard_D8s_v4 SKUs and deployed a custom monitoring extension. Since the upgrade, we’re no longer receiving VM-level metrics in Azure Monitor-CPU percentage, memory usage, and disk IOPS are all missing from our dashboards.

The VMs are running fine and our applications are working, but the lack of monitoring visibility is a major concern for our SLA compliance. I’ve verified that the Azure Monitor agent is installed on all instances, and the diagnostic settings appear to be configured correctly at the scale set level. However, when I check individual VM instances, some show the monitoring extension in a ‘transitioning’ state.

Is there a known compatibility issue between the D8s_v4 SKU and certain monitoring extensions? Or could the upgrade process have corrupted the diagnostic settings configuration?

I checked the logs and found this error repeating: ‘Failed to connect to Azure Monitor endpoint - certificate validation failed’. That’s strange because we haven’t changed any certificate configurations. Could this be related to the SKU change?

Great catch on the managed identity! I checked and the system-assigned identity was indeed missing after the upgrade. I’ve re-enabled it and assigned the Monitoring Metrics Publisher role. Should I reinstall the monitoring extension now, or will it automatically detect the identity?

Your issue stems from three interrelated problems that occurred during the scale set upgrade:

1. VMSS SKU Compatibility: The v4 SKU series uses Generation 2 VMs with a different hypervisor architecture than v3. While the Azure Monitor agent is compatible with both, the upgrade process doesn’t automatically migrate extension configurations in a way that accounts for these architectural differences. When you upgraded, the extension settings were copied but not validated against the new SKU’s requirements. This is why you saw the ‘transitioning’ state-the extension was trying to initialize using v3-specific parameters that don’t apply to v4 VMs.

2. Monitoring Extension Logs Analysis: The certificate validation error you found in the logs is the key diagnostic indicator. This error specifically points to an authentication failure between the VM and Azure Monitor’s data ingestion endpoints. The extension uses the VM’s managed identity to obtain authentication tokens, which are then used to establish TLS connections to Azure Monitor. When the managed identity is missing or misconfigured, the extension falls back to certificate-based authentication, which fails because it can’t validate the chain of trust without proper identity credentials.

To properly analyze these logs in the future:

  • For Linux VMs: Check /var/log/azure/Microsoft.Azure.Monitor.AzureMonitorLinuxAgent/
  • For Windows VMs: Check C:\WindowsAzure\Logs\Plugins\Microsoft.Azure.Monitor.AzureMonitorWindowsAgent
  • Look for ‘AuthenticationException’ or ‘IdentityNotFound’ errors
  • Cross-reference timestamps with Azure Activity Logs to correlate extension operations with identity changes

3. Diagnostic Settings Restoration: Here’s the complete remediation process:

First, verify your current configuration:

  • Confirm the system-assigned managed identity is enabled on the VMSS
  • Verify the identity has ‘Monitoring Metrics Publisher’ role at the subscription or resource group level
  • Check that the identity also has ‘Log Analytics Contributor’ if you’re sending logs to a workspace

Next, update the VMSS extension configuration:

  • Remove the existing Azure Monitor extension from the VMSS model
  • Wait 10-15 minutes for all instances to process the removal
  • Add the latest version of the extension (check Azure portal for the current version number)
  • Configure the extension with explicit diagnostic settings pointing to your Log Analytics workspace
  • Trigger a manual update on all VMSS instances to force immediate extension deployment

Finally, validate the fix:

  • After 5 minutes, check that metrics are flowing by querying your Log Analytics workspace
  • Verify all VM instances show the extension in ‘Succeeded’ state
  • Create a test alert rule to confirm end-to-end monitoring functionality

The key lesson here is that VMSS upgrades involving SKU changes require explicit validation of extension compatibility and managed identity preservation. Always snapshot your extension configuration before major upgrades and verify identity assignments immediately after.

The certificate error suggests a networking or managed identity issue rather than SKU compatibility. When you upgraded the VMSS, did you preserve the managed identity configuration? The Azure Monitor agent requires either a system-assigned or user-assigned managed identity to authenticate. If the upgrade process recreated the VMs without properly transferring the identity, that would explain both the certificate error and the missing metrics. Check the Identity blade on your scale set and verify the managed identity is still assigned and has the ‘Monitoring Metrics Publisher’ role.