Firmware update job stuck in pending state due to data-storage queue overflow

Running aziot-24 with Azure CLI for job management. We’ve scheduled firmware updates for approximately 2,000 edge devices, but jobs remain stuck in ‘pending’ status for hours, sometimes days. The Azure portal shows the jobs as created successfully, but device modules never receive the update command.

Diagnostics reveal our module queue has over 15,000 messages backed up. We’re using the standard tier IoT Hub with default quota settings. Here’s how we’re creating the jobs:

az iot hub job create --hub-name prod-iot-hub \
  --job-id fw-update-batch-001 --job-type scheduleUpdateTwin \
  --twin-patch '{"properties":{"desired":{"firmware":"v2.1.0"}}}'

Resource allocation seems adequate - IoT Hub shows 70% capacity utilization. Anyone dealt with firmware job state transitions getting blocked by queue management issues?

Good insights. I checked the throttling metrics and we ARE hitting limits - specifically the ‘twin update operations’ quota. Our devices are mostly online (95%+ connectivity), so that’s not the issue. The queue backlog is growing because we’re sending twin updates faster than the throttle limit allows them to be processed. I’m now looking into chunking the 2,000 devices into smaller batches.

Have you looked at the firmware job state transitions in detail? Jobs can get stuck in ‘pending’ if the IoT Hub can’t establish the initial connection to enough devices to meet the minimum threshold. By default, the job waits for at least 5% of target devices to be online before transitioning to ‘running’. If your devices are intermittently connected or behind NAT with long keepalive intervals, this could explain the delay. Try adding --start-time parameter to schedule jobs during peak device connectivity windows.

The 15,000 message backlog is definitely your bottleneck. Standard tier IoT Hub has a per-device message throttle limit. With 2,000 devices, you’re likely hitting the aggregate throttle ceiling. Check your IoT Hub metrics for throttling errors - they often don’t surface in the portal’s main view. You might need to upgrade to a higher tier or implement batch processing with smaller device groups.