Firmware update events not reaching devices during network interruptions

jessica_solver · March 13, 2025, 6:12pm

Our edge devices frequently miss firmware update events when network connectivity is unstable. We push updates to 2000+ devices across remote locations, but delivery rate is only 65-70% on first attempt. Devices that miss the event never receive the update until we manually retry.

Current event delivery shows this error pattern:


MQTT Publish Failed: Connection timeout
Device: edge-device-1247
Topic: /firmware/update/v2
QoS: 0

We’re using QoS 0 for performance, but wondering if that’s the issue. Need advice on event persistence, retry mechanisms, and tracking device state during network issues.

amanda_guru · March 16, 2025, 8:28pm

Good points. What about devices that stay offline for extended periods? Don’t want messages queued indefinitely. How do you handle dead-letter scenarios?

paul_sql · March 14, 2025, 1:28am

Beyond QoS, implement persistent sessions with cleanSession=false. This tells the broker to queue messages for disconnected clients. When device reconnects, it receives all missed messages automatically. We also maintain device state tracking in IoT Asset Monitor - tracks last successful firmware version, last contact time, and pending update status. This gives visibility into which devices need manual intervention versus just waiting for reconnection.

michael_builder · April 14, 2025, 2:49am

Implemented the suggested changes. Here’s what worked for us to get from 70% to 98% delivery rate.

helenadmin · April 20, 2025, 8:42pm

Excellent results! Here’s the complete solution that addresses all aspects of reliable firmware delivery:

Event Persistence: Upgraded MQTT configuration to enable persistent sessions and message durability:


cleanSession: false
sessionExpiryInterval: 259200
messagePersistence: true

Broker now retains messages for disconnected clients up to 72 hours (259200 seconds). Persistent sessions survive broker restarts, ensuring no message loss during maintenance windows.

Retry Mechanisms: Implemented multi-layer retry strategy:

Immediate Layer: MQTT QoS 1 provides automatic broker-level retry (up to 3 attempts, 30s intervals)
Application Layer: Custom retry service monitors delivery acknowledgments:
- Retry schedule: 5min, 15min, 1hr, 4hr, 24hr
- Exponential backoff prevents network congestion
- Maximum 5 retry attempts before escalation
Scheduled Recovery: Daily job at 02:00 UTC identifies devices with pending updates >48hrs old, republishes with high-priority flag

Dead-Letter Queues: Created DLQ infrastructure for failed deliveries:


topic: /firmware/dlq/failed
retentionPolicy: 30days
alertThreshold: 50

Events move to DLQ after exhausting all retries. Operations team receives alert when DLQ depth exceeds 50 messages. DLQ analysis revealed 80% of failures were devices decommissioned without proper cleanup - now automated in device lifecycle process.

Device State Tracking: Enhanced IoT Asset Monitor with firmware state machine:

States: PENDING, IN_PROGRESS, DELIVERED, APPLIED, FAILED, EXPIRED
Tracks: currentVersion, targetVersion, lastUpdateAttempt, attemptCount, lastOnlineTime
Device reports state transitions via telemetry topic /device/{id}/firmware/state
Dashboard shows real-time delivery status across entire fleet
Automated alerts for devices stuck in PENDING >24hrs or DELIVERED >72hrs (not applied)

Network Resilience: Implemented intelligent retry logic based on network conditions:

Device reports network quality metrics (signal strength, latency, packet loss)
Retry scheduling adjusts based on network quality: poor network = longer backoff intervals
During network instability (>20% packet loss), firmware updates pause automatically, resume when stable
Edge devices maintain local update queue, self-reconcile with server state on reconnection
Added heartbeat mechanism: devices ping every 5min, missed heartbeats trigger state check

Results: Firmware delivery success rate improved from 70% to 98% first-attempt, 99.5% within 24 hours. Mean time to delivery dropped from 4.2 hours to 18 minutes. Reduced manual intervention by 85% - only truly offline devices require attention. Network bandwidth usage decreased 30% through intelligent retry scheduling. Zero firmware version drift incidents in past 60 days versus 12-15 monthly before implementation.

jessica_expert · March 30, 2025, 8:29am

Set message expiry (TTL) on firmware update events - we use 72 hours. After expiry, events move to dead-letter queue for manual review. For devices offline longer than TTL, we have scheduled job that republishes updates based on device state tracking. This catches devices that were offline during initial push and subsequent retry window.

Topic		Replies	Views
Integration module firmware update fails due to MQTT broker disconnects during device push (cciot-25) Cisco IoT Cloud Connect question , integration , firmware-update , iot-gateway , mqtt , mqtt-broker , cciot-25 , update-failures , mqtt-disconnect	6	0	February 3, 2025
Firmware update fails over-the-air when using MQTT with ThingWorx Edge - device disconnects mid-transfer PTC ThingWorx question , mqtt , edge-device , firmware-mgm , ota-updates , hw-integration , mqtt-disconnect , qos-config , twx-97	7	0	October 25, 2025
Firmware management fails with MQTT connection drops during bulk updates Cumulocity IoT question , connectivity , device-management , mqtt-protocol , mqtt , firmware-mgmt , c8y-1020 , connection-drop , failed-updates	6	1	June 20, 2025
Devices lose connectivity during firmware updates in IoT Operations Dashboard Cisco IoT Cloud Connect question , connectivity , ota-updates , firmware-mgmt , iod-23 , device-availability , heartbeat-monitoring , checksum-validation , rollback-mechanism	6	0	January 24, 2025
Automated firmware rollout to 500+ edge devices in manufacturing plant reduced downtime by 60% Microsoft Azure IoT use-case , manufacturing , cost-reduction , azure-iot-hub , firmware-update , firmware-mgm , edge-devices , aziot-24 , manual-updates	5	0	September 22, 2025
Firmware management event processing delayed in iod-23, causing update failures Cisco IoT Cloud Connect question , event-processing , firmware-mgmt , iod-23 , iot-operations , device-mgmt , queue-optimization , update-failures	4	0	August 26, 2025
Over-the-air firmware updates using Pub/Sub for remote pump stations with cellular connectivity constraints Google Cloud IoT use-case , devops-deploy-auto , cloud-storage , pub-sub , mqtt , firmware-mgmt , iiot-support , ota-update , pubsub-23	4	0	September 27, 2025
OTA firmware update fails for devices marked offline in registry during scheduled rollout Google Cloud IoT question , update-fail , device-registry , firmware-mgmt , ota-update , gcpiot-24 , sys-integration , firmware-compliance , device-status	7	0	August 11, 2025
OTA firmware update API fails silently without retry when network drops Cisco IoT Cloud Connect question , error-handling , network-resilience , retry-logic , exponential-backoff , api-sdk , firmware-mgm , iod-23 , ota-update	4	0	December 18, 2024

Firmware update events not reaching devices during network interruptions

Related topics