Firmware update events not reaching devices during network interruptions

Our edge devices frequently miss firmware update events when network connectivity is unstable. We push updates to 2000+ devices across remote locations, but delivery rate is only 65-70% on first attempt. Devices that miss the event never receive the update until we manually retry.

Current event delivery shows this error pattern:


MQTT Publish Failed: Connection timeout
Device: edge-device-1247
Topic: /firmware/update/v2
QoS: 0

We’re using QoS 0 for performance, but wondering if that’s the issue. Need advice on event persistence, retry mechanisms, and tracking device state during network issues.

Good points. What about devices that stay offline for extended periods? Don’t want messages queued indefinitely. How do you handle dead-letter scenarios?

Beyond QoS, implement persistent sessions with cleanSession=false. This tells the broker to queue messages for disconnected clients. When device reconnects, it receives all missed messages automatically. We also maintain device state tracking in IoT Asset Monitor - tracks last successful firmware version, last contact time, and pending update status. This gives visibility into which devices need manual intervention versus just waiting for reconnection.

Implemented the suggested changes. Here’s what worked for us to get from 70% to 98% delivery rate.

Excellent results! Here’s the complete solution that addresses all aspects of reliable firmware delivery:

Event Persistence: Upgraded MQTT configuration to enable persistent sessions and message durability:


cleanSession: false
sessionExpiryInterval: 259200
messagePersistence: true

Broker now retains messages for disconnected clients up to 72 hours (259200 seconds). Persistent sessions survive broker restarts, ensuring no message loss during maintenance windows.

Retry Mechanisms: Implemented multi-layer retry strategy:

  • Immediate Layer: MQTT QoS 1 provides automatic broker-level retry (up to 3 attempts, 30s intervals)
  • Application Layer: Custom retry service monitors delivery acknowledgments:
    • Retry schedule: 5min, 15min, 1hr, 4hr, 24hr
    • Exponential backoff prevents network congestion
    • Maximum 5 retry attempts before escalation
  • Scheduled Recovery: Daily job at 02:00 UTC identifies devices with pending updates >48hrs old, republishes with high-priority flag

Dead-Letter Queues: Created DLQ infrastructure for failed deliveries:


topic: /firmware/dlq/failed
retentionPolicy: 30days
alertThreshold: 50

Events move to DLQ after exhausting all retries. Operations team receives alert when DLQ depth exceeds 50 messages. DLQ analysis revealed 80% of failures were devices decommissioned without proper cleanup - now automated in device lifecycle process.

Device State Tracking: Enhanced IoT Asset Monitor with firmware state machine:

  • States: PENDING, IN_PROGRESS, DELIVERED, APPLIED, FAILED, EXPIRED
  • Tracks: currentVersion, targetVersion, lastUpdateAttempt, attemptCount, lastOnlineTime
  • Device reports state transitions via telemetry topic /device/{id}/firmware/state
  • Dashboard shows real-time delivery status across entire fleet
  • Automated alerts for devices stuck in PENDING >24hrs or DELIVERED >72hrs (not applied)

Network Resilience: Implemented intelligent retry logic based on network conditions:

  • Device reports network quality metrics (signal strength, latency, packet loss)
  • Retry scheduling adjusts based on network quality: poor network = longer backoff intervals
  • During network instability (>20% packet loss), firmware updates pause automatically, resume when stable
  • Edge devices maintain local update queue, self-reconcile with server state on reconnection
  • Added heartbeat mechanism: devices ping every 5min, missed heartbeats trigger state check

Results: Firmware delivery success rate improved from 70% to 98% first-attempt, 99.5% within 24 hours. Mean time to delivery dropped from 4.2 hours to 18 minutes. Reduced manual intervention by 85% - only truly offline devices require attention. Network bandwidth usage decreased 30% through intelligent retry scheduling. Zero firmware version drift incidents in past 60 days versus 12-15 monthly before implementation.

Set message expiry (TTL) on firmware update events - we use 72 hours. After expiry, events move to dead-letter queue for manual review. For devices offline longer than TTL, we have scheduled job that republishes updates based on device state tracking. This catches devices that were offline during initial push and subsequent retry window.