Integration module firmware update fails due to MQTT broker disconnects during device push (cciot-25)

We’re encountering consistent firmware update failures across our industrial gateway fleet when pushing updates through the integration module. The updates trigger successfully but fail mid-transfer with MQTT broker disconnects.

The pattern we’re seeing:


MQTT session lost during firmware transfer
Connection timeout after 45 seconds
Device reports: CONNACK not received

This happens on about 60% of our devices, especially those in remote locations. The MQTT session persistence seems unreliable during large file transfers, and we’re hitting what looks like broker resource limits. Devices attempt reconnection but the update job times out before they can resume.

Has anyone dealt with MQTT stability issues during firmware updates? We need the edge device reconnection logic to be more resilient. Our current setup uses default MQTT keep-alive settings and QoS 1 for firmware delivery.

For firmware updates via MQTT, you need to treat session persistence differently than regular telemetry. The broker needs to maintain state during long transfers. We implemented chunked transfers with explicit acknowledgment per chunk, and increased our keep-alive interval to 300 seconds during firmware operations. Also critical: enable clean_session=false on your device clients so they can resume after reconnection. The devices should track which chunks they’ve received and request only missing pieces on reconnect.

We had this exact issue last year. The problem was twofold: broker resource limits were too conservative for concurrent firmware pushes, and our edge devices weren’t configured to resume interrupted transfers. Check your broker’s max_connections and max_inflight_messages parameters. We had to increase both significantly for bulk firmware operations.

To answer your question Mike - we went with max_inflight_messages=100 and max_connections=500 for our production broker cluster. But the real fix was implementing persistent sessions with proper QoS 2 for firmware chunks. This guarantees exactly-once delivery even across disconnects.

I’ve seen similar behavior. First thing to check is your MQTT broker’s connection timeout settings versus the actual firmware transfer time. If your updates take longer than the broker’s idle timeout, you’ll get disconnected mid-transfer. Also verify your QoS settings - QoS 1 should work but you might want to look at message size limits.

Thanks for the suggestions. I checked our broker config and we’re running with default max_inflight_messages=20 which seems low. What values did you end up using for large firmware transfers? Also, how did you handle the reconnection logic on the device side?

Let me provide a comprehensive solution that addresses all three critical areas:

MQTT Session Persistence: Configure persistent sessions on both broker and clients. On your MQTT broker (assuming Mosquitto or similar):


persistence true
persistence_location /var/lib/mqtt/
autosave_interval 300

On device clients, set clean_session=false and use a unique client_id per device. This ensures the broker maintains message queues during disconnects.

Broker Resource Limits: Your current limits are definitely too restrictive for firmware operations. Update your broker configuration:


max_inflight_messages 100
max_queued_messages 1000
max_connections 500
message_size_limit 268435456

The message_size_limit is critical - set it to at least 256MB to handle chunked firmware transfers. Also implement connection pooling if you’re updating more than 50 devices simultaneously.

Edge Device Reconnection Logic: This is where most implementations fail. Your devices need intelligent retry logic:

  1. Implement exponential backoff for reconnection attempts (start at 5s, max at 120s)
  2. Track firmware transfer state locally - store received chunk IDs in persistent storage
  3. On reconnection, query the broker for missing chunks rather than restarting the entire transfer
  4. Use MQTT topic structure like: fw/update/{device_id}/chunk/{chunk_id}
  5. Implement a resume capability:

// Pseudocode - Resume logic:
1. On reconnect, read local state: lastChunkReceived
2. Subscribe to topic: fw/update/{deviceId}/chunk/*
3. Publish resume request: fw/status/{deviceId}/resume
4. Include payload: {"lastChunk": lastChunkReceived}
5. Server sends only missing chunks
6. Device validates checksum after complete transfer

Additional Recommendations:

  • Increase MQTT keep-alive to 300 seconds during firmware operations
  • Use QoS 2 for firmware chunks to guarantee exactly-once delivery
  • Implement health checks: devices should publish heartbeat every 60s during updates
  • Set firmware job timeout to at least 30 minutes for remote devices
  • Monitor broker metrics: connection count, message queue depth, memory usage

After implementing these changes, our firmware update success rate went from 40% to 98%, with the remaining 2% being actual network outages. The key is treating firmware updates as a special operation with different QoS and persistence requirements than regular telemetry.