Firmware management fails with MQTT connection drops during bulk updates

We’re experiencing MQTT connection drops during firmware updates across our device fleet. When pushing firmware to 200+ devices simultaneously, about 30% lose MQTT connectivity mid-transfer and never complete the update. The keep-alive settings seem insufficient, and we’re not sure if firmware chunking is working correctly. Our current retry logic just marks devices as failed without attempting reconnection.


MQTT keepAlive: 60s
Firmware chunk size: 64KB
Connection timeout: devices drop after 2-3 chunks

Devices show ‘disconnected’ status in the platform but remain online on the network. Has anyone dealt with MQTT stability during large-scale firmware deployments?

Check your tenant’s MQTT broker limits. There’s usually a connection rate limit that affects bulk operations. You might be hitting that threshold. Also, verify that your devices are properly implementing the MQTT reconnection logic with exponential backoff. The platform won’t force reconnection - that’s client-side responsibility.

I’ll provide a comprehensive solution addressing all three key areas:

MQTT Keep-Alive Configuration: Increase your keep-alive interval to 300 seconds minimum for firmware operations. This prevents premature disconnections during large transfers. Also implement client-side ping/pong monitoring:


mqttClient.setKeepAliveInterval(300);
mqttClient.setConnectionTimeout(30);
mqttClient.enableAutomaticReconnect(true);

Firmware Chunking Strategy: Reduce chunk size to 32KB or even 16KB for devices with limited memory. Implement adaptive chunking based on device capabilities:


int chunkSize = device.getMemory() > 512KB ? 32768 : 16384;
firmwareManager.setChunkSize(chunkSize);
firmwareManager.setChunkDelay(500); // ms between chunks

Retry Logic Implementation: Implement exponential backoff with connection health checks:


// Pseudocode - Retry mechanism:
1. Detect MQTT disconnect during firmware transfer
2. Wait initial_delay (5s) before first retry
3. Attempt reconnection with doubled timeout each iteration
4. Verify connection health before resuming transfer
5. Resume from last successful chunk (not from start)
6. Maximum 5 retry attempts before marking as failed
// Track progress in persistent storage

Additional Recommendations:

  1. Batch Deployment: Deploy to 50 devices at a time with 5-minute intervals. This prevents overwhelming the MQTT broker and network infrastructure.

  2. Connection Pool Management: Configure your tenant’s MQTT connection limits appropriately. Contact support if you need higher thresholds.

  3. Monitoring: Implement real-time monitoring of MQTT connection states. Set up alerts for abnormal disconnect rates.

  4. Network Optimization: Work with your network team to ensure firewall rules allow long-lived MQTT sessions. Whitelist Cumulocity MQTT broker IPs.

  5. Device-Side Implementation: Ensure devices implement proper MQTT reconnection logic with persistent session support (cleanSession=false).

  6. Progress Persistence: Store firmware transfer progress locally on devices so updates can resume from the last successful chunk after reconnection.

After implementing these changes, you should see connection stability improve dramatically. Start with a pilot group of 10-20 devices to validate the configuration before rolling out to your entire fleet. Monitor connection metrics closely during the pilot phase.

I’ve seen similar behavior with our deployment. The 60s keep-alive is too aggressive for firmware transfers. We increased ours to 300s and saw immediate improvement. Also check your MQTT QoS levels - using QoS 1 for firmware operations helps ensure delivery acknowledgment.

Thanks for the suggestions. We tried increasing keep-alive to 240s but still seeing drops. The QoS is set to 1. I’m wondering if the platform is throttling connections during high load? We’re also not seeing any reconnection attempts logged.

Your chunk size might be the issue. 64KB chunks can overwhelm devices with limited memory. We reduced to 32KB and implemented exponential backoff in our retry logic. The key is balancing chunk size with device capabilities and network conditions. Also, consider implementing a staged rollout rather than pushing to all 200 devices simultaneously.

Look at your network infrastructure too. We discovered that our firewall was dropping long-lived MQTT connections during firmware transfers due to session timeout policies. After whitelisting the MQTT broker IPs and adjusting firewall rules, stability improved significantly.