Firmware management fails with MQTT connection drops during bulk updates

arjun_expert · May 23, 2025, 11:06pm

We’re experiencing MQTT connection drops during firmware updates across our device fleet. When pushing firmware to 200+ devices simultaneously, about 30% lose MQTT connectivity mid-transfer and never complete the update. The keep-alive settings seem insufficient, and we’re not sure if firmware chunking is working correctly. Our current retry logic just marks devices as failed without attempting reconnection.


MQTT keepAlive: 60s
Firmware chunk size: 64KB
Connection timeout: devices drop after 2-3 chunks

Devices show ‘disconnected’ status in the platform but remain online on the network. Has anyone dealt with MQTT stability during large-scale firmware deployments?

amanda_master · June 8, 2025, 9:00am

Check your tenant’s MQTT broker limits. There’s usually a connection rate limit that affects bulk operations. You might be hitting that threshold. Also, verify that your devices are properly implementing the MQTT reconnection logic with exponential backoff. The platform won’t force reconnection - that’s client-side responsibility.

amanda_master · June 25, 2025, 5:58pm

I’ll provide a comprehensive solution addressing all three key areas:

MQTT Keep-Alive Configuration: Increase your keep-alive interval to 300 seconds minimum for firmware operations. This prevents premature disconnections during large transfers. Also implement client-side ping/pong monitoring:


mqttClient.setKeepAliveInterval(300);
mqttClient.setConnectionTimeout(30);
mqttClient.enableAutomaticReconnect(true);

Firmware Chunking Strategy: Reduce chunk size to 32KB or even 16KB for devices with limited memory. Implement adaptive chunking based on device capabilities:


int chunkSize = device.getMemory() > 512KB ? 32768 : 16384;
firmwareManager.setChunkSize(chunkSize);
firmwareManager.setChunkDelay(500); // ms between chunks

Retry Logic Implementation: Implement exponential backoff with connection health checks:


// Pseudocode - Retry mechanism:
1. Detect MQTT disconnect during firmware transfer
2. Wait initial_delay (5s) before first retry
3. Attempt reconnection with doubled timeout each iteration
4. Verify connection health before resuming transfer
5. Resume from last successful chunk (not from start)
6. Maximum 5 retry attempts before marking as failed
// Track progress in persistent storage

Additional Recommendations:

Batch Deployment: Deploy to 50 devices at a time with 5-minute intervals. This prevents overwhelming the MQTT broker and network infrastructure.
Connection Pool Management: Configure your tenant’s MQTT connection limits appropriately. Contact support if you need higher thresholds.
Monitoring: Implement real-time monitoring of MQTT connection states. Set up alerts for abnormal disconnect rates.
Network Optimization: Work with your network team to ensure firewall rules allow long-lived MQTT sessions. Whitelist Cumulocity MQTT broker IPs.
Device-Side Implementation: Ensure devices implement proper MQTT reconnection logic with persistent session support (cleanSession=false).
Progress Persistence: Store firmware transfer progress locally on devices so updates can resume from the last successful chunk after reconnection.

After implementing these changes, you should see connection stability improve dramatically. Start with a pilot group of 10-20 devices to validate the configuration before rolling out to your entire fleet. Monitor connection metrics closely during the pilot phase.

carlospro · May 26, 2025, 1:49pm

I’ve seen similar behavior with our deployment. The 60s keep-alive is too aggressive for firmware transfers. We increased ours to 300s and saw immediate improvement. Also check your MQTT QoS levels - using QoS 1 for firmware operations helps ensure delivery acknowledgment.

ricardoexpert · June 5, 2025, 3:41am

Thanks for the suggestions. We tried increasing keep-alive to 240s but still seeing drops. The QoS is set to 1. I’m wondering if the platform is throttling connections during high load? We’re also not seeing any reconnection attempts logged.

amanda_master · May 29, 2025, 7:15am

Your chunk size might be the issue. 64KB chunks can overwhelm devices with limited memory. We reduced to 32KB and implemented exponential backoff in our retry logic. The key is balancing chunk size with device capabilities and network conditions. Also, consider implementing a staged rollout rather than pushing to all 200 devices simultaneously.

vikramguru · June 20, 2025, 4:24pm

Look at your network infrastructure too. We discovered that our firewall was dropping long-lived MQTT connections during firmware transfers due to session timeout policies. After whitelisting the MQTT broker IPs and adjusting firewall rules, stability improved significantly.

Topic		Views
Integration module firmware update fails due to MQTT broker disconnects during device push (cciot-25) Cisco IoT Cloud Connect question , integration , firmware-update , iot-gateway , mqtt , mqtt-broker , cciot-25 , update-failures , mqtt-disconnect	6	February 3, 2025
Firmware update fails over-the-air when using MQTT with ThingWorx Edge - device disconnects mid-transfer PTC ThingWorx question , mqtt , edge-device , firmware-mgm , ota-updates , hw-integration , mqtt-disconnect , qos-config , twx-97	7	October 25, 2025
OTA firmware update fails for asset tracking devices using MQTT Pub/Sub-update job stuck in pending state Google Cloud IoT question , pubsub , asset-tracking , device-management , firmware-update , mqtt , ota-updates , gcpiot-25	7	September 22, 2025
Firmware update events not reaching devices during network interruptions Oracle IoT Cloud question , network-resilience , event-delivery , event-processing , mqtt , firmware-mgmt , oiot-22 , device-virtualization , retry-mechanisms	5	March 30, 2025
Gateway firmware update fails with MQTT connection lost error during remote deployment IBM Watson IoT question , json , connection-timeout , edge-gateway , firmware-update , mqtt , gateway-mgmt , broker-config , wiot-25	3	September 19, 2025
Firmware update fails on remote devices with OTA error, rollback not triggering Cumulocity IoT question , rest-api , java , rollback , device-connectivity , device-sdk , firmware-mgmt , iiot-support , ota-update	6	August 17, 2025
Firmware update fails on asset tracking devices due to MQTT payload size limits SAP IoT question , json , payload-limit , firmware-update , mqtt , asset-tracki , ota-update , device-mgmt , sapiot-25	4	January 14, 2025
Gateway management MQTT connection pool exhausted when 10K+ devices reconnect simultaneously after network outage Cumulocity IoT question , performance-opt , connection-pool , network-resilience , mqtt-broker , gateway-mgmt , c8y-1018 , mass-reconnect , connection-rate-limiting	4	November 29, 2024
Firmware management binary transfer times out for devices over cellular with 50MB+ firmware images Cumulocity IoT question , performance-opt , device-management , firmware-mgmt , c8y-1019 , transfer-timeout , cellular-connectivity , binary-upload , http-chunked	3	September 6, 2025

Firmware management fails with MQTT connection drops during bulk updates

Related topics