Your issue is a combination of Edge module restart handling, device auto-reconnect policy, and session management problems. Here’s the complete solution:
1. Edge Module Restart Handling:
First, fix your deployment manifest to ensure graceful shutdowns. Add this to your custom module configuration:
"createOptions": {
"StopTimeout": 30,
"HostConfig": {"RestartPolicy": {"Name": "on-failure"}}
}
This gives your module 30 seconds to complete in-flight operations before restart. Set restart policy to ‘on-failure’ instead of ‘always’ to prevent cascade restarts.
Ensure Edge hub is NOT listed as dependent on your custom module in the routing configuration. The Edge hub should be independent so leaf device connections persist during your module updates.
2. Device Auto-Reconnect Policy:
Your leaf devices need aggressive reconnection logic. Update the device code to handle disconnect events:
def on_disconnect(client, userdata, rc):
if rc != 0:
reconnect_count = 0
while reconnect_count < 10:
time.sleep(5 * (2 ** min(reconnect_count, 4)))
# Exponential backoff: 5s, 10s, 20s, 40s, 80s...
Critically, devices must be configured with ‘cleanSession=false’ in their MQTT connection options. This preserves the session state on the Edge hub during brief disconnects.
3. Session Management:
The Edge hub MQTT broker needs proper session persistence. Update your Edge hub environment variables:
- Set ‘mqttSettings__sessionStatePersistenceEnabled’ to ‘true’
- Set ‘storageFolder’ to a persistent volume mount
- Configure ‘mqttSettings__maxPendingMessages’ to at least 100
This ensures that even if the Edge hub restarts, it can restore device sessions from persisted state.
Additionally, implement a connection health check on your leaf devices:
- Send keepalive pings every 30 seconds (not just rely on the 60-second MQTT keepalive)
- If 2 consecutive pings fail, trigger immediate reconnection attempt
- On reconnection, re-subscribe to all topics to ensure message routing is restored
For your 40-device fleet, also configure the Edge hub with adequate connection limits:
- Set ‘IotHubConnectionPoolSize’ to at least 50
- Increase ‘amqpSettings__maxFrameSize’ to 65536 if devices send large payloads
With these changes, module restarts should cause less than 10 seconds of downtime, and devices will automatically reconnect without manual intervention. The key is combining persistent sessions, aggressive retry logic, and proper module lifecycle management.