Devices disconnect from gateway after Edge module restart - persistent offline state

Restarting our IoT Edge gateway module causes all downstream leaf devices to disconnect and remain offline until we manually reboot them. The data flow interruption lasts 30+ minutes even though the Edge gateway comes back online within 2 minutes.

Our setup has an Edge gateway managing 40 leaf devices using MQTT protocol. When we restart the custom Edge module for updates, leaf devices show as disconnected in IoT Hub and don’t automatically reconnect. The Edge gateway logs show:


[ERROR] Module restart completed
[WARN] Downstream device sensor_042 connection lost
[INFO] Waiting for device reconnection...
# No reconnection attempts logged

Leaf devices are configured with 60-second keepalive and should auto-reconnect. Is there a specific Edge module restart procedure that preserves downstream device sessions? We can’t afford 30-minute outages every time we deploy module updates.

Your issue is a combination of Edge module restart handling, device auto-reconnect policy, and session management problems. Here’s the complete solution:

1. Edge Module Restart Handling:

First, fix your deployment manifest to ensure graceful shutdowns. Add this to your custom module configuration:

"createOptions": {
  "StopTimeout": 30,
  "HostConfig": {"RestartPolicy": {"Name": "on-failure"}}
}

This gives your module 30 seconds to complete in-flight operations before restart. Set restart policy to ‘on-failure’ instead of ‘always’ to prevent cascade restarts.

Ensure Edge hub is NOT listed as dependent on your custom module in the routing configuration. The Edge hub should be independent so leaf device connections persist during your module updates.

2. Device Auto-Reconnect Policy:

Your leaf devices need aggressive reconnection logic. Update the device code to handle disconnect events:

def on_disconnect(client, userdata, rc):
    if rc != 0:
        reconnect_count = 0
        while reconnect_count < 10:
            time.sleep(5 * (2 ** min(reconnect_count, 4)))
            # Exponential backoff: 5s, 10s, 20s, 40s, 80s...

Critically, devices must be configured with ‘cleanSession=false’ in their MQTT connection options. This preserves the session state on the Edge hub during brief disconnects.

3. Session Management:

The Edge hub MQTT broker needs proper session persistence. Update your Edge hub environment variables:

  • Set ‘mqttSettings__sessionStatePersistenceEnabled’ to ‘true’
  • Set ‘storageFolder’ to a persistent volume mount
  • Configure ‘mqttSettings__maxPendingMessages’ to at least 100

This ensures that even if the Edge hub restarts, it can restore device sessions from persisted state.

Additionally, implement a connection health check on your leaf devices:

  • Send keepalive pings every 30 seconds (not just rely on the 60-second MQTT keepalive)
  • If 2 consecutive pings fail, trigger immediate reconnection attempt
  • On reconnection, re-subscribe to all topics to ensure message routing is restored

For your 40-device fleet, also configure the Edge hub with adequate connection limits:

  • Set ‘IotHubConnectionPoolSize’ to at least 50
  • Increase ‘amqpSettings__maxFrameSize’ to 65536 if devices send large payloads

With these changes, module restarts should cause less than 10 seconds of downtime, and devices will automatically reconnect without manual intervention. The key is combining persistent sessions, aggressive retry logic, and proper module lifecycle management.

The issue is that module restarts break the MQTT broker session state. Leaf devices maintain their connection to the Edge hub module, not the custom module. Check if your custom module is hosting the MQTT broker or if you’re using the built-in Edge hub. If custom, you need to implement session persistence to survive restarts. The Edge hub module has this built-in but requires proper configuration.

We’re using the built-in Edge hub for MQTT brokering. Our custom module handles data transformation and routing. The Edge hub module isn’t being restarted - only our custom module. But somehow the leaf device connections are still affected. Could there be a dependency causing the Edge hub to restart when our module restarts?

Leaf devices won’t auto-reconnect if they received a proper MQTT disconnect packet during the Edge hub restart. They interpret this as an intentional disconnect and wait for explicit instructions. You need to configure aggressive retry logic on the leaf devices with exponential backoff. Set initial retry at 5 seconds, max retry at 60 seconds. Also ensure the Edge hub’s MQTT settings have connection timeout configured correctly.

Found the dependency issue - our module was listed as a dependency for Edge hub routing rules, causing both to restart together. However, even after fixing the deployment manifest to remove the dependency, leaf devices still don’t auto-reconnect. They seem to be stuck waiting for the gateway to initiate reconnection rather than actively trying to reconnect themselves.

Check your module deployment manifest for restart policies and dependencies. If your custom module has a restart policy of ‘always’ and the Edge hub depends on it, a restart of your module could trigger an Edge hub restart as well. Also verify that your leaf devices are configured with the correct gateway hostname in their connection strings - if they’re using the module-specific endpoint instead of the Edge hub endpoint, they’ll lose connection when the module restarts.