Your MQTT connection stability issues require addressing all three key areas:
MQTT Keepalive: The 60-second keepalive interval is too long for edge scenarios. Reduce to 20 seconds to enable faster connection failure detection:
MqttConnectOptions opts = new MqttConnectOptions();
opts.setKeepAliveInterval(20);
opts.setConnectionTimeout(10);
opts.setCleanSession(false);
opts.setAutomaticReconnect(false); // Handle manually
Auto-Reconnect: Implement custom reconnect logic with exponential backoff and jitter to prevent reconnection storms:
private void reconnectWithBackoff() {
int delay = Math.min(1000 * (1 << attempt), 64000);
delay += random.nextInt(1000); // Add jitter
scheduler.schedule(() -> {
try {
client.connect(opts);
} catch (Exception e) {
attempt++;
reconnectWithBackoff();
}
}, delay, TimeUnit.MILLISECONDS);
}
Network Diagnostics: Enable detailed MQTT client logging and implement connection monitoring:
client.setCallback(new MqttCallback() {
public void connectionLost(Throwable cause) {
logger.error("MQTT connection lost: {}", cause.getMessage());
logger.debug("Cause code: {}", ((MqttException)cause).getReasonCode());
reconnectWithBackoff();
}
});
Root cause analysis: The 10-30 minute disconnect pattern combined with no packet loss suggests NAT timeout or firewall idle connection closure. Most network devices have TCP idle timeouts between 15-30 minutes. Your 60-second keepalive isn’t frequent enough to keep the connection active through these devices. Additionally, the 30-60 second reconnection gap indicates the SDK’s default auto-reconnect is failing on first attempt and retrying without proper backoff. Implementation checklist: reduce keepalive to 20 seconds, set connectionTimeout to 10 seconds to fail fast on network issues, disable SDK’s automatic reconnect and implement custom logic with exponential backoff (1s, 2s, 4s, 8s, 16s, 32s, 64s max), add jitter (random 0-1000ms) to prevent synchronized reconnections across device fleet, set cleanSession=false to preserve subscriptions across reconnects, implement connection health monitoring that tracks disconnect frequency and reasons, enable MQTT client debug logging to capture PINGREQ/PINGRESP exchanges, and monitor authentication token expiry - refresh tokens at 80% of lifetime to prevent auth-related disconnects. For network diagnostics, capture MQTT reason codes from disconnect events: code 4 (connection refused) indicates auth issues, code 7 (not authorized) means token expired, code 0 (normal disconnect) suggests keepalive timeout. Run tcpdump on edge devices during disconnect events to verify if PINGRESP packets are received. If keepalive packets are sent but not acknowledged, network infrastructure is dropping them. Consider using WebSocket transport (wss://) instead of direct MQTT (mqtts://) if behind corporate firewalls - many firewalls better handle WebSocket keepalive. After implementing these changes, monitor connection uptime metrics - you should see sustained connections lasting hours or days instead of minutes.