Monitoring agent loses connection to cloud platform every 4 hours in production environment

michelleapi · July 25, 2025, 3:49pm

Our monitoring agents deployed on edge devices disconnect from Cisco IoT Monitoring platform precisely every 4 hours. The pattern is consistent across all 150+ deployed agents, suggesting a configuration issue rather than network problems.

I suspect SSL certificate chain validation or certificate refresh timing, but I’m not seeing certificate errors in the logs. The agents use persistent connections with keep-alive enabled, but something is forcing reconnection every 240 minutes.


Agent Disconnect Event
Timestamp: 2025-06-11T04:52:33Z
Last Successful Heartbeat: 04:52:15Z
Connection Duration: 14,398 seconds (3h 59m 58s)
SSL Session: Terminated

The monitoring data gaps during reconnection (typically 30-45 seconds) are causing false alerts. Has anyone encountered similar periodic disconnection patterns? I’m running agent version 24.1.3.

marcoace · August 26, 2025, 6:37pm

Your keep-alive configuration might also need tuning. If the TCP keep-alive interval is too long, the platform might not detect stale connections promptly. Set TCP keep-alive to 60 seconds with 3 probes. This ensures dead connections are detected and cleaned up quickly, preventing false disconnection alerts.

kathleenking · August 7, 2025, 9:32pm

Good points. I checked the agent config and found connection_pool_max_lifetime=14400 which matches the disconnect interval. However, I’m not sure if simply increasing this value is the right solution. Should I disable the lifetime limit entirely or set it to something higher? Also concerned about the certificate chain - how do I verify all intermediates are included?

marcoace · July 27, 2025, 4:59pm

I’ve seen this with connection pooling configurations. If your pool has a max connection lifetime set to 14400 seconds (4 hours), it will forcibly close and recreate connections. Check your agent’s connection pool settings. Also, verify the cloud platform’s load balancer isn’t rotating backend connections on a 4-hour schedule.

alessia_ops · August 14, 2025, 3:27am

Don’t disable the connection lifetime entirely - stale connections can accumulate and cause other issues. Set it to 24 hours (86400 seconds) and implement graceful reconnection logic. For certificate chain validation, use openssl s_client to verify the complete chain from your agent’s perspective. The platform endpoint should present the full chain including intermediates.

Topic		Views
Edge device monitoring alerts not triggering after security policy update for IoT gateway nodes SAP IoT question , monitoring , missed-alerts , edge-security , tls-certificate , sapiot-25 , iot-gateway-edge , alerts-not-triggering , outbound-connections	5	July 13, 2025
IoT data stream disconnects frequently due to TCP timeouts in Cloud Functions Google Cloud IoT question , connectivity , data-loss , cloud-functions , data-streaming , keepalive , iot-core , gcpiot-24 , tcp-timeout	5	August 21, 2025
Monitoring alerts not triggering when IoT device goes offline despite correct Pub/Sub subscription setup Google Cloud IoT question , monitoring , alerting , cloud-monitoring , device-connectivity , gcpiot-25 , alert-policy , pubsub-integration , connection-metrics	3	November 16, 2025
Virtual Server CPU metrics not appearing in IBM Cloud Monitoring dashboard after agent deployment IBM Cloud question , compute , ic-2019 , json , network-security , metric-collection , monitoring-mana , ibm-cloud-monit , agent-config	3	June 14, 2025
Integration SDK MQTT connection drops frequently on aziotc edge devices Microsoft Azure IoT question , integration , api-development , connectivity-loss , mqtt , aziotc , integration-sdk , mqtt-conn-drop , network-diagnostics	5	January 27, 2025
Monitoring module reports random MQTT connection resets during high-frequency telemetry IBM Watson IoT question , monitoring , performance-opt , scripting-auto , data-ingestion , mqtt , wiot-ea , data-gaps , mqtt-reset	6	April 9, 2025
CloudMonitor metric collection fails after upgrading to ac-2021 observability module Alibaba Cloud question , compute , observability , ac-2021 , cloudmonitor , metric-collection , monitoring-gaps , agent-configuration , network-troubleshooting	5	November 5, 2025
MQTT connection timeouts when syncing billing records from edge devices to cloud Cisco IoT Cloud Connect question , connectivity , billing-mgmt , firewall-config , connection-timeout , payload-optimization , mqtt-protocol , mqtt , cciot-25	7	May 17, 2025
VPN tunnel between edge gateway and cloud platform drops intermittently causing data loss Cisco IoT Cloud Connect question , connectivity , data-buffering , gateway-mgmt , iod-23 , vpn-tunnel-stability , tunnel-redundancy , health-monitoring , reconnection	5	November 20, 2025

Monitoring agent loses connection to cloud platform every 4 hours in production environment

Related topics