We’re evaluating monitoring strategies for our expanding IoT deployment (currently 600 devices, growing to 2000+ next year). Trying to decide between SNMP polling for device health monitoring versus REST API-based health checks. SNMP has lower overhead and is proven for network device monitoring, but REST API provides richer device context and integrates better with Oracle IoT Cloud Platform. What are the scalability trade-offs? Has anyone successfully deployed either approach at scale (1000+ devices) and can share lessons learned on polling intervals, data collection overhead, and monitoring infrastructure requirements?
We use REST API health checks for 1500 devices. The main advantage is tight integration with Oracle IoT Cloud Platform’s device registry and shadow state. Health check responses include not just connectivity status but also firmware version, last telemetry timestamp, error counts, and custom health metrics. The overhead is manageable if you implement smart polling - we check critical devices every 2 minutes, standard devices every 10 minutes, and low-priority devices every 30 minutes based on device criticality tiers.
SNMP polling has served us well for 800 industrial devices. The key benefit is that SNMP is lightweight and doesn’t require application-layer processing on devices - most embedded systems support SNMP natively. We poll standard MIBs (sysUpTime, ifOperStatus) every 5 minutes and get reliable health indicators with minimal device CPU overhead. However, SNMP doesn’t integrate well with cloud-native monitoring tools, so we had to build custom bridges to feed SNMP data into our monitoring dashboard.
Great insights from everyone. Based on the discussion, I’m leaning toward Maria’s hybrid approach. For our deployment profile (mix of industrial gateways and edge sensors), SNMP makes sense for infrastructure health while REST APIs provide application-layer visibility. The key challenge is orchestrating both monitoring streams into a unified view. Has anyone implemented tooling to correlate SNMP and REST API health data?
We built a monitoring aggregator service that collects both SNMP traps and REST API health responses, normalizes them into a common schema, and publishes to our central monitoring platform. The aggregator runs on Kubernetes and scales horizontally based on device count. For correlation, we use device ID as the common key across both data sources. The normalized health events feed into our alerting rules engine, which triggers alerts based on combined SNMP and REST health indicators.
Some practical considerations from our implementation:
SNMP Polling: We use 5-minute intervals for infrastructure metrics (CPU, memory, uptime). SNMP is great for detecting low-level issues like resource exhaustion or network connectivity problems. The overhead is minimal - our SNMP polling infrastructure handles 1500 devices with a single monitoring server consuming less than 2 CPU cores.
REST API Health Checks: We poll application health every 10 minutes for standard devices, 2 minutes for critical devices. The REST responses include custom health indicators like message queue depth, last successful telemetry upload timestamp, and device-specific error codes. This gives us application-layer visibility that SNMP can’t provide.
Scalability Trade-offs: The hybrid approach does add complexity, but it scales well. SNMP handles the high-frequency, low-overhead infrastructure monitoring, while REST APIs provide deep health insights at lower frequency. The key is using the right tool for each monitoring dimension rather than forcing one protocol to do everything.
For your 2000-device deployment, I’d recommend: SNMP for all devices (infrastructure health), REST API for critical devices and gateways (application health), and a monitoring aggregator to unify the data streams. This balances overhead, richness, and scalability effectively.
Scalability wise, REST API health checks can become a bottleneck as you approach 2000+ devices. Each health check is an HTTP request with TLS handshake, authentication, and JSON parsing overhead. At 2000 devices with 5-minute polling, that’s 400 requests per minute sustained load. Make sure your API gateway and backend services can handle this. We had to implement connection pooling and request batching to avoid overwhelming the platform. SNMP scales better from a protocol perspective but lacks the semantic richness you need for IoT-specific health metrics.
Consider a hybrid approach. Use SNMP for basic connectivity and resource monitoring (CPU, memory, network stats) because it’s efficient and standardized. Layer REST API health checks on top for application-specific health metrics (queue depths, error rates, business logic status). This gives you the efficiency of SNMP for infrastructure monitoring plus the richness of REST APIs for application health. The trade-off is increased monitoring complexity, but it scales well if you automate the monitoring configuration.