Having implemented monitoring architectures for multiple large-scale IoT deployments, I’ll share insights on all three areas:
Edge vs Central Alerting:
The optimal strategy depends on your operational model and failure modes:
Edge alerting advantages:
- Zero dependency on network connectivity for device health visibility
- Sub-second alert latency for critical device failures
- Enables automated local remediation (restart services, failover devices)
- Reduces central monitoring load and network bandwidth
- Maintains visibility during network partitions
Central alerting advantages:
- Single pane of glass for all locations
- Easier correlation of issues across multiple sites
- Simpler operational model (one monitoring stack)
- Better for trend analysis and capacity planning
- Centralized alert routing and escalation
For your 800+ device deployment across 15 sites, I recommend a tiered alerting strategy:
- Critical device failures → Edge alerts with local notification
- Performance degradation → Edge detection, central alerting
- Trend analysis and anomalies → Central monitoring only
- Network connectivity issues → Edge detection (can’t rely on central)
Implement alert correlation at the central level to deduplicate edge-originated alerts that indicate the same root cause.
Network Outage Resilience:
Network resilience is where edge monitoring truly shines. During outages:
Edge capabilities:
- Continue monitoring device health independently
- Store alerts locally with timestamps
- Execute automated remediation playbooks
- Maintain historical metrics for post-outage analysis
- Provide local dashboards for on-site staff
Central limitations during outages:
- Complete loss of real-time visibility
- No ability to trigger remediation actions
- Alert gaps in the timeline
- Delayed incident response
With varying network reliability across your 15 sites, edge monitoring becomes essential. Sites with poor connectivity need autonomous monitoring that doesn’t depend on the central system. Implement:
- Local alert storage with sync-on-reconnect
- Edge-based automated remediation for common failures
- Local metric retention (7-30 days) for troubleshooting
- Heartbeat monitoring from central to detect edge monitoring failures
Operational Complexity:
Yes, hybrid monitoring increases operational complexity, but it’s manageable with the right approach:
Complexity factors:
- Two monitoring stacks to maintain and upgrade
- Configuration management across 15+ edge locations
- Alert routing logic between edge and central
- Training ops team on both systems
- Troubleshooting monitoring issues at edge locations
Mitigation strategies:
- Use the same monitoring tooling at edge and central (Prometheus + Grafana, or Datadog agents everywhere)
- Centralized configuration management - deploy edge monitoring configs from central repository
- Automated edge monitoring deployment - treat monitoring as infrastructure-as-code
- Clear ownership model - define which alerts are handled locally vs centrally
- Comprehensive runbooks for common monitoring issues
Operational model recommendation:
- Edge monitoring: Focused on device health, availability, and immediate issues
- Central monitoring: Aggregated analytics, capacity planning, cross-site correlation
- Shared responsibility: Edge teams handle local alerts, central SRE handles trends and optimization
For your specific scenario with 800+ devices and varying network reliability, the operational complexity of hybrid monitoring is absolutely worth it. The alternative - central-only monitoring - leaves you blind during network outages, which seems to be a recurring issue in your environment.
Implementation recommendation:
- Start with edge monitoring at your 3-4 sites with worst network reliability
- Prove the value before rolling out to all 15 sites
- Use identical tooling at edge and central to minimize operational overhead
- Implement automated alert deduplication and correlation
- Provide clear escalation paths for both edge and central alerts
The resilience and reduced MTTR during network outages will far outweigh the additional operational complexity.