We’re architecting a global IoT platform and debating edge compute vs centralized cloud deployment. The use case involves processing sensor data from 50+ manufacturing sites worldwide with latency requirements under 100ms for real-time decisions.
Centralized approach: All data streams to OCI region, processing in Compute instances with autoscaling. Simpler operations, easier monitoring, lower total infrastructure cost.
Edge approach: OCI Compute instances at or near each site, local processing, results aggregated to central cloud. Better latency, continues working during network outages, but 50x the operational complexity.
Hybrid: Critical real-time processing at edge, analytics and ML training centralized. Best of both worlds but most complex architecture.
What patterns have worked for others? How do you calculate the true TCO when factoring in operational overhead? At what scale does edge compute justify the complexity?
We went full edge for 30 production facilities and it’s been worth it despite the complexity. The key factor was HA requirements - manufacturing lines can’t wait for cloud connectivity. Local processing means we maintain 99.99% uptime even when WAN links fail. Yes, managing 30 edge deployments is harder, but downtime costs far exceed operational overhead in our case.
Good point on data egress. We’re generating 2TB/day across all sites. Streaming everything to cloud would be $180K/year in egress alone. Edge processing reduces that to summary data (~50GB/day). That partially offsets the infrastructure cost. Still struggling with the operational complexity though - how do you handle edge deployments at scale?
Having implemented both patterns across multiple industries, I can share some insights on the tradeoffs:
Edge vs Centralized Compute TCO:
The TCO calculation needs to include these often-overlooked factors:
- Infrastructure costs - Edge is 2.5-4x higher for compute resources, but you save 60-80% on data egress if processing locally
- Operational overhead - Edge requires 40-50% more DevOps resources initially, but this levels off after year one with proper automation
- Downtime costs - This is the wildcard. If 1 hour of downtime costs $50K+ (common in manufacturing), edge pays for itself in the first year through improved availability
- Latency penalties - Hard to quantify but real. Delayed decisions in real-time systems cascade into quality issues, waste, or safety concerns
Latency and HA Requirements:
Your 100ms requirement makes centralized cloud infeasible for the critical path. However, not all processing needs sub-100ms response. Pattern we’ve successfully deployed:
- Edge: Real-time control decisions, anomaly detection, immediate alerts (sub-50ms)
- Regional cloud: Aggregation, reporting, medium-priority analytics (1-5 second latency acceptable)
- Central cloud: ML model training, long-term storage, enterprise integration (latency irrelevant)
For HA, edge nodes must be autonomous. Design for “occasionally connected” - they should operate 100% independently for 24-48 hours during network outages. Sync state when connectivity returns.
Hybrid Deployment Patterns:
This is the pragmatic approach for your scale. Specific recommendations:
- Deploy lightweight edge compute (2-4 OCPUs per site) for critical real-time workloads only
- Use OCI Object Storage as the data handoff point - edge nodes upload processed results/events, cloud pulls for analytics
- Implement bidirectional sync for ML models - train centrally, deploy to edge automatically
- Centralize monitoring and alerting - every edge node reports health metrics to OCI Monitoring, alerts flow to central NOC
- Standardize edge infrastructure with immutable deployments - no SSH access, updates via blue/green deployment automation
For 50 sites, the operational complexity is manageable but requires investment in automation. Budget 2-3 FTE for the first year building deployment pipelines, monitoring infrastructure, and runbooks. After that, 0.5-1 FTE can manage 50+ edge nodes.
The hybrid pattern we recommend: 80% of compute capacity centralized (cost-efficient, easy to scale), 20% at edge (latency-critical only). This balances cost, complexity, and performance requirements. Start with 5-10 pilot sites, validate the pattern, then roll out globally using infrastructure-as-code.
One final consideration: edge compute makes sense when you have stable, well-understood workloads. If you’re still experimenting with algorithms and processing logic, start centralized for faster iteration, then push proven workloads to edge once they’re mature.
The 100ms latency requirement is your deciding factor. You can’t reliably achieve sub-100ms from manufacturing sites to centralized cloud - network physics won’t allow it for global deployment. Even with OCI regions strategically placed, you’re looking at 150-300ms round trip for many locations. Edge compute isn’t optional if latency is a hard requirement, it’s the only architecture that works. Focus your debate on hybrid patterns and what workloads stay local vs centralized.