Greengrass firmware updates: comparing OTA vs local push for edge reliability

We’re designing our firmware update strategy for 10,000 Greengrass edge devices deployed across remote locations with varying network reliability. Debating between OTA updates via AWS IoT Jobs versus local push updates using physical access.

OTA updates are attractive for automation but concerned about network reliability in remote sites. Some locations have intermittent connectivity that could interrupt large firmware downloads. Local push updates guarantee completion but don’t scale well and require field technician visits.

What experiences have others had with OTA vs local push methods? How do you handle update automation when network reliability varies significantly across deployment sites? Looking for real-world insights on balancing automation benefits against reliability requirements.

From a field operations perspective, local push updates are extremely expensive at scale. Each site visit costs $500-1000 in technician time and travel. For 10k devices, that’s $5-10M for a single update cycle. OTA updates have infrastructure costs but the per-device cost is essentially zero after initial setup. Even with 5% failure rate requiring site visits, OTA is vastly more cost-effective.

Let me provide a comprehensive comparison based on our multi-year experience with both approaches:

OTA vs Local Push Analysis:

OTA Updates (AWS IoT Jobs + Greengrass):

Advantages:

  • Full automation capability - zero manual intervention for successful updates
  • Progressive rollout support - staged deployment with automatic pause on failures
  • Built-in resume/retry - handles intermittent connectivity gracefully
  • A/B partition support - automatic rollback on health check failures
  • Real-time monitoring - CloudWatch metrics for update progress and success rates
  • Centralized control - manage 10k devices from single console
  • Cost-effective at scale - near-zero marginal cost per device

Disadvantages:

  • Network dependency - requires sufficient bandwidth and connectivity windows
  • Initial setup complexity - requires proper device partitioning and health checks
  • Security considerations - need secure firmware distribution and verification
  • Monitoring overhead - must track update status across large fleet

Local Push Updates:

Advantages:

  • Network independent - works regardless of connectivity
  • Guaranteed completion - technician verifies successful update
  • Physical verification - can inspect device state directly
  • Immediate rollback - technician can restore from backup on-site
  • No cloud dependency - works even if AWS connection is down

Disadvantages:

  • Doesn’t scale - requires site visits for every device
  • High cost - $500-1000 per site visit × 10k devices = $5-10M per update
  • Slow deployment - months to complete full fleet update
  • Human error risk - incorrect firmware versions or procedures
  • No centralized visibility - hard to track completion status
  • Update fatigue - frequent updates become operationally infeasible

Network Reliability Considerations:

For sites with varying connectivity, implement tiered strategies:

  1. High Reliability Sites (>95% uptime):

    • Full OTA automation
    • Aggressive update schedules
    • Minimal monitoring required
  2. Medium Reliability Sites (80-95% uptime):

    • OTA with extended timeouts
    • Longer download windows (24-72 hours)
    • Staged rollouts per site
    • Automated retry logic
  3. Low Reliability Sites (<80% uptime):

    • OTA with pre-staging during connectivity windows
    • Update execution during next connection
    • Fallback to local push only if OTA fails after 3 attempts

Update Automation Best Practices:

Our production implementation:

  1. Pre-Update Validation:

    • Device health check before starting update
    • Verify sufficient storage space
    • Check battery level (for battery-powered devices)
    • Confirm network bandwidth availability
  2. Progressive Rollout Strategy:

    
    Stage 1: Canary group (1% of fleet, 50-100 devices)
    - Wait 24 hours, monitor metrics
    
    Stage 2: Early adopters (10% of fleet)
    - Wait 48 hours, analyze telemetry
    
    Stage 3: Broad deployment (50% of fleet)
    - Wait 24 hours, verify stability
    
    Stage 4: Remaining devices (39% of fleet)
    - Complete rollout
    
  3. Health Check Implementation:

    • Post-update device reboot
    • Automatic connectivity test
    • Application-level health verification
    • Automatic rollback if any check fails
    • Report status to IoT Jobs
  4. Rollback Automation:

    • Maintain previous firmware on separate partition
    • Automatic revert on boot failure (3 consecutive failures)
    • Manual rollback trigger via IoT Jobs
    • Preserve device configuration across rollbacks

Cost-Benefit Analysis:

For 10,000 devices with quarterly updates:

OTA Approach:

  • Initial setup: $50k (engineering + infrastructure)
  • Per-update cost: $2k (monitoring + AWS services)
  • Annual cost: $58k
  • 95% success rate, 5% require site visits = $250k site visits
  • Total annual cost: $308k

Local Push Approach:

  • Per-update cost: $5M (10k devices × $500 per visit)
  • Annual cost: $20M (4 updates/year)
  • 100% success rate (by definition)
  • Total annual cost: $20M

Savings with OTA: $19.7M annually (98.5% cost reduction)

Recommended Strategy:

Implement OTA as primary method with local push as fallback:

  1. Deploy OTA infrastructure with proper health checks and rollback
  2. Classify sites by network reliability
  3. Use progressive rollouts starting with high-reliability sites
  4. Monitor update success rates per site
  5. Schedule local push only for repeated OTA failures
  6. Continuously improve OTA success rates based on telemetry

This approach gives you 90%+ automation benefits while maintaining reliability guarantees through strategic use of local push for the small percentage of problematic updates.

The data clearly shows OTA updates are superior for fleet management at scale, with network reliability concerns being addressable through proper implementation of timeouts, retries, and progressive rollouts.

For update automation, we built a progressive rollout system using IoT Jobs dynamic groups. Start with 1% of devices (canary group), monitor for 24 hours, then 10%, then 50%, then remaining devices. Each stage waits for health metrics before proceeding. This catches firmware issues early while still being fully automated. We can pause or rollback at any stage. Local push can’t match this level of control and safety.