Zero-downtime firmware updates for critical machinery using Greengrass staged rollout

We deployed a zero-downtime firmware update system for critical manufacturing machinery using AWS IoT Greengrass v2 staged rollout capabilities. Our production lines can’t tolerate unexpected downtime, so firmware updates required a robust validation process.

The solution implements staged firmware rollout with automated health checks at each stage. We start with a canary deployment to 5% of devices, validate operational metrics for 24 hours, then expand to 25%, 50%, and finally 100% of the fleet. If health checks fail at any stage, the system automatically rolls back to the previous firmware version. This approach eliminated production disruptions from bad firmware while enabling continuous improvement of edge capabilities.

Our zero-downtime firmware update system has been operational for 14 months, handling 23 firmware releases across 340 edge devices with zero production incidents. Here’s the architecture:

Staged Firmware Rollout: We use AWS IoT Greengrass v2 deployment configurations with thing groups to control rollout stages. Devices are organized into deployment groups: canary (5%, 17 devices), early (20%, 68 devices), and production (75%, 255 devices). Each stage has a separate thing group in IoT Core.

The deployment process starts with canary group. We use Greengrass continuous deployments with a 24-hour monitoring window. The deployment includes the new firmware component plus a health monitoring component that runs in parallel. If canary succeeds, EventBridge triggers a Lambda that promotes the deployment to early group, waits another 24 hours, then promotes to production.

Automated Health Checks: Each device runs a health monitoring Greengrass component that collects metrics every 60 seconds: MQTT connection status, component state, application cycle times, error counts, CPU/memory utilization, and custom business metrics (units produced, quality checks passed). These metrics publish to device shadow documents with a ‘health’ namespace.

A Lambda function triggered every 5 minutes analyzes health metrics across all devices in the current deployment stage. It compares post-update metrics against a 7-day baseline from before the deployment. Thresholds: connection failures >2%, cycle time degradation >10%, error rate increase >5%, or any component crash. If thresholds breach, the Lambda immediately cancels the Greengrass deployment and triggers rollback.

Rollback on Failure: Rollback is automatic and fast. We maintain the previous firmware version as a Greengrass deployment in ‘archived’ state. When health checks fail, the Lambda reactivates the previous deployment targeting the affected thing groups. Greengrass core devices receive the rollback deployment within 60 seconds and revert to previous firmware within 2-3 minutes. Total detection-to-stable time averages 7-8 minutes.

To handle schema compatibility, we enforce backward compatibility in firmware development. Every firmware version must successfully process data from N-1 and N+1 versions. Breaking changes require deploying a migration component first. We also version device shadow documents - new firmware writes to shadow version 2.0 while still reading from 1.0 for a transition period.

Deployment Strategy: We spread updates across production lines to minimize line-specific risk. The canary group includes one device from each line. Early group includes 20% of devices from each line. This ensures no single line gets fully updated until we’ve validated across diverse conditions. We also stagger by shift - canary deploys during second shift (lower volume), early during third shift, production during planned maintenance windows.

Results: 23 firmware deployments with 2 automatic rollbacks (caught issues in canary stage before impacting production). Zero unplanned production downtime. Average deployment time from canary start to full production: 4 days. Firmware defects caught before widespread deployment: 100%. Operations team confidence in continuous deployment: transformed from resistance to requesting more frequent updates.

Key insight: The 24-hour monitoring windows seem long, but they’re essential for catching time-delayed issues like memory leaks or cumulative errors. We initially tried 4-hour windows and missed subtle degradation that appeared after 12+ hours of operation.

We monitor multiple layers: device connectivity (MQTT connection stability), application-level metrics (cycle times, error rates, throughput), and system metrics (CPU, memory, disk I/O). Each device publishes these to a shadow document every minute. The health check Lambda compares post-update metrics against baseline for 24 hours. If any metric degrades beyond threshold, rollback triggers automatically.

The staged rollout approach sounds ideal for our environment too. What metrics do you monitor for the health checks? We’re concerned about catching subtle issues that might not show up immediately but could cause failures over time.

The automated rollback capability is critical. How quickly can the system detect a problem and execute rollback? In our environment, even 10 minutes of degraded performance can impact product quality. Do you have any metrics on detection-to-rollback time?

How do you handle rollback for firmware that changes data schemas or configuration formats? We’ve had situations where newer firmware writes data that older firmware can’t parse, making rollback risky. Do you maintain backward compatibility in all firmware versions?

Excellent question - backward compatibility is mandatory in our firmware development process. We enforce a rule that firmware version N must be able to read data written by version N-1 and N+1. This means schema changes require multi-version migration. If a breaking change is unavoidable, we deploy a compatibility shim as a separate Greengrass component first, validate it across the fleet, then deploy the firmware update.