Our zero-downtime firmware update system has been operational for 14 months, handling 23 firmware releases across 340 edge devices with zero production incidents. Here’s the architecture:
Staged Firmware Rollout: We use AWS IoT Greengrass v2 deployment configurations with thing groups to control rollout stages. Devices are organized into deployment groups: canary (5%, 17 devices), early (20%, 68 devices), and production (75%, 255 devices). Each stage has a separate thing group in IoT Core.
The deployment process starts with canary group. We use Greengrass continuous deployments with a 24-hour monitoring window. The deployment includes the new firmware component plus a health monitoring component that runs in parallel. If canary succeeds, EventBridge triggers a Lambda that promotes the deployment to early group, waits another 24 hours, then promotes to production.
Automated Health Checks: Each device runs a health monitoring Greengrass component that collects metrics every 60 seconds: MQTT connection status, component state, application cycle times, error counts, CPU/memory utilization, and custom business metrics (units produced, quality checks passed). These metrics publish to device shadow documents with a ‘health’ namespace.
A Lambda function triggered every 5 minutes analyzes health metrics across all devices in the current deployment stage. It compares post-update metrics against a 7-day baseline from before the deployment. Thresholds: connection failures >2%, cycle time degradation >10%, error rate increase >5%, or any component crash. If thresholds breach, the Lambda immediately cancels the Greengrass deployment and triggers rollback.
Rollback on Failure: Rollback is automatic and fast. We maintain the previous firmware version as a Greengrass deployment in ‘archived’ state. When health checks fail, the Lambda reactivates the previous deployment targeting the affected thing groups. Greengrass core devices receive the rollback deployment within 60 seconds and revert to previous firmware within 2-3 minutes. Total detection-to-stable time averages 7-8 minutes.
To handle schema compatibility, we enforce backward compatibility in firmware development. Every firmware version must successfully process data from N-1 and N+1 versions. Breaking changes require deploying a migration component first. We also version device shadow documents - new firmware writes to shadow version 2.0 while still reading from 1.0 for a transition period.
Deployment Strategy: We spread updates across production lines to minimize line-specific risk. The canary group includes one device from each line. Early group includes 20% of devices from each line. This ensures no single line gets fully updated until we’ve validated across diverse conditions. We also stagger by shift - canary deploys during second shift (lower volume), early during third shift, production during planned maintenance windows.
Results: 23 firmware deployments with 2 automatic rollbacks (caught issues in canary stage before impacting production). Zero unplanned production downtime. Average deployment time from canary start to full production: 4 days. Firmware defects caught before widespread deployment: 100%. Operations team confidence in continuous deployment: transformed from resistance to requesting more frequent updates.
Key insight: The 24-hour monitoring windows seem long, but they’re essential for catching time-delayed issues like memory leaks or cumulative errors. We initially tried 4-hour windows and missed subtle degradation that appeared after 12+ hours of operation.