Let me detail the complete implementation across the three focus areas:
Greengrass Stream Manager: This is the foundation of our resilient architecture. Stream Manager runs as a system component on each Greengrass core and provides persistent message queuing with automatic retry. We configured streams with a 7-day retention policy and 10MB size limit per stream. Each manufacturing line has a dedicated stream named predictions-line-{lineId}. The ML inference component publishes predictions to these streams using the Stream Manager SDK. Configuration includes automatic export to IoT Core when connectivity is available, but the critical feature is local persistence - if cloud connectivity drops, predictions accumulate locally and sync when the network returns. This ensures zero data loss during network outages, which are common in our factory environments.
Lambda Integration: We use two Lambda functions in the workflow. The first Lambda subscribes to IoT Core topics where Stream Manager exports predictions. It enriches the prediction data by querying DynamoDB for equipment metadata (asset ID, location, maintenance history, criticality level). The enrichment adds context that the ERP system needs to properly route and prioritize work orders. The second Lambda handles the actual ERP integration. It applies business rules to filter predictions (severity threshold, deduplication, persistence check), constructs the work order payload according to our ERP’s API schema, and makes the REST API call with OAuth authentication. Both Lambdas include comprehensive error handling with CloudWatch alarms for failed invocations. The Lambda functions are idempotent - they check if a work order already exists for the equipment before creating a new one.
ERP API Automation: Our ERP exposes a REST API for work order creation. The integration Lambda constructs a JSON payload that includes equipment ID, predicted failure type, severity score, recommended maintenance action, and estimated time to failure. The API call uses OAuth 2.0 client credentials flow - the Lambda retrieves credentials from Secrets Manager and caches tokens for the 1-hour validity period. For resilience, we implemented exponential backoff retry with jitter (3 attempts over 15 minutes). If all retries fail, the message goes to a DLQ for manual review. The ERP API returns a work order ID which we store in DynamoDB to prevent duplicate creation. We also update the device shadow with the work order ID, allowing technicians to see which work orders are associated with each piece of equipment.
For operational adoption, we ran a 30-day pilot where automated work orders were created but flagged as ‘AI-Generated’ requiring supervisor approval. This built trust as the team saw the prediction accuracy. After the pilot, we moved to automatic creation for high-confidence predictions (score > 0.8) while medium-confidence predictions (0.5-0.8) still require approval. The maintenance team now reviews a dashboard showing prediction accuracy trends, which has improved their confidence in the system.
Performance metrics: average end-to-end latency from prediction to work order creation is 45 seconds (edge inference 5s, Stream Manager export 10s, Lambda processing 15s, ERP API call 15s). During a recent 8-hour network outage, the system queued 127 predictions locally and successfully synced them all when connectivity returned, with no manual intervention required.