Great questions on the ERP integration - that was definitely the most complex piece. Here’s our end-to-end architecture:
Streaming Sensor Data Ingestion:
IoT devices publish to Google Cloud IoT Core via MQTT. Each device sends telemetry bundles every 5 seconds containing temperature, vibration (3-axis), pressure, and operating speed. IoT Core forwards to a dedicated Pub/Sub topic with ~500K messages/hour during production shifts.
Dataflow pipeline consumes from Pub/Sub using sliding windows (5-minute window, 1-minute slide) to aggregate sensor readings. We calculate statistical features: mean, std deviation, rate of change, and cross-sensor correlations. Watermark delay is set to 30 seconds to handle network latency from factory floor devices.
ML-based Anomaly Detection:
Our Vertex AI model is a gradient boosting classifier trained on 18 months of labeled data (847 actual failure events). Features include 15-minute rolling statistics across all sensor types. Model achieves 89% precision and 82% recall on validation set.
For real-time inference, Dataflow calls a Cloud Function hosting the deployed model after each window aggregation. The function returns failure probability (0-1) and predicted time-to-failure. We trigger alerts when probability exceeds 0.75 for critical equipment or 0.85 for non-critical.
Automated ERP Work Order Creation:
This required careful design to avoid overwhelming the ERP system. When anomaly detection triggers an alert, we:
- Write alert details to Firestore (equipment_id, failure_probability, predicted_failure_time, sensor_readings)
- Cloud Function evaluates alert against business rules (maintenance history, existing work orders, equipment priority)
- If work order needed, publish to Cloud Tasks queue with priority-based delay (critical=immediate, high=5min, medium=30min)
- Background worker consumes from Cloud Tasks, calls ERP REST API to create maintenance work order
- ERP API returns work_order_id, which we store in Firestore linked to the alert
The Cloud Tasks queue provides rate limiting (max 50 API calls/minute to ERP) and automatic retries with exponential backoff. Idempotency keys prevent duplicate work orders during retries.
For prioritization, we assign scores based on: equipment criticality (1-10), failure probability (0-1), impact on production line (boolean), and current maintenance backlog. High-priority equipment gets immediate work orders; lower priority batches into scheduled maintenance windows.
Results After 4 Months:
- Unplanned downtime reduced from avg 8.2 hours/week to 2.1 hours/week (73% reduction)
- Maintenance costs down 28% (fewer emergency repairs, better parts inventory planning)
- 156 equipment failures predicted and prevented
- False positive rate: 12% (acceptable given cost of missed failures)
- Average prediction lead time: 31 hours before actual failure
Key Lessons:
- Start with high-value equipment for initial deployment - we began with 12 critical machines before scaling to 180
- Invest heavily in data quality - garbage sensor data produces garbage predictions
- Build feedback loops - maintenance technicians can mark false positives, which retrains the model monthly
- Don’t underestimate ERP integration complexity - budget 40% of project time for this piece
- Monitor end-to-end latency religiously - we alert if sensor-to-work-order time exceeds 10 minutes
Happy to answer specific technical questions about any component. The streaming ingestion and ML pieces were straightforward compared to the ERP integration and change management aspects.