Predictive maintenance integration between IoT sensor data streams and ERP work orders

Sharing our implementation of predictive maintenance that connects Google Cloud IoT sensor streams directly to our ERP system for automated work order creation. We operate 180 industrial machines across three facilities, and unplanned downtime was costing us $45K per incident.

The challenge was bridging real-time IoT telemetry (temperature, vibration, pressure readings every 5 seconds) with our ERP’s maintenance management module. Traditional reactive maintenance meant equipment failures happened before work orders were created. We needed anomaly detection that could predict failures 24-48 hours in advance and automatically trigger preventive maintenance workflows in the ERP.

Our solution processes 500K+ sensor readings per hour through Dataflow, applies ML-based anomaly detection, and creates ERP work orders when failure probability exceeds thresholds. Since going live four months ago, we’ve reduced unplanned downtime by 73% and maintenance costs by 28%. Happy to walk through the architecture and lessons learned.

Great questions on the ERP integration - that was definitely the most complex piece. Here’s our end-to-end architecture:

Streaming Sensor Data Ingestion: IoT devices publish to Google Cloud IoT Core via MQTT. Each device sends telemetry bundles every 5 seconds containing temperature, vibration (3-axis), pressure, and operating speed. IoT Core forwards to a dedicated Pub/Sub topic with ~500K messages/hour during production shifts.

Dataflow pipeline consumes from Pub/Sub using sliding windows (5-minute window, 1-minute slide) to aggregate sensor readings. We calculate statistical features: mean, std deviation, rate of change, and cross-sensor correlations. Watermark delay is set to 30 seconds to handle network latency from factory floor devices.

ML-based Anomaly Detection: Our Vertex AI model is a gradient boosting classifier trained on 18 months of labeled data (847 actual failure events). Features include 15-minute rolling statistics across all sensor types. Model achieves 89% precision and 82% recall on validation set.

For real-time inference, Dataflow calls a Cloud Function hosting the deployed model after each window aggregation. The function returns failure probability (0-1) and predicted time-to-failure. We trigger alerts when probability exceeds 0.75 for critical equipment or 0.85 for non-critical.

Automated ERP Work Order Creation: This required careful design to avoid overwhelming the ERP system. When anomaly detection triggers an alert, we:

  1. Write alert details to Firestore (equipment_id, failure_probability, predicted_failure_time, sensor_readings)
  2. Cloud Function evaluates alert against business rules (maintenance history, existing work orders, equipment priority)
  3. If work order needed, publish to Cloud Tasks queue with priority-based delay (critical=immediate, high=5min, medium=30min)
  4. Background worker consumes from Cloud Tasks, calls ERP REST API to create maintenance work order
  5. ERP API returns work_order_id, which we store in Firestore linked to the alert

The Cloud Tasks queue provides rate limiting (max 50 API calls/minute to ERP) and automatic retries with exponential backoff. Idempotency keys prevent duplicate work orders during retries.

For prioritization, we assign scores based on: equipment criticality (1-10), failure probability (0-1), impact on production line (boolean), and current maintenance backlog. High-priority equipment gets immediate work orders; lower priority batches into scheduled maintenance windows.

Results After 4 Months:

  • Unplanned downtime reduced from avg 8.2 hours/week to 2.1 hours/week (73% reduction)
  • Maintenance costs down 28% (fewer emergency repairs, better parts inventory planning)
  • 156 equipment failures predicted and prevented
  • False positive rate: 12% (acceptable given cost of missed failures)
  • Average prediction lead time: 31 hours before actual failure

Key Lessons:

  1. Start with high-value equipment for initial deployment - we began with 12 critical machines before scaling to 180
  2. Invest heavily in data quality - garbage sensor data produces garbage predictions
  3. Build feedback loops - maintenance technicians can mark false positives, which retrains the model monthly
  4. Don’t underestimate ERP integration complexity - budget 40% of project time for this piece
  5. Monitor end-to-end latency religiously - we alert if sensor-to-work-order time exceeds 10 minutes

Happy to answer specific technical questions about any component. The streaming ingestion and ML pieces were straightforward compared to the ERP integration and change management aspects.

Yes, Pub/Sub handles the ingestion from IoT Core, and Dataflow processes the streams. We use sliding windows of 5 minutes with 1-minute intervals to capture sensor trends. For exactly-once processing, we enabled Dataflow’s streaming deduplication with message IDs from IoT Core. Message ordering wasn’t critical for our use case since we’re aggregating across time windows anyway. The key was setting appropriate watermark delays to handle late-arriving sensor data from network hiccups.

The automated ERP work order creation is the part I’m most interested in. How do you handle the integration between your anomaly detection output and the ERP maintenance module? REST API calls, batch imports, or something else? And how do you manage work order prioritization when multiple machines trigger alerts simultaneously?

We use a hybrid approach. Initial model was trained on 18 months of historical sensor data with labeled failure events using Vertex AI. But for real-time inference, we deployed the model as a Cloud Function that Dataflow calls after windowed aggregation. This keeps costs reasonable since we’re only running inference on aggregated features, not raw sensor readings.

Most predictive features were vibration amplitude changes, temperature differential rates, and bearing pressure anomalies. Combining multiple sensor types improved accuracy significantly - single-sensor models gave too many false positives.

This is exactly what we’re trying to build! How did you handle the streaming sensor data ingestion at that scale? We’re prototyping with Pub/Sub but concerned about message ordering and exactly-once processing guarantees when feeding into the ML pipeline. Did you use Dataflow’s windowing functions to aggregate sensor readings before anomaly detection?

I implemented something similar last year. The ERP integration is tricky because most ERP systems don’t expect real-time work order creation at IoT scale. We had to build a buffering layer with Cloud Tasks to queue work order requests and rate-limit API calls to the ERP. Otherwise you overwhelm the ERP’s API endpoints during high-alert periods. Also recommend implementing idempotency keys to prevent duplicate work orders if the integration retries.