Real-time anomaly detection for energy usage visualized in IoT Analytics dashboard

We’ve implemented a real-time anomaly detection system for our commercial building energy management that’s reduced fault detection time from days to minutes. Sharing our architecture in case it helps others.

We have 200+ buildings with smart meters publishing energy consumption data every 5 minutes to AWS IoT Core. The challenge was detecting anomalies (equipment failures, unusual consumption patterns) quickly enough to prevent energy waste.

Our solution uses IoT Analytics pipeline with an integrated SageMaker ML model for anomaly detection, feeding results into QuickSight dashboards for our facilities team. The system automatically flags anomalies and creates maintenance tickets through our ERP integration.

Key benefits: 85% reduction in energy waste from equipment failures, automated alerting replaced manual meter review, maintenance team can prioritize issues by predicted impact. The dashboard shows real-time anomaly scores across all buildings with drill-down to individual equipment.

How did you integrate the ML model with the IoT Analytics pipeline? Did you use a Lambda activity in the pipeline, or is the model invoked separately? I’m particularly interested in the latency - you mentioned real-time detection, so I assume the inference happens inline with data ingestion?

What’s your false positive rate? Anomaly detection can be noisy, and I’m concerned about alert fatigue for the facilities team. Do you have any post-processing to filter out low-confidence anomalies before they hit the dashboard?

Let me provide the detailed implementation addressing all three focus areas:

IoT Analytics Pipeline Design: The pipeline has four activities: Channel (ingests MQTT messages from IoT Core), Lambda (enrichment with temporal features and building metadata), Lambda (ML inference via SageMaker), and Datastore (stores enriched data with anomaly scores). The first Lambda adds day_of_week, hour_of_day, is_holiday, and building_type fields. This enrichment is critical because the ML model needs temporal context to distinguish between normal variations and true anomalies. The pipeline processes data in micro-batches every minute, giving near-real-time detection while managing Lambda costs.

ML Model Integration: The SageMaker Random Cut Forest model is deployed as a real-time endpoint with auto-scaling. The inference Lambda batches up to 50 records per invocation to optimize throughput. Each record includes current consumption, historical 24-hour average, and the enrichment features. The model outputs an anomaly score (0-1) and a confidence level. We’ve tuned the threshold to 0.75 to balance detection sensitivity with false positives - anything above this triggers an alert. The model is versioned using SageMaker Model Registry, and we maintain two endpoints (production and canary) to test new model versions before full deployment.

Dashboard Visualization: QuickSight connects directly to the IoT Analytics dataset using SPICE for fast queries. The main dashboard has three views: fleet overview (heatmap of all buildings colored by anomaly score), building detail (time series of consumption with anomaly flags), and alert queue (table of active anomalies sorted by predicted impact). We calculate predicted impact by multiplying the anomaly score by the building’s typical daily energy cost. The dashboard refreshes every 5 minutes via scheduled SPICE refresh. For critical anomalies (score > 0.9), we trigger SNS notifications to the facilities team via an EventBridge rule that monitors the datastore for high-score records.

To address false positives: we implemented a 15-minute persistence threshold. An anomaly must appear in three consecutive data points before generating an alert. This filters transient spikes while catching sustained issues. We also maintain an exclusion list for buildings undergoing maintenance. The false positive rate is now under 5%, which the facilities team finds manageable.

The ERP integration uses a Lambda function triggered by EventBridge that creates maintenance work orders in our system via REST API. The work order includes the building ID, equipment suspected (based on meter location), anomaly score, and estimated energy waste rate.

Total cost for the system is approximately $800/month for 200 buildings, covering IoT Core messages, Analytics pipeline, SageMaker endpoint, and QuickSight. The ROI is significant - we’re preventing about $15,000/month in energy waste from early detection of equipment failures.

Great question. We retrain weekly using the previous 90 days of data, which captures seasonal patterns. The model is a Random Cut Forest algorithm in SageMaker that automatically adjusts for seasonal trends. We also maintain separate baseline profiles for weekday vs weekend consumption patterns. The IoT Analytics pipeline enriches incoming data with day-of-week and holiday flags before feeding to the model.

We use a Lambda activity in the IoT Analytics pipeline that invokes the SageMaker endpoint. Latency is around 200-300ms for inference, which is acceptable for our 5-minute data intervals. The Lambda function batches records from the same building to reduce endpoint invocations and costs. Results are written back to the pipeline as an anomaly_score field before data reaches the datastore.

This is impressive. How frequently does your ML model retrain? Anomaly detection for energy can be tricky because normal patterns change seasonally. Are you handling that in the model or through separate baseline adjustments?