Best practices for billing engine data ingestion in aziot-25 with high volume

Looking to discuss best practices for implementing billing engine data ingestion in aziot-25. We’re building a usage-based billing system that ingests metering data from 1000+ IoT devices. The challenge is ensuring billing accuracy while handling high-volume data streams.

Key considerations include batch processing for cost efficiency, parallelization to handle peak loads, and data validation to prevent billing errors. We’re using Event Hubs as the ingestion layer but need guidance on optimal configuration for billing workloads. What approaches have worked well for others implementing billing data ingestion at scale?

Based on implementing billing ingestion for multiple aziot-25 deployments, here’s a comprehensive approach addressing batch processing, parallelization, and data validation:

Batch Processing Strategy:

Implement tiered batching based on use case. For billing calculations, use 5-15 minute batches to reduce storage operations while maintaining reasonable freshness. Configure Stream Analytics with tumbling windows:

  • Real-time dashboard: 1-minute windows for immediate visibility
  • Billing aggregation: 15-minute windows for cost efficiency
  • Reconciliation: Hourly windows for audit compliance

Use Event Hub capture feature to automatically archive raw events to Blob Storage every 5 minutes. This provides the audit trail without custom dual-write logic. Configure capture with Avro format for efficient storage and downstream processing.

Parallelization Architecture:

Scale Event Hubs to 32 partitions for 1000+ devices. Partition by customer/tenant ID rather than device ID to ensure per-customer ordering. Configure consumer groups:

  • Real-time processing: High priority, 32 consumer instances
  • Billing aggregation: Medium priority, 16 consumer instances
  • Audit/archive: Low priority, 8 consumer instances

Implement auto-scaling for consumer instances based on Event Hub lag metrics. Scale up when lag exceeds 1 million events or 5 minutes of data. Use Azure Monitor metrics to trigger scaling actions.

Data Validation Framework:

Implement three-tier validation:

  1. Device validation: Basic schema and range checks before sending
  2. Ingestion validation: Schema enforcement and duplicate detection at Event Hub
  3. Business validation: Usage pattern analysis and anomaly detection post-aggregation

Use Azure Functions with Durable Entities to maintain per-device metering state. Track expected usage patterns and flag anomalies for manual review before billing. Implement circuit breakers that pause billing for devices showing suspicious patterns until validated.

Billing Accuracy Safeguards:

Implement reconciliation jobs that compare three data sources:

  • Real-time aggregates (Stream Analytics output)
  • Archived raw events (Event Hub capture)
  • Device-reported totals (periodic checksum messages)

Run reconciliation hourly for critical customers, daily for standard customers. Auto-correct discrepancies under 1%, flag larger differences for investigation. Maintain immutable audit log of all billing adjustments.

Cost Optimization:

Batch processing reduces database writes by 90-95%. With 1000 devices sending events every 60 seconds, you go from 1.44M writes/day to ~144K aggregated writes. This translates to significant Cosmos DB RU savings. Use Event Hub standard tier with 32 partitions and 4 throughput units for baseline, enable auto-inflate to 8 TUs for peak handling.

Implementation Recommendations:

Start with 16 Event Hub partitions and 2 throughput units, monitor for 2 weeks, then scale based on actual patterns. Implement the real-time dashboard path first using simple aggregation, then add sophisticated validation and reconciliation. This allows customers to see usage immediately while you build billing accuracy safeguards. Deploy billing calculation as a separate pipeline from real-time display - never let dashboard performance issues impact billing accuracy.

The key insight from production deployments is that billing and real-time display have fundamentally different requirements. Optimize each path independently rather than trying to use a single pipeline for both purposes.

Data validation is where most billing errors occur. Implement validation at multiple stages: device-side pre-validation, ingestion-side schema validation, and post-aggregation business rule validation. We use Azure Functions with Durable Entities to maintain per-device state and detect anomalies like sudden usage spikes or negative values. Set up alerting for validation failures above 1% threshold. Also implement idempotency - use message IDs to prevent duplicate billing from retry logic.

The dual-write pattern makes sense for audit compliance. How do you handle the reconciliation process? Do you run it per device or across all devices? Also interested in the batch aggregation approach - does 5-minute batching introduce any issues with near-real-time billing dashboards? Our customers expect to see usage within 1-2 minutes.

For near-real-time dashboards with batched billing, use a lambda architecture. Raw events flow through fast path (Event Hubs → Stream Analytics → Cache/Dashboard) for real-time display. Simultaneously, batch path aggregates and validates data for actual billing. Dashboard shows estimated usage in real-time, but billing calculations use validated batch data. This gives customers immediate feedback while ensuring billing accuracy. We use Azure Cache for Redis for the real-time dashboard layer.

For billing accuracy, implement dual-write pattern where metering data goes to both real-time processing (Event Hubs) and durable storage (Blob Storage) simultaneously. This gives you an audit trail and allows reconciliation if real-time processing has issues. We run nightly reconciliation jobs comparing both sources to catch any discrepancies. Also, partition your Event Hub by customer ID or tenant ID to ensure per-customer ordering guarantees.