The data-stream module is ingesting duplicate telemetry messages through our Pub/Sub to Dataflow pipeline, causing data quality issues in our analytics. We’re seeing 5-10% duplicate rate across device events, which is inflating our metrics and causing incorrect billing calculations.
I suspect this is related to Pub/Sub message deduplication behavior and how we’ve configured Dataflow windowing. We need to implement exactly-once delivery semantics but aren’t sure if this should be handled at the Pub/Sub level, Dataflow level, or both:
Duplicate events detected:
device_id: sensor-2847
timestamp: 2025-01-15T14:23:45Z
occurrences: 3 (same message_id)
This is affecting our data analytics accuracy and downstream billing processes. How do you handle message deduplication in Pub/Sub + Dataflow streaming pipelines?
Use a combination of device_id and event_timestamp as your deduplication key, not just message_id. Pub/Sub message_id is unique per delivery attempt, but the actual telemetry event might be published multiple times by the device or gateway. Implement stateful processing in Dataflow to track seen events within a time window. For telemetry data, a 5-minute deduplication window is usually sufficient.
For device-level duplicates, have devices include a unique event_id (UUID) in each message. For Dataflow, use stateful processing with a time-bounded cache (5-10 minute window) that tracks seen event_ids. This handles both device retries and Pub/Sub redelivery. Make sure your windowing strategy allows for late data - use allowed lateness of 5-10 minutes to catch stragglers.
Looking at the logs, it’s a mix - some duplicates from device retries (network timeouts causing republish), and some from Pub/Sub redelivery during Dataflow pipeline restarts. So we need multi-level deduplication. What’s the best practice for implementing this in Dataflow?