Duplicate messages detected in Pub/Sub ingestion pipeline for telemetry data

anna_expert · July 11, 2025, 2:36am

The data-stream module is ingesting duplicate telemetry messages through our Pub/Sub to Dataflow pipeline, causing data quality issues in our analytics. We’re seeing 5-10% duplicate rate across device events, which is inflating our metrics and causing incorrect billing calculations.

I suspect this is related to Pub/Sub message deduplication behavior and how we’ve configured Dataflow windowing. We need to implement exactly-once delivery semantics but aren’t sure if this should be handled at the Pub/Sub level, Dataflow level, or both:


Duplicate events detected:
device_id: sensor-2847
timestamp: 2025-01-15T14:23:45Z
occurrences: 3 (same message_id)

This is affecting our data analytics accuracy and downstream billing processes. How do you handle message deduplication in Pub/Sub + Dataflow streaming pipelines?

scottbuilder · July 28, 2025, 8:54pm

Use a combination of device_id and event_timestamp as your deduplication key, not just message_id. Pub/Sub message_id is unique per delivery attempt, but the actual telemetry event might be published multiple times by the device or gateway. Implement stateful processing in Dataflow to track seen events within a time window. For telemetry data, a 5-minute deduplication window is usually sufficient.

lisaadmin · August 26, 2025, 9:16am

For device-level duplicates, have devices include a unique event_id (UUID) in each message. For Dataflow, use stateful processing with a time-bounded cache (5-10 minute window) that tracks seen event_ids. This handles both device retries and Pub/Sub redelivery. Make sure your windowing strategy allows for late data - use allowed lateness of 5-10 minutes to catch stragglers.

garycoder · August 13, 2025, 6:13am

Looking at the logs, it’s a mix - some duplicates from device retries (network timeouts causing republish), and some from Pub/Sub redelivery during Dataflow pipeline restarts. So we need multi-level deduplication. What’s the best practice for implementing this in Dataflow?

Topic		Views
Duplicate messages in BigQuery when ingesting IoT data streams via Dataflow pipeline Google Cloud IoT question , dataflow , streaming , duplicate-records , deduplication , analytics-accuracy , bigquery , data-stream , gcpiot-24	4	August 27, 2025
Duplicate Pub/Sub messages from billing engine causing unexpected cost spikes Google Cloud IoT question , deduplication , billing-engi , device-mgmt , pubsub-23 , message-ordering , cost-spike , duplicate-message , ack-handling	5	July 10, 2025
Pub/Sub message acknowledgement delay causes duplicate processing in downstream IoT data pipeline Google Cloud IoT question , integration , python , idempotency , duplicate-processing , pubsub-23 , ack-deadline , message-redelivery	3	December 31, 2024
Reduced MQTT broker load by 40% through event deduplication and sliding window state management Cumulocity IoT use-case , performance-opt , data-stream , throughput-scaling , smart-rules , c8y-1018 , mqtt-optimization , event-deduplication , redis-cache	7	August 13, 2025
App enablement Cloud Function triggers multiple times for a single IoT event, causing duplicate processing Google Cloud IoT question , pubsub , nodejs , idempotency , cloud-functions , app-enableme , event-processin , gcpiot-24 , duplicate-proce	4	March 27, 2025
Billing engine cost spikes due to delayed Pub/Sub message acknowledgements Google Cloud IoT question , pubsub , connectivity , billing-mgmt , duplicate-processing , pubsub-23 , ack-latency , cost-spike , billing-impact	5	August 3, 2025
Device telemetry data stream delays in Pub/Sub delivery impact real-time dashboard Google Cloud IoT question , pubsub , real-time-monitoring , mqtt , dashboard-lag , data-stream , device-mgmt , gcpiot-25 , telemetry-delay	5	April 14, 2025
Pub/Sub data stream lags under high-throughput IIoT ingest, causing delayed analytics for production monitoring Google Cloud IoT question , performance-opt , dataflow , throughput , pub-sub , data-stream , iiot-support , stream-lag , pubsub-23	6	October 9, 2025
Asset tracking ingestion fails with duplicate key error in aziotc, causing data corruption Microsoft Azure IoT question , duplicate-key , asset-tracking , deduplication , data-corruption , nosql , data-ingestion , cosmos-db , aziotc	3	June 21, 2025

Duplicate messages detected in Pub/Sub ingestion pipeline for telemetry data

Related topics