Duplicate Pub/Sub messages from billing engine causing unexpected cost spikes

kevinbuilder · July 7, 2025, 8:00pm

Our billing engine is processing duplicate messages from Pub/Sub, resulting in customers being charged multiple times for the same IoT device usage. We’re seeing approximately 3-8% duplicate rate during high-traffic periods.

The flow is: IoT devices → IoT Core → Pub/Sub → Billing Engine → Cloud SQL. Our billing subscriber uses this acknowledgment pattern:

subscriber.subscribe(subscription_path, callback=process_billing_event)
# In callback:
message.ack()  # Called after DB insert

We’ve confirmed duplicates by checking message_id values - the same message_id appears multiple times in our billing database. This suggests Pub/Sub is redelivering messages we’ve already acknowledged. Our ack deadline is 60 seconds and processing typically takes 5-10 seconds.

Is there a way to enforce exactly-once delivery, or do we need to implement application-level deduplication? The financial impact is significant.

larrylead · August 14, 2025, 6:42am

Here’s a comprehensive solution addressing all three focus areas:

Pub/Sub Message Deduplication: Implement database-level deduplication with unique constraints:

CREATE TABLE billing_events (
  message_id VARCHAR(255) PRIMARY KEY,
  device_id VARCHAR(100) NOT NULL,
  usage_amount DECIMAL(10,2),
  processed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

In your application:

try:
    cursor.execute(
        "INSERT INTO billing_events (message_id, device_id, usage_amount) VALUES (%s, %s, %s)",
        (message.message_id, device_id, amount)
    )
    connection.commit()
    message.ack()
except IntegrityError:  # Duplicate message_id
    message.ack()  # Already processed, safe to ack

Ack/Nack Handling: Your current code acks after DB insert, which is correct, but you need proper error handling:

def process_billing_event(message):
    try:
        # Parse and validate
        billing_data = parse_message(message.data)
        # Insert with deduplication
        insert_billing_record(message.message_id, billing_data)
        message.ack()
    except ValidationError as e:
        # Invalid data - don't retry
        log_error(f"Invalid message: {e}")
        message.ack()  # Ack to prevent infinite retries
    except DatabaseError as e:
        # Transient DB error - retry
        log_error(f"DB error: {e}")
        message.nack()  # Explicit nack for retry

Exactly-Once Delivery: Pub/Sub doesn’t guarantee exactly-once, but you can achieve it at the application level:

Use transactions for atomic deduplication:

with connection.begin():
    # Check + Insert in single transaction
    result = cursor.execute(
        "INSERT INTO billing_events ... ON CONFLICT (message_id) DO NOTHING RETURNING id"
    )
    if result.rowcount > 0:
        # First time processing this message
        cursor.execute("INSERT INTO customer_charges ...")

Implement distributed locking for critical sections using Cloud Memorystore (Redis):

import redis
lock = redis_client.lock(f"billing_lock:{message.message_id}", timeout=30)
if lock.acquire(blocking=False):
    try:
        process_billing(message)
    finally:
        lock.release()

Monitor duplicate rates:

total_messages = Counter('billing_messages_total')
duplicate_messages = Counter('billing_duplicates_total')

if is_duplicate:
    duplicate_messages.inc()
total_messages.inc()

Set up alerts when duplicate rate exceeds 1%. Your 3-8% rate suggests ack handling issues or worker instability. Check Cloud Monitoring for:

Subscription/unacked_messages_count
Subscription/oldest_unacked_message_age
High worker restart rates

Finally, increase your ack deadline to 120 seconds to provide more buffer for processing variations. This reduces redelivery due to timeout.

jacob_cloud · July 14, 2025, 5:42pm

We tried adding a message_id check, but we’re hitting race conditions where two concurrent workers process the same duplicate message before either completes the DB insert. We’re using Cloud SQL with default isolation levels. Should we be using database-level locking or is there a better pattern?

helen_ops · August 3, 2025, 7:21am

I’d add that you should also investigate why you’re getting 3-8% duplicates - that seems high. Are you properly handling ack/nack? If your processing fails partway through, you should nack the message, not let it timeout and get redelivered. Also check if your workers are being terminated mid-processing, which would cause redelivery of unacked messages.

kevinbuilder · August 13, 2025, 4:38pm

Enable Pub/Sub message ordering if you need guaranteed ordering within a key (like device_id). This can reduce duplicate processing in some scenarios. Also monitor your ack latency - if you’re close to the 60-second deadline, network hiccups could cause late acks that lead to redelivery.

rebeccaninja · July 10, 2025, 4:19pm

Pub/Sub provides at-least-once delivery by design, not exactly-once. You must implement idempotency in your application. Check if the message_id already exists in your database before processing. This is standard practice for financial systems.

Topic		Views
Billing engine cost spikes due to delayed Pub/Sub message acknowledgements Google Cloud IoT question , pubsub , connectivity , billing-mgmt , duplicate-processing , pubsub-23 , ack-latency , cost-spike , billing-impact	5	August 3, 2025
Pub/Sub message acknowledgement delay causes duplicate processing in downstream IoT data pipeline Google Cloud IoT question , integration , python , idempotency , duplicate-processing , pubsub-23 , ack-deadline , message-redelivery	3	December 31, 2024
Duplicate messages detected in Pub/Sub ingestion pipeline for telemetry data Google Cloud IoT question , data-modeling , dataflow , streaming , analytics-report , deduplication , data-ingestion , pubsub-23 , duplicate-msgs	3	August 13, 2025
Duplicate messages in BigQuery when ingesting IoT data streams via Dataflow pipeline Google Cloud IoT question , dataflow , streaming , duplicate-records , deduplication , analytics-accuracy , bigquery , data-stream , gcpiot-24	4	August 27, 2025
App enablement Cloud Function triggers multiple times for a single IoT event, causing duplicate processing Google Cloud IoT question , pubsub , nodejs , idempotency , cloud-functions , app-enableme , event-processin , gcpiot-24 , duplicate-proce	4	March 27, 2025
Rules engine fails to trigger billing workflow on Pub/Sub event for custom monetization logic in IoT device usage Google Cloud IoT question , event-driven , monetization , rules-engine , json , cloud-functions , pub-sub , gcpiot-25 , billing-trigger	4	June 20, 2025
Billing engine data ingestion to Pub/Sub fails with 'Quota Exceeded' error on large batch uploads Google Cloud IoT question , pubsub , batch-processing , billing-mgmt , json , exponential-backoff , quota-exceeded , data-ingestion , gcpiot-25	3	November 6, 2025
Device telemetry data stream delays in Pub/Sub delivery impact real-time dashboard Google Cloud IoT question , pubsub , real-time-monitoring , mqtt , dashboard-lag , data-stream , device-mgmt , gcpiot-25 , telemetry-delay	5	April 14, 2025
Price list synchronization API fails with duplicate record errors in pricing management Microsoft Dynamics 365 question , api-development , rest-api , data-sync , pricing-mgmt , json , http-error , duplicate-records , d365-10-0-41	7	February 6, 2025

Duplicate Pub/Sub messages from billing engine causing unexpected cost spikes

Related topics