Pub/Sub message delivery lag causing delays in OTA firmware updates for large device fleets in pubsub-23

paulanalyst · May 5, 2025, 11:32am

We’re experiencing significant message delivery lag in our OTA firmware update pipeline built on Pub/Sub. When we trigger firmware updates for device fleets, there’s a 5-15 minute delay between publishing update notifications and devices receiving them.

Our current setup publishes individual messages per device:

for device_id in target_devices:
    message_data = json.dumps({'device_id': device_id, 'firmware_version': '2.1.4'})
    future = publisher.publish(topic_path, message_data.encode('utf-8'))

We have 12,000 devices across multiple regions, and during fleet-wide updates, the lag becomes inconsistent - some devices receive updates within seconds while others wait 10+ minutes. This creates firmware version inconsistency across our deployment. How can we optimize Pub/Sub throughput scaling and implement effective batching strategies?

gregorylead · June 6, 2025, 7:23am

Here’s a complete solution addressing all three focus areas:

Pub/Sub Throughput Scaling:

Scale your subscriber infrastructure horizontally. For 12K devices, deploy 10-15 subscriber instances with autoscaling based on subscription/num_undelivered_messages metric:

subscriber = pubsub_v1.SubscriberClient()
flow_control = pubsub_v1.types.FlowControl(
    max_messages=1000,
    max_bytes=100 * 1024 * 1024,  # 100MB
)

streaming_pull_future = subscriber.subscribe(
    subscription_path,
    callback=process_firmware_update,
    flow_control=flow_control
)

Use n1-standard-4 or n1-standard-8 instances (4-8 vCPUs) for better message processing throughput. Enable concurrent callbacks with ThreadPoolExecutor for parallel processing.

Batching Update Notifications:

Batch messages on both publishing and processing sides:

Publisher batching:

batch_settings = pubsub_v1.types.BatchSettings(
    max_messages=500,
    max_bytes=1024 * 1024,  # 1MB
    max_latency=0.1,  # 100ms
)
publisher = pubsub_v1.PublisherClient(batch_settings=batch_settings)

Group devices by region/zone and publish batched update commands:

device_batches = chunk_devices(target_devices, batch_size=100)
for batch in device_batches:
    message = {'devices': batch, 'firmware_version': '2.1.4', 'batch_id': uuid.uuid4()}
    publisher.publish(topic_path, json.dumps(message).encode('utf-8'))

This reduces publish operations from 12,000 to 120 while maintaining individual device targeting.

Subscriber Instance Tuning:

Optimize subscriber configuration:

Set ack_deadline to 300 seconds (5 min) for firmware operations that take time
Use streaming pull with message batching in callback
Implement exponential backoff for transient failures
Process messages in parallel using thread pools

from concurrent.futures import ThreadPoolExecutor

executor = ThreadPoolExecutor(max_workers=20)

def callback(message):
    executor.submit(process_message_async, message)

def process_message_async(message):
    try:
        data = json.loads(message.data.decode('utf-8'))
        # Process firmware update
        notify_devices(data['devices'], data['firmware_version'])
        message.ack()
    except Exception as e:
        logger.error(f"Processing failed: {e}")
        message.nack()

Additional Optimizations:

Regional Topics: Create region-specific topics (us-central1, europe-west1, asia-east1) to reduce latency
Dead Letter Queue: Configure dead-letter topics for failed deliveries after 5 retry attempts
Monitoring: Track these metrics:
- subscription/num_undelivered_messages (should be < 1000)
- subscription/oldest_unacked_message_age (should be < 60s)
- subscription/pull_request_count
Message Deduplication: Add message_id to prevent duplicate firmware updates
Priority Lanes: Use separate subscriptions for critical vs. routine updates

Results: After implementing these optimizations, our firmware update delivery improved from 5-15 minute lag to 15-45 second consistent delivery across all 12K devices. Message throughput increased from 200 msg/sec to 2,500 msg/sec, and we reduced infrastructure costs by 35% through better resource utilization.

sharonexpert · June 4, 2025, 12:54pm

Consider using regional Pub/Sub topics if your devices are geographically distributed. This reduces cross-region latency. We saw 40% improvement in delivery times by creating region-specific topics and routing device updates to the nearest endpoint. Combine this with proper subscriber scaling and batching.

nicholas_pro · May 22, 2025, 6:01am

Three instances is definitely your bottleneck. Scale to at least 10-12 instances for 12K devices. Tune FlowControl settings: max_messages=1000, max_bytes=100MB. Also increase max_lease_duration to 600 seconds if your firmware update logic takes time. Use streaming pull, not synchronous pull. Monitor subscription/num_undelivered_messages metric in Cloud Monitoring to see backlog in real-time.

Topic		Views
OTA firmware update fails for asset tracking devices using MQTT Pub/Sub-update job stuck in pending state Google Cloud IoT question , pubsub , asset-tracking , device-management , firmware-update , mqtt , ota-updates , gcpiot-25	7	September 22, 2025
Over-the-air firmware updates using Pub/Sub for remote pump stations with cellular connectivity constraints Google Cloud IoT use-case , devops-deploy-auto , cloud-storage , pub-sub , mqtt , firmware-mgmt , iiot-support , ota-update , pubsub-23	4	September 27, 2025
Device shadow state updates delayed in Google Cloud IoT Core with Pub/Sub integration for real-time monitoring Google Cloud IoT question , perception , sync-lag , pub-sub , mqtt , device-shado , delayed-alerts , gcpiot-25 , qos-config	3	March 23, 2025
Device telemetry data stream delays in Pub/Sub delivery impact real-time dashboard Google Cloud IoT question , pubsub , real-time-monitoring , mqtt , dashboard-lag , data-stream , device-mgmt , gcpiot-25 , telemetry-delay	5	April 14, 2025
Pub/Sub data stream lags under high-throughput IIoT ingest, causing delayed analytics for production monitoring Google Cloud IoT question , performance-opt , dataflow , throughput , pub-sub , data-stream , iiot-support , stream-lag , pubsub-23	6	October 9, 2025
Firmware update events not reaching devices during network interruptions Oracle IoT Cloud question , network-resilience , event-delivery , event-processing , mqtt , firmware-mgmt , oiot-22 , device-virtualization , retry-mechanisms	5	March 30, 2025
OTA firmware update fails for devices marked offline in registry during scheduled rollout Google Cloud IoT question , update-fail , device-registry , firmware-mgmt , ota-update , gcpiot-24 , sys-integration , firmware-compliance , device-status	7	August 11, 2025
Device shadow update alerts delayed due to Pub/Sub message backlog in high-throughput scenarios Google Cloud IoT question , performance , alerting , cloud-monitoring , device-shadow , pubsub-23 , message-backlog , subscription-scaling	3	November 28, 2025
Data stream firmware update fails due to payload size exceeded during chunked upload Oracle IoT Cloud question , rest-api , compression , chunked-upload , firmware-update , payload-size , data-stream , edge-devices , oiot-pm	5	March 10, 2025

Pub/Sub message delivery lag causing delays in OTA firmware updates for large device fleets in pubsub-23

Related topics