Device telemetry data stream delays in Pub/Sub delivery impact real-time dashboard

We’re experiencing significant delays in our telemetry data pipeline affecting real-time monitoring dashboards. Devices publish MQTT messages to IoT Core every 30 seconds, but our Pub/Sub subscribers are receiving messages with 2-5 minute delays during peak hours (8AM-6PM). This lag makes our operational dashboards nearly useless for real-time decision making.

Our setup has 3,500 active devices publishing to a single Pub/Sub topic with three subscribers processing different analytics workloads. The Pub/Sub ack deadline is currently set to default (10 seconds), and we’re seeing subscriber throughput drop to about 200 messages/sec during peak times despite much higher publishing rates.

I’m particularly concerned about the MQTT message flow from IoT Core to Pub/Sub and whether our subscriber configuration is causing bottlenecks. Has anyone dealt with similar telemetry delays and found effective tuning strategies?

Thanks for the suggestions. We’re using pull subscriptions but haven’t implemented flow control properly. I checked our subscriber logs and found that processing time averages 8-12 seconds per message during analytics operations, which explains the ack deadline issues. I’ll increase the deadline and implement batching. What about the MQTT message flow from IoT Core - could that be a bottleneck too?

One more thing to check - are your subscribers running on adequately sized instances? We had similar delays that disappeared when we upgraded from n1-standard-2 to n1-standard-4 instances. The CPU overhead of deserializing and processing IoT telemetry can be significant, especially if you’re doing any transformation before storing data.

I’ve seen this pattern before. Your ack deadline of 10 seconds is likely too short for processing complex analytics. When subscribers can’t acknowledge within the deadline, Pub/Sub redelivers messages, creating a cascading backlog. Try increasing your ack deadline to 60-120 seconds based on your actual processing time. Also check if your subscribers are CPU-bound during peak hours.

IoT Core to Pub/Sub handoff is generally very fast (under 100ms) unless you’re hitting IoT Core quotas. Check your IoT Core metrics in Cloud Monitoring for any throttling. The real issue is usually subscriber-side. With 3,500 devices at 30-second intervals, you’re looking at roughly 117 messages/second baseline. If three subscribers are all pulling from the same topic, they might be competing for messages inefficiently. Consider using separate topics for different analytics workloads or implement message filtering at the subscriber level.

I had almost identical symptoms last year. Here’s what worked:

First, increase your ack deadline to match actual processing time plus buffer (60-90 seconds for your 8-12 second processing). Second, configure proper flow control on subscribers - set maxOutstandingMessages to limit concurrent processing based on your instance capacity. Third, enable Pub/Sub message ordering if you need it, but be aware it can reduce throughput.

For the MQTT flow, verify your IoT Core registry isn’t hitting the 4000 messages/second per registry limit. If you’re close, consider sharding devices across multiple registries. Also check that your MQTT QoS settings align with your delivery requirements - QoS 1 provides at-least-once delivery which is usually sufficient for telemetry.

Most importantly, implement these Pub/Sub subscription settings:

  • Ack deadline: 90 seconds
  • Flow control: maxOutstandingMessages=1000, maxOutstandingBytes=100MB
  • Pull batch size: 500 messages
  • Number of concurrent pull streams: 4-8 depending on instance size

Monitor your subscriber lag metric closely. If lag remains high after these changes, you need to scale horizontally by adding more subscriber instances. We went from 2 to 6 subscriber instances and our p99 latency dropped from 4 minutes to under 30 seconds.

Finally, consider implementing a separate fast-path subscriber for real-time dashboard updates that does minimal processing, while heavy analytics run on a different subscription with its own throughput limits. This architecture pattern isolates critical real-time needs from batch processing workloads.