Monitoring alerts not triggering when IoT device goes offline despite correct Pub/Sub subscription setup

We’ve configured Cloud Monitoring alert policies to notify us when devices disconnect from our IoT Core registry, but alerts aren’t firing consistently. Our fleet has 500+ industrial sensors sending telemetry every 30 seconds. When devices go offline (power loss, network issues), we expect immediate alerts but often discover outages hours later through manual checks.

Our current setup uses the device connection state metric with a threshold condition. The Pub/Sub topic for telemetry data is linked to the registry, and we have a subscription processing messages. However, the alert policy doesn’t seem to detect when connection state changes from CONNECTED to DISCONNECTED. We’ve verified the metric exists in Cloud Monitoring but the policy remains silent during actual device failures.

Is there a specific way to configure alert policies for IoT device connection states? Are we missing linkage between the Pub/Sub subscription and the monitoring metric?

That’s interesting about absence-based alerts. So instead of monitoring the connection state metric directly, I should monitor the Pub/Sub message flow? How would I configure that in Cloud Monitoring?

The connection state metric alone might not be sufficient. IoT Core’s device connection state is tracked differently than regular telemetry metrics. You need to ensure you’re monitoring the right signal - typically iot.googleapis.com/device/connected_state but this metric has a sampling interval that might not catch brief disconnections. Have you checked if your alert policy is using the correct metric name and aggregation window?

Adding to the excellent suggestions above - your alert policy configuration needs to address all three components systematically:

Cloud Monitoring Alert Policy Configuration: First, create a metric-based alert policy specifically for device connectivity. Use pubsub.googleapis.com/subscription/num_unacked_messages_by_region as your primary metric since this reflects actual message flow. Set the aggregation window to 1 minute with an aligner of ALIGN_RATE to detect when message rates drop to zero for any device.

Pub/Sub Topic and Subscription Linkage: Your IoT Core device registry’s telemetry topic must have an active subscription with proper configuration:

  • Set ackDeadlineSeconds to 60 seconds minimum
  • Enable retainAckedMessages for 10 minutes to allow alert policy evaluation
  • Use message filtering on the subscription level with attributes like deviceId to enable per-device monitoring
  • Create a separate subscription dedicated to monitoring (don’t rely on your processing subscription)

Device Connection State Metrics: While IoT Core does expose connection state, it’s not real-time. Instead, implement a custom metric approach:

  1. Create a log-based metric from Cloud Logging that captures device authentication events
  2. Filter for resource.type="cloudiot_device" and `protoPayload.methodName=“google.cloud.iot.v1.DeviceManager.SendCommandToDevice”
  3. Extract device ID as a label: `labels.device_id = EXTRACT(resource.labels.device_id)
  4. Set up an alert policy on this custom metric with a threshold condition: if metric value = 0 for > 2 minutes, trigger alert

For your 500+ device fleet, consider implementing a Cloud Function that periodically (every 5 minutes) queries the device registry for connection states and publishes custom metrics to Cloud Monitoring. This gives you more control over alert timing and can include device-specific metadata in alert notifications.

The key issue with your current setup is likely that you’re monitoring a lagging indicator (connection state) rather than the actual data flow (Pub/Sub messages). Switch to message-based monitoring and you’ll get alerts within 2-3 minutes of device disconnection rather than hours later.