Cloud Monitoring alerts not triggering for custom metrics, missing critical incidents in production

We’ve configured alerting policies in Cloud Monitoring for custom metrics from our application, but alerts aren’t triggering even when metric values clearly exceed our defined thresholds. This has caused us to miss several critical incidents over the past week.

Our setup involves custom metrics written via the Monitoring API:

metric = monitoring_v3.TimeSeries()
metric.metric.type = 'custom.googleapis.com/app/error_rate'
metric.resource.type = 'gce_instance'
metric.points = [monitoring_v3.Point({
    'interval': {'end_time': {'seconds': int(time.time())}},
    'value': {'double_value': error_rate}
})]

The metrics are appearing in Metrics Explorer and the values are correct, but our alerting policy (threshold > 5.0 for 5 minutes) never fires. We’ve verified the notification channels are working by testing them manually. The policy shows as “Active” in the console but no incidents are created.

Has anyone encountered issues with Cloud Monitoring alerting policies not triggering for custom metrics? We’re on GCP and this is becoming a major operational risk.

Let me provide a comprehensive solution covering all three focus areas: custom metric ingestion, alerting policy configuration, and notification channel setup.

Custom Metric Ingestion: The root cause of your issue is incomplete resource labels in your metric writes. For gce_instance resource type, you must provide all required labels. Here’s the corrected ingestion code:

import time
from google.cloud import monitoring_v3

client = monitoring_v3.MetricServiceClient()
project_name = f"projects/{project_id}"

# Create time series with complete resource labels
series = monitoring_v3.TimeSeries()
series.metric.type = 'custom.googleapis.com/app/error_rate'
series.resource.type = 'gce_instance'
series.resource.labels['instance_id'] = instance_id
series.resource.labels['zone'] = zone  # REQUIRED - was missing
series.resource.labels['project_id'] = project_id

point = monitoring_v3.Point({
    'interval': {'end_time': {'seconds': int(time.time())}},
    'value': {'double_value': error_rate}
})
series.points = [point]

client.create_time_series(name=project_name, time_series=[series])

Before writing metrics, ensure your custom metric descriptor is properly created with the correct value type and metric kind. For error rates, use GAUGE metric kind with DOUBLE value type.

Alerting Policy Configuration: Your alerting policy needs proper alignment and aggregation settings that match your metric ingestion frequency. Since you’re writing every 30 seconds, configure your policy with these parameters:

  • Alignment Period: 1 minute (60 seconds)
  • Per-series aligner: ALIGN_MEAN or ALIGN_MAX (use MAX for error rates to catch spikes)
  • Cross-series reducer: REDUCE_MEAN (if aggregating across multiple instances)
  • Condition Duration: 5 minutes (as you specified)
  • Threshold: > 5.0

Critically, ensure your policy filter matches the exact resource labels you’re using in metric ingestion. Use the Metrics Explorer to construct the filter by selecting your custom metric and examining the available labels. The filter should look like:


resource.type = "gce_instance"
AND metric.type = "custom.googleapis.com/app/error_rate"

If you want to alert on specific instances, add instance-level filters. For alerting across all instances, use the cross-series reducer.

Notification Channel Setup: Verify your notification channels are properly configured and linked:

  1. Navigate to Monitoring > Alerting > Edit Policy > Notifications
  2. Confirm your notification channels (email, PagerDuty, Slack, etc.) are explicitly listed
  3. Check notification channel settings for any filtering rules that might suppress alerts
  4. Enable “Auto-close duration” to automatically resolve incidents when conditions clear
  5. Consider setting up multiple notification channels for redundancy

Validation and Troubleshooting: After implementing these fixes:

  1. Use Metrics Explorer to verify metrics are being ingested with complete labels
  2. Check the alerting policy’s “Incident” page for evaluation history - “No data” should disappear
  3. Temporarily lower your threshold to trigger a test alert and verify the entire pipeline
  4. Monitor the “Alerting Policies” dashboard for policy health and evaluation status
  5. Set up a synthetic test that deliberately exceeds thresholds to validate alerting

Additional Best Practices:

  • Write metrics at consistent intervals (every 60 seconds is optimal for most use cases)
  • Implement retry logic with exponential backoff for metric writes
  • Monitor the Monitoring API quota usage to ensure you’re not hitting rate limits
  • Use structured logging to capture metric write failures for debugging
  • Consider using OpenTelemetry for standardized metric collection and export

By ensuring complete resource labels in your metric ingestion, properly configuring alignment and aggregation in your alerting policies, and verifying notification channel linkage, your alerts should start triggering correctly. The missing zone label was preventing policy evaluation, which is why you were seeing “No data” conditions and missing critical incidents.

The “No data” condition is definitely a red flag. This usually means the alerting policy can’t find time series matching your filter criteria. Double-check that your resource labels in the metric ingestion match exactly what the alerting policy is filtering for.

For gce_instance resource type, you need to provide instance_id, zone, and project_id labels. If any of these are missing or incorrect in your metric writes, the alerting policy won’t be able to match the time series. Use the Metrics Explorer to verify the exact labels on your ingested metrics and ensure your alerting policy filter matches them precisely.

Beyond resource labels, watch out for metric descriptor mismatches. Make sure your custom metric is properly registered with the correct value type (DOUBLE, INT64, etc.) and metric kind (GAUGE vs CUMULATIVE). If you try to write a value that doesn’t match the registered descriptor, the write will fail silently.

Also, be careful with metric naming - use the custom.googleapis.com/ prefix for custom metrics and avoid special characters. And remember that Cloud Monitoring has rate limits on custom metric writes (around 1 write per 10 seconds per time series), so if you’re writing too frequently, some data points might be dropped.

You were right about the resource labels! I went back and checked our metric ingestion code. We’re setting instance_id and project_id, but we were missing the zone label entirely. That explains why the alerting policy couldn’t match the time series.

I’m going to update our metric writing code to include all required resource labels. Are there any other common pitfalls with custom metric ingestion that I should be aware of?

One more thing about notification channels: even if they test successfully, check that they’re actually associated with your alerting policy. I’ve seen cases where the notification channel was created but not linked to the policy, so alerts were evaluating correctly but notifications weren’t being sent.

Go to your alerting policy and verify in the “Notifications” section that your channels are explicitly listed. Also check the notification channel settings for any filtering or routing rules that might be suppressing alerts.

This sounds like a metric alignment or aggregation issue. When you create an alerting policy for custom metrics, Cloud Monitoring needs to align and aggregate the time series data before evaluating conditions. If your alignment period doesn’t match your metric ingestion frequency, you might not get the expected behavior.

Check your alerting policy configuration: what’s your alignment period and aggregation method? For custom metrics written every 60 seconds, you typically want an alignment period of 60 seconds with an aggregation like ALIGN_MEAN or ALIGN_MAX depending on your use case.