Cloud Monitoring alerts not triggering for custom metrics, missing critical incidents in production

jennifer_sql · November 9, 2024, 5:39am

We’ve configured alerting policies in Cloud Monitoring for custom metrics from our application, but alerts aren’t triggering even when metric values clearly exceed our defined thresholds. This has caused us to miss several critical incidents over the past week.

Our setup involves custom metrics written via the Monitoring API:

metric = monitoring_v3.TimeSeries()
metric.metric.type = 'custom.googleapis.com/app/error_rate'
metric.resource.type = 'gce_instance'
metric.points = [monitoring_v3.Point({
    'interval': {'end_time': {'seconds': int(time.time())}},
    'value': {'double_value': error_rate}
})]

The metrics are appearing in Metrics Explorer and the values are correct, but our alerting policy (threshold > 5.0 for 5 minutes) never fires. We’ve verified the notification channels are working by testing them manually. The policy shows as “Active” in the console but no incidents are created.

Has anyone encountered issues with Cloud Monitoring alerting policies not triggering for custom metrics? We’re on GCP and this is becoming a major operational risk.

dansys · November 29, 2024, 4:29pm

Let me provide a comprehensive solution covering all three focus areas: custom metric ingestion, alerting policy configuration, and notification channel setup.

Custom Metric Ingestion: The root cause of your issue is incomplete resource labels in your metric writes. For gce_instance resource type, you must provide all required labels. Here’s the corrected ingestion code:

import time
from google.cloud import monitoring_v3

client = monitoring_v3.MetricServiceClient()
project_name = f"projects/{project_id}"

# Create time series with complete resource labels
series = monitoring_v3.TimeSeries()
series.metric.type = 'custom.googleapis.com/app/error_rate'
series.resource.type = 'gce_instance'
series.resource.labels['instance_id'] = instance_id
series.resource.labels['zone'] = zone  # REQUIRED - was missing
series.resource.labels['project_id'] = project_id

point = monitoring_v3.Point({
    'interval': {'end_time': {'seconds': int(time.time())}},
    'value': {'double_value': error_rate}
})
series.points = [point]

client.create_time_series(name=project_name, time_series=[series])

Before writing metrics, ensure your custom metric descriptor is properly created with the correct value type and metric kind. For error rates, use GAUGE metric kind with DOUBLE value type.

Alerting Policy Configuration: Your alerting policy needs proper alignment and aggregation settings that match your metric ingestion frequency. Since you’re writing every 30 seconds, configure your policy with these parameters:

Alignment Period: 1 minute (60 seconds)
Per-series aligner: ALIGN_MEAN or ALIGN_MAX (use MAX for error rates to catch spikes)
Cross-series reducer: REDUCE_MEAN (if aggregating across multiple instances)
Condition Duration: 5 minutes (as you specified)
Threshold: > 5.0

Critically, ensure your policy filter matches the exact resource labels you’re using in metric ingestion. Use the Metrics Explorer to construct the filter by selecting your custom metric and examining the available labels. The filter should look like:


resource.type = "gce_instance"
AND metric.type = "custom.googleapis.com/app/error_rate"

If you want to alert on specific instances, add instance-level filters. For alerting across all instances, use the cross-series reducer.

Notification Channel Setup: Verify your notification channels are properly configured and linked:

Navigate to Monitoring > Alerting > Edit Policy > Notifications
Confirm your notification channels (email, PagerDuty, Slack, etc.) are explicitly listed
Check notification channel settings for any filtering rules that might suppress alerts
Enable “Auto-close duration” to automatically resolve incidents when conditions clear
Consider setting up multiple notification channels for redundancy

Validation and Troubleshooting: After implementing these fixes:

Use Metrics Explorer to verify metrics are being ingested with complete labels
Check the alerting policy’s “Incident” page for evaluation history - “No data” should disappear
Temporarily lower your threshold to trigger a test alert and verify the entire pipeline
Monitor the “Alerting Policies” dashboard for policy health and evaluation status
Set up a synthetic test that deliberately exceeds thresholds to validate alerting

Additional Best Practices:

Write metrics at consistent intervals (every 60 seconds is optimal for most use cases)
Implement retry logic with exponential backoff for metric writes
Monitor the Monitoring API quota usage to ensure you’re not hitting rate limits
Use structured logging to capture metric write failures for debugging
Consider using OpenTelemetry for standardized metric collection and export

By ensuring complete resource labels in your metric ingestion, properly configuring alignment and aggregation in your alerting policies, and verifying notification channel linkage, your alerts should start triggering correctly. The missing zone label was preventing policy evaluation, which is why you were seeing “No data” conditions and missing critical incidents.

techexpert · November 17, 2024, 2:15am

The “No data” condition is definitely a red flag. This usually means the alerting policy can’t find time series matching your filter criteria. Double-check that your resource labels in the metric ingestion match exactly what the alerting policy is filtering for.

For gce_instance resource type, you need to provide instance_id, zone, and project_id labels. If any of these are missing or incorrect in your metric writes, the alerting policy won’t be able to match the time series. Use the Metrics Explorer to verify the exact labels on your ingested metrics and ensure your alerting policy filter matches them precisely.

naveenmaster · November 21, 2024, 6:29pm

Beyond resource labels, watch out for metric descriptor mismatches. Make sure your custom metric is properly registered with the correct value type (DOUBLE, INT64, etc.) and metric kind (GAUGE vs CUMULATIVE). If you try to write a value that doesn’t match the registered descriptor, the write will fail silently.

Also, be careful with metric naming - use the custom.googleapis.com/ prefix for custom metrics and avoid special characters. And remember that Cloud Monitoring has rate limits on custom metric writes (around 1 write per 10 seconds per time series), so if you’re writing too frequently, some data points might be dropped.

apiexpert · November 20, 2024, 12:48am

You were right about the resource labels! I went back and checked our metric ingestion code. We’re setting instance_id and project_id, but we were missing the zone label entirely. That explains why the alerting policy couldn’t match the time series.

I’m going to update our metric writing code to include all required resource labels. Are there any other common pitfalls with custom metric ingestion that I should be aware of?

anjali_ace · November 25, 2024, 9:08pm

One more thing about notification channels: even if they test successfully, check that they’re actually associated with your alerting policy. I’ve seen cases where the notification channel was created but not linked to the policy, so alerts were evaluating correctly but notifications weren’t being sent.

Go to your alerting policy and verify in the “Notifications” section that your channels are explicitly listed. Also check the notification channel settings for any filtering or routing rules that might be suppressing alerts.

pablo_arch · November 9, 2024, 6:30am

This sounds like a metric alignment or aggregation issue. When you create an alerting policy for custom metrics, Cloud Monitoring needs to align and aggregate the time series data before evaluating conditions. If your alignment period doesn’t match your metric ingestion frequency, you might not get the expected behavior.

Check your alerting policy configuration: what’s your alignment period and aggregation method? For custom metrics written every 60 seconds, you typically want an alignment period of 60 seconds with an aggregation like ALIGN_MEAN or ALIGN_MAX depending on your use case.

Topic		Replies	Views
Cloud Monitoring alerts not triggering for failed ERP API requests to inventory module Google Cloud Platform (GCP) question , erp-integration , observability , gcp-2020 , alerting , apis , cloud-monitoring , log-based-metrics , incident-detection	7	0	July 28, 2025
Cloud Monitoring alerts not triggering on ERP Compute Engine CPU spikes after metric filter change Google Cloud Platform (GCP) question , compute , observability , gcp-2020 , alerting , sla-breach , compute-engine , cloud-monitoring , cpu-metrics	3	0	July 12, 2025
Custom monitoring metric missing from OCI Observability dashboard after deployment Oracle Cloud question , monitoring , compute , rest-api , observability , oci-2019 , json , alerting , custom-metrics	6	0	January 18, 2025
Best practices for API monitoring and alerting strategies using custom metrics in GCP Google Cloud Platform (GCP) discussion , observability , gcp-2021 , custom-metrics , cloud-monitoring , api-monitoring , alerting-policies , slo	4	0	May 18, 2025
Cloud Monitoring alerts for Dataflow pipeline failures improved SLA compliance for marketing analytics Google Cloud Platform (GCP) use-case , monitoring , dataflow , observability , gcp-2020 , alerting , sla-compliance , cloud-monitoring , pipeline-monitoring	4	0	February 3, 2025
Cloud Monitoring API not returning expected metrics for custom dashboards Google Cloud Platform (GCP) question , monitoring , dashboards , rest-api , metrics , gcp-2020 , json , apis , iam-permissions	3	0	December 25, 2024
Device registry alert not firing for unauthorized device registration attempts in multi-region setup Google Cloud IoT question , security , yaml , alerting , unauthorized-access , log-based-metrics , cloud-audit-logs , device-registry , gcpiot-24	4	0	December 2, 2025
Cloud Monitoring alerts not triggering for Cloud SQL CPU spikes above threshold Google Cloud Platform (GCP) question , database , observability , gcp-2019 , yaml , alert-missing , monitoring-gap , cloud-sql , cloud-monitoring	3	0	November 21, 2024
CloudMonitor fails to trigger alerts for ECS CPU spikes during peak hours Alibaba Cloud question , monitoring , compute , notification , ac-2021 , ecs , cloudmonitor , alert-missing , cpu-threshold	3	1	November 30, 2025

Cloud Monitoring alerts not triggering for custom metrics, missing critical incidents in production

Related topics