Cloud Monitoring alerts not triggering on ERP Compute Engine CPU spikes after metric filter change

singhace · July 2, 2025, 6:51am

We configured Cloud Monitoring alerting policies to notify us when ERP application servers exceed 80% CPU utilization, but alerts aren’t triggering even though we can see CPU spikes in the metrics explorer that clearly exceed the threshold.

Our alert policy configuration:


metric: compute.googleapis.com/instance/cpu/utilization
filter: resource.labels.instance_name=~"erp-app-.*"
threshold: 0.8
duration: 300s

Last week we had a production incident where ERP servers hit 95% CPU for over 20 minutes during batch processing, but no alerts were sent. We only discovered the issue when users reported slow response times. The metrics explorer shows the spike clearly, so the data is being collected.

This is a serious SLA risk - we need reliable alerting to catch performance issues before users are impacted. What could prevent Cloud Monitoring alerts from triggering when thresholds are clearly exceeded?

carlossql · July 13, 2025, 8:30am

Another possibility: check if you have multiple conditions in your policy with AND logic. If one condition isn’t met, the entire policy won’t trigger. Also look at the policy’s incident documentation - even if alerts aren’t sent, incidents should still be created in Cloud Monitoring. Check the incident history to see if incidents were created but notifications failed, or if no incidents were created at all. This will tell you if it’s a detection problem or a notification problem.

marie_func · July 15, 2025, 3:23am

Let me provide a comprehensive solution to fix your Cloud Monitoring alerting for ERP CPU spikes.

Root Cause Analysis:

Based on your configuration and symptoms, the most likely issues are:

Metric Filter Specificity: Your regex filter may not be matching all intended instances
Alignment Window Configuration: Aggregation settings may be smoothing out spikes
Condition Duration: 300-second requirement may be too long for spike detection
Notification Channel State: Channels may be in failed state without obvious indication

Complete Solution:

Part 1: Cloud Monitoring Metric Filters

Understanding Metric Filtering:

Cloud Monitoring filters use MQL (Monitoring Query Language) or filter expressions. Your current filter:


resource.labels.instance_name=~"erp-app-.*"

This regex filter has potential issues:

Case sensitivity (instance_name vs instance_id)
Partial matching behavior
Instance group member naming variations

Corrected Filter Configuration:


resource.type="gce_instance" AND
resource.labels.instance_name=monitoring.regex.full_match("erp-app-.*") AND
metadata.system_labels.name=monitoring.regex.full_match("erp-app-.*")

Or use zone-based filtering if instances are in specific zones:


resource.type="gce_instance" AND
resource.labels.zone=~"us-central1-.*" AND
metric.labels.instance_name=starts_with("erp-app")

Verify Filter Matches Instances:

Test in Metrics Explorer first:

# List instances that should match
gcloud compute instances list \
  --filter="name~'erp-app-.*'" \
  --format="table(name,zone,status)"

Then verify metrics exist for these instances:


metric.type="compute.googleapis.com/instance/cpu/utilization"
resource.type="gce_instance"
resource.labels.instance_name=monitoring.regex.full_match("erp-app-.*")

If no data appears, your filter is incorrect.

Part 2: Alerting Policy Configuration

Optimal Alert Policy Structure:

Create a new alerting policy with proper configuration:

displayName: "ERP CPU Utilization High"
conditions:
  - displayName: "CPU > 80% for 2 minutes"
    conditionThreshold:
      filter: |
        metric.type="compute.googleapis.com/instance/cpu/utilization"
        resource.type="gce_instance"
        resource.labels.instance_name=monitoring.regex.full_match("erp-app-.*")
      comparison: COMPARISON_GT
      thresholdValue: 0.8
      duration: 120s
      aggregations:
        - alignmentPeriod: 60s
          perSeriesAligner: ALIGN_MEAN
          crossSeriesReducer: REDUCE_NONE
      trigger:
        count: 1

Key Configuration Decisions:

Alignment Period (60s):

Balances spike detection vs noise filtering
Shorter periods (30s) catch brief spikes but may cause false positives
Longer periods (300s) smooth data too much, missing real incidents
60s is optimal for CPU monitoring

Per-Series Aligner (ALIGN_MEAN):

ALIGN_MEAN: Average CPU over alignment period (recommended for CPU)
ALIGN_MAX: Highest CPU in period (catches brief spikes, more false positives)
ALIGN_MIN: Lowest CPU (not useful for high CPU alerts)

Cross-Series Reducer (REDUCE_NONE):

REDUCE_NONE: Alert on each instance independently (recommended)
REDUCE_MEAN: Alert only if average across all instances exceeds threshold (misses individual spikes)
REDUCE_MAX: Alert if any instance exceeds (good for fleet-wide awareness)

Duration (120s vs 300s): Your 300s duration requires 5 consecutive minutes above threshold. This is too long for spike detection:

Brief spikes (1-3 minutes) won’t trigger alerts
By the time alert fires, incident may be over
Users already experiencing impact

Recommended: 120s (2 minutes)

Catches sustained issues
Filters out momentary spikes
Alerts before significant user impact

Part 3: Compute Engine CPU Metrics Understanding

CPU Utilization Metric Details:


metric.type: compute.googleapis.com/instance/cpu/utilization
value_type: DOUBLE
metric_kind: GAUGE
unit: 10^2.%
range: 0.0 to 1.0 (0% to 100%)

Critical Understanding:

Reported as decimal (0.8 = 80%)
Sampled every 60 seconds by default
Represents utilization across all vCPUs (averaged)
Includes user, system, and steal time

CPU Metric Collection Delay:

Metric ingestion lag: 60-120 seconds typical
Alignment processing: 10-30 seconds
Alert evaluation: 30-60 seconds
Total alert delay: 2-4 minutes from actual spike start

This means your 20-minute spike should definitely have triggered alerts.

Alternative CPU Metrics for Better Detection:

Consider also monitoring:

CPU Reserved Cores (for autoscaling scenarios):


metric.type="compute.googleapis.com/instance/cpu/reserved_cores"

CPU Usage Time (absolute seconds, not percentage):


metric.type="compute.googleapis.com/instance/cpu/usage_time"

Complete Working Alert Policy:

Create this policy via gcloud:

gcloud alpha monitoring policies create \
  --notification-channels=CHANNEL_ID_1,CHANNEL_ID_2 \
  --display-name="ERP CPU Utilization Critical" \
  --condition-display-name="CPU exceeds 80% for 2 minutes" \
  --condition-threshold-value=0.8 \
  --condition-threshold-duration=120s \
  --condition-filter='metric.type="compute.googleapis.com/instance/cpu/utilization" AND resource.type="gce_instance" AND resource.labels.instance_name=monitoring.regex.full_match("erp-app-.*")' \
  --condition-comparison=COMPARISON_GT \
  --condition-aggregation-alignment-period=60s \
  --condition-aggregation-per-series-aligner=ALIGN_MEAN \
  --condition-aggregation-cross-series-reducer=REDUCE_NONE \
  --condition-trigger-count=1 \
  --documentation-content="ERP application server CPU utilization exceeded 80%. Check application logs and consider scaling up. Runbook: https://wiki.company.com/erp-cpu-high" \
  --documentation-mime-type="text/markdown"

Notification Channel Verification:

# List all notification channels
gcloud alpha monitoring channels list

# Verify specific channel
gcloud alpha monitoring channels describe CHANNEL_ID

# Test notification
gcloud alpha monitoring channels verify CHANNEL_ID

If notifications still don’t work:

Check email spam filters
Verify Slack webhook is active
Check PagerDuty integration key
Review notification channel quotas (max 5000 notifications/day)
Check Cloud Logging for notification delivery errors:


resource.type="k8s_cluster"
protoPayload.serviceName="monitoring.googleapis.com"
protoPayload.methodName="NotificationChannelService.SendNotification"

Incident History Debugging:

Check if incidents were created:

gcloud alpha monitoring policies list

# Get policy ID, then check incidents
gcloud alpha monitoring incidents list --policy=POLICY_ID

If incidents exist but notifications weren’t sent:

Notification channel configuration issue
Channel in failed state
Rate limiting applied

If no incidents exist:

Metric filter doesn’t match instances
Threshold/duration configuration prevents triggering
Metric data not being collected

Testing Your Alert Policy:

Simulate CPU Load:

SSH to an ERP instance and run:

# Generate CPU load for 5 minutes
stress --cpu 8 --timeout 300s

Monitor alert status:

watch -n 10 'gcloud alpha monitoring incidents list --policy=POLICY_ID --filter="state=OPEN" --limit=5'

You should see:

Incident created within 3-4 minutes
Notification sent to all channels
Incident documented in Cloud Monitoring UI

Recommended Multi-Tier Alerting Strategy:

Warning Alert (70% CPU, 5 min duration) → Email/Slack
Critical Alert (80% CPU, 2 min duration) → PagerDuty
Emergency Alert (95% CPU, 1 min duration) → SMS + PagerDuty

This provides escalating awareness before SLA impact.

Ongoing Monitoring:

Set up dashboard to visualize:

CPU utilization per instance
Alert policy evaluation status
Notification delivery success rate
Incident open/close timeline

With these corrections, your Cloud Monitoring alerts will reliably trigger on ERP CPU spikes, preventing SLA breaches and user impact.

singhace · July 12, 2025, 9:29am

Check if you’re using the correct comparison operator. CPU utilization is reported as a fraction between 0 and 1, not as a percentage. Your threshold of 0.8 is correct for 80%, but make sure your condition uses ABOVE or GREATER_THAN, not EQUAL. Also verify the condition duration - if set to 300 seconds, the metric must exceed threshold continuously for 5 minutes before alerting. Transient spikes shorter than that won’t trigger.

Topic		Replies	Views
Cloud Monitoring alerts not triggering for Cloud SQL CPU spikes above threshold Google Cloud Platform (GCP) question , database , observability , gcp-2019 , yaml , alert-missing , monitoring-gap , cloud-sql , cloud-monitoring	3	0	November 21, 2024
CloudMonitor fails to trigger alerts for ECS CPU spikes during peak hours Alibaba Cloud question , monitoring , compute , notification , ac-2021 , ecs , cloudmonitor , alert-missing , cpu-threshold	3	1	November 30, 2025
Cloud Monitoring alerts not triggering for failed ERP API requests to inventory module Google Cloud Platform (GCP) question , erp-integration , observability , gcp-2020 , alerting , apis , cloud-monitoring , log-based-metrics , incident-detection	7	0	July 28, 2025
CloudWatch alarms not triggering for EC2 ERP instance CPU spikes above threshold Amazon Web Services (AWS) question , monitoring , metrics , observability , aws-2019 , cloudwatch , ec2 , alarms , sns	7	0	January 2, 2025
Cloud Monitoring alerts missed ERP network latency spikes during peak hours IBM Cloud question , networking , performance , observability , ic-2019 , sla-monitoring , cloud-monitoring , alert-configuration , latency-metrics	4	0	March 18, 2025
Cloud Monitoring alerts not triggering for custom metrics, missing critical incidents in production Google Cloud Platform (GCP) question , monitoring , notifications , observability , gcp-2019 , custom-metrics , cloud-monitoring , monitoring-alerts , alerting-policy	6	0	November 9, 2024
Cloud Monitoring alerts not firing for ERP container CrashLoopBackOff events in GKE Google Cloud Platform (GCP) question , kubernetes , observability , gcp-2019 , cloud-monitoring , gke , container-servi , alerts-missing , incident-delay	3	0	April 25, 2025
Database alarms not triggering in OCI Observability despite meeting threshold conditions Oracle Cloud question , database , mql , notifications , observability , oci-2021 , iam-policy , alarms , incident-response	5	0	July 15, 2025
Cloud Monitoring alerts for Dataflow pipeline failures improved SLA compliance for marketing analytics Google Cloud Platform (GCP) use-case , monitoring , dataflow , observability , gcp-2020 , alerting , sla-compliance , cloud-monitoring , pipeline-monitoring	4	0	February 3, 2025

Cloud Monitoring alerts not triggering on ERP Compute Engine CPU spikes after metric filter change

Related topics