Cloud Monitoring alerts not triggering on ERP Compute Engine CPU spikes after metric filter change

We configured Cloud Monitoring alerting policies to notify us when ERP application servers exceed 80% CPU utilization, but alerts aren’t triggering even though we can see CPU spikes in the metrics explorer that clearly exceed the threshold.

Our alert policy configuration:


metric: compute.googleapis.com/instance/cpu/utilization
filter: resource.labels.instance_name=~"erp-app-.*"
threshold: 0.8
duration: 300s

Last week we had a production incident where ERP servers hit 95% CPU for over 20 minutes during batch processing, but no alerts were sent. We only discovered the issue when users reported slow response times. The metrics explorer shows the spike clearly, so the data is being collected.

This is a serious SLA risk - we need reliable alerting to catch performance issues before users are impacted. What could prevent Cloud Monitoring alerts from triggering when thresholds are clearly exceeded?

Another possibility: check if you have multiple conditions in your policy with AND logic. If one condition isn’t met, the entire policy won’t trigger. Also look at the policy’s incident documentation - even if alerts aren’t sent, incidents should still be created in Cloud Monitoring. Check the incident history to see if incidents were created but notifications failed, or if no incidents were created at all. This will tell you if it’s a detection problem or a notification problem.

Let me provide a comprehensive solution to fix your Cloud Monitoring alerting for ERP CPU spikes.

Root Cause Analysis:

Based on your configuration and symptoms, the most likely issues are:

  1. Metric Filter Specificity: Your regex filter may not be matching all intended instances
  2. Alignment Window Configuration: Aggregation settings may be smoothing out spikes
  3. Condition Duration: 300-second requirement may be too long for spike detection
  4. Notification Channel State: Channels may be in failed state without obvious indication

Complete Solution:

Part 1: Cloud Monitoring Metric Filters

Understanding Metric Filtering:

Cloud Monitoring filters use MQL (Monitoring Query Language) or filter expressions. Your current filter:


resource.labels.instance_name=~"erp-app-.*"

This regex filter has potential issues:

  • Case sensitivity (instance_name vs instance_id)
  • Partial matching behavior
  • Instance group member naming variations

Corrected Filter Configuration:


resource.type="gce_instance" AND
resource.labels.instance_name=monitoring.regex.full_match("erp-app-.*") AND
metadata.system_labels.name=monitoring.regex.full_match("erp-app-.*")

Or use zone-based filtering if instances are in specific zones:


resource.type="gce_instance" AND
resource.labels.zone=~"us-central1-.*" AND
metric.labels.instance_name=starts_with("erp-app")

Verify Filter Matches Instances:

Test in Metrics Explorer first:

# List instances that should match
gcloud compute instances list \
  --filter="name~'erp-app-.*'" \
  --format="table(name,zone,status)"

Then verify metrics exist for these instances:


metric.type="compute.googleapis.com/instance/cpu/utilization"
resource.type="gce_instance"
resource.labels.instance_name=monitoring.regex.full_match("erp-app-.*")

If no data appears, your filter is incorrect.

Part 2: Alerting Policy Configuration

Optimal Alert Policy Structure:

Create a new alerting policy with proper configuration:

displayName: "ERP CPU Utilization High"
conditions:
  - displayName: "CPU > 80% for 2 minutes"
    conditionThreshold:
      filter: |
        metric.type="compute.googleapis.com/instance/cpu/utilization"
        resource.type="gce_instance"
        resource.labels.instance_name=monitoring.regex.full_match("erp-app-.*")
      comparison: COMPARISON_GT
      thresholdValue: 0.8
      duration: 120s
      aggregations:
        - alignmentPeriod: 60s
          perSeriesAligner: ALIGN_MEAN
          crossSeriesReducer: REDUCE_NONE
      trigger:
        count: 1

Key Configuration Decisions:

Alignment Period (60s):

  • Balances spike detection vs noise filtering
  • Shorter periods (30s) catch brief spikes but may cause false positives
  • Longer periods (300s) smooth data too much, missing real incidents
  • 60s is optimal for CPU monitoring

Per-Series Aligner (ALIGN_MEAN):

  • ALIGN_MEAN: Average CPU over alignment period (recommended for CPU)
  • ALIGN_MAX: Highest CPU in period (catches brief spikes, more false positives)
  • ALIGN_MIN: Lowest CPU (not useful for high CPU alerts)

Cross-Series Reducer (REDUCE_NONE):

  • REDUCE_NONE: Alert on each instance independently (recommended)
  • REDUCE_MEAN: Alert only if average across all instances exceeds threshold (misses individual spikes)
  • REDUCE_MAX: Alert if any instance exceeds (good for fleet-wide awareness)

Duration (120s vs 300s): Your 300s duration requires 5 consecutive minutes above threshold. This is too long for spike detection:

  • Brief spikes (1-3 minutes) won’t trigger alerts
  • By the time alert fires, incident may be over
  • Users already experiencing impact

Recommended: 120s (2 minutes)

  • Catches sustained issues
  • Filters out momentary spikes
  • Alerts before significant user impact

Part 3: Compute Engine CPU Metrics Understanding

CPU Utilization Metric Details:


metric.type: compute.googleapis.com/instance/cpu/utilization
value_type: DOUBLE
metric_kind: GAUGE
unit: 10^2.%
range: 0.0 to 1.0 (0% to 100%)

Critical Understanding:

  • Reported as decimal (0.8 = 80%)
  • Sampled every 60 seconds by default
  • Represents utilization across all vCPUs (averaged)
  • Includes user, system, and steal time

CPU Metric Collection Delay:

  • Metric ingestion lag: 60-120 seconds typical
  • Alignment processing: 10-30 seconds
  • Alert evaluation: 30-60 seconds
  • Total alert delay: 2-4 minutes from actual spike start

This means your 20-minute spike should definitely have triggered alerts.

Alternative CPU Metrics for Better Detection:

Consider also monitoring:

  1. CPU Reserved Cores (for autoscaling scenarios):

metric.type="compute.googleapis.com/instance/cpu/reserved_cores"
  1. CPU Usage Time (absolute seconds, not percentage):

metric.type="compute.googleapis.com/instance/cpu/usage_time"

Complete Working Alert Policy:

Create this policy via gcloud:

gcloud alpha monitoring policies create \
  --notification-channels=CHANNEL_ID_1,CHANNEL_ID_2 \
  --display-name="ERP CPU Utilization Critical" \
  --condition-display-name="CPU exceeds 80% for 2 minutes" \
  --condition-threshold-value=0.8 \
  --condition-threshold-duration=120s \
  --condition-filter='metric.type="compute.googleapis.com/instance/cpu/utilization" AND resource.type="gce_instance" AND resource.labels.instance_name=monitoring.regex.full_match("erp-app-.*")' \
  --condition-comparison=COMPARISON_GT \
  --condition-aggregation-alignment-period=60s \
  --condition-aggregation-per-series-aligner=ALIGN_MEAN \
  --condition-aggregation-cross-series-reducer=REDUCE_NONE \
  --condition-trigger-count=1 \
  --documentation-content="ERP application server CPU utilization exceeded 80%. Check application logs and consider scaling up. Runbook: https://wiki.company.com/erp-cpu-high" \
  --documentation-mime-type="text/markdown"

Notification Channel Verification:

# List all notification channels
gcloud alpha monitoring channels list

# Verify specific channel
gcloud alpha monitoring channels describe CHANNEL_ID

# Test notification
gcloud alpha monitoring channels verify CHANNEL_ID

If notifications still don’t work:

  1. Check email spam filters
  2. Verify Slack webhook is active
  3. Check PagerDuty integration key
  4. Review notification channel quotas (max 5000 notifications/day)
  5. Check Cloud Logging for notification delivery errors:

resource.type="k8s_cluster"
protoPayload.serviceName="monitoring.googleapis.com"
protoPayload.methodName="NotificationChannelService.SendNotification"

Incident History Debugging:

Check if incidents were created:

gcloud alpha monitoring policies list

# Get policy ID, then check incidents
gcloud alpha monitoring incidents list --policy=POLICY_ID

If incidents exist but notifications weren’t sent:

  • Notification channel configuration issue
  • Channel in failed state
  • Rate limiting applied

If no incidents exist:

  • Metric filter doesn’t match instances
  • Threshold/duration configuration prevents triggering
  • Metric data not being collected

Testing Your Alert Policy:

Simulate CPU Load:

SSH to an ERP instance and run:

# Generate CPU load for 5 minutes
stress --cpu 8 --timeout 300s

Monitor alert status:

watch -n 10 'gcloud alpha monitoring incidents list --policy=POLICY_ID --filter="state=OPEN" --limit=5'

You should see:

  1. Incident created within 3-4 minutes
  2. Notification sent to all channels
  3. Incident documented in Cloud Monitoring UI

Recommended Multi-Tier Alerting Strategy:

  1. Warning Alert (70% CPU, 5 min duration) → Email/Slack
  2. Critical Alert (80% CPU, 2 min duration) → PagerDuty
  3. Emergency Alert (95% CPU, 1 min duration) → SMS + PagerDuty

This provides escalating awareness before SLA impact.

Ongoing Monitoring:

Set up dashboard to visualize:

  • CPU utilization per instance
  • Alert policy evaluation status
  • Notification delivery success rate
  • Incident open/close timeline

With these corrections, your Cloud Monitoring alerts will reliably trigger on ERP CPU spikes, preventing SLA breaches and user impact.

Check if you’re using the correct comparison operator. CPU utilization is reported as a fraction between 0 and 1, not as a percentage. Your threshold of 0.8 is correct for 80%, but make sure your condition uses ABOVE or GREATER_THAN, not EQUAL. Also verify the condition duration - if set to 300 seconds, the metric must exceed threshold continuously for 5 minutes before alerting. Transient spikes shorter than that won’t trigger.