Let me provide a comprehensive solution to fix your Cloud Monitoring alerting for ERP CPU spikes.
Root Cause Analysis:
Based on your configuration and symptoms, the most likely issues are:
- Metric Filter Specificity: Your regex filter may not be matching all intended instances
- Alignment Window Configuration: Aggregation settings may be smoothing out spikes
- Condition Duration: 300-second requirement may be too long for spike detection
- Notification Channel State: Channels may be in failed state without obvious indication
Complete Solution:
Part 1: Cloud Monitoring Metric Filters
Understanding Metric Filtering:
Cloud Monitoring filters use MQL (Monitoring Query Language) or filter expressions. Your current filter:
resource.labels.instance_name=~"erp-app-.*"
This regex filter has potential issues:
- Case sensitivity (instance_name vs instance_id)
- Partial matching behavior
- Instance group member naming variations
Corrected Filter Configuration:
resource.type="gce_instance" AND
resource.labels.instance_name=monitoring.regex.full_match("erp-app-.*") AND
metadata.system_labels.name=monitoring.regex.full_match("erp-app-.*")
Or use zone-based filtering if instances are in specific zones:
resource.type="gce_instance" AND
resource.labels.zone=~"us-central1-.*" AND
metric.labels.instance_name=starts_with("erp-app")
Verify Filter Matches Instances:
Test in Metrics Explorer first:
# List instances that should match
gcloud compute instances list \
--filter="name~'erp-app-.*'" \
--format="table(name,zone,status)"
Then verify metrics exist for these instances:
metric.type="compute.googleapis.com/instance/cpu/utilization"
resource.type="gce_instance"
resource.labels.instance_name=monitoring.regex.full_match("erp-app-.*")
If no data appears, your filter is incorrect.
Part 2: Alerting Policy Configuration
Optimal Alert Policy Structure:
Create a new alerting policy with proper configuration:
displayName: "ERP CPU Utilization High"
conditions:
- displayName: "CPU > 80% for 2 minutes"
conditionThreshold:
filter: |
metric.type="compute.googleapis.com/instance/cpu/utilization"
resource.type="gce_instance"
resource.labels.instance_name=monitoring.regex.full_match("erp-app-.*")
comparison: COMPARISON_GT
thresholdValue: 0.8
duration: 120s
aggregations:
- alignmentPeriod: 60s
perSeriesAligner: ALIGN_MEAN
crossSeriesReducer: REDUCE_NONE
trigger:
count: 1
Key Configuration Decisions:
Alignment Period (60s):
- Balances spike detection vs noise filtering
- Shorter periods (30s) catch brief spikes but may cause false positives
- Longer periods (300s) smooth data too much, missing real incidents
- 60s is optimal for CPU monitoring
Per-Series Aligner (ALIGN_MEAN):
- ALIGN_MEAN: Average CPU over alignment period (recommended for CPU)
- ALIGN_MAX: Highest CPU in period (catches brief spikes, more false positives)
- ALIGN_MIN: Lowest CPU (not useful for high CPU alerts)
Cross-Series Reducer (REDUCE_NONE):
- REDUCE_NONE: Alert on each instance independently (recommended)
- REDUCE_MEAN: Alert only if average across all instances exceeds threshold (misses individual spikes)
- REDUCE_MAX: Alert if any instance exceeds (good for fleet-wide awareness)
Duration (120s vs 300s):
Your 300s duration requires 5 consecutive minutes above threshold. This is too long for spike detection:
- Brief spikes (1-3 minutes) won’t trigger alerts
- By the time alert fires, incident may be over
- Users already experiencing impact
Recommended: 120s (2 minutes)
- Catches sustained issues
- Filters out momentary spikes
- Alerts before significant user impact
Part 3: Compute Engine CPU Metrics Understanding
CPU Utilization Metric Details:
metric.type: compute.googleapis.com/instance/cpu/utilization
value_type: DOUBLE
metric_kind: GAUGE
unit: 10^2.%
range: 0.0 to 1.0 (0% to 100%)
Critical Understanding:
- Reported as decimal (0.8 = 80%)
- Sampled every 60 seconds by default
- Represents utilization across all vCPUs (averaged)
- Includes user, system, and steal time
CPU Metric Collection Delay:
- Metric ingestion lag: 60-120 seconds typical
- Alignment processing: 10-30 seconds
- Alert evaluation: 30-60 seconds
- Total alert delay: 2-4 minutes from actual spike start
This means your 20-minute spike should definitely have triggered alerts.
Alternative CPU Metrics for Better Detection:
Consider also monitoring:
- CPU Reserved Cores (for autoscaling scenarios):
metric.type="compute.googleapis.com/instance/cpu/reserved_cores"
- CPU Usage Time (absolute seconds, not percentage):
metric.type="compute.googleapis.com/instance/cpu/usage_time"
Complete Working Alert Policy:
Create this policy via gcloud:
gcloud alpha monitoring policies create \
--notification-channels=CHANNEL_ID_1,CHANNEL_ID_2 \
--display-name="ERP CPU Utilization Critical" \
--condition-display-name="CPU exceeds 80% for 2 minutes" \
--condition-threshold-value=0.8 \
--condition-threshold-duration=120s \
--condition-filter='metric.type="compute.googleapis.com/instance/cpu/utilization" AND resource.type="gce_instance" AND resource.labels.instance_name=monitoring.regex.full_match("erp-app-.*")' \
--condition-comparison=COMPARISON_GT \
--condition-aggregation-alignment-period=60s \
--condition-aggregation-per-series-aligner=ALIGN_MEAN \
--condition-aggregation-cross-series-reducer=REDUCE_NONE \
--condition-trigger-count=1 \
--documentation-content="ERP application server CPU utilization exceeded 80%. Check application logs and consider scaling up. Runbook: https://wiki.company.com/erp-cpu-high" \
--documentation-mime-type="text/markdown"
Notification Channel Verification:
# List all notification channels
gcloud alpha monitoring channels list
# Verify specific channel
gcloud alpha monitoring channels describe CHANNEL_ID
# Test notification
gcloud alpha monitoring channels verify CHANNEL_ID
If notifications still don’t work:
- Check email spam filters
- Verify Slack webhook is active
- Check PagerDuty integration key
- Review notification channel quotas (max 5000 notifications/day)
- Check Cloud Logging for notification delivery errors:
resource.type="k8s_cluster"
protoPayload.serviceName="monitoring.googleapis.com"
protoPayload.methodName="NotificationChannelService.SendNotification"
Incident History Debugging:
Check if incidents were created:
gcloud alpha monitoring policies list
# Get policy ID, then check incidents
gcloud alpha monitoring incidents list --policy=POLICY_ID
If incidents exist but notifications weren’t sent:
- Notification channel configuration issue
- Channel in failed state
- Rate limiting applied
If no incidents exist:
- Metric filter doesn’t match instances
- Threshold/duration configuration prevents triggering
- Metric data not being collected
Testing Your Alert Policy:
Simulate CPU Load:
SSH to an ERP instance and run:
# Generate CPU load for 5 minutes
stress --cpu 8 --timeout 300s
Monitor alert status:
watch -n 10 'gcloud alpha monitoring incidents list --policy=POLICY_ID --filter="state=OPEN" --limit=5'
You should see:
- Incident created within 3-4 minutes
- Notification sent to all channels
- Incident documented in Cloud Monitoring UI
Recommended Multi-Tier Alerting Strategy:
- Warning Alert (70% CPU, 5 min duration) → Email/Slack
- Critical Alert (80% CPU, 2 min duration) → PagerDuty
- Emergency Alert (95% CPU, 1 min duration) → SMS + PagerDuty
This provides escalating awareness before SLA impact.
Ongoing Monitoring:
Set up dashboard to visualize:
- CPU utilization per instance
- Alert policy evaluation status
- Notification delivery success rate
- Incident open/close timeline
With these corrections, your Cloud Monitoring alerts will reliably trigger on ERP CPU spikes, preventing SLA breaches and user impact.