CloudWatch alarms not triggering for EC2 ERP instance CPU spikes above threshold

Our production ERP application runs on EC2 instances, and I’ve configured CloudWatch alarms to alert when CPUUtilization exceeds 80% for 5 minutes. However, we’ve had two incidents where CPU spiked to 95%+ for 10-15 minutes without any alarm notifications.

Alarm configuration:

{
  "MetricName": "CPUUtilization",
  "Threshold": 80,
  "ComparisonOperator": "GreaterThanThreshold",
  "EvaluationPeriods": 1,
  "Period": 300
}

The SNS topic is configured and I’ve verified my email subscription is active. When I manually test the SNS topic, emails arrive fine. The alarm shows as “OK” in the console even during the spike periods. I’ve checked CloudWatch metrics and the CPU data points are definitely there showing the spikes. What’s causing the alarm to miss these critical events? The evaluation period and metric frequency seem aligned, but something isn’t working.

I see another potential issue - insufficient data handling. Check your alarm’s ‘treat missing data’ setting. If set to ‘notBreaching’, gaps in metric data prevent alarm state changes. For production systems, set this to ‘breaching’ or ‘ignore’ depending on your needs. Basic monitoring has 5-minute granularity which creates gaps if you’re evaluating shorter windows.

Check your alarm’s Statistic setting - is it using Average, Maximum, or Sum? If you’re using Average and the CPU spikes are brief, the 5-minute average might stay below 80% even though instantaneous values hit 95%. For CPU monitoring, Maximum is often more appropriate for catching spikes. Also verify the alarm is in ALARM state during incidents by checking alarm history in the console.

I checked and I’m using Average statistic. That could explain why brief spikes aren’t triggering alarms. But I’m confused about the period setting - I thought Period: 300 with EvaluationPeriods: 1 meant it checks every 5 minutes. Are you saying I need to enable detailed monitoring first? That’s an additional cost, right? Is there a way to make this work with basic monitoring?

Don’t forget to check your SNS topic policy and subscription filter policy. Even if the topic works when tested manually, there might be a filter policy on your email subscription that’s blocking alarm notifications based on message attributes. Go to SNS console, find your subscription, and verify there’s no filter policy applied. Also check CloudWatch alarm actions - make sure the SNS topic ARN is correctly specified in the alarm configuration.

Let me address all three critical areas comprehensively:

Alarm Period and Evaluation: Your current configuration evaluates a single 5-minute average, which is insufficient for catching CPU spikes. With basic monitoring (5-minute metric intervals), you need multiple evaluation periods to ensure sustained high CPU triggers alerts. Update your alarm:

{
  "MetricName": "CPUUtilization",
  "Statistic": "Maximum",
  "Threshold": 80,
  "ComparisonOperator": "GreaterThanThreshold",
  "EvaluationPeriods": 3,
  "DatapointsToAlarm": 2,
  "Period": 300,
  "TreatMissingData": "notBreaching"
}

This configuration requires 2 out of 3 consecutive 5-minute periods to exceed 80% before alarming. The key change is using Maximum statistic instead of Average - this captures peak CPU within each 5-minute window rather than smoothing it out. The DatapointsToAlarm parameter (M out of N evaluation) prevents false alarms from transient spikes while still catching sustained high CPU events.

SNS Topic Configuration: Your manual SNS test succeeds, but alarm notifications fail - this indicates an alarm action configuration issue, not SNS itself. Verify three things:

  1. Check alarm actions are properly configured:
aws cloudwatch describe-alarms --alarm-names "ERP-CPU-High" --query 'MetricAlarms[0].AlarmActions'

Ensure your SNS topic ARN appears in the output.

  1. Verify the SNS topic policy allows CloudWatch to publish:
{
  "Effect": "Allow",
  "Principal": {"Service": "cloudwatch.amazonaws.com"},
  "Action": "SNS:Publish",
  "Resource": "arn:aws:sns:region:account:topic-name"
}
  1. Check subscription filter policies - go to SNS console, select your topic, click on your email subscription, and verify “Subscription filter policy” shows “None”. If a filter exists, it might be blocking alarm messages based on attributes.

Also enable alarm action logging by creating a CloudWatch Logs group and configuring alarm history to stream there. This provides visibility into why actions succeed or fail.

Metric Frequency Alignment: Basic monitoring publishes EC2 metrics every 5 minutes, which limits your detection granularity. However, you don’t need detailed monitoring (1-minute metrics) if you configure alarms correctly. The issue isn’t frequency alignment - it’s that your single evaluation period with Average statistic misses spikes that occur within a 5-minute window.

For cost-effective monitoring without detailed metrics, use this strategy:

  • Set Period to 300 (5 minutes) to match basic monitoring frequency
  • Use Maximum statistic to capture peak values within each period
  • Set EvaluationPeriods to 3 and DatapointsToAlarm to 2 for sustained spike detection
  • Lower your threshold to 75% to compensate for less granular monitoring

If you need faster detection for critical ERP workloads, enable detailed monitoring on production instances only:

aws ec2 monitor-instances --instance-ids i-1234567890abcdef0

This costs approximately $2.10/month per instance but provides 1-minute metrics. With detailed monitoring, reconfigure your alarm to Period: 60, EvaluationPeriods: 5, DatapointsToAlarm: 3 - this detects CPU above 80% for 3 out of 5 consecutive minutes (5-minute window with better granularity).

For your specific incidents where CPU hit 95% for 10-15 minutes without alerts, the root cause is definitely your Average statistic combined with single evaluation period. Even if a 5-minute average stayed at 78% while instantaneous CPU peaked at 95%, your alarm wouldn’t trigger. Switch to Maximum statistic and implement the M-out-of-N evaluation pattern I described above. This configuration will reliably catch sustained high CPU events while avoiding false positives from brief, acceptable spikes during ERP batch processing.

Your evaluation period of 1 with a 5-minute period means the alarm only checks a single data point. If that specific 5-minute window happens to miss the peak or averages below 80%, the alarm won’t trigger even if CPU was at 95% for most of those minutes. EC2 detailed monitoring sends metrics every minute, but your alarm is sampling every 5 minutes. Increase EvaluationPeriods to 2 or 3, or switch to 1-minute periods with detailed monitoring enabled.

With basic monitoring, EC2 publishes metrics every 5 minutes. Your alarm evaluates every 5 minutes but only looks at one data point (EvaluationPeriods: 1). Change to EvaluationPeriods: 2 so it requires two consecutive 5-minute periods above threshold before alarming. This reduces false positives but still catches sustained high CPU. For the statistic issue, switch from Average to Maximum to catch spikes that might be smoothed out in averages.