Let me address all three critical areas comprehensively:
Alarm Period and Evaluation: Your current configuration evaluates a single 5-minute average, which is insufficient for catching CPU spikes. With basic monitoring (5-minute metric intervals), you need multiple evaluation periods to ensure sustained high CPU triggers alerts. Update your alarm:
{
"MetricName": "CPUUtilization",
"Statistic": "Maximum",
"Threshold": 80,
"ComparisonOperator": "GreaterThanThreshold",
"EvaluationPeriods": 3,
"DatapointsToAlarm": 2,
"Period": 300,
"TreatMissingData": "notBreaching"
}
This configuration requires 2 out of 3 consecutive 5-minute periods to exceed 80% before alarming. The key change is using Maximum statistic instead of Average - this captures peak CPU within each 5-minute window rather than smoothing it out. The DatapointsToAlarm parameter (M out of N evaluation) prevents false alarms from transient spikes while still catching sustained high CPU events.
SNS Topic Configuration: Your manual SNS test succeeds, but alarm notifications fail - this indicates an alarm action configuration issue, not SNS itself. Verify three things:
- Check alarm actions are properly configured:
aws cloudwatch describe-alarms --alarm-names "ERP-CPU-High" --query 'MetricAlarms[0].AlarmActions'
Ensure your SNS topic ARN appears in the output.
- Verify the SNS topic policy allows CloudWatch to publish:
{
"Effect": "Allow",
"Principal": {"Service": "cloudwatch.amazonaws.com"},
"Action": "SNS:Publish",
"Resource": "arn:aws:sns:region:account:topic-name"
}
- Check subscription filter policies - go to SNS console, select your topic, click on your email subscription, and verify “Subscription filter policy” shows “None”. If a filter exists, it might be blocking alarm messages based on attributes.
Also enable alarm action logging by creating a CloudWatch Logs group and configuring alarm history to stream there. This provides visibility into why actions succeed or fail.
Metric Frequency Alignment: Basic monitoring publishes EC2 metrics every 5 minutes, which limits your detection granularity. However, you don’t need detailed monitoring (1-minute metrics) if you configure alarms correctly. The issue isn’t frequency alignment - it’s that your single evaluation period with Average statistic misses spikes that occur within a 5-minute window.
For cost-effective monitoring without detailed metrics, use this strategy:
- Set Period to 300 (5 minutes) to match basic monitoring frequency
- Use Maximum statistic to capture peak values within each period
- Set EvaluationPeriods to 3 and DatapointsToAlarm to 2 for sustained spike detection
- Lower your threshold to 75% to compensate for less granular monitoring
If you need faster detection for critical ERP workloads, enable detailed monitoring on production instances only:
aws ec2 monitor-instances --instance-ids i-1234567890abcdef0
This costs approximately $2.10/month per instance but provides 1-minute metrics. With detailed monitoring, reconfigure your alarm to Period: 60, EvaluationPeriods: 5, DatapointsToAlarm: 3 - this detects CPU above 80% for 3 out of 5 consecutive minutes (5-minute window with better granularity).
For your specific incidents where CPU hit 95% for 10-15 minutes without alerts, the root cause is definitely your Average statistic combined with single evaluation period. Even if a 5-minute average stayed at 78% while instantaneous CPU peaked at 95%, your alarm wouldn’t trigger. Switch to Maximum statistic and implement the M-out-of-N evaluation pattern I described above. This configuration will reliably catch sustained high CPU events while avoiding false positives from brief, acceptable spikes during ERP batch processing.