Container Insights metrics delayed in CloudWatch for Prometheus scraper

We’re running EKS clusters with Container Insights enabled and using Prometheus for custom metrics scraping. Recently noticed CloudWatch metrics are showing delays of 5-10 minutes compared to our Prometheus dashboard.

Our Prometheus scrape interval is set to 30 seconds, and we’ve verified the CloudWatch agent is running on all nodes. The delay seems to worsen during peak load hours when our cluster scales to 20+ nodes.


apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
data:
  scrape_interval: 30s

Has anyone experienced similar metric lag with Container Insights? Wondering if this is related to the agent configuration or resource constraints on our worker nodes.

I’ve seen this before. The default CloudWatch agent configuration batches metrics before sending them, which can introduce delays. Check your agent’s flush interval settings - it’s likely set to 60 seconds or higher. Also, during scaling events, the agent might be CPU-throttled if you haven’t allocated enough resources to the DaemonSet.

Here’s a comprehensive solution addressing all three areas causing your delays:

1. Prometheus Scrape Interval Alignment: Your 30-second Prometheus scrape interval is fine, but you need to synchronize the CloudWatch agent flush interval. Update your CloudWatch agent ConfigMap:

agent:
  metrics_collection_interval: 30
  flush_interval: 30
  metric_buffer_limit: 10000

This ensures metrics are flushed every 30 seconds, matching your Prometheus collection cycle and eliminating the sampling mismatch.

2. CloudWatch Agent Resource Configuration: Your current 200m CPU/200Mi memory allocation is insufficient for a 20+ node cluster. Update the agent DaemonSet:

resources:
  requests:
    cpu: 500m
    memory: 512Mi
  limits:
    cpu: 1000m
    memory: 1Gi

This prevents CPU throttling during metric bursts and provides adequate buffer space. Monitor the actual usage after deployment - you may need to adjust further based on your metric cardinality.

3. API Rate Limit Management: With 20+ agents, you’re likely hitting CloudWatch PutMetricData limits (150 TPS default). Implement these optimizations:

  • Enable metric aggregation in the agent config to reduce API calls
  • Filter out low-value metrics using drop_original_metrics configuration
  • Request a service quota increase for PutMetricData in your region (can go up to 1500 TPS)
  • Consider using metric_aggregation_interval: 60 for less critical metrics while keeping critical ones at 30s

After implementing these changes, you should see metric delays drop to under 1 minute consistently. The combination of proper resource allocation, aligned intervals, and rate limit management addresses all three constraint areas. Monitor CloudWatch agent logs for the first 48 hours to verify no throttling errors remain.

Definitely increase those resource allocations. For clusters with 20+ nodes and high metric cardinality, I typically set the CloudWatch agent to at least 500m CPU and 512Mi memory. The agent needs headroom to buffer metrics during burst periods. Also consider adjusting the metric_buffer_limit parameter if you’re seeing dropped metrics during scaling events.

Update: Enabled debug logging and found throttling errors during peak hours. Also saw the CPU throttling Raj mentioned.

Thanks for the pointer. I checked our agent config and the flush_interval is indeed set to 60s. Also noticed the agent pods are requesting only 200m CPU and 200Mi memory. During peak times, we’re seeing CPU throttling in the metrics. Should I increase these resource limits?