Container Insights metrics delayed in CloudWatch for Prometheus scraper

stephen_func · September 2, 2025, 2:52pm

We’re running EKS clusters with Container Insights enabled and using Prometheus for custom metrics scraping. Recently noticed CloudWatch metrics are showing delays of 5-10 minutes compared to our Prometheus dashboard.

Our Prometheus scrape interval is set to 30 seconds, and we’ve verified the CloudWatch agent is running on all nodes. The delay seems to worsen during peak load hours when our cluster scales to 20+ nodes.


apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
data:
  scrape_interval: 30s

Has anyone experienced similar metric lag with Container Insights? Wondering if this is related to the agent configuration or resource constraints on our worker nodes.

amandaexpert · September 2, 2025, 5:18pm

I’ve seen this before. The default CloudWatch agent configuration batches metrics before sending them, which can introduce delays. Check your agent’s flush interval settings - it’s likely set to 60 seconds or higher. Also, during scaling events, the agent might be CPU-throttled if you haven’t allocated enough resources to the DaemonSet.

emilysolver · September 29, 2025, 11:28am

Here’s a comprehensive solution addressing all three areas causing your delays:

1. Prometheus Scrape Interval Alignment: Your 30-second Prometheus scrape interval is fine, but you need to synchronize the CloudWatch agent flush interval. Update your CloudWatch agent ConfigMap:

agent:
  metrics_collection_interval: 30
  flush_interval: 30
  metric_buffer_limit: 10000

This ensures metrics are flushed every 30 seconds, matching your Prometheus collection cycle and eliminating the sampling mismatch.

2. CloudWatch Agent Resource Configuration: Your current 200m CPU/200Mi memory allocation is insufficient for a 20+ node cluster. Update the agent DaemonSet:

resources:
  requests:
    cpu: 500m
    memory: 512Mi
  limits:
    cpu: 1000m
    memory: 1Gi

This prevents CPU throttling during metric bursts and provides adequate buffer space. Monitor the actual usage after deployment - you may need to adjust further based on your metric cardinality.

3. API Rate Limit Management: With 20+ agents, you’re likely hitting CloudWatch PutMetricData limits (150 TPS default). Implement these optimizations:

Enable metric aggregation in the agent config to reduce API calls
Filter out low-value metrics using drop_original_metrics configuration
Request a service quota increase for PutMetricData in your region (can go up to 1500 TPS)
Consider using metric_aggregation_interval: 60 for less critical metrics while keeping critical ones at 30s

After implementing these changes, you should see metric delays drop to under 1 minute consistently. The combination of proper resource allocation, aligned intervals, and rate limit management addresses all three constraint areas. Monitor CloudWatch agent logs for the first 48 hours to verify no throttling errors remain.

justin_pro · September 7, 2025, 11:34pm

Definitely increase those resource allocations. For clusters with 20+ nodes and high metric cardinality, I typically set the CloudWatch agent to at least 500m CPU and 512Mi memory. The agent needs headroom to buffer metrics during burst periods. Also consider adjusting the metric_buffer_limit parameter if you’re seeing dropped metrics during scaling events.

brandon_coder · September 24, 2025, 8:48am

Update: Enabled debug logging and found throttling errors during peak hours. Also saw the CPU throttling Raj mentioned.

gregoryninja · September 4, 2025, 3:07am

Thanks for the pointer. I checked our agent config and the flush_interval is indeed set to 60s. Also noticed the agent pods are requesting only 200m CPU and 200Mi memory. During peak times, we’re seeing CPU throttling in the metrics. Should I increase these resource limits?

Topic		Replies	Views
Delayed metrics ingestion for container workloads in OCI Monitoring impacting real-time analytics dashboards Oracle Cloud question , analytics , oci-2020 , metrics-delay , container-servi , oci-monitoring , telemetry-agent , dashboard-latency	3	0	July 16, 2025
CloudWatch metrics delayed for IoT Core monitoring during high device connection bursts AWS IoT question , monitoring , performance-opt , real-time-monitoring , cloudwatch , metrics-delay , awsiot-25 , iot-core	6	0	November 15, 2025
CloudWatch Metric Streams API throttling when exporting to third-party monitoring Amazon Web Services (AWS) question , rest-api , observability , aws-2019 , data-loss , api-throttling , exponential-backoff , cloudwatch , metric-streams	4	0	November 7, 2025
Metrics REST API ingestion delays causing stale dashboard data in production monitoring Oracle Cloud question , monitoring , dashboards , rest-api , observability , oci-2020 , latency , rate-limits , metrics-ingestion	6	1	May 12, 2025
CloudWatch GetMetricData API returns missing latency metrics for custom dimension in order processing monitoring Amazon Web Services (AWS) question , monitoring , api-mgmt , rest-api , observability , aws-2021 , json , cloudwatch , metric-missing	4	0	June 9, 2025
Kinesis data stream lag observed when ingesting high-frequency sensor data AWS IoT question , performance-opt , real-time-analytics , cloudwatch , kinesis , awsiot-24 , data-stream , iiot-support , stream-lag	4	0	December 4, 2025
Azure Log Analytics query latency spikes during high-volume data ingestion Microsoft Azure question , monitoring , networking , observability , az-2021 , performance-tuning , latency , azure-log-analytics , kusto-query	6	0	February 4, 2025
Analytics queries from containerized apps to Oracle Analytics Cloud timing out Oracle Cloud question , analytics , timeout-error , oci-2021 , network-latency , query-performance , containers-ctn , oci-monitoring , oracle-analytics-cloud	7	0	April 17, 2025
Dashboard visualizations lag or fail to refresh when ingesting high-frequency sensor data Cisco IoT Cloud Connect question , performance-opt , analytics-report , sensor-data , real-time-monitoring , data-ingestion , dashboard-lag , viz-dashboar , cciot-25	5	0	February 21, 2025

Container Insights metrics delayed in CloudWatch for Prometheus scraper

Related topics