Aurora Serverless connection timeouts from ECS containers during scaling events

We’re experiencing intermittent connection timeouts when our ECS Fargate tasks try to connect to Aurora Serverless v1 during scaling events. The application is a Node.js API using the mysql2 library.

Connection configuration:


connectionLimit: 10
connectTimeout: 10000
aqueueLimit: 0

During traffic spikes, Aurora scales up from 2 ACUs to 8 ACUs, but our application logs show connection timeouts during this 30-60 second scaling window. Failed transactions spike to about 15% during these periods. We’re monitoring with CloudWatch but not sure which metrics would help identify the root cause. Is this expected behavior with Aurora Serverless, or is there a connection pool tuning issue we need to address?

RDS Proxy would definitely help here. It maintains a connection pool and handles the scaling transitions gracefully by queuing requests during the brief scaling window. For Aurora Serverless v1, this is almost essential for production workloads. The proxy adds about 1-2ms latency, which is negligible compared to your timeout issues. Your connection pool settings also seem aggressive - 10 connections per container could overwhelm Aurora during scaling.

Aurora Serverless v1 does have a brief pause during scaling, but 15% failure rate seems high. Are you using the Data API or direct MySQL connections? The Data API handles scaling transitions better. Also, what’s your current min and max ACU configuration?

Complete Solution for Aurora Serverless Connection Timeouts

Your issue is a combination of connection pool misconfiguration and Aurora Serverless v1’s connection limits during scaling. Here’s how to fix it:

Aurora Scaling Configuration:

Aurora Serverless v1 has dynamic max_connections based on current ACU capacity:

  • 2 ACUs: ~90 connections
  • 4 ACUs: ~180 connections
  • 8 ACUs: ~360 connections

During scaling from 2 to 8 ACUs, there’s a 30-60 second transition where connections may be briefly unavailable. Your 20-30 containers with 10 connections each (200-300 total) immediately exceed the 90 connection limit at 2 ACUs, causing timeouts before scaling even begins.

Connection Pool Tuning:

Reduce your connection pool dramatically:

connectionLimit: 2,
connectTimeout: 30000,
aqueueLimit: 0,
waitForConnections: true,
enableKeepAlive: true,
keepAliveInitialDelay: 10000

With 30 containers at 2 connections each, you’ll use 60 connections at 2 ACUs (66% utilization), leaving headroom. The longer connectTimeout (30s) allows time for Aurora to scale before timing out. Enable keep-alive to detect and remove stale connections.

Implement application-level connection retry logic:

const maxRetries = 3;
const baseDelay = 1000;

for (let attempt = 1; attempt <= maxRetries; attempt++) {
  try {
    return await pool.query(sql);
  } catch (err) {
    if (attempt === maxRetries) throw err;
    await sleep(baseDelay * Math.pow(2, attempt));
  }
}

CloudWatch Monitoring:

Create a dashboard tracking these metrics:

  • DatabaseConnections - current active connections
  • ServerlessDatabaseCapacity - current ACU level
  • ACUUtilization - percentage of current capacity used
  • CommitThroughput and SelectThroughput - database activity

Set CloudWatch alarms:

  1. DatabaseConnections > 75% of calculated max_connections for current ACU
  2. ACUUtilization > 70% for 5 minutes (triggers scaling)
  3. High connection errors from your application logs

Calculate connection threshold: At 2 ACUs with 90 max connections, alarm at 68 connections (75%). This gives early warning before hitting limits.

Additional Recommendations:

  1. Consider RDS Proxy: Deploy RDS Proxy in front of Aurora Serverless. It pools connections and handles scaling transitions transparently. Your containers connect to the proxy (which maintains connections to Aurora), eliminating timeouts during scaling:

    • Proxy max_connections: Set to 100% of your container count × connection pool size
    • Proxy connection pool: Let it manage Aurora connections efficiently
    • Adds ~1-2ms latency but eliminates 15% failure rate
  2. Increase Minimum ACUs: Set min ACUs to 4 instead of 2. This doubles your connection capacity to 180 and reduces scaling frequency. The cost increase is minimal compared to lost transactions.

  3. Pre-warming Strategy: If you can predict traffic spikes, trigger scaling proactively by running a lightweight query that increases ACU utilization above 70%, forcing Aurora to scale before the actual load hits.

  4. Evaluate Aurora Serverless v2: Consider migrating to v2, which scales in finer increments (0.5 ACU) and scales much faster (typically under 15 seconds). v2 also maintains connections during scaling, eliminating this entire class of issues.

Implementing connection pool reduction and CloudWatch monitoring will immediately improve your situation. Adding RDS Proxy will eliminate timeouts entirely during scaling events.

We typically run 20-30 ECS tasks during peak hours, so that would be 200-300 connections total with our current pool size. Is that too many for Aurora Serverless? What would be a recommended connection limit per container?

For CloudWatch monitoring, you should track DatabaseConnections, ServerlessDatabaseCapacity, and ACUUtilization metrics. Set up alarms when DatabaseConnections approaches max_connections for your current ACU level. This will help you see when you’re hitting connection limits during scaling events.

Yes, 200-300 connections is excessive. Aurora Serverless v1 has a max_connections limit based on ACUs - at 2 ACUs you only get about 90 connections. When all your containers try to establish 10 connections each, you’re hitting the connection limit which causes timeouts. Reduce your pool to 2-3 connections per container and implement connection retry logic with exponential backoff.

We’re using direct MySQL connections to the cluster endpoint. Current configuration is min 2 ACU, max 16 ACU, auto-pause disabled. We chose direct connections because we need sub-100ms response times and heard Data API adds latency. Should we be using RDS Proxy instead? I’ve seen it mentioned but don’t fully understand how it would help with scaling events.