Aurora failover latency causes ERP transaction stalls during maintenance

Our ERP system running on Aurora MySQL experiences 30-60 second transaction stalls during planned failovers or maintenance windows. The Aurora cluster has one writer and two readers across multiple AZs.

We’re connecting using the cluster endpoint:


jdbc:mysql://erp-cluster.cluster-xxxxx.us-east-1.rds.amazonaws.com:3306/erp

During failovers, the application doesn’t reconnect quickly and users see timeout errors. Our connection pool is configured with 50 max connections and a 30-second timeout. We’ve monitored CloudWatch and see the failover completes in about 15 seconds, but applications remain disconnected much longer. Is there a way to reduce this latency? The 30-60 second stalls are unacceptable for our ERP workflows.

Are you using the reader endpoint for read queries? Separating read and write traffic can help. Also, check your application’s connection retry logic. Most database drivers don’t automatically retry on connection failures - your app needs to handle that.

The 30-60 second stalls you’re experiencing are due to three compounding factors that need systematic addressing:

Aurora Cluster Endpoint Usage: You’re correctly using the cluster endpoint for writes, which automatically points to the current writer instance. However, your configuration needs optimization:

  1. Use separate endpoints for read and write operations:
  2. Configure your JDBC connection with failover parameters:

jdbc:mysql:aurora://erp-cluster.cluster-xxxxx.us-east-1.rds.amazonaws.com:3306/erp
?connectTimeout=5000&socketTimeout=10000

The aurora:// protocol enables the AWS JDBC driver’s Aurora-aware failover handling.

Connection Pool Tuning: Your pool settings are causing extended outages. Implement these changes:

  1. Reduce timeout from 30s to 5-10s:
maxPoolSize=50
connectionTimeout=10000
validationTimeout=5000
idleTimeout=300000
  1. Enable aggressive connection validation (HikariCP example):
connectionTestQuery=SELECT 1
testWhileIdle=true
testOnBorrow=true
timeBetweenEvictionRunsMillis=5000
  1. Add JVM DNS settings to respect Aurora’s 5-second DNS TTL:

-Dsun.net.inetaddr.ttl=1
-Dsun.net.inetaddr.negative.ttl=1
  1. Implement connection retry logic in your application:
int maxRetries = 3;
for (int i = 0; i < maxRetries; i++) {
    try {
        connection = dataSource.getConnection();
        break;
    } catch (SQLException e) {
        if (i == maxRetries - 1) throw e;
        Thread.sleep(2000);
    }
}

CloudWatch Failover Monitoring: Set up comprehensive monitoring to understand failover behavior:

Key metrics to track:

  • DatabaseConnections - drops during failover, should recover within 15-20s
  • CommitLatency - spikes during promotion of new writer
  • AuroraBinlogReplicaLag - should be near zero before failover
  • EngineUptime - resets when new writer is promoted

Create CloudWatch alarms:

aws cloudwatch put-metric-alarm \
  --alarm-name aurora-high-commit-latency \
  --metric-name CommitLatency \
  --threshold 100 \
  --comparison-operator GreaterThanThreshold

Enable Enhanced Monitoring (1-second granularity) to capture precise failover timing. Use Performance Insights to identify which queries were in-flight during failover.

Additional Recommendations:

  1. Enable Aurora Fast Failover with aurora_replica_read_consistency=EVENTUAL for faster promotion
  2. Use Aurora Global Database if you need cross-region failover capabilities
  3. Test failovers regularly during maintenance windows to validate your configuration
  4. Consider using RDS Proxy for connection pooling at the infrastructure level - it maintains connections during failover
  5. Set up VPC Flow Logs to monitor network-level connection resets during failover

The combination of proper endpoint usage, aggressive connection pool validation, and correct DNS caching should reduce your failover impact to 10-15 seconds maximum. The remaining time is Aurora’s actual promotion process, which you can’t eliminate but can monitor effectively with CloudWatch.

Your connection pool timeout is too high. Set it to 5-10 seconds max. Also configure testOnBorrow=true and validationQuery=‘SELECT 1’ so the pool validates connections before handing them to the app. During failover, stale connections get detected and discarded quickly. What connection pool library are you using - HikariCP, DBCP2, or something else?

We’re not currently using the reader endpoint - all traffic goes through the cluster endpoint. The JVM DNS caching is a good point. I’ll check that setting. But even with proper DNS resolution, shouldn’t the connection pool detect failures faster and reconnect? Our 30-second timeout seems long.

The cluster endpoint has DNS TTL of 5 seconds, but many JVMs cache DNS lookups longer than that. Check your JVM DNS cache settings. Add -Dsun.net.inetaddr.ttl=1 to your Java options to respect the short TTL. Without this, your app might be connecting to the old writer IP after failover.

For ERP systems, you need aggressive failover detection. Set TCP keepalive at the OS level (net.ipv4.tcp_keepalive_time=10) and use connection pool health checks every few seconds. Also, consider using Aurora’s Fast Failover feature with the MySQL JDBC driver’s failover parameters. This can reduce detection time significantly.