Aurora failover latency causes ERP transaction stalls during maintenance

annaadmin · June 8, 2025, 10:22am

Our ERP system running on Aurora MySQL experiences 30-60 second transaction stalls during planned failovers or maintenance windows. The Aurora cluster has one writer and two readers across multiple AZs.

We’re connecting using the cluster endpoint:


jdbc:mysql://erp-cluster.cluster-xxxxx.us-east-1.rds.amazonaws.com:3306/erp

During failovers, the application doesn’t reconnect quickly and users see timeout errors. Our connection pool is configured with 50 max connections and a 30-second timeout. We’ve monitored CloudWatch and see the failover completes in about 15 seconds, but applications remain disconnected much longer. Is there a way to reduce this latency? The 30-60 second stalls are unacceptable for our ERP workflows.

kevin_sql · June 11, 2025, 3:01am

Are you using the reader endpoint for read queries? Separating read and write traffic can help. Also, check your application’s connection retry logic. Most database drivers don’t automatically retry on connection failures - your app needs to handle that.

brandon_guru · July 29, 2025, 9:51pm

The 30-60 second stalls you’re experiencing are due to three compounding factors that need systematic addressing:

Aurora Cluster Endpoint Usage: You’re correctly using the cluster endpoint for writes, which automatically points to the current writer instance. However, your configuration needs optimization:

Use separate endpoints for read and write operations:
- Cluster endpoint for writes: `erp-cluster.cluster-xxxxx.rds.amazonaws.com
- Reader endpoint for reads: `erp-cluster.cluster-ro-xxxxx.rds.amazonaws.com
Configure your JDBC connection with failover parameters:


jdbc:mysql:aurora://erp-cluster.cluster-xxxxx.us-east-1.rds.amazonaws.com:3306/erp
?connectTimeout=5000&socketTimeout=10000

The aurora:// protocol enables the AWS JDBC driver’s Aurora-aware failover handling.

Connection Pool Tuning: Your pool settings are causing extended outages. Implement these changes:

Reduce timeout from 30s to 5-10s:

maxPoolSize=50
connectionTimeout=10000
validationTimeout=5000
idleTimeout=300000

Enable aggressive connection validation (HikariCP example):

connectionTestQuery=SELECT 1
testWhileIdle=true
testOnBorrow=true
timeBetweenEvictionRunsMillis=5000

Add JVM DNS settings to respect Aurora’s 5-second DNS TTL:


-Dsun.net.inetaddr.ttl=1
-Dsun.net.inetaddr.negative.ttl=1

Implement connection retry logic in your application:

int maxRetries = 3;
for (int i = 0; i < maxRetries; i++) {
    try {
        connection = dataSource.getConnection();
        break;
    } catch (SQLException e) {
        if (i == maxRetries - 1) throw e;
        Thread.sleep(2000);
    }
}

CloudWatch Failover Monitoring: Set up comprehensive monitoring to understand failover behavior:

Key metrics to track:

DatabaseConnections - drops during failover, should recover within 15-20s
CommitLatency - spikes during promotion of new writer
AuroraBinlogReplicaLag - should be near zero before failover
EngineUptime - resets when new writer is promoted

Create CloudWatch alarms:

aws cloudwatch put-metric-alarm \
  --alarm-name aurora-high-commit-latency \
  --metric-name CommitLatency \
  --threshold 100 \
  --comparison-operator GreaterThanThreshold

Enable Enhanced Monitoring (1-second granularity) to capture precise failover timing. Use Performance Insights to identify which queries were in-flight during failover.

Additional Recommendations:

Enable Aurora Fast Failover with aurora_replica_read_consistency=EVENTUAL for faster promotion
Use Aurora Global Database if you need cross-region failover capabilities
Test failovers regularly during maintenance windows to validate your configuration
Consider using RDS Proxy for connection pooling at the infrastructure level - it maintains connections during failover
Set up VPC Flow Logs to monitor network-level connection resets during failover

The combination of proper endpoint usage, aggressive connection pool validation, and correct DNS caching should reduce your failover impact to 10-15 seconds maximum. The remaining time is Aurora’s actual promotion process, which you can’t eliminate but can monitor effectively with CloudWatch.

donna_lead · July 3, 2025, 7:08am

Your connection pool timeout is too high. Set it to 5-10 seconds max. Also configure testOnBorrow=true and validationQuery=‘SELECT 1’ so the pool validates connections before handing them to the app. During failover, stale connections get detected and discarded quickly. What connection pool library are you using - HikariCP, DBCP2, or something else?

jessicalead · June 29, 2025, 6:33am

We’re not currently using the reader endpoint - all traffic goes through the cluster endpoint. The JVM DNS caching is a good point. I’ll check that setting. But even with proper DNS resolution, shouldn’t the connection pool detect failures faster and reconnect? Our 30-second timeout seems long.

donna_lead · June 15, 2025, 12:18pm

The cluster endpoint has DNS TTL of 5 seconds, but many JVMs cache DNS lookups longer than that. Check your JVM DNS cache settings. Add -Dsun.net.inetaddr.ttl=1 to your Java options to respect the short TTL. Without this, your app might be connecting to the old writer IP after failover.

jasonsolver · July 19, 2025, 11:25pm

For ERP systems, you need aggressive failover detection. Set TCP keepalive at the OS level (net.ipv4.tcp_keepalive_time=10) and use connection pool health checks every few seconds. Also, consider using Aurora’s Fast Failover feature with the MySQL JDBC driver’s failover parameters. This can reduce detection time significantly.

Topic		Replies	Views
Aurora Serverless connection timeouts from ECS containers during scaling events Amazon Web Services (AWS) question , compute , database , scaling , aws-2020 , connection-pooling , ecs , cloudwatch , aurora-serverless	7	0	October 3, 2025
Aurora slow query performance during peak hours impacting ERP transaction processing Amazon Web Services (AWS) question , compute , performance , database , indexing , query-optimization , aws-2021 , slow-query , aurora-mysql	5	0	August 4, 2025
Aurora Serverless fails to scale during month-end close, causing timeout errors Amazon Web Services (AWS) question , database , timeout-errors , devops , aws-2021 , financial-reporting , connection-pooling , aurora-serverless , autoscaling	4	1	October 5, 2025
Cloud SQL failover triggers ERP downtime due to DNS propagation delays Google Cloud Platform (GCP) question , compute , database , high-availability , gcp-2021 , connection-timeout , cloud-sql , failover , dns-propagation	3	2	March 2, 2025
Cloud SQL failover lag triggers high connection errors in application tier Google Cloud Platform (GCP) question , compute , database , java , high-availability , gcp-2020 , retry-logic , connection-pooling , cloud-sql	3	0	January 27, 2025
Inventory sync delays due to intermittent database connections in warehouse-mgmt Oracle Fusion Cloud question , database-mgt , java , ofc-23b , retry-logic , network-latency , warehouse-mg , inventory-sync , connection-pooling	6	0	October 17, 2025
Automated database failover using OCI Observability for ERP order processing: reduced downtime by 90% Oracle Cloud use-case , database , high-availability , erp-integration , observability , oci-2020 , bash , oci-monitoring , automated-failover	3	0	July 11, 2025
Aurora PostgreSQL slow query performance during BI reporting peak hours Amazon Web Services (AWS) question , analytics , database , aws-2021 , performance-tuning , bi-tools , reporting-delays , aurora-postgresql , slow-queries	3	1	September 24, 2025
Cloud SQL replica lag spikes during failover, causing stale data in analytics Google Cloud Platform (GCP) question , analytics , database , gcp-2021 , cloud-sql , failover , databases , sql-replica-lag , replication	6	0	September 28, 2025

Aurora failover latency causes ERP transaction stalls during maintenance

Related topics