After a regional outage last week, our Cloud SQL instance failed over to the standby replica, but the failover process took 3-4 minutes during which our application experienced massive connection errors. Even after the failover completed, we saw elevated error rates for another 10 minutes.
Our setup uses Cloud SQL for PostgreSQL with high availability enabled. The application tier runs on GKE with connection pooling configured. During the failover window, we logged thousands of connection timeout errors and database unavailable exceptions.
Connection pool config:
maxPoolSize=50
minIdle=10
connectionTimeout=30000
idleTimeout=600000
The Cloud SQL failover process itself seems to work, but our application doesn’t handle the transition gracefully. We need better strategies for connection pool tuning and retry logic to survive these failover events without significant user impact. What’s the recommended approach for handling Cloud SQL failover from the application side?
I reduced the connection timeout to 10 seconds and added basic retry logic with 3 attempts. But I’m seeing that even after the failover completes, the first few connection attempts still fail. Looking at Cloud SQL logs, the replica promotion completes within 2 minutes, but connections are rejected for several more minutes. Is this because the connection pool is holding dead connections?
Your connection timeout of 30 seconds is too long for failover scenarios. Reduce it to 5-10 seconds so connections fail fast and can retry quickly. Also implement exponential backoff in your retry logic. During failover, the old primary becomes unavailable immediately but the new primary takes time to fully initialize, so aggressive retries just add load.
Yes, stale connections are a major issue. Your connection pool doesn’t know the underlying instance changed. Add connection validation queries (like SELECT 1) before using pooled connections. Set testOnBorrow=true in your pool config. This adds slight overhead but ensures you don’t use dead connections. Also configure maxLifetime to force connection renewal every 10-15 minutes even during normal operation.