Cloud SQL failover triggers ERP downtime due to DNS propagation delays

Our Cloud SQL PostgreSQL instance failed over to the standby replica during scheduled maintenance yesterday, and our ERP application experienced 8-12 minutes of downtime even though Cloud SQL documentation claims failover happens in under 2 minutes.

Our application connects using the Cloud SQL instance connection name:


spring.datasource.url=jdbc:postgresql://PROJECT:REGION:INSTANCE/erp_db

During the failover, the application couldn’t establish connections and threw timeout errors. Database became accessible again after about 10 minutes. The Cloud SQL instance itself shows it was available within 90 seconds of failover start.

We suspect DNS propagation delays are causing the extended outage, but we’re not sure how to fix this. The ERP system handles critical business operations and we can’t afford 10-minute outages during failovers. What’s the recommended approach for minimizing connection disruption during Cloud SQL failover events?

I recommend switching to Cloud SQL Proxy immediately. It maintains a persistent connection to Cloud SQL and handles failover transparently. Your application connects to localhost and the proxy manages the actual Cloud SQL connection. We made this switch after similar issues and failover became seamless - application doesn’t even notice the database failed over.

The issue is definitely DNS caching. When Cloud SQL fails over, the instance connection name resolves to a new IP address, but your application JVM and OS are caching the old IP. Even though Cloud SQL updates DNS immediately, cached entries don’t expire until their TTL passes. You need to either reduce DNS cache TTL in your application or use Cloud SQL Proxy which handles failover automatically without DNS dependencies.

Let me provide a comprehensive solution to eliminate your ERP downtime during Cloud SQL failover events.

Understanding the Problem:

Your 8-12 minute downtime during a sub-2-minute failover is caused by DNS caching at multiple layers:

  1. Cloud SQL Failover Process (actual 60-120 seconds):

    • Standby replica promoted to primary
    • DNS A record updated to new primary IP
    • Cloud SQL reports instance as available
  2. DNS Propagation Delays (your 8-12 minute gap):

    • JVM DNS cache (default: infinite caching for successful lookups)
    • OS DNS cache (varies by OS, typically 60-300 seconds)
    • Application connection pool holding dead connections
    • No automatic retry logic in application

Recommended Solution: Cloud SQL Proxy

Cloud SQL Proxy completely eliminates DNS-related failover issues and is the Google-recommended approach for production systems.

How Cloud SQL Proxy Handles Failover:

The proxy doesn’t use DNS for ongoing connections. Instead:

  1. Uses Cloud SQL Admin API to query instance metadata
  2. Receives current primary IP directly from API (no DNS)
  3. Maintains authenticated connection with automatic certificate rotation
  4. Detects connection failures immediately
  5. Queries API for new primary location
  6. Reconnects to new primary within seconds
  7. Application sees brief connection error, retries, succeeds

Implementation for Your ERP System:

Step 1: Install Cloud SQL Proxy on Each Compute Engine VM

# Download and install
wget https://dl.google.com/cloudsql/cloud_sql_proxy.linux.amd64 -O cloud_sql_proxy
chmod +x cloud_sql_proxy
sudo mv cloud_sql_proxy /usr/local/bin/

Step 2: Create Systemd Service

Create /etc/systemd/system/cloud-sql-proxy.service:


[Unit]
Description=Cloud SQL Proxy
After=network.target

[Service]
Type=simple
User=your-app-user
ExecStart=/usr/local/bin/cloud_sql_proxy \
  -instances=PROJECT:REGION:INSTANCE=tcp:5432 \
  -ip_address_types=PRIVATE
Restart=always

[Install]
WantedBy=multi-user.target

Enable and start:

sudo systemctl enable cloud-sql-proxy
sudo systemctl start cloud-sql-proxy

Step 3: Update Application Configuration

Change your JDBC URL from:


spring.datasource.url=jdbc:postgresql://PROJECT:REGION:INSTANCE/erp_db

To:


spring.datasource.url=jdbc:postgresql://127.0.0.1:5432/erp_db

The application now connects to localhost where the proxy listens.

Step 4: Configure Connection Pool for Resilience

Update your Spring datasource properties:

spring.datasource.hikari.connection-timeout=10000
spring.datasource.hikari.validation-timeout=5000
spring.datasource.hikari.max-lifetime=600000
spring.datasource.hikari.connection-test-query=SELECT 1

Addressing Your Specific Concerns:

Cloud SQL Failover Behavior: During failover:

  • Primary becomes unavailable (health check fails)
  • Standby promoted to primary (30-60 seconds)
  • DNS updated (immediate, but caching causes delays)
  • Cloud SQL Proxy detects connection failure (1-2 seconds)
  • Proxy queries Admin API for current primary (1 second)
  • Proxy reconnects to new primary (2-5 seconds)
  • Total application disruption: 5-10 seconds vs your current 10 minutes

DNS Propagation Delays: With Cloud SQL Proxy, DNS is irrelevant for database connections:

  • Proxy uses Admin API, not DNS
  • Application uses localhost (127.0.0.1), not DNS
  • No DNS caching issues
  • No JVM DNS cache problems
  • Failover detection is active, not passive

Cloud SQL Proxy Usage: Yes, install proxy on each Compute Engine VM:

  • Lightweight process (minimal CPU/memory)
  • No single point of failure
  • Each application instance has dedicated proxy
  • Proxy authenticates using VM service account
  • No additional credentials needed

Service Account Permissions: Ensure your Compute Engine VMs use a service account with:

gcloud projects add-iam-policy-binding PROJECT_ID \
  --member='serviceAccount:VM_SA@PROJECT.iam.gserviceaccount.com' \
  --role='roles/cloudsql.client'

Alternative Approach (If Proxy Not Viable):

If you absolutely cannot use Cloud SQL Proxy:

  1. Use Private IP with VPC peering (reduces but doesn’t eliminate DNS issues)

  2. Configure JVM DNS cache in your application:

    
    -Dsun.net.inetaddr.ttl=30
    -Dsun.net.inetaddr.negative.ttl=10
    
  3. Implement connection retry in application code

  4. Use connection pool validation to detect stale connections

  5. Set aggressive timeouts (5-10 seconds)

However, this approach still leaves you vulnerable to 30-60 second outages during failover.

Testing Your Solution:

After implementing Cloud SQL Proxy, test failover:

  1. Trigger manual failover:
    gcloud sql instances failover INSTANCE_NAME
    
    

2. Monitor application logs for connection errors
3. Measure actual downtime experienced by ERP
4. Verify automatic recovery

Expected result: 5-10 seconds of connection errors, automatic recovery, no manual intervention required.

**Monitoring and Alerting:**

Set up Cloud Monitoring alerts:
- Cloud SQL failover events
- Application connection pool exhaustion
- Cloud SQL Proxy process health
- Database connection error rates

With Cloud SQL Proxy properly configured, your ERP system will experience minimal disruption during Cloud SQL failover events, meeting your availability requirements.