Let me provide a comprehensive solution to eliminate your ERP downtime during Cloud SQL failover events.
Understanding the Problem:
Your 8-12 minute downtime during a sub-2-minute failover is caused by DNS caching at multiple layers:
-
Cloud SQL Failover Process (actual 60-120 seconds):
- Standby replica promoted to primary
- DNS A record updated to new primary IP
- Cloud SQL reports instance as available
-
DNS Propagation Delays (your 8-12 minute gap):
- JVM DNS cache (default: infinite caching for successful lookups)
- OS DNS cache (varies by OS, typically 60-300 seconds)
- Application connection pool holding dead connections
- No automatic retry logic in application
Recommended Solution: Cloud SQL Proxy
Cloud SQL Proxy completely eliminates DNS-related failover issues and is the Google-recommended approach for production systems.
How Cloud SQL Proxy Handles Failover:
The proxy doesn’t use DNS for ongoing connections. Instead:
- Uses Cloud SQL Admin API to query instance metadata
- Receives current primary IP directly from API (no DNS)
- Maintains authenticated connection with automatic certificate rotation
- Detects connection failures immediately
- Queries API for new primary location
- Reconnects to new primary within seconds
- Application sees brief connection error, retries, succeeds
Implementation for Your ERP System:
Step 1: Install Cloud SQL Proxy on Each Compute Engine VM
# Download and install
wget https://dl.google.com/cloudsql/cloud_sql_proxy.linux.amd64 -O cloud_sql_proxy
chmod +x cloud_sql_proxy
sudo mv cloud_sql_proxy /usr/local/bin/
Step 2: Create Systemd Service
Create /etc/systemd/system/cloud-sql-proxy.service:
[Unit]
Description=Cloud SQL Proxy
After=network.target
[Service]
Type=simple
User=your-app-user
ExecStart=/usr/local/bin/cloud_sql_proxy \
-instances=PROJECT:REGION:INSTANCE=tcp:5432 \
-ip_address_types=PRIVATE
Restart=always
[Install]
WantedBy=multi-user.target
Enable and start:
sudo systemctl enable cloud-sql-proxy
sudo systemctl start cloud-sql-proxy
Step 3: Update Application Configuration
Change your JDBC URL from:
spring.datasource.url=jdbc:postgresql://PROJECT:REGION:INSTANCE/erp_db
To:
spring.datasource.url=jdbc:postgresql://127.0.0.1:5432/erp_db
The application now connects to localhost where the proxy listens.
Step 4: Configure Connection Pool for Resilience
Update your Spring datasource properties:
spring.datasource.hikari.connection-timeout=10000
spring.datasource.hikari.validation-timeout=5000
spring.datasource.hikari.max-lifetime=600000
spring.datasource.hikari.connection-test-query=SELECT 1
Addressing Your Specific Concerns:
Cloud SQL Failover Behavior:
During failover:
- Primary becomes unavailable (health check fails)
- Standby promoted to primary (30-60 seconds)
- DNS updated (immediate, but caching causes delays)
- Cloud SQL Proxy detects connection failure (1-2 seconds)
- Proxy queries Admin API for current primary (1 second)
- Proxy reconnects to new primary (2-5 seconds)
- Total application disruption: 5-10 seconds vs your current 10 minutes
DNS Propagation Delays:
With Cloud SQL Proxy, DNS is irrelevant for database connections:
- Proxy uses Admin API, not DNS
- Application uses localhost (127.0.0.1), not DNS
- No DNS caching issues
- No JVM DNS cache problems
- Failover detection is active, not passive
Cloud SQL Proxy Usage:
Yes, install proxy on each Compute Engine VM:
- Lightweight process (minimal CPU/memory)
- No single point of failure
- Each application instance has dedicated proxy
- Proxy authenticates using VM service account
- No additional credentials needed
Service Account Permissions:
Ensure your Compute Engine VMs use a service account with:
gcloud projects add-iam-policy-binding PROJECT_ID \
--member='serviceAccount:VM_SA@PROJECT.iam.gserviceaccount.com' \
--role='roles/cloudsql.client'
Alternative Approach (If Proxy Not Viable):
If you absolutely cannot use Cloud SQL Proxy:
-
Use Private IP with VPC peering (reduces but doesn’t eliminate DNS issues)
-
Configure JVM DNS cache in your application:
-Dsun.net.inetaddr.ttl=30
-Dsun.net.inetaddr.negative.ttl=10
-
Implement connection retry in application code
-
Use connection pool validation to detect stale connections
-
Set aggressive timeouts (5-10 seconds)
However, this approach still leaves you vulnerable to 30-60 second outages during failover.
Testing Your Solution:
After implementing Cloud SQL Proxy, test failover:
- Trigger manual failover:
gcloud sql instances failover INSTANCE_NAME
2. Monitor application logs for connection errors
3. Measure actual downtime experienced by ERP
4. Verify automatic recovery
Expected result: 5-10 seconds of connection errors, automatic recovery, no manual intervention required.
**Monitoring and Alerting:**
Set up Cloud Monitoring alerts:
- Cloud SQL failover events
- Application connection pool exhaustion
- Cloud SQL Proxy process health
- Database connection error rates
With Cloud SQL Proxy properly configured, your ERP system will experience minimal disruption during Cloud SQL failover events, meeting your availability requirements.