Cloud SQL connection pool exhaustion from containerized ERP app leads to intermittent crashes

Our containerized ERP application running on GKE keeps crashing with connection pool exhaustion errors when connecting to Cloud SQL PostgreSQL. We’re using Cloud SQL Proxy as a sidecar container in our pods.

The errors appear during moderate load (50-80 concurrent users). Our Cloud SQL instance is db-custom-4-16384 with max_connections set to 500. Each pod’s connection pool is configured with maxPoolSize=20, and we typically run 8-10 pod replicas.


ERROR: Connection pool exhausted
at HikariPool.getConnection(HikariPool.java:123)
Caused by: PSQLException: FATAL: remaining connection slots
  are reserved for non-replication superuser connections

The math should work (10 pods × 20 connections = 200, well under 500 limit), but we’re hitting limits. Could this be related to how Cloud SQL Proxy manages connections or our app’s connection pool configuration?

Check your pod termination behavior. When pods are killed or restarted during deployments, if connections aren’t gracefully closed, they can remain in Cloud SQL for up to 10 minutes (default TCP timeout). This creates a buildup of stale connections. Implement preStop hooks in your pod spec to explicitly close database connections before termination. Also review your connection pool idle timeout settings.

This is almost certainly a connection lifecycle issue combined with Cloud SQL Proxy configuration. I’ve seen this pattern before with containerized apps. The proxy itself doesn’t consume connections per se, but it can mask connection state issues from your application.

I suspect your application isn’t properly closing connections or has connection leaks. Even with proper pool configuration, if connections aren’t returned to the pool after use, you’ll exhaust available connections quickly. Add connection leak detection to your HikariCP config with leakDetectionThreshold=60000 to identify problematic code paths.

Good point about reserved connections. I checked Cloud SQL monitoring and saw actual connection count spiking to 485-490 during crashes. That’s way more than our expected 200. Could there be zombie connections not being cleaned up properly?