Choosing between GKE and Cloud Run for scaling ETL jobs in data warehouse pipelines: cost, performance, and manageability

We’re redesigning our data warehouse ETL architecture and debating between GKE and Cloud Run for running our transformation jobs. Currently using Compute Engine VMs with cron jobs, which is becoming hard to manage as we scale to 50+ daily ETL pipelines.

Our requirements: Jobs range from 5-minute quick transforms to 2-hour complex aggregations. Most jobs are Python-based with some using Spark. We need to orchestrate dependencies between jobs and handle retries gracefully. Cost is important but not the primary driver - we value operational simplicity and maintainability.

I’m leaning toward Cloud Run for its serverless simplicity, but our team has more Kubernetes experience. Some jobs need stateful processing with local disk caching. Interested in hearing real-world experiences with both approaches for data warehouse ETL workloads. What are the practical tradeoffs beyond the marketing materials?

Autopilot is much simpler - Google manages the control plane and nodes, you just deploy workloads. However, you lose some flexibility: can’t use DaemonSets, limited node configuration options, and you’re restricted to specific machine types. For ETL workloads, these limitations rarely matter. The real tradeoff is cost - Autopilot charges a premium (~10% more) for the managed experience, but you save on operational overhead. Standard GKE gives you full control if you need custom node configurations for your Spark jobs (like high-memory or local SSD nodes).

Thanks all for the insights. The hybrid approach Sarah mentioned is intriguing. For orchestration, we’re evaluating Cloud Composer (managed Airflow) which can trigger both Cloud Run and GKE jobs. The stateful processing concern James raised is valid - reviewing our jobs, only about 10-15% actually need local disk caching, mostly the Spark-based ones. Those could go to GKE while the rest use Cloud Run. Mike, how complex is managing GKE Autopilot versus standard GKE for this use case?

Cost perspective: Cloud Run’s pay-per-use model is compelling for ETL workloads with variable schedules. You pay only for execution time, not idle capacity. GKE nodes run 24/7 unless you implement aggressive autoscaling. For your 50+ daily pipelines, if they run at different times throughout the day, Cloud Run could be significantly cheaper. However, if all jobs cluster around certain hours (common with nightly ETL), GKE with properly sized node pools might be more cost-effective. Run the math with your actual job execution patterns.