GPU workload placement strategy: when to burst to cloud vs. retain on-prem?

We’re architecting our AI infrastructure roadmap and running into the classic placement dilemma. We’ve got a decent fleet of on-prem GPUs handling baseline training workloads, but we’re seeing two pressure points. First, our model experimentation velocity is constrained—data science teams want to spin up large training runs without waiting for capacity. Second, our cost predictability is actually pretty good on-prem, but we’re worried we’re over-provisioned for average load and under-provisioned for peak.

The options we’re weighing are either a pure bursting model where we keep steady-state workloads on-prem and overflow into AWS or Azure during spikes, or a more federated approach where we intentionally distribute certain workload types across clouds from the start. Kubernetes is our orchestration layer, but we’re still figuring out the governance and cost controls to make this actually work without surprise bills or complexity spiraling.

Curious what others have landed on. Are you treating cloud as pure overflow capacity, or are you making deliberate placement decisions by workload type? And how are you handling the networking and identity management across environments without it becoming a maintenance nightmare?

Are you seeing GPU reliability issues when you burst to the cloud? We’ve had spot instances disappear mid-training more than once, and it’s frustrating. We’re starting to treat cloud as best-effort capacity and keeping anything mission-critical on our own hardware. The trade-off is we’re probably paying more for redundancy than we need to, but at least we know the workloads will complete.

We ended up with deliberate placement by workload type rather than just overflow. Compliance-sensitive stuff stays on-prem no matter what. Experimentation and one-off research projects go straight to the cloud because we don’t want to tie up our on-prem fleet. Production inference runs on-prem because the cost is flat and predictable. The federated model works for us because we have teams in different regions with different data residency rules, so we’re managing multiple clusters anyway.

We went with the bursting model about eighteen months ago and it’s been solid for us. On-prem handles everything that runs predictably—our production inference workloads and the recurring training pipelines. When the research teams need to scale up fast, we let Kubernetes route those jobs to cloud GPUs automatically. The key for us was setting hard budget caps and egress limits in Terraform so we don’t get caught with runaway costs. Networking was surprisingly straightforward once we got VPN tunnels and IAM federation locked down.

One thing we learned the hard way: don’t underestimate data gravity. We tried routing training jobs to the cloud, but the datasets lived on-prem, and egress fees plus latency killed us. Now we do a two-stage approach—train the foundation models in the cloud where the data can live cheaply in object storage, then pull the optimized models back on-prem for inference. Keeps the costs manageable and inference latency low.

Networking complexity is real. We use a centralized identity provider and enforce policies through a single control plane across all clusters. That’s been the only way to keep it manageable. Also worth mentioning—monitoring and observability across hybrid environments is harder than it looks. We ended up with Prometheus and Grafana scraping metrics from both on-prem and cloud clusters, but unifying the dashboards took effort.