How are you structuring platform teams to support enterprise-wide AI adoption?

We’re seeing AI usage explode across our development teams—probably 80% of our engineers are using code generation tools at least weekly, and we have dozens of small pilot projects experimenting with ML models in different product areas. But we’re hitting a wall moving any of this into production at scale. Most pilots stay pilots, and the ones that do ship often become someone’s side project to maintain, with no real governance or monitoring.

Our platform engineering group has been focused on traditional DevOps—CI/CD pipelines, infrastructure-as-code, self-service cloud provisioning. Now leadership is asking us to enable AI adoption across the org, but we’re not sure where to start. Do we build a unified internal developer platform with AI primitives baked in? Do we treat ML infrastructure as a separate capability? How do we handle GPU orchestration, model registries, and agent governance when we’re still getting comfortable with Kubernetes?

Curious how other platform teams are approaching this. Are you embedding AI enablement into your existing IDP, or building parallel tracks? What’s actually working to get teams from experimentation to production systems that deliver business value?

Don’t ignore the cost dimension. Inference costs have dropped a lot, but training large models and running fleets of agents can still burn budget fast if you’re not careful. We’ve seen teams spin up expensive GPU instances for experiments and forget to tear them down, or deploy models that rack up API charges with no business justification. Self-service is great, but pair it with quota management, cost visibility dashboards, and automated cleanup policies. Otherwise finance will shut the whole thing down when they see the bill.

One thing we learned the hard way: don’t underestimate the complexity jump when you move from managing cloud services to managing hybrid AI workloads across cloud and on-prem. We went from thousands of services to way more overnight, and suddenly we needed topology-aware GPU placement, gang scheduling for multi-GPU jobs, and network tuning we’d never dealt with before. If you’re starting fresh, I’d recommend investing heavily in solid abstractions and automation—otherwise you’ll spend all your time firefighting infrastructure instead of enabling developers.

What I’ve found is that most teams aren’t blocked by technology—they’re blocked by unclear ownership and fragmented accountability. IT builds infrastructure, data science builds models, product defines requirements, and legal shows up after deployment asking hard questions. You need explicit ownership: who is responsible for model performance, who monitors for drift, who handles retraining, and who responds when something breaks? Without that clarity, production deployments stall because no one wants to be on the hook.

Governance is where most organizations stumble. You need centralized data cataloging, clear lineage tracking, and access controls that apply consistently whether teams are using models in the cloud or on-premises. We set up a data governance account as a central hub—data engineers publish datasets there, and data science teams consume from that single source of truth. It sounds bureaucratic, but it’s the only way we’ve found to prevent shadow AI and compliance disasters. Also, make sure every AI agent or model has a unique identity and can be audited just like any other system component.

Security and agent governance can’t be an afterthought. We’re already seeing early examples of ungoverned agents—bots pulling full datasets when they should have least-privilege access, agents connecting APIs without audit trails, shadow deployments outside IT oversight. The control plane approach makes sense: every agent gets a unique ID, operates in a sandboxed environment, and is subject to the same access policies as human users. If you’re building this from scratch, design identity and access management for agents from day one, not as a retrofit.

From a developer experience perspective, the biggest win has been providing natural-language interfaces and AI-assisted automation within our platform. Developers can describe what infrastructure they need in plain language, and the platform provisions it with guardrails already in place. That’s been way more impactful than telling people to read docs and fill out YAML. But you have to be careful—if the AI hallucinates or suggests the wrong config, trust evaporates fast. Rigorous evaluation and fallback mechanisms are non-negotiable.

We went through this exact transition last year. What worked for us was treating AI as infrastructure, not as a feature each team builds separately. We extended our existing IDP to provide self-service access to the full ML lifecycle: managed notebooks integrated with our data lake, standardized model training pipelines with GPU scheduling via Kueue, a central model registry, and deployment templates with monitoring hooks built in. The key was making it dead simple for data scientists to move from experiment to production without reinventing plumbing every time. We also created an architecture review process for new AI projects—not to slow people down, but to ensure they’re using consistent tooling and understand the governance requirements upfront. The result has been that pilots actually graduate now, because the path to production is paved and well-lit. Still plenty of work ahead on agent orchestration and observability, but at least we’re shipping models that deliver real value instead of accumulating proof-of-concepts that never deploy.