Storage architecture for distributed AI: training centralized, inference everywhere

We’re at an inflection point with our AI infrastructure and I’m curious how others are handling the storage side of things. Our ML team has been running training jobs in a single cloud region with decent success, but as we start moving models to production we’re hitting a wall around geographic distribution and latency. The business wants inference to happen close to users across multiple regions for compliance and performance reasons, but training still needs to be centralized where we have our GPU clusters and can optimize utilization.

The challenge is that this split creates all kinds of storage headaches. Training data pipelines flow toward the central hub, but now we need models and feature definitions replicated globally for inference. We’re also dealing with the usual suspects: versioning chaos in the model registry, training-serving skew because features get computed differently at inference time, and checkpoint management during long training runs that keeps breaking. Our current approach is basically duct tape—different storage services for different stages, manual syncing, no real governance.

I know we’re not the only ones dealing with this. What patterns or architectures have worked for you when training is centralized but inference needs to be distributed? How are you keeping feature stores consistent across regions? And are people actually using lakehouse setups or is that just vendor talk?

Something to watch out for with distributed inference: make sure your observability spans the whole path from model registry to regional endpoints. We had a situation where a model performed fine in staging but tanked in production in one specific region because feature values were stale due to replication lag. Took us way too long to diagnose because we weren’t monitoring feature freshness per region. Now we track latency, feature staleness, and prediction drift independently for each inference location. Also consider how you’ll handle rollbacks—if a bad model gets pushed to ten regions you need a fast way to revert everywhere at once.

Have you looked at orchestration frameworks like Flyte or Kubeflow for managing the pipeline complexity? We integrated our training pipelines with Flyte and it handles a lot of the coordination between data services and training jobs automatically. You define resource requirements and endpoints in config and it provisions the infrastructure. We also reuse data services across multiple training runs which cuts down on redundant storage. The agent framework approach means we didn’t have to build custom Kubernetes operators from scratch—just extended what was already there.

Model registry governance has been our biggest pain point. We’re using MLflow for versioning and lineage tracking and it’s been solid for the technical side—every model version links back to the exact experiment run, training data version, and hyperparameters used. The harder part was building the approval workflows around it. We added custom stages so models progress from experimental to staging to production with sign-offs at each gate. The registry integrates with our CI/CD so promotion triggers automated deployment to the appropriate regions. Still manual review for production approval but at least we have visibility now into what’s deployed where.

One thing that helped us was realizing we don’t need to replicate all features everywhere. Most of our models only use a subset of features and only the latest values matter for inference. We set up a feature registry that tracks dependencies per model, and when we deploy a model to a new region the system automatically provisions just the features that model needs in the local online store. Keeps storage costs down and reduces replication lag. For training we keep full history in a Delta Lake setup but inference regions only get snapshots.