Storage architecture for distributed AI: training centralized, inference everywhere

sneha_func · January 18, 2025, 9:22am

We’re at an inflection point with our AI infrastructure and I’m curious how others are handling the storage side of things. Our ML team has been running training jobs in a single cloud region with decent success, but as we start moving models to production we’re hitting a wall around geographic distribution and latency. The business wants inference to happen close to users across multiple regions for compliance and performance reasons, but training still needs to be centralized where we have our GPU clusters and can optimize utilization.

The challenge is that this split creates all kinds of storage headaches. Training data pipelines flow toward the central hub, but now we need models and feature definitions replicated globally for inference. We’re also dealing with the usual suspects: versioning chaos in the model registry, training-serving skew because features get computed differently at inference time, and checkpoint management during long training runs that keeps breaking. Our current approach is basically duct tape—different storage services for different stages, manual syncing, no real governance.

I know we’re not the only ones dealing with this. What patterns or architectures have worked for you when training is centralized but inference needs to be distributed? How are you keeping feature stores consistent across regions? And are people actually using lakehouse setups or is that just vendor talk?

vina8824 · January 19, 2025, 3:42pm

Something to watch out for with distributed inference: make sure your observability spans the whole path from model registry to regional endpoints. We had a situation where a model performed fine in staging but tanked in production in one specific region because feature values were stale due to replication lag. Took us way too long to diagnose because we weren’t monitoring feature freshness per region. Now we track latency, feature staleness, and prediction drift independently for each inference location. Also consider how you’ll handle rollbacks—if a bad model gets pushed to ten regions you need a fast way to revert everywhere at once.

jasonexpert · January 19, 2025, 8:05am

Have you looked at orchestration frameworks like Flyte or Kubeflow for managing the pipeline complexity? We integrated our training pipelines with Flyte and it handles a lot of the coordination between data services and training jobs automatically. You define resource requirements and endpoints in config and it provisions the infrastructure. We also reuse data services across multiple training runs which cuts down on redundant storage. The agent framework approach means we didn’t have to build custom Kubernetes operators from scratch—just extended what was already there.

elen2718 · January 19, 2025, 1:28pm

Model registry governance has been our biggest pain point. We’re using MLflow for versioning and lineage tracking and it’s been solid for the technical side—every model version links back to the exact experiment run, training data version, and hyperparameters used. The harder part was building the approval workflows around it. We added custom stages so models progress from experimental to staging to production with sign-offs at each gate. The registry integrates with our CI/CD so promotion triggers automated deployment to the appropriate regions. Still manual review for production approval but at least we have visibility now into what’s deployed where.

sandeep_395 · January 18, 2025, 2:10pm

One thing that helped us was realizing we don’t need to replicate all features everywhere. Most of our models only use a subset of features and only the latest values matter for inference. We set up a feature registry that tracks dependencies per model, and when we deploy a model to a new region the system automatically provisions just the features that model needs in the local online store. Keeps storage costs down and reduces replication lag. For training we keep full history in a Delta Lake setup but inference regions only get snapshots.

Topic		Replies	Views
Training centralized, inference distributed—how are you handling the storage split? AI Adoption in Cloud discussion , data-sovereignty , model-registry , ai-adoption , piloting , cloud-ai , feature-store , lakehouse , training-serving-skew	3	0	February 14, 2025
Training-serving skew and feature store architecture: how do you prevent it at scale? AI Adoption in Cloud discussion , governance , scaling , model-registry , ai-adoption , cloud-ai , feature-store , training-pipelines , lakehouse	6	0	February 18, 2025
How are you structuring platform teams to support enterprise-wide AI adoption? AI Adoption in Cloud question , mlops , scaling , model-governance , ai-adoption , cloud-ai , gpu-orchestration , internal-developer-platform , agent-orchestration	7	0	February 14, 2025
GPU workload placement strategy: when to burst to cloud vs. retain on-prem? AI Adoption in Cloud discussion , multi-cloud , kubernetes , scaling , cost-optimization , ai-adoption , cloud-ai , gpu-orchestration , edge-inference	5	0	February 19, 2025
Platform teams taking ownership of AI infrastructure—who's making this work? AI Adoption in Cloud discussion , governance , mlops , scaling , ai-adoption , cloud-ai , internal-developer-platform , agent-orchestration	6	0	February 14, 2025
Real-time analytics at the edge vs centralized cloud processing - architecture trade-offs Microsoft Azure discussion , edge-computing , analytics , az-2021 , azure-stream-analytics , trade-off-analy , latency-vs-governanc , iot-hub , event-hubs	4	1	April 29, 2025
Multi-cloud GPU orchestration – when does burst vs. federated make sense? AI Adoption in Cloud question , multi-cloud , kubernetes , scaling , cost-optimization , ai-adoption , cloud-ai , gpu-orchestration , edge-inference	3	0	February 18, 2025
GPU availability blocking scale—how are you navigating the hardware shortage? AI Adoption in Cloud discussion , scaling , cost-optimization , ai-adoption , cloud-ai , inference-costs , gpu-availability , budget-management	6	0	February 20, 2025
Comparing ML model deployment on gateways vs centralized analytics platform Cisco IoT Cloud Connect discussion , edge-computing , bandwidth , latency , deployment-strategy , gateway-mgmt , analytics-ml , cciot-25 , cisco-ir-router	4	0	December 11, 2024

Storage architecture for distributed AI: training centralized, inference everywhere

Related topics