Real-time anomaly detection for AI workload costs – how granular is enough?

We’re trying to implement real-time cost anomaly detection for our AI inference workloads running across AWS and GCP. Right now we get daily cost reports, and by the time we spot something unusual, we’ve already burned through a few thousand dollars. Finance is pushing us to catch these anomalies faster, ideally within an hour or two of them starting.

We’ve experimented with setting up alerts on total daily spend increases (like 10% over baseline), but we’re drowning in false positives. Legitimate traffic spikes trigger alerts constantly, and the team has started ignoring them. We’ve also tried tracking cost-per-request as a unit metric, but our different AI features use different models with very different token costs, so it’s not clear what the baseline should be at an aggregate level.

Has anyone managed to get real-time anomaly detection working without overwhelming the team with noise? What level of granularity are you tracking – per model, per feature, per environment? And how do you handle the fact that AI costs are so volatile compared to traditional infrastructure?

From the finance side, the most useful metric we’ve found is cost-per-outcome rather than just cost-per-request. For example, if you’re running a recommendation engine, track cost per conversion or cost per user session, not just cost per API call. That way you can distinguish between healthy growth (more users, proportional cost increase) and efficiency problems (same users, higher cost per outcome).

We had the exact same problem. Tracking aggregate spend was useless because everything scales together during normal growth. The breakthrough for us was switching to per-feature cost tracking with separate baselines. Each AI feature gets its own cost anomaly threshold based on its historical behavior and expected usage patterns. That way a legitimate spike in Feature A doesn’t mask a configuration problem in Feature B.

Thanks everyone, this is really helpful. Sounds like the consensus is per-feature or per-endpoint baselines with 15–30 minute detection windows, plus integration with deployment events to reduce false positives. We’ll start with tagging cleanup and then build out the auto-baselining. Appreciate the concrete direction.

Do you have tagging in place for all your AI workloads? We struggled with attribution until we enforced consistent tagging by team, feature, and environment. Once we had that, we could set per-team baselines and alert the right people when their specific workloads went off-track. Without it, everything was just a black box of aggregate spend.

You need to build dynamic baselines that account for time-of-day and day-of-week patterns, not static thresholds. We implemented auto-baselining where the system learns what normal spending looks like for each service, environment, and model endpoint over the past few weeks. The key is that normal spending at 2pm on a Tuesday looks very different from 3am on a Sunday, and the baseline adjusts for that.

For granularity, we track cost-per-request at the model endpoint level, not aggregated. So we know if a particular inference endpoint suddenly starts costing 40% more per call even if total volume hasn’t changed. That catches configuration problems like accidental routing to the wrong region, expanded prompt sizes, or falling back to more expensive model variants during peak load.

On detection latency, we ingest cost and usage data every 15 minutes and flag anomalies within that window. That’s fast enough to catch issues before they scale across all traffic but slow enough to avoid reacting to momentary blips. When an anomaly fires, the alert includes exactly which endpoint or feature changed, recent deployments in that area, and who owns it, so the team can correlate it to recent changes immediately. We’ve caught runaway training jobs, misconfigured model routing, and API retry storms this way before they became serious cost problems.

We’ve also found that token-level tracking is critical for LLM workloads. A small change in prompt structure or response length can double token usage without changing the number of API calls. If you’re only tracking request volume, you’ll miss it completely. We instrument every LLM call to log input tokens, output tokens, and total cost, then aggregate that by feature.