Real-time anomaly detection for LLM costs – which metrics actually matter?

We’re scaling an internal AI assistant that uses LLM APIs, and our monthly cloud bill has grown from about $12k to $85k in three months. Finance is starting to ask questions, and honestly we don’t have great answers yet. We get daily cost rollups from our cloud provider, but by the time we spot something unusual, the behavior has been running for a day or more and the cost is already baked in.

We’ve tried setting up basic threshold alerts (spending over $3k/day triggers a notification), but they fire for legitimate reasons like a big product launch or end-of-quarter activity, so the team started ignoring them. We’re also seeing situations where total spend looks fine but cost per request is quietly climbing—more verbose prompts, longer responses, something—and we don’t catch it until we’re reviewing the monthly retrospective.

Has anyone implemented real-time cost anomaly detection that actually works for LLM workloads? What metrics do you track beyond total daily spend, and how do you distinguish between normal growth and actual problems without drowning in false positives?

Do you have visibility into which services or features are driving the cost increases? We tag every LLM API call with a feature identifier and route the cost data through our observability stack. When an anomaly fires, we can immediately see it’s coming from the document summarization feature or the chat interface, not just “LLM costs are up.” That context makes investigation way faster.

Are you tracking which models you’re hitting? We found that our app was falling back to a more expensive model during traffic spikes because the cheaper one had rate limits. Cost per request went up 3x during those windows and we had no idea until someone manually correlated the cost data with our load balancer logs. Now we instrument model selection as a first-class metric.

One thing that helped us was alerting on unit economics degradation rather than absolute spend. We calculate cost per successful transaction (ignoring retries and errors) and alert when that drifts more than 15% from the trailing seven-day average. Catches prompt bloat and inefficient context window usage way faster than looking at total bills.

We had the exact same problem with daily rollups being too slow. Switched to a tool that ingests cost data in near real-time (updates every 15 minutes) and compares current spend rate to learned baselines. The trick was tuning sensitivity—too tight and you get alert fatigue, too loose and you miss real issues. It took us about two weeks of tuning to get the thresholds dialed in, but now we catch configuration mistakes and runaway jobs before they rack up serious charges.

Another angle: track token usage separately from cost. Sometimes pricing changes or you switch models, and your token consumption stays flat but cost moves. Tracking both lets you isolate whether the issue is behavior (using more tokens) or economics (same usage, different pricing). We export token counts and cost together into our data warehouse and run anomaly detection on both dimensions.

We track cost per token (split by input and output tokens separately), cost per API call, and cost per active user per day. The key insight for us was learning baselines that account for time-of-day patterns and day-of-week seasonality. Our usage is way higher during business hours, so a spike at 2pm isn’t alarming but the same absolute number at 2am would be. We also built dashboards showing cost per feature so product teams can see when their experiments are getting expensive before it becomes a budget problem. The real win was catching a misconfigured retry loop within 20 minutes instead of discovering it five days later in the bill.

We also started tracking cache hit rates as a cost-adjacent metric. If cache hit rate drops, cost per request goes up because we’re making more actual API calls. Helped us catch a deployment that accidentally disabled response caching—cost jumped 40% overnight, but would’ve looked like organic growth if we’d only been watching total spend.