We’ve been running a pilot chatbot for our internal support desk using GPT-4, and the demo results have been solid. Now we’re looking at pushing it to production for our full employee base (around 8,000 people), and the finance team just came back with concerns after seeing the projected token spend.
During the pilot with 50 users, we were averaging about $2,500 a month. Extrapolating that out to full scale puts us somewhere north of $400K annually just for this one use case. When we started digging into where the cost is coming from, we realized our prompts are pulling in way more context than they probably need to, and we’re hitting GPT-4 for every query even when a lot of them are just simple FAQ lookups that a smaller model could handle.
Has anyone else dealt with inference costs spiraling once you go from pilot to real production traffic? What levers actually work for bringing this under control without killing the user experience?
We hit almost the exact same wall last year. What worked for us was putting a routing layer in front of the LLM. Most requests go to a fine-tuned smaller model now, and we only escalate to GPT-4 when the confidence score is low or the query is flagged as complex. Cut our monthly bill by about 40% without users noticing any drop in quality.
Are you tracking token usage by department or team? We found that two teams were responsible for 60% of our spend because they were running batch jobs through the API without any throttling. Once we made the cost visible and added guardrails, behavior changed pretty quickly.
Check your context window sizes. We were sending 3,000+ tokens per request because we were dumping entire knowledge base articles into the prompt. Switching to semantic chunking and only passing the top 2-3 relevant snippets dropped our token usage by more than half. Huge savings with minimal effort.
One thing that helped us was adding rate limiting and prompt previews. Users were pasting huge blocks of text without thinking about it. Once we showed them token counts and added a soft cap per user per day, usage became much more intentional and our costs stabilized.
How are you handling caching? If you’re seeing repeated queries (like the same FAQ coming up multiple times), you should be able to cache responses and avoid hitting the model every time. Depending on your traffic pattern, that alone might save 20-30%.
We’ve been experimenting with a hybrid approach: running a local smaller model for initial triage and only calling the cloud LLM when necessary. It’s more infrastructure to manage, but for high-volume internal tools, the cost savings justify the complexity. Just something to consider if your scale keeps growing.
Also worth looking at whether you need GPT-4 at all for this use case. We switched to GPT-3.5 Turbo for most of our internal tooling and honestly couldn’t tell the difference for straightforward Q&A. The cost per token is like a tenth of what we were paying before.