We went live with our ERP refresh two months ago and turned on the embedded Copilot feature for invoice GL coding. During UAT it was fantastic—95%+ accuracy, finance team loved it, leadership was excited about the automation. Now we’re six weeks past cutover and it’s become a mess. Real supplier invoices are flowing through at 5x the volume we tested with, and the AI is confidently miscoding things left and right. Vendor names spelled slightly differently than our master data, line items in unexpected formats, charges we never saw in test data—the model just guesses and posts with high confidence. We’ve had to pull people off other work to manually review everything, and they’re starting to ignore the suggestions entirely because trust is gone. Our trial balance is a disaster and we’re racing to clean it up before month-end close.
The bigger problem is we also activated the demand forecast feature in supply chain planning around the same time, and we’re starting to see similar drift there—forecasts that looked solid in testing are now missing by wide margins as real market patterns diverge from the training data. It feels like we turned these features on before the system was actually ready to handle production data at scale, and now we’re paying for it in rework and lost confidence.
I’m curious if others have hit this wall post-go-live with Copilot or AI features in ERP. Did you disable and stabilize first, or try to tune the models in production? How do you even know when to retrain, and who owns that decision? Would appreciate hearing what worked (or didn’t) for teams who’ve been through this.
One thing that helped us was establishing clear governance before go-live about which decisions the AI could make autonomously versus which required human approval. For invoice coding, we set thresholds—anything under a certain amount and from known vendors could auto-post, but new vendors or high-value transactions required manual review. That way when the AI encountered edge cases it hadn’t seen in training, it didn’t just guess. We also built in audit trails so we could see exactly what the AI coded and why. The governance framework probably saved us from the chaos you’re describing.
Model drift monitoring is something most ERP implementations just don’t include, and it kills AI features post-go-live. You need automated tracking of accuracy metrics over time and clear triggers for when retraining should happen. We built dashboards that show coding accuracy week-over-week and alert when it drops below thresholds. When drift is detected, there’s a defined process for retraining on updated data. Without that infrastructure, the model just silently degrades until someone notices the damage. It’s not a one-time deployment; it’s ongoing system maintenance.
I’d argue part of the problem is activating these features too early in the stabilization window. The first 60-90 days post-go-live are already chaotic—users are learning new processes, dealing with cutover issues, and under a lot of stress. Throwing unproven AI features into that mix just adds to the chaos and damages trust in both the core system and the AI. We’ve had better success treating AI as a phase-two capability: stabilize the core ERP first, validate data quality, train users on standard workflows, and only then gradually introduce AI features in controlled pilots. It takes longer, but adoption and trust are much higher.
The volume spike post-go-live is no joke. We tested with maybe 500 clean invoices during UAT, then went live and suddenly had 3,000 invoices a week with all kinds of formats and edge cases the model had never seen. The false positive rate on exceptions was so high that my team couldn’t keep up, and they started ignoring the AI entirely. We ended up having to go back and retrain the model on a much larger, more diverse dataset that actually reflected real production patterns. Testing on sanitized data just doesn’t prepare the system for the real world.
The forecast drift problem is real. Our demand planning module was trained on three years of history, but when supply chain disruptions hit and customer buying patterns shifted, the model just kept predicting based on outdated patterns. We didn’t have any retraining process in place—no one even knew it needed retraining until forecasts were obviously wrong. Eventually we set up monthly model refresh cycles and added a feedback loop where planners flag bad forecasts so the system can learn. But yeah, it should have been part of the design from day one, not something we bolted on after the fact.