Invoice Miscoding and Forecast Drift After ERP Copilot Go-Live

We went live with our ERP upgrade six months ago with embedded Copilot features for invoice GL coding and demand forecasting. The pilot looked great—95% accuracy on test data, smooth UAT, everyone excited about the automation potential. Reality hit hard about three weeks post-cutover.

Invoice miscoding started showing up when real vendor invoices came through at volume. Our test set was maybe 500 clean invoices; production brought thousands with inconsistent formats, abbreviations we’d never seen, and encoding quirks from international suppliers. The AI kept suggesting codes with high confidence even when they were wrong. Finance lost trust fast and started manually reviewing everything, which killed the efficiency we were promised. By month two we had GL trial balance issues and were running reconciliations in Excel again.

Demand forecasting was worse in some ways because the drift was silent. The AI trained on two years of historical data, but we had supply chain disruptions and a product line refresh that changed demand patterns. Forecasts looked reasonable on the surface—MAPE was still acceptable—but directional calls were off. We had stockouts in some categories and overstock tying up working capital in others. Took us almost four months to realize the model had drifted and needed retraining, but we had no governance process for who owned that decision or how to execute it.

The lesson for us: you can’t just activate AI features at go-live and assume they’ll work. We needed data quality work we didn’t do, retraining pipelines we didn’t build, and governance frameworks we didn’t establish. We’re fixing it now, but it’s expensive rework and our users are skeptical of anything AI-related. Would love to hear if others hit similar walls and how you recovered.

We had almost the exact same invoice coding problem. Our issue was that the AI was trained on a narrow set of vendor formats during implementation, and real-world invoices from smaller suppliers or international partners just didn’t match. We ended up disabling auto-coding for the first 90 days post-go-live and running the AI in suggestion-only mode while we rebuilt our vendor master data and retrained the model on actual production invoices. Took longer but restored some trust.

We learned this the hard way too. Our mistake was treating Copilot as a feature deployment instead of a process redesign. When users are already stressed learning new ERP workflows post-go-live, adding AI suggestions that conflict with what they were trained to do just creates resistance. We should have stabilized core processes first, then gradually enabled AI in monitored phases. Now we’re in recovery mode and it’s taking twice as long as it should have.

On the demand side, we found that ensemble forecasting helped—running multiple models in parallel instead of relying on one. When the primary model started drifting, the variance across models flagged it before we had major stockout issues. We also set explicit triggers: if week-over-week forecast volatility exceeds a threshold or if actuals miss forecast by more than 20% for two consecutive periods, that kicks off a retraining review. Made the drift detectable instead of silent.

Forecast drift is brutal because it’s not obvious until you’re already in trouble. We set up monthly model performance reviews where we compare actuals to forecast and look for systematic bias or accuracy degradation. When MAPE stays flat but directional accuracy drops, that’s your signal. The hard part is getting business ownership of the retraining decision—finance and supply chain both need to sign off, and that governance took us months to sort out.

Change management is underestimated in these rollouts. When users lose trust in the AI during the chaotic first 100 days post-go-live, it’s incredibly hard to win it back. We now treat AI features as a separate adoption wave—stabilize the core ERP first, prove it works, then introduce AI in limited scope with clear user value. Celebrating small wins and gathering feedback before scaling helps a lot. Forcing AI features on day one of go-live is a recipe for resistance and shadow workarounds.

The root cause here is usually that the AI features are piloted on clean test data but deployed into messy production reality. We now do a pre-go-live data health check specifically for AI readiness: duplicate vendor records, inconsistent GL structures, missing cost center mappings, shadow processes in Excel. If that foundation isn’t solid, the AI will amplify the problems instead of solving them. Also, you need automated retraining pipelines in place before go-live, not something you figure out after the model drifts.