AI defect prediction letting critical bugs slip through—how to catch false negatives before production?

We’ve been piloting an AI-based defect prediction model integrated into our CI/CD pipeline for the past six months, and while it’s caught some high-risk modules early, we’re seeing a troubling pattern: critical defects are still reaching production despite the model flagging the code as low-risk. The model was trained on about two years of historical defect data, test outcomes, and code complexity metrics, and initially it seemed to work well—false positives were manageable and the team started trusting the risk scores.

But over the last two months we’ve had three production incidents where the AI gave high confidence that the release was safe (risk scores below 20%), yet we found critical bugs within days of deployment. One was a payment processing issue, another was a data integrity problem in our order pipeline. When we went back and looked at the features the model analyzed, nothing seemed obviously wrong—the code complexity was reasonable, test coverage was solid, and change frequency wasn’t unusual.

We’re starting to suspect the model isn’t keeping up with how our codebase and failure modes have evolved. We haven’t retrained it since deployment, and our architecture has shifted quite a bit in the last quarter. The false negatives are eroding trust faster than the false positives ever did, because people feel like they’re getting a false sense of security. How do others handle ongoing calibration and drift detection for these kinds of systems? And is there a way to add safety checks that catch what the model misses without just reintroducing all the manual overhead we were trying to eliminate?

Have you considered lowering your confidence threshold temporarily while you sort this out? If the model is missing critical defects at a 20% risk score, you might need to flag anything above 10% or 15% for extra scrutiny until you retrain. Yes, it increases false positives short-term, but false negatives in production are way more expensive. We did this as a stopgap and it bought us time to improve the model without more incidents.

One thing that helped us was adding anomaly detection on top of the defect prediction model. The prediction model looks at code patterns, but the anomaly detection watches runtime behavior during staging deployments—API response times, transaction success rates, error log patterns. If something behaves unusually even when the defect model says it’s safe, we hold the release. It’s not perfect but it’s caught things the ML model missed because it’s looking at different signals.

Do you have ground truth tracking set up? Meaning, are you systematically recording which predictions were correct and which were wrong so you can measure false negative rates over time? A lot of teams deploy these models and then only notice problems anecdotally. If you’re not logging prediction vs actual outcome, you can’t really tune thresholds or know when performance is degrading. It’s extra instrumentation work but essential for keeping the system honest.

Just a thought—are your test cases themselves reliable? We discovered our AI model was trained partially on flaky test data, so it learned some bad patterns. If tests sometimes fail for infrastructure reasons rather than real bugs, the model gets confused about what actually correlates with defects. Cleaning up flaky tests before retraining made a noticeable difference in our false negative rate.

Are you monitoring the distribution of the features the model uses? We had a case where code complexity metrics started drifting because we adopted a new framework that naturally had higher cyclomatic complexity, and the model didn’t know that was normal now. Setting up drift detection on your top model features can at least alert you when the input data stops looking like what the model was trained on. It won’t fix false negatives directly, but it gives you early warning that recalibration is needed.

We ended up implementing a multi-tier threshold system. Anything the model flags as very low risk still gets a lightweight human review—just a quick checklist from the on-call engineer. Medium risk gets deeper review, high risk gets full regression testing. It’s more manual than pure automation, but it prevents the dangerous overconfidence problem where everyone assumes the AI has it covered. The model is still useful for prioritizing attention, but we don’t treat any score as a definitive go/no-go signal anymore.

This sounds like classic model decay. If your codebase and deployment patterns have shifted significantly in six months and the model hasn’t been retrained, it’s essentially operating on outdated assumptions about what ‘risky’ looks like. We faced something similar and ended up implementing a lightweight continuous fine-tuning process—every two weeks we feed the model recent defect data and retrain on a rolling window. It’s not a full rebuild, just enough to keep the patterns current. The key is treating model maintenance as ongoing work, not a one-time deployment.