AI defect prediction letting critical bugs slip through—how to catch false negatives before production?

michellecloud · February 18, 2025, 9:42am

We’ve been piloting an AI-based defect prediction model integrated into our CI/CD pipeline for the past six months, and while it’s caught some high-risk modules early, we’re seeing a troubling pattern: critical defects are still reaching production despite the model flagging the code as low-risk. The model was trained on about two years of historical defect data, test outcomes, and code complexity metrics, and initially it seemed to work well—false positives were manageable and the team started trusting the risk scores.

But over the last two months we’ve had three production incidents where the AI gave high confidence that the release was safe (risk scores below 20%), yet we found critical bugs within days of deployment. One was a payment processing issue, another was a data integrity problem in our order pipeline. When we went back and looked at the features the model analyzed, nothing seemed obviously wrong—the code complexity was reasonable, test coverage was solid, and change frequency wasn’t unusual.

We’re starting to suspect the model isn’t keeping up with how our codebase and failure modes have evolved. We haven’t retrained it since deployment, and our architecture has shifted quite a bit in the last quarter. The false negatives are eroding trust faster than the false positives ever did, because people feel like they’re getting a false sense of security. How do others handle ongoing calibration and drift detection for these kinds of systems? And is there a way to add safety checks that catch what the model misses without just reintroducing all the manual overhead we were trying to eliminate?

gaurav231 · February 18, 2025, 1:47pm

Have you considered lowering your confidence threshold temporarily while you sort this out? If the model is missing critical defects at a 20% risk score, you might need to flag anything above 10% or 15% for extra scrutiny until you retrain. Yes, it increases false positives short-term, but false negatives in production are way more expensive. We did this as a stopgap and it bought us time to improve the model without more incidents.

ericross_1 · February 18, 2025, 3:20pm

One thing that helped us was adding anomaly detection on top of the defect prediction model. The prediction model looks at code patterns, but the anomaly detection watches runtime behavior during staging deployments—API response times, transaction success rates, error log patterns. If something behaves unusually even when the defect model says it’s safe, we hold the release. It’s not perfect but it’s caught things the ML model missed because it’s looking at different signals.

mark_expert · February 18, 2025, 4:55pm

Do you have ground truth tracking set up? Meaning, are you systematically recording which predictions were correct and which were wrong so you can measure false negative rates over time? A lot of teams deploy these models and then only notice problems anecdotally. If you’re not logging prediction vs actual outcome, you can’t really tune thresholds or know when performance is degrading. It’s extra instrumentation work but essential for keeping the system honest.

fr_hcm · February 19, 2025, 8:30am

Just a thought—are your test cases themselves reliable? We discovered our AI model was trained partially on flaky test data, so it learned some bad patterns. If tests sometimes fail for infrastructure reasons rather than real bugs, the model gets confused about what actually correlates with defects. Cleaning up flaky tests before retraining made a noticeable difference in our false negative rate.

kevin_engineer · February 18, 2025, 12:03pm

Are you monitoring the distribution of the features the model uses? We had a case where code complexity metrics started drifting because we adopted a new framework that naturally had higher cyclomatic complexity, and the model didn’t know that was normal now. Setting up drift detection on your top model features can at least alert you when the input data stops looking like what the model was trained on. It won’t fix false negatives directly, but it gives you early warning that recalibration is needed.

jessicasql · February 19, 2025, 10:12am

We ended up implementing a multi-tier threshold system. Anything the model flags as very low risk still gets a lightweight human review—just a quick checklist from the on-call engineer. Medium risk gets deeper review, high risk gets full regression testing. It’s more manual than pure automation, but it prevents the dangerous overconfidence problem where everyone assumes the AI has it covered. The model is still useful for prioritizing attention, but we don’t treat any score as a definitive go/no-go signal anymore.

ben_analyst · February 18, 2025, 11:15am

This sounds like classic model decay. If your codebase and deployment patterns have shifted significantly in six months and the model hasn’t been retrained, it’s essentially operating on outdated assumptions about what ‘risky’ looks like. We faced something similar and ended up implementing a lightweight continuous fine-tuning process—every two weeks we feed the model recent defect data and retrain on a rolling window. It’s not a full rebuild, just enough to keep the patterns current. The key is treating model maintenance as ongoing work, not a one-time deployment.

Topic		Replies	Views
When AI Defect Prediction Misses Critical Bugs: Our CI/CD Release Gate Learning Curve AI Adoption in ALM use-case , ci-cd , ai-adoption , piloting , model-drift , quality-gates , alm-ai , defect-prediction , false-negatives	6	0	February 20, 2025
Recalibrating AI defect prediction after false-negative spike in production AI Adoption in ALM use-case , ci-cd , scaling , ai-adoption , model-drift , quality-gates , alm-ai , defect-prediction , false-negatives	6	0	February 15, 2025
AI-powered anomaly detection in visual inspection: balancing accuracy gains with validation burden AI Adoption in QMS discussion , data-governance , audit-trails , anomaly-detection , ai-adoption , piloting , qms-ai , capa-management	4	0	December 14, 2025
Navigating the gap between AI vision pilots and production-grade defect detection AI Adoption in QMS discussion , edge-computing , change-management , ai-adoption , piloting , model-drift , qms-ai , computer-vision , data-labeling	5	1	January 5, 2026
Tackling Model Drift in Production Vision Systems – Your Strategies? AI Adoption in QMS discussion , data-quality , edge-computing , ai-adoption , piloting , model-drift , qms-ai , computer-vision , defect-detection	4	0	October 6, 2025
Building inspector confidence in AI defect calls — what's actually working? AI Adoption in QMS discussion , change-management , ai-adoption , piloting , explainability , qms-ai , computer-vision , human-in-the-loop , model-validation	2	0	November 2, 2025
Tackling alarm fatigue in MES: how are you balancing false positives vs missed failures? AI Adoption in MES discussion , scaling , predictive-maintenance , anomaly-detection , ai-adoption , mes-ai , alarm-management , false-positives , iiot	4	0	October 29, 2025
How are teams balancing AI test prioritization with full regression coverage? AI Adoption in ALM question , ci-cd , test-automation , regression-testing , ai-adoption , llm , piloting , alm-ai , defect-prediction	4	0	February 19, 2025
AI spanning requirements, test management, and CI/CD—how are you connecting the dots? AI Adoption in ALM discussion , ci-cd , test-automation , scaling , ai-adoption , llm , alm-ai , self-healing-tests , risk-prediction	7	0	February 20, 2025

AI defect prediction letting critical bugs slip through—how to catch false negatives before production?

Related topics