Recalibrating AI defect prediction after false-negative spike in production

suresh_expert · February 14, 2025, 9:22am

We deployed an ML-based defect prediction model into our CI/CD pipeline last year to identify high-risk code modules before release. Initial results looked solid—false positive rate was acceptable, and the model flagged several problem areas that we caught in staging. About four months in, we started seeing defects slip through into production that the model had marked as low-risk. A post-mortem traced it back to data drift: our training data was from 2023, but development patterns and tech stack had evolved significantly by mid-2024.

We implemented continuous fine-tuning and added real-time monitoring on the top predictive features—code complexity, change frequency, and test coverage trends. We also switched to a multi-tier threshold architecture instead of a single pass/fail gate. High-confidence predictions still block releases automatically, but medium-confidence flags now route to manual review by senior engineers. Low-confidence alerts are logged for pattern analysis but don’t block the pipeline.

The biggest lesson was that model maintenance isn’t optional. We now track false negative rates weekly and retrain quarterly, even if drift metrics look stable. The other surprise was how much trust we’d lost with the dev team after those missed defects—it took three months of transparent reporting and involving them in threshold tuning before people stopped ignoring the alerts. If I were starting over, I’d build the feedback loop and human oversight into the architecture from day one, not bolt it on after things break.

torres_func2 · February 14, 2025, 11:45am

The trust erosion part resonates. We had something similar when our automated security scanner started throwing false positives constantly—developers just started merging anyway and treating alerts as noise. Did you find that the multi-tier threshold approach reduced alert fatigue, or did it just shift the problem to the manual review queue?

mohit_analyst · February 15, 2025, 8:50am

We’ve been piloting something similar but hit a wall with flaky tests poisoning the training data. How did you ensure the historical defect and test data feeding your model was reliable? Did you filter out known flaky test results, or did you have to clean up the test suite first before the model could even be useful?

diegoplant · February 16, 2025, 9:05am

How are you handling the retraining cadence without introducing new risks? We’re nervous about automated retraining because if the model learns from recent false negatives, it might overcorrect and start generating too many false positives. Are you doing any kind of validation or A/B testing on the retrained model before deploying it back into the pipeline?

diego_analyst · February 14, 2025, 2:10pm

Curious how you’re measuring false negatives in real time if you don’t have ground truth until defects surface in production. Are you using proxy metrics like post-release defect density correlated back to the model’s predictions, or something else? We’re trying to set up similar monitoring but struggling with the lag between prediction and validation.

kaisolver · February 15, 2025, 1:28pm

The part about involving the dev team in threshold tuning is really important. We rolled out something similar last year and treated it as a top-down mandate. The pushback was fierce. Once we opened up the calibration process and let teams see the trade-offs between sensitivity and false positives, adoption improved significantly. People need to understand why the gate exists, not just be told to follow it.

thomasdeveloper · February 15, 2025, 10:12am

Yeah, flaky tests were a huge problem initially. We had to spend about six weeks stabilizing the test suite before retraining. We flagged any test that failed intermittently more than twice in a month and excluded those results from the training set. It was painful but necessary—feeding the model unreliable data just made the predictions worse. Data quality really is the foundation here.

Topic		Replies	Views
AI defect prediction letting critical bugs slip through—how to catch false negatives before production? AI Adoption in ALM question , ci-cd , ai-adoption , piloting , model-drift , release-gates , alm-ai , defect-prediction , false-negatives	7	0	February 18, 2025
When AI Defect Prediction Misses Critical Bugs: Our CI/CD Release Gate Learning Curve AI Adoption in ALM use-case , ci-cd , ai-adoption , piloting , model-drift , quality-gates , alm-ai , defect-prediction , false-negatives	6	0	February 20, 2025
AI-powered anomaly detection in visual inspection: balancing accuracy gains with validation burden AI Adoption in QMS discussion , data-governance , audit-trails , anomaly-detection , ai-adoption , piloting , qms-ai , capa-management	4	0	December 14, 2025
How are teams balancing AI test prioritization with full regression coverage? AI Adoption in ALM question , ci-cd , test-automation , regression-testing , ai-adoption , llm , piloting , alm-ai , defect-prediction	4	0	February 19, 2025
Navigating the gap between AI vision pilots and production-grade defect detection AI Adoption in QMS discussion , edge-computing , change-management , ai-adoption , piloting , model-drift , qms-ai , computer-vision , data-labeling	5	1	January 5, 2026
Recovering 22K Builds with ML-Based Flaky Test Detection Platform AI Adoption in ALM use-case , ci-cd , devops , scaling , ml-models , ai-adoption , test-prioritization , alm-ai , flaky-test-detection	3	0	February 19, 2025
Tackling alarm fatigue in MES: how are you balancing false positives vs missed failures? AI Adoption in MES discussion , scaling , predictive-maintenance , anomaly-detection , ai-adoption , mes-ai , alarm-management , false-positives , iiot	4	0	October 29, 2025
Building inspector confidence in AI defect calls — what's actually working? AI Adoption in QMS discussion , change-management , ai-adoption , piloting , explainability , qms-ai , computer-vision , human-in-the-loop , model-validation	2	0	November 2, 2025
Flaky test detection at scale: ML model vs heuristics vs hybrid? AI Adoption in ALM question , ci-cd , scaling , ai-adoption , flaky-tests , test-maintenance , test-prioritization , alm-ai , self-healing	6	0	February 14, 2025

Recalibrating AI defect prediction after false-negative spike in production

Related topics