Recovering 22K Builds with ML-Based Flaky Test Detection Platform

tommaster · February 18, 2025, 9:12am

We spent the better part of a year building an internal flaky test detection platform after our CI/CD pipelines became unreliable. The problem was straightforward but brutal: engineers couldn’t trust test results anymore. A test would fail, someone would rerun it, and it would pass. Multiply that across hundreds of teams and thousands of daily builds, and you have a full-blown trust crisis.

We implemented a multi-algorithm approach combining retry detection with Bayesian inference over historical execution data. When a test flips from pass to fail (or vice versa) on retry in the same build, we log that as a flakiness signal. For patterns that single retries miss, we apply statistical analysis across a moving window of past executions to calculate flakiness probability. We also score tests based on duration variability, environment consistency, and historical pass/fail ratios.

In the first quarter after deployment, the system recovered over 22,000 builds that would have been blocked by false-positive failures and flagged 7,000 unique flaky tests across our org. The system auto-notifies owning teams, creates tickets with resolution deadlines, and surfaces likely root causes from logs and code changes. The biggest lesson: data quality matters as much as the algorithm. Inconsistent test metadata torpedoed our early accuracy, so we had to invest heavily upfront in reliable labeling and context for every test case.

What we’d do differently: we initially tried pure ML, then pure heuristics. Neither worked well alone. The combination of Bayesian stats, rule-based checks, and traditional ML turned out to be far more robust than any single technique.

kai_thinker · February 20, 2025, 9:55am

One thing we struggled with in a similar initiative was the rate of false positives from the detection system itself. Sometimes a test would be flagged as flaky when it was actually catching a real intermittent bug in the application. How do you distinguish between legitimate flakiness (test issue) and intermittent application behavior (real defect)? That line can be blurry.

jessica_head · February 20, 2025, 1:30pm

The Bayesian inference piece is interesting. We’ve been considering a purely heuristic approach because our team doesn’t have deep stats expertise. How much tuning did the statistical models require, and do you have dedicated data science support maintaining them, or is it something your QA/platform engineers handle day-to-day?

mike_expert · February 19, 2025, 4:10pm

On cold start: we used a hybrid approach. New tests default to non-flaky but enter a monitoring phase where we track closely for the first 50–100 executions. If flip signals appear early, we flag them sooner. For repos being onboarded, we backfill historical data if it’s available in their old CI system. On quarantine: yes, tests above a certain flakiness threshold are automatically moved to a separate suite that doesn’t block merges, but teams get notified and have SLAs to fix or retire them. We don’t remove tests permanently without human review.

Topic		Views
Flaky test detection at scale: ML model vs heuristics vs hybrid? AI Adoption in ALM question , ci-cd , scaling , ai-adoption , flaky-tests , test-maintenance , test-prioritization , alm-ai , self-healing	6	February 14, 2025
Recalibrating AI defect prediction after false-negative spike in production AI Adoption in ALM use-case , ci-cd , scaling , ai-adoption , model-drift , quality-gates , alm-ai , defect-prediction , false-negatives	6	February 15, 2025
How are teams balancing AI test prioritization with full regression coverage? AI Adoption in ALM question , ci-cd , test-automation , regression-testing , ai-adoption , llm , piloting , alm-ai , defect-prediction	4	February 19, 2025
Choosing AI Test Automation Architecture: Multi-Model vs Single-Model vs Retrofitted AI Adoption in ALM discussion , scaling , roi , ai-adoption , test-prioritization , alm-ai , self-healing-tests , flaky-test-detection , multi-model-ai	6	February 18, 2025
AI spanning requirements, test management, and CI/CD—how are you connecting the dots? AI Adoption in ALM discussion , ci-cd , test-automation , scaling , ai-adoption , llm , alm-ai , self-healing-tests , risk-prediction	7	February 20, 2025
When AI Defect Prediction Misses Critical Bugs: Our CI/CD Release Gate Learning Curve AI Adoption in ALM use-case , ci-cd , ai-adoption , piloting , model-drift , quality-gates , alm-ai , defect-prediction , false-negatives	6	February 20, 2025
AI defect prediction letting critical bugs slip through—how to catch false negatives before production? AI Adoption in ALM question , ci-cd , ai-adoption , piloting , model-drift , release-gates , alm-ai , defect-prediction , false-negatives	7	February 18, 2025
Balancing LLM-generated test coverage with execution time in CI/CD – how do you filter the noise? AI Adoption in ALM question , ci-cd , test-automation , ai-adoption , llm , piloting , alm-ai , test-generation , self-healing-tests	5	February 16, 2025
Tackling alarm fatigue in MES: how are you balancing false positives vs missed failures? AI Adoption in MES discussion , scaling , predictive-maintenance , anomaly-detection , ai-adoption , mes-ai , alarm-management , false-positives , iiot	4	October 29, 2025

Recovering 22K Builds with ML-Based Flaky Test Detection Platform

Related topics