Recovering 22K Builds with ML-Based Flaky Test Detection Platform

We spent the better part of a year building an internal flaky test detection platform after our CI/CD pipelines became unreliable. The problem was straightforward but brutal: engineers couldn’t trust test results anymore. A test would fail, someone would rerun it, and it would pass. Multiply that across hundreds of teams and thousands of daily builds, and you have a full-blown trust crisis.

We implemented a multi-algorithm approach combining retry detection with Bayesian inference over historical execution data. When a test flips from pass to fail (or vice versa) on retry in the same build, we log that as a flakiness signal. For patterns that single retries miss, we apply statistical analysis across a moving window of past executions to calculate flakiness probability. We also score tests based on duration variability, environment consistency, and historical pass/fail ratios.

In the first quarter after deployment, the system recovered over 22,000 builds that would have been blocked by false-positive failures and flagged 7,000 unique flaky tests across our org. The system auto-notifies owning teams, creates tickets with resolution deadlines, and surfaces likely root causes from logs and code changes. The biggest lesson: data quality matters as much as the algorithm. Inconsistent test metadata torpedoed our early accuracy, so we had to invest heavily upfront in reliable labeling and context for every test case.

What we’d do differently: we initially tried pure ML, then pure heuristics. Neither worked well alone. The combination of Bayesian stats, rule-based checks, and traditional ML turned out to be far more robust than any single technique.

One thing we struggled with in a similar initiative was the rate of false positives from the detection system itself. Sometimes a test would be flagged as flaky when it was actually catching a real intermittent bug in the application. How do you distinguish between legitimate flakiness (test issue) and intermittent application behavior (real defect)? That line can be blurry.

The Bayesian inference piece is interesting. We’ve been considering a purely heuristic approach because our team doesn’t have deep stats expertise. How much tuning did the statistical models require, and do you have dedicated data science support maintaining them, or is it something your QA/platform engineers handle day-to-day?

On cold start: we used a hybrid approach. New tests default to non-flaky but enter a monitoring phase where we track closely for the first 50–100 executions. If flip signals appear early, we flag them sooner. For repos being onboarded, we backfill historical data if it’s available in their old CI system. On quarantine: yes, tests above a certain flakiness threshold are automatically moved to a separate suite that doesn’t block merges, but teams get notified and have SLAs to fix or retire them. We don’t remove tests permanently without human review.