We spent the better part of a year building an internal flaky test detection platform after our CI/CD pipelines became unreliable. The problem was straightforward but brutal: engineers couldn’t trust test results anymore. A test would fail, someone would rerun it, and it would pass. Multiply that across hundreds of teams and thousands of daily builds, and you have a full-blown trust crisis.
We implemented a multi-algorithm approach combining retry detection with Bayesian inference over historical execution data. When a test flips from pass to fail (or vice versa) on retry in the same build, we log that as a flakiness signal. For patterns that single retries miss, we apply statistical analysis across a moving window of past executions to calculate flakiness probability. We also score tests based on duration variability, environment consistency, and historical pass/fail ratios.
In the first quarter after deployment, the system recovered over 22,000 builds that would have been blocked by false-positive failures and flagged 7,000 unique flaky tests across our org. The system auto-notifies owning teams, creates tickets with resolution deadlines, and surfaces likely root causes from logs and code changes. The biggest lesson: data quality matters as much as the algorithm. Inconsistent test metadata torpedoed our early accuracy, so we had to invest heavily upfront in reliable labeling and context for every test case.
What we’d do differently: we initially tried pure ML, then pure heuristics. Neither worked well alone. The combination of Bayesian stats, rule-based checks, and traditional ML turned out to be far more robust than any single technique.