We’re running daily builds across a pretty complex product suite—lots of microservices, multiple frontends, and integration points everywhere. Our full regression suite takes about 12 hours to run, which is killing our ability to ship faster. We’ve been looking at ML-based test prioritization to focus on high-risk areas instead of running everything on every commit.
We did a small pilot with impact analysis and risk prediction models trained on our historical defect data and code changes. The results were interesting—execution time dropped by maybe 50%, but we also missed a couple of issues that only showed up in tests we didn’t prioritize. Now leadership is nervous about skipping tests, and some engineers don’t trust the model’s recommendations. We’re still manually reviewing which tests to run, which defeats the purpose.
How are other teams handling this? Do you run prioritized tests on every commit and full regression on a schedule, or have you found a way to build confidence in the model’s decisions? What metrics convinced your stakeholders that AI prioritization was safe?
What’s your governance process for updating the model? We found that the model degrades over time as the codebase evolves—tests that were high-priority six months ago aren’t anymore, and new high-risk areas emerge. We retrain quarterly using the latest data, and we have a monthly review where QA and dev leads can flag tests that should be manually prioritized regardless of what the model says. That manual override option was critical for getting buy-in.
How are you handling flaky tests in the prioritization? We had issues where the model kept flagging certain tests as high-priority because they failed frequently, but they were actually just flaky—timeouts, race conditions, environment issues. We ended up building a separate flakiness detection layer that quarantines unreliable tests before they even reach the prioritization model. Otherwise the model just amplifies the noise.
From the dev side, the biggest benefit was getting test results back in under an hour instead of waiting until the next morning. We can fix issues the same day we introduce them, which makes a massive difference in productivity. But I agree with the trust issue—early on, I didn’t understand why certain tests were prioritized, and I’d manually trigger full runs just to be safe. Once the QA team added explanations to the CI feedback (“these tests were prioritized because you changed X and historically that affects Y”), I stopped second-guessing it.
We tried a hybrid approach: AI prioritization for PRs and feature branches, but we always run the full suite before merging to main and before any production deployment. The risk is just too high to skip comprehensive testing at those gates. The AI helps us iterate faster during development, but we don’t rely on it for final validation. It’s more about speed during the dev cycle than replacing full coverage.