We’ve been piloting an LLM-based test generation tool integrated into our CI/CD pipeline for the past three months. The results are mixed and I’m curious how others have managed this.
The tool is incredibly good at cranking out test scenarios – we’re seeing 80–90% reductions in authoring time, which is huge. It generates edge cases we never would have thought of manually, like unusual state transitions and boundary conditions across multiple services. But here’s the problem: it also generates a ton of tests that seem logically sound but don’t really exercise meaningful user workflows. We’re now sitting on thousands of generated test cases, and our regression cycles have actually gotten longer because we’re running so many tests that don’t correlate with real defects.
We’ve tried a few things – manually reviewing samples of generated tests before adding them to the suite, tracking which tests have ever caught a bug and deprioritizing the rest, and setting stricter prompts to guide the LLM toward critical paths. But we’re still struggling to separate signal from noise at scale. The team is starting to lose trust in the tool because they see flaky or irrelevant tests failing and it’s hard to tell what’s real.
How have others approached filtering or prioritizing LLM-generated tests in a CI/CD context? Are there patterns or metrics you use to decide which tests are worth keeping? And has anyone successfully combined LLM generation with ML-based test prioritization to focus execution on high-value scenarios?
We hit this exact problem six months ago. What worked for us was implementing a two-phase approach: LLM generation followed by impact analysis. We trained a lightweight ML model on our historical test execution data to predict which generated tests were most likely to detect defects based on code change patterns. Only tests scoring above a threshold get added to the main regression suite. The rest go into a secondary exploratory suite that runs nightly but doesn’t block CI. This cut our regression time by about 40% while keeping the high-value tests front and center.
This resonates. We ended up splitting our test suite into risk tiers based on code change impact analysis. High-risk changes trigger the full LLM-generated suite; low-risk changes only run a curated core set. The key was integrating the test generator with our version control system so it could see exactly what changed and recommend the relevant subset. Reduced our average CI runtime from 45 minutes to about 18 without losing coverage on the stuff that matters.
Are you tracking test effectiveness metrics? We started measuring defect detection rate per test and execution stability (how often a test fails due to flakiness vs real issues). Any test that hasn’t caught a defect in 90 days and has failed spuriously more than twice gets demoted to a lower-priority tier. It’s manual overhead initially but it pays off.
Have you looked at combining this with self-healing test capabilities? We found that a lot of the low-value tests the LLM generated were also the ones that broke most often due to minor UI changes. Adding a self-healing layer reduced maintenance overhead and made it easier to keep a larger test suite without the constant firefighting. Doesn’t solve the prioritization problem but it helps manage the overhead.
Are you doing any human review of the generated tests before they go live? We implemented a lightweight review step where a senior QA engineer samples 10% of newly generated tests each week and marks the ones that are off-target. The LLM retrains on that feedback and the quality improved noticeably. It’s overhead but way less than manually authoring everything or dealing with a bloated suite.