Balancing LLM-generated test coverage with execution time in CI/CD – how do you filter the noise?

linda308 · February 14, 2025, 9:22am

We’ve been piloting an LLM-based test generation tool integrated into our CI/CD pipeline for the past three months. The results are mixed and I’m curious how others have managed this.

The tool is incredibly good at cranking out test scenarios – we’re seeing 80–90% reductions in authoring time, which is huge. It generates edge cases we never would have thought of manually, like unusual state transitions and boundary conditions across multiple services. But here’s the problem: it also generates a ton of tests that seem logically sound but don’t really exercise meaningful user workflows. We’re now sitting on thousands of generated test cases, and our regression cycles have actually gotten longer because we’re running so many tests that don’t correlate with real defects.

We’ve tried a few things – manually reviewing samples of generated tests before adding them to the suite, tracking which tests have ever caught a bug and deprioritizing the rest, and setting stricter prompts to guide the LLM toward critical paths. But we’re still struggling to separate signal from noise at scale. The team is starting to lose trust in the tool because they see flaky or irrelevant tests failing and it’s hard to tell what’s real.

How have others approached filtering or prioritizing LLM-generated tests in a CI/CD context? Are there patterns or metrics you use to decide which tests are worth keeping? And has anyone successfully combined LLM generation with ML-based test prioritization to focus execution on high-value scenarios?

carlos_admin · February 14, 2025, 11:47am

We hit this exact problem six months ago. What worked for us was implementing a two-phase approach: LLM generation followed by impact analysis. We trained a lightweight ML model on our historical test execution data to predict which generated tests were most likely to detect defects based on code change patterns. Only tests scoring above a threshold get added to the main regression suite. The rest go into a secondary exploratory suite that runs nightly but doesn’t block CI. This cut our regression time by about 40% while keeping the high-value tests front and center.

caro9172 · February 15, 2025, 3:40pm

This resonates. We ended up splitting our test suite into risk tiers based on code change impact analysis. High-risk changes trigger the full LLM-generated suite; low-risk changes only run a curated core set. The key was integrating the test generator with our version control system so it could see exactly what changed and recommend the relevant subset. Reduced our average CI runtime from 45 minutes to about 18 without losing coverage on the stuff that matters.

et_anl · February 14, 2025, 1:15pm

Are you tracking test effectiveness metrics? We started measuring defect detection rate per test and execution stability (how often a test fails due to flakiness vs real issues). Any test that hasn’t caught a defect in 90 days and has failed spuriously more than twice gets demoted to a lower-priority tier. It’s manual overhead initially but it pays off.

mary_analyst · February 15, 2025, 8:30am

Have you looked at combining this with self-healing test capabilities? We found that a lot of the low-value tests the LLM generated were also the ones that broke most often due to minor UI changes. Adding a self-healing layer reduced maintenance overhead and made it easier to keep a larger test suite without the constant firefighting. Doesn’t solve the prioritization problem but it helps manage the overhead.

raj_func · February 16, 2025, 9:12am

Are you doing any human review of the generated tests before they go live? We implemented a lightweight review step where a senior QA engineer samples 10% of newly generated tests each week and marks the ones that are off-target. The LLM retrains on that feedback and the quality improved noticeably. It’s overhead but way less than manually authoring everything or dealing with a bloated suite.

Topic		Replies	Views
AI spanning requirements, test management, and CI/CD—how are you connecting the dots? AI Adoption in ALM discussion , ci-cd , test-automation , scaling , ai-adoption , llm , alm-ai , self-healing-tests , risk-prediction	7	3	February 20, 2025
How are teams balancing AI test prioritization with full regression coverage? AI Adoption in ALM question , ci-cd , test-automation , regression-testing , ai-adoption , llm , piloting , alm-ai , defect-prediction	4	1	February 19, 2025
Flaky test detection at scale: ML model vs heuristics vs hybrid? AI Adoption in ALM question , ci-cd , scaling , ai-adoption , flaky-tests , test-maintenance , test-prioritization , alm-ai , self-healing	6	4	February 14, 2025
Best approach to validate requirement quality before AI-generated tests AI Adoption in ALM question , nlp , ai-adoption , exploring , acceptance-criteria , requirements-quality , alm-ai , backlog-hygiene , ears-notation	6	1	February 15, 2025
Recovering 22K Builds with ML-Based Flaky Test Detection Platform AI Adoption in ALM use-case , ci-cd , devops , scaling , ml-models , ai-adoption , test-prioritization , alm-ai , flaky-test-detection	3	0	February 19, 2025
Recalibrating AI defect prediction after false-negative spike in production AI Adoption in ALM use-case , ci-cd , scaling , ai-adoption , model-drift , quality-gates , alm-ai , defect-prediction , false-negatives	6	2	February 15, 2025
Optimizing test automation pyramid in CI/CD connectors: unit vs integration vs E2E balance Micro Focus ALM / Quality Center discussion , test-automation , quality-metrics , pipeline-optimization , ci-cd-connectors , flaky-tests , mf-25-3 , test-pyramid , contract-testing	6	0	January 5, 2026
How do you validate AI-generated acceptance criteria before teams start building? AI Adoption in ALM question , nlp , ai-adoption , exploring , acceptance-criteria , requirements-traceability , alm-ai , backlog-hygiene , user-stories	7	2	February 18, 2025
Choosing AI Test Automation Architecture: Multi-Model vs Single-Model vs Retrofitted AI Adoption in ALM discussion , scaling , roi , ai-adoption , test-prioritization , alm-ai , self-healing-tests , flaky-test-detection , multi-model-ai	6	2	February 18, 2025

Balancing LLM-generated test coverage with execution time in CI/CD – how do you filter the noise?

Related topics