How do you catch aggregation errors in LLM-generated BI queries before they reach executives?

We’re piloting an LLM-powered natural language interface for our executive dashboards in Power BI, and I’m running into a serious trust problem. Last week we had a near-miss where an AI-generated query calculated regional performance by averaging store-level percentages instead of recalculating the ratio at the regional level. The numbers looked plausible—nobody questioned them until finance ran their own report and found a 40% discrepancy.

The core issue is that our LLM confidently generates SQL that parses correctly and executes without errors, but the aggregation logic is fundamentally wrong. We’re seeing things like averaging pre-computed CTRs, using COUNT instead of SUM on conversion metrics, and joins that silently return incomplete result sets. Most of these don’t throw errors; they just produce incorrect numbers that make it into leadership presentations.

We’ve started building validation rules and considering a semantic layer, but I’m curious what others have implemented. How are you catching these errors before they reach decision-makers? Are you using multi-stage validation pipelines, self-correcting workflows, or some kind of automated reconciliation against known-good queries?

This is exactly why I’m skeptical of rolling out AI-powered analytics broadly in our org. We had an incident where a dashboard showed one visitor and one pageview for a new app launch, and leadership almost killed the project. Turned out the query logic was fundamentally broken—actual numbers were 121 visitors and 159 events. The scary part is the dashboard didn’t error; it just silently returned garbage. Now I always cross-check AI-generated metrics against raw upstream logs before trusting anything.

The averaging-an-average problem is brutal. We had a similar issue in our sales performance dashboards where pre-calculated margin percentages at the product level were being averaged up to category level instead of recalculating margin as total profit divided by total revenue. Finance nearly made a massive inventory decision based on those bad numbers. Now we enforce a rule: ratio metrics can only be defined at the chart level using SUM functions for both numerator and denominator, never at the data source level.

We hit this exact problem six months ago with our text-to-SQL pilot. Our solution was to implement a three-gate validation pipeline: syntax check, schema validation, and then a consistency check that compares LLM output against a library of known-good queries for similar questions. If the new query deviates significantly in structure or row counts from established patterns, it gets flagged for manual review before execution. This caught about 60% of our aggregation errors before they reached users.