LLM generating SQL that looks right but returns wrong aggregations—how to catch this?

We’re piloting an LLM-powered natural language query interface for our BI platform, and I’m running into a frustrating issue. The model generates SQL that parses cleanly and executes without errors, but the numbers coming back are just wrong. Latest example: someone asked for year-over-year revenue growth by region for customers acquired in Q1, and the query ran fine but returned figures that don’t match what finance calculated manually.

When I dug into the generated SQL, the problem was buried in the JOIN logic. The LLM chose a path through three tables that made syntactic sense but produced a many-to-many relationship it didn’t account for, so revenue got double-counted for about 30% of customers. On the surface, the query looked reasonable. No syntax errors. It just silently returned bad data.

We’ve also hit issues with ratio metrics. The model sometimes averages pre-aggregated percentages instead of recalculating from base numerator and denominator, which can throw results off by 50% or more depending on the data distribution. I know semantic layers are supposed to help with this, but we’re still in early piloting and don’t have full governance infrastructure in place yet.

Has anyone tackled this in a production or pilot setting? What kind of validation gates or checks did you put in place to catch queries that execute successfully but return fundamentally misleading results? And how do you handle it when the business is already making decisions based on a dashboard before someone notices the numbers are off?

One pattern that helped us was embedding schema metadata directly into the database as natural language descriptions. Instead of just table and column names, we added comments explaining relationships, cardinality, and business rules. For example, ‘customer_id in orders table is a foreign key to customers table; one customer can have many orders.’ When the LLM has that context, it makes fewer hallucinated guesses about how to join tables. We saw query accuracy improve by about 25% after enriching the schema metadata.

You need a multi-stage validation pipeline before execution. Stage one checks syntax and schema existence—does the table actually exist, are column names valid. Stage two is logical consistency: does this query structure make sense given known table relationships and cardinality? We built a lightweight rules engine that knows our schema’s join paths and flags queries that create many-to-many joins without explicit aggregation logic. Stage three is a dry-run execution on a small data slice with expected result validation. If the row count or sum totals are wildly off from historical norms, we block it and ask the user to clarify their intent. This setup caught about 70% of hallucinated queries before they hit production dashboards.