LLM generating SQL that looks right but returns wrong aggregations—how to catch this?

drew_tab · December 16, 2025, 12:59pm

We’re piloting an LLM-powered natural language query interface for our BI platform, and I’m running into a frustrating issue. The model generates SQL that parses cleanly and executes without errors, but the numbers coming back are just wrong. Latest example: someone asked for year-over-year revenue growth by region for customers acquired in Q1, and the query ran fine but returned figures that don’t match what finance calculated manually.

When I dug into the generated SQL, the problem was buried in the JOIN logic. The LLM chose a path through three tables that made syntactic sense but produced a many-to-many relationship it didn’t account for, so revenue got double-counted for about 30% of customers. On the surface, the query looked reasonable. No syntax errors. It just silently returned bad data.

We’ve also hit issues with ratio metrics. The model sometimes averages pre-aggregated percentages instead of recalculating from base numerator and denominator, which can throw results off by 50% or more depending on the data distribution. I know semantic layers are supposed to help with this, but we’re still in early piloting and don’t have full governance infrastructure in place yet.

Has anyone tackled this in a production or pilot setting? What kind of validation gates or checks did you put in place to catch queries that execute successfully but return fundamentally misleading results? And how do you handle it when the business is already making decisions based on a dashboard before someone notices the numbers are off?

reese_dev · December 27, 2025, 12:59pm

One pattern that helped us was embedding schema metadata directly into the database as natural language descriptions. Instead of just table and column names, we added comments explaining relationships, cardinality, and business rules. For example, ‘customer_id in orders table is a foreign key to customers table; one customer can have many orders.’ When the LLM has that context, it makes fewer hallucinated guesses about how to join tables. We saw query accuracy improve by about 25% after enriching the schema metadata.

susanplanner · December 23, 2025, 12:59pm

You need a multi-stage validation pipeline before execution. Stage one checks syntax and schema existence—does the table actually exist, are column names valid. Stage two is logical consistency: does this query structure make sense given known table relationships and cardinality? We built a lightweight rules engine that knows our schema’s join paths and flags queries that create many-to-many joins without explicit aggregation logic. Stage three is a dry-run execution on a small data slice with expected result validation. If the row count or sum totals are wildly off from historical norms, we block it and ask the user to clarify their intent. This setup caught about 70% of hallucinated queries before they hit production dashboards.

Topic		Views
How do you catch aggregation errors in LLM-generated BI queries before they reach executives? AI Adoption in BA-BI question , data-quality , semantic-layer , power-bi , ai-adoption , llm , piloting , bi-ai , text-to-sql	3	October 9, 2025
How do you catch wrong aggregations before they reach leadership dashboards? AI Adoption in BA-BI question , data-quality , semantic-layer , power-bi , looker-studio , ai-adoption , llm , piloting , bi-ai	0	November 14, 2025
Row-level security enforcement when LLMs generate SQL directly AI Adoption in BA-BI discussion , semantic-layer , scaling , data-lineage , row-level-security , abac , ai-adoption , bi-ai , metric-governance	4	September 14, 2025
Row-level security breaking when LLM generates SQL directly—how to enforce? AI Adoption in BA-BI question , scaling , row-level-security , power-bi , ai-adoption , llm , bi-ai , natural-language-query , attribute-based-access	4	August 6, 2025
NLP query interface returning inconsistent results across departments – semantic layer issue? AI Adoption in BA-BI question , data-quality , semantic-layer , nlp , ai-adoption , piloting , bi-ai , self-service-analytics	2	November 11, 2025
Rethinking RLS and semantic layers when deploying conversational analytics AI Adoption in BA-BI discussion , semantic-layer , scaling , row-level-security , abac , ai-adoption , bi-ai , metric-governance , data-access-control	5	September 27, 2025
Augmented analytics rollout: balancing self-service with governance AI Adoption in BA-BI discussion , data-quality , scaling , anomaly-detection , ai-adoption , bi-ai , self-service-analytics , natural-language-query	6	November 20, 2025
How can natural language query improve business intelligence accessibility? Generic BA-BI Topics question , business-intelligence , natural-language-query , semantic-layers , user-accessibility , bi-adoption , nlq-for-bi-accessib	7	December 3, 2025
Dealing with metric drift in self-service BI – how do you enforce consistency? AI Adoption in BA-BI question , data-quality , semantic-layer , dbt , ai-adoption , piloting , bi-ai , metric-governance , warehouse-hygiene	7	October 9, 2025

LLM generating SQL that looks right but returns wrong aggregations—how to catch this?

Related topics