We’re piloting an LLM-powered natural language query interface for our BI platform, and I’m running into a frustrating issue. The model generates SQL that parses cleanly and executes without errors, but the numbers coming back are just wrong. Latest example: someone asked for year-over-year revenue growth by region for customers acquired in Q1, and the query ran fine but returned figures that don’t match what finance calculated manually.
When I dug into the generated SQL, the problem was buried in the JOIN logic. The LLM chose a path through three tables that made syntactic sense but produced a many-to-many relationship it didn’t account for, so revenue got double-counted for about 30% of customers. On the surface, the query looked reasonable. No syntax errors. It just silently returned bad data.
We’ve also hit issues with ratio metrics. The model sometimes averages pre-aggregated percentages instead of recalculating from base numerator and denominator, which can throw results off by 50% or more depending on the data distribution. I know semantic layers are supposed to help with this, but we’re still in early piloting and don’t have full governance infrastructure in place yet.
Has anyone tackled this in a production or pilot setting? What kind of validation gates or checks did you put in place to catch queries that execute successfully but return fundamentally misleading results? And how do you handle it when the business is already making decisions based on a dashboard before someone notices the numbers are off?