Our self-service analytics dashboards have become unusable over the past month as our datasets grew beyond 5 million records. Load times went from 3-4 seconds to over 30 seconds, and some complex dashboards timeout entirely. This is seriously impacting user adoption - business users are abandoning the platform.
The dashboards use multiple widgets with cross-filtering enabled, showing sales trends, regional breakdowns, and product performance. When users apply filters or drill down, the entire dashboard freezes for 20-30 seconds.
Here’s a typical query pattern we’re using:
q = load "sales_data";
q = filter q by 'Date' in ["current_year"];
q = group q by ('Region', 'Product');
q = foreach q generate sum('Revenue') as 'Total';
We haven’t implemented any specific data aggregation strategies or indexed filter fields. Should we be pre-aggregating data or is there a better query optimization approach for large datasets in tcrm-2022?
Are you using compact form for your datasets? With 5 million records, storage format makes a huge difference. Also, check if your date filters are using indexed fields. Date ranges are common filter criteria and should absolutely be indexed. You can verify indexing in the dataset metadata and add indexes through the dataflow configuration if they’re missing.
Beyond indexing, you really need to implement aggregation layers for self-service dashboards at this scale. Create pre-aggregated datasets at different grain levels - daily summaries, weekly rollups, monthly aggregates. Then configure your dashboard to use the appropriate aggregation level based on the selected date range. This is standard practice for enterprise-scale self-service BI. Users drilling into daily detail would query the granular dataset, but overview widgets use the aggregated versions. This can reduce query times by 80-90% for typical use cases.
Here’s a comprehensive optimization strategy addressing all three critical areas:
Query Optimization:
Restructure your SAQL to filter early and minimize data scanning. Your current query loads everything first - instead:
q = load "sales_data";
q = filter q by 'Date' in ["current_year"];
q = filter q by 'Region' is not null;
q = group q by ('Region', 'Product');
q = foreach q generate sum('Revenue') as 'Total';
Apply all filters immediately after load, before any grouping or aggregation. This reduces the working dataset size early in the query pipeline.
Data Aggregation Strategy:
Implement a three-tier aggregation approach in your dataflow:
- Raw dataset (5M records) - for detailed drill-downs only
- Daily aggregates (500K records) - for week/month views
- Monthly aggregates (50K records) - for year/quarter overview
Create separate datasets for each grain level and configure your dashboard to automatically select the appropriate dataset based on date range selection. Use dashboard bindings with conditional logic:
if date_range <= 7 days: use raw_data
if date_range <= 90 days: use daily_aggregates
if date_range > 90 days: use monthly_aggregates
Indexed Filter Fields:
Add indexes to your most commonly filtered dimensions. In your dataflow metadata, configure these fields as indexed:
- Date (absolutely critical for time-series queries)
- Region (common filter criterion)
- Product Category (if used in filters)
- Any field used in dashboard global filters
Indexing requires dataset rebuild but provides 5-10x performance improvement for filtered queries. The rebuild is a one-time cost for ongoing performance gains.
Additional optimizations for tcrm-2022:
- Enable compact storage format (reduces dataset size by 40-60%)
- Implement query result caching with 1-hour TTL for common patterns
- Use pagination for large result sets (limit initial load to 1000 rows)
- Disable cross-filtering on widgets that don’t need it (reduces query cascade)
- Schedule dataset refreshes during off-peak hours to maintain compact form
With these changes implemented, you should see dashboard load times drop from 30+ seconds to under 5 seconds for typical queries, even with continued data growth. The aggregation strategy is the biggest win - most business users don’t need transaction-level detail for overview dashboards.
Also consider implementing result caching for common filter combinations. If multiple users are running similar queries (like current quarter sales by region), cache those results with a reasonable TTL. Tableau CRM in tcrm-2022 has improved caching capabilities that can serve cached results for identical queries, dramatically improving perceived performance for repeated access patterns.
We’re using the default storage format - didn’t realize compact form was an option. The date field isn’t specifically indexed, just a standard dimension. How much performance improvement can we expect from adding indexes? And does that require rebuilding the entire dataset?