We’re experiencing significant query latency spikes in our Azure Log Analytics workspace during peak hours. Our monitoring dashboard runs multiple KQL queries every 30 seconds, and response times jump from 2-3 seconds to 30+ seconds when data ingestion rates exceed 50GB/hour.
Current query pattern:
Heartbeat
| where TimeGenerated > ago(1h)
| summarize count() by Computer, bin(TimeGenerated, 5m)
| order by TimeGenerated desc
We’ve noticed the latency correlates with high ingestion volumes from our container cluster (200+ nodes). The workspace is on the Standard tier with default retention. Query optimization attempts haven’t resolved the issue, and we’re concerned about workspace scaling limits affecting our real-time monitoring capabilities.
I see multiple issues contributing to your latency spikes. Let me address the three key areas systematically:
Query Optimization:
Your current query scans the entire Heartbeat table hourly without leveraging partitioning. Rewrite it to filter by Computer first, then time:
Heartbeat
| where Computer in (computerList)
| where TimeGenerated > ago(1h)
| summarize count() by Computer, bin(TimeGenerated, 5m)
This reduces the scan by 80-90% if you’re monitoring specific nodes. For broader queries, use summary operators early in the pipeline.
Data Ingestion Strategy:
At 800GB/day with 50GB/hour spikes, you’re hitting Standard tier soft limits. Implement these ingestion optimizations:
- Enable Basic logs for verbose container stdout/stderr (4x cost reduction, 8-day retention)
- Use DCR transformations to drop unnecessary fields at ingestion
- Configure container insights to collect only Warning+ logs during peak hours
- Split high-volume sources (container logs) into a separate workspace with Basic logs tier
Workspace Scaling:
Your workspace needs architectural changes:
- Create a dedicated workspace for container telemetry using Basic logs (reduces query load on primary workspace)
- Keep critical alerting data (Heartbeat, metrics) in Standard tier workspace
- Use cross-workspace queries only for historical analysis, not real-time dashboards
- Consider Log Analytics Dedicated Clusters if budget allows (guaranteed 500GB/day minimum with consistent performance)
For immediate relief, reduce your dashboard refresh rate to 60 seconds and implement query result caching. Set up alerts based on pre-aggregated metrics rather than running queries every 30 seconds. This approach reduced query load by 70% for a similar deployment we optimized last quarter.
Monitor the _LogOperation table to track query performance patterns and identify specific queries causing resource contention during ingestion spikes.
Have you checked your workspace’s daily cap and ingestion rate limits? At 50GB/hour you’re pushing 1.2TB/day which could trigger throttling. Standard tier has soft limits around 500GB/day before performance degrades. You might need to split workspaces by data source or upgrade to dedicated clusters for consistent query performance at that scale.
Before jumping to dedicated clusters, optimize your data collection. Are you ingesting verbose container logs that aren’t needed for alerting? Use DCR transformations to filter at ingestion time. Also, pre-aggregate metrics in your application and send summary data instead of raw logs. We reduced our ingestion by 60% this way while maintaining alert accuracy. Check if you’re duplicating data across multiple tables too.
Thanks for the suggestions. We’re currently at around 800GB/day total ingestion. Would splitting the container logs into a separate workspace help, or should we look at dedicated clusters? Our budget is limited but query performance is critical for alerting.
The Heartbeat table scan without proper filtering is likely your bottleneck. When ingestion spikes, Log Analytics prioritizes data writes over queries. Try adding Computer filter before the time range to reduce the scan scope. Also, consider using materialized views for frequently-run aggregations.
For your specific query pattern, the 5-minute bins are forcing full table scans. Try this approach: create a scheduled query rule that pre-aggregates the data every 5 minutes into a custom table with much lower cardinality. Then your dashboard queries that smaller table instead. This is essentially a manual materialized view that updates on schedule.