Azure Log Analytics query latency spikes during high-volume data ingestion

sharonpro · January 24, 2025, 2:15pm

We’re experiencing significant query latency spikes in our Azure Log Analytics workspace during peak hours. Our monitoring dashboard runs multiple KQL queries every 30 seconds, and response times jump from 2-3 seconds to 30+ seconds when data ingestion rates exceed 50GB/hour.

Current query pattern:


Heartbeat
| where TimeGenerated > ago(1h)
| summarize count() by Computer, bin(TimeGenerated, 5m)
| order by TimeGenerated desc

We’ve noticed the latency correlates with high ingestion volumes from our container cluster (200+ nodes). The workspace is on the Standard tier with default retention. Query optimization attempts haven’t resolved the issue, and we’re concerned about workspace scaling limits affecting our real-time monitoring capabilities.

scott_solver · February 14, 2025, 6:59pm

I see multiple issues contributing to your latency spikes. Let me address the three key areas systematically:

Query Optimization: Your current query scans the entire Heartbeat table hourly without leveraging partitioning. Rewrite it to filter by Computer first, then time:


Heartbeat
| where Computer in (computerList)
| where TimeGenerated > ago(1h)
| summarize count() by Computer, bin(TimeGenerated, 5m)

This reduces the scan by 80-90% if you’re monitoring specific nodes. For broader queries, use summary operators early in the pipeline.

Data Ingestion Strategy: At 800GB/day with 50GB/hour spikes, you’re hitting Standard tier soft limits. Implement these ingestion optimizations:

Enable Basic logs for verbose container stdout/stderr (4x cost reduction, 8-day retention)
Use DCR transformations to drop unnecessary fields at ingestion
Configure container insights to collect only Warning+ logs during peak hours
Split high-volume sources (container logs) into a separate workspace with Basic logs tier

Workspace Scaling: Your workspace needs architectural changes:

Create a dedicated workspace for container telemetry using Basic logs (reduces query load on primary workspace)
Keep critical alerting data (Heartbeat, metrics) in Standard tier workspace
Use cross-workspace queries only for historical analysis, not real-time dashboards
Consider Log Analytics Dedicated Clusters if budget allows (guaranteed 500GB/day minimum with consistent performance)

For immediate relief, reduce your dashboard refresh rate to 60 seconds and implement query result caching. Set up alerts based on pre-aggregated metrics rather than running queries every 30 seconds. This approach reduced query load by 70% for a similar deployment we optimized last quarter.

Monitor the _LogOperation table to track query performance patterns and identify specific queries causing resource contention during ingestion spikes.

scott_solver · January 26, 2025, 2:55am

Have you checked your workspace’s daily cap and ingestion rate limits? At 50GB/hour you’re pushing 1.2TB/day which could trigger throttling. Standard tier has soft limits around 500GB/day before performance degrades. You might need to split workspaces by data source or upgrade to dedicated clusters for consistent query performance at that scale.

amy_admin · January 31, 2025, 12:29pm

Before jumping to dedicated clusters, optimize your data collection. Are you ingesting verbose container logs that aren’t needed for alerting? Use DCR transformations to filter at ingestion time. Also, pre-aggregate metrics in your application and send summary data instead of raw logs. We reduced our ingestion by 60% this way while maintaining alert accuracy. Check if you’re duplicating data across multiple tables too.

john_ops · January 27, 2025, 3:06pm

Thanks for the suggestions. We’re currently at around 800GB/day total ingestion. Would splitting the container logs into a separate workspace help, or should we look at dedicated clusters? Our budget is limited but query performance is critical for alerting.

melissa_guru · January 25, 2025, 6:10pm

The Heartbeat table scan without proper filtering is likely your bottleneck. When ingestion spikes, Log Analytics prioritizes data writes over queries. Try adding Computer filter before the time range to reduce the scan scope. Also, consider using materialized views for frequently-run aggregations.

laura_expert · February 4, 2025, 10:05am

For your specific query pattern, the 5-minute bins are forcing full table scans. Try this approach: create a scheduled query rule that pre-aggregates the data every 5 minutes into a custom table with much lower cardinality. Then your dashboard queries that smaller table instead. This is essentially a manual materialized view that updates on schedule.

Topic		Views
Azure Log Analytics query latency spikes during high-volume ingestion Microsoft Azure question , monitoring , networking , query-optimization , observability , az-2021 , latency , kusto , azure-log-analytics	6	August 31, 2025
Azure Blob Storage analytics queries timeout when analyzing large datasets with index tags Microsoft Azure question , storage , analytics , query-timeout , az-2019 , azure-storage-analytics , blob-index-tags , synapse-integration , kql	3	March 20, 2025
Log Analytics workspace storage capacity alerts not triggering despite exceeding quota Microsoft Azure question , monitoring , alerts , observability , log-analytics , az-2019 , data-loss-risk , workspace-usage , action-groups	6	October 21, 2025
Log Analytics workspace storage capacity alerts not triggering despite exceeding quota limits Microsoft Azure question , monitoring , alerts , observability , log-analytics , az-2019 , storage-quota , action-group , workspace-usage	5	December 24, 2024
CloudWatch Logs Insights API batch query limits and performance tuning for high-volume log analytics Amazon Web Services (AWS) discussion , timeout , observability , batch-processing , aws-2020 , rate-limit , apis , analytics-delay , cloudwatch-logs	5	May 13, 2025
Azure Monitor storage metrics show gaps in observability for blob access patterns and latency tracking Microsoft Azure question , storage , observability , log-analytics , az-2020 , json , azure-monitor , diagnostic-settings , kql-queries	6	January 19, 2025
Data storage SDK query performance issues cause slow responses in aziot-25 Microsoft Azure IoT question , api-development , indexing , query-performance , pagination , analytics-delay , data-storage , aziot-25 , data-storage-sdk	6	August 1, 2025
Monitoring data ingestion shows lag after upgrading to aziot-24, causing delayed alerts Microsoft Azure IoT question , monitoring , json , latency , alerting , batch-optimization , event-hubs , data-ingestion , aziot-24	6	September 12, 2025
Data stream visualization dashboard lags and drops updates under high device load Microsoft Azure IoT question , performance-opt , visualization , real-time-monitoring , azure-iot-hub , dashboard-lag , data-stream , high-throughput , aziot-24	6	August 11, 2025

Azure Log Analytics query latency spikes during high-volume data ingestion

Related topics