Azure Log Analytics query latency spikes during high-volume ingestion

We’re experiencing severe query latency spikes in our Azure Log Analytics workspace during peak data ingestion periods. Our environment processes approximately 500GB daily from multiple sources including Application Insights, VMs, and container logs.

The issue manifests as query timeouts (>180s) when running standard KQL queries that normally complete in 15-20 seconds. This happens specifically during our ETL windows between 02:00-06:00 UTC when ingestion rates spike to 80GB/hour.

Current workspace configuration:


workspace.sku = PerGB2018
retention = 90 days
daily cap = not set
ingestion volume = 500GB/day avg

We’ve noticed the latency correlates with high data ingestion rates, but I’m unsure if this is a workspace scaling issue, query optimization problem, or network bandwidth limitation. Has anyone dealt with similar query performance degradation during high-volume ingestion periods?

I’ve seen this pattern before. During high ingestion, Log Analytics prioritizes write operations over read queries. Your 500GB daily volume is substantial but manageable. First quick check - are you using summarize operations without proper time filters? That forces full table scans even during ingestion pressure.

Good point about time filters. Most of our queries do include specific time ranges (last 24h or 7d), but we have a few dashboard queries that might be scanning larger datasets. I’ll audit those. However, even our well-filtered queries are timing out during the ingestion window. Could workspace capacity be a factor here?

Your ingestion pattern suggests you might benefit from distributed ingestion. Are all 80GB/hour hitting the workspace simultaneously? We spread our high-volume sources across multiple collection endpoints and that helped reduce contention. Also check if you’re hitting any throttling limits - Azure Monitor has ingestion rate limits per workspace that might be causing backpressure.

I’ll break down the three areas you need to address simultaneously:

Query Optimization: Your KQL queries need immediate attention. Add explicit time ranges to ALL queries using where TimeGenerated > ago(24h) at the beginning of your query chain, not buried in later operations. Use summarize with bin() for time-series data to reduce result sets. For your dashboard queries, implement query result caching by standardizing time windows. Replace any search operators with explicit table references - search scans all tables and is extremely expensive during ingestion.

Data Ingestion Strategy: Your 80GB/hour spike is creating resource contention. Implement these changes:

  • Distribute ingestion across time by staggering your ETL jobs into 30-minute batches instead of parallel execution
  • Enable data collection rules (DCR) to filter and transform data at ingestion time, reducing workspace load
  • Set up separate workspaces for different data tiers: hot operational data (30 days) vs cold analytical data (90 days)
  • Configure ingestion-time transformations to drop unnecessary fields and reduce volume by 20-30%

Workspace Scaling: At 45TB total with 500GB daily ingestion, you’re definitely a candidate for Azure Monitor dedicated clusters (minimum 500GB/day commitment). Benefits include:

  • Reserved capacity that isolates your ingestion from multi-tenant throttling
  • Customer-managed keys for encryption
  • Cross-workspace queries with better performance
  • Ability to link multiple workspaces to the cluster

Immediate action: Enable Basic Logs for high-volume, low-value data sources (like verbose application logs). This reduces costs by 65% and removes these logs from competing with your critical analytics queries. Your container logs are perfect candidates.

For monitoring: Set up alerts on _LogOperation | where Level == "Warning" to catch ingestion throttling early. Track query performance with _QueryPerformance table to identify problematic queries.

Implement query optimization first (immediate impact), then ingestion distribution (1-week improvement), finally evaluate dedicated cluster based on 2-week performance metrics. Your current PerGB2018 tier can handle the volume but not the concurrent ingestion/query load pattern you’re experiencing.

Have you looked at your query patterns during these spikes? We had similar issues and discovered some of our queries were doing cross-workspace joins during peak ingestion. That combination was killer for performance. Moving those queries to off-peak hours or using materialized views helped significantly. Also, your 90-day retention on 500GB daily means you’re managing roughly 45TB of data - that’s definitely workspace scaling territory.

Thanks everyone. I’ve identified several queries doing unnecessary full scans and we do have some cross-workspace operations. The 45TB total volume is eye-opening - I hadn’t calculated that. What’s the recommended approach for workspace scaling at this volume? Should we be looking at dedicated clusters?