Let me address the three key areas for optimizing your Logs Insights API usage at scale.
Logs Insights API Quotas:
The fundamental constraints you’re working with are:
- 20 concurrent queries per region per account (default)
- 30 StartQuery API calls per second
- Query timeout of 15 minutes for queries that scan large amounts of data
- Results limited to 10,000 records per query (use pagination for more)
Request a quota increase to 50 concurrent queries through AWS Service Quotas console. This is typically approved within 24-48 hours for production accounts with valid use cases. For your 50 log groups, this means you can run all queries in parallel with some headroom.
However, there’s a more efficient approach: use the log group name filter in a single query. Instead of 50 separate queries, you can query multiple log groups in one Insights query by specifying them in the log group list parameter. Group your log groups into batches of 10-15 and run 4-5 parallel queries instead of 50. This reduces API calls and improves overall performance.
Batch Query Optimization:
Implement a tiered query strategy based on data freshness requirements:
- Hot tier (last 1 hour): Query every 5 minutes with small time windows
- Warm tier (1-24 hours): Query every 15 minutes, cache results for 10 minutes
- Cold tier (older than 24 hours): Query on-demand only, cache aggressively
For your 5-minute refresh requirement, only query the hot tier in real-time. Pre-aggregate warm and cold tier data in DynamoDB or S3 to avoid repeated expensive queries. Use CloudWatch Logs Insights’ time-based filtering to minimize data scanned:
fields @timestamp, @message | filter @timestamp > ago(5m) | stats count() by service
This limits scan volume and improves query performance from 8-12 minutes to 2-3 minutes for your batch.
Pagination and Retry Logic:
Implement robust error handling with these patterns:
For throttling (TooManyRequestsException):
- Exponential backoff: 1s, 2s, 4s, 8s, 16s, 32s (max)
- Add jitter: random(0, min(cap, base * 2^attempt))
- Queue failed queries for retry rather than failing fast
For pagination when results exceed 10,000 records:
- Always check the ‘status’ field in GetQueryResults response
- If status=‘Complete’ and ‘nextToken’ is present, make another GetQueryResults call with the token
- Aggregate results client-side across all pages
- Set a maximum page limit (e.g., 5 pages = 50,000 records) to prevent runaway queries
For concurrent query management:
- Use a semaphore or queue to limit concurrent StartQuery calls to your quota (20 or 50)
- Track query IDs in a priority queue, with newer time ranges prioritized
- Poll GetQueryResults with 2-second intervals (don’t poll too aggressively)
- Implement timeout handling - if a query exceeds 5 minutes, cancel it and retry with a smaller time range
Implementation pattern:
- Chunk your 50 log groups into 5 batches of 10
- For each batch, start a single Insights query targeting all 10 log groups
- Poll results with 2-second intervals, implementing pagination if needed
- Move to next batch only after previous batch completes or times out
- Cache results in ElastiCache with 3-minute TTL to serve dashboard requests
This approach should reduce your total execution time from 8-12 minutes to under 4 minutes while staying within API limits. The key is reducing the number of individual queries through log group batching and implementing intelligent caching to avoid redundant API calls for dashboard refreshes.