Azure Blob Storage monitoring alerts not triggering for large file uploads in production workloads

We have Azure Monitor alerts configured to notify us when files larger than 500MB are uploaded to our production Blob Storage container. The alerts worked fine initially, but we’ve noticed they’re not triggering consistently anymore-especially during high-volume periods when multiple large files arrive simultaneously.

We’re using BlobCreated events with Event Grid subscriptions to trigger the alerts, but I’m starting to wonder if we’re hitting some kind of subscription limit or if there’s an issue with our alert rule configuration. The diagnostic logs show the uploads are completing successfully, so the files are definitely reaching the container.

This is causing us to miss critical ingestion events that downstream processes depend on. Has anyone experienced similar issues with Event Grid subscriptions or Azure Monitor alerts not firing reliably for Blob Storage events?

Good point about the retry policy. Our webhook endpoint does take 3-4 seconds to respond sometimes. Could that be causing the backlog?

Based on the symptoms you’re describing, you have three interconnected issues that need to be addressed systematically:

1. BlobCreated Event Diagnostics: First, enable detailed diagnostic logging on your storage account. Navigate to your storage account → Diagnostic settings → Add diagnostic setting. Enable ‘StorageRead’ and ‘StorageWrite’ logs and send them to Log Analytics. This will help you verify that BlobCreated events are actually being generated for all uploads. Query the logs with:


StorageBlobLogs
| where OperationName == "PutBlob" and StatusText == "Success"
| where ContentLength > 524288000

This confirms whether the issue is event generation or event delivery.

2. Event Grid Subscription Limits: You’re hitting Event Grid’s publishing rate limit during peak hours (the 150-200 dropped events confirm this). The standard tier supports up to 5,000 events per second per topic. To solve this:

  • Implement event batching by configuring your Event Grid subscription to use batch delivery (max 5000 events per batch)
  • Consider splitting your workload across multiple Event Grid topics by container or file size category
  • Increase the ‘max events per batch’ and ‘preferred batch size in kilobytes’ settings in your subscription configuration
  • Enable dead-lettering to a separate storage container so you can recover dropped events

3. Alert Rule Configuration: Your alert rule needs optimization for high-volume scenarios:

  • Change the aggregation granularity to 5 minutes minimum (not 1 minute)
  • Use ‘Total’ aggregation instead of ‘Count’ for the metric evaluation
  • Set the evaluation frequency to match your aggregation window to prevent evaluation gaps
  • Verify your action group isn’t being throttled by checking Action Group metrics in Azure Monitor
  • Consider using Log Analytics-based alerts instead of metric alerts for better handling of high-cardinality data

The 3-4 second webhook response time is definitely contributing to the problem. Event Grid expects acknowledgment within 60 seconds, but slow responses cause connection pooling issues. Implement an asynchronous processing pattern where your webhook immediately returns 200 OK and processes the event in the background.

Implement these changes in order-diagnostics first to confirm event generation, then optimize Event Grid delivery, and finally tune the alert rules. Monitor for 48 hours after each change to measure improvement.

We had a similar problem and discovered our issue was with Event Grid subscription quotas. Each subscription has a default limit of 500 event subscriptions per region. If you’re creating subscriptions dynamically or have a lot of them, you might be hitting that limit. Also check the event delivery retry policy-by default Event Grid retries for 24 hours, but if your endpoint is slow to respond, events can queue up and eventually drop.

Thanks Mike. I checked the Event Grid metrics and we’re definitely seeing dropped events during peak hours-around 150-200 drops per hour. Our subscription doesn’t have any advanced filters applied, just the basic BlobCreated event type. Should we be implementing batching or splitting across multiple topics?