We’re architecting our incident management workflow notifications and debating between webhook-based push notifications versus polling-based pull mechanisms. Currently using ETQ 2021 and need to integrate with external ticketing systems.
Webhooks offer real-time notifications but introduce reliability concerns - what happens when our endpoint is down? Polling seems more reliable but introduces latency and increases API load. Curious about others’ experiences with notification strategies in production environments. What’s worked well for high-volume incident workflows?
Having implemented both approaches across multiple ETQ deployments, here’s my analysis of the tradeoffs:
Webhook Reliability and Retry Logic: ETQ’s native webhook retry is limited. Best practice is implementing your own reliability layer. Use an API gateway that immediately acknowledges webhooks (200 OK) and queues events for asynchronous processing. This prevents ETQ from marking deliveries as failed due to slow processing. Implement idempotency keys in your webhook handler since ETQ may deliver the same event multiple times during retries.
Polling Frequency Optimization: Smart polling is about balance. Query frequency should match your SLA requirements. For critical incidents needing <1 minute response, webhooks are necessary. For standard incidents with 5-15 minute SLA, polling every 2-3 minutes works fine. Use incremental polling with lastModifiedDate filters to minimize API load. Cache the last successful poll timestamp and only fetch records modified since then.
Hybrid Notification Strategy: This is the enterprise-grade solution. Configure webhooks as primary mechanism for real-time notifications, but implement a reconciliation polling job every 15-30 minutes that verifies all incidents are accounted for. Track webhook-received incident IDs in your system and compare against polling results. This catches webhook delivery failures without sacrificing real-time performance. The reconciliation overhead is minimal - usually just comparing ID lists.
Dead-Letter Queue Implementation: Critical for production reliability. When webhook processing fails after all retries, events should be routed to a DLQ for manual review and reprocessing. We use AWS SQS with a dead-letter queue that triggers alerts when events accumulate. This prevents silent data loss and provides audit trail of delivery failures.
For high-volume incident workflows (500+ daily), I recommend the hybrid approach with these specific configurations: webhooks for immediate notification, 15-minute reconciliation polling, message queue buffering with 3 retry attempts, and DLQ for failed events. This architecture achieves 99.9% reliability with <30 second average latency.
One often-overlooked aspect: webhook signatures for security. Always validate webhook signatures using HMAC to ensure requests actually come from ETQ and haven’t been tampered with. ETQ includes signature headers that you should verify before processing any webhook payload.
We use webhooks exclusively and handle reliability through a message queue. ETQ sends webhook to our API gateway, which immediately queues the event in RabbitMQ and returns 200 OK. Background workers process from the queue with retries. This decouples ETQ from our processing logic and prevents webhook failures from backing up in ETQ. Works great for our 500+ daily incidents.
The hybrid strategy is solid. Another consideration is notification payload size. Webhooks in ETQ 2021 have payload size limits - around 256KB. For incidents with large attachments or extensive audit trails, webhooks might only send metadata and you’ll need to poll the full record anyway. Also think about security - webhooks require exposing an endpoint to the internet with proper authentication. Polling from your network to ETQ might be simpler from a security standpoint.