Great questions on both fronts. For notifications, we customize content per channel - emails include full incident details with links to ETQ, SMS contains critical info only (incident ID, severity, brief description), and Slack messages use rich formatting with action buttons for quick acknowledgment. To prevent fatigue, we implemented a 5-minute correlation window that groups related alerts from the same source into a single incident.
Regarding escalation automation, we’ve configured time-based triggers tied to SLA definitions in ETQ. Here’s our escalation framework:
Webhook Receiver Layer: All incoming monitoring payloads pass through validation (signature verification, IP whitelist, schema check) before creating incidents. The receiver maps alert metadata to ETQ incident fields using our centralized classification rules.
Incident Classification: We maintain a configuration table with 50+ mapping rules covering different monitoring sources. Each rule defines severity mapping (P1-P4), initial assignment group, and required response time. Critical infrastructure alerts auto-assign P1 severity; application warnings become P3. The system evaluates rules in priority order until finding a match.
Automated Routing: Incidents route to teams based on alert source and category. Network alerts go to infrastructure team, application errors to development, security events to InfoSec. We use ETQ’s workflow engine with custom scripts that query our CMDB to determine ownership dynamically.
Multi-Channel Notifications: Upon incident creation, our integration triggers parallel notifications via email (detailed), SMS (brief), and Slack (interactive). Each channel receives appropriately formatted content. We suppress duplicate notifications using the 5-minute correlation window.
SLA Tracking and Escalation: ETQ’s SLA engine monitors response and resolution times. When thresholds breach:
- At 50% of SLA: Send warning to assigned team
- At 80% of SLA: Escalate to team lead with Slack ping
- At 100% of SLA: Auto-escalate to manager level, trigger emergency conference bridge, log compliance violation
For P1 incidents, we have 15-minute initial response SLA. If no acknowledgment within 15 minutes, the system pages the on-call engineer directly and creates a critical escalation record. For off-hours coverage, we integrated with PagerDuty for failover escalation when primary contacts don’t respond.
The entire implementation runs on ETQ’s cloud infrastructure with webhook receivers deployed as serverless functions for scalability. We process 200-300 monitoring alerts daily, with average end-to-end latency of 12 minutes from alert generation to team notification. The system has reduced our mean time to acknowledge from 45 minutes to under 15 minutes, significantly improving our incident response capability.
Key lessons learned: Start with a small set of critical alert types and expand gradually. Invest time in tuning your classification rules based on real incident data. Build admin interfaces for non-technical users to adjust configurations. Monitor your webhook infrastructure itself - we had early issues with receiver timeouts that went undetected.