Real-time incident alerting integrated with monitoring system

We successfully implemented a real-time incident alerting system that bridges our ETQ Reliance platform with our infrastructure monitoring tools. The integration uses webhook receivers to capture monitoring alerts and automatically creates incidents in ETQ with proper classification.

Our webhook endpoint validates incoming payloads from multiple monitoring sources (Datadog, New Relic, custom sensors) and maps them to ETQ incident severity levels based on predefined rules. The system handles incident classification automatically - critical alerts trigger P1 incidents, warnings become P3, etc.


// Webhook receiver endpoint configuration
POST /api/webhooks/monitoring
Headers: X-Auth-Token, Content-Type: application/json
Validation: signature verification, schema check
Response: 200 OK with incident ID

We’ve configured automated routing to assign incidents to appropriate teams based on alert source and type, with escalation rules that kick in when SLA thresholds are breached. Multi-channel notifications go out via email, SMS, and Slack simultaneously. The entire flow from alert detection to team notification averages under 15 minutes. Happy to share implementation details and lessons learned.

Great questions on both fronts. For notifications, we customize content per channel - emails include full incident details with links to ETQ, SMS contains critical info only (incident ID, severity, brief description), and Slack messages use rich formatting with action buttons for quick acknowledgment. To prevent fatigue, we implemented a 5-minute correlation window that groups related alerts from the same source into a single incident.

Regarding escalation automation, we’ve configured time-based triggers tied to SLA definitions in ETQ. Here’s our escalation framework:

Webhook Receiver Layer: All incoming monitoring payloads pass through validation (signature verification, IP whitelist, schema check) before creating incidents. The receiver maps alert metadata to ETQ incident fields using our centralized classification rules.

Incident Classification: We maintain a configuration table with 50+ mapping rules covering different monitoring sources. Each rule defines severity mapping (P1-P4), initial assignment group, and required response time. Critical infrastructure alerts auto-assign P1 severity; application warnings become P3. The system evaluates rules in priority order until finding a match.

Automated Routing: Incidents route to teams based on alert source and category. Network alerts go to infrastructure team, application errors to development, security events to InfoSec. We use ETQ’s workflow engine with custom scripts that query our CMDB to determine ownership dynamically.

Multi-Channel Notifications: Upon incident creation, our integration triggers parallel notifications via email (detailed), SMS (brief), and Slack (interactive). Each channel receives appropriately formatted content. We suppress duplicate notifications using the 5-minute correlation window.

SLA Tracking and Escalation: ETQ’s SLA engine monitors response and resolution times. When thresholds breach:

  • At 50% of SLA: Send warning to assigned team
  • At 80% of SLA: Escalate to team lead with Slack ping
  • At 100% of SLA: Auto-escalate to manager level, trigger emergency conference bridge, log compliance violation

For P1 incidents, we have 15-minute initial response SLA. If no acknowledgment within 15 minutes, the system pages the on-call engineer directly and creates a critical escalation record. For off-hours coverage, we integrated with PagerDuty for failover escalation when primary contacts don’t respond.

The entire implementation runs on ETQ’s cloud infrastructure with webhook receivers deployed as serverless functions for scalability. We process 200-300 monitoring alerts daily, with average end-to-end latency of 12 minutes from alert generation to team notification. The system has reduced our mean time to acknowledge from 45 minutes to under 15 minutes, significantly improving our incident response capability.

Key lessons learned: Start with a small set of critical alert types and expand gradually. Invest time in tuning your classification rules based on real incident data. Build admin interfaces for non-technical users to adjust configurations. Monitor your webhook infrastructure itself - we had early issues with receiver timeouts that went undetected.

The multi-channel notification piece interests me most. Are you sending identical content across all channels, or do you customize messages per channel? Also, how do you prevent notification fatigue when multiple related alerts fire in quick succession?

We created a centralized mapping configuration table in ETQ that defines rules for each monitoring source. Each rule specifies alert_type, source_system, keyword_patterns, and maps to ETQ severity levels. The webhook processor queries this table on every incoming alert and applies the first matching rule. We also built an admin interface where our ops team can adjust mappings without code changes. This approach keeps the logic maintainable and allows us to fine-tune classifications based on real-world incident patterns we observe over time.

This is exactly what we’ve been planning! Could you elaborate on how you structured the incident classification logic? We’re dealing with dozens of different alert types from various monitoring tools, and I’m concerned about maintaining consistent severity mapping across all sources.