Let me provide the complete implementation details:
REST API Callback Triggers: Configured ETQ workflow rules to fire webhooks on specific non-conformance events. Key triggers: NC status changes to ‘Open’, severity set to ‘High’ or ‘Critical’, assignment changes, or due date approaching. ETQ sends POST request to our callback endpoint with full NC record as JSON payload. We validate webhook signature using HMAC to ensure authenticity:
function validateWebhook(req) {
const signature = req.headers['x-etq-signature'];
const payload = JSON.stringify(req.body);
const expectedSig = crypto.createHmac('sha256', SECRET_KEY)
.update(payload).digest('hex');
return signature === expectedSig;
}
JSON Payload Transformation: Built a transformation pipeline that normalizes ETQ’s complex nested JSON into a simplified notification format. Extract essential fields, apply business rules, and enrich with additional context from our systems. The transformer handles field mapping, data type conversions, and default value population. We use JSON schema validation to ensure transformed payloads meet our notification service requirements.
Department Routing Logic: Implemented a rule engine that maps NC categories to departments with configurable priority levels. Routing rules stored in PostgreSQL with fields: category, departmentCode, priorityLevel, escalationThreshold, notificationTemplate. When processing callback, query rules based on NC category and severity. Multiple departments can receive notifications with different priority levels. For critical NCs affecting multiple areas, we route to all relevant departments simultaneously plus a central quality team.
Retry Mechanism with Exponential Backoff: Critical for reliability. When notification delivery fails (service down, timeout, error response), we retry with increasing delays: 1st retry after 2 seconds, 2nd after 4 seconds, 3rd after 8 seconds, up to 5 total attempts. Implemented using a job queue (Bull with Redis backend). Each failed notification creates a retry job with delay calculated as: delay = baseDelay * (2 ^ attemptNumber). After 5 failures, route to dead-letter queue and alert operations team.
const retryJob = await notificationQueue.add(
{ncId, department, attempt: currentAttempt + 1},
{delay: Math.pow(2, currentAttempt) * 2000, attempts: 5}
);
Workflow Audit Logging: Comprehensive logging at every stage: webhook receipt, payload validation, transformation, routing decision, notification delivery, acknowledgment, and escalation. Each log entry includes: timestamp, ncId, eventType, actor, outcome, duration, and error details if applicable. Logs stored in Elasticsearch for searching and analysis. This audit trail is essential for compliance and troubleshooting.
To prevent duplicate notifications from ETQ sending the same webhook multiple times, we implement idempotency using a combination of ncId and event timestamp. Check Redis cache for recent webhook with same ncId + timestamp. If found within 5-minute window, return 200 OK but skip processing. This handles network retries without creating duplicate notifications.
The 85% reduction came from multiple factors: 60% from eliminating manual routing (system automatically identifies correct departments), 20% from faster notification delivery (webhooks vs polling), and 20% from automatic escalation (no manual follow-up needed for overdue NCs). We track: time-to-notification (avg 30 seconds vs 4 hours previously), routing accuracy (98%), acknowledgment rate within SLA (92%), and escalation frequency (reduced from 200/month to 30/month).
Key lesson learned: Start simple with basic routing rules and iterate. We initially built complex multi-factor routing logic that was hard to maintain. Simplified to category-based routing with manual overrides for edge cases. This reduced complexity while maintaining effectiveness.