Automated non-conformance notification workflow using REST API callbacks - 85% reduction in manual escalation

Sharing our implementation of automated non-conformance notification workflows that reduced manual escalation by 85% over six months. We were struggling with delayed notifications when non-conformances required immediate attention across multiple departments.

Built a REST API callback system that triggers notifications based on non-conformance severity and department routing rules. The system transforms ETQ JSON payloads and routes them to appropriate stakeholders with automatic escalation if not acknowledged within SLA timeframes.

// Callback endpoint receives ETQ webhook
app.post('/api/nc-callback', async (req, res) => {
  const ncData = transformPayload(req.body);
  await routeByDepartment(ncData);
  res.status(200).send({received: true});
});

Implemented retry mechanism with exponential backoff and comprehensive audit logging. Happy to share details about the architecture and lessons learned.

Let me provide the complete implementation details:

REST API Callback Triggers: Configured ETQ workflow rules to fire webhooks on specific non-conformance events. Key triggers: NC status changes to ‘Open’, severity set to ‘High’ or ‘Critical’, assignment changes, or due date approaching. ETQ sends POST request to our callback endpoint with full NC record as JSON payload. We validate webhook signature using HMAC to ensure authenticity:

function validateWebhook(req) {
  const signature = req.headers['x-etq-signature'];
  const payload = JSON.stringify(req.body);
  const expectedSig = crypto.createHmac('sha256', SECRET_KEY)
    .update(payload).digest('hex');
  return signature === expectedSig;
}

JSON Payload Transformation: Built a transformation pipeline that normalizes ETQ’s complex nested JSON into a simplified notification format. Extract essential fields, apply business rules, and enrich with additional context from our systems. The transformer handles field mapping, data type conversions, and default value population. We use JSON schema validation to ensure transformed payloads meet our notification service requirements.

Department Routing Logic: Implemented a rule engine that maps NC categories to departments with configurable priority levels. Routing rules stored in PostgreSQL with fields: category, departmentCode, priorityLevel, escalationThreshold, notificationTemplate. When processing callback, query rules based on NC category and severity. Multiple departments can receive notifications with different priority levels. For critical NCs affecting multiple areas, we route to all relevant departments simultaneously plus a central quality team.

Retry Mechanism with Exponential Backoff: Critical for reliability. When notification delivery fails (service down, timeout, error response), we retry with increasing delays: 1st retry after 2 seconds, 2nd after 4 seconds, 3rd after 8 seconds, up to 5 total attempts. Implemented using a job queue (Bull with Redis backend). Each failed notification creates a retry job with delay calculated as: delay = baseDelay * (2 ^ attemptNumber). After 5 failures, route to dead-letter queue and alert operations team.

const retryJob = await notificationQueue.add(
  {ncId, department, attempt: currentAttempt + 1},
  {delay: Math.pow(2, currentAttempt) * 2000, attempts: 5}
);

Workflow Audit Logging: Comprehensive logging at every stage: webhook receipt, payload validation, transformation, routing decision, notification delivery, acknowledgment, and escalation. Each log entry includes: timestamp, ncId, eventType, actor, outcome, duration, and error details if applicable. Logs stored in Elasticsearch for searching and analysis. This audit trail is essential for compliance and troubleshooting.

To prevent duplicate notifications from ETQ sending the same webhook multiple times, we implement idempotency using a combination of ncId and event timestamp. Check Redis cache for recent webhook with same ncId + timestamp. If found within 5-minute window, return 200 OK but skip processing. This handles network retries without creating duplicate notifications.

The 85% reduction came from multiple factors: 60% from eliminating manual routing (system automatically identifies correct departments), 20% from faster notification delivery (webhooks vs polling), and 20% from automatic escalation (no manual follow-up needed for overdue NCs). We track: time-to-notification (avg 30 seconds vs 4 hours previously), routing accuracy (98%), acknowledgment rate within SLA (92%), and escalation frequency (reduced from 200/month to 30/month).

Key lesson learned: Start simple with basic routing rules and iterate. We initially built complex multi-factor routing logic that was hard to maintain. Simplified to category-based routing with manual overrides for edge cases. This reduced complexity while maintaining effectiveness.

We built a custom transformation layer using JSON schema mapping. Extract key fields from ETQ payload: ncId, category, severity, description, assignedTo, createdDate. Then enrich with our business logic - add department codes, escalation thresholds, notification templates. For error handling, the callback endpoint immediately returns 200 OK to ETQ, then processes asynchronously. If transformation or routing fails, we retry with exponential backoff and log to our monitoring system. Failed events go to a dead-letter queue for manual review.

This is exactly what we need! How did you handle the department routing logic? Do you maintain department mappings in ETQ or in your custom system? And what triggers the initial callback from ETQ?

What about the JSON payload transformation? ETQ’s webhook payloads can be complex with nested objects. Did you build a custom parser or use a mapping tool? Also curious about error handling - what happens if the transformation fails or the routing service is down?

The exponential backoff retry mechanism is critical. What’s your retry strategy? How many attempts and what intervals? Also, how do you prevent duplicate notifications if ETQ sends the same webhook multiple times due to network issues?