Here’s a comprehensive solution for reliable webhook delivery with proper retry logic, delivery tracking, event ordering, and queue management:
WEBHOOK RETRY LOGIC IMPLEMENTATION:
Since FT 10.0 lacks native retry capabilities, implement a webhook relay service:
-
Relay Service Architecture:
- Deploy a dedicated webhook relay between FT MES and your WMS
- Configure FT MES to send webhooks to relay instead of directly to WMS
- Relay persists events to durable message queue before acknowledging receipt
- Separate worker processes consume from queue and deliver to WMS with retry logic
-
Retry Configuration:
// Pseudocode - Webhook delivery with retry logic:
1. Receive webhook from FT MES, respond 200 OK immediately
2. Persist event to message queue (RabbitMQ/Kafka)
3. Worker process attempts delivery to WMS endpoint
4. On failure: retry with exponential backoff (1s, 2s, 4s, 8s, 16s)
5. Maximum 5 retry attempts over 31 seconds
6. On persistent failure: move to dead letter queue for manual review
7. Log all delivery attempts with status codes and error messages
DELIVERY TRACKING SYSTEM:
Implement comprehensive tracking for webhook lifecycle:
-
Tracking Database Schema:
- Event ID (unique identifier)
- Work Order ID
- Event type (status change, transition)
- Source timestamp (when FT MES generated event)
- Receipt timestamp (when relay received webhook)
- Delivery timestamp (when WMS acknowledged)
- Delivery attempts count
- Current status (pending/delivered/failed)
- Response codes from delivery attempts
-
Tracking API Endpoints:
GET /webhook-relay/events/{eventId}/status
Response: {
"eventId": "evt-89234",
"status": "delivered",
"attempts": 2,
"deliveryTime": "2025-06-01T09:15:23Z",
"latency": 3420
}
- Monitoring Dashboard:
- Real-time delivery success rate
- Average delivery latency
- Failed delivery alerts
- Queue depth monitoring
- Event throughput metrics
EVENT ORDERING GUARANTEES:
Ensure events are processed in correct sequence:
- Sequence Number Implementation:
Modify FT MES webhook payload to include sequence information:
{
"eventId": "evt-89234",
"workOrderId": "WO-5521",
"status": "Paused",
"timestamp": "2025-06-01T09:15:20.543Z",
"sequenceNumber": 3,
"previousStatus": "Started"
}
-
WMS-Side Ordering Logic:
- Buffer events per work order in memory
- Process events in sequence number order
- Hold out-of-order events for up to 60 seconds
- If gap persists, query FT MES API for missing events
- Reject events with sequence numbers older than last processed
-
Ordering in Message Queue:
- Use message queue partitioning by work order ID
- Ensures all events for same work order processed by same consumer
- Maintains FIFO ordering within each work order’s event stream
- Configure Kafka topic with key-based partitioning or RabbitMQ with routing keys
QUEUE MANAGEMENT STRATEGY:
Implement robust queue management for reliability:
- Queue Configuration:
Queue: ftmes-webhooks-primary
Durability: Persistent (survives broker restart)
Ack Mode: Manual (consumer explicitly acknowledges processing)
Prefetch: 10 messages per consumer
TTL: 7 days (events older than 7 days moved to archive)
DLQ: ftmes-webhooks-failed (for permanently failed deliveries)
-
Consumer Pool Management:
- Deploy 3-5 consumer workers for redundancy
- Each consumer processes events independently
- Automatic consumer failover on worker failure
- Dynamic scaling based on queue depth (scale up if depth >1000)
-
Backpressure Handling:
- Monitor queue depth continuously
- If depth exceeds 5000 messages: alert operations team
- If depth exceeds 10000: temporarily pause FT MES webhook generation
- Implement circuit breaker pattern to prevent queue overflow
CONFIGURATION CHANGES IN FT MES:
- Webhook Delivery Thread Pool:
Increase thread pool size in
ftmes-webhooks.properties:
webhook.delivery.threadPool.size=25
webhook.delivery.timeout=15000
webhook.delivery.queueCapacity=500
- Webhook Endpoint Configuration:
Point webhooks to relay service:
Webhook URL: http://webhook-relay.internal:8080/ftmes-events
Headers: X-Source: factorytalk-mes, X-API-Key: [relay-api-key]
FAILURE RECOVERY PROCEDURES:
-
WMS Downtime Handling:
- Events accumulate in relay queue during WMS downtime
- Upon WMS recovery, consumers process backlog at controlled rate
- Implement rate limiting (max 50 events/second) to prevent overwhelming WMS
- Monitor queue drain rate and adjust consumer count as needed
-
Relay Service Failure:
- Deploy relay service in high-availability configuration (3 instances)
- Use load balancer with health checks
- If all relay instances fail, FT MES webhooks fail but events remain in FT MES audit log
- Recovery procedure: query FT MES audit API for missed events and replay to relay
-
Message Queue Failure:
- Use message broker clustering (RabbitMQ cluster or Kafka replication)
- Ensure queue persistence to disk
- Regular queue backups for disaster recovery
MONITORING AND ALERTING:
Set up comprehensive monitoring:
- Webhook delivery success rate (alert if <95%)
- Average delivery latency (alert if >10 seconds)
- Queue depth (alert if >5000 messages)
- Dead letter queue depth (alert on any messages)
- Out-of-order event rate (alert if >1%)
- Consumer health checks (alert on consumer failures)
IMPLEMENTATION ROADMAP:
- Phase 1 (Week 1-2): Deploy webhook relay service with basic queue
- Phase 2 (Week 3): Implement retry logic and delivery tracking
- Phase 3 (Week 4): Add event ordering and sequence numbers
- Phase 4 (Week 5): Implement monitoring dashboard and alerts
- Phase 5 (Week 6): Load testing and failure scenario validation
ADDITIONAL RECOMMENDATIONS:
- Implement idempotency keys in webhook payloads so WMS can safely process duplicate deliveries
- Use webhook signatures (HMAC) to verify authenticity of events received by WMS
- Create reconciliation job that compares FT MES work order state with WMS state daily
- Document runbooks for common failure scenarios (relay down, queue full, WMS unavailable)
- Consider implementing webhook replay API for manual recovery of specific missed events
This architecture provides reliable webhook delivery with retry logic, comprehensive delivery tracking, guaranteed event ordering, and robust queue management. The relay service decouples FT MES webhook generation from WMS delivery, ensuring no events are lost even during downstream system failures or rapid work order state transitions.