Let me address all the key factors systematically since you’re dealing with multiple interrelated issues.
Session Timeout Configuration: Increase your session timeout from 30 to 180 seconds minimum. For work order sync in manufacturing, you need to account for network variability and server load spikes. Set this in your OPC UA client configuration:
sessionTimeout=180000
requestTimeout=60000
secureChannelLifetime=300000
Keep-Alive Interval Tuning: Your 10-second keep-alive is reasonable, but it needs to work with your revised session timeout. The rule of thumb is keep-alive should be 1/3 of session timeout. With 180-second timeout, adjust keep-alive to 15-20 seconds. This gives enough buffer for network delays without excessive overhead.
Subscription vs Polling Trade-offs: Based on your BadSessionIdInvalid errors during high activity, your subscriptions are likely overwhelming the server. Shift changes create burst traffic when multiple work orders update simultaneously. Consider hybrid approach: use subscriptions for critical status changes (Started, Completed, Failed) but poll less critical attributes every 30-60 seconds. This reduces the real-time burden on the server while maintaining responsiveness for important events.
Network Firewall Configuration: Work with your network team to verify three things for port 4840: 1) Stateful connection tracking timeout matches or exceeds your OPC UA session timeout (set to 300 seconds), 2) Disable deep packet inspection or application-layer gateway features for OPC UA traffic - they interfere with binary protocol timing, 3) If using NAT, ensure connection tracking is per-session, not per-packet.
Exponential Backoff Reconnection: Implement this in your MES OPC UA client adapter. After disconnect, don’t reconnect immediately. Use this pattern:
// Pseudocode - Reconnection with backoff:
1. Detect session timeout or BadSessionIdInvalid error
2. Calculate backoff: min(60, 5 * 2^attemptCount) seconds
3. Wait for backoff period before reconnection attempt
4. On successful reconnect, reset attemptCount to 0
5. Log all reconnection attempts with timestamps for analysis
// This prevents reconnection storms during server stress
For your specific 15-20 minute timeout pattern, I suspect it’s the combination of aggressive subscription publishing (500ms mentioned earlier) plus firewall state timeout mismatch. The server can handle it initially, but as work order updates accumulate, the processing queue builds up, keep-alive responses get delayed, and eventually the firewall drops the “idle” connection because it hasn’t seen bidirectional traffic within its timeout window.
Start with increasing session timeout to 180s and adjusting firewall state tracking to 300s. Then tune your subscription publishing intervals based on actual work order update frequency - 2-5 seconds is usually sufficient for status visibility. Monitor for a week and adjust from there. The exponential backoff will protect you during any remaining edge cases.