OPC UA session keeps timing out during work order status sync

We’re experiencing intermittent OPC UA session timeouts between AVEVA MES AM-2021.2 and our TOP Server when syncing work order status updates. The sessions drop unexpectedly after 15-20 minutes of operation, causing loss of work order visibility on the shop floor.

Our current configuration has the default session timeout set to 30 seconds with a 10-second keep-alive interval. We’re using subscription-based data collection for work order status changes, and the network team confirms port 4840 is open in the firewall rules.

The connection re-establishes automatically, but we lose critical status updates during the downtime window. Has anyone encountered similar session stability issues with OPC UA work order synchronization? I’m particularly concerned about optimizing the timeout settings and whether we should adjust our subscription approach versus polling.

That BadSessionIdInvalid error usually means the server thinks the session expired before the client tried to use it again. The 3-5 second gaps during shift changes are a red flag - sounds like your MES server is getting overwhelmed. Check your subscription publishing intervals too. If you’re polling too aggressively on multiple subscriptions simultaneously, you might be creating a feedback loop where the server can’t keep up with keep-alive responses. We reduced our publishing interval from 500ms to 2000ms for work order status and it stabilized everything.

Good catch on the publishing interval. Another thing to consider is implementing exponential backoff in your reconnection logic. If you’re trying to reconnect immediately after each timeout, you might be making the problem worse during high-load periods. We added a backoff strategy that waits 5 seconds after the first failure, then 10, then 20, up to a max of 60 seconds. This gave the MES server breathing room to recover during peak times.

From a network perspective, definitely verify your firewall isn’t just allowing port 4840 but also maintaining stateful connections properly. We had to adjust our firewall’s connection timeout to 300 seconds to match our OPC UA session timeout. Also check MTU settings - if you’re experiencing packet fragmentation on the OPC UA traffic, it can cause those mysterious gaps in keep-alive responses that look like server load issues but are actually network layer problems.

Are you seeing any specific error codes in the MES logs when the session drops? We had similar issues last year and found that our firewall was doing deep packet inspection on port 4840, which was causing intermittent delays. Even though the port was “open,” the stateful inspection was interfering with the OPC UA handshake. Work with your network team to verify there’s no inspection or QoS policies affecting that traffic. Also, check if you’re hitting any connection pool limits on the TOP Server side.

I’ve seen this before with TOP Server. The 30-second session timeout is way too aggressive for manufacturing environments with network jitter. Try increasing it to at least 120 seconds in your OPC UA server settings. Also, your 10-second keep-alive might be fine, but make sure it’s actually being honored by both client and server. Check the TOP Server diagnostics to see if keep-alive packets are being sent consistently.

Let me address all the key factors systematically since you’re dealing with multiple interrelated issues.

Session Timeout Configuration: Increase your session timeout from 30 to 180 seconds minimum. For work order sync in manufacturing, you need to account for network variability and server load spikes. Set this in your OPC UA client configuration:


sessionTimeout=180000
requestTimeout=60000
secureChannelLifetime=300000

Keep-Alive Interval Tuning: Your 10-second keep-alive is reasonable, but it needs to work with your revised session timeout. The rule of thumb is keep-alive should be 1/3 of session timeout. With 180-second timeout, adjust keep-alive to 15-20 seconds. This gives enough buffer for network delays without excessive overhead.

Subscription vs Polling Trade-offs: Based on your BadSessionIdInvalid errors during high activity, your subscriptions are likely overwhelming the server. Shift changes create burst traffic when multiple work orders update simultaneously. Consider hybrid approach: use subscriptions for critical status changes (Started, Completed, Failed) but poll less critical attributes every 30-60 seconds. This reduces the real-time burden on the server while maintaining responsiveness for important events.

Network Firewall Configuration: Work with your network team to verify three things for port 4840: 1) Stateful connection tracking timeout matches or exceeds your OPC UA session timeout (set to 300 seconds), 2) Disable deep packet inspection or application-layer gateway features for OPC UA traffic - they interfere with binary protocol timing, 3) If using NAT, ensure connection tracking is per-session, not per-packet.

Exponential Backoff Reconnection: Implement this in your MES OPC UA client adapter. After disconnect, don’t reconnect immediately. Use this pattern:


// Pseudocode - Reconnection with backoff:
1. Detect session timeout or BadSessionIdInvalid error
2. Calculate backoff: min(60, 5 * 2^attemptCount) seconds
3. Wait for backoff period before reconnection attempt
4. On successful reconnect, reset attemptCount to 0
5. Log all reconnection attempts with timestamps for analysis
// This prevents reconnection storms during server stress

For your specific 15-20 minute timeout pattern, I suspect it’s the combination of aggressive subscription publishing (500ms mentioned earlier) plus firewall state timeout mismatch. The server can handle it initially, but as work order updates accumulate, the processing queue builds up, keep-alive responses get delayed, and eventually the firewall drops the “idle” connection because it hasn’t seen bidirectional traffic within its timeout window.

Start with increasing session timeout to 180s and adjusting firewall state tracking to 300s. Then tune your subscription publishing intervals based on actual work order update frequency - 2-5 seconds is usually sufficient for status visibility. Monitor for a week and adjust from there. The exponential backoff will protect you during any remaining edge cases.