Travel expense approval workflow times out during automated deployments

Our travel expense approval workflows are timing out consistently after automated deployments in ICS 2022. The issue manifests within 2-3 hours post-deployment when approval volumes pick up. We’re seeing database connection pool exhaustion and workflow batch optimization failures.

The timeout configuration appears correct at 120 seconds, but workflows are failing at around 90 seconds during the approval routing phase. Error logs show:


WorkflowTimeout: Approval routing exceeded 90s threshold
ConnectionPoolExhausted: Unable to acquire connection after 30s

We’ve run load testing pre-deployment that shows the workflow handles 200 concurrent approvals fine in staging, but production fails at around 150 concurrent workflows post-deployment. The database connection pooling settings seem adequate (maxActive=100, maxIdle=50), yet we’re hitting exhaustion.

Could this be related to how the deployment process reinitializes the workflow engine? Looking for insights on proper connection pool configuration for post-deployment workflow stability.

We had identical timeout issues in ICS 2022 travel expense workflows. The problem was workflow batch optimization settings not being compatible with our deployment process. The workflow engine batches approval requests for efficiency, but post-deployment the batch processor wasn’t scaling correctly. We modified the batch size from 50 to 20 for the first two hours after deployment, then gradually increased it. Also implemented a connection pool pre-warming script that runs before the workflow engine fully initializes. These two changes eliminated our timeout issues completely.

Your load testing methodology might be flawed. Testing 200 concurrent approvals in staging doesn’t replicate the post-deployment state where the workflow engine cache is cold. The first few hours after deployment are always the most vulnerable because workflow definitions, user role caches, and approval routing tables all need to be rebuilt in memory. Your 150-workflow failure threshold suggests the cache warming is causing the bottleneck, not the raw processing capacity. Try load testing immediately after a staging deployment to see if you can reproduce the issue.

Excellent troubleshooting suggestions from everyone. After systematically testing each theory, I’ve identified and resolved the issue. It was a combination of connection pool initialization problems and workflow batch optimization misconfiguration.

Root Cause Analysis:

The core issue was that our deployment process wasn’t properly coordinating the workflow engine restart with database connection pool initialization. The workflow engine was accepting approval requests before the connection pool had scaled to operational capacity.

Database Connection Pooling Solution:

Implemented a pre-warming strategy in our deployment automation:

-- Connection pool warm-up script
SELECT COUNT(*) FROM workflow_definitions;
SELECT COUNT(*) FROM approval_routing_rules;
SELECT COUNT(*) FROM user_role_assignments;

These queries force connection establishment and cache population before the workflow engine goes live. We also adjusted the pool configuration:

  • minIdle increased from 20 to 40
  • initialSize set to 40 (was defaulting to 10)
  • maxWait reduced from 30s to 10s (fail faster rather than queue)

Workflow Batch Optimization:

Modified the post-deployment batch processing strategy. For the first 90 minutes after deployment:

  • Batch size: 15 (down from 50)
  • Batch interval: 10 seconds (up from 5 seconds)
  • Max concurrent batches: 5 (down from 10)

This gives the workflow engine breathing room to build its caches without overwhelming the connection pool.

Timeout Configuration:

Actually increased the timeout to 180 seconds for the first two hours post-deployment, then automatically reverts to 120 seconds. This accommodates the cold-cache performance profile without masking genuine issues once the system stabilizes.

Load Testing Improvements:

Revised our load testing to include a “post-deployment simulation” scenario. We now restart the workflow engine in staging, wait 60 seconds, then immediately hit it with production-level load. This revealed that our previous testing was unrealistic - we were testing against a warm system that had been running for hours.

Results:

After implementing these changes across three deployments:

  • Zero workflow timeouts in the critical first 3 hours post-deployment
  • Connection pool utilization peaks at 65% (was hitting 100%)
  • Average approval routing time: 2.3 seconds (was 45+ seconds post-deployment)
  • Successfully handled 220 concurrent approvals 30 minutes after deployment

The key insight: automated deployments need deployment-aware configuration profiles, not just static production settings. The first few hours after deployment represent a distinct operational state that requires specific tuning.

Check if your deployment is properly shutting down existing workflow instances before starting new ones. We discovered that our automated deployment was leaving orphaned workflow threads that held onto database connections. These zombie threads weren’t visible in normal monitoring but consumed connection pool resources. After adding explicit workflow engine shutdown commands to our deployment script with a 60-second drain period, the connection pool exhaustion disappeared. The drain period lets active workflows complete before the engine restarts.

Connection pool exhaustion post-deployment usually indicates the pool isn’t warming up properly. When the workflow engine restarts, it initializes with minimum connections (typically 10-20), not the configured maxActive. Under sudden load, the pool tries to scale up but can’t create connections fast enough. Add a post-deployment warm-up script that pre-establishes connections before enabling workflow processing. Also check your connection validation query - if it’s slow, connection acquisition becomes a bottleneck.