Your transaction timeout issue is a classic batch processing problem that requires addressing three key areas: batch size optimization, API timeout handling, and retry logic. Here’s a comprehensive solution:
Batch Size Optimization:
The optimal batch size depends on several factors - ECN complexity, number of fields being updated, server load, and network latency. For Agile 9.3.5, I’ve found that batches of 50-75 ECNs work well for most scenarios. Larger batches risk timeout, smaller batches add unnecessary overhead.
Implement dynamic batch sizing based on processing time. Start with batches of 50 and measure average processing time per ECN:
// Pseudocode - Dynamic batch processing:
1. Calculate batchSize = min(50, remainingECNs)
2. Process batch and measure totalTime
3. Calculate avgTimePerECN = totalTime / batchSize
4. Adjust next batchSize based on performance
5. If avgTimePerECN > threshold, reduce batchSize
// Target: Keep batch processing under 3 minutes
Add a configurable delay between batches (30-60 seconds) to allow server resources to stabilize. This prevents resource exhaustion when processing large datasets.
API Timeout Handling:
The 504 Gateway Timeout indicates your API calls are exceeding the server’s maximum execution time. Don’t try to increase the timeout - instead, work within the constraint. Implement timeout detection and graceful degradation:
- Set client-side request timeout slightly below server timeout (4.5 minutes if server is 5 minutes)
- When timeout occurs, immediately stop processing current batch
- Record the last successfully processed ECN ID
- Log timeout details for monitoring and analysis
For each API call within a batch, implement individual timeout handling:
HttpClient client = HttpClient.newBuilder()
.connectTimeout(Duration.ofSeconds(30))
.build();
HttpRequest request = HttpRequest.newBuilder()
.timeout(Duration.ofSeconds(45))
.PUT(bodyPublisher)
.build();
This prevents a single slow ECN update from blocking the entire batch.
Retry Logic:
Implement a sophisticated retry strategy that handles different failure scenarios:
-
Transient Failures (timeouts, 503 errors): Retry with exponential backoff
- First retry after 10 seconds
- Second retry after 30 seconds
- Third retry after 90 seconds
- Max 3 retries before marking as failed
-
Permanent Failures (400, 403 errors): Don’t retry, log and skip
-
Partial Batch Completion: Query ECN states after timeout to determine which updates succeeded, then retry only failed ECNs
Maintain a processing state table with columns: ecn_id, status (pending/processing/completed/failed), attempt_count, last_attempt_time, error_message. This enables resumable batch processing.
Implementation Pattern:
Use a checkpoint-based approach where you commit progress after each batch. If the job fails midway, resume from the last checkpoint rather than restarting from the beginning. Store checkpoint data including: last processed ECN ID, batch number, timestamp, and success count.
Implement idempotent updates by checking ECN state before updating. If an ECN is already in the target state, skip it. This prevents duplicate updates when retrying.
Monitoring and Alerting:
Track these metrics: batch processing time, ECNs processed per minute, timeout frequency, retry rate. Set alerts for: batch processing time exceeding 4 minutes, timeout rate above 5%, or retry rate above 10%.
This approach has successfully handled daily batches of 2000+ ECNs in production environments with minimal timeouts and robust error recovery.