Bulk ECN update via API fails with transaction timeout during large batch processing

Our nightly batch job updates 500+ ECNs via the Agile 9.3.5 REST API, but we’re hitting transaction timeout errors after processing about 200 records. The API returns a timeout after 5 minutes and leaves the remaining ECNs unprocessed, causing data inconsistency between our systems.

Error after ~200 updates:


HTTP 504 Gateway Timeout
{"error": "Transaction timeout",
 "message": "Operation exceeded maximum execution time"}

We’re using a simple loop that PUTs each ECN update sequentially. The batch size optimization seems critical here, but we’re not sure if we should split into smaller batches or if there’s a better API timeout handling strategy. Has anyone dealt with retry logic for large-scale ECN updates? What’s the recommended approach for batch processing without hitting these timeout limits?

You need to implement idempotent update logic. Before each batch, query the current state of those ECNs to determine which actually need updating. After a timeout, re-query to see which updates succeeded and only retry the failed ones. Don’t retry entire batches blindly. Also, log each ECN update result individually so you can track exactly where the failure occurred. Consider using a database or file to maintain processing state between batch runs.

Five hundred sequential API calls in a single transaction is definitely going to timeout. You need to implement batch chunking - split your 500 ECNs into smaller batches of 50-100 and process each batch in a separate transaction. This way if one batch fails, you can retry just that chunk without reprocessing everything. Also add a small delay between batches to avoid overwhelming the server.

The 5-minute timeout is likely configured at the application server level, not in Agile itself. Check your application server’s transaction timeout settings and increase it if possible. However, Jane’s approach is better - splitting into smaller batches is more resilient. You should also implement exponential backoff retry logic for failed batches. If a batch times out, wait a bit and retry up to 3 times before marking it as failed.

Your transaction timeout issue is a classic batch processing problem that requires addressing three key areas: batch size optimization, API timeout handling, and retry logic. Here’s a comprehensive solution:

Batch Size Optimization: The optimal batch size depends on several factors - ECN complexity, number of fields being updated, server load, and network latency. For Agile 9.3.5, I’ve found that batches of 50-75 ECNs work well for most scenarios. Larger batches risk timeout, smaller batches add unnecessary overhead.

Implement dynamic batch sizing based on processing time. Start with batches of 50 and measure average processing time per ECN:

// Pseudocode - Dynamic batch processing:
1. Calculate batchSize = min(50, remainingECNs)
2. Process batch and measure totalTime
3. Calculate avgTimePerECN = totalTime / batchSize
4. Adjust next batchSize based on performance
5. If avgTimePerECN > threshold, reduce batchSize
// Target: Keep batch processing under 3 minutes

Add a configurable delay between batches (30-60 seconds) to allow server resources to stabilize. This prevents resource exhaustion when processing large datasets.

API Timeout Handling: The 504 Gateway Timeout indicates your API calls are exceeding the server’s maximum execution time. Don’t try to increase the timeout - instead, work within the constraint. Implement timeout detection and graceful degradation:

  1. Set client-side request timeout slightly below server timeout (4.5 minutes if server is 5 minutes)
  2. When timeout occurs, immediately stop processing current batch
  3. Record the last successfully processed ECN ID
  4. Log timeout details for monitoring and analysis

For each API call within a batch, implement individual timeout handling:

HttpClient client = HttpClient.newBuilder()
    .connectTimeout(Duration.ofSeconds(30))
    .build();
HttpRequest request = HttpRequest.newBuilder()
    .timeout(Duration.ofSeconds(45))
    .PUT(bodyPublisher)
    .build();

This prevents a single slow ECN update from blocking the entire batch.

Retry Logic: Implement a sophisticated retry strategy that handles different failure scenarios:

  1. Transient Failures (timeouts, 503 errors): Retry with exponential backoff

    • First retry after 10 seconds
    • Second retry after 30 seconds
    • Third retry after 90 seconds
    • Max 3 retries before marking as failed
  2. Permanent Failures (400, 403 errors): Don’t retry, log and skip

  3. Partial Batch Completion: Query ECN states after timeout to determine which updates succeeded, then retry only failed ECNs

Maintain a processing state table with columns: ecn_id, status (pending/processing/completed/failed), attempt_count, last_attempt_time, error_message. This enables resumable batch processing.

Implementation Pattern: Use a checkpoint-based approach where you commit progress after each batch. If the job fails midway, resume from the last checkpoint rather than restarting from the beginning. Store checkpoint data including: last processed ECN ID, batch number, timestamp, and success count.

Implement idempotent updates by checking ECN state before updating. If an ECN is already in the target state, skip it. This prevents duplicate updates when retrying.

Monitoring and Alerting: Track these metrics: batch processing time, ECNs processed per minute, timeout frequency, retry rate. Set alerts for: batch processing time exceeding 4 minutes, timeout rate above 5%, or retry rate above 10%.

This approach has successfully handled daily batches of 2000+ ECNs in production environments with minimal timeouts and robust error recovery.