Error handling patterns for complex procure-to-pay API workflows with multiple dependencies

I’m designing error handling for a complex P2P automation that orchestrates multiple REST API calls: create requisition → submit for approval → convert to PO → receive goods → create invoice → process payment. Each step depends on the previous one succeeding, and failures can occur at any point.

The challenge is building resilient error handling that supports retries without creating duplicate records, implementing compensation logic when later steps fail (e.g., if payment fails, should we reverse the invoice?), maintaining a complete audit trail of all attempts and outcomes, and coordinating this through middleware orchestration.

What patterns have others used for complex multi-step API workflows? Particularly interested in idempotency strategies, when to use compensation versus retry, and how to balance reliability with complexity.

For compensation logic, we follow the pattern: retry transient errors (network timeouts, 503 service unavailable), compensate business errors (validation failures, insufficient inventory). Transient errors get exponential backoff retry up to 5 attempts. Business errors trigger compensation - for example, if PO creation fails due to budget constraints, we cancel the approved requisition and notify the requester rather than leaving it in limbo.

We log everything - full request/response payloads, timestamps, correlation IDs, error details. Storage is cheap and the audit value is enormous when troubleshooting. We use structured logging with consistent field names so we can easily query across workflow steps. For partial failures, implement health checks that query Fusion to verify record state before proceeding to next step. Don’t trust the API response alone.

Let me provide comprehensive guidance on error handling patterns for complex P2P workflows:

Idempotency and Retry Strategies: Implement idempotency at multiple levels. Use client-generated unique identifiers (UUIDs) as correlation IDs for each workflow instance. Pass these IDs in custom headers (X-Correlation-ID) with every API call. Before creating any record, query Fusion using the correlation ID to check if it already exists. Oracle REST APIs support filtering by custom attributes, so store correlation IDs in descriptive flexfields. For retries, implement exponential backoff: wait 2 seconds after first failure, 4 seconds after second, 8 seconds after third, up to maximum 60 seconds. Distinguish between retryable errors (429 rate limit, 503 service unavailable, network timeouts) and non-retryable errors (400 bad request, 401 unauthorized, 404 not found). Retryable errors get automatic retry; non-retryable errors trigger compensation flow.

Compensation Logic Design: Use the saga pattern for distributed transactions across multiple API calls. Define forward transactions (create requisition, create PO, etc.) and compensating transactions (cancel requisition, cancel PO, etc.) for each step. When a step fails, execute compensating transactions in reverse order to restore system to consistent state. Example: if invoice creation fails after goods receipt, the compensation flow should: 1) Reverse the goods receipt, 2) Cancel the PO, 3) Cancel the requisition, 4) Notify stakeholders. Not all failures require full compensation - use business rules to determine compensation scope. For example, if payment processing fails due to temporary bank outage, retry payment rather than reversing the entire workflow. Document compensation decisions in a decision matrix that maps failure scenarios to compensation actions.

Audit Trail Requirements: Implement comprehensive logging that captures: workflow instance ID, step name, timestamp (ISO-8601 format), API endpoint called, request payload (sanitized for sensitive data), response payload, HTTP status code, error messages, retry attempt number, and user/system context. Store logs in structured format (JSON) for easy querying. Create separate audit tables for: workflow instances (high-level status), workflow steps (detailed step execution), API calls (request/response pairs), and errors (failures with stack traces). Retain audit data according to compliance requirements - typically 7 years for financial transactions. Implement log correlation using workflow instance ID so you can trace entire P2P cycle from requisition to payment.

Middleware Orchestration Patterns: Use event-driven architecture where possible. Instead of polling for approval completion, subscribe to Fusion approval events via webhooks. This reduces latency and API call volume. Implement state machine pattern to manage workflow progression: define states (RequisitionCreated, ApprovalPending, POCreated, GoodsReceived, InvoiceCreated, PaymentProcessed), valid transitions, and actions at each transition. Store state machine state in persistent storage (database) so workflows can recover after middleware restarts. Use message queues for asynchronous processing - place each workflow step in a queue, process steps independently, handle failures by moving messages to dead letter queue for manual review.

Reliability vs Complexity Trade-offs: Start with simple retry logic and add complexity only when needed. Basic pattern: try operation, if fails with retryable error, wait and retry up to 3 times, if still fails, log error and alert operations. This handles 80% of scenarios. Add compensation logic for business-critical workflows where partial completion is worse than no completion. Add sophisticated state management for long-running workflows (multi-day approvals). Avoid over-engineering - complex error handling can become a maintenance burden and source of bugs.

Practical Implementation: Build a reusable workflow engine that handles common patterns: idempotency checking, retry with backoff, compensation execution, audit logging, state management. Configure workflows declaratively using JSON or YAML rather than hardcoding logic. This allows changing workflow steps without code changes. Implement circuit breaker pattern to prevent cascading failures - if Fusion API is consistently failing, stop sending requests temporarily and fail fast. Monitor workflow metrics: success rate, average duration, retry frequency, compensation frequency. Use these metrics to identify problematic steps and optimize.

Key Success Factors: Treat error handling as a first-class design concern, not an afterthought. Test failure scenarios explicitly - simulate network failures, API errors, validation failures, timeout scenarios. Build operator-friendly tools for workflow monitoring and manual intervention. Document error handling behavior clearly so support teams understand what happens when things go wrong. Review and refine error handling patterns based on production experience.

The goal is building resilient workflows that gracefully handle failures while maintaining data consistency and providing visibility into what happened.