Integration Hub workflow orchestration - real-time event processing vs batch

Looking for perspectives on Integration Hub workflow orchestration patterns. We’re syncing customer data between SAP CX, our ERP system, and a custom billing platform. Current implementation uses batch processing every 4 hours (about 2,500 records per batch), but business is pushing for real-time event-driven updates.

Real-time would improve data freshness significantly - sales reps would see current billing status instead of 4-hour-old data. But I’m concerned about system load and error handling complexity. Our ERP can handle maybe 50 API calls per minute before throttling kicks in. With event-driven architecture, we could easily exceed that during peak periods (new order entry happens in bursts).

What’s been your experience balancing latency requirements with throughput constraints? How do you handle retry logic when downstream systems are temporarily unavailable? Are there hybrid patterns that work better than pure real-time or pure batch?

Your 2,500 records per 4-hour batch is about 10 records/minute average load. If you go event-driven with burst patterns, you might see 200 events/minute during peak order entry (9-11am), then 2 events/minute overnight. The variance kills you. Implement time-based smoothing - collect events in 5-minute windows, then process the batch at controlled rate. This gives you 5-minute latency (much better than 4 hours) while preventing burst overload. Add dynamic scaling if queue depth exceeds thresholds.

Monitor your actual event patterns before committing to architecture. We instrumented our systems and found 80% of our “urgent” data updates could wait 15-30 minutes with no business impact. Only 20% truly needed real-time. We implemented tiered processing: critical events (order status changes, payment failures) go real-time with priority queuing, everything else goes to a 15-minute micro-batch. This reduced API load by 70% while meeting business SLAs for time-sensitive data.

We faced identical challenges. Pure event-driven caused cascading failures when our ERP went down for maintenance - Integration Hub kept retrying and eventually backed up thousands of events. We implemented a circuit breaker pattern that automatically switches to batch mode when error rates exceed 10% over a 5-minute window. This prevented total system meltdown while maintaining real-time processing during normal operations. The hybrid approach has been rock solid for 14 months now.

Throughput throttling is your real constraint. For your 50 API calls/minute ERP limit, implement a queue-based rate limiter in Integration Hub. Events get queued and processed at controlled rate (45 calls/minute to leave headroom). During burst periods, queue depth grows but you never overwhelm the ERP. We use Redis for the queue with 2-hour retention - if queue depth exceeds 1,000 events, alerts fire and we can temporarily increase batch processing or scale ERP capacity. This gives you near-real-time (under 2 minutes typically) without the instability of pure event-driven.

After implementing Integration Hub orchestration across multiple enterprise clients, here’s my comprehensive analysis:

Event-Driven Architecture Patterns: Pure event-driven (synchronous processing on every data change) provides minimal latency but introduces significant complexity. Benefits include immediate data consistency and simpler business logic (no need to handle stale data). Challenges include handling burst traffic, managing downstream system failures, and ensuring exactly-once processing semantics.

For your scenario with ERP throttling at 50 calls/minute, pure event-driven is risky unless you implement sophisticated rate limiting and back-pressure mechanisms. The burst pattern you described (order entry peaks) will regularly exceed your capacity.

Best practice: Implement asynchronous event processing with queuing. Events publish to a message queue (Integration Hub supports this natively), consumer processes events at controlled rate. This decouples event generation from processing, providing natural buffering.

Batch Processing Optimization: Your current 4-hour batch cycle is conservative but safe. Batch processing advantages: predictable load patterns, easier error handling (failed batch can retry), simpler monitoring, lower API overhead. Disadvantages: data staleness, all-or-nothing processing (one bad record can block entire batch), less responsive to urgent changes.

Optimization strategies: Reduce batch interval to 15-30 minutes (micro-batching), implement delta processing (only changed records), add parallel processing for large batches, use compression for network efficiency. Your 2,500 records could easily process in 15-minute windows with proper optimization.

Latency and Throughput Trade-offs: This is the core architectural decision. Framework for analysis:

  1. Measure actual business requirement: What’s the true cost of 4-hour-old billing data? We found most “real-time” requirements are actually “within 15-30 minutes” when you dig into specific use cases.

  2. Calculate throughput capacity: Your ERP at 50 calls/minute = 3,000 calls/hour = 72,000 calls/day. Your current 2,500 records every 4 hours = 15,000 records/day. You have 4.8x headroom if you smooth the load.

  3. Profile your event distribution: Instrument your system to track events by hour. You’ll likely find 60-70% of events cluster in a 4-6 hour business window. This informs queue sizing and rate limiting.

  4. Define SLA tiers: Not all data needs same latency. Critical events (payment failures, order cancellations) might need <5 minute processing. Standard updates (address changes, preference updates) can wait 30 minutes. Bulk imports can batch overnight.

Recommended approach: Implement priority-based hybrid processing. High-priority events (5% of volume) process real-time with dedicated capacity reservation. Medium-priority (30% of volume) process in 15-minute micro-batches. Low-priority (65% of volume) remain in 4-hour batches. This balances responsiveness with system stability.

Error Handling and Retry Strategies: Robust error handling is what separates production-grade integrations from prototypes. Multi-layered strategy:

Layer 1 - Immediate Retry: Transient errors (network timeouts, temporary service unavailability) retry immediately with exponential backoff. 3 attempts over 30 seconds. Handles 60-70% of errors.

Layer 2 - Delayed Retry: Persistent errors route to delayed retry queue. Retry every 15 minutes for 2 hours (8 attempts). Handles another 25% of errors as downstream systems recover.

Layer 3 - Dead Letter Queue: After 8 failed retries, route to dead letter queue for manual investigation. Alert on-call engineer if DLQ depth exceeds threshold. Implement admin UI for manual reprocessing after root cause resolution.

Layer 4 - Circuit Breaker: Monitor error rates across all integrations. If error rate exceeds 10% over 5-minute window, trip circuit breaker - pause event processing, alert operations team, automatically switch to batch mode if configured. Prevents cascading failures.

Layer 5 - Compensating Transactions: For critical events that fail after partial processing, implement compensating transactions to maintain data consistency. Example: if billing update succeeds but CRM update fails, either rollback billing or queue CRM update for guaranteed eventual consistency.

Monitoring and Observability: Comprehensive monitoring dashboard should track:

  • Event processing latency (p50, p95, p99 percentiles)
  • Queue depth by priority tier
  • Error rates by integration endpoint
  • Throughput (events/minute) with burst detection
  • Downstream system health (response times, error rates)
  • Circuit breaker status
  • Dead letter queue depth with aging alerts

Implement correlation IDs that flow through entire integration chain. When troubleshooting, you need to trace a single customer update from SAP CX → Integration Hub → ERP → Billing with complete visibility at each hop.

My Specific Recommendation: Implement a three-tier hybrid architecture:

Tier 1 - Real-time (5% of events): Order status changes, payment failures, critical customer updates. Process immediately with dedicated 5 calls/minute capacity reservation. Implement aggressive retry with alerts on failure.

Tier 2 - Micro-batch (30% of events): Standard customer updates, non-urgent billing changes. Collect in 10-minute windows, process batch at 20 calls/minute. Provides 10-minute average latency with smooth load profile.

Tier 3 - Traditional batch (65% of events): Bulk updates, historical corrections, non-time-sensitive data. Keep your 4-hour batch cycle with 25 calls/minute processing rate.

This architecture uses 50 calls/minute total (5 + 20 + 25), leaving no headroom - but implement dynamic allocation where Tier 3 yields capacity to Tier 1/2 during peak periods. Add queue-based buffering so bursts queue rather than fail.

Implementation timeline: 6-8 weeks with proper testing. Start with Tier 2 micro-batching (easiest), add Tier 1 real-time for specific use cases, optimize Tier 3 batch last. This progressive approach reduces risk while delivering incremental business value.