Process mining data extraction: OData vs REST API for large-scale analytics

We’re implementing process mining for our OutSystems BPM workflows and need to extract large volumes of process execution data (5M+ events across 200K+ process instances). The question is whether to use OData feeds or REST API endpoints for data extraction.

OData’s filtering capabilities look promising for incremental extraction, but I’m concerned about performance at scale. REST API gives us more control over batch sizes and pagination, but we’d need to handle rate limiting and implement our own filtering logic.

Anyone with experience extracting large-scale process mining datasets? Particularly interested in OData filtering strategies and API rate limit handling for batch extractions.

Good point about indexing. We do have indexes on key fields. Kim, when you mention parallel extraction, how do you partition the date ranges without overlap or gaps? And what’s your approach to handling API rate limits?

After implementing process mining data extraction for multiple enterprise clients, here’s my comprehensive analysis addressing the three critical considerations:

OData Filtering Strategies:

OData is powerful for incremental extraction when used correctly, but requires understanding its limitations at scale:

Effective OData Filtering:

  • Use simple date range filters for incremental extraction: $filter=EventTimestamp ge 2024-11-26T00:00:00Z and EventTimestamp lt 2024-11-27T00:00:00Z
  • Combine with $top and $skip for pagination: $top=10000&$skip=0, then $skip=10000, etc.
  • Leverage $select to retrieve only required fields, reducing payload size by 50-70%
  • Index fields used in $filter clauses - this is critical for performance

OData Performance Limitations:

  • Complex filters with multiple OR conditions cause query optimizer issues
  • Nested $expand operations exponentially increase response time
  • $count operations on large datasets (>1M records) can timeout
  • Server-side filtering overhead increases with dataset size

For your 5M events scenario, OData works well for daily incremental extraction (typically 20-50K events per day). For historical bulk extraction, OData becomes inefficient beyond 500K records per query due to server-side processing overhead.

Best Practice: Use OData with date range + batch size limiting. Extract one day at a time with 10K record batches. This keeps queries fast and memory usage predictable.

API Rate Limits:

Rate limiting is your primary constraint with REST API batch extraction. OutSystems typically enforces limits at the web server and application tiers:

Typical Limits:

  • 100-500 requests per minute per API endpoint (varies by configuration)
  • 10-50 concurrent connections per client IP
  • Request timeout of 60-120 seconds for long-running queries
  • Payload size limits of 10-50MB per response

Rate Limit Handling Strategy:

  1. Implement Client-Side Rate Limiting: Track your request rate and proactively throttle before hitting server limits. Use token bucket algorithm - maintain a counter of available requests, refill at allowed rate, consume one token per request.

  2. Exponential Backoff for 429 Responses: When you receive rate limit errors, implement exponential backoff: wait 30s, then 60s, then 120s, max 300s before retry. Include jitter (random delay variation) to avoid thundering herd if multiple extraction jobs are running.

  3. Request Queuing: Build a request queue that respects rate limits. Process queue at controlled pace rather than firing requests as fast as possible.

  4. Batch Size Optimization: Larger batches mean fewer requests but longer processing time per request. We’ve found 25K-50K records per batch optimal - balances throughput with timeout risk.

  5. Parallel Extraction with Coordination: Run multiple extraction threads for different date ranges, but coordinate total request rate across all threads. Don’t let each thread independently max out rate limits.

Batch Extraction Strategies:

For 5M events across 200K process instances, implement a multi-tier extraction strategy:

Tier 1 - Initial Historical Load: Use REST API with date-based partitioning. Divide your historical data into monthly chunks. Extract each month in parallel using separate jobs. Within each month, batch by day with 50K records per batch.

Partitioning logic:

  • Month 1: January 1-31 → 31 daily batches
  • Month 2: February 1-28 → 28 daily batches
  • Process months in parallel (4-6 concurrent month extractions)
  • Each day extracts in 50K record batches until complete

Tier 2 - Daily Incremental Updates: Use OData with simple date filtering for yesterday’s events. Run as scheduled timer at 2 AM daily. Extract previous day’s events in 10K batches.

Tier 3 - Real-Time/Near Real-Time: If you need fresher data, implement hourly extraction using OData filtering on last hour’s events. This typically yields small result sets (100-5000 events) that extract quickly.

Practical Implementation Pattern:

Create an OutSystems Timer-based extraction framework:

  1. Extraction Control Table: Store extraction jobs with fields: JobId, StartDate, EndDate, Status, RecordsExtracted, LastProcessedId, ErrorCount

  2. Batch Processing Timer: Runs every 5 minutes, picks up pending jobs, processes one batch per job per execution

  3. Watermark Tracking: As Carlos mentioned, maintain highest extracted ProcessEventId. This handles late-arriving events that might have old timestamps but were created recently.

  4. Error Handling: Implement retry logic with maximum attempts. Failed batches go to error queue for manual review. Don’t let one bad batch block entire extraction.

  5. Monitoring Dashboard: Track extraction progress, throughput (events/hour), error rates, API response times. Alert on anomalies.

Performance Comparison from Real Implementation:

For 5M event extraction:

  • Pure OData: 8-12 hours (limited by server-side query processing)
  • Pure REST API: 4-6 hours (parallel batch extraction, rate-limited)
  • Hybrid (REST for bulk, OData for incremental): 4 hours initial + 5 minutes daily

Recommendation:

Implement hybrid approach: Use REST API with parallel date-partitioned batch extraction for your initial 5M event load. Once current, switch to OData for daily incremental extraction. This gives you speed for bulk loads and simplicity for ongoing updates. Monitor your extraction metrics and adjust batch sizes based on observed API performance - every environment has different optimal parameters.

Don’t forget about incremental extraction strategies beyond just date filtering. We use watermark columns - track the highest ProcessEventId extracted, then query for events with ID greater than watermark. This handles late-arriving events better than pure date-based extraction. Combine with date range as safety check. For batch sizing, test with your actual data - we found 25K records optimal for our network/server combo, but YMMV.

Consider using OutSystems’ built-in batch processing capabilities rather than building custom extraction logic. The Timer mechanism with BPT (Business Process Technology) can handle large-scale extraction reliably. Set up a timer that processes extraction in configurable batch sizes, maintains state between runs, and handles failures gracefully. This integrates better with OutSystems monitoring and doesn’t require external orchestration.