After implementing process mining data extraction for multiple enterprise clients, here’s my comprehensive analysis addressing the three critical considerations:
OData Filtering Strategies:
OData is powerful for incremental extraction when used correctly, but requires understanding its limitations at scale:
Effective OData Filtering:
- Use simple date range filters for incremental extraction: $filter=EventTimestamp ge 2024-11-26T00:00:00Z and EventTimestamp lt 2024-11-27T00:00:00Z
- Combine with $top and $skip for pagination: $top=10000&$skip=0, then $skip=10000, etc.
- Leverage $select to retrieve only required fields, reducing payload size by 50-70%
- Index fields used in $filter clauses - this is critical for performance
OData Performance Limitations:
- Complex filters with multiple OR conditions cause query optimizer issues
- Nested $expand operations exponentially increase response time
- $count operations on large datasets (>1M records) can timeout
- Server-side filtering overhead increases with dataset size
For your 5M events scenario, OData works well for daily incremental extraction (typically 20-50K events per day). For historical bulk extraction, OData becomes inefficient beyond 500K records per query due to server-side processing overhead.
Best Practice: Use OData with date range + batch size limiting. Extract one day at a time with 10K record batches. This keeps queries fast and memory usage predictable.
API Rate Limits:
Rate limiting is your primary constraint with REST API batch extraction. OutSystems typically enforces limits at the web server and application tiers:
Typical Limits:
- 100-500 requests per minute per API endpoint (varies by configuration)
- 10-50 concurrent connections per client IP
- Request timeout of 60-120 seconds for long-running queries
- Payload size limits of 10-50MB per response
Rate Limit Handling Strategy:
-
Implement Client-Side Rate Limiting: Track your request rate and proactively throttle before hitting server limits. Use token bucket algorithm - maintain a counter of available requests, refill at allowed rate, consume one token per request.
-
Exponential Backoff for 429 Responses: When you receive rate limit errors, implement exponential backoff: wait 30s, then 60s, then 120s, max 300s before retry. Include jitter (random delay variation) to avoid thundering herd if multiple extraction jobs are running.
-
Request Queuing: Build a request queue that respects rate limits. Process queue at controlled pace rather than firing requests as fast as possible.
-
Batch Size Optimization: Larger batches mean fewer requests but longer processing time per request. We’ve found 25K-50K records per batch optimal - balances throughput with timeout risk.
-
Parallel Extraction with Coordination: Run multiple extraction threads for different date ranges, but coordinate total request rate across all threads. Don’t let each thread independently max out rate limits.
Batch Extraction Strategies:
For 5M events across 200K process instances, implement a multi-tier extraction strategy:
Tier 1 - Initial Historical Load:
Use REST API with date-based partitioning. Divide your historical data into monthly chunks. Extract each month in parallel using separate jobs. Within each month, batch by day with 50K records per batch.
Partitioning logic:
- Month 1: January 1-31 → 31 daily batches
- Month 2: February 1-28 → 28 daily batches
- Process months in parallel (4-6 concurrent month extractions)
- Each day extracts in 50K record batches until complete
Tier 2 - Daily Incremental Updates:
Use OData with simple date filtering for yesterday’s events. Run as scheduled timer at 2 AM daily. Extract previous day’s events in 10K batches.
Tier 3 - Real-Time/Near Real-Time:
If you need fresher data, implement hourly extraction using OData filtering on last hour’s events. This typically yields small result sets (100-5000 events) that extract quickly.
Practical Implementation Pattern:
Create an OutSystems Timer-based extraction framework:
-
Extraction Control Table: Store extraction jobs with fields: JobId, StartDate, EndDate, Status, RecordsExtracted, LastProcessedId, ErrorCount
-
Batch Processing Timer: Runs every 5 minutes, picks up pending jobs, processes one batch per job per execution
-
Watermark Tracking: As Carlos mentioned, maintain highest extracted ProcessEventId. This handles late-arriving events that might have old timestamps but were created recently.
-
Error Handling: Implement retry logic with maximum attempts. Failed batches go to error queue for manual review. Don’t let one bad batch block entire extraction.
-
Monitoring Dashboard: Track extraction progress, throughput (events/hour), error rates, API response times. Alert on anomalies.
Performance Comparison from Real Implementation:
For 5M event extraction:
- Pure OData: 8-12 hours (limited by server-side query processing)
- Pure REST API: 4-6 hours (parallel batch extraction, rate-limited)
- Hybrid (REST for bulk, OData for incremental): 4 hours initial + 5 minutes daily
Recommendation:
Implement hybrid approach: Use REST API with parallel date-partitioned batch extraction for your initial 5M event load. Once current, switch to OData for daily incremental extraction. This gives you speed for bulk loads and simplicity for ongoing updates. Monitor your extraction metrics and adjust batch sizes based on observed API performance - every environment has different optimal parameters.