Let me provide a comprehensive solution addressing ETL memory allocation, event log preprocessing, and server resource scaling for your scenario.
ETL Memory Allocation:
First, optimize your Java heap configuration for the ETL engine:
-Xms8192m -Xmx12288m // Initial 8GB, max 12GB heap
-XX:+UseG1GC // Use G1 garbage collector for better large heap handling
-XX:MaxGCPauseMillis=200 // Target max 200ms GC pauses
Configure the ETL batch processing parameters:
ETLConfig.BatchSize = 75000 // Process 75K events per batch
ETLConfig.CommitInterval = 50000 // Commit every 50K events
ETLConfig.EnableStreaming = true // Enable streaming mode
Event Log Preprocessing Strategy:
Before importing into Creatio, implement a preprocessing pipeline:
-
Data Reduction (typically reduces size by 35-45%):
- Remove debug/system events not relevant to process analysis
- Eliminate redundant attributes (keep only: CaseID, Activity, Timestamp, Resource, essential business attributes)
- Consolidate events: If you have multiple events for the same activity within seconds, merge them
-
Temporal Partitioning:
- Split your 2.5M event log into quarterly chunks (approximately 625K events each)
- Name convention: EventLog_2024_Q1.csv, EventLog_2024_Q2.csv, etc.
- Import each quarter separately, then use Creatio’s process mining merge feature
-
Data Quality Checks (prevent import failures):
- Validate timestamp formats are consistent
- Ensure CaseIDs don’t have special characters that cause parsing issues
- Check for null values in critical fields (CaseID, Activity, Timestamp)
-
Preprocessing Implementation:
// Pseudocode - Event log preprocessing:
1. Load raw event log in streaming mode (don't load entire file)
2. Apply filters: Remove system events, validate required fields
3. Transform: Standardize timestamps, normalize activity names
4. Partition: Write to separate files based on time period (quarterly)
5. Compress: Use gzip compression for storage (reduces size 60-70%)
6. Generate metadata: Event counts, date ranges, case counts per partition
Server Resource Scaling:
For sustainable process mining with 2-5M event workloads:
-
Minimum Hardware Recommendations:
- RAM: 32GB (allocate 20GB to application server, 12GB to OS/other)
- CPU: 12+ cores (ETL engine can parallelize event processing)
- Storage: NVMe SSD for temp files and database (ETL writes 2-3x event log size in temp data)
- Network: If database is remote, ensure 1Gbps+ connection
-
Configuration Optimization:
- Database connection pool: Set to 20-30 connections for parallel ETL processing
- Temp directory: Point to fast SSD with at least 50GB free space
- Enable parallel processing: Configure ETL to use 6-8 worker threads
-
Scaling Strategy by Event Volume:
- <1M events: 16GB RAM, 8 cores (your current setup)
- 1-3M events: 32GB RAM, 12 cores (recommended upgrade)
- 3-5M events: 48GB RAM, 16 cores
-
5M events: Consider distributed processing or database-level process mining
Implementation Roadmap:
Phase 1 - Immediate (No Hardware Changes):
- Increase Java heap to 12GB with G1GC
- Enable streaming mode and reduce batch size to 75K
- Preprocess event log: Remove unnecessary attributes (target 40% size reduction)
- Split into 500K event chunks and import sequentially
Expected result: Successfully import 2.5M events in 4-5 sequential batches
Phase 2 - Short-term (Optimize Current Hardware):
- Implement automated preprocessing pipeline
- Configure parallel ETL processing (4-6 threads)
- Optimize database queries with proper indexing on CaseID and Timestamp
- Set up monitoring for memory usage during ETL
Expected result: Reduce import time by 30-40%, more reliable processing
Phase 3 - Long-term (Scale Infrastructure):
- Upgrade to 32GB RAM server
- Migrate to NVMe SSD storage
- Implement quarterly automated imports with merge
- Set up retention policy (archive events older than 2 years)
Expected result: Handle 5M+ events, support continuous process mining
Monitoring and Validation:
After implementation, track these metrics:
- Memory peak usage during ETL (should stay under 85% of allocated heap)
- Events processed per minute (target: 8K-12K events/min)
- Import failure rate (target: <2%)
- End-to-end import time for 500K events (target: <15 minutes)
This comprehensive approach should resolve your immediate memory issues while providing a scalable foundation for growing event volumes. The preprocessing step is critical - I’ve seen it reduce import failures by 90% in large-scale implementations.