Process mining import fails on large event logs with memory exceeded error

I’m trying to import large event logs (approximately 2.5 million events) into the Creatio 8.3 process mining module, but the ETL engine consistently fails with memory exceeded errors. The import works fine with smaller datasets (under 500K events), but anything larger causes the server to run out of memory during the transformation phase.

Current server specs: 16GB RAM, 8 CPU cores. The ETL process seems to load the entire event log into memory before processing, which isn’t scalable for our analysis needs. We need to analyze year-long process executions across multiple departments.

Error we’re encountering:


ETLEngine.MemoryException: Heap space exceeded during event transformation
Allocated: 14.2GB, Required: 18.7GB
Failed at: EventLogProcessor.TransformBatch(line 2847)

Has anyone successfully imported multi-million event logs? I’m looking for guidance on ETL memory allocation, event log preprocessing strategies, or server resource scaling recommendations.

Thanks everyone. We increased heap size to 12GB and that helped, but still hitting limits around 1.8M events. The preprocessing suggestion makes sense - we’ll try splitting by quarter and merging the analysis results. Would still love to understand the optimal server configuration for this scale.

Let me provide a comprehensive solution addressing ETL memory allocation, event log preprocessing, and server resource scaling for your scenario.

ETL Memory Allocation:

First, optimize your Java heap configuration for the ETL engine:


-Xms8192m -Xmx12288m  // Initial 8GB, max 12GB heap
-XX:+UseG1GC  // Use G1 garbage collector for better large heap handling
-XX:MaxGCPauseMillis=200  // Target max 200ms GC pauses

Configure the ETL batch processing parameters:


ETLConfig.BatchSize = 75000  // Process 75K events per batch
ETLConfig.CommitInterval = 50000  // Commit every 50K events
ETLConfig.EnableStreaming = true  // Enable streaming mode

Event Log Preprocessing Strategy:

Before importing into Creatio, implement a preprocessing pipeline:

  1. Data Reduction (typically reduces size by 35-45%):

    • Remove debug/system events not relevant to process analysis
    • Eliminate redundant attributes (keep only: CaseID, Activity, Timestamp, Resource, essential business attributes)
    • Consolidate events: If you have multiple events for the same activity within seconds, merge them
  2. Temporal Partitioning:

    • Split your 2.5M event log into quarterly chunks (approximately 625K events each)
    • Name convention: EventLog_2024_Q1.csv, EventLog_2024_Q2.csv, etc.
    • Import each quarter separately, then use Creatio’s process mining merge feature
  3. Data Quality Checks (prevent import failures):

    • Validate timestamp formats are consistent
    • Ensure CaseIDs don’t have special characters that cause parsing issues
    • Check for null values in critical fields (CaseID, Activity, Timestamp)
  4. Preprocessing Implementation:


// Pseudocode - Event log preprocessing:
1. Load raw event log in streaming mode (don't load entire file)
2. Apply filters: Remove system events, validate required fields
3. Transform: Standardize timestamps, normalize activity names
4. Partition: Write to separate files based on time period (quarterly)
5. Compress: Use gzip compression for storage (reduces size 60-70%)
6. Generate metadata: Event counts, date ranges, case counts per partition

Server Resource Scaling:

For sustainable process mining with 2-5M event workloads:

  1. Minimum Hardware Recommendations:

    • RAM: 32GB (allocate 20GB to application server, 12GB to OS/other)
    • CPU: 12+ cores (ETL engine can parallelize event processing)
    • Storage: NVMe SSD for temp files and database (ETL writes 2-3x event log size in temp data)
    • Network: If database is remote, ensure 1Gbps+ connection
  2. Configuration Optimization:

    • Database connection pool: Set to 20-30 connections for parallel ETL processing
    • Temp directory: Point to fast SSD with at least 50GB free space
    • Enable parallel processing: Configure ETL to use 6-8 worker threads
  3. Scaling Strategy by Event Volume:

    • <1M events: 16GB RAM, 8 cores (your current setup)
    • 1-3M events: 32GB RAM, 12 cores (recommended upgrade)
    • 3-5M events: 48GB RAM, 16 cores
    • 5M events: Consider distributed processing or database-level process mining

Implementation Roadmap:

Phase 1 - Immediate (No Hardware Changes):

  1. Increase Java heap to 12GB with G1GC
  2. Enable streaming mode and reduce batch size to 75K
  3. Preprocess event log: Remove unnecessary attributes (target 40% size reduction)
  4. Split into 500K event chunks and import sequentially

Expected result: Successfully import 2.5M events in 4-5 sequential batches

Phase 2 - Short-term (Optimize Current Hardware):

  1. Implement automated preprocessing pipeline
  2. Configure parallel ETL processing (4-6 threads)
  3. Optimize database queries with proper indexing on CaseID and Timestamp
  4. Set up monitoring for memory usage during ETL

Expected result: Reduce import time by 30-40%, more reliable processing

Phase 3 - Long-term (Scale Infrastructure):

  1. Upgrade to 32GB RAM server
  2. Migrate to NVMe SSD storage
  3. Implement quarterly automated imports with merge
  4. Set up retention policy (archive events older than 2 years)

Expected result: Handle 5M+ events, support continuous process mining

Monitoring and Validation:

After implementation, track these metrics:

  • Memory peak usage during ETL (should stay under 85% of allocated heap)
  • Events processed per minute (target: 8K-12K events/min)
  • Import failure rate (target: <2%)
  • End-to-end import time for 500K events (target: <15 minutes)

This comprehensive approach should resolve your immediate memory issues while providing a scalable foundation for growing event volumes. The preprocessing step is critical - I’ve seen it reduce import failures by 90% in large-scale implementations.

Before throwing more hardware at it, try preprocessing your event logs. Split the CSV or database export into smaller chunks (500K events each) and import them sequentially. The process mining module can merge multiple imports into a single analysis. This approach is less elegant but much more reliable for large datasets. We use a simple Python script to split our event logs by date ranges before importing, and it’s worked well for datasets up to 5 million events.

Your server is definitely undersized for this workload. For process mining with 2-5 million events, you should be looking at 32GB minimum RAM. Also, make sure you’re using SSD storage for temporary files during ETL - the process mining engine writes intermediate results to disk, and slow I/O will compound your problems. Beyond hardware, verify your database connection pool settings. If the ETL is waiting on database queries, it’ll hold events in memory longer than necessary, increasing peak memory usage.

I’d recommend a two-pronged approach. First, optimize your event log before import by removing unnecessary attributes and consolidating redundant events. Often, raw event logs contain 20-30 attributes per event when you only need 8-10 for process mining analysis. Reducing the data footprint can cut memory requirements by 40-50%. Second, if you’re extracting from a database, use SQL to pre-aggregate or filter events at the source rather than importing everything and filtering later. This is especially effective if you’re analyzing specific process types or time periods.