Dataflow pipeline fails when ERP container hits OOM, causing incomplete loads

Dataflow pipeline jobs fail when ERP containers run out of memory during large batch processing. We’re loading daily transaction data from our ERP system into BigQuery, processing approximately 2M records per day. The pipeline works fine for smaller datasets but consistently fails with OOM errors when processing month-end reconciliation data (8-10M records). Here’s the error pattern we’re seeing:


java.lang.OutOfMemoryError: Java heap space
at DataProcessor.transformRecord(DataProcessor.java:142)
Worker crashed: Container killed due to memory limit

Our current Dataflow worker configuration uses n1-standard-4 machines with default memory settings. The pipeline code loads entire batches into memory for validation before writing to BigQuery. Is this a worker sizing issue or do we need to optimize the pipeline code itself?

Definitely. For predictable large loads, configure autoscaling with higher minimums. Also consider using n1-highmem machines instead of standard - the extra RAM helps with aggregation operations. But the real fix is probably in the pipeline code optimization.

GroupByKey can be memory-intensive with large groups. Have you considered using Combine operations instead? They’re more memory-efficient because they perform partial aggregations before the final grouping. Also, what’s your current autoscaling configuration? Default settings might not scale workers fast enough for sudden load increases.

Loading entire batches into memory is definitely problematic at that scale. Dataflow is designed for streaming transformations where each element is processed independently. Are you using windowing or batching operations that accumulate state?

One more thing to check: are you setting explicit memory limits on your ERP containers? If the container’s memory limit is lower than the worker VM’s available memory, you’ll hit container OOM before fully utilizing the VM resources.

Let me address all three focus areas comprehensively: worker memory configuration, pipeline code optimization, and autoscaling settings.

For Dataflow worker memory configuration, switch from n1-standard-4 to n1-highmem-4 machines. This increases available memory from 15GB to 26GB per worker without changing CPU allocation. Configure the worker options explicitly:


--workerMachineType=n1-highmem-4
--diskSizeGb=100
--workerDiskType=pd-ssd

Also critical: if you’re running ERP containers alongside Dataflow workers, ensure container memory limits don’t exceed 60-70% of VM memory, leaving headroom for Dataflow’s own operations. Set container limits explicitly in your deployment configuration to prevent resource contention.

For pipeline code optimization, your current approach of loading entire batches into memory is the root cause. The GroupByKey operation with large account groups creates memory pressure that no amount of RAM will solve at scale. Refactor using Combine.perKey() instead:


// Pseudocode - Memory-efficient aggregation:
1. Replace GroupByKey with Combine.perKey(new TransactionValidator())
2. Implement CombineFn with partial aggregation in createAccumulator()
3. Process transactions incrementally in addInput() method
4. Merge partial results in mergeAccumulators() for distributed processing
5. Final validation in extractOutput() with compact state representation

This approach processes transactions incrementally rather than accumulating all in memory. Each worker handles partial aggregations, and Dataflow merges results efficiently. For your validation requirements, maintain a compact state object (account balance, transaction IDs for duplicate checking) rather than storing full transaction objects.

Additionally, implement windowing to break processing into smaller chunks. Even for batch data, fixed windows of 1-2 hours prevent unbounded state accumulation and enable better parallelization.

For autoscaling settings, configure explicit parameters for month-end processing. Default autoscaling is too conservative for predictable large loads:


--numWorkers=10
--maxNumWorkers=50
--autoscalingAlgorithm=THROUGHPUT_BASED

Set numWorkers to start with adequate capacity rather than ramping up slowly. The maxNumWorkers limit prevents runaway costs while allowing scale for peak loads. THROUGHPUT_BASED algorithm responds faster to backlog increases than the default NONE setting.

For month-end jobs specifically, consider using separate pipeline configurations with higher minimums. You can trigger these via Cloud Scheduler or detect the workload size programmatically and adjust parameters accordingly.

Implement monitoring on these key metrics: worker memory utilization (should stay below 80%), data freshness (time between ingestion and processing), and worker count over time. Set alerts if memory utilization exceeds 85% or if autoscaling hits maxNumWorkers, indicating you need further tuning.

The combination of increased worker memory, refactored pipeline code using Combine operations, and properly configured autoscaling should eliminate your OOM failures while improving overall pipeline efficiency and cost. The code optimization is most critical - even with unlimited memory, accumulating millions of records per key isn’t scalable long-term.