Let me address all three focus areas comprehensively: worker memory configuration, pipeline code optimization, and autoscaling settings.
For Dataflow worker memory configuration, switch from n1-standard-4 to n1-highmem-4 machines. This increases available memory from 15GB to 26GB per worker without changing CPU allocation. Configure the worker options explicitly:
--workerMachineType=n1-highmem-4
--diskSizeGb=100
--workerDiskType=pd-ssd
Also critical: if you’re running ERP containers alongside Dataflow workers, ensure container memory limits don’t exceed 60-70% of VM memory, leaving headroom for Dataflow’s own operations. Set container limits explicitly in your deployment configuration to prevent resource contention.
For pipeline code optimization, your current approach of loading entire batches into memory is the root cause. The GroupByKey operation with large account groups creates memory pressure that no amount of RAM will solve at scale. Refactor using Combine.perKey() instead:
// Pseudocode - Memory-efficient aggregation:
1. Replace GroupByKey with Combine.perKey(new TransactionValidator())
2. Implement CombineFn with partial aggregation in createAccumulator()
3. Process transactions incrementally in addInput() method
4. Merge partial results in mergeAccumulators() for distributed processing
5. Final validation in extractOutput() with compact state representation
This approach processes transactions incrementally rather than accumulating all in memory. Each worker handles partial aggregations, and Dataflow merges results efficiently. For your validation requirements, maintain a compact state object (account balance, transaction IDs for duplicate checking) rather than storing full transaction objects.
Additionally, implement windowing to break processing into smaller chunks. Even for batch data, fixed windows of 1-2 hours prevent unbounded state accumulation and enable better parallelization.
For autoscaling settings, configure explicit parameters for month-end processing. Default autoscaling is too conservative for predictable large loads:
--numWorkers=10
--maxNumWorkers=50
--autoscalingAlgorithm=THROUGHPUT_BASED
Set numWorkers to start with adequate capacity rather than ramping up slowly. The maxNumWorkers limit prevents runaway costs while allowing scale for peak loads. THROUGHPUT_BASED algorithm responds faster to backlog increases than the default NONE setting.
For month-end jobs specifically, consider using separate pipeline configurations with higher minimums. You can trigger these via Cloud Scheduler or detect the workload size programmatically and adjust parameters accordingly.
Implement monitoring on these key metrics: worker memory utilization (should stay below 80%), data freshness (time between ingestion and processing), and worker count over time. Set alerts if memory utilization exceeds 85% or if autoscaling hits maxNumWorkers, indicating you need further tuning.
The combination of increased worker memory, refactored pipeline code using Combine operations, and properly configured autoscaling should eliminate your OOM failures while improving overall pipeline efficiency and cost. The code optimization is most critical - even with unlimited memory, accumulating millions of records per key isn’t scalable long-term.