Distribution management cloud batch job stuck in processing status

daniel_builder · April 2, 2025, 8:14am

After moving our distribution-mgmt module to a cloud-hybrid setup on ES 10.2.600, we’re experiencing issues with the nightly order allocation batch job. The job starts normally but gets stuck in “Processing” status and never completes. This is blocking our entire order fulfillment process.

The job is scheduled through the Epicor Job Scheduler to run at 2 AM daily. It processes warehouse allocation for pending sales orders. The job ran fine for the first week after migration, but now it consistently hangs. When I check the job status in the morning, it shows:


Job: DIST_ORDER_ALLOCATION_NIGHTLY
Status: Processing
Start Time: 02:00:15
Elapsed: 6h 23m (still running)
Progress: 0%

The cloud application server shows the job process is running but consuming minimal CPU. There are no error logs, no timeout messages - it just sits there indefinitely. Manual order allocation through the UI works fine during the day. The scheduler agent connectivity seems OK for other jobs. Could this be related to cloud firewall configuration or job heartbeat timeout settings? We need this resolved urgently as orders are piling up.

robertcloud · April 2, 2025, 9:58am

The fact that it worked initially suggests the job configuration itself is correct. A job that starts but never progresses or completes usually indicates a connectivity issue between the scheduler agent and the worker process, especially in cloud-hybrid environments where they might be on different network segments.

Check if your scheduler agent can maintain persistent connections to the cloud worker nodes. Firewalls often allow initial connections but drop long-running sessions, which would cause exactly this symptom - job starts but then can’t communicate progress back to the scheduler.

bettyadmin · April 24, 2025, 2:35pm

You need to increase the idle timeout to at least 30 minutes, preferably 60 minutes for long batch jobs. But there’s another critical setting - the job heartbeat timeout in Epicor itself. If the scheduler doesn’t receive a heartbeat within its configured timeout period, it marks the job as failed even if it’s still running. These two timeouts need to be coordinated.

robertcloud · April 27, 2025, 12:30am

Here’s the complete solution to fix your stuck batch job issue:

1. Configure Scheduler-Agent Connectivity: The core problem is that your scheduler agent (on-premises) loses connection to the cloud worker process due to firewall idle timeouts. You need to ensure persistent connectivity throughout job execution.

On your scheduler agent server, edit the scheduler configuration file (typically SchedulerAgent.config):

<SchedulerAgent>
  <Heartbeat interval="30" timeout="300" />
  <KeepAlive enabled="true" interval="25" />
</SchedulerAgent>

This enables TCP keep-alive packets every 25 seconds, preventing firewall idle timeout disconnections. The heartbeat timeout of 300 seconds (5 minutes) gives the worker process time to send status updates even during intensive processing.

2. Cloud Firewall Configuration: Work with your network team to update Azure Network Security Group rules:

Inbound Rule (to cloud workers):

Source: Your on-prem scheduler agent IP
Destination: Cloud application server subnet
Ports: 4502-4503 (Epicor job management)
Protocol: TCP
Idle Timeout: 3600 seconds (60 minutes)

Outbound Rule (from cloud workers):

Source: Cloud application server subnet
Destination: On-prem scheduler agent IP
Ports: 4502-4503
Protocol: TCP
Idle Timeout: 3600 seconds

If using Azure Load Balancer, also configure:


IdleTimeoutInMinutes: 30
EnableTcpReset: true

3. Job Heartbeat Timeout Settings: In Epicor Admin Console, navigate to System Management > Job Scheduler > Job Definitions. Edit your DIST_ORDER_ALLOCATION_NIGHTLY job:

<JobDefinition>
  <Name>DIST_ORDER_ALLOCATION_NIGHTLY</Name>
  <HeartbeatInterval>60</HeartbeatInterval>
  <HeartbeatTimeout>600</HeartbeatTimeout>
  <MaxExecutionTime>7200</MaxExecutionTime>
</JobDefinition>

Settings explained:

HeartbeatInterval: Worker sends status update every 60 seconds
HeartbeatTimeout: Scheduler waits up to 600 seconds (10 minutes) for heartbeat before considering job failed
MaxExecutionTime: Job automatically terminates after 7200 seconds (2 hours) to prevent runaway processes

4. Enable Detailed Job Logging: To troubleshoot future issues, enable comprehensive job logging in your cloud environment. Add to application server configuration:


job.logging.level=DEBUG
job.logging.includeHeartbeat=true
job.logging.path=/logs/scheduler/

This logs every heartbeat exchange and connection status, making it easy to identify connectivity issues.

5. Test and Validate: After implementing these changes:

Restart the scheduler agent service on your on-prem server
Restart the Epicor application service on cloud servers
Manually trigger the order allocation job during business hours (don’t wait until 2 AM)
Monitor the job logs for heartbeat messages - you should see regular status updates
Verify the job completes successfully
Check that the job status updates correctly in the scheduler console throughout execution

Why This Works: Your job was getting stuck because the network connection between scheduler and worker was being dropped by idle timeout settings. The scheduler lost visibility into job status, so it couldn’t mark the job complete even though the worker process was running. By enabling TCP keep-alive, increasing firewall idle timeouts, and configuring appropriate heartbeat intervals, you maintain continuous communication throughout the job execution. The coordinated timeout settings (25s keep-alive < 30s heartbeat < 4min firewall < 10min heartbeat timeout) ensure no timeout expires before the next communication occurs.

Additional Recommendation: Consider implementing job monitoring alerts using Azure Monitor or your existing monitoring platform. Set up alerts when:

Job execution time exceeds 90 minutes (your typical max)
Heartbeat timeout warnings appear in logs
Job remains in Processing status for more than 2 hours

This comprehensive solution addresses all three focus areas: scheduler-agent connectivity through keep-alive and heartbeat configuration, proper firewall configuration with appropriate idle timeouts, and coordinated job heartbeat timeout settings that prevent premature job failure while still catching genuine hangs.

robert_analyst · April 15, 2025, 1:56pm

I checked with our network team. The Azure NSG has a 4-minute idle timeout on TCP connections. Our order allocation job takes 45-90 minutes typically, so that would definitely explain the disconnect. What’s the recommended timeout setting for long-running Epicor batch jobs? And do I need to change anything in the job configuration itself?

robertcloud · April 4, 2025, 6:12pm

That’s a good lead. Our scheduler agent runs on-premises while the distribution-mgmt application servers are in Azure. I can see the initial connection succeeds, but I don’t know how to verify if it’s being maintained throughout the job execution. Are there specific ports or protocols I should check with our network team?

Topic		Replies	Views
Cloud EDI integration fails during large batch import in logistics management, causing partial data loads and missing shipments Epicor SCM question , timeout , rest-api , batch-processing , json , cloud-hybrid-deployment , logistics-mgmt , es-10-2-500 , edi-gateway	6	1	March 13, 2025
Scheduled job for process automation stuck in 'Waiting' status Mendix question , scheduler , process-automation , batch-processing , process-modeling , timezone , mendix-9-18 , microflow , cron-expression	6	0	November 7, 2025
Cloud batch job scheduling delay impacts nightly inventory optimization runs in hybrid deployment Blue Yonder Luminate question , inventory-opt , python , cloud-scheduler , cloud-hybrid-deployment , by-2023-2 , batch-delay , resource-contention , job-priority	5	0	April 10, 2025
Stock batch update jobs fail intermittently during AWS region failover events Infor CloudSuite question , cloud-deploy , high-availability , stock-control , ics-2022 , job-failure , batch-scheduler , aws-infrastructure , monitoring-alerting	3	0	June 19, 2025
Manufacturing plan work order release lags during peak hours, causing production delays and resource bottlenecks Oracle Fusion Cloud SCM question , batch-processing , ofc-23c , performance-optimization , production-delays , manufacturing-plan , concurrent-programs , work-order-release , awr-analysis	6	0	January 13, 2025
Supply planning data sync latency between on-prem and Azure causing MRP inaccuracy Epicor SCM question , performance , data-sync , supply-planning , azure-sql , mrp-scheduling , cloud-hybrid-deployment , es-10-2-700	5	1	February 26, 2025
Production plan scheduling fails with 'Resource not available' error despite correct resource assignment Oracle Fusion Cloud question , resource-mgmt , configuration , manufacturing , ofc-23c , production-planning , scheduling-error , system-constraints , production-delays	6	0	May 23, 2025
Async shipment update via REST API returns intermittent 500 errors in logistics module, blocking EDI integration Epicor SCM question , integration , api-development , rest-api , http-500 , async-processing , logistics-mgmt , es-10-2-500 , shipment-status	7	0	April 14, 2025
Inventory batch synchronization times out during cloud deployment when processing large product catalogs Microsoft Dynamics 365 question , performance-opt , cloud-deploy , inventory-mgmt , timeout-exception , inventory-sync , d365-10-0-41 , batch-framework , x-plus-plus	4	0	March 21, 2025

Distribution management cloud batch job stuck in processing status

Related topics