Distribution management cloud batch job stuck in processing status

After moving our distribution-mgmt module to a cloud-hybrid setup on ES 10.2.600, we’re experiencing issues with the nightly order allocation batch job. The job starts normally but gets stuck in “Processing” status and never completes. This is blocking our entire order fulfillment process.

The job is scheduled through the Epicor Job Scheduler to run at 2 AM daily. It processes warehouse allocation for pending sales orders. The job ran fine for the first week after migration, but now it consistently hangs. When I check the job status in the morning, it shows:


Job: DIST_ORDER_ALLOCATION_NIGHTLY
Status: Processing
Start Time: 02:00:15
Elapsed: 6h 23m (still running)
Progress: 0%

The cloud application server shows the job process is running but consuming minimal CPU. There are no error logs, no timeout messages - it just sits there indefinitely. Manual order allocation through the UI works fine during the day. The scheduler agent connectivity seems OK for other jobs. Could this be related to cloud firewall configuration or job heartbeat timeout settings? We need this resolved urgently as orders are piling up.

The fact that it worked initially suggests the job configuration itself is correct. A job that starts but never progresses or completes usually indicates a connectivity issue between the scheduler agent and the worker process, especially in cloud-hybrid environments where they might be on different network segments.

Check if your scheduler agent can maintain persistent connections to the cloud worker nodes. Firewalls often allow initial connections but drop long-running sessions, which would cause exactly this symptom - job starts but then can’t communicate progress back to the scheduler.

You need to increase the idle timeout to at least 30 minutes, preferably 60 minutes for long batch jobs. But there’s another critical setting - the job heartbeat timeout in Epicor itself. If the scheduler doesn’t receive a heartbeat within its configured timeout period, it marks the job as failed even if it’s still running. These two timeouts need to be coordinated.

Here’s the complete solution to fix your stuck batch job issue:

1. Configure Scheduler-Agent Connectivity: The core problem is that your scheduler agent (on-premises) loses connection to the cloud worker process due to firewall idle timeouts. You need to ensure persistent connectivity throughout job execution.

On your scheduler agent server, edit the scheduler configuration file (typically SchedulerAgent.config):

<SchedulerAgent>
  <Heartbeat interval="30" timeout="300" />
  <KeepAlive enabled="true" interval="25" />
</SchedulerAgent>

This enables TCP keep-alive packets every 25 seconds, preventing firewall idle timeout disconnections. The heartbeat timeout of 300 seconds (5 minutes) gives the worker process time to send status updates even during intensive processing.

2. Cloud Firewall Configuration: Work with your network team to update Azure Network Security Group rules:

Inbound Rule (to cloud workers):

  • Source: Your on-prem scheduler agent IP
  • Destination: Cloud application server subnet
  • Ports: 4502-4503 (Epicor job management)
  • Protocol: TCP
  • Idle Timeout: 3600 seconds (60 minutes)

Outbound Rule (from cloud workers):

  • Source: Cloud application server subnet
  • Destination: On-prem scheduler agent IP
  • Ports: 4502-4503
  • Protocol: TCP
  • Idle Timeout: 3600 seconds

If using Azure Load Balancer, also configure:


IdleTimeoutInMinutes: 30
EnableTcpReset: true

3. Job Heartbeat Timeout Settings: In Epicor Admin Console, navigate to System Management > Job Scheduler > Job Definitions. Edit your DIST_ORDER_ALLOCATION_NIGHTLY job:

<JobDefinition>
  <Name>DIST_ORDER_ALLOCATION_NIGHTLY</Name>
  <HeartbeatInterval>60</HeartbeatInterval>
  <HeartbeatTimeout>600</HeartbeatTimeout>
  <MaxExecutionTime>7200</MaxExecutionTime>
</JobDefinition>

Settings explained:

  • HeartbeatInterval: Worker sends status update every 60 seconds
  • HeartbeatTimeout: Scheduler waits up to 600 seconds (10 minutes) for heartbeat before considering job failed
  • MaxExecutionTime: Job automatically terminates after 7200 seconds (2 hours) to prevent runaway processes

4. Enable Detailed Job Logging: To troubleshoot future issues, enable comprehensive job logging in your cloud environment. Add to application server configuration:


job.logging.level=DEBUG
job.logging.includeHeartbeat=true
job.logging.path=/logs/scheduler/

This logs every heartbeat exchange and connection status, making it easy to identify connectivity issues.

5. Test and Validate: After implementing these changes:

  1. Restart the scheduler agent service on your on-prem server
  2. Restart the Epicor application service on cloud servers
  3. Manually trigger the order allocation job during business hours (don’t wait until 2 AM)
  4. Monitor the job logs for heartbeat messages - you should see regular status updates
  5. Verify the job completes successfully
  6. Check that the job status updates correctly in the scheduler console throughout execution

Why This Works: Your job was getting stuck because the network connection between scheduler and worker was being dropped by idle timeout settings. The scheduler lost visibility into job status, so it couldn’t mark the job complete even though the worker process was running. By enabling TCP keep-alive, increasing firewall idle timeouts, and configuring appropriate heartbeat intervals, you maintain continuous communication throughout the job execution. The coordinated timeout settings (25s keep-alive < 30s heartbeat < 4min firewall < 10min heartbeat timeout) ensure no timeout expires before the next communication occurs.

Additional Recommendation: Consider implementing job monitoring alerts using Azure Monitor or your existing monitoring platform. Set up alerts when:

  • Job execution time exceeds 90 minutes (your typical max)
  • Heartbeat timeout warnings appear in logs
  • Job remains in Processing status for more than 2 hours

This comprehensive solution addresses all three focus areas: scheduler-agent connectivity through keep-alive and heartbeat configuration, proper firewall configuration with appropriate idle timeouts, and coordinated job heartbeat timeout settings that prevent premature job failure while still catching genuine hangs.

I checked with our network team. The Azure NSG has a 4-minute idle timeout on TCP connections. Our order allocation job takes 45-90 minutes typically, so that would definitely explain the disconnect. What’s the recommended timeout setting for long-running Epicor batch jobs? And do I need to change anything in the job configuration itself?

That’s a good lead. Our scheduler agent runs on-premises while the distribution-mgmt application servers are in Azure. I can see the initial connection succeeds, but I don’t know how to verify if it’s being maintained throughout the job execution. Are there specific ports or protocols I should check with our network team?