Here’s the complete solution to fix your stuck batch job issue:
1. Configure Scheduler-Agent Connectivity:
The core problem is that your scheduler agent (on-premises) loses connection to the cloud worker process due to firewall idle timeouts. You need to ensure persistent connectivity throughout job execution.
On your scheduler agent server, edit the scheduler configuration file (typically SchedulerAgent.config):
<SchedulerAgent>
<Heartbeat interval="30" timeout="300" />
<KeepAlive enabled="true" interval="25" />
</SchedulerAgent>
This enables TCP keep-alive packets every 25 seconds, preventing firewall idle timeout disconnections. The heartbeat timeout of 300 seconds (5 minutes) gives the worker process time to send status updates even during intensive processing.
2. Cloud Firewall Configuration:
Work with your network team to update Azure Network Security Group rules:
Inbound Rule (to cloud workers):
- Source: Your on-prem scheduler agent IP
- Destination: Cloud application server subnet
- Ports: 4502-4503 (Epicor job management)
- Protocol: TCP
- Idle Timeout: 3600 seconds (60 minutes)
Outbound Rule (from cloud workers):
- Source: Cloud application server subnet
- Destination: On-prem scheduler agent IP
- Ports: 4502-4503
- Protocol: TCP
- Idle Timeout: 3600 seconds
If using Azure Load Balancer, also configure:
IdleTimeoutInMinutes: 30
EnableTcpReset: true
3. Job Heartbeat Timeout Settings:
In Epicor Admin Console, navigate to System Management > Job Scheduler > Job Definitions. Edit your DIST_ORDER_ALLOCATION_NIGHTLY job:
<JobDefinition>
<Name>DIST_ORDER_ALLOCATION_NIGHTLY</Name>
<HeartbeatInterval>60</HeartbeatInterval>
<HeartbeatTimeout>600</HeartbeatTimeout>
<MaxExecutionTime>7200</MaxExecutionTime>
</JobDefinition>
Settings explained:
- HeartbeatInterval: Worker sends status update every 60 seconds
- HeartbeatTimeout: Scheduler waits up to 600 seconds (10 minutes) for heartbeat before considering job failed
- MaxExecutionTime: Job automatically terminates after 7200 seconds (2 hours) to prevent runaway processes
4. Enable Detailed Job Logging:
To troubleshoot future issues, enable comprehensive job logging in your cloud environment. Add to application server configuration:
job.logging.level=DEBUG
job.logging.includeHeartbeat=true
job.logging.path=/logs/scheduler/
This logs every heartbeat exchange and connection status, making it easy to identify connectivity issues.
5. Test and Validate:
After implementing these changes:
- Restart the scheduler agent service on your on-prem server
- Restart the Epicor application service on cloud servers
- Manually trigger the order allocation job during business hours (don’t wait until 2 AM)
- Monitor the job logs for heartbeat messages - you should see regular status updates
- Verify the job completes successfully
- Check that the job status updates correctly in the scheduler console throughout execution
Why This Works:
Your job was getting stuck because the network connection between scheduler and worker was being dropped by idle timeout settings. The scheduler lost visibility into job status, so it couldn’t mark the job complete even though the worker process was running. By enabling TCP keep-alive, increasing firewall idle timeouts, and configuring appropriate heartbeat intervals, you maintain continuous communication throughout the job execution. The coordinated timeout settings (25s keep-alive < 30s heartbeat < 4min firewall < 10min heartbeat timeout) ensure no timeout expires before the next communication occurs.
Additional Recommendation:
Consider implementing job monitoring alerts using Azure Monitor or your existing monitoring platform. Set up alerts when:
- Job execution time exceeds 90 minutes (your typical max)
- Heartbeat timeout warnings appear in logs
- Job remains in Processing status for more than 2 hours
This comprehensive solution addresses all three focus areas: scheduler-agent connectivity through keep-alive and heartbeat configuration, proper firewall configuration with appropriate idle timeouts, and coordinated job heartbeat timeout settings that prevent premature job failure while still catching genuine hangs.