Automated ECS task API orchestration for scheduled batch jobs reduces manual operations workload by 60%

We recently automated our batch job orchestration using ECS RunTask API with EventBridge scheduling, eliminating the manual operations overhead we had with EC2-based batch processing. Previously, our team was manually triggering batch jobs through a web interface and monitoring logs to verify completion. This consumed about 4-5 hours of engineering time daily across multiple time zones.

The solution leverages EventBridge scheduled rules to invoke Lambda functions that call the ECS RunTask API with specific task definitions for each batch job type. We have 15 different batch jobs (data processing, report generation, ETL pipelines) that run at various intervals throughout the day. The automated system now handles job scheduling, execution, and basic monitoring without human intervention. I’ll share the architecture and implementation details that might help others looking to automate similar workflows.

What about monitoring? When jobs were manual, someone was watching them. How do you know if an automated job fails or gets stuck?

Let me walk through our complete implementation addressing the ECS RunTask API usage, EventBridge scheduling, and automated monitoring.

ECS RunTask API Usage: We created a Python Lambda function that serves as the orchestration layer for all batch jobs. Here’s the core RunTask implementation:

import boto3
ecs = boto3.client('ecs')
response = ecs.run_task(
    cluster='batch-jobs-cluster',
    taskDefinition='data-processor:12'
)

The Lambda function accepts job parameters from EventBridge and dynamically constructs the RunTask API call. Key aspects of our implementation:

  1. Task Definition versioning: We maintain separate task definitions for each job type (data-processor, report-generator, etl-pipeline). Each definition specifies container image, CPU/memory requirements, IAM role, and logging configuration.

  2. Dynamic parameter passing: Job-specific parameters are passed via container environment overrides in the RunTask call. This allows the same task definition to handle different inputs:

overrides={'containerOverrides': [{
    'name': 'batch-container',
    'environment': [
        {'name': 'JOB_DATE', 'value': '2025-04-28'},
        {'name': 'REGION', 'value': 'us-east-1'}
    ]
}]}
  1. Network configuration: Our ECS tasks run in Fargate with awsvpc network mode, requiring subnet and security group specification in the RunTask call. We use private subnets with NAT gateway for secure outbound access.

  2. Launch type selection: We use Fargate for most jobs (no infrastructure management), but have some compute-intensive ETL jobs that use EC2 launch type with spot instances for 70% cost savings.

EventBridge Scheduling: We replaced our manual job triggers with EventBridge scheduled rules. The architecture:

  1. Created 15 EventBridge rules, one per batch job type
  2. Each rule has a cron expression defining when it should fire (e.g., ‘cron(0 2 * * ? *)’ for 2 AM daily)
  3. Rule target is the orchestration Lambda function with job-specific parameters passed in the input JSON

Example rule configuration:

  • Rule name: daily-data-processor
  • Schedule: cron(0 2 * * ? *) - 2 AM daily UTC
  • Target: orchestration-lambda
  • Input: {“jobType”: “data-processor”, “taskDef”: “data-processor:12”, “region”: “us-east-1”}

For jobs with dependencies, we created Step Functions state machines that EventBridge triggers. The state machine calls ECS RunTask for each step and waits for completion:

// Pseudocode - ETL pipeline state machine:

  1. Extract Task: Run ECS task to extract data from source
  2. Wait for Extract completion (poll task status)
  3. Transform Task: Run ECS task to transform extracted data
  4. Wait for Transform completion
  5. Load Task: Run ECS task to load into target database
  6. Send completion notification via SNS

This handles complex workflows with 3-5 sequential jobs that previously required manual coordination.

Automated Job Monitoring: We implemented comprehensive monitoring to replace manual observation:

  1. CloudWatch Events for task state changes:

    • Created EventBridge rule matching ECS task state change events
    • Rule pattern filters for STOPPED tasks with non-zero exit codes (failures)
    • Triggers Lambda function that parses failure reason and sends detailed Slack notification
  2. CloudWatch Logs Insights for job analysis:

    • All ECS tasks log to CloudWatch Logs with structured JSON format
    • Created saved queries for common troubleshooting patterns
    • Lambda function runs Logs Insights query after job completion to extract key metrics
  3. Custom CloudWatch metrics:

    • Lambda publishes custom metrics: job duration, records processed, error count
    • Created dashboards showing job performance trends over time
    • Alarms trigger when job duration exceeds baseline by 50% or error rate spikes
  4. SNS notification workflow:

    • Success: Minimal notification to audit log only
    • Failure: Detailed notification to Slack with error details, task ARN, logs link
    • Timeout: Alert if job doesn’t complete within expected duration (set per job type)

Results and Benefits: After 6 months of operation:

  • Zero manual interventions for routine batch jobs (down from 4-5 hours daily)
  • Job failure rate decreased from 8% to 2% due to automated retries and better error handling
  • Cost reduction of 40% by using Fargate spot and right-sizing task resources based on metrics
  • Job completion time improved 25% by running parallel jobs that were previously sequential due to manual coordination

The key to success was starting with a few simple jobs, validating the monitoring catches failures reliably, then gradually migrating more complex workflows. The ECS RunTask API combined with EventBridge provides a powerful, scalable automation platform that eliminated our operational toil while improving reliability.

How do you handle job configuration and parameters? Our batch jobs need different inputs depending on the day, region, or customer. Hardcoding everything in task definitions doesn’t seem scalable.

For dependencies, we use Step Functions to orchestrate jobs that need to run in sequence. EventBridge triggers the Step Functions workflow, which then calls ECS RunTask API for each step and waits for task completion before proceeding. The Step Functions state machine handles error handling and retries automatically. For independent jobs, EventBridge directly triggers Lambda which calls RunTask - simpler and more cost-effective.