Let me walk through our complete implementation addressing the ECS RunTask API usage, EventBridge scheduling, and automated monitoring.
ECS RunTask API Usage:
We created a Python Lambda function that serves as the orchestration layer for all batch jobs. Here’s the core RunTask implementation:
import boto3
ecs = boto3.client('ecs')
response = ecs.run_task(
cluster='batch-jobs-cluster',
taskDefinition='data-processor:12'
)
The Lambda function accepts job parameters from EventBridge and dynamically constructs the RunTask API call. Key aspects of our implementation:
-
Task Definition versioning: We maintain separate task definitions for each job type (data-processor, report-generator, etl-pipeline). Each definition specifies container image, CPU/memory requirements, IAM role, and logging configuration.
-
Dynamic parameter passing: Job-specific parameters are passed via container environment overrides in the RunTask call. This allows the same task definition to handle different inputs:
overrides={'containerOverrides': [{
'name': 'batch-container',
'environment': [
{'name': 'JOB_DATE', 'value': '2025-04-28'},
{'name': 'REGION', 'value': 'us-east-1'}
]
}]}
-
Network configuration: Our ECS tasks run in Fargate with awsvpc network mode, requiring subnet and security group specification in the RunTask call. We use private subnets with NAT gateway for secure outbound access.
-
Launch type selection: We use Fargate for most jobs (no infrastructure management), but have some compute-intensive ETL jobs that use EC2 launch type with spot instances for 70% cost savings.
EventBridge Scheduling:
We replaced our manual job triggers with EventBridge scheduled rules. The architecture:
- Created 15 EventBridge rules, one per batch job type
- Each rule has a cron expression defining when it should fire (e.g., ‘cron(0 2 * * ? *)’ for 2 AM daily)
- Rule target is the orchestration Lambda function with job-specific parameters passed in the input JSON
Example rule configuration:
- Rule name: daily-data-processor
- Schedule: cron(0 2 * * ? *) - 2 AM daily UTC
- Target: orchestration-lambda
- Input: {“jobType”: “data-processor”, “taskDef”: “data-processor:12”, “region”: “us-east-1”}
For jobs with dependencies, we created Step Functions state machines that EventBridge triggers. The state machine calls ECS RunTask for each step and waits for completion:
// Pseudocode - ETL pipeline state machine:
- Extract Task: Run ECS task to extract data from source
- Wait for Extract completion (poll task status)
- Transform Task: Run ECS task to transform extracted data
- Wait for Transform completion
- Load Task: Run ECS task to load into target database
- Send completion notification via SNS
This handles complex workflows with 3-5 sequential jobs that previously required manual coordination.
Automated Job Monitoring:
We implemented comprehensive monitoring to replace manual observation:
-
CloudWatch Events for task state changes:
- Created EventBridge rule matching ECS task state change events
- Rule pattern filters for STOPPED tasks with non-zero exit codes (failures)
- Triggers Lambda function that parses failure reason and sends detailed Slack notification
-
CloudWatch Logs Insights for job analysis:
- All ECS tasks log to CloudWatch Logs with structured JSON format
- Created saved queries for common troubleshooting patterns
- Lambda function runs Logs Insights query after job completion to extract key metrics
-
Custom CloudWatch metrics:
- Lambda publishes custom metrics: job duration, records processed, error count
- Created dashboards showing job performance trends over time
- Alarms trigger when job duration exceeds baseline by 50% or error rate spikes
-
SNS notification workflow:
- Success: Minimal notification to audit log only
- Failure: Detailed notification to Slack with error details, task ARN, logs link
- Timeout: Alert if job doesn’t complete within expected duration (set per job type)
Results and Benefits:
After 6 months of operation:
- Zero manual interventions for routine batch jobs (down from 4-5 hours daily)
- Job failure rate decreased from 8% to 2% due to automated retries and better error handling
- Cost reduction of 40% by using Fargate spot and right-sizing task resources based on metrics
- Job completion time improved 25% by running parallel jobs that were previously sequential due to manual coordination
The key to success was starting with a few simple jobs, validating the monitoring catches failures reliably, then gradually migrating more complex workflows. The ECS RunTask API combined with EventBridge provides a powerful, scalable automation platform that eliminated our operational toil while improving reliability.