EC2 AMI backup automation through Lambda fails to enforce retention policies causing cost escalation

Lambda function creates EC2 AMI backups successfully every night but orphaned AMIs and snapshots are accumulating. We implemented a 30-day retention policy in our Lambda code but it’s not deleting old backups. Monthly costs increased from $450 to $2,100 over three months due to snapshot storage.

Our Lambda function tags AMIs with CreatedDate and RetentionDays but the cleanup logic isn’t working. We’re also not getting proper CloudWatch logs to debug why deletions fail.

ami = ec2.create_image(
    InstanceId=instance_id,
    Name=f"backup-{instance_id}-{date}"
)
ec2.create_tags(Resources=[ami['ImageId']],
    Tags=[{'Key':'RetentionDays','Value':'30'}])

Should we be using Data Lifecycle Manager instead? How do others handle AMI retention at scale without manual cleanup?

We had the exact same cost explosion issue. The problem is Lambda timeout limits and error handling. If your cleanup function times out while processing a large number of AMIs, it fails silently. You need to implement pagination and process deletions in batches. Also add dead letter queues to catch failures. DLM is honestly much simpler for this use case - it handles both AMI and snapshot lifecycle automatically with no code to maintain.

I checked and we’re definitely not deleting snapshots - that’s probably a big part of the cost issue. How do I find which snapshots belong to a deregistered AMI? The snapshot descriptions reference AMI IDs but I’m not sure how to reliably query that relationship.

Let me address your retention automation challenges comprehensively:

Data Lifecycle Manager (DLM) vs Lambda: For EC2 AMI backups, DLM is the recommended approach and will solve most of your issues. However, if you need custom logic (like conditional backups based on instance tags or integration with external systems), Lambda is still valid. Here’s how to fix your implementation:

DLM Policy Configuration: DLM automatically handles both AMI and snapshot lifecycle:


aws dlm create-lifecycle-policy \
  --policy-details '{"ResourceTypes":["INSTANCE"],
    "TargetTags":[{"Key":"Backup","Value":"true"}],
    "Schedules":[{"RetainRule":{"Count":30}}]}'

DLM advantages: automatic snapshot cleanup, built-in retry logic, no Lambda maintenance, better cost efficiency.

AMI and Snapshot Tagging Strategy: Your current tagging is incomplete. You need:

# Pseudocode - Comprehensive tagging:
1. Tag AMI with: CreatedDate, RetentionDays, InstanceId, BackupType
2. Immediately tag all associated snapshots with same tags
3. Add DeleteAfter timestamp (CreatedDate + RetentionDays)
4. Use consistent tag schema: BackupPolicy:AutomatedDaily
5. Enable Cost Allocation Tags in billing console

Lambda CloudWatch Logging: Your logging is probably failing due to insufficient permissions or missing log group. Fix:

  • Ensure Lambda execution role has logs:CreateLogGroup, logs:CreateLogStream, logs:PutLogEvents
  • Add explicit logging at each step: backup creation, tagging, cleanup attempts, errors
  • Use structured logging with JSON format for easier CloudWatch Insights queries
  • Set log retention to 30 days minimum for debugging

Snapshot Lifecycle Automation: The complete cleanup logic you’re missing:

# Pseudocode - Complete cleanup flow:
1. Query all AMIs with tag DeleteAfter < current_date
2. For each AMI:
   a. Call describe_images to get BlockDeviceMappings
   b. Extract all Ebs.SnapshotId values from mappings
   c. Check AMI sharing with describe_image_attribute
   d. If shared, call modify_image_attribute to remove permissions
   e. Call deregister_image
   f. For each snapshot_id:
      - Verify snapshot not used by other AMIs
      - Call delete_snapshot with error handling
3. Log all actions and errors to CloudWatch
4. Send summary metrics to CloudWatch custom namespace

Critical Issues in Your Current Code:

  1. No snapshot tracking - AMI deletion leaves orphaned snapshots
  2. Missing error handling - failures are silent
  3. No pagination - only processes first 1000 AMIs
  4. Race condition - cleanup runs while backup is in progress
  5. No validation - doesn’t check if AMI is actually old enough to delete

Recommended Solution: Migrate to DLM for standard use cases. It’s purpose-built for this and eliminates 90% of your operational burden. Reserve Lambda for:

  • Pre-backup validation (instance health checks)
  • Post-backup verification (AMI availability testing)
  • Custom notification workflows
  • Integration with external backup catalogs

Cost Recovery Plan:

  1. Identify all orphaned snapshots: filter by description containing “ami-” but AMI no longer exists
  2. Create one-time cleanup Lambda with higher timeout (5 minutes)
  3. Process deletions in batches of 50 with exponential backoff
  4. Monitor CloudWatch for throttling errors
  5. Expected cost reduction: 60-70% once cleanup completes

Your $2,100 cost is likely 80% orphaned snapshots. Clean those up first, then implement DLM going forward.