EC2 AMI backup automation through Lambda fails to enforce retention policies causing cost escalation

lauradev · December 8, 2024, 4:35pm

Lambda function creates EC2 AMI backups successfully every night but orphaned AMIs and snapshots are accumulating. We implemented a 30-day retention policy in our Lambda code but it’s not deleting old backups. Monthly costs increased from $450 to $2,100 over three months due to snapshot storage.

Our Lambda function tags AMIs with CreatedDate and RetentionDays but the cleanup logic isn’t working. We’re also not getting proper CloudWatch logs to debug why deletions fail.

ami = ec2.create_image(
    InstanceId=instance_id,
    Name=f"backup-{instance_id}-{date}"
)
ec2.create_tags(Resources=[ami['ImageId']],
    Tags=[{'Key':'RetentionDays','Value':'30'}])

Should we be using Data Lifecycle Manager instead? How do others handle AMI retention at scale without manual cleanup?

amandalead · December 15, 2024, 5:46am

We had the exact same cost explosion issue. The problem is Lambda timeout limits and error handling. If your cleanup function times out while processing a large number of AMIs, it fails silently. You need to implement pagination and process deletions in batches. Also add dead letter queues to catch failures. DLM is honestly much simpler for this use case - it handles both AMI and snapshot lifecycle automatically with no code to maintain.

justin_pro · December 22, 2024, 5:20pm

I checked and we’re definitely not deleting snapshots - that’s probably a big part of the cost issue. How do I find which snapshots belong to a deregistered AMI? The snapshot descriptions reference AMI IDs but I’m not sure how to reliably query that relationship.

jessicalead · January 6, 2025, 12:34pm

Let me address your retention automation challenges comprehensively:

Data Lifecycle Manager (DLM) vs Lambda: For EC2 AMI backups, DLM is the recommended approach and will solve most of your issues. However, if you need custom logic (like conditional backups based on instance tags or integration with external systems), Lambda is still valid. Here’s how to fix your implementation:

DLM Policy Configuration: DLM automatically handles both AMI and snapshot lifecycle:


aws dlm create-lifecycle-policy \
  --policy-details '{"ResourceTypes":["INSTANCE"],
    "TargetTags":[{"Key":"Backup","Value":"true"}],
    "Schedules":[{"RetainRule":{"Count":30}}]}'

DLM advantages: automatic snapshot cleanup, built-in retry logic, no Lambda maintenance, better cost efficiency.

AMI and Snapshot Tagging Strategy: Your current tagging is incomplete. You need:

# Pseudocode - Comprehensive tagging:
1. Tag AMI with: CreatedDate, RetentionDays, InstanceId, BackupType
2. Immediately tag all associated snapshots with same tags
3. Add DeleteAfter timestamp (CreatedDate + RetentionDays)
4. Use consistent tag schema: BackupPolicy:AutomatedDaily
5. Enable Cost Allocation Tags in billing console

Lambda CloudWatch Logging: Your logging is probably failing due to insufficient permissions or missing log group. Fix:

Ensure Lambda execution role has logs:CreateLogGroup, logs:CreateLogStream, logs:PutLogEvents
Add explicit logging at each step: backup creation, tagging, cleanup attempts, errors
Use structured logging with JSON format for easier CloudWatch Insights queries
Set log retention to 30 days minimum for debugging

Snapshot Lifecycle Automation: The complete cleanup logic you’re missing:

# Pseudocode - Complete cleanup flow:
1. Query all AMIs with tag DeleteAfter < current_date
2. For each AMI:
   a. Call describe_images to get BlockDeviceMappings
   b. Extract all Ebs.SnapshotId values from mappings
   c. Check AMI sharing with describe_image_attribute
   d. If shared, call modify_image_attribute to remove permissions
   e. Call deregister_image
   f. For each snapshot_id:
      - Verify snapshot not used by other AMIs
      - Call delete_snapshot with error handling
3. Log all actions and errors to CloudWatch
4. Send summary metrics to CloudWatch custom namespace

Critical Issues in Your Current Code:

No snapshot tracking - AMI deletion leaves orphaned snapshots
Missing error handling - failures are silent
No pagination - only processes first 1000 AMIs
Race condition - cleanup runs while backup is in progress
No validation - doesn’t check if AMI is actually old enough to delete

Recommended Solution: Migrate to DLM for standard use cases. It’s purpose-built for this and eliminates 90% of your operational burden. Reserve Lambda for:

Pre-backup validation (instance health checks)
Post-backup verification (AMI availability testing)
Custom notification workflows
Integration with external backup catalogs

Cost Recovery Plan:

Identify all orphaned snapshots: filter by description containing “ami-” but AMI no longer exists
Create one-time cleanup Lambda with higher timeout (5 minutes)
Process deletions in batches of 50 with exponential backoff
Monitor CloudWatch for throttling errors
Expected cost reduction: 60-70% once cleanup completes

Your $2,100 cost is likely 80% orphaned snapshots. Clean those up first, then implement DLM going forward.

Topic		Replies	Views
CloudWatch Logs backup retention gaps causing RPO violations during incident investigation Amazon Web Services (AWS) question , backup-dr , lambda , observability , aws-2021 , python , s3 , athena , cloudwatch-logs	4	2	July 29, 2025
Automated RDS backup encryption and cross-region replication for compliance Amazon Web Services (AWS) use-case , backup-dr , disaster-recovery , security , compliance , database , aws-2021 , kms , rds	4	1	September 27, 2025
Automated backup pipeline with Athena analytics for disaster recovery compliance reporting-reduced manual audits by 85% Amazon Web Services (AWS) use-case , backup-dr , analytics , compliance , sql , lambda , aws-2019 , python , s3	7	3	November 26, 2025
AWS Backup centralized management versus native service backups for multi-account disaster recovery Amazon Web Services (AWS) discussion , backup-dr , security , compliance , cost-optimization , aws-2020 , s3 , rds , ec2	3	0	November 6, 2025
Object Storage lifecycle policy not deleting old compute backups despite matching prefix rules Oracle Cloud question , storage , retention , cost-optimization , object-storage , oci-2021 , json , lifecycle-policy , compute-backup	5	1	October 28, 2025
Differences between OCI Compute backup and snapshot for VM recovery strategies Oracle Cloud discussion , backup-dr , compute , rto , oci-2020 , cost-analysis , snapshot , boot-volume , vm-recovery	7	1	November 4, 2025
EFS lifecycle policy not deleting old files in ERP archive, causing storage overages Amazon Web Services (AWS) question , storage , devops-auto , aws-2021 , file-access , lifecycle-policy , efs , storage-costs , infrequent-access	5	1	May 31, 2025
Blob Storage lifecycle management policy not deleting expired blobs despite correct configuration Microsoft Azure question , storage , cost-optimization , az-2021 , json , lifecycle-policy , blob-storage , policy-evaluation , soft-delete	3	1	August 9, 2025
Automated invoice archiving using S3 Lifecycle API reduces storage costs and improves compliance for finance audits Amazon Web Services (AWS) use-case , storage , automation , compliance , cost-optimization , aws-2019 , retention-policy , apis , s3-lifecycle-api	7	1	January 17, 2025

EC2 AMI backup automation through Lambda fails to enforce retention policies causing cost escalation

Related topics