Let me address your retention automation challenges comprehensively:
Data Lifecycle Manager (DLM) vs Lambda:
For EC2 AMI backups, DLM is the recommended approach and will solve most of your issues. However, if you need custom logic (like conditional backups based on instance tags or integration with external systems), Lambda is still valid. Here’s how to fix your implementation:
DLM Policy Configuration:
DLM automatically handles both AMI and snapshot lifecycle:
aws dlm create-lifecycle-policy \
--policy-details '{"ResourceTypes":["INSTANCE"],
"TargetTags":[{"Key":"Backup","Value":"true"}],
"Schedules":[{"RetainRule":{"Count":30}}]}'
DLM advantages: automatic snapshot cleanup, built-in retry logic, no Lambda maintenance, better cost efficiency.
AMI and Snapshot Tagging Strategy:
Your current tagging is incomplete. You need:
# Pseudocode - Comprehensive tagging:
1. Tag AMI with: CreatedDate, RetentionDays, InstanceId, BackupType
2. Immediately tag all associated snapshots with same tags
3. Add DeleteAfter timestamp (CreatedDate + RetentionDays)
4. Use consistent tag schema: BackupPolicy:AutomatedDaily
5. Enable Cost Allocation Tags in billing console
Lambda CloudWatch Logging:
Your logging is probably failing due to insufficient permissions or missing log group. Fix:
- Ensure Lambda execution role has logs:CreateLogGroup, logs:CreateLogStream, logs:PutLogEvents
- Add explicit logging at each step: backup creation, tagging, cleanup attempts, errors
- Use structured logging with JSON format for easier CloudWatch Insights queries
- Set log retention to 30 days minimum for debugging
Snapshot Lifecycle Automation:
The complete cleanup logic you’re missing:
# Pseudocode - Complete cleanup flow:
1. Query all AMIs with tag DeleteAfter < current_date
2. For each AMI:
a. Call describe_images to get BlockDeviceMappings
b. Extract all Ebs.SnapshotId values from mappings
c. Check AMI sharing with describe_image_attribute
d. If shared, call modify_image_attribute to remove permissions
e. Call deregister_image
f. For each snapshot_id:
- Verify snapshot not used by other AMIs
- Call delete_snapshot with error handling
3. Log all actions and errors to CloudWatch
4. Send summary metrics to CloudWatch custom namespace
Critical Issues in Your Current Code:
- No snapshot tracking - AMI deletion leaves orphaned snapshots
- Missing error handling - failures are silent
- No pagination - only processes first 1000 AMIs
- Race condition - cleanup runs while backup is in progress
- No validation - doesn’t check if AMI is actually old enough to delete
Recommended Solution:
Migrate to DLM for standard use cases. It’s purpose-built for this and eliminates 90% of your operational burden. Reserve Lambda for:
- Pre-backup validation (instance health checks)
- Post-backup verification (AMI availability testing)
- Custom notification workflows
- Integration with external backup catalogs
Cost Recovery Plan:
- Identify all orphaned snapshots: filter by description containing “ami-” but AMI no longer exists
- Create one-time cleanup Lambda with higher timeout (5 minutes)
- Process deletions in batches of 50 with exponential backoff
- Monitor CloudWatch for throttling errors
- Expected cost reduction: 60-70% once cleanup completes
Your $2,100 cost is likely 80% orphaned snapshots. Clean those up first, then implement DLM going forward.