Differences between OCI Compute backup and snapshot for VM recovery strategies

I’m designing a disaster recovery strategy for our OCI Compute instances and trying to understand the practical differences between using boot volume backups versus snapshots. The documentation covers the technical differences, but I’d like to hear from people who have actually implemented both approaches.

We’re running about 50 production VMs across multiple availability domains, and we need to balance recovery time objectives (RTO) with storage costs. From what I understand, snapshots are faster to create but backups offer better cost efficiency for long-term retention.

For those who have experience with both methods:

  • What are the real-world restore times you’ve seen?
  • How do the storage costs compare over time?
  • Are there compliance considerations that favor one approach over the other?
  • Can you mix both strategies effectively?

I’m particularly interested in understanding which approach works better for different VM types - database servers versus application servers versus stateless web servers.

One thing to consider is that snapshots are region-specific, while backups can be copied across regions more easily. If your DR strategy involves multiple regions, backups give you more flexibility. We learned this the hard way during a regional outage - our snapshots were unavailable, but we could restore from backups that had been replicated to another region.

Great insights everyone. It sounds like a hybrid approach makes the most sense - snapshots for short-term, fast recovery needs, and backups for long-term retention and compliance. I’m curious about the automation aspect - are you using OCI policies to automatically manage the lifecycle of both snapshots and backups, or handling it through custom scripts?

Let me share a comprehensive analysis of OCI Compute backup versus snapshot strategies based on our production experience managing a large VM fleet.

Full Backup vs Snapshot - Technical Comparison

The fundamental difference lies in how data is stored and managed:

Snapshots:

  • Point-in-time copy of boot/block volumes stored in Block Volume service
  • Created almost instantaneously (seconds to minutes)
  • Stored as full copies - no compression or deduplication
  • Region-specific - cannot be directly copied across regions
  • Charged at block storage rates (~$0.05/GB/month)
  • Ideal for short-term recovery scenarios

Backups:

  • Incremental copies stored in Object Storage
  • First backup is full, subsequent backups are incremental
  • Automatic compression and deduplication applied
  • Can be copied to other regions for DR
  • Charged at object storage rates (~$0.0255/GB/month for Standard tier, $0.0099/GB for Archive)
  • Better for long-term retention and compliance

Restore Time Comparison - Real World Data

Based on our testing across different VM sizes:

Snapshot Restore Times:

  • 100GB boot volume: 12-18 minutes
  • 500GB boot volume: 15-25 minutes
  • 1TB boot volume: 20-30 minutes

Restore time is relatively consistent because you’re cloning within the Block Volume service. The limiting factor is usually the VM provisioning time, not data transfer.

Backup Restore Times:

  • 100GB boot volume: 35-50 minutes
  • 500GB boot volume: 60-90 minutes
  • 1TB boot volume: 90-150 minutes

Restore time increases with volume size because data must be transferred from Object Storage and written to Block Storage. Network bandwidth and Object Storage API limits affect performance.

Key Insight: For RTO under 30 minutes, snapshots are essential. For RTO of 1-2 hours, backups are acceptable.

Storage Cost Analysis

Let’s analyze costs for a typical 500GB boot volume over 12 months:

Snapshot Strategy (retain 7 days):

  • Daily snapshots: 7 snapshots × 500GB × $0.05 = $175/month
  • Annual cost: $2,100
  • No incremental savings - each snapshot is full size

Backup Strategy (retain 12 months):

  • First full backup: 500GB × $0.0255 = $12.75
  • Monthly incremental backups (assume 10% change): 11 × 50GB × $0.0255 = $14.03
  • Annual cost: ~$325 (first year), ~$155/year ongoing (after compression/dedup)
  • Backups older than 3 months moved to Archive tier: Additional 30% savings

Hybrid Strategy (our recommended approach):

  • 3 recent snapshots (fast recovery): 3 × 500GB × $0.05 = $75/month = $900/year
  • 12 months backups (compliance): ~$325/year
  • Total: ~$1,225/year
  • Provides both fast RTO and long-term retention

For 50 VMs:

  • Snapshot-only: $105,000/year
  • Backup-only: $16,250/year
  • Hybrid: $61,250/year

The hybrid approach saves ~$44K annually versus snapshot-only while maintaining fast recovery capability.

Compliance and Governance Considerations

For regulated industries:

  1. Audit Requirements: Backups provide better audit trails with detailed metadata about backup creation, retention, and deletion events

  2. Retention Policies: Most compliance frameworks require 7+ years retention. Backups in Archive tier ($0.0099/GB/month) make this economically feasible

  3. Immutability: Backups can leverage Object Storage retention rules to prevent deletion or modification

  4. Cross-Region DR: Compliance often requires geographic redundancy. Backup copies to remote regions are straightforward; snapshot replication requires custom automation

  5. Data Classification: Backups support tagging and metadata for data classification requirements

VM Type-Specific Strategies

Database Servers (High RTO sensitivity):

  • Strategy: Hybrid with emphasis on snapshots
  • Snapshots: 3-5 recent (last 24-48 hours)
  • Backups: Daily for 30 days, weekly for 12 months
  • Rationale: Fast recovery critical for business continuity
  • Additional: Use database-native backup tools alongside VM-level protection

Application Servers (Moderate RTO):

  • Strategy: Backup-focused with limited snapshots
  • Snapshots: 1-2 pre-maintenance window only
  • Backups: Daily for 30 days, weekly for 6-12 months
  • Rationale: Can tolerate 1-2 hour RTO, cost optimization priority

Stateless Web Servers (Low RTO sensitivity):

  • Strategy: Minimal protection
  • Snapshots: Golden image snapshots only (after patching/updates)
  • Backups: Weekly or monthly for configuration drift detection
  • Rationale: Can be rebuilt from automation/IaC quickly
  • Consider: Skip VM-level backup entirely, rely on infrastructure-as-code

Automation and Lifecycle Management

We use a combination of OCI native features and custom automation:

OCI Native Policies:

  • Boot volume backup policies (Bronze/Silver/Gold tiers)
  • Automatic scheduling and retention management
  • Good for standardized backup requirements

Custom Automation (Terraform + OCI CLI):


// Pseudocode for hybrid backup strategy:
1. Create snapshot before maintenance windows (OCI Events trigger)
2. Retain last 3 snapshots, delete older ones (daily cleanup job)
3. Create daily incremental backups via backup policy
4. Copy weekly backups to DR region (weekend job)
5. Move backups >90 days to Archive tier (monthly job)
6. Alert on backup failures or retention policy violations

Best Practices from Production Experience

  1. Tag Everything: Use consistent tags for backup/snapshot resources to track costs and automate lifecycle

  2. Test Restores Quarterly: We restore random VMs every quarter to validate both snapshot and backup recovery procedures

  3. Monitor Backup Growth: Track incremental backup sizes to detect configuration drift or unexpected data growth

  4. Document Recovery Procedures: Maintain runbooks for both snapshot and backup restore processes

  5. Use Separate Compartments: Isolate backup resources in dedicated compartments for better cost tracking and access control

  6. Consider Backup Exclusions: For VMs with ephemeral data (caches, logs), exclude non-essential volumes from backup to reduce costs

  7. Leverage Lifecycle Policies: Use Object Storage lifecycle rules to automatically transition old backups to Archive tier

  8. Cross-Region Strategy: For critical systems, maintain backup copies in at least two regions

Recommended Approach for Your 50 VMs

Based on your requirements:

  1. Categorize VMs: Database (10 VMs), Application (25 VMs), Web (15 VMs)

  2. Database VMs: Hybrid strategy - 3 snapshots + daily backups for 90 days + weekly backups for 12 months

  3. Application VMs: Backup-focused - 1 snapshot pre-maintenance + daily backups for 30 days + weekly backups for 6 months

  4. Web VMs: Minimal - Golden image snapshots + weekly backups for 30 days

  5. Estimated Annual Cost: ~$45,000 (versus $105K snapshot-only or $16K backup-only)

  6. RTO Achievement: Database VMs <30 min, Application VMs <2 hours, Web VMs <4 hours (or rebuild)

This balanced approach addresses RTO requirements, optimizes costs, and meets compliance needs while providing flexibility for different workload types.

The cost difference is substantial over time. Snapshots are stored as full copies in block storage, while backups use object storage with compression and deduplication. For a 500GB boot volume, a snapshot costs about $25/month, while a backup might be $8-12/month depending on compression ratio. If you’re keeping 12 months of retention, that adds up quickly across 50 VMs. For database servers, we actually use both - snapshots for quick rollback during maintenance, backups for compliance and long-term recovery.

That’s helpful Sara. Are you seeing significant cost differences between the two? I’m trying to estimate the TCO for a 12-month retention policy. Also, do you use different strategies for database VMs versus app servers?

We use both, but for different purposes. Snapshots for quick recovery during maintenance windows or before major changes - they’re fast to create and restore. Backups for long-term retention and compliance. The restore time difference is significant: snapshots can have a VM back up in 15-20 minutes, backups can take 45-90 minutes depending on size. Cost-wise, snapshots get expensive if you keep too many, so we only retain the last 3-5 snapshots and rely on backups for anything older than a week.

From a compliance perspective, backups are generally preferred because they support longer retention periods and have better audit trails. We need 7-year retention for certain systems, which is impractical with snapshots due to cost. Also, backups can be moved to Archive storage tier for even lower costs on data you rarely need to access. However, for systems where RTO is critical (under 30 minutes), we maintain recent snapshots alongside the backup strategy.