We’re running our ERP database on Cloud SQL for PostgreSQL (500GB, growing 15GB monthly) and need to finalize our backup and disaster recovery strategy. GCP offers multiple options and I’m trying to understand the tradeoffs between them for an ERP environment where data loss is highly sensitive.
Currently we’re using the default automated backups (daily at 3 AM) with 7-day retention. This seems basic for an ERP system. Point-in-time recovery is appealing because we could recover to any moment before a data corruption incident, but I’m unclear on the performance impact and cost implications of continuous transaction log archiving.
Cross-region replication is another consideration. Our primary instance is in us-central1, and we’re evaluating whether to set up a read replica in us-east1 for disaster recovery. The replication lag concerns me - if there’s a 2-3 second delay, we could lose recent transactions in a regional failure. But the alternative is restoring from backup, which takes 30-45 minutes for our database size.
How are others protecting ERP data in Cloud SQL? What combination of automated backups, PITR, and replication provides the best balance of recovery objectives, cost, and operational complexity?
Don’t forget about backup testing. We schedule quarterly DR drills where we restore from backup to a test instance and verify data integrity. This validates both the backup process and our recovery procedures. Found several issues during drills that would have been disasters in real incidents - incorrect restore parameters, missing application configuration, outdated runbooks.
Also consider exporting critical tables to Cloud Storage as an additional safety layer. We export our financial transactions table nightly to GCS with versioning enabled. It’s not a full backup replacement but provides another recovery option.
Here’s a comprehensive backup strategy framework for ERP systems on Cloud SQL, addressing all three focus areas:
Cloud SQL Automated Backups - Foundation Layer:
Automated backups are your first line of defense but shouldn’t be your only protection. For ERP systems, configure:
-
Backup Frequency: Daily automated backups are adequate for most ERP scenarios. Schedule during lowest activity period (typically 2-4 AM) to minimize performance impact. The backup window for 500GB database is approximately 15-20 minutes.
-
Retention Period: Extend from default 7 days to at least 30 days for ERP environments. This provides recovery options for issues discovered weeks later (reconciliation errors, month-end close problems). Maximum retention is 365 days - use this if your compliance requirements allow.
-
Backup Location: Automated backups are stored in same region as database by default. For additional protection, enable multi-region backup storage (available in Enterprise tier). This protects against regional disasters affecting both primary database and backups.
-
Incremental Backups: Cloud SQL uses incremental backups after the first full backup, reducing storage costs. A 500GB database with 15GB daily growth results in approximately 105GB weekly backup storage (1 full + 6 incrementals), not 3.5TB as you might expect.
-
Backup Verification: Automate backup testing using Cloud Scheduler + Cloud Functions. Weekly, restore most recent backup to a test instance, run validation queries against known data sets, confirm restore completes successfully. This catches backup corruption before you need it for actual recovery.
Point-in-Time Recovery - Precision Protection:
PITR is critical for ERP environments where single transactions can have significant financial impact:
-
Enable PITR: Transaction log archiving adds <5% overhead and approximately $6-8/month storage cost for your database size. This is negligible compared to the value of recovering to the exact second before data corruption or user error.
-
Recovery Scenarios: PITR solves problems automated backups can’t:
- Accidental DELETE/UPDATE affecting financial records (recover to moment before the query)
- Data corruption from application bug (recover to last known good state)
- Ransomware or malicious activity (recover to point before attack)
-
Retention Configuration: Set PITR retention to 14 days minimum for ERP systems. This covers scenarios where issues aren’t immediately apparent (month-end close finds discrepancies from two weeks ago). Maximum retention is 35 days, which I recommend for financial ERP systems.
-
Recovery Process: PITR creates a new Cloud SQL instance at the specified timestamp. Plan for recovery time of 30-60 minutes for your 500GB database. The new instance can be promoted to primary after application validation. Keep original instance until verification is complete.
-
Log Archive Storage: Transaction logs are stored in backup storage bucket, charged at $0.08/GB/month. Your 2GB daily log generation over 35 days retention = ~70GB = $5.60/month. Enable lifecycle policies to automatically delete logs older than retention period.
Cross-Region Replication - High Availability Layer:
For ERP systems with strict RTO/RPO requirements, cross-region replication is essential:
-
Architecture Design: Primary instance in us-central1, read replica in us-east1 provides protection against regional failures. The replica is continuously updated with 2-5 second lag under normal conditions. This lag represents your maximum data loss in regional disaster scenario.
-
Replication Lag Monitoring: Set up Cloud Monitoring alerts for replica lag exceeding 10 seconds. Sustained high lag indicates network issues or replica resource constraints. Monitor replica_lag metric and alert if average over 5 minutes exceeds threshold.
-
Failover Process: Replica promotion is manual operation requiring 2-5 minutes:
- Verify primary region is truly unavailable (not transient network issue)
- Promote read replica to standalone instance using gcloud sql instances promote-replica
- Update application connection strings to point to new primary
- Validate application functionality and data consistency
- Document failover in incident log for post-mortem
-
Cost Considerations: Read replica costs same as primary instance (compute + storage). For your configuration, expect ~$400-600/month additional cost depending on machine type. This is your disaster recovery insurance premium.
-
Failback Strategy: After regional recovery, you can’t simply demote the promoted replica back. You must:
- Create new replica in original region from current primary
- Allow replica to catch up (may take hours for 500GB)
- Schedule maintenance window for failback
- Promote original region replica, update application connections
Comprehensive Strategy - Layered Defense:
Combine all three approaches for robust ERP data protection:
- Automated Backups: Daily backups with 30-day retention for recovery from major issues
- Point-in-Time Recovery: 14-day PITR retention for precision recovery from data corruption or user errors
- Cross-Region Replication: us-east1 read replica for disaster recovery with <5 second RPO and <1 hour RTO
- Long-Term Archives: Monthly exports to Cloud Storage for compliance (7-year retention using Coldline storage)
- Regular Testing: Quarterly DR drills validating restore procedures, failover process, and application recovery
This layered approach provides Recovery Point Objective (RPO) of 2-5 seconds and Recovery Time Objective (RTO) of 30-60 minutes for most scenarios. Total additional cost is approximately $450-650/month for replica plus $20-30/month for backup storage - reasonable investment for protecting critical ERP data.
For your specific 500GB database growing 15GB monthly, this strategy scales well. Monitor backup sizes and adjust retention policies as database grows. Consider partitioning large tables to improve backup and recovery performance when you reach 1TB+.
Cross-region replication is essential if you have strict RTO requirements. Yes, there’s 2-3 second replication lag, but that’s far better than 30-45 minute restore time. For ERP, you need to decide: can your business tolerate 45 minutes of downtime plus potential data loss up to last backup? If not, cross-region replica is mandatory.
Consider setting up automated failover using Cloud SQL’s high availability configuration. This handles zone failures automatically within same region. For regional disasters, you’d manually promote the cross-region replica.
One more consideration for ERP: backup retention for compliance. Many industries require 7-year retention of financial data. Cloud SQL automated backups only go 35 days maximum. You need a separate long-term backup strategy - typically exporting to Cloud Storage with lifecycle policies. We do monthly exports to GCS, then use Nearline or Coldline storage classes for cost optimization.