Based on your symptoms and the troubleshooting discussion, here’s a comprehensive analysis of your geo-replication lag issue and how to resolve it:
1. Geo-Replication Monitoring:
The persistent CATCH_UP state 5 days after failover indicates an ongoing issue, not just post-failover synchronization. Normal catch-up should complete within hours, not days. Let’s diagnose the root cause:
Primary Monitoring Queries:
-- Check current replication status
SELECT
partner_server, partner_database,
replication_state_desc, replication_lag_sec,
last_replication, is_suspended
FROM sys.dm_geo_replication_link_status;
-- Check log generation rate
SELECT
log_space_in_bytes_since_last_backup / 1024 / 1024 AS log_mb,
log_backup_time
FROM sys.dm_db_log_stats(DB_ID());
Key Metrics to Evaluate:
- Replication lag consistently above 30 seconds violates your RPO
- CATCH_UP state during business hours suggests throughput bottleneck
- Intermittent SYNCHRONIZED status indicates the link is functional but overwhelmed
2. Failover Testing Impact Analysis:
Planned failover reverses the replication direction, which can expose underlying issues:
Common Post-Failover Issues:
- Compute Tier Mismatch: Verify both databases are identical tier/size. After failover, if the new secondary (old primary) was scaled down, it can’t keep up with replication throughput.
- Network Path Changes: Replication now flows in opposite direction. New path might have different latency or bandwidth characteristics.
- Transaction Volume Increase: If workload changed during/after failover test, higher transaction rate could exceed replication capacity.
- Configuration Drift: Failover might have reset certain database configurations or connection settings.
Verification Steps:
-- Compare service tiers on both databases
SELECT
database_name, service_objective,
elastic_pool_name, sku_name
FROM sys.database_service_objectives;
Ensure both primary and secondary show identical service_objective and sku_name. Any mismatch will cause persistent lag.
3. RPO/RTO Compliance Resolution:
To meet your 30-second RPO requirement, you need to address the root cause:
Immediate Actions:
-
Scale Up Secondary Database: If compute is the bottleneck, scale the secondary to match or exceed primary capacity. The secondary needs sufficient resources to apply incoming transaction logs.
-
Check Network Connectivity: Use Azure Network Watcher to verify connectivity between regions. Look for packet loss or high latency.
-
Review Transaction Log Activity: Identify if specific operations (bulk inserts, index rebuilds) are generating excessive log traffic during business hours.
-
Verify No Long-Running Transactions: Long transactions on primary can delay replication. Query sys.dm_tran_active_transactions to identify any transactions open for extended periods.
Long-Term Solutions:
// Pseudocode - Implement comprehensive monitoring:
1. Create Azure Monitor alert for replication_lag_sec > 30
2. Set up Log Analytics workspace to track geo-replication metrics
3. Configure automated scaling for secondary during high-traffic periods
4. Implement transaction log monitoring to detect anomalies
5. Schedule regular failover tests during low-traffic maintenance windows
// See documentation: Azure SQL Database monitoring best practices
RPO/RTO Optimization:
- Consider Business Critical Tier: Provides faster replication and lower lag compared to General Purpose
- Use Zone-Redundant Configuration: Reduces failover time and maintains sync during regional issues
- Implement Read Scale-Out: Distribute read workload to secondary, reducing primary load
- Review Application Connection Strings: Ensure apps use failover group listener, not direct database endpoints
Failover Testing Best Practices:
To prevent this issue in future tests:
-
Pre-Test Validation: Before failover, verify both databases are at identical service tier, no resource constraints, and replication lag is minimal
-
Test During Low-Traffic Windows: Schedule failover tests during maintenance windows to minimize transaction volume during role reversal
-
Post-Test Monitoring: Implement 24-hour enhanced monitoring after failover to catch lag issues early
-
Automated Rollback: If lag exceeds threshold for more than 2 hours post-failover, consider failing back to original configuration
-
Document Baseline Metrics: Record normal replication lag, transaction rate, and DTU usage before testing for comparison
Immediate Remediation Steps:
- Verify both databases are on identical service tier (S3, P2, BC_Gen5_4, etc.)
- If secondary is lower tier, scale it up to match primary immediately
- Check Azure Service Health for any ongoing issues in West US 2 region
- Review application connection patterns - ensure no apps are bypassing failover group listener
- Monitor for 48 hours after making changes to confirm lag returns to <5 seconds
If lag persists after ensuring identical configurations, open an Azure support ticket - there may be an underlying platform issue with the geo-replication link that requires Microsoft intervention. Include your sys.dm_geo_replication_link_status output and transaction log statistics in the ticket.
The combination of proper tier matching, network verification, and transaction monitoring should resolve your replication lag and bring you back into RPO compliance within 24-48 hours.