SQL Database geo-replication shows high data sync lag after planned failover test

We performed a planned failover test on our Azure SQL Database last week to validate our disaster recovery procedures. The failover itself completed successfully, but since then we’re seeing significant replication lag between primary and secondary regions. The geo-replication monitoring shows sync lag times of 45-90 seconds, whereas before the test it was consistently under 5 seconds.

Our RPO requirement is 30 seconds maximum, so this current state violates our compliance targets. I’ve checked the DTU utilization on both primary (East US) and secondary (West US 2) databases, and neither is showing resource constraints - both are running at 40-50% DTU usage.


Query: SELECT * FROM sys.dm_geo_replication_link_status
Result: replication_lag_sec = 67
        last_replication = 2025-05-22 13:15:23
        replication_state = CATCH_UP

Failover testing is critical for our compliance, but we can’t have it impact production replication performance for days afterward. Has anyone experienced geo-replication lag issues persisting after failover tests? What could cause the replication to not return to normal sync speeds?

Have you verified that both databases are on the same service tier and compute size? I’ve seen cases where after failover, the new secondary (old primary) was inadvertently scaled down, causing replication lag because it couldn’t keep up with the incoming replication stream. Also check for any geo-replication configuration changes that might have been applied during the failover test.

This sounds like a network throughput issue between your primary and secondary regions. After failover, the replication topology changed - your former secondary became primary, and vice versa. Check if there are any network routing changes or bandwidth limitations. Also verify that your secondary region has the same service tier and compute size as the primary. Mismatched tiers can cause persistent lag.

Another thing to investigate - check your failover group configuration if you’re using one. After a planned failover, the roles are reversed, but if there were any changes to the read-write listener endpoint or if applications are still connecting to the old primary’s endpoint directly (bypassing the listener), you might see unexpected behavior. Also verify your secondary region isn’t experiencing any Azure platform issues using the Service Health dashboard.

Check if your database is still in CATCH_UP state. After a planned failover, the new secondary (your former primary) needs to catch up with any transactions that occurred during the failover process. This can take time depending on transaction volume. The replication_state should eventually return to SYNCHRONIZED. How long has it been since the failover completed?

Look at your transaction log generation rate during business hours. If you’re generating transaction logs faster than the geo-replication can transfer them to the secondary region, you’ll see persistent lag. Run this query on your primary to check log generation rate: SELECT * FROM sys.dm_db_log_stats. Also check if you have any long-running transactions that might be holding up the replication process.

It’s been 5 days since the failover. The state shows CATCH_UP intermittently - sometimes it shows SYNCHRONIZED for a few hours, then goes back to CATCH_UP. The lag seems to spike during our business hours (8am-5pm EST) when transaction volume is higher, but even during low-traffic periods overnight, it’s still showing 15-20 second lag instead of the sub-5 second lag we had before.

Based on your symptoms and the troubleshooting discussion, here’s a comprehensive analysis of your geo-replication lag issue and how to resolve it:

1. Geo-Replication Monitoring:

The persistent CATCH_UP state 5 days after failover indicates an ongoing issue, not just post-failover synchronization. Normal catch-up should complete within hours, not days. Let’s diagnose the root cause:

Primary Monitoring Queries:


-- Check current replication status
SELECT
    partner_server, partner_database,
    replication_state_desc, replication_lag_sec,
    last_replication, is_suspended
FROM sys.dm_geo_replication_link_status;

-- Check log generation rate
SELECT
    log_space_in_bytes_since_last_backup / 1024 / 1024 AS log_mb,
    log_backup_time
FROM sys.dm_db_log_stats(DB_ID());

Key Metrics to Evaluate:

  • Replication lag consistently above 30 seconds violates your RPO
  • CATCH_UP state during business hours suggests throughput bottleneck
  • Intermittent SYNCHRONIZED status indicates the link is functional but overwhelmed

2. Failover Testing Impact Analysis:

Planned failover reverses the replication direction, which can expose underlying issues:

Common Post-Failover Issues:

  • Compute Tier Mismatch: Verify both databases are identical tier/size. After failover, if the new secondary (old primary) was scaled down, it can’t keep up with replication throughput.
  • Network Path Changes: Replication now flows in opposite direction. New path might have different latency or bandwidth characteristics.
  • Transaction Volume Increase: If workload changed during/after failover test, higher transaction rate could exceed replication capacity.
  • Configuration Drift: Failover might have reset certain database configurations or connection settings.

Verification Steps:


-- Compare service tiers on both databases
SELECT
    database_name, service_objective,
    elastic_pool_name, sku_name
FROM sys.database_service_objectives;

Ensure both primary and secondary show identical service_objective and sku_name. Any mismatch will cause persistent lag.

3. RPO/RTO Compliance Resolution:

To meet your 30-second RPO requirement, you need to address the root cause:

Immediate Actions:

  1. Scale Up Secondary Database: If compute is the bottleneck, scale the secondary to match or exceed primary capacity. The secondary needs sufficient resources to apply incoming transaction logs.

  2. Check Network Connectivity: Use Azure Network Watcher to verify connectivity between regions. Look for packet loss or high latency.

  3. Review Transaction Log Activity: Identify if specific operations (bulk inserts, index rebuilds) are generating excessive log traffic during business hours.

  4. Verify No Long-Running Transactions: Long transactions on primary can delay replication. Query sys.dm_tran_active_transactions to identify any transactions open for extended periods.

Long-Term Solutions:


// Pseudocode - Implement comprehensive monitoring:
1. Create Azure Monitor alert for replication_lag_sec > 30
2. Set up Log Analytics workspace to track geo-replication metrics
3. Configure automated scaling for secondary during high-traffic periods
4. Implement transaction log monitoring to detect anomalies
5. Schedule regular failover tests during low-traffic maintenance windows
// See documentation: Azure SQL Database monitoring best practices

RPO/RTO Optimization:

  • Consider Business Critical Tier: Provides faster replication and lower lag compared to General Purpose
  • Use Zone-Redundant Configuration: Reduces failover time and maintains sync during regional issues
  • Implement Read Scale-Out: Distribute read workload to secondary, reducing primary load
  • Review Application Connection Strings: Ensure apps use failover group listener, not direct database endpoints

Failover Testing Best Practices:

To prevent this issue in future tests:

  1. Pre-Test Validation: Before failover, verify both databases are at identical service tier, no resource constraints, and replication lag is minimal

  2. Test During Low-Traffic Windows: Schedule failover tests during maintenance windows to minimize transaction volume during role reversal

  3. Post-Test Monitoring: Implement 24-hour enhanced monitoring after failover to catch lag issues early

  4. Automated Rollback: If lag exceeds threshold for more than 2 hours post-failover, consider failing back to original configuration

  5. Document Baseline Metrics: Record normal replication lag, transaction rate, and DTU usage before testing for comparison

Immediate Remediation Steps:

  1. Verify both databases are on identical service tier (S3, P2, BC_Gen5_4, etc.)
  2. If secondary is lower tier, scale it up to match primary immediately
  3. Check Azure Service Health for any ongoing issues in West US 2 region
  4. Review application connection patterns - ensure no apps are bypassing failover group listener
  5. Monitor for 48 hours after making changes to confirm lag returns to <5 seconds

If lag persists after ensuring identical configurations, open an Azure support ticket - there may be an underlying platform issue with the geo-replication link that requires Microsoft intervention. Include your sys.dm_geo_replication_link_status output and transaction log statistics in the ticket.

The combination of proper tier matching, network verification, and transaction monitoring should resolve your replication lag and bring you back into RPO compliance within 24-48 hours.