SQL Database geo-replication shows high data sync lag after failover test

carolengineer · January 24, 2025, 8:57am

After performing a planned failover test for our Azure SQL Database, we’re seeing significant data sync lag on the geo-replica. Before the test, replication lag was typically under 5 seconds. Now it’s consistently showing 45-60 seconds lag, and sometimes spikes to 2-3 minutes.

Our setup: Primary in East US, secondary replica in West US 2. Database is Standard S3 tier, about 180GB size. We did the failover test last Friday during a maintenance window, failed back to primary on Sunday, and the lag started Monday morning.

Monitoring query showing the issue:


SELECT replication_state_desc, last_commit_time FROM sys.dm_geo_replication_link_status

The replication state shows as SEEDING intermittently, then switches back to CATCH_UP. We have RPO requirements of 30 seconds for this database due to compliance. This high lag is putting us out of compliance and our disaster recovery testing showed the issue. Any ideas what could cause persistent lag after a failover test?

thomasanalyst · February 10, 2025, 6:21pm

Also worth checking if your failover test itself completed cleanly. Sometimes a failover can leave the replication link in a degraded state. You might need to remove and re-add the geo-replica to establish a fresh replication link. This would cause the SEEDING state you’re seeing as it rebuilds the secondary.

dorothy_master · January 28, 2025, 1:30am

Good catch on the transaction log. The log space usage is at 67% on primary, which is higher than our normal 40-45%. We do have a nightly ETL process that runs large batch updates. Could that be overwhelming the replication? The lag seems worst in the mornings after the ETL completes.

pameladev · January 26, 2025, 2:51am

The SEEDING state is concerning - that usually only appears during initial geo-replication setup or when the secondary is being rebuilt. After a planned failover and failback, you shouldn’t see SEEDING at all. This suggests the geo-replication link might have broken and is trying to re-establish. Check if there were any network issues or maintenance events in either region during your failover window. Also verify that both databases are showing as online and healthy in the portal.

justin_dev · January 29, 2025, 4:18am

Your ETL is definitely a factor. Large batch operations generate substantial transaction log activity that must be replicated. Consider breaking your ETL into smaller batches with commits between them. This reduces the burst load on geo-replication. Also, are you monitoring DTU usage during ETL? S3 tier gives you 100 DTUs - if your primary is maxing out during ETL, the secondary might not have enough resources to keep up with replication. You might need to scale up temporarily during ETL windows or permanently if this is a recurring issue.

thomasanalyst · February 20, 2025, 8:34pm

Based on all the symptoms and discussion, here’s a comprehensive solution addressing your geo-replication lag:

Geo-Replication Monitoring Analysis: The intermittent SEEDING state is the key diagnostic indicator. This state should only occur during initial setup or when the secondary is being completely rebuilt. After a planned failover and failback, the replication link should remain in CATCH_UP state. The appearance of SEEDING suggests the geo-replication link encountered an issue during your failover test and is now attempting to re-synchronize.

Monitor replication health with this query on primary:


SELECT partner_server, partner_database, replication_state_desc, replication_lag_sec FROM sys.dm_geo_replication_links

Also check for errors:


SELECT * FROM sys.dm_operation_status WHERE operation = 'ALTER DATABASE' ORDER BY start_time DESC

Failover Testing Root Cause: Your planned failover test likely encountered one of these issues:

Network latency spike during failover causing link degradation
DTU resource exhaustion on secondary during role swap
Long-running transactions that blocked clean failover completion
Service maintenance in one region that coincided with your test

The fact that lag started Monday after a Friday/Sunday failover test suggests the link didn’t fully recover from the role changes.

RPO/RTO Compliance Recovery: To meet your 30-second RPO requirement:

Immediate Fix - Rebuild Geo-Replica: The most reliable solution is to remove and re-add the geo-replica:


ALTER DATABASE [YourDB] REMOVE SECONDARY ON SERVER [secondary-server]
ALTER DATABASE [YourDB] ADD SECONDARY ON SERVER [secondary-server] WITH (ALLOW_CONNECTIONS = ALL)

This will cause SEEDING (expected), but establishes a clean replication link. Initial seeding for 180GB typically takes 2-4 hours.

Scale Up During ETL Windows: Your S3 tier (100 DTUs) may be insufficient during heavy ETL operations. Monitor DTU usage:


SELECT end_time, avg_cpu_percent, avg_data_io_percent, avg_log_write_percent FROM sys.dm_db_resource_stats ORDER BY end_time DESC

If consistently above 80% during ETL, scale to S4 (200 DTUs) or higher. You can automate this:


az sql db update --resource-group <rg> --server <server> --name <db> --service-objective S4

Optimize ETL Batch Processing: Break large batch updates into smaller transactions:

Process 5,000-10,000 rows per batch instead of millions
Add COMMIT statements between batches
This reduces transaction log burst and smooths replication load

Implement Proactive Monitoring: Set up Azure Monitor alerts:

Alert when replication_lag_sec > 30 seconds
Alert on replication_state_desc != ‘CATCH_UP’
Alert on DTU usage > 80% sustained for 10 minutes

Failover Testing Best Practices: For future tests:

Perform during low-activity periods (not before heavy ETL)
Verify replication lag < 5 seconds before initiating failover
Monitor for 48 hours after failback to catch issues early
Document baseline performance metrics before test
Test during quarterly maintenance, not monthly

Long-term Solution: Consider upgrading to Premium tier for better performance and lower replication lag guarantees. Premium P2 (250 DTUs) provides better geo-replication performance and more consistent RPO compliance. Alternatively, evaluate Business Critical tier which offers built-in high availability and better replication capabilities.

Immediate Action Plan:

Verify no ongoing Azure service issues in both regions
Check current DTU utilization - scale up if needed
Remove and re-add geo-replica to establish clean link
Monitor replication lag hourly during initial seeding
Once stable, implement ETL batch optimization
Set up automated monitoring alerts
Schedule next failover test for 90 days out with proper preparation

This approach will restore your RPO compliance while addressing the root causes of replication lag post-failover.

Topic		Replies	Views
SQL Database geo-replication shows high data sync lag after planned failover test Microsoft Azure monitoring , storage , database , rpo , rto , az-2021 , azure-sql , failover , geo-replication	7	0	August 24, 2025
Azure SQL Database geo-restore fails with 'Geo-secondary not available' error Microsoft Azure question , storage , disaster-recovery , database , az-2020 , sql-database , failover-groups , geo-restore , tsql	6	0	July 18, 2025
Cloud SQL replica lag spikes during failover, causing stale data in analytics Google Cloud Platform (GCP) question , analytics , database , gcp-2021 , cloud-sql , failover , databases , sql-replica-lag , replication	6	0	September 28, 2025
Cloud SQL replication lag leads to stale data in data warehouse reporting Google Cloud Platform (GCP) question , monitoring , data-warehousing , database , gcp-2020 , stale-data , replication-lag , read-replica , cloud-sql	6	0	December 10, 2024
S3 cross-region replication delays causing RTO violations during disaster recovery failover Amazon Web Services (AWS) question , backup-dr , storage , disaster-recovery , aws-2019 , s3 , cloudwatch , eventbridge , replication-lag	3	0	February 5, 2025
What redundancy strategy should we implement for Azure Storage accounts? Microsoft Azure discussion , storage , disaster-recovery , compliance , observability , az-2021 , business-continuity , redundancy-options , failover-strategy	6	0	June 16, 2025
Cloud SQL failover triggers ERP downtime due to DNS propagation delays Google Cloud Platform (GCP) question , compute , database , high-availability , gcp-2021 , connection-timeout , cloud-sql , failover , dns-propagation	3	2	March 2, 2025
Cloud SQL backup strategies for ERP: automated backups vs point-in-time recovery vs cross-region replication Google Cloud Platform (GCP) discussion , compute , disaster-recovery , database , gcp-2019 , backup-strategy , cross-region , cloud-sql , point-in-time-recovery	4	0	May 17, 2025
Db2 high availability vs disaster recovery for ERP: failover time and network impact IBM Cloud discussion , networking , disaster-recovery , database , sql , high-availability , ic-2019 , replication , db2	5	0	January 20, 2025

SQL Database geo-replication shows high data sync lag after failover test

Related topics