SQL Database geo-replication shows high data sync lag after failover test

After performing a planned failover test for our Azure SQL Database, we’re seeing significant data sync lag on the geo-replica. Before the test, replication lag was typically under 5 seconds. Now it’s consistently showing 45-60 seconds lag, and sometimes spikes to 2-3 minutes.

Our setup: Primary in East US, secondary replica in West US 2. Database is Standard S3 tier, about 180GB size. We did the failover test last Friday during a maintenance window, failed back to primary on Sunday, and the lag started Monday morning.

Monitoring query showing the issue:


SELECT replication_state_desc, last_commit_time FROM sys.dm_geo_replication_link_status

The replication state shows as SEEDING intermittently, then switches back to CATCH_UP. We have RPO requirements of 30 seconds for this database due to compliance. This high lag is putting us out of compliance and our disaster recovery testing showed the issue. Any ideas what could cause persistent lag after a failover test?

Also worth checking if your failover test itself completed cleanly. Sometimes a failover can leave the replication link in a degraded state. You might need to remove and re-add the geo-replica to establish a fresh replication link. This would cause the SEEDING state you’re seeing as it rebuilds the secondary.

Good catch on the transaction log. The log space usage is at 67% on primary, which is higher than our normal 40-45%. We do have a nightly ETL process that runs large batch updates. Could that be overwhelming the replication? The lag seems worst in the mornings after the ETL completes.

The SEEDING state is concerning - that usually only appears during initial geo-replication setup or when the secondary is being rebuilt. After a planned failover and failback, you shouldn’t see SEEDING at all. This suggests the geo-replication link might have broken and is trying to re-establish. Check if there were any network issues or maintenance events in either region during your failover window. Also verify that both databases are showing as online and healthy in the portal.

Your ETL is definitely a factor. Large batch operations generate substantial transaction log activity that must be replicated. Consider breaking your ETL into smaller batches with commits between them. This reduces the burst load on geo-replication. Also, are you monitoring DTU usage during ETL? S3 tier gives you 100 DTUs - if your primary is maxing out during ETL, the secondary might not have enough resources to keep up with replication. You might need to scale up temporarily during ETL windows or permanently if this is a recurring issue.

Based on all the symptoms and discussion, here’s a comprehensive solution addressing your geo-replication lag:

Geo-Replication Monitoring Analysis: The intermittent SEEDING state is the key diagnostic indicator. This state should only occur during initial setup or when the secondary is being completely rebuilt. After a planned failover and failback, the replication link should remain in CATCH_UP state. The appearance of SEEDING suggests the geo-replication link encountered an issue during your failover test and is now attempting to re-synchronize.

Monitor replication health with this query on primary:


SELECT partner_server, partner_database, replication_state_desc, replication_lag_sec FROM sys.dm_geo_replication_links

Also check for errors:


SELECT * FROM sys.dm_operation_status WHERE operation = 'ALTER DATABASE' ORDER BY start_time DESC

Failover Testing Root Cause: Your planned failover test likely encountered one of these issues:

  1. Network latency spike during failover causing link degradation
  2. DTU resource exhaustion on secondary during role swap
  3. Long-running transactions that blocked clean failover completion
  4. Service maintenance in one region that coincided with your test

The fact that lag started Monday after a Friday/Sunday failover test suggests the link didn’t fully recover from the role changes.

RPO/RTO Compliance Recovery: To meet your 30-second RPO requirement:

  1. Immediate Fix - Rebuild Geo-Replica: The most reliable solution is to remove and re-add the geo-replica:

ALTER DATABASE [YourDB] REMOVE SECONDARY ON SERVER [secondary-server]
ALTER DATABASE [YourDB] ADD SECONDARY ON SERVER [secondary-server] WITH (ALLOW_CONNECTIONS = ALL)

This will cause SEEDING (expected), but establishes a clean replication link. Initial seeding for 180GB typically takes 2-4 hours.

  1. Scale Up During ETL Windows: Your S3 tier (100 DTUs) may be insufficient during heavy ETL operations. Monitor DTU usage:

SELECT end_time, avg_cpu_percent, avg_data_io_percent, avg_log_write_percent FROM sys.dm_db_resource_stats ORDER BY end_time DESC

If consistently above 80% during ETL, scale to S4 (200 DTUs) or higher. You can automate this:


az sql db update --resource-group <rg> --server <server> --name <db> --service-objective S4
  1. Optimize ETL Batch Processing: Break large batch updates into smaller transactions:
  • Process 5,000-10,000 rows per batch instead of millions
  • Add COMMIT statements between batches
  • This reduces transaction log burst and smooths replication load
  1. Implement Proactive Monitoring: Set up Azure Monitor alerts:
  • Alert when replication_lag_sec > 30 seconds
  • Alert on replication_state_desc != ‘CATCH_UP’
  • Alert on DTU usage > 80% sustained for 10 minutes
  1. Failover Testing Best Practices: For future tests:
  • Perform during low-activity periods (not before heavy ETL)
  • Verify replication lag < 5 seconds before initiating failover
  • Monitor for 48 hours after failback to catch issues early
  • Document baseline performance metrics before test
  • Test during quarterly maintenance, not monthly

Long-term Solution: Consider upgrading to Premium tier for better performance and lower replication lag guarantees. Premium P2 (250 DTUs) provides better geo-replication performance and more consistent RPO compliance. Alternatively, evaluate Business Critical tier which offers built-in high availability and better replication capabilities.

Immediate Action Plan:

  1. Verify no ongoing Azure service issues in both regions
  2. Check current DTU utilization - scale up if needed
  3. Remove and re-add geo-replica to establish clean link
  4. Monitor replication lag hourly during initial seeding
  5. Once stable, implement ETL batch optimization
  6. Set up automated monitoring alerts
  7. Schedule next failover test for 90 days out with proper preparation

This approach will restore your RPO compliance while addressing the root causes of replication lag post-failover.