Based on all the symptoms and discussion, here’s a comprehensive solution addressing your geo-replication lag:
Geo-Replication Monitoring Analysis:
The intermittent SEEDING state is the key diagnostic indicator. This state should only occur during initial setup or when the secondary is being completely rebuilt. After a planned failover and failback, the replication link should remain in CATCH_UP state. The appearance of SEEDING suggests the geo-replication link encountered an issue during your failover test and is now attempting to re-synchronize.
Monitor replication health with this query on primary:
SELECT partner_server, partner_database, replication_state_desc, replication_lag_sec FROM sys.dm_geo_replication_links
Also check for errors:
SELECT * FROM sys.dm_operation_status WHERE operation = 'ALTER DATABASE' ORDER BY start_time DESC
Failover Testing Root Cause:
Your planned failover test likely encountered one of these issues:
- Network latency spike during failover causing link degradation
- DTU resource exhaustion on secondary during role swap
- Long-running transactions that blocked clean failover completion
- Service maintenance in one region that coincided with your test
The fact that lag started Monday after a Friday/Sunday failover test suggests the link didn’t fully recover from the role changes.
RPO/RTO Compliance Recovery:
To meet your 30-second RPO requirement:
- Immediate Fix - Rebuild Geo-Replica:
The most reliable solution is to remove and re-add the geo-replica:
ALTER DATABASE [YourDB] REMOVE SECONDARY ON SERVER [secondary-server]
ALTER DATABASE [YourDB] ADD SECONDARY ON SERVER [secondary-server] WITH (ALLOW_CONNECTIONS = ALL)
This will cause SEEDING (expected), but establishes a clean replication link. Initial seeding for 180GB typically takes 2-4 hours.
- Scale Up During ETL Windows:
Your S3 tier (100 DTUs) may be insufficient during heavy ETL operations. Monitor DTU usage:
SELECT end_time, avg_cpu_percent, avg_data_io_percent, avg_log_write_percent FROM sys.dm_db_resource_stats ORDER BY end_time DESC
If consistently above 80% during ETL, scale to S4 (200 DTUs) or higher. You can automate this:
az sql db update --resource-group <rg> --server <server> --name <db> --service-objective S4
- Optimize ETL Batch Processing:
Break large batch updates into smaller transactions:
- Process 5,000-10,000 rows per batch instead of millions
- Add COMMIT statements between batches
- This reduces transaction log burst and smooths replication load
- Implement Proactive Monitoring:
Set up Azure Monitor alerts:
- Alert when replication_lag_sec > 30 seconds
- Alert on replication_state_desc != ‘CATCH_UP’
- Alert on DTU usage > 80% sustained for 10 minutes
- Failover Testing Best Practices:
For future tests:
- Perform during low-activity periods (not before heavy ETL)
- Verify replication lag < 5 seconds before initiating failover
- Monitor for 48 hours after failback to catch issues early
- Document baseline performance metrics before test
- Test during quarterly maintenance, not monthly
Long-term Solution:
Consider upgrading to Premium tier for better performance and lower replication lag guarantees. Premium P2 (250 DTUs) provides better geo-replication performance and more consistent RPO compliance. Alternatively, evaluate Business Critical tier which offers built-in high availability and better replication capabilities.
Immediate Action Plan:
- Verify no ongoing Azure service issues in both regions
- Check current DTU utilization - scale up if needed
- Remove and re-add geo-replica to establish clean link
- Monitor replication lag hourly during initial seeding
- Once stable, implement ETL batch optimization
- Set up automated monitoring alerts
- Schedule next failover test for 90 days out with proper preparation
This approach will restore your RPO compliance while addressing the root causes of replication lag post-failover.