Db2 high availability vs disaster recovery for ERP: failover time and network impact

We’re architecting Db2 deployment for our ERP system on IBM Cloud and need to decide between HA (high availability) and DR (disaster recovery) configurations. Our ERP requires 99.95% uptime SLA, which allows about 4 hours downtime per year. The RTO requirement is 15 minutes and RPO is 5 minutes for disaster scenarios. I understand Db2 HADR (High Availability Disaster Recovery) can provide both HA and DR, but I’m confused about the practical differences in failover time and network requirements. How does multi-zone cluster setup compare to cross-region HADR replication? What’s the actual network latency impact on ERP transaction performance when using synchronous vs asynchronous HADR modes? We’re looking at Dallas primary with either a Dallas multi-zone standby or a Washington DC standby. Our ERP does about 500 transactions per second during peak hours with mix of OLTP and batch processing. Configuration example:

UPDATE DB CFG FOR erpdb USING
  HADR_LOCAL_HOST db-primary.dallas
  HADR_REMOTE_HOST db-standby.washdc
  HADR_SYNCMODE SYNC

Is SYNC mode viable for cross-region with 30-40ms latency? Or should we use NEARSYNC/ASYNC?

So would you recommend multi-zone for HA and separate DR setup for disaster recovery? That seems like double the infrastructure cost. Can HADR handle both use cases with a single standby configuration?

SYNC mode with 40ms latency will kill your ERP performance. Every transaction commit has to wait for acknowledgment from the standby, so you’re adding 40ms to every write operation. At 500 TPS, that’s significant. NEARSYNC is better for cross-region - it acknowledges locally but ensures log shipping keeps up. For true HA with fast failover, use multi-zone within Dallas with SYNC mode. Multi-zone latency is typically under 2ms, so performance impact is minimal.

Based on extensive experience with Db2 HADR for ERP systems on IBM Cloud, here’s a comprehensive analysis of your three critical decision points:

Db2 HADR Configuration for HA and DR

Db2 HADR is specifically designed to provide both high availability and disaster recovery through a unified replication framework. The key is understanding that HADR supports multiple standby databases with different replication modes simultaneously.

For your ERP requirements (99.95% uptime, 15-minute RTO, 5-minute RPO), the optimal architecture is:

Three-Node HADR Configuration:

  1. Primary Database: Dallas Zone 1 (active, serving ERP transactions)
  2. HA Standby: Dallas Zone 2 (SYNC mode, automatic failover)
  3. DR Standby: Washington DC (ASYNC mode, manual failover)

This configuration provides:

  • Local HA failover in under 30 seconds (typically 10-15 seconds) for zone failures
  • Geographic DR protection with RPO under 5 minutes for regional disasters
  • Minimal performance impact on primary database

Configuration example:

-- Primary in Dallas Zone 1
UPDATE DB CFG FOR erpdb USING
  HADR_LOCAL_HOST db-primary.dal.zone1
  HADR_LOCAL_SVC 55001
  HADR_REMOTE_HOST db-standby.dal.zone2
  HADR_REMOTE_SVC 55001
  HADR_SYNCMODE SYNC
  HADR_TIMEOUT 120
  HADR_PEER_WINDOW 0

The HA standby uses SYNC mode because multi-zone latency within Dallas is 1-2ms, adding negligible overhead to transaction commits. The DR standby in Washington DC uses ASYNC mode to avoid the 30-40ms cross-region latency impact.

HADR handles both use cases efficiently because:

  • Primary database replicates to both standbys simultaneously
  • Each standby can have independent SYNCMODE settings
  • Failover logic prioritizes the HA standby for automatic failover
  • DR standby remains available if HA standby fails

The infrastructure cost is higher than single-standby, but necessary to meet your 99.95% SLA. Single-standby configurations cannot provide both sub-minute HA failover and geographic DR protection.

Multi-Zone Cluster Setup vs Cross-Region Replication

The fundamental difference is latency and failure domain:

Multi-Zone Cluster (Dallas Zone 1 → Dallas Zone 2):

  • Network latency: 1-2ms typical, 5ms worst case
  • Bandwidth: 10-100Gbps private network
  • Failure protection: Protects against zone failures (datacenter issues, power, cooling)
  • Does NOT protect against: Regional disasters, widespread network failures
  • Failover time: 10-30 seconds automatic
  • Performance impact: Negligible (2-3% transaction latency increase)
  • SYNCMODE: SYNC recommended

Cross-Region Replication (Dallas → Washington DC):

  • Network latency: 30-40ms typical, can spike to 60-80ms
  • Bandwidth: 1-10Gbps inter-region network
  • Failure protection: Protects against regional disasters, widespread outages
  • Does NOT protect against: Application-level failures, data corruption
  • Failover time: 5-15 minutes manual (requires validation and DNS updates)
  • Performance impact: Significant with SYNC mode (30-40ms per transaction)
  • SYNCMODE: ASYNC or NEARSYNC recommended

For ERP systems requiring both HA and DR:

  • Use multi-zone for HA (automatic, fast, minimal impact)
  • Use cross-region for DR (manual, slower, protects against regional failures)
  • Never rely solely on cross-region SYNC for HA - the latency impact is unacceptable

Your 500 TPS workload with SYNC mode to Washington DC would add 15-20 seconds to every transaction (40ms latency × 500 TPS = serialization delay). This violates basic ERP performance requirements.

Network Latency Impact on ERP Transaction Performance

The relationship between HADR SYNCMODE and ERP performance is critical:

SYNC Mode:

  • Transaction commit waits for standby acknowledgment before returning to application
  • Added latency per transaction = 2 × network RTT (round-trip to standby and back)
  • Multi-zone (2ms RTT): 4ms added latency = 1-2% performance impact
  • Cross-region (40ms RTT): 80ms added latency = 30-50% performance impact
  • Use case: HA standby within same region/metro area

NEARSYNC Mode:

  • Transaction commit returns after primary log write
  • Log buffer shipping happens asynchronously but with bounded delay
  • Added latency: 0-5ms depending on log buffer fill rate
  • Use case: Cross-region standby with low RPO requirement (under 1 minute)
  • Potential data loss: Typically under 10 seconds of transactions

ASYNC Mode:

  • Transaction commit returns immediately after primary log write
  • Log shipping happens in background with no synchronization
  • Added latency: 0ms
  • Use case: Cross-region DR standby with RPO tolerance (5+ minutes)
  • Potential data loss: Based on log shipping frequency and network conditions

For your configuration:

-- HA Standby (Dallas Zone 2) - SYNC mode
HADR_SYNCMODE SYNC
HADR_TIMEOUT 120  -- 2 minutes before disconnected state

-- DR Standby (Washington DC) - ASYNC mode
HADR_SYNCMODE ASYNC
HADR_TIMEOUT 300  -- 5 minutes tolerance for DR

With this configuration:

  • ERP transactions experience 2-4ms additional latency from HA standby replication (acceptable)
  • DR standby receives logs asynchronously with typical lag of 5-30 seconds (meets 5-minute RPO)
  • Network impact on primary: Approximately 50-100 Mbps for 500 TPS workload (manageable)

Monitor key metrics:

  • HADR_LOG_GAP: Difference between primary and standby log positions
  • HADR_CONNECT_STATUS: Connection health between primary and standbys
  • HADR_STATE: Peer state (should be PEER for normal operation)
  • Transaction commit latency: Should remain under 50ms for ERP responsiveness

Critical Implementation Notes:

  1. Failover Testing: Test both HA and DR failover scenarios quarterly. HA failover should be automatic and complete in under 30 seconds. DR failover requires manual intervention and should complete in under 15 minutes to meet RTO.

  2. Network Monitoring: Implement continuous latency monitoring between primary and standbys. Alert on latency spikes above 10ms for HA standby, 100ms for DR standby.

  3. Bandwidth Planning: At 500 TPS, expect 50-100 Mbps log shipping traffic during peak hours. Plan for 3-5x this bandwidth for headroom and batch processing spikes.

  4. Automatic Client Reroute: Configure ACR (Automatic Client Reroute) in ERP connection strings to automatically reconnect to new primary after failover.

This architecture meets your 99.95% SLA through multi-zone HA while providing geographic DR protection for catastrophic failures, with acceptable performance impact on your ERP workload.

HADR can handle both, but you need to understand the trade-offs. Multi-zone HADR in Dallas gives you HA with automatic failover in seconds and minimal performance impact. Cross-region HADR to Washington DC gives you DR protection but failover is slower (minutes) and requires manual intervention in most cases. You can actually have multiple HADR standbys - primary in Dallas Zone 1, standby in Dallas Zone 2 for HA, and another standby in Washington DC for DR. The Dallas standby runs SYNC mode, Washington runs ASYNC mode.

Don’t forget about network bandwidth requirements for HADR replication. At 500 TPS with typical ERP transaction logs, you’re generating significant log volume. Calculate your log generation rate and ensure network bandwidth can handle it plus overhead. For synchronous replication, you also need consistent low latency - occasional latency spikes will cause transaction delays. Monitor network latency between primary and standby continuously.