After working with several ERP deployments on IBM Cloud, I’ve encountered this exact scenario. The problem stems from how you’ve configured your monitoring alert policy and the fundamental differences between latency and packet loss metric collection.
Issue 1: Monitoring Alert Policy Setup
Your alert policy has three critical configuration problems:
First, using avg() aggregation over a 5-minute window is inappropriate for latency spike detection. Average aggregation dilutes spike impact - if you have 150ms latency for 1 minute and 5ms for 4 minutes, the average is only 34ms, well below your 100ms threshold. This explains why sustained 15-20 minute spikes might not trigger alerts if there are brief periods of normal latency within the evaluation window.
Second, your metric query lacks proper label filtering. The query avg(ibm_vpc_network_latency_ms{destination="db-tier"}) is too broad. You need to specify source application tier instances and aggregate per-connection, not across all connections globally. Use:
max(ibm_vpc_network_latency_ms{
source="app-tier",
destination="db-tier"
}) by (instance)
This ensures you detect spikes on individual connections rather than averaging across healthy and unhealthy connections.
Third, your 5-minute evaluation window is too long for ERP performance SLAs. Peak hour latency spikes need immediate detection. Reduce to 1-2 minute windows with consecutive evaluation periods: “Alert when metric exceeds 100ms for 2 consecutive 1-minute periods.”
Issue 2: Latency vs Packet Loss Metrics
Latency and packet loss metrics are collected through fundamentally different mechanisms in IBM Cloud Monitoring:
-
Latency metrics: Measured actively by sending probe packets between monitoring agents and calculating round-trip time. Collection frequency is typically every 60 seconds by default. If your latency spike occurs between collection intervals, it might be partially missed or averaged out.
-
Packet loss metrics: Calculated from actual traffic flow analysis over longer time windows (usually 5-10 minutes). This is why packet loss stayed near zero - your latency spikes weren’t causing packet drops, just delays.
The key difference: packet loss is a binary event (packet lost or not), while latency is a continuous metric that requires proper statistical aggregation. Your network can have high latency with zero packet loss if bandwidth is saturated but no packets are being dropped.
Issue 3: Synthetic Transaction Monitoring Gap
You’re relying entirely on infrastructure-level network metrics, which don’t capture application-level user experience. ERP response time depends on multiple factors: network latency, database query time, application processing, and serialization overhead.
Implement synthetic transaction monitoring:
- Create synthetic test transactions that mimic real ERP user workflows
- Run these transactions every 1-2 minutes from your application tier
- Measure end-to-end response time including all components
- Alert on synthetic transaction failures or slowness
For ERP systems, synthetic transactions should test critical paths: login, query operations, data entry, and report generation. This gives you true user experience metrics rather than infrastructure metrics.
Recommended Alert Policy Configuration:
Metric: max(ibm_vpc_network_latency_ms) by (instance)
Filter: source="app-tier" AND destination="db-tier"
Condition: value > 100ms
Evaluation: 1 minute window, 2 consecutive periods
Severity: Warning at 100ms, Critical at 150ms
Additionally, create a separate alert for latency variance:
Metric: stddev(ibm_vpc_network_latency_ms) by (instance)
Condition: value > 50ms
Evaluation: 5 minute window
High standard deviation indicates inconsistent latency even if average stays low.
Action Items:
- Update your alert policy with max() aggregation and shorter evaluation windows
- Add per-instance alerting to catch individual connection issues
- Implement synthetic transaction monitoring for end-to-end ERP workflows
- Create a composite alert that considers both latency metrics and synthetic transaction results
- Set up a dashboard showing p50, p95, and p99 latency percentiles - these reveal spike patterns better than averages
With these changes, you’ll catch latency issues before users report them, meeting your ERP SLA commitments reliably.