Our ERP integration makes thousands of API calls daily to external suppliers and internal microservices. We’ve configured Cloud Monitoring alerts to notify us when API requests fail, but alerts aren’t triggering even though we can see failed requests in the logs.
The log-based metric is configured to count log entries with severity ERROR and jsonPayload.status >= 400. The alerting policy checks if the metric exceeds 10 failures in 5 minutes. We’ve manually verified that failed API calls are being logged correctly with the right severity and status codes.
filter: severity="ERROR" AND jsonPayload.apiStatus>=400
metric.type: "logging.googleapis.com/user/erp_api_failures"
This is critical for incident detection - we’re missing failures that impact business operations. What could prevent the alerts from firing despite matching log entries?
First thing to check - is the log-based metric actually incrementing? Go to Metrics Explorer and query your custom metric to see if data points are being recorded. Sometimes the metric filter syntax doesn’t match what’s actually in the logs. Also verify the metric is being created in the same project where you’re checking alerts.
Carlos, the issue is likely your filter syntax. Log-based metrics use different syntax than Logs Explorer. For JSON payload fields, you need jsonPayload.apiStatus but the comparison operator should be separate. Try testing your filter directly in the log-based metric preview before saving. Also check if apiStatus is a string vs integer - that matters for comparisons.
Alex, checked Metrics Explorer and the metric shows zero data points over the past 7 days, even though Cloud Logging clearly shows ERROR entries with apiStatus fields. The logs and metric are in the same project. Could the jsonPayload path be incorrect?
Here’s the comprehensive solution covering log-based metric configuration, alerting policy setup, and API error monitoring:
Root Cause Analysis:
Your alerts weren’t triggering due to three issues: incorrect filter syntax for string comparison, inconsistent log field names across services, and alignment period misconfiguration.
Log-Based Metric Configuration:
Create a proper filter that handles both string status codes and multiple field paths:
severity="ERROR" AND (
jsonPayload.apiStatus="400" OR jsonPayload.apiStatus="500" OR
httpResponse.status>=400
)
resource.type="cloud_run_revision" OR resource.type="k8s_container"
Key Configuration Steps:
-
Create the metric with correct data type:
- Metric type: Counter (for counting failures)
- Value extractor: Leave empty for simple counting
- Labels: Extract
jsonPayload.serviceName and jsonPayload.endpoint for granular alerting
-
Test the filter before saving:
- Use the preview feature in log-based metrics console
- Verify it matches recent log entries
- Check data appears in Metrics Explorer within 2-3 minutes
Alerting Policy Setup:
Create a properly configured alerting policy:
-
Condition Configuration:
- Metric: Your log-based metric `logging.googleapis.com/user/erp_api_failures
- Rolling window: 5 minutes
- Alignment period: 1 minute (matches log ingestion frequency)
- Aggregator: Sum
- Threshold: > 10 failures
- Duration: 1 minute (trigger after threshold exceeded for 1 minute)
-
Group by fields: service_name, endpoint (for targeted alerts)
-
Notification channels: Configure multiple channels (email, Slack, PagerDuty) with proper webhook URLs
API Error Monitoring Best Practices:
-
Standardize logging across services:
- Use consistent field names (httpResponse.status recommended)
- Always log severity, timestamp, service name, endpoint, and error details
- Include correlation IDs for request tracing
-
Create tiered alerting:
- Warning: 5-10 failures in 5 minutes (email/Slack)
- Critical: >10 failures in 5 minutes (PagerDuty)
- Separate alerts for 4xx (client errors) vs 5xx (server errors)
-
Set up dashboard for visibility:
- Chart showing API failure rate over time
- Breakdown by service and endpoint
- Latency metrics alongside error rates
Validation Steps:
- Generate test failures and verify metric increments in Metrics Explorer
- Manually trigger alert by exceeding threshold
- Confirm notifications arrive at all configured channels
- Test alert auto-resolution when failures drop below threshold
- Review alert history after 48 hours to tune thresholds
Common Pitfalls to Avoid:
- Don’t use regex in log filters (performance impact)
- Avoid alignment periods longer than your rolling window
- Don’t set thresholds too low (alert fatigue) or too high (miss critical issues)
- Remember log-based metrics have ~2 minute ingestion delay
Implement Cloud Trace for distributed tracing alongside these alerts for complete API observability across your ERP integration.
Another gotcha - make sure your alerting policy alignment period matches your data granularity. If you’re using 1-minute alignment but data only arrives every 5 minutes, alerts might not trigger as expected.
We had similar issues with inconsistent log structures across services. Recommend standardizing your logging format or creating multiple log-based metrics for different patterns. Also set up notification channels properly - we discovered our alerts were firing but going to a deprecated Slack channel.
Maya, you’re right! The apiStatus field in our logs is actually a string “400” not an integer 400. The numeric comparison was failing silently. Also found that some logs use httpResponse.status instead of jsonPayload.apiStatus depending on the service.