Best practices for API monitoring and alerting strategies using custom metrics in GCP

I’m looking to establish comprehensive API monitoring and alerting practices using Cloud Monitoring for our microservices platform. We have about 30 API services running on Cloud Run and GKE, and I want to move beyond basic uptime checks to more sophisticated observability.

I’m particularly interested in strategies for custom metrics that go beyond default HTTP metrics, how to structure alerting policies to avoid alert fatigue while catching real issues, and dashboard design patterns that work well for API-centric architectures. What monitoring approaches have worked well for your teams? What mistakes did you make that others should avoid?

One thing we learned the hard way - don’t just monitor success cases. Track partial failures, degraded responses, and timeouts separately. We had situations where APIs returned 200 OK but with incomplete data, and our monitoring missed it because we only tracked HTTP status codes. Now we instrument custom metrics for data completeness and response quality, not just availability.

The SLO-based approach sounds promising. How granular do you make your SLOs? Do you have one per service or per endpoint? And for custom metrics, what’s your approach to instrumenting them - are you using OpenTelemetry or Cloud Monitoring’s client libraries directly?

We define SLOs at the service level for overall health, then have more detailed metrics per critical endpoint. Not every endpoint needs an SLO - focus on user-facing APIs and critical backend services. For instrumentation, we use OpenTelemetry because it’s vendor-neutral and gives us flexibility. Cloud Monitoring ingests OpenTelemetry metrics natively, so you get the best of both worlds - standardized instrumentation and GCP-native monitoring.

We started with too many alerts and quickly learned that’s a mistake. Now we focus on SLO-based alerting using error budgets. We define SLOs for availability and latency, then alert when we’re burning through our error budget too quickly. This reduces noise significantly while still catching issues before they impact users. For custom metrics, we instrument business-level metrics like successful transactions, not just technical metrics.