Our SRE team is debating whether to invest more heavily in centralized logging (Cloud Logging) or distributed tracing (Cloud Trace) for our ERP system running on GKE. We have limited budget for observability tooling and need to make a strategic choice.
Current situation: We’re using Cloud Logging for all application and system logs, but our incident response times are slow because we’re manually correlating logs across 20+ microservices during outages. The distributed tracing camp argues that trace context would dramatically speed up root cause analysis. The logging camp says we need better log aggregation and structured logging before adding another tool.
Our ERP handles order processing, inventory management, and financial transactions. When something breaks, we need to quickly trace a user request through multiple services. What’s been more valuable in practice for complex microservices troubleshooting?
After considering all perspectives, here’s a comprehensive strategy that addresses the three key focus areas:
Centralized Log Aggregation:
This should be your foundation. Ensure all 20+ microservices are sending structured logs to Cloud Logging with consistent formatting:
- Use JSON-formatted logs with standardized fields (timestamp, severity, service_name, trace_id, user_id, transaction_id)
- Implement log levels correctly: DEBUG for development, INFO for business events, WARN for degraded performance, ERROR for failures
- Create log-based metrics for key business and technical indicators (order processing rate, error rates by service, latency percentiles)
- Set up log sinks to BigQuery for long-term analysis and cost optimization (Cloud Logging retention is expensive beyond 30 days)
- Use exclusion filters to reduce ingestion costs - exclude health check logs, verbose debug logs in production
The value here is immediate: you get full visibility into what’s happening across your entire ERP system without requiring code changes.
Distributed Trace Context:
Implement trace context propagation as your next phase. This doesn’t require full distributed tracing immediately:
- Generate a unique trace ID at your ingress layer (GKE Ingress or API Gateway)
- Propagate this trace ID through all service calls using W3C Trace Context headers
- Include the trace ID in all structured logs (this creates the bridge between logging and tracing)
- Start with automatic instrumentation using OpenTelemetry where possible (Java and Node.js have good auto-instrumentation support)
- Focus initial tracing effort on your critical path: order processing → inventory check → payment processing → fulfillment
With trace context in logs, you can already do basic request tracing using Cloud Logging queries, even before implementing full distributed tracing.
Incident Response Workflows:
Optimize your workflows to leverage both tools effectively:
Phase 1 - Detection (Logs):
- Use log-based alerts for known error patterns and SLO violations
- Create dashboards showing error rates, latency trends, and service health from log-based metrics
- When an alert fires, start with broad log queries to identify the scope (one user? one service? entire system?)
Phase 2 - Diagnosis (Logs + Traces):
- Use logs to identify affected trace IDs or transaction IDs
- If tracing is implemented, pull the specific trace to see the request flow and identify the failing service
- If tracing isn’t available yet, use the trace ID to query logs across all services in chronological order
- Look for timing gaps between service calls to identify latency sources
Phase 3 - Root Cause (Deep Dive):
- Use detailed logs from the failing service to understand the specific error condition
- Check for correlated infrastructure issues (pod restarts, node failures) in GKE logs
- Examine database query logs if the issue is data-layer related
Practical Implementation Timeline:
Weeks 1-4: Logging foundation
- Standardize log format across all services
- Implement correlation IDs (can be simple UUIDs generated at ingress)
- Set up log-based metrics and dashboards
- Configure BigQuery sink for cost optimization
Weeks 5-8: Trace context
- Implement W3C Trace Context header propagation
- Add trace IDs to all log statements
- Test end-to-end trace ID flow for critical paths
Weeks 9-12: Distributed tracing
- Enable OpenTelemetry auto-instrumentation for Java and Node.js services
- Manually instrument Python services for critical paths
- Configure Cloud Trace with 10% sampling rate initially
- Create trace-based dashboards for latency analysis
Cost Management:
- Cloud Logging: ~$0.50/GB ingested. Budget for 50-100GB/day for 20 microservices = $750-1500/month
- Cloud Trace: ~$0.20/million spans. With 10% sampling on 1M requests/day = $20-40/month
- BigQuery storage: ~$0.02/GB/month for long-term log retention = $100-200/month
The answer isn’t either/or - it’s both, implemented strategically. Start with logging as your foundation because it provides immediate value with minimal code changes. Add tracing incrementally, focusing on high-value request paths. The trace context in logs creates a bridge between the two systems, giving you request correlation even before full tracing is implemented. For your ERP system, this phased approach balances quick wins with long-term observability maturity.
The instrumentation effort is a real concern. We’re using a mix of Java, Node.js, and Python services. Some teams are already using OpenTelemetry, but it’s not consistent across the organization. Would it make sense to start with automatic instrumentation where possible and gradually expand? Or should we focus on establishing logging best practices first and add tracing later when we have the engineering bandwidth?
Don’t underestimate the cost implications. Cloud Logging charges for ingestion and storage, which can get expensive with verbose logging from 20+ services. Cloud Trace charges per span ingested, which also adds up quickly. We found that sampling traces (1-10% of requests) provides enough visibility for troubleshooting while keeping costs manageable. For logs, implement log levels properly and only send WARN and ERROR to Cloud Logging in production, with INFO and DEBUG available locally or in short-term storage.
Centralized logging gives you breadth - everything that happens in your system. Distributed tracing gives you depth - the exact path of a specific request. For microservices, tracing is transformational when you need to answer “why is this one transaction slow?” Logs are better for “what happened in the last hour?” If your pain point is incident response for specific user requests, tracing will have more immediate impact.
We went through this exact debate last year. Our approach: implement structured logging with correlation IDs as the foundation, then add tracing incrementally starting with the most critical request paths. The correlation IDs bridge the gap - they show up in both logs and traces, making it easier to jump between the two views. Start with your order processing flow since that’s probably your highest-value path to instrument first.
From an incident response perspective, I’ve found that tracing helps with the 20% of incidents that are complex request flow issues, while logging handles the other 80% - application errors, infrastructure problems, security events. During a P1 incident, I want both, but if I had to choose one, I’d take comprehensive logging. Tracing is useless if you don’t know which trace to look at, and logs help you identify the problematic request IDs in the first place.