Distributed trace sampling missing critical error events in Cloud Observability

We’ve implemented distributed tracing across our microservices using IBM Cloud Observability, but we’re losing visibility into critical error events. The sampling algorithm appears to be dropping error traces during high-traffic periods, even though we’ve configured error prioritization in the sampling policy. When we review the trace database, we can see successful transactions but error traces that our application logs show are missing from the observability platform. I’ve checked the sampling decision logging and it shows a 1% sample rate during peak hours, which seems to be filtering out errors along with normal traffic. Our trace database retention is set to 30 days. How can we ensure error traces are always captured regardless of sampling decisions?

That makes sense about the sampling decision timing. Our errors typically occur in downstream services after the initial sampling decision. How do I configure tail-based sampling in IBM Cloud Observability?

Missing error traces during high-traffic periods is a critical observability gap. The root cause is that standard head-based sampling makes decisions before errors occur downstream. Here’s a comprehensive solution:

Sampling Policy Configuration: Replace your current head-based sampling with a hybrid approach. Configure your observability agent with priority-based sampling:

sampling:
  default_rate: 0.01  # 1% for normal traffic
  priority_rules:
    - condition: "error == true"
      rate: 1.0  # 100% for errors
    - condition: "http.status_code >= 500"
      rate: 1.0
    - condition: "duration > 5000"
      rate: 0.5  # 50% for slow requests

This ensures error traces bypass the 1% sampling rate you’re seeing during peak hours.

Error Trace Prioritization: Implement error detection at multiple span levels. In your application instrumentation:

from opentelemetry import trace

tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("process_order") as span:
    try:
        result = process_order(order_id)
    except Exception as e:
        span.set_attribute("error", True)
        span.set_attribute("error.type", type(e).__name__)
        span.set_status(Status(StatusCode.ERROR))
        # Force sampling decision override
        span.set_attribute("sampling.priority", 1)
        raise

The sampling.priority attribute forces the trace to be retained even if initially sampled out.

Sampling Decision Logging: Enable detailed sampling decision logging to understand why traces are dropped:

logging:
  level: debug
  sampling_decisions: true
  include_fields:
    - trace_id
    - sampling_decision
    - decision_reason
    - span_attributes

Analyze the logs to identify patterns in dropped error traces. Look for traces where error attributes were set AFTER the sampling decision.

Tail-Based Sampling Implementation: For critical services, implement tail-based sampling that evaluates complete traces:

processors:
  tail_sampling:
    decision_wait: 10s  # Wait for complete trace
    policies:
      - name: error-traces
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow-traces
        type: latency
        latency: {threshold_ms: 5000}
      - name: sample-normal
        type: probabilistic
        probabilistic: {sampling_percentage: 1}

This buffers traces for 10 seconds, allowing downstream errors to influence the sampling decision.

Trace Database Retention Optimization: Your 30-day retention might be causing storage pressure affecting sampling. Implement tiered retention:

  • Error traces: 90 days full retention
  • Slow traces (>5s): 60 days
  • Normal sampled traces: 30 days
  • Dropped traces: Keep metadata only for 7 days

Configure in your observability instance:

{
  "retention_policies": [
    {
      "condition": "error == true",
      "retention_days": 90,
      "storage_tier": "hot"
    },
    {
      "condition": "duration > 5000",
      "retention_days": 60,
      "storage_tier": "warm"
    },
    {
      "condition": "sampled == true",
      "retention_days": 30,
      "storage_tier": "warm"
    }
  ]
}

Agent Resource Configuration: Tail-based sampling requires significant memory. Update your agent deployment:

resources:
  limits:
    memory: 2Gi
    cpu: 1000m
  requests:
    memory: 1Gi
    cpu: 500m

Validation and Monitoring: Create alerts for sampling effectiveness:

  • Alert when error trace capture rate drops below 95%
  • Monitor sampling decision latency
  • Track trace buffer overflow events

Query to validate error capture:

SELECT
  COUNT(*) as total_errors,
  SUM(CASE WHEN sampled THEN 1 ELSE 0 END) as captured_errors,
  (SUM(CASE WHEN sampled THEN 1 ELSE 0 END) * 100.0 / COUNT(*)) as capture_rate
FROM traces
WHERE error = true
  AND timestamp > NOW() - INTERVAL '1 hour'

Best Practice Recommendations:

  1. Use tail-based sampling for critical services (authentication, payment, etc.)
  2. Keep head-based probabilistic sampling for high-throughput non-critical services
  3. Always set error attributes before span completion
  4. Implement sampling priority attributes in exception handlers
  5. Monitor agent memory usage - increase if buffer overflows occur

This comprehensive approach addresses all four focus areas: sampling policy configuration with priority rules, error trace prioritization through span attributes, sampling decision logging for troubleshooting, and optimized trace database retention. The hybrid head/tail sampling strategy ensures you capture 100% of error traces while controlling costs through aggressive sampling of normal traffic.

We faced this exact issue. The problem is that sampling decisions are made at the trace root, often before the error actually occurs downstream. You need to implement adaptive sampling that can override the initial decision when errors are detected. Also consider using tail-based sampling instead of head-based sampling for critical services.

Error traces should have sampling priority. Check if your application is correctly tagging error spans with error=true attribute. The sampling algorithm relies on span tags to identify high-priority traces. If the error tag isn’t set before the sampling decision, the trace gets treated as normal traffic.

Tail-based sampling requires buffering complete traces before making sampling decisions, which adds latency and resource overhead. Make sure your observability agent has sufficient memory allocated. We had to increase agent memory from 512MB to 2GB to support tail-based sampling for our traffic volume. Also check if your 30-day retention is causing storage pressure that might affect sampling behavior.