Distributed trace sampling missing critical error events in Cloud Observability

anna_architect · June 21, 2025, 4:19pm

We’ve implemented distributed tracing across our microservices using IBM Cloud Observability, but we’re losing visibility into critical error events. The sampling algorithm appears to be dropping error traces during high-traffic periods, even though we’ve configured error prioritization in the sampling policy. When we review the trace database, we can see successful transactions but error traces that our application logs show are missing from the observability platform. I’ve checked the sampling decision logging and it shows a 1% sample rate during peak hours, which seems to be filtering out errors along with normal traffic. Our trace database retention is set to 30 days. How can we ensure error traces are always captured regardless of sampling decisions?

brenda_func · July 2, 2025, 4:16pm

That makes sense about the sampling decision timing. Our errors typically occur in downstream services after the initial sampling decision. How do I configure tail-based sampling in IBM Cloud Observability?

charlesengineer · July 15, 2025, 7:31pm

Missing error traces during high-traffic periods is a critical observability gap. The root cause is that standard head-based sampling makes decisions before errors occur downstream. Here’s a comprehensive solution:

Sampling Policy Configuration: Replace your current head-based sampling with a hybrid approach. Configure your observability agent with priority-based sampling:

sampling:
  default_rate: 0.01  # 1% for normal traffic
  priority_rules:
    - condition: "error == true"
      rate: 1.0  # 100% for errors
    - condition: "http.status_code >= 500"
      rate: 1.0
    - condition: "duration > 5000"
      rate: 0.5  # 50% for slow requests

This ensures error traces bypass the 1% sampling rate you’re seeing during peak hours.

Error Trace Prioritization: Implement error detection at multiple span levels. In your application instrumentation:

from opentelemetry import trace

tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("process_order") as span:
    try:
        result = process_order(order_id)
    except Exception as e:
        span.set_attribute("error", True)
        span.set_attribute("error.type", type(e).__name__)
        span.set_status(Status(StatusCode.ERROR))
        # Force sampling decision override
        span.set_attribute("sampling.priority", 1)
        raise

The sampling.priority attribute forces the trace to be retained even if initially sampled out.

Sampling Decision Logging: Enable detailed sampling decision logging to understand why traces are dropped:

logging:
  level: debug
  sampling_decisions: true
  include_fields:
    - trace_id
    - sampling_decision
    - decision_reason
    - span_attributes

Analyze the logs to identify patterns in dropped error traces. Look for traces where error attributes were set AFTER the sampling decision.

Tail-Based Sampling Implementation: For critical services, implement tail-based sampling that evaluates complete traces:

processors:
  tail_sampling:
    decision_wait: 10s  # Wait for complete trace
    policies:
      - name: error-traces
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow-traces
        type: latency
        latency: {threshold_ms: 5000}
      - name: sample-normal
        type: probabilistic
        probabilistic: {sampling_percentage: 1}

This buffers traces for 10 seconds, allowing downstream errors to influence the sampling decision.

Trace Database Retention Optimization: Your 30-day retention might be causing storage pressure affecting sampling. Implement tiered retention:

Error traces: 90 days full retention
Slow traces (>5s): 60 days
Normal sampled traces: 30 days
Dropped traces: Keep metadata only for 7 days

Configure in your observability instance:

{
  "retention_policies": [
    {
      "condition": "error == true",
      "retention_days": 90,
      "storage_tier": "hot"
    },
    {
      "condition": "duration > 5000",
      "retention_days": 60,
      "storage_tier": "warm"
    },
    {
      "condition": "sampled == true",
      "retention_days": 30,
      "storage_tier": "warm"
    }
  ]
}

Agent Resource Configuration: Tail-based sampling requires significant memory. Update your agent deployment:

resources:
  limits:
    memory: 2Gi
    cpu: 1000m
  requests:
    memory: 1Gi
    cpu: 500m

Validation and Monitoring: Create alerts for sampling effectiveness:

Alert when error trace capture rate drops below 95%
Monitor sampling decision latency
Track trace buffer overflow events

Query to validate error capture:

SELECT
  COUNT(*) as total_errors,
  SUM(CASE WHEN sampled THEN 1 ELSE 0 END) as captured_errors,
  (SUM(CASE WHEN sampled THEN 1 ELSE 0 END) * 100.0 / COUNT(*)) as capture_rate
FROM traces
WHERE error = true
  AND timestamp > NOW() - INTERVAL '1 hour'

Best Practice Recommendations:

Use tail-based sampling for critical services (authentication, payment, etc.)
Keep head-based probabilistic sampling for high-throughput non-critical services
Always set error attributes before span completion
Implement sampling priority attributes in exception handlers
Monitor agent memory usage - increase if buffer overflows occur

This comprehensive approach addresses all four focus areas: sampling policy configuration with priority rules, error trace prioritization through span attributes, sampling decision logging for troubleshooting, and optimized trace database retention. The hybrid head/tail sampling strategy ensures you capture 100% of error traces while controlling costs through aggressive sampling of normal traffic.

garymaster · June 27, 2025, 2:25pm

We faced this exact issue. The problem is that sampling decisions are made at the trace root, often before the error actually occurs downstream. You need to implement adaptive sampling that can override the initial decision when errors are detected. Also consider using tail-based sampling instead of head-based sampling for critical services.

michellesolver · June 24, 2025, 1:03pm

Error traces should have sampling priority. Check if your application is correctly tagging error spans with error=true attribute. The sampling algorithm relies on span tags to identify high-priority traces. If the error tag isn’t set before the sampling decision, the trace gets treated as normal traffic.

anna_architect · July 5, 2025, 10:01am

Tail-based sampling requires buffering complete traces before making sampling decisions, which adds latency and resource overhead. Make sure your observability agent has sufficient memory allocated. We had to increase agent memory from 512MB to 2GB to support tail-based sampling for our traffic volume. Also check if your 30-day retention is causing storage pressure that might affect sampling behavior.

Topic		Replies	Views
Cloud Monitoring alerts missed ERP network latency spikes during peak hours IBM Cloud question , networking , performance , observability , ic-2019 , sla-monitoring , cloud-monitoring , alert-configuration , latency-metrics	4	0	March 18, 2025
Centralized logging vs distributed tracing for ERP observability strategy Google Cloud Platform (GCP) discussion , observability , gcp-2019 , incident-response , containers-ctn , cloud-logging , cloud-trace , observability-choice	6	0	September 8, 2025
VPC network latency spikes detected but monitoring shows zero packet loss - troubleshooting network performance IBM Cloud question , networking , ic-2020 , flow-logs , network-acl , monitoring-mana , ibm-cloud-vpc-flow , incomplete-metrics , latency-spikes	3	0	October 21, 2025
Best practices for monitoring network traffic in IBM Cloud observability module IBM Cloud discussion , monitoring , networking , alerts , observability , ic-2020 , flow-logs , sysdig , logdna	7	0	May 17, 2025
How should we design observability and monitoring for compliance requirements Oracle Cloud discussion , monitoring , observability , log-analytics , audit-trail , oci-2021 , security-compliance , compliance-governance , real-time-alerting	6	1	November 23, 2024
Azure Monitor storage metrics show gaps in observability for blob access patterns and latency tracking Microsoft Azure question , storage , observability , log-analytics , az-2020 , json , azure-monitor , diagnostic-settings , kql-queries	6	0	January 19, 2025
Log Analysis ML-based alerts delayed by several minutes in production IBM Cloud question , observability , ic-2021 , resource-allocation , machine-learning , incident-response , log-analysis , alert-delay , ml-batch-processing	5	1	October 15, 2025
Sysdig agent missing audit logs from Cloud Activity Tracker integration in multi-account setup IBM Cloud question , security , compliance , observability , ic-2019 , audit-logs , iam-permissions , sysdig , activity-tracker	5	0	November 6, 2025
Traceability matrix impact analysis queries timing out on large baseline snapshots IBM Engineering Lifecycle Management question , traceability , compliance-audit , sql , query-timeout , performance-optimization , traceability-matrix , baseline-management , elm-7-0-3	6	0	September 18, 2025

Distributed trace sampling missing critical error events in Cloud Observability

Related topics