Edge data stream buffer overflows during cloud outage, resulting in data loss

datastrategist · March 11, 2025, 10:14pm

During a 6-hour cloud connectivity outage last weekend, our edge data-stream buffers overflowed, resulting in loss of approximately 3 million telemetry messages. The buffer was supposed to queue messages locally until cloud connectivity restored, but it filled up after about 2 hours.

Current buffer configuration:


buffer.maxSize=100MB
buffer.strategy=FIFO
buffer.persistence=memory

We need better buffer size configuration to handle longer outages. We’re also considering a local archiving strategy to prevent data loss, and need buffer usage monitoring to alert us before overflow occurs. With 500 devices sending messages every 10 seconds, the 100MB buffer fills quickly during outages. What’s the recommended approach for handling extended cloud outages without losing telemetry data?

master_dev · April 24, 2025, 9:49pm

I’ll provide a comprehensive solution addressing all three critical areas:

Buffer Size Configuration: Your 100MB buffer is severely undersized. Calculate required buffer capacity:


Devices: 500
Message rate: 6/min (every 10 sec)
Message size: ~500 bytes (typical telemetry)
Outage duration: 6 hours

Required = 500 * 6 * 60 * 6 * 500 bytes
         = 5.4 GB uncompressed

With compression (70% reduction), you need ~1.6GB. Configure a two-tier buffering system:


# Hot buffer (memory) - last 1 hour
buffer.hot.maxSize=300MB
buffer.hot.persistence=memory
buffer.hot.compression=true

# Warm buffer (disk) - up to 24 hours
buffer.warm.maxSize=2GB
buffer.warm.persistence=disk
buffer.warm.path=/var/lib/iot/buffer
buffer.warm.compression=true

Implement automatic tier transition:

if (hotBuffer.usage > 80%) {
  moveOldestMessages(hotBuffer, warmBuffer);
}

Local Archiving Strategy: For outages exceeding 24 hours, implement cold storage archiving:

Configure local archive storage:


archive.enabled=true
archive.path=/mnt/archive
archive.format=parquet
archive.compression=snappy
archive.rotationSize=100MB

Set up archiving rules:

if (warmBuffer.usage > 90%) {
  archiveToLocal({
    source: warmBuffer,
    target: '/mnt/archive',
    priority: 'low-priority-telemetry',
    compress: true
  });
}

Implement smart recovery on reconnection:


recovery.priority=[
  'hot-buffer',      // Sync immediately
  'warm-buffer',     // Sync within 1 hour
  'local-archive'    // Background sync
];

recovery.bandwidth.limit=10Mbps;  // Don't saturate link
recovery.batch.size=1000;          // Messages per batch

Optional: Use S3-compatible storage for unlimited capacity:


archive.remote.enabled=true
archive.remote.endpoint=http://minio.local:9000
archive.remote.bucket=iot-telemetry-archive
archive.remote.accessKey=<key>

Buffer Usage Monitoring: Implement comprehensive buffer monitoring:

Export metrics:


monitoring.metrics.enabled=true
monitoring.export.prometheus=true
monitoring.export.interval=30s

Key metrics to track:

// Buffer depth
buffer_usage_bytes{tier="hot"}
buffer_usage_bytes{tier="warm"}
buffer_usage_bytes{tier="archive"}

// Write rate
buffer_write_rate_msgs_per_sec
buffer_write_rate_bytes_per_sec

// Estimated time to overflow
buffer_time_to_overflow_seconds

// Message age
buffer_oldest_message_age_seconds

Set up alerts:

alerts:
  - name: BufferHighUsage
    condition: buffer_usage_percent > 70
    severity: warning
    action: notify_ops_team

  - name: BufferCritical
    condition: buffer_usage_percent > 85
    severity: critical
    action: [notify_ops_team, trigger_archiving]

  - name: BufferOverflowImminent
    condition: buffer_time_to_overflow_seconds < 3600
    severity: critical
    action: [notify_ops_team, enable_message_sampling]

Implement adaptive message sampling during high buffer usage:

if (buffer.usage > 85%) {
  // Sample non-critical telemetry
  sampling.rate = 0.5;  // Keep 50% of messages
  sampling.priority = 'preserve-critical';
}

if (buffer.usage > 95%) {
  // Aggressive sampling
  sampling.rate = 0.2;  // Keep 20% of messages
  sampling.strategy = 'statistical';  // Keep representative sample
}

Complete Configuration Example:

dataStream:
  buffer:
    hot:
      maxSize: 300MB
      persistence: memory
      compression: gzip
      compressionLevel: 6
    warm:
      maxSize: 2GB
      persistence: disk
      path: /var/lib/iot/buffer
      compression: gzip
    archive:
      enabled: true
      path: /mnt/archive
      maxSize: unlimited
      format: parquet
      compression: snappy

  overflow:
    strategy: PRIORITY  # Not FIFO
    priorities:
      critical: 1.0    # Never drop
      high: 0.9
      medium: 0.5
      low: 0.2         # Drop first

  monitoring:
    enabled: true
    alertThresholds:
      warning: 70
      critical: 85
    estimateOverflow: true

  recovery:
    prioritize: hot-buffer
    bandwidthLimit: 10Mbps
    batchSize: 1000

Best Practices:

Size buffers for 2x your expected maximum outage duration
Always use disk persistence for warm buffers
Implement compression (reduces storage by 60-70%)
Use priority-based overflow strategy, not FIFO
Monitor buffer usage and set alerts at 70% and 85%
Test buffer overflow scenarios in staging
Document recovery procedures for extended outages
Consider message sampling for non-critical telemetry during high buffer usage

With this configuration, you can handle outages up to 24 hours without data loss, and longer outages with local archiving.

paul_coder · April 5, 2025, 10:53am

Local archiving to object storage is the right approach for long outages. Configure a two-tier system: hot buffer (1GB) for recent data that gets synced first when connectivity returns, and cold archive (unlimited) for older data that syncs in the background. This ensures critical recent data is prioritized during recovery.

singharch · April 16, 2025, 2:26pm

Don’t forget about buffer usage monitoring. You need real-time alerts when buffer usage exceeds 70-80% so you can take action before overflow occurs. We use Prometheus metrics exported from edge nodes to track buffer depth, write rate, and estimated time to overflow. This gives us early warning of potential issues.

kimberlysage · March 16, 2025, 3:48pm

FIFO strategy means you’re dropping the oldest messages when the buffer fills. Consider using a priority-based strategy where critical telemetry is preserved and less important data is dropped first. Also implement data compression - you can typically reduce telemetry payload size by 60-70% with gzip compression, effectively multiplying your buffer capacity.

Topic		Views
Data stream module drops messages during brief network outages on Azure IoT Edge, causing data loss Microsoft Azure IoT question , integration , reliability , edge-compute , azure-iot-hub , data-stream , aziot-25 , msg-drop , message-buffering	5	January 27, 2025
Best practices for managing edge gateway analytics and data ingestion reliability IBM Watson IoT discussion , data-ingestion , data-reliability , gateway-mgmt , edge-analytics , buffering , wiot-ea , edge-analytics-data	5	March 11, 2025
Billing engine loses metering data from edge devices when processing high volumes Oracle IoT Cloud question , edge-compute , data-loss , message-queue , billing-engi , oiot-23 , backpressure , metering-data , billing	6	August 5, 2025
Data stream connection timeout errors when ingesting high-frequency sensor data from edge gateways Cisco IoT Cloud Connect question , connectivity , data-loss , connection-timeout , mqtt , data-stream , cciot-25 , iot-operations , buffer-configuration	5	September 10, 2025
Data stream connection timeout when ingesting high-frequency sensor data from edge gateway Oracle IoT Cloud question , json , connection-timeout , mqtt , data-stream , iiot-support , real-time-data-loss , oiot-23 , edge-gateway-sd	3	April 29, 2025
VPN tunnel between edge gateway and cloud platform drops intermittently causing data loss Cisco IoT Cloud Connect question , connectivity , data-buffering , gateway-mgmt , iod-23 , vpn-tunnel-stability , tunnel-redundancy , health-monitoring , reconnection	5	November 20, 2025
Data storage API times out when ingesting large batch uploads from edge gateway-partial data loss observed SAP IoT question , rest-api , timeout-error , json , data-loss , batch-upload , edge-gateway , data-storage , sys-integration	6	April 23, 2025
Data stream WebSocket disconnects during high-volume telemetry ingestion Cumulocity IoT question , performance-opt , connectivity , data-loss , websocket , data-stream , c8y-1018 , disconnect , websocket-protocol	3	November 5, 2025
Device telemetry upload blocked due to data storage quota exceeded in IoT Cloud Oracle IoT Cloud question , retention , data-loss , admin-console , storage-quota , quota-exceeded , telemetry , data-storage , device-mgmt	4	May 28, 2025

Edge data stream buffer overflows during cloud outage, resulting in data loss

Related topics