Edge data stream buffer overflows during cloud outage, resulting in data loss

During a 6-hour cloud connectivity outage last weekend, our edge data-stream buffers overflowed, resulting in loss of approximately 3 million telemetry messages. The buffer was supposed to queue messages locally until cloud connectivity restored, but it filled up after about 2 hours.

Current buffer configuration:


buffer.maxSize=100MB
buffer.strategy=FIFO
buffer.persistence=memory

We need better buffer size configuration to handle longer outages. We’re also considering a local archiving strategy to prevent data loss, and need buffer usage monitoring to alert us before overflow occurs. With 500 devices sending messages every 10 seconds, the 100MB buffer fills quickly during outages. What’s the recommended approach for handling extended cloud outages without losing telemetry data?

I’ll provide a comprehensive solution addressing all three critical areas:

Buffer Size Configuration: Your 100MB buffer is severely undersized. Calculate required buffer capacity:


Devices: 500
Message rate: 6/min (every 10 sec)
Message size: ~500 bytes (typical telemetry)
Outage duration: 6 hours

Required = 500 * 6 * 60 * 6 * 500 bytes
         = 5.4 GB uncompressed

With compression (70% reduction), you need ~1.6GB. Configure a two-tier buffering system:


# Hot buffer (memory) - last 1 hour
buffer.hot.maxSize=300MB
buffer.hot.persistence=memory
buffer.hot.compression=true

# Warm buffer (disk) - up to 24 hours
buffer.warm.maxSize=2GB
buffer.warm.persistence=disk
buffer.warm.path=/var/lib/iot/buffer
buffer.warm.compression=true

Implement automatic tier transition:

if (hotBuffer.usage > 80%) {
  moveOldestMessages(hotBuffer, warmBuffer);
}

Local Archiving Strategy: For outages exceeding 24 hours, implement cold storage archiving:

  1. Configure local archive storage:

archive.enabled=true
archive.path=/mnt/archive
archive.format=parquet
archive.compression=snappy
archive.rotationSize=100MB
  1. Set up archiving rules:
if (warmBuffer.usage > 90%) {
  archiveToLocal({
    source: warmBuffer,
    target: '/mnt/archive',
    priority: 'low-priority-telemetry',
    compress: true
  });
}
  1. Implement smart recovery on reconnection:

recovery.priority=[
  'hot-buffer',      // Sync immediately
  'warm-buffer',     // Sync within 1 hour
  'local-archive'    // Background sync
];

recovery.bandwidth.limit=10Mbps;  // Don't saturate link
recovery.batch.size=1000;          // Messages per batch
  1. Optional: Use S3-compatible storage for unlimited capacity:

archive.remote.enabled=true
archive.remote.endpoint=http://minio.local:9000
archive.remote.bucket=iot-telemetry-archive
archive.remote.accessKey=<key>

Buffer Usage Monitoring: Implement comprehensive buffer monitoring:

  1. Export metrics:

monitoring.metrics.enabled=true
monitoring.export.prometheus=true
monitoring.export.interval=30s
  1. Key metrics to track:
// Buffer depth
buffer_usage_bytes{tier="hot"}
buffer_usage_bytes{tier="warm"}
buffer_usage_bytes{tier="archive"}

// Write rate
buffer_write_rate_msgs_per_sec
buffer_write_rate_bytes_per_sec

// Estimated time to overflow
buffer_time_to_overflow_seconds

// Message age
buffer_oldest_message_age_seconds
  1. Set up alerts:
alerts:
  - name: BufferHighUsage
    condition: buffer_usage_percent > 70
    severity: warning
    action: notify_ops_team

  - name: BufferCritical
    condition: buffer_usage_percent > 85
    severity: critical
    action: [notify_ops_team, trigger_archiving]

  - name: BufferOverflowImminent
    condition: buffer_time_to_overflow_seconds < 3600
    severity: critical
    action: [notify_ops_team, enable_message_sampling]
  1. Implement adaptive message sampling during high buffer usage:
if (buffer.usage > 85%) {
  // Sample non-critical telemetry
  sampling.rate = 0.5;  // Keep 50% of messages
  sampling.priority = 'preserve-critical';
}

if (buffer.usage > 95%) {
  // Aggressive sampling
  sampling.rate = 0.2;  // Keep 20% of messages
  sampling.strategy = 'statistical';  // Keep representative sample
}

Complete Configuration Example:

dataStream:
  buffer:
    hot:
      maxSize: 300MB
      persistence: memory
      compression: gzip
      compressionLevel: 6
    warm:
      maxSize: 2GB
      persistence: disk
      path: /var/lib/iot/buffer
      compression: gzip
    archive:
      enabled: true
      path: /mnt/archive
      maxSize: unlimited
      format: parquet
      compression: snappy

  overflow:
    strategy: PRIORITY  # Not FIFO
    priorities:
      critical: 1.0    # Never drop
      high: 0.9
      medium: 0.5
      low: 0.2         # Drop first

  monitoring:
    enabled: true
    alertThresholds:
      warning: 70
      critical: 85
    estimateOverflow: true

  recovery:
    prioritize: hot-buffer
    bandwidthLimit: 10Mbps
    batchSize: 1000

Best Practices:

  • Size buffers for 2x your expected maximum outage duration
  • Always use disk persistence for warm buffers
  • Implement compression (reduces storage by 60-70%)
  • Use priority-based overflow strategy, not FIFO
  • Monitor buffer usage and set alerts at 70% and 85%
  • Test buffer overflow scenarios in staging
  • Document recovery procedures for extended outages
  • Consider message sampling for non-critical telemetry during high buffer usage

With this configuration, you can handle outages up to 24 hours without data loss, and longer outages with local archiving.

Local archiving to object storage is the right approach for long outages. Configure a two-tier system: hot buffer (1GB) for recent data that gets synced first when connectivity returns, and cold archive (unlimited) for older data that syncs in the background. This ensures critical recent data is prioritized during recovery.

Don’t forget about buffer usage monitoring. You need real-time alerts when buffer usage exceeds 70-80% so you can take action before overflow occurs. We use Prometheus metrics exported from edge nodes to track buffer depth, write rate, and estimated time to overflow. This gives us early warning of potential issues.

FIFO strategy means you’re dropping the oldest messages when the buffer fills. Consider using a priority-based strategy where critical telemetry is preserved and less important data is dropped first. Also implement data compression - you can typically reduce telemetry payload size by 60-70% with gzip compression, effectively multiplying your buffer capacity.