I’ll provide a comprehensive solution addressing all three critical areas:
Buffer Size Configuration:
Your 100MB buffer is severely undersized. Calculate required buffer capacity:
Devices: 500
Message rate: 6/min (every 10 sec)
Message size: ~500 bytes (typical telemetry)
Outage duration: 6 hours
Required = 500 * 6 * 60 * 6 * 500 bytes
= 5.4 GB uncompressed
With compression (70% reduction), you need ~1.6GB. Configure a two-tier buffering system:
# Hot buffer (memory) - last 1 hour
buffer.hot.maxSize=300MB
buffer.hot.persistence=memory
buffer.hot.compression=true
# Warm buffer (disk) - up to 24 hours
buffer.warm.maxSize=2GB
buffer.warm.persistence=disk
buffer.warm.path=/var/lib/iot/buffer
buffer.warm.compression=true
Implement automatic tier transition:
if (hotBuffer.usage > 80%) {
moveOldestMessages(hotBuffer, warmBuffer);
}
Local Archiving Strategy:
For outages exceeding 24 hours, implement cold storage archiving:
- Configure local archive storage:
archive.enabled=true
archive.path=/mnt/archive
archive.format=parquet
archive.compression=snappy
archive.rotationSize=100MB
- Set up archiving rules:
if (warmBuffer.usage > 90%) {
archiveToLocal({
source: warmBuffer,
target: '/mnt/archive',
priority: 'low-priority-telemetry',
compress: true
});
}
- Implement smart recovery on reconnection:
recovery.priority=[
'hot-buffer', // Sync immediately
'warm-buffer', // Sync within 1 hour
'local-archive' // Background sync
];
recovery.bandwidth.limit=10Mbps; // Don't saturate link
recovery.batch.size=1000; // Messages per batch
- Optional: Use S3-compatible storage for unlimited capacity:
archive.remote.enabled=true
archive.remote.endpoint=http://minio.local:9000
archive.remote.bucket=iot-telemetry-archive
archive.remote.accessKey=<key>
Buffer Usage Monitoring:
Implement comprehensive buffer monitoring:
- Export metrics:
monitoring.metrics.enabled=true
monitoring.export.prometheus=true
monitoring.export.interval=30s
- Key metrics to track:
// Buffer depth
buffer_usage_bytes{tier="hot"}
buffer_usage_bytes{tier="warm"}
buffer_usage_bytes{tier="archive"}
// Write rate
buffer_write_rate_msgs_per_sec
buffer_write_rate_bytes_per_sec
// Estimated time to overflow
buffer_time_to_overflow_seconds
// Message age
buffer_oldest_message_age_seconds
- Set up alerts:
alerts:
- name: BufferHighUsage
condition: buffer_usage_percent > 70
severity: warning
action: notify_ops_team
- name: BufferCritical
condition: buffer_usage_percent > 85
severity: critical
action: [notify_ops_team, trigger_archiving]
- name: BufferOverflowImminent
condition: buffer_time_to_overflow_seconds < 3600
severity: critical
action: [notify_ops_team, enable_message_sampling]
- Implement adaptive message sampling during high buffer usage:
if (buffer.usage > 85%) {
// Sample non-critical telemetry
sampling.rate = 0.5; // Keep 50% of messages
sampling.priority = 'preserve-critical';
}
if (buffer.usage > 95%) {
// Aggressive sampling
sampling.rate = 0.2; // Keep 20% of messages
sampling.strategy = 'statistical'; // Keep representative sample
}
Complete Configuration Example:
dataStream:
buffer:
hot:
maxSize: 300MB
persistence: memory
compression: gzip
compressionLevel: 6
warm:
maxSize: 2GB
persistence: disk
path: /var/lib/iot/buffer
compression: gzip
archive:
enabled: true
path: /mnt/archive
maxSize: unlimited
format: parquet
compression: snappy
overflow:
strategy: PRIORITY # Not FIFO
priorities:
critical: 1.0 # Never drop
high: 0.9
medium: 0.5
low: 0.2 # Drop first
monitoring:
enabled: true
alertThresholds:
warning: 70
critical: 85
estimateOverflow: true
recovery:
prioritize: hot-buffer
bandwidthLimit: 10Mbps
batchSize: 1000
Best Practices:
- Size buffers for 2x your expected maximum outage duration
- Always use disk persistence for warm buffers
- Implement compression (reduces storage by 60-70%)
- Use priority-based overflow strategy, not FIFO
- Monitor buffer usage and set alerts at 70% and 85%
- Test buffer overflow scenarios in staging
- Document recovery procedures for extended outages
- Consider message sampling for non-critical telemetry during high buffer usage
With this configuration, you can handle outages up to 24 hours without data loss, and longer outages with local archiving.