I’ve implemented a complete solution for this log loss issue across multiple GKE clusters. Here’s what you need to address all three critical areas:
Cloud Logging Agent Configuration:
Your fluentd DaemonSet needs buffer optimization. Update the ConfigMap:
<buffer>
@type file
path /var/log/fluentd-buffer/
flush_interval 5s
chunk_limit_size 8MB
queue_limit_length 32
retry_max_interval 30s
overflow_action drop_oldest_chunk
</buffer>
The key changes: faster flush_interval (5s instead of default 10s), larger chunk size for high-volume logs, and explicit overflow handling. Also ensure fluentd has a preStop hook with 20-second sleep to flush buffers before termination.
Pod Lifecycle Events:
Implement proper shutdown handling in both your application and fluentd:
For application pods:
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 15"]
terminationGracePeriodSeconds: 45
For fluentd DaemonSet:
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 20"]
terminationGracePeriodSeconds: 60
The sleep ensures logs are fully written and scraped before final termination. Increase grace period to 45-60 seconds to accommodate buffer flushing.
Log Persistence Strategies:
Implement a multi-layer approach:
- Application-level persistence: Configure your app to flush logs synchronously on shutdown signals. Most logging frameworks support this:
# For Java logback
<shutdownHook class="ch.qos.logback.core.hook.DelayingShutdownHook"/>
- Volume-backed logging: Mount an emptyDir volume for application logs with fluentd tailing from that volume. This decouples log writing from container lifecycle:
volumes:
- name: app-logs
emptyDir: {}
volumeMounts:
- name: app-logs
mountPath: /var/app/logs
-
Dual-shipping: Send critical logs to both Cloud Logging and a persistent store (Cloud Storage) directly from the application. Use Cloud Logging for real-time analysis and Storage for compliance.
-
Node-level persistence: Configure containerd to persist container logs even after pod deletion:
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "runc"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
Root Cause Analysis:
Your 2-5 minute gap is caused by:
- Fluentd buffer not flushed (2-3 minutes of buffered logs lost)
- Container runtime buffer not scraped (30-60 seconds)
- Application not flushing on shutdown (30-90 seconds)
Verification:
After implementing these changes, test by forcing pod restarts during active logging:
kubectl delete pod POD_NAME --grace-period=30
Monitor fluentd metrics in Cloud Monitoring:
These should show buffers flushing to near-zero before pod termination. With this comprehensive approach, you should eliminate log loss entirely.