Cloud Logging missing container logs after pod restarts in GKE, impacting audit trails

We’re experiencing a critical logging gap in our GKE cluster (v1.21). After pod restarts - whether from deployments, crashes, or autoscaling - we consistently lose 2-5 minutes of container logs from the period immediately before the restart.

This is causing audit compliance issues since we can’t trace the exact sequence of events leading to failures. Our Cloud Logging agent (fluentd) is deployed as a DaemonSet using GKE’s default configuration. The pod lifecycle events are captured in Kubernetes events, but the actual application logs during those final minutes are missing from Cloud Logging.


# Last log entry before restart
2024-12-08T09:23:15Z INFO Request processed
# Pod restart occurred at 09:26:30 per k8s events
# Next log entry after restart
2024-12-08T09:28:45Z INFO Application started
# Missing: 09:23:15 to 09:28:45 (5.5 minutes)

Is this a known limitation of how Cloud Logging handles pod termination, or is there a configuration we’re missing to ensure log persistence during restarts?

This is typically caused by the fluentd buffer not being flushed before pod termination. When a pod receives SIGTERM, it has a grace period (default 30 seconds) to shut down. If fluentd doesn’t flush its buffer within that time, buffered logs are lost. Check your fluentd buffer configuration and termination grace period.

Another aspect to consider - the container runtime log rotation. Docker and containerd both rotate logs based on size or age. If rotation happens during pod termination and the rotated file isn’t picked up by fluentd before pod deletion, those logs are gone forever. Check your runtime log rotation settings and ensure fluentd is tailing both current and rotated log files.

I’ve implemented a complete solution for this log loss issue across multiple GKE clusters. Here’s what you need to address all three critical areas:

Cloud Logging Agent Configuration: Your fluentd DaemonSet needs buffer optimization. Update the ConfigMap:

<buffer>
  @type file
  path /var/log/fluentd-buffer/
  flush_interval 5s
  chunk_limit_size 8MB
  queue_limit_length 32
  retry_max_interval 30s
  overflow_action drop_oldest_chunk
</buffer>

The key changes: faster flush_interval (5s instead of default 10s), larger chunk size for high-volume logs, and explicit overflow handling. Also ensure fluentd has a preStop hook with 20-second sleep to flush buffers before termination.

Pod Lifecycle Events: Implement proper shutdown handling in both your application and fluentd:

For application pods:

lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "sleep 15"]
terminationGracePeriodSeconds: 45

For fluentd DaemonSet:

lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "sleep 20"]
terminationGracePeriodSeconds: 60

The sleep ensures logs are fully written and scraped before final termination. Increase grace period to 45-60 seconds to accommodate buffer flushing.

Log Persistence Strategies: Implement a multi-layer approach:

  1. Application-level persistence: Configure your app to flush logs synchronously on shutdown signals. Most logging frameworks support this:

# For Java logback
<shutdownHook class="ch.qos.logback.core.hook.DelayingShutdownHook"/>
  1. Volume-backed logging: Mount an emptyDir volume for application logs with fluentd tailing from that volume. This decouples log writing from container lifecycle:
volumes:
- name: app-logs
  emptyDir: {}
volumeMounts:
- name: app-logs
  mountPath: /var/app/logs
  1. Dual-shipping: Send critical logs to both Cloud Logging and a persistent store (Cloud Storage) directly from the application. Use Cloud Logging for real-time analysis and Storage for compliance.

  2. Node-level persistence: Configure containerd to persist container logs even after pod deletion:


[plugins."io.containerd.grpc.v1.cri".containerd]
  default_runtime_name = "runc"
  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
    SystemdCgroup = true

Root Cause Analysis: Your 2-5 minute gap is caused by:

  1. Fluentd buffer not flushed (2-3 minutes of buffered logs lost)
  2. Container runtime buffer not scraped (30-60 seconds)
  3. Application not flushing on shutdown (30-90 seconds)

Verification: After implementing these changes, test by forcing pod restarts during active logging:

kubectl delete pod POD_NAME --grace-period=30

Monitor fluentd metrics in Cloud Monitoring:

These should show buffers flushing to near-zero before pod termination. With this comprehensive approach, you should eliminate log loss entirely.

We do have a 30-second termination grace period set. I checked the fluentd logs and found some “buffer overflow” warnings during high-traffic periods. Could the buffer size be the bottleneck? Also, how do we ensure the runtime buffer is flushed?

Buffer overflow in fluentd is definitely part of your problem. The default buffer chunk size might be too small for your log volume. Increase the chunk_limit_size and queue_limit_length in your fluentd ConfigMap. Also, verify that your Cloud Logging API quotas aren’t being throttled - that can cause fluentd to buffer more aggressively and lose logs during restarts.

I’ve dealt with this before. The issue is that container logs written to stdout/stderr are buffered by the container runtime before fluentd scrapes them. When a pod terminates, that buffer isn’t necessarily flushed to disk. You need to ensure your application flushes logs synchronously during shutdown, and that fluentd has adequate time to collect them. Consider implementing a preStop hook that waits 10-15 seconds before actual termination.