Issue Summary: Kubernetes Garbage Collection Failing to Free Disk Space
Issue Details
- Title: Failed to garbage collect required amount of images
- Reported By: samuela
- Date: December 8, 2018
- Kubernetes Version: 1.10.7 (Client and Server)
- Environment: GKE; nodes using Container-Optimized OS; minimum disk size set to 10 GB
Description
The kubelet on GKE nodes has repeatedly failed to free necessary disk space through image garbage collection (GC). The kubelet logs indicate a persistent issue, attempting to free millions of bytes, but failing to reclaim any space. The specific error messages include:
- “failed to garbage collect required amount of images. Wanted to free X bytes, but freed 0 bytes.”
This has resulted in the eviction of several pods due to disk pressure, as observed in the eviction warnings and kubelet events.
Expected Behavior
- Image GC should successfully reclaim disk space, or the system should prevent scheduling new pods on nodes with insufficient disk space.
Reproduction Steps
- Deploy and delete multiple pods on a node to observe disk pressure.
- Monitor the kubelet’s garbage collection attempts and related events.
Additional Context
- The issue seems reproducible on other environments, with users on GKE, AWS, and AKS reporting similar garbage collection failures and disk pressure incidents.
- Cordon and draining nodes temporarily relieves pressure, but the root cause remains unresolved.
- The problem may relate to improper configuration or thresholds in GC, potential bugs in the kubelet’s image management logic, or the size limitations encountered with underpowered node disks.
Recent Activity
- Users are encouraged to investigate kubelet logs for insights.
- Contributors have been assigned and discussions continue regarding root causes and possible solutions.
Labels
kind/bug
sig/node
good first issue
help wanted
triage/accepted
This issue remains active, with ongoing efforts to investigate and resolve the GC inefficiencies on GKE. Advanced developers are welcome to contribute to finding a solution, considering the overall impact on Kubernetes reliability and performance.