Summary of Open Issue: Node Lifecycle Controller Pod Readiness
Issue Identifier: Kubernetes/kubernetes #109998 (generalized duplicate)
Description: The Node Lifecycle Controller in Kubernetes is failing to mark pods as “not ready” when a node’s status changes from “false” to “unknown,” particularly in scenarios where the node was already in a non-ready state (e.g., due to a failure of the container runtime like containerd). This behavior can lead to scenarios where pods remain in the “ready” state even after the node becomes unreachable, potentially causing network traffic to be misrouted to these nodes.
Expected Behavior:
The expectation is that the MarkPodsNotReady
function should be triggered whenever the node transitions to an “unknown” state, regardless of its previous “ready” state. This would ensure that all pods on the node are marked as not ready, avoiding inconsistencies and unintended traffic routing.
Reproduction Steps:
- Stop containerd on the node and observe that the node’s status changes to “false.”
- Subsequently, stop kubelet or shut down the node, allowing it to transition to “unknown.”
- Verify that the pods on the affected node incorrectly remain in the “ready” state and are not evicted.
Relevant Code:
The issue pertains to a specific section in the node_lifecycle_controller.go
file, particularly lines 883 to 888, where the current logic only acts when a node transitions from “true” to “unknown.”
Discussion: The issue has generated interest from multiple contributors, and various PRs have been attempted to address it. However, some PRs have been flagged for having broken unit tests or are considered too complex for new contributors. The core discussion centers around the appropriate logic to ensure that pod readiness is accurately reported based on node status changes.
Labeling: The issue is labeled as a “good first issue,” but there have been reservations regarding its complexity for new contributors. Other labels include “help wanted,” “kind/bug,” “sig/node,” and “triage/accepted,” indicating an active discussion and ongoing efforts to resolve the problem.
Current Status: As of the latest updates, the issue remains open with ongoing contributions and discussions aimed at implementing a suitable fix. The community is encouraged to coordinate efforts to avoid duplication and work towards a resolution.