Add non graceful shutdown integration test

Pull Request Summary: “Add Non-Graceful Shutdown Integration Test”

Contributor: mowangdk Date: 2024-10-29

Type:

  • Feature: This PR introduces a new integration test for the Kubernetes volume management system, focusing specifically on the behavior of the system when a node goes out of service due to non-graceful shutdowns.

Purpose:

The integration test aims to validate the “node out of service detach” feature by ensuring that when a node is labeled as out-of-service (OOS) and experiences a non-graceful shutdown, all associated pods are terminated immediately and their volumes detached without waiting for the default timeout period.

Background:

  • Related Feature PR: #108486
  • Previous PR: #119478
  • This PR addresses the behavior of the volume attach/detach controller in scenarios where nodes become unavailable unexpectedly.

Changes Implemented:

  • New Test Functionality:

  • Implemented an integration test: TestPodTerminationWithNodeOOSDetach.

  • The test sets up a simulated environment where it creates a node, a pod, and a persistent volume claim (PVC), enforcing conditions to observe the expected outcomes during a simulated failure.

  • When the node is tainted as out-of-service and a non-graceful termination is invoked, the test validates that the pod is forcefully deleted and the volume detached successfully.

  • Metrics Verification:

  • The test checks metrics related to pod deletion and volume detachment to confirm actions taken by the Kubernetes controllers when the node experiences an out-of-service state.

Code Changes:

  • Modified the integration test suite, including significant additions to:

  • test/integration/volume/attach_detach_test.go

  • Introduced new utility functions for waiting and verifying states related to pod and node conditions.

  • Adjustments in client creation to include the pod garbage collector (podgc), enhancing cleanup mechanisms for dangling resources.

Testing Outcomes:

  • The integration test has shown successful validation of immediate pod terminations and volume detachments, achieving the expected metrics post-events.

Additional Notes:

  • User-Facing Changes: There are no direct user-facing changes introduced by this PR.
  • This PR is currently marked as a work in progress (do-not-merge/work-in-progress) and requires further review and approval before it can be integrated into the main codebase.

Labels:

  • area/test
  • sig/storage
  • sig/testing
  • size/L
  • Priority: Needs triage and priority assignment.

This PR enhances the robustness of Kubernetes storage management by ensuring that failure scenarios are adequately tested, reinforcing the system’s resilience in face of unexpected node failures.