Summary of Closed Pull Request: “Add ‘Sticky’ Store Nodes”
Author: GiedriusS
Date Opened: January 28, 2020
Status: Closed (follow-up proposal created)
Overview
This pull request introduces the concept of “sticky” store nodes for the Thanos Query component, enabling nodes to remain part of the active set despite failing health checks. Sticky nodes are designed for scenarios where certain store nodes are expected to be consistently available (e.g., caching layers like Cortex’s query-frontend).
Key Features
- Sticky Nodes: Store nodes that have a suffix
+sticky
added to their addresses are treated as always available, allowing Thanos Query to deliver consistent partial responses even if a node is down. - Health Check Modifications: The functionality modifies the health-checking logic to retain the last known good state of sticky nodes, preventing them from being removed from the active set during failures.
- Indication of State: Sticky nodes will display a yellow
UP
status on theStores
page when their data can’t be verified but are still considered available.
Implementation Details
- A node is marked sticky by appending
+sticky
to the store address. - A dedicated
MetaTarget
structure is introduced to handle the addresses along with their sticky state. - Update to command-line flags documentation to explain the sticky node behavior.
- Changes made to resolve addresses, storing both the address and its sticky state.
Testing and Feedback
- Local testing was performed to validate functionality with sticky nodes, highlighting behavior when interacting with non-responsive nodes.
- Initial feedback from other contributors, particularly from
bwplotka
, questioned the design choice of keeping unhealthy nodes within the healthy set. Suggestions leaned towards making all store nodes sticky by default.
Follow-Up Actions
Due to diverging opinions on the design and expected behavior, the author decided to close this PR and propose a new one to refine the implementation based on community feedback.
Conclusion
The proposed sticky store node feature aims to enhance the reliability of partial responses in Thanos Query by maintaining the state of crucial store nodes during failures. However, further discussions and a proposal for an adjusted implementation are anticipated to solidify the approach and incorporate community input.
Relevant Issue: #1651
Follow-Up Proposal: #2086