Need Cardinality Management Support on Thanos

Summary of Open Issue: Cardinality Management Support in Thanos

Issue ID: #XXXX (Please insert actual issue number)

Created by: sharathfeb12 on 2022-12-30

Current State: Open, with active discussions and interest from multiple contributors.

Problem Statement

Thanos is currently receiving high cardinality metrics from several product teams managed by the Central Observability Team. This influx of metrics is causing performance bottlenecks in the Store Gateway and Querier components, hindering the ability of teams to execute queries efficiently. Standard advice from community discussions has focused on dropping metrics or labels at the Prometheus level, which is deemed a reactive measure.

Proposed Solution

A proactive approach is suggested, where a Cardinality Management feature would allow teams to monitor and manage the cardinality of incoming metrics and labels before they escalate into problems. This includes:

  • Exposing cardinality metrics and thresholds through an interface.
  • Implementing alerting mechanisms when cardinality levels exceed specific values.
  • Providing the capability to drop metrics and labels directly from the Thanos side, similar to the functionality found in Grafana’s Mimir, which allows for dynamic runtime configuration without restarts.

Suggested Features

  • Create API endpoints that allow users to check the cardinality of labels and metrics (inspired by Mimir).
  • Implement a user interface similar to the cardinality explorer found in VictoriaMetrics.
  • Leverage Thanos API features like series matchers and flags for downsampling and limiting data to effectively manage overhead caused by high cardinality.

Community Feedback

  • Contributors such as maheshbaliga and jatinagwal have expressed interest in tackling the issue.
  • Various members, including matej-g and yeya24, have highlighted the need for clear requirements, necessary APIs, and a structured way to evaluate cardinality.
  • The availability of tools like Grafana Cloud’s Pro and Advanced tiers, which provide insights into cardinality, has been noted as a benchmark for the desired solution.

Action Items

  • Further discussion is needed to establish detailed requirements and design specifications.
  • Exploration of existing endpoints and feedback from community members should shape the development approach.
  • Contributors are encouraged to produce actionable proposals and iterate on the implementation of the cardinality management feature.

Labels

  • Difficulty: Medium
  • Good First Issue
  • GSoC/Community Bridge/LFX
  • Help Wanted

Other contributors are encouraged to participate and share their insights or proposed implementations for this crucial feature aimed at improving the efficiency and reliability of metrics handling within Thanos.