High Availability - thanos-io/thanos

Thanos is an open-source project that extends Prometheus to provide a highly available, scalable, and long-term metric storage solution. It is designed to be easily deployed and integrated into various environments, including cloud-native and traditional setups. Thanos achieves high availability through a set of components that can be composed into a highly available Prometheus setup with long-term storage capabilities.

The main components of Thanos are:

Metric sources: These are instances of Prometheus that collect and expose metrics. Thanos can monitor multiple Kubernetes clusters and aggregate metrics from various sources.
Stores: Thanos store instances are responsible for long-term storage of metric data. They can use different storage providers, including object storage solutions like S3, GCS, or Azure Blob Storage. Thanos supports high availability for store instances through a proposed but rejected design, which suggests explicitly supporting and documenting high availability for store instances and reducing query latency incurred by failing store instances.
Queriers: Thanos queriers provide a global query view across all metric sources and stores. They can be configured for high availability and can be scaled horizontally to handle increased query loads.

Thanos offers several ways to ensure high availability:

Replication: Thanos supports replication of metric data across multiple stores, ensuring data availability even if some store instances fail.
Horizontal scaling: Thanos queriers can be scaled horizontally to handle increased query loads and provide high availability.
Load balancing: Thanos includes a hash ring-based load balancer in the Thanos Receive component, which can distribute incoming metrics across multiple receivers for high availability and scalability.
Retries and failovers: Thanos includes retry and failover mechanisms in its querier component to ensure that queries are successful even if some metric sources or stores are temporarily unavailable.

Examples:

Thanos Receive: This component receives and processes incoming metrics. It uses a hash ring to distribute metrics across multiple receivers, ensuring high availability and scalability.
Thanos Querier: This component provides a global query view across all metric sources and stores. It can be scaled horizontally to handle increased query loads and can be configured for high availability.

Sources:

Thanos documentation: https://thanos.io/v0.36/thanos/design.md
Thanos rejected proposals: https://thanos.io/v0.36/thanos/proposals-rejected/201807-store-instance-high-availability.md
Thanos code documentation: https://github.com/thanos-io/thanos/tree/main/docs
Thanos code snippets: https://github.com/thanos-io/thanos/tree/main/pkg
Thanos blog: https://thanos.io/blog/
Thanos blog - Thanos at Medallia: https://thanos.io/blog/2022-09-08-thanos-at-medallia/
Thanos blog - Thanos at Aiven: https://thanos.io/blog/2023-06-08-thanos-at-aiven/
Thanos videos:
Thanos: Easier Than Ever to Scale Prometheus and Make It Highly Available: https://www.youtube.com/watch?v=mtwwUqeIHAw
Thanos: Highly Available, Pluggable, Long Term Metric Storage for Everyone!: https://www.youtube.com/watch?v=VG_TtLg84ME
Intro to Thanos: Scale Your Prometheus Monitoring With Ease: https://www.youtube.com/watch?v=m0JgWlTc60Q
Turn It Up to a Million: Ingesting Millions of Metrics with Thanos Receive: https://www.youtube.com/watch?v=5MJqdJq41Ms
High Available + Scalable Prometheus with Thanos in Alibaba: https://www.youtube.com/watch?v=ZS6zMksfipc