High Availability - thanos-io/thanos

Thanos is an open-source project that extends Prometheus to provide a highly available, scalable, and long-term metric storage solution. It is designed to be easily deployed and integrated into various environments, including cloud-native and traditional setups. Thanos achieves high availability through a set of components that can be composed into a highly available Prometheus setup with long-term storage capabilities.

The main components of Thanos are:

  1. Metric sources: These are instances of Prometheus that collect and expose metrics. Thanos can monitor multiple Kubernetes clusters and aggregate metrics from various sources.

  2. Stores: Thanos store instances are responsible for long-term storage of metric data. They can use different storage providers, including object storage solutions like S3, GCS, or Azure Blob Storage. Thanos supports high availability for store instances through a proposed but rejected design, which suggests explicitly supporting and documenting high availability for store instances and reducing query latency incurred by failing store instances.

  3. Queriers: Thanos queriers provide a global query view across all metric sources and stores. They can be configured for high availability and can be scaled horizontally to handle increased query loads.

Thanos offers several ways to ensure high availability:

  • Replication: Thanos supports replication of metric data across multiple stores, ensuring data availability even if some store instances fail.

  • Horizontal scaling: Thanos queriers can be scaled horizontally to handle increased query loads and provide high availability.

  • Load balancing: Thanos includes a hash ring-based load balancer in the Thanos Receive component, which can distribute incoming metrics across multiple receivers for high availability and scalability.

  • Retries and failovers: Thanos includes retry and failover mechanisms in its querier component to ensure that queries are successful even if some metric sources or stores are temporarily unavailable.

Examples:

  • Thanos Receive: This component receives and processes incoming metrics. It uses a hash ring to distribute metrics across multiple receivers, ensuring high availability and scalability.

  • Thanos Querier: This component provides a global query view across all metric sources and stores. It can be scaled horizontally to handle increased query loads and can be configured for high availability.

Sources: