Fault-Tolerant Query Routing - thanos-io/thanos

Fault-Tolerant Query Routing in Thanos

Thanos is a highly available Prometheus setup with long term storage capabilities. It ensures queries succeed even with failures in backend stores through fault-tolerant query routing. This is achieved through various components and mechanisms, such as Queriers, Store Nodes, Query Frontend, and distributed execution mode.

Queriers and Store Nodes

Queriers are stateless and horizontally scalable instances that implement PromQL on top of the Store APIs exposed in the cluster. They participate in the cluster to resiliently discover all data sources and store nodes. Store Nodes, on the other hand, expose the Store API and can be Prometheus instances or Sidecars.

Based on the metadata of store and source nodes, Queriers attempt to minimize the request fanout to fetch data for a particular query. This helps in reducing the impact of any failures in backend stores.

Query Frontend

Thanos Query Frontend supports a retry mechanism to retry query when HTTP requests are failing. The --query-range.max-retries-per-request flag limits the maximum retry times. The Query Frontend is fully stateless and horizontally scalable.

The thanos query-frontend command implements a service that can be put in front of Thanos Queriers to improve the read path. It is based on the Cortex Query Frontend component so you can find some common features like Splitting and Results Caching.

Distributed Execution Mode

The distributed execution mode can be enabled using --query.mode=distributed. When this mode is enabled, the Querier will break down each query into independent fragments and delegate them to components which implement the Query API.

Prometheus is stateful and does not allow replicating its database. Thanos Querier instead pulls the data from both replicas, and deduplicates those signals, filling the gaps if any, transparently to the Querier consumer.

Example command to run Query Frontend:

thanos query-frontend \
--http-address "0.0.0.0:9090" \
--query-frontend.downstream-url="<thanos-querier>:<querier-http-port>" \
--query-range.horizontal-shards=0 \
--query-range.max-query-length=0 \
--query-range.max-query-parallelism=14 \
--query-range.max-retries-per-request=5 \
--query-range.max-split-interval=0 \
--query-range.min-split-interval=0

Sources: