Monitoring and Logging for thanos-io/thanos

In this documentation, we will detail the step-by-step process for monitoring a production setup using Thanos. This guide is tailored for expert developers familiar with the Thanos ecosystem and expects knowledge of Prometheus and general observability practices.

Overview of Monitoring in Thanos

Thanos extends Prometheus, providing additional features for monitoring distributed systems, including long-term storage, global querying, and high availability. To effectively monitor a Thanos deployment, one needs to be familiar with its components, configuration, and how to leverage Prometheus rules and Grafana dashboards for observability.

Step 1: Setting Up Prometheus

Thanos works best with Prometheus as a primary metrics collector. It is crucial to ensure that Prometheus is configured correctly:

Deploy Prometheus in the same failure domain as the monitored services.
Use persistent storage to retain data across restarts.
Implement local compaction for extended retention periods.

See the implementation example below for a standard Prometheus configuration:

storage:
  tsdb:
    retention: 30d
    min-block-duration: 2h
    max-block-duration: 2d

Source: docs/quick-tutorial.md

Step 2: Deploy Thanos Components

Deploy the essential Thanos components:

Thanos Sidecar: This runs alongside each Prometheus instance, allowing it to be queried via Thanos.
Thanos Store: For reading historical metrics from object storage.
Thanos Query: To query across multiple Prometheus instances or Thanos Store.

For example, a minimal Dockerfile for the Thanos binary could look like this:

ARG BASE_DOCKER_SHA="14d68ca3d69fceaa6224250c83d81d935c053fb13594c811038c461194599973"
FROM quay.io/prometheus/busybox@sha256:${BASE_DOCKER_SHA}
LABEL maintainer="The Thanos Authors"

COPY /thanos_tmp_for_docker /bin/thanos

RUN adduser \
    -D \
    -H \
    -u 1001 \
    thanos && \
    chown thanos /bin/thanos
USER 1001
ENTRYPOINT [ "/bin/thanos" ]

Source: Dockerfile

Step 3: Setting Up Monitoring Mixins

To simplify your monitoring setup, utilize Thanos Mixins, which provide Predefined Alerting Rules and Grafana Dashboards.

Install needed tools.
Leverage existing mixins tailored to Thanos metrics using JSONNET definitions.

Example alert rule for monitoring Thanos Store latency:

- alert: ThanosStoreSeriesGateLatencyHigh
  annotations:
    description: Thanos Store {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for store series gate requests.
    summary: Thanos Store has high latency for store series gate requests.
  expr: histogram_quantile(0.99, sum by (job, le) (rate(thanos_bucket_store_series_gate_duration_seconds_bucket{job=~".*thanos-store.*"}[5m]))) > 2
  for: 5m
  labels:
    severity: warning

Source: examples/alerts/alerts.yaml

Step 4: Implementing Instrumentation in Your Application

For Thanos to track various metrics, it’s vital that you instrument your application. Use the built-in Prometheus instrumentation capabilities to capture performance metrics.

Here is an example of how to instrument a Thanos Store server:

type instrumentedStoreServer struct {
    storepb.StoreServer
    seriesRequested prometheus.Histogram
    chunksRequested prometheus.Histogram
}

Source: pkg/store/telemetry.go

Step 5: Gathering and Querying Metrics

Use the Thanos Query component to fetch metrics from your deployed Prometheus instances and Thanos Store.

Thanos uses a PromQL-compatible query language, enabling users to perform complex queries across datasets. An example query may look like:

sum by (job) (rate(grpc_server_started_total{job=~".*thanos-store.*"}[5m]))

Source: examples/alerts/rules.yaml

Step 6: Creating Dashboards in Grafana

To visualize the metrics captured, create Grafana dashboards that provide insights into the performance and health of your Thanos deployment. Use existing dashboards defined in the Thanos mixin or create custom ones based on your specific requirements.

Example Dashboard Configuration:

{
  __inputs: [],
  __config: {
    panels: [
      {
        title: "Thanos Store Latency",
        type: "graph",
        targets: [
          {
            expr: 'histogram_quantile(0.99, sum(rate(thanos_bucket_store_series_gate_duration_seconds_bucket[1m])) by (le))',
            range: true
          }
        ]
      }
    ]
  }
}

Source: mixin/README.md

Conclusion

Effective production monitoring with Thanos involves proper deployment, configuration of Prometheus, leveraging Thanos’s extensible architecture with mixins, and comprehensive visualization with Grafana. Through careful implementation of instrumentation, alerting, and dashboarding, one can achieve robust observability for their applications.

For more in-depth information, refer to the Thanos documentation.