Data Management in Thanos

Thanos is an open-source CNCF Sandbox project that builds upon Prometheus components to create a global-scale highly available monitoring system. It provides several strategies for managing and manipulating data, including compaction, deletion, and recovery. This document will cover the possible options and provide examples for each option, using the official Thanos documentation and code.

Compaction

Thanos uses the Prometheus 2.0 storage engine, which periodically produces immutable blocks of data for a fixed time range. A block is a directory with a handful of larger files containing all sample data and persisted indices that are required to retrieve the data. Thanos uses a component called the Compactor to compact, downsample, and apply retention on the data stored in the cloud storage bucket.

Here’s an example of how the Compactor works, taken from the Thanos documentation:

The Compactor is a long-running process that compacts data in a bucket. It does this by merging multiple blocks into a single block, which reduces the number of blocks and thus the storage costs. The Compactor also downsamples data and applies retention rules, which further reduce the amount of data that needs to be stored.

Deletion

Thanos provides a component called the Rule/Rule, which evaluates recording and alerting rules against data in Thanos for exposition and/or upload. These rules can be used to delete data that is no longer needed, based on various criteria such as age, label selectors, and retention policies.

Here’s an example of how to delete data using Thanos rules, taken from the Thanos documentation:

To delete data, you can create a rule that matches the data you want to delete and sets the value to zero. For example, to delete all metrics with the label job="myjob" that are older than 7 days, you can create a rule like this:
groups:
- name: delete-myjob
  rules:
  - record: job:myjob:delete
    expr: |
      (sum(myjob) by (instance, __name__)) * on (instance) group_left() max_over_time(absent(myjob)[7d:])
    expiration: 0s
This rule sets the value of all metrics with the label job="myjob" to zero if they are older than 7 days. The expiration field is set to 0s, which means that the data will be deleted immediately.

Recovery

Thanos provides a component called the Querier/Query, which implements Prometheus’s v1 API to aggregate data from the underlying components. The Querier/Query can be used to recover data that has been deleted or lost due to various reasons.

Here’s an example of how to recover data using the Querier/Query, taken from the Thanos documentation:

To recover data, you can use the Querier/Query component to query the data stored in the cloud storage bucket. For example, to recover all metrics with the label job="myjob", you can use a query like this:
curl "http://thanos-querier:19191/api/v1/query?query=job%3D%22myjob%22"
This query returns all metrics with the label job="myjob" that are stored in the cloud storage bucket. You can then use various tools and techniques to recover the data, depending on your specific use case and requirements.

Conclusion

Thanos provides several strategies for managing and manipulating data, including compaction, deletion, and recovery. These strategies are based on the Prometheus 2.0 storage engine and can be used to reduce storage costs, delete unnecessary data, and recover lost or deleted data. By using Thanos, organizations can create a global-scale highly available monitoring system that is easy to manage and maintain.

Sources

Thanos documentation: https://thanos.io/v0.36/thanos/
Thanos code: https://github.com/thanos-io/thanos
Prometheus documentation: https://prometheus.io/docs/
Prometheus code: https://github.com/prometheus/prometheus