Monitoring and Logging @ thanos-io/thanos

Directory Structure
Entrypoints
API
CLI
UI
Schemas
Build
Test
Security
Bookmarks

.bingo
- .gitignore
- README.md
- Variables.mk
- alertmanager.mod
- alertmanager.sum
- bingo.mod
- bingo.sum
- faillint.mod
- faillint.sum
- go.mod
- go.sum
- goimports.mod
- goimports.sum
- gojsontoyaml.mod
- gojsontoyaml.sum
- golangci-lint.mod
- golangci-lint.sum
- gotesplit.mod
- gotesplit.sum
- hugo.mod
- hugo.sum
- jb.mod
- jb.sum
- jsonnet-lint.mod
- jsonnet-lint.sum
- jsonnet.mod
- jsonnet.sum
- jsonnetfmt.mod
- jsonnetfmt.sum
- mdox.mod
- mdox.sum
- minio.mod
- minio.sum
- promdoc.mod
- promdoc.sum
- prometheus.mod
- prometheus.sum
- promtool.mod
- promtool.sum
- promu.mod
- promu.sum
- protoc-gen-gogofast.mod
- protoc-gen-gogofast.sum
- shfmt.mod
- shfmt.sum
- variables.env
.circleci
- config.yml
.devcontainer
- Dockerfile
- devcontainer.json
- welcome-message.txt
.github
- ISSUE_TEMPLATE
- codeql
  - codeql-config.yml
- workflows
- PULL_REQUEST_TEMPLATE.md
- dependabot.yml
- stale.yml
cmd
- thanos
docs
- blog
  - img
  - 2022-09-08-thanos-at-medallia.md
  - 2023-06-02-lfx-mentorship-query-observability.md
  - 2023-06-08-thanos-at-aiven.md
  - 2023-20-11-thanoscon.md
  - welcome.md
- components
  - flags
    - .gitignore
  - README.md
  - compact.md
  - compactor_no_coping_with_load.png
  - query-frontend.md
  - query.md
  - receive.md
  - rule.md
  - sidecar.md
  - store.md
  - tools.md
- contributing
  - README.md
  - coding-style-guide.md
  - community.md
  - how-to-change-go-version.md
  - how-to-contribute-to-docs.md
  - mentorship.md
  - proposal-process.md
- img
  - Thanos-logo_full.svg
  - Thanos-logo_fullmedium.png
  - Thanos_with_cilium.png
  - bottleneck-globalsort.png
  - bucket-web.jpg
  - compaction_progress_metrics.png
  - distributed-execution-proposal-1.png
  - distributed-execution-proposal-2.png
  - distributed-execution-proposal-3.png
  - distributed-execution-proposal-4.png
  - distributed-execution-proposal-5.png
  - distributed-execution-proposal-6.png
  - get-ref-map.png
  - globalsort-nonoptimized.png
  - globalsort-optimized.png
  - go-in-thanos.jpg
  - groupcache.png
  - hubble_network_flow.png
  - latency-with-sharding.png
  - latency-without-sharding.png
  - memory-with-sharding.png
  - memory-without-sharding.png
  - meta-monitoring-validator.png
  - per-receive.png
  - querier.svg
  - query-path-tenancy-proposal-diagram.svg
  - query_ui_6week.png
  - query_ui_6week_dedup.png
  - query_ui_stores.png
  - receive-validator.png
  - rueidis-client-side.png
  - thanos_log_limit.png
  - thanos_proposal_flow.excalidraw
  - thanos_proposal_flow.png
  - thanos_receiver_troubleshoot_empty_replica_external_label_name.drawio.png
  - thanos_receiver_troubleshoot_federation_idential_replica_name.drawio.png
  - thanos_receiver_troubleshoot_grafana_remote_write.png
  - thanos_receiver_troubleshoot_no_external_labels.drawio.png
  - tracing.png
  - tracing2.png
  - vertical-sharding.png
  - zoomedit.png
  - zoomrecording.png
- operating
  - README.md
  - binary-index-header.md
  - compactor-backlog.md
  - cross-cluster-tls-communication.md
  - https.md
  - modify-objstore-data.md
  - multi-tenancy.md
  - reverse-proxy.md
  - troubleshooting.md
  - use-cases.md
- proposals-accepted
  - 202012-deletions-object-storage.md
  - 202012-receive-split.md
  - 202101-endpoint-discovery.md
  - 202106-automated-per-endpoint-mTLS.md
  - 202107-protobuf-openapi-httpapi.md
  - 202108-more-granular-query-performance-metrics.md
  - 202205-vertical-query-sharding.md
  - 202206-active-series-limiting-hashring.md
  - 202209-receive-tenant-external-labels.md
  - 20221129-avoid-global-sort.md
  - 202301-distributed-query-execution.md
  - 202304-query-path-tenancy.md
  - README.md
- proposals-done
  - 201809-gossip-removal.md
  - 201812-thanos-remote-receive.md
  - 201901-read-write-operations-bucket.md
  - 201909-thanos-sharding.md
  - 201912-thanos-binary-index-header.md
  - 202001-thanos-query-health-handling.md
  - 202003-thanos-rules-federation.md
  - 202004-embedd-cortex-frontend.md
  - 202005-query-logging.md
  - 202005-scalable-rule-storage.md
  - 202005-version-documentation.md
  - 202106-proposals-process.md
  - 202203-grpc-query-api.md
  - README.md
- proposals-rejected
  - 201807-config.md
  - 201807-store-instance-high-availability.md
  - README.md
- support
  - welcome.md
- README.md
- design.md
- getting-started.md
- governance.md
- integrations.md
- logging.md
- quick-tutorial.md
- release-process.md
- service-discovery.md
- sharding.md
- storage.md
- tracing.md
examples
- alerts
- dashboards
- interactive
  - interactive_test.go
internal
- cortex
mixin
- alerts
- dashboards
- lib
  - thanos-grafana-builder
  - utils.libsonnet
- rules
- README.md
- alerts.jsonnet
- config.libsonnet
- dashboards.jsonnet
- jsonnetfile.json
- jsonnetfile.lock.json
- mixin.libsonnet
- rules.jsonnet
- runbook.md
- separated-alerts.jsonnet
pkg
- alert
  - alert.go
  - alert_test.go
  - config.go
  - config_test.go
- api
  - blocks
    - v1.go
    - v1_test.go
  - query
  - rule
    - v1.go
  - status
    - v1.go
  - api.go
  - api_test.go
- block
  - indexheader
  - metadata
  - block.go
  - block_test.go
  - fetcher.go
  - fetcher_test.go
  - index.go
  - index_test.go
  - writer.go
- cache
  - cache.go
  - caching_bucket_config.go
  - groupcache.go
  - groupcache_test.go
  - inmemory.go
  - inmemory_test.go
  - memcached.go
  - memcached_test.go
  - redis.go
  - redis_test.go
  - tracing_cache.go
- cacheutil
  - async_op.go
  - cacheutil.go
  - cacheutil_test.go
  - jump_hash.go
  - memcached_client.go
  - memcached_client_test.go
  - memcached_server_selector.go
  - memcached_server_selector_test.go
  - redis_client.go
  - redis_client_test.go
- clientconfig
  - config.go
  - config_test.go
  - grpc.go
  - http.go
- compact
  - downsample
  - blocks_cleaner.go
  - clean.go
  - clean_test.go
  - compact.go
  - compact_e2e_test.go
  - compact_test.go
  - planner.go
  - planner_test.go
  - retention.go
  - retention_test.go
- compactv2
  - changelog.go
  - chunk_series_set.go
  - compactor.go
  - compactor_test.go
  - modifiers.go
- component
  - component.go
- dedup
  - chunk_iter.go
  - chunk_iter_test.go
  - iter.go
  - iter_test.go
  - pushdown_iter.go
- discovery
  - cache
    - cache.go
    - cache_test.go
  - dns
  - memcache
- errors
  - errors.go
  - errors_test.go
  - stacktrace.go
  - stacktrace_test.go
- errutil
  - multierror.go
  - multierror_test.go
- exemplars
  - exemplarspb
  - exemplars.go
  - exemplars_test.go
  - multitsdb.go
  - prometheus.go
  - proxy.go
  - proxy_test.go
  - tsdb.go
  - tsdb_test.go
- extannotations
  - annotations.go
- extflag
  - hidden.go
- extgrpc
  - snappy
    - snappy.go
    - snappy_test.go
  - client.go
- exthttp
  - tlsconfig.go
  - transport.go
- extkingpin
  - app.go
  - flags.go
  - path_content_reloader.go
  - path_content_reloader_test.go
- extprom
  - http
  - extprom.go
  - testing.go
  - tx_gauge.go
  - tx_gauge_test.go
- gate
  - gate.go
  - gate_test.go
- info
  - infopb
  - info.go
- logging
  - grpc.go
  - http.go
  - http_test.go
  - logger.go
  - logger_test.go
  - options.go
  - yaml_parser.go
- metadata
  - metadatapb
  - metadata.go
  - prometheus.go
  - prometheus_test.go
  - proxy.go
  - proxy_test.go
- model
  - timeduration.go
  - timeduration_test.go
  - units.go
  - units_test.go
- pool
  - pool.go
  - pool_test.go
- prober
  - combiner.go
  - grpc.go
  - http.go
  - http_test.go
  - intrumentation.go
  - prober.go
- promclient
  - promclient.go
  - promclient_e2e_test.go
- query
  - internal
    - test-storeset-pre-v0.8.0
      - storeset.go
      - storeset_test.go
  - testdata
  - endpointset.go
  - endpointset_test.go
  - iter.go
  - querier.go
  - querier_test.go
  - query_bench_test.go
  - query_test.go
  - remote_engine.go
  - remote_engine_test.go
  - test_test.go
- queryfrontend
  - cache.go
  - cache_test.go
  - config.go
  - config_test.go
  - downsampled.go
  - downsampled_test.go
  - labels_codec.go
  - labels_codec_test.go
  - queryinstant_codec.go
  - queryinstant_codec_test.go
  - queryrange_codec.go
  - queryrange_codec_test.go
  - request.go
  - response.go
  - response.pb.go
  - response.proto
  - roundtrip.go
  - roundtrip_test.go
  - shard_query.go
  - split_by_interval.go
  - split_by_interval_test.go
- querysharding
  - analysis.go
  - analyzer.go
  - analyzer_test.go
- receive
  - testdata
    - limits_config
      - good_limits.yaml
      - invalid_limits.yaml
    - limits.yaml
  - config.go
  - config_test.go
  - handler.go
  - handler_test.go
  - hashring.go
  - hashring_test.go
  - head_series_limiter.go
  - limiter.go
  - limiter_config.go
  - limiter_config_test.go
  - limiter_test.go
  - multitsdb.go
  - multitsdb_test.go
  - receive_test.go
  - request_limiter.go
  - request_limiter_test.go
  - writer.go
  - writer_test.go
- reloader
  - example_test.go
  - reloader.go
  - reloader_test.go
  - tracker.go
- replicate
  - replicator.go
  - scheme.go
  - scheme_test.go
- rules
  - rulespb
  - manager.go
  - manager_test.go
  - prometheus.go
  - prometheus_test.go
  - proxy.go
  - proxy_test.go
  - queryable.go
  - rules.go
  - rules_test.go
- runutil
  - example_test.go
  - runutil.go
  - runutil_test.go
- server
  - grpc
    - grpc.go
    - option.go
  - http
- shipper
  - shipper.go
  - shipper_e2e_test.go
  - shipper_test.go
- store
  - cache
    - cachekey
      - cachekey.go
      - cachekey_test.go
    - cache.go
    - cache_test.go
    - caching_bucket.go
    - caching_bucket_factory.go
    - caching_bucket_test.go
    - factory.go
    - factory_test.go
    - filter_cache.go
    - filter_cache_test.go
    - inmemory.go
    - inmemory_test.go
    - memcached.go
    - memcached_test.go
    - tracing_index_cache.go
  - hintspb
    - custom.go
    - custom_test.go
    - hints.pb.go
    - hints.proto
  - labelpb
    - label.go
    - label_test.go
    - label_zlabel_test.go
    - types.pb.go
    - types.proto
  - storepb
    - prompb
    - testutil
    - custom.go
    - custom_test.go
    - inprocess.go
    - inprocess_test.go
    - query_hints.go
    - rpc.pb.go
    - rpc.proto
    - shard_info.go
    - shard_info_test.go
    - types.pb.go
    - types.proto
  - 6545postingsrepro
  - acceptance_test.go
  - bucket.go
  - bucket_e2e_test.go
  - bucket_test.go
  - flushable.go
  - io.go
  - io_test.go
  - lazy_postings.go
  - lazy_postings_test.go
  - limiter.go
  - limiter_test.go
  - local.go
  - opts.go
  - opts_test.go
  - postings.go
  - postings_codec.go
  - postings_codec_test.go
  - prometheus.go
  - prometheus_test.go
  - proxy.go
  - proxy_heap.go
  - proxy_heap_test.go
  - proxy_test.go
  - recover.go
  - recover_test.go
  - telemetry.go
  - tsdb.go
  - tsdb_test.go
- strutil
  - merge.go
- targets
  - targetspb
  - prometheus.go
  - prometheus_test.go
  - proxy.go
  - proxy_test.go
  - targets.go
  - targets_test.go
- tenancy
  - tenancy.go
  - tenancy_test.go
- testutil
  - custom
    - custom.go
  - e2eutil
  - testdata
    - 20kseries.json
  - testpromcompatibility
    - api_compatibility.go
- tls
  - options.go
- tracing
  - client
    - factory.go
  - elasticapm
    - elastic_apm.go
  - google_cloud
    - google_cloud.go
    - google_cloud_test.go
  - jaeger
  - lightstep
    - lightstep.go
    - lightstep_test.go
  - migration
    - bridge.go
    - sampler.go
  - otlp
  - stackdriver
    - stackdriver.go
    - tracer.go
  - grpc.go
  - http.go
  - testutil.go
  - tracing.go
- ui
  - react-app
  - static
    - react
  - README.md
  - bucket.go
  - query.go
  - rule.go
  - ui.go
  - ui_test.go
- verifier
  - duplicated_compaction.go
  - duplicated_compaction_test.go
  - index_issue.go
  - overlapped_blocks.go
  - safe_delete.go
  - verify.go
scripts
- cfggen
  - main.go
- copyright
  - copyright.go
- website
  - mdoxpostprocess.sh
  - websitepreprocess.sh
- build-react-app.sh
- busybox-updater.sh
- cleanup-white-noise.sh
- genproto.sh
- insecure_grpcurl_series.sh
- installprotoc.sh
- quickstart.sh
- thanos-block.jq
test
- e2e
- travis-gcs-creds.json.enc
tutorials
- interactive-example
  - README.md
- killercoda
  - README.md
- kubernetes-helm
  - README.md
- thanos-with-cilium
  - README.md
website
- archetypes
  - docs.md
- data
  - adopters.yml
- layouts
  - _default
  - blog
    - list.html
    - single.html
  - partials
  - proposal
    - single.html
  - support
    - list.html
    - single.html
  - index.html
- static
- .hugo_build.lock
- hugo.yaml
.busybox-versions
.cortex-packages.txt
.dockerignore
.errcheck_excludes.txt
.gitattributes
.gitignore
.gitpod.yml
.go-version
.golangci.yml
.mdox.prev-release.yaml
.mdox.validate.yaml
.mdox.yaml
.promu.yml
CHANGELOG.md
CODE_OF_CONDUCT.md
CONTRIBUTING.md
Dockerfile
Dockerfile.e2e-tests
Dockerfile.multi-arch
Dockerfile.multi-stage
LICENSE
MAINTAINERS.md
Makefile
README.md
SECURITY.md
VERSION
doc.go
go.mod
go.sum
netlify.toml

Monitoring and Logging for Thanos

What is Monitoring and Logging?

Monitoring and Logging refer to the practices and tools used to collect, process, and analyze data about the performance, health, and behavior of a system or application. In the context of Thanos, it involves setting up and maintaining monitoring and logging systems for the project, including metrics, tracing, and alerting.

Why is Monitoring and Logging important?

Monitoring and Logging are crucial for understanding the performance and behavior of complex systems like Thanos. They help developers and operators to:

Identify and diagnose issues quickly
Optimize system performance
Ensure system reliability and availability
Comply with regulatory requirements
Understand user behavior and usage patterns

Insights

Monitoring and Logging have been essential for improving the awareness and observability of systems and applications. In the case of Medallia, the implementation of metrics was a game-changer for the engineering organization, providing valuable insights into the quality and performance of their software.

Thanos, as a distributed observability system, relies on effective monitoring and logging to provide insights into the performance and behavior of the system. Different types of logging are useful for various purposes, such as debugging, tracking queries, and optimizing database performance.

Active Query Logging

Active Query Logging is a type of logging that logs all the current active queries running in a component. This logger is local to each component and runs as a standalone for that component. It helps in debugging queries that led to a component instance being killed due to an Out of Memory (OOM) error or tracking queries that are taking too long.

level=info ts=2019-08-28T14:30:09.142Z caller=main.go:331 component=activeQueryTracker msg="These queries didn't finish in prometheus' last run:" queries="[{"query":"changes(changes(prometheus_http_request_duration_seconds_bucket[1h:1s])[1h:1s])", "timestamp_sec":1567002604}]"

The example above shows an active query log taken from Prometheus, which logs the respective query that did not finish in the last run. Although it does not pinpoint the exact issue that caused the problem, it is still helpful in considering that query as a potential cause.

Audit and Adaptive Logging

Audit logging is another type of logging where all internal API requests are logged. It is useful for tracking the flow of requests made and can help in describing the observed behavior of a request.

{
          protoPayload: {
          @type: "type.googleapis.com/google.cloud.audit.AuditLog",
          status: {},
          authenticationInfo: {principalEmail:  },
          serviceName: "appengine.googleapis.com",
          methodName: "SetIamPolicy",
          authorizationInfo: [...],
          serviceData: {
          @type: "type.googleapis.com/google.appengine.legacy.AuditData",
          policyDelta: { bindingDeltas: [
          action: "ADD",
          role: "roles/logging.privateLogViewer",
          member:  ]
          },
          request: {
          resource: "my-gcp-project-id",
          policy: { bindings: [...], }
          },
          response: {
          bindings: [
          {
          role: "roles/logging.privateLogViewer",
          members: [  ]
          }
          ],
          }
          },
          insertId: "53179D9A9B559.AD6ACC7.B40604EF",
          resource: {
          type: "gcp_app",
          labels: { project_id: "my-gcp-project-id" }
          },
          timestamp: "2019-05-27T16:24:56.135Z",
          severity: "NOTICE",
          logName: "projects/my-gcp-project-id/logs/cloudaudit.googleapis.com%2Factivity",
          }

The example above shows an audit log, which captures the flow of requests made and can be helpful in tracking line-to-line flow of queries.

Adaptive logging is a useful feature that allows logging all queries based on fulfilling certain criteria or filters. These logs can later be used for inspecting abnormal queries and improving the user experience overall.

[1] https://www.robustperception.io/what-queries-were-running-when-prometheus-died [2] https://rollout.io/blog/audit-logs/ [3] https://cloud.google.com/logging/docs/audit/understanding-audit-logs

Monitoring and Logging for Thanos

What is Monitoring and Logging?

Why is Monitoring and Logging important?

Insights

Active Query Logging

Audit and Adaptive Logging

Explanation

Graph

Symbols

We couldn't identify any entrypoints. If you believe this to be incorrect then please contact support.