Distributed traces can be excessively verbose

Issue Summary: Excessive Verbosity in Distributed Traces of Thanos

Issue Author: ahayworth Date Created: February 2, 2023

Problem Statement:

In large Thanos deployments, the volume of spans generated by components can be overwhelming. A recent trace captured an extraordinary number of spans—59,000 in just 3 seconds—which poses significant challenges in terms of cost and performance for production environments.

Proposed Solutions:

  1. Span Emission Control:
  • Prevent Thanos from emitting verbose spans entirely.
  • Introduce a configuration option to conditionally emit spans based on user-defined criteria.
  1. Removing Unnecessary Spans:
  • Identify and remove spans that serve limited functionality during normal operations, particularly those related to internal processes rather than RPC calls. Examples include spans from store_matches and internal TSDB operations.

Developer Feedback:

  • Participants have suggested that certain spans could be deemed unnecessary and proposed configurations such as:
  • Maintaining only RPC spans (noted as CLIENT/SERVER spans).
  • Implementing a possible tiered span level (e.g., “debug” vs “info”).
  • Considering a “max N-depth” configuration to manage span depth.

Alternatives Considered:

  • Filtering spans via the OpenTelemetry collector was discussed; however, this does not alleviate the overhead associated with processing the spans.

Contextual Insights:

  • The verbosity issue may not be experienced universally across all Thanos users due to varying deployment scales.
  • Detailed analysis of spans revealed a structure with a significant count of “UNSPECIFIED” spans, caused by RPC call proliferation.
  • Contributors have hinted at relying more on continuous profiling as a preferred method over tracing for performance visibility.

Next Steps:

  • Potential contributors are encouraged to assess this issue and propose implementations based on the outlined suggestions. The objective is to balance necessary operational visibility with reduced trace noise in production environments.

Labels: feature request/improvement, good first issue, help wanted