Monitoring and Logging for Thanos
What is Monitoring and Logging?
Monitoring and Logging refer to the practices and tools used to collect, process, and analyze data about the performance, health, and behavior of a system or application. In the context of Thanos, it involves setting up and maintaining monitoring and logging systems for the project, including metrics, tracing, and alerting.
Why is Monitoring and Logging important?
Monitoring and Logging are crucial for understanding the performance and behavior of complex systems like Thanos. They help developers and operators to:
- Identify and diagnose issues quickly
- Optimize system performance
- Ensure system reliability and availability
- Comply with regulatory requirements
- Understand user behavior and usage patterns
Insights
Monitoring and Logging have been essential for improving the awareness and observability of systems and applications. In the case of Medallia, the implementation of metrics was a game-changer for the engineering organization, providing valuable insights into the quality and performance of their software.
Thanos, as a distributed observability system, relies on effective monitoring and logging to provide insights into the performance and behavior of the system. Different types of logging are useful for various purposes, such as debugging, tracking queries, and optimizing database performance.
Active Query Logging
Active Query Logging is a type of logging that logs all the current active queries running in a component. This logger is local to each component and runs as a standalone for that component. It helps in debugging queries that led to a component instance being killed due to an Out of Memory (OOM) error or tracking queries that are taking too long.
level=info ts=2019-08-28T14:30:09.142Z caller=main.go:331 component=activeQueryTracker msg="These queries didn't finish in prometheus' last run:" queries="[{"query":"changes(changes(prometheus_http_request_duration_seconds_bucket[1h:1s])[1h:1s])", "timestamp_sec":1567002604}]"
The example above shows an active query log taken from Prometheus, which logs the respective query that did not finish in the last run. Although it does not pinpoint the exact issue that caused the problem, it is still helpful in considering that query as a potential cause.
Audit and Adaptive Logging
Audit logging is another type of logging where all internal API requests are logged. It is useful for tracking the flow of requests made and can help in describing the observed behavior of a request.
{
protoPayload: {
@type: "type.googleapis.com/google.cloud.audit.AuditLog",
status: {},
authenticationInfo: {principalEmail: },
serviceName: "appengine.googleapis.com",
methodName: "SetIamPolicy",
authorizationInfo: [...],
serviceData: {
@type: "type.googleapis.com/google.appengine.legacy.AuditData",
policyDelta: { bindingDeltas: [
action: "ADD",
role: "roles/logging.privateLogViewer",
member: ]
},
request: {
resource: "my-gcp-project-id",
policy: { bindings: [...], }
},
response: {
bindings: [
{
role: "roles/logging.privateLogViewer",
members: [ ]
}
],
}
},
insertId: "53179D9A9B559.AD6ACC7.B40604EF",
resource: {
type: "gcp_app",
labels: { project_id: "my-gcp-project-id" }
},
timestamp: "2019-05-27T16:24:56.135Z",
severity: "NOTICE",
logName: "projects/my-gcp-project-id/logs/cloudaudit.googleapis.com%2Factivity",
}
The example above shows an audit log, which captures the flow of requests made and can be helpful in tracking line-to-line flow of queries.
Adaptive logging is a useful feature that allows logging all queries based on fulfilling certain criteria or filters. These logs can later be used for inspecting abnormal queries and improving the user experience overall.
[1] https://www.robustperception.io/what-queries-were-running-when-prometheus-died [2] https://rollout.io/blog/audit-logs/ [3] https://cloud.google.com/logging/docs/audit/understanding-audit-logs