Monitoring and Logging for cilium/cilium

This documentation provides a comprehensive step-by-step guide on how to monitor the Cilium project in a production environment. It outlines key components and gives code examples to demonstrate essential practices for effective monitoring.

Setting Up Monitoring with Prometheus

Cilium integrates seamlessly with Prometheus, allowing you to monitor various metrics related to the cluster’s health and performance. Follow these steps to enable and configure Prometheus monitoring.

Enabling Service Monitors

To enable monitoring for both the Cilium operator and the Cilium agent, modify the Helm chart values or configuration files. The specifics are as follows:

operator:
  prometheus:
    serviceMonitor:
      enabled: true       # Enable service monitor for cilium-operator
      interval: 10s       # Set the scrape interval for metrics
      jobLabel: cilium-operator  
      labels: {}          # Add any labels if required

prometheus:
  serviceMonitor:
    enabled: true         # Enable service monitor for cilium-agent
    interval: 10s         # Set the scrape interval for metrics
    jobLabel: cilium-agent  
    labels: {}            # Add any labels if required

Clustermesh API Server Metrics

To monitor metrics for the Clustermesh API server, ensure the following configuration is in place:

clustermesh:
  apiserver:
    metrics:
      serviceMonitor:
        enabled: true      # Enable service monitor for Clustermesh API server
        etcd:
          interval: 10s     # Set the scrape interval for etcd metrics
          metricRelabelings: {}  # Configure metric relabelings

Verifying Prometheus Configuration

After configuring Prometheus service monitors, you can verify the setup using the following command:

kubectl get servicemonitor -n cilium

You should see service monitors listed for both the Cilium operator and agent.

Capturing Metrics

Once monitoring is enabled, Cilium exposes various metrics that can be captured and queried via Prometheus. Some notable metrics include:

controllers_runs_total: Number of times that a controller process was run.
jobs_errors_total: Number of job runs that returned an error.
remote_cluster_readiness_status: The readiness status of the remote cluster.

Metrics can be queried as follows:

controllers_runs_total
jobs_errors_total

Debugging with cilium-dbg Monitor

Cilium provides a debugging tool that can be invaluable for monitoring. The cilium-dbg monitor command displays real-time notifications and events emitted by the BPF programs attached to endpoints and devices. This includes:

Dropped packet notifications
Captured packet traces
Policy verdict notifications
Debugging information

To utilize this command, run:

cilium-dbg monitor [flags]

Conclusion

Monitoring Cilium in a production environment is essential for maintaining cluster health and performance. By integrating with Prometheus and utilizing the monitoring features within Cilium, you can gain deep insights into the operations of your Kubernetes networking and security. Make sure to adjust intervals and enable necessary service monitors to ensure comprehensive coverage of all critical metrics.

For more detailed configurations and information on metrics, reference the official documentation for Cilium or consult the Prometheus Operator documentation.

Source: Documentation/cmdref/cilium-dbg_monitor.md, install/kubernetes/cilium/README.md, Documentation/observability/metrics.rst.