Scaling cilium/cilium.io

Scaling Cilium in production environments can be accomplished through various strategies, ensuring performance, reliability, and security while managing large clusters effectively. Below are step-by-step guidelines along with pertinent code examples for expert developers interested in leveraging Cilium capabilities in scaling their applications.

Addressing and Network Policy Scaling

To scale to thousands of nodes and manage complex network policies, it is essential to design the architecture with scalability in mind. Cilium’s eBPF-powered networking allows for sophisticated network policies that maintain performance regardless of scale.

Example: Configuring Network Policies

Using Cilium, network policies can be defined to control traffic between workloads. As the number of workloads increases, it is vital to manage policies systematically.

{
  "apiVersion": "cilium.io/v2",
  "kind": "CiliumNetworkPolicy",
  "metadata": {
    "name": "example-network-policy"
  },
  "spec": {
    "endpointSelector": {
      "matchLabels": {
        "role": "frontend"
      }
    },
    "ingress": [
      {
        "fromEndpoints": [
          {
            "matchLabels": {
              "role": "backend"
            }
          }
        ],
        "toPorts": [
          {
            "ports": [
              {
                "port": "443",
                "protocol": "TCP"
              }
            ]
          }
        ]
      }
    ]
  }
}

Challenges in Day 2 Operations

As highlighted by the Delivery Engineering team, while creating clusters instantly is straightforward, maintaining them, ensuring they are updated, and securing them post-deployment presents real operational challenges.

Key Takeaway: Building a robust CI/CD pipeline with automated tests and monitoring is crucial for successful Day 2 operations.

Example: CI/CD Pipeline for Cilium

Leveraging CI/CD tools can help automate deployments. Here is an example of a shell script to install Cilium in a Kubernetes cluster:

#!/bin/bash

# Ensure kubectl is configured to communicate with your cluster
kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/v1.10.0/install/kubernetes/quick-install.yaml

# Verify the installation
cilium status

Monitoring and Metrics Collection

Proactive monitoring is essential for scalable systems. Cilium supports integration with popular monitoring tools like Prometheus and Grafana. Metrics collected can help in diagnosing issues early in large-scale environments.

Example: Exposing Metrics for Prometheus

Make sure to enable the Cilium metrics feature. Below is a snippet for cilium-config:

apiVersion: v1
kind: ConfigMap
metadata:
  name: cilium-config
  namespace: kube-system
data:
  enable-metrics: "true"

Use the service discovery of Prometheus to scrape the metrics:

scrape_configs:
  - job_name: 'cilium'
    static_configs:
      - targets: ['<CILIUM_METRICS_SERVICE>:9090']

Handling High Throughput

To cater to high client requests and maintain efficiency, it is essential to use load balancing effectively. Cilium’s native support for Load Balancing through XDP (Express Data Path) enhances performance while lowering latency.

Example: L4 Load Balancing with Cilium

Implementing L4 load balancing for services can be achieved with the following configuration:

kind: Service
apiVersion: v1
metadata:
  name: my-service
spec:
  type: LoadBalancer
  ports:
    - port: 80
      targetPort: 8080
  selector:
    app: my-app

Cilium’s XDP can be configured to provide efficient load balancing:

cilium bpf lb list

Resource Management

As workloads scale, resource consumption becomes a crucial aspect. The use of enhanced eBPF capabilities allows for minimizing resource usage even as the number of pods increases dramatically.

Example: Resource Allocation Policies

To effectively manage resources during high concurrency, Kubernetes resource requests and limits should be set:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 5
  template:
    spec:
      containers:
        - name: my-app
          image: my-app-image
          resources:
            requests:
              memory: "128Mi"
              cpu: "500m"
            limits:
              memory: "256Mi"
              cpu: "1"

Conclusion

Scaling Cilium for production requires careful consideration of network policies, CI/CD practices, monitoring, efficient load balancing, and comprehensive resource management. Following these guidelines allows developers to leverage Cilium’s full potential while maintaining control and reliability in large production environments.

Source Material:

Cilium Networking and Security for Containers with BPF and XDP - (src/posts/16-03-2017-cilium-networking-and-security-for-containers-with-bpf-and-xdp/index.md)
Scaling to 60k Pods - (src/posts/2019-04-29-cilium-15/index.md)
Scaling for the future with Cilium - (src/pages/use-cases/network-policy.jsx)
Metrics & Tracing Export - (src/pages/use-cases/metrics-export.jsx)
Multi Cluster Gaming Platform - (src/posts/2020-09-03-wildlife-studios-multi-cluster-gaming-platform/index.md)