Scaling docker/genai-stack

This document outlines the production scaling strategies for the docker/genai-stack project. Focused on code examples and technical configurations, it serves as a comprehensive guide to achieving scalability in production.

Overview

Scaling the docker/genai-stack in a production environment involves efficient resource management, container orchestration, and configuring services to handle increased loads. The following section describes individual services, Docker configurations, and deployment strategies to facilitate production scaling.

Docker Compose Configuration

The primary configuration for scaling the application in production is managed through the docker-compose.yml file. Each service can be replicated and monitored, ensuring load balancing and redundancy.

Service Configuration

The key services and their configurations enable scalability. Below are details about relevant configurations used for production scaling:

LLM Services: The llm and llm-gpu services can be configured to run on multiple instances, especially under GPU profiles for handling intensive computations.

llm-gpu:
  <<: *llm
  profiles: ["linux-gpu"]
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: all
            capabilities: [gpu]

Database Scaling

For databases, replication can be configured to ensure fault tolerance and high availability. The Neo4j instance can benefit from clustering for read-write scaling. Adjust the configurations based on the deployment requirements, utilizing volume persistence.

database:
  user: neo4j:neo4j
  image: neo4j:5.11
  ports:
    - 7687:7687
    - 7474:7474
  volumes:
    - $PWD/data:/data
  environment:
    - NEO4J_AUTH=${NEO4J_USERNAME-neo4j}/${NEO4J_PASSWORD-password}
    - NEO4J_PLUGINS=["apoc"]

Load Balancing Container Instances

Using Docker Swarm or Kubernetes is essential for managing multiple instances of services like api, bot, and pdf_bot. Configure settings for scaling and load balancing, similar to the following example:

api:
  deploy:
    replicas: 3
  healthcheck:
    test: ["CMD-SHELL", "wget --no-verbose --tries=1 http://localhost:8504/ || exit 1"]
    interval: 5s
    timeout: 3s
    retries: 5

Multi-Container Docker Networking

Utilize the defined network for inter-service communication. All services should be connected to the same network to ensure seamless communication, and service discovery can be enhanced through DNS resolution.

networks:
  net:

Environment Variable Management

Dynamic environment variables allow for easy adaptation depending on deployment environments such as staging or production. Use .env files or Docker secrets for sensitive information.

environment:
  - NEO4J_URI=${NEO4J_URI-neo4j://database:7687}
  - OPENAI_API_KEY=${OPENAI_API_KEY}

Deployment Strategies

System Resource Allocation

Resource constraints like memory and CPU limits can be defined in the docker-compose.yml, allowing for effective management without overwhelming the host machine.

  deploy:
    resources:
      limits:
        cpus: '1.0'
        memory: 512M

Horizontal Scaling with Docker Swarm

To enable auto-scaling in a production environment, consider orchestrating services with Docker Swarm. You can scale the service with the following command:

docker service scale <service_name>=<number_of_replicas>

Monitoring and Health Checks

Integrate monitoring solutions (like Prometheus, Grafana) for insights into application health and performance. Health checks within the Docker configuration ensure that unhealthy services are restarted automatically.

  healthcheck:
    test: ["CMD-SHELL", "wget --no-verbose --tries=1 --spider localhost:7474 || exit 1"]
    interval: 15s
    timeout: 30s
    retries: 10

Conclusion

By following the configurations and strategies outlined in this document, developers can effectively scale the docker/genai-stack in a production environment. Ensure that testing precedes deployment to maintain reliability and performance under load.

Quote Source: Docker Documentation and related scaling resources.