Monitoring and Logging for docker/genai-stack

This documentation outlines the monitoring strategies for the production environment of the docker/genai-stack project. It focuses on health checks, resource management, and service dependencies crucial for ensuring the system operates effectively in production.

Health Checks

Health checks are crucial for monitoring the status of services deployed within the stack. They allow the orchestrator to verify if the services are running as expected and can restart them if they fail.

Example: API Service Health Check

Within the docker-compose.yml, the api service includes a health check configuration:

  api:
    ...
    healthcheck:
      test: ["CMD-SHELL", "wget --no-verbose --tries=1 http://localhost:8504/ || exit 1"]
      interval: 5s
      timeout: 3s
      retries: 5

Here, the health check is designed to attempt a simple HTTP GET request on port 8504 every 5 seconds, timing out after 3 seconds. If it fails 5 consecutive times, the service is considered unhealthy.

Example: Neo4j Database Health Check

The database service uses a similar approach:

  database:
    ...
    healthcheck:
      test: ["CMD-SHELL", "wget --no-verbose --tries=1 --spider localhost:7474 || exit 1"]
      interval: 15s
      timeout: 30s
      retries: 10

This checks the availability of the Neo4j database, allowing for a rollback/restart mechanism to maintain operational integrity if the database service becomes unavailable.

Resource Management

Resource allocation and management ensure the services have adequate computing resources, which is crucial for performance and stability in a production environment.

Example: GPU Resource Allocation

The llm-gpu service specifies GPU resources as follows:

  llm-gpu:
    <<: *llm
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

This configuration reserves all available Nvidia GPUs for processing. Monitoring GPU usage and performance metrics is essential to ensure that the model utilizing this resource is effectively leveraging its capabilities.

Service Dependencies

Managing service dependencies streamlines the initialization of services and handles readiness checks effectively.

Example: Service Dependencies in Loader and Bot Services

Services can be set to depend on other services by using the depends_on directive:

  loader:
    ...
    depends_on:
      database:
        condition: service_healthy
      pull-model:
        condition: service_completed_successfully

The above configuration ensures that the loader service will start only after the database is healthy and the pull-model service has successfully completed its initialization. This approach reduces errors during startup and improves system reliability.

Monitoring and Logging

Integrating logging technologies with your services provides visibility into their operational health. Each service can be configured to output logs that can then be scraped by logging tools such as ELK stack or Prometheus.

Example: Container Logging

To enable logging to the console for debugging purposes, modify the Dockerfiles or the service definition:

# In one of the Dockerfiles, e.g., `api.Dockerfile`
CMD ["python", "api.py"]

This command will send logs to the Docker standard output, which can then be collected by external logging solutions.

Monitoring Strategies

Use of Monitoring Tools: Implement monitoring solutions such as Prometheus or Grafana to observe various metrics (CPU, memory, response times) across all containers.
Alerting Mechanisms: Set up alerting for critical metrics such as service downtime or response time spikes to facilitate quick resolutions to potential issues.
Log Aggregation: Centralize logs from all services, making it easier to analyze historical patterns for investigation and performance tuning.

By utilizing identified monitoring strategies alongside appropriate health checks, resource management, and dependency handling, the production environment for docker/genai-stack can be kept robust and reliable.

Source: docker/genai-stack documentation.