Database Schema

The genai-stack repository on GitHub (https://github.com/docker/genai-stack) is a Docker-based machine learning platform. It includes various components for data processing, model training, and inference. One essential part of this system is the database schema that manages and stores data.

What is Database Schema?

A database schema is a blueprint that describes the logical structure of a database. It defines the various components, such as tables, columns, relationships, and indexes, that make up the database. In the context of genai-stack, the database schema is responsible for managing and organizing data related to machine learning tasks, model configurations, and experiment results.

Why is Database Schema important?

The database schema plays a crucial role in ensuring data consistency, security, and efficiency within the genai-stack platform. It enables developers and data scientists to:

  1. Store, retrieve, and manage large volumes of data related to machine learning tasks.
  2. Maintain data integrity by defining relationships between different data entities.
  3. Optimize query performance through indexing and efficient data organization.
  4. Ensure data security by defining access control and encryption policies.

Database Schema Overview

The genai-stack database schema is implemented using Neo4j, a popular graph database management system. Neo4j is particularly well-suited for handling complex relationships between data entities, making it an ideal choice for managing machine learning data.

Data Model

The Neo4j database schema for genai-stack includes several key nodes and relationships:

  1. Project: Represents a machine learning project, including its name, description, and other metadata.
  2. Dataset: Represents a dataset used in a machine learning project, including its name, size, and location.
  3. Model: Represents a machine learning model, including its name, architecture, and training history.
  4. Experiment: Represents a single machine learning experiment, including its associated project, dataset, model, and training results.
  5. Metric: Represents a performance metric for a machine learning experiment, such as accuracy, loss, or F1 score.

Relationships

The relationships between nodes in the genai-stack database schema include:

  1. PROJECT-HAS Dataset: A project can have multiple datasets.
  2. PROJECT-HAS Model: A project can have multiple models.
  3. MODEL-TRAINED_ON Dataset: A model is trained on a specific dataset.
  4. EXPERIMENT-RUN Model: An experiment uses a specific model for training.
  5. EXPERIMENT-HAS Metric: An experiment produces multiple metrics.

Exploring the Database

To explore the genai-stack database schema using the Neo4j client, follow these steps:

  1. Install and start the Neo4j server (https://neo4j.com/docs/operations-manual/current/installation/installing-neo4j/).
  2. Create a new database and import the Cypher queries from the genai-stack repository (tree/main/db/cypher).
  3. Use the Neo4j Browser (https://neo4j.com/developer/tools-browser/) or a Cypher client to execute queries against the database.

For example, to retrieve all projects and their associated datasets, execute the following Cypher query:

MATCH (p:Project)-[:HAS]->(d:Dataset)
          RETURN p.name AS project_name, d.name AS dataset_name
          

This query returns the names of all projects and their associated datasets.

For more information on the genai-stack database schema and Neo4j, refer to the following resources:


          Sources:
          - <https://neo4j.com/docs/operations-manual/current/intro/what-is-neo4j/>
          - <https://neo4j.com/docs/operations-manual/current/schema/schema-introduction/>
          - <https://neo4j.com/docs/operations-manual/current/schema/schema-data-modeling/>
          - <https://neo4j.com/docs/operations-manual/current/schema/schema-relationships/>
          - <https://neo4j.com/docs/operations-manual/current/schema/schema-import/>
          - <https://neo4j.com/docs/operations-manual/current/cypher/intro/>