Database Schema @ sourcegraph/zoekt

What is Database Schema?

The database schema defines the structure of a database. It specifies the tables, columns, data types, relationships, and constraints that govern the organization and integrity of the data. [Source: https://en.wikipedia.org/wiki/Database_schema]

Why is Database Schema important?

A well-defined database schema is essential for several reasons:

Data Organization: It provides a clear structure for storing and retrieving data, making it easier to understand and manage.
Data Integrity: Constraints within the schema ensure data consistency and validity, preventing errors and maintaining data quality.
Query Optimization: The schema allows the database to efficiently process queries, leading to faster retrieval times.
Data Modeling: It serves as a blueprint for designing the database, enabling clear communication and collaboration among developers.
Code Maintainability: A consistent schema makes code easier to write, read, and maintain, reducing the likelihood of errors.

Database Schema for Zoekt

Zoekt uses a PostgreSQL database to store indexed data. The schema is designed to facilitate efficient searching and indexing of code repositories. Key tables include:

Repositories: Stores metadata about each repository, such as its name, URL, and last indexing time.
Files: Stores information about individual files within a repository, including file path, size, and modification time.
Trigrams: Contains the index of trigrams (three-character sequences) found in the file content.
Branches: Tracks the branches where a specific file exists.
Symbols: Stores information about symbols detected in the file content, including function names, variables, and their locations.

The relationships between these tables allow for efficient retrieval of relevant search results based on user queries. For example, a search for a specific term will first look up the corresponding trigrams and then use the associated files and branches to retrieve the relevant code.

Index Format

Zoekt’s index is organized into shards, each representing a single code repository. Each shard is a file containing the following data:

File contents
File names
Content posting lists (varint encoded)
Filename posting lists (varint encoded)
Branch masks
Metadata (repository name, index format version, etc.)

The shard size is roughly 3x the size of the corpus and should be kept below 4G. This limitation is due to the use of uint32 for offsets.

Ranking

Zoekt implements a ranking system to prioritize search results based on relevance. The following factors influence the ranking:

Number of matched atoms
Closeness of matches for different atoms
Quality of match (e.g., word boundary)
File latest update time
File name length
Tokenizer ranking (e.g., match in comments or string literals)
Symbol ranking (e.g., match on symbol definition)

Query Language

Zoekt uses an expression tree to represent search queries. The tree consists of various atoms, including:

ConstQuery: Matches everything or nothing, useful for partial query evaluation.
SubStringQuery: Matches substrings within file content or filenames.
RegexpQuery: Matches regular expressions.
RepoQuery: Matches a specific repository.
BranchQuery: Matches a specific branch.

Queries can be combined using logical operators like AND, OR, and NOT.

Gerrit/Gitiles Integration

Zoekt supports integration with Gerrit and Gitiles for code review and browsing. Gitiles sends search queries to Zoekt, which respects Gerrit’s ACLs. Zoekt returns search results to Gitiles, which filters them based on user permissions.

Top-Level Directory Explanations

doc/ - This directory contains documentation for the project.

What is Database Schema?

Why is Database Schema important?

Database Schema for Zoekt

Index Format

Ranking

Query Language

Gerrit/Gitiles Integration

Top-Level Directory Explanations

Explanation

Graph

Symbols

We couldn't identify any entrypoints. If you believe this to be incorrect then please contact support.