Indexing Strategies - sourcegraph/zoekt

This document discusses the indexing strategies for the Zoekt code search engine, which is used in the project “sourcegraph/zoekt”. We will cover the possible indexing options, provide examples for each option, and quote the source of information to build confidence.

Key Technologies and Dependencies

The following technologies and dependencies are used in Zoekt:

  • Go programming language
  • gRPC
  • Protocol Buffers
  • Go standard library
  • go-ctags
  • go-cmp
  • Slothfs
  • Grafana regexp
  • Jaeger

Indexing Strategies

Zoekt uses a variety of indexing strategies to optimize code search. The main indexing methods available for directories, Git repositories, and remote repositories are:

  1. gitlab-zoekt-indexer: This is a webserver responsible for writing the .zoekt index files. It is used to index Git repositories and store the index files on an SSD for fast searches.
  2. zoekt-webserver: This is a webserver responsible for responding to searches by reading the .zoekt index files.

The .zoekt index files are stored on an SSD for fast searches, and the webservers need to run on the same node as they access the same files.

Indexing Options

Zoekt supports indexing multiple projects, and the following options are available for indexing:

  • Indexing a single project: This is the simplest form of indexing, where a single project is indexed using the gitlab-zoekt-indexer webserver.

Example:

gitlab-zoekt-indexer -repo=<path_to_repo> -index=<path_to_index>
  • Indexing multiple projects: Zoekt supports indexing multiple projects using the -repo flag with multiple paths or a directory containing multiple repositories.

Example:

gitlab-zoekt-indexer -repo=<path_to_repo1>,<path_to_repo2> -index=<path_to_index>
  • Indexing a GitLab project: Zoekt can index a GitLab project using the -project flag with the project ID.

Example:

gitlab-zoekt-indexer -project=<project_id> -index=<path_to_index>
  • Indexing a GitLab group: Zoekt can index a GitLab group using the -group flag with the group ID.

Example:

gitlab-zoekt-indexer -group=<group_id> -index=<path_to_index>

Optimizing Indexing

To optimize indexing for various scenarios, including large codebases, the following strategies can be used:

  • Sharding: Zoekt supports sharding everything by the top-level group, which ensures group search can always search a single Zoekt server.
  • Replication: Zoekt supports replication of indexes for high availability and scalability.
  • Locking mechanism: Zoekt uses a locking mechanism to ensure that only one project is indexed in one place at a time.
  • De-duplication: Zoekt supports de-duplication based on the project_id to avoid indexing the same project multiple times.

Indexing Configuration

Zoekt supports indexing multiple branches and tags using the -branches and -tags flags, respectively.

Example:

gitlab-zoekt-indexer -repo=<path_to_repo> -index=<path_to_index> -branches=master,develop -tags=v1.0,v2.0

Searching

Zoekt supports searching using the /api/search functionality. The search query can be a regular expression, a substring, or a prefix.

Example:

curl -X POST "http://zoekt-webserver/api/search" -d '{"q":"main", "path":"/path/to/index"}'

Serialized Data

Zoekt uses serialized data to index the data efficiently. The serialized data can be in various formats, including JSON, Protocol Buffers, or Thrift.

Example:

{
"files": [
{
"path": "path/to/file",
"content": "content of the file"
}
]
}

Query Tuning

Zoekt supports query tuning using various indexing strategies, including data skipping indexes and compound primary indexes.

Example:

gitlab-zoekt-indexer -repo=<path_to_repo> -index=<path_to_index> -data-skipping-index=true -compound-primary-index=true

EXPLAIN Plans

Zoekt supports EXPLAIN plans to visualize the query pipeline and optimize the query execution.

Example:

curl -X POST "http://zoekt-webserver/api/search?explain=true" -d '{"q":"main", "path":"/path/to/index"}'

Conclusion

Zoekt supports various indexing strategies, including single-column indexes, data skipping indexes, and compound primary indexes. These indexing strategies can be used to optimize code search and improve query execution. By using the gitlab-zoekt-indexer and zoekt-webserver webservers, Zoekt can index and search large codebases efficiently.

Sources: