This document discusses the indexing strategies for the Zoekt code search engine, which is used in the project “sourcegraph/zoekt”. We will cover the possible indexing options, provide examples for each option, and quote the source of information to build confidence.
Key Technologies and Dependencies
The following technologies and dependencies are used in Zoekt:
- Go programming language
- gRPC
- Protocol Buffers
- Go standard library
- go-ctags
- go-cmp
- Slothfs
- Grafana regexp
- Jaeger
Indexing Strategies
Zoekt uses a variety of indexing strategies to optimize code search. The main indexing methods available for directories, Git repositories, and remote repositories are:
- gitlab-zoekt-indexer: This is a webserver responsible for writing the
.zoekt
index files. It is used to index Git repositories and store the index files on an SSD for fast searches. - zoekt-webserver: This is a webserver responsible for responding to searches by reading the
.zoekt
index files.
The .zoekt
index files are stored on an SSD for fast searches, and the webservers need to run on the same node as they access the same files.
Indexing Options
Zoekt supports indexing multiple projects, and the following options are available for indexing:
- Indexing a single project: This is the simplest form of indexing, where a single project is indexed using the
gitlab-zoekt-indexer
webserver.
Example:
gitlab-zoekt-indexer -repo=<path_to_repo> -index=<path_to_index>
- Indexing multiple projects: Zoekt supports indexing multiple projects using the
-repo
flag with multiple paths or a directory containing multiple repositories.
Example:
gitlab-zoekt-indexer -repo=<path_to_repo1>,<path_to_repo2> -index=<path_to_index>
- Indexing a GitLab project: Zoekt can index a GitLab project using the
-project
flag with the project ID.
Example:
gitlab-zoekt-indexer -project=<project_id> -index=<path_to_index>
- Indexing a GitLab group: Zoekt can index a GitLab group using the
-group
flag with the group ID.
Example:
gitlab-zoekt-indexer -group=<group_id> -index=<path_to_index>
Optimizing Indexing
To optimize indexing for various scenarios, including large codebases, the following strategies can be used:
- Sharding: Zoekt supports sharding everything by the top-level group, which ensures group search can always search a single Zoekt server.
- Replication: Zoekt supports replication of indexes for high availability and scalability.
- Locking mechanism: Zoekt uses a locking mechanism to ensure that only one project is indexed in one place at a time.
- De-duplication: Zoekt supports de-duplication based on the
project_id
to avoid indexing the same project multiple times.
Indexing Configuration
Zoekt supports indexing multiple branches and tags using the -branches
and -tags
flags, respectively.
Example:
gitlab-zoekt-indexer -repo=<path_to_repo> -index=<path_to_index> -branches=master,develop -tags=v1.0,v2.0
Searching
Zoekt supports searching using the /api/search
functionality. The search query can be a regular expression, a substring, or a prefix.
Example:
curl -X POST "http://zoekt-webserver/api/search" -d '{"q":"main", "path":"/path/to/index"}'
Serialized Data
Zoekt uses serialized data to index the data efficiently. The serialized data can be in various formats, including JSON, Protocol Buffers, or Thrift.
Example:
{
"files": [
{
"path": "path/to/file",
"content": "content of the file"
}
]
}
Query Tuning
Zoekt supports query tuning using various indexing strategies, including data skipping indexes and compound primary indexes.
Example:
gitlab-zoekt-indexer -repo=<path_to_repo> -index=<path_to_index> -data-skipping-index=true -compound-primary-index=true
EXPLAIN Plans
Zoekt supports EXPLAIN plans to visualize the query pipeline and optimize the query execution.
Example:
curl -X POST "http://zoekt-webserver/api/search?explain=true" -d '{"q":"main", "path":"/path/to/index"}'
Conclusion
Zoekt supports various indexing strategies, including single-column indexes, data skipping indexes, and compound primary indexes. These indexing strategies can be used to optimize code search and improve query execution. By using the gitlab-zoekt-indexer
and zoekt-webserver
webservers, Zoekt can index and search large codebases efficiently.
Sources:
- Use Zoekt For code search | GitLab
- Serializing Data | GitLab
- Tune your MySQL queries like a pro | Opensource.com
- Optimizing query execution | GitLab
- Search Architecture | Backstage Software Catalog and Developer Platform
- Understanding EXPLAIN plans | GitLab
- How To Use Indexes in MySQL | DigitalOcean