Indexing @ sourcegraph/zoekt

What is Indexing?

Indexing is the process of creating a searchable index for a corpus of data, such as source code, to enable fast and efficient text searches. Zoekt is an open-source, fast text search engine specifically designed for source code.

Why is Indexing important?

Indexing is crucial for building a searchable index of your codebase, making it easier and faster to find the specific code you need. With indexing, you can perform sub-second searches on large codebases, filter results based on branches or repositories, and even use approximate queries with regular expressions.

Indexing Options and Examples

Positional Trigrams

Zoekt uses positional trigram indexing, which involves storing the offsets of n-grams (n=3) within files. For example, if the corpus is “banana”, the index would be “ban”: 0, “ana”: 1,3, and “nan”: 2. When searching for a string like “The quick brown fox”, Zoekt looks for two trigrams, “The” and “fox”, and checks that they are found at the right distance apart.

Case Sensitivity

By default, Zoekt searches code without regard for case. However, you can search for case-insensitive queries by using the ~ operator, such as The~ quick brown fox.

UTF-8 Encoding

Zoekt assumes that files are UTF-8 encoded. It uses rune offsets in the trigram index and converts those back to bytes using a lookup table.

Branches

Each file blob in the index has a bitmask, representing the branches in which the content is found. For example, a file “x.java” with branches “master” and “staging” would have the bitmask [master=1, staging=2]. With this technique, you can index many similar branches of a repository with little space overhead.

Index Format

The index is organized in shards, where each shard is a file, laid out such that it can be mmap’d efficiently. Each shard contains data for one code repository, including file contents, filenames, content and filename posting lists, branch masks, and metadata.

Ranking

Zoekt uses several signals for ranking search results, including the number of atoms matched, closeness to matches for other atoms, quality of match, file latest update time, filename length, tokenizer ranking, and symbol ranking.

Query Language

Queries are stored as expression trees using the following data structure:

Query:         Atom         | AND QueryList         | OR QueryList         | NOT Query         ;
          Atom:         ConstQuery         | SubStringQuery         | RegexpQuery         | RepoQuery         | BranchQuery         ;

Both SubStringQuery and RegexpQuery can apply to either file or contents, and can optionally be case-insensitive. ConstQuery (match everything, or match nothing) is a useful construct for partial evaluation of a query.

Gerrit/Gitiles Integration

Zoekt can be integrated with Gerrit and Gitiles for code search functionality. Gitiles constructs search queries that respect Gerrit’s complex ACLs and filters results accordingly.

For more information, please refer to the official documentation and design document.

Top-Level Directory Explanations

doc/ - This directory contains documentation for the project.

gitindex/ - This directory contains the code for managing Git indexes.

internal/ - This directory contains internal packages used by the project.

shards/ - This directory contains the code for managing shards.