Searching @ sourcegraph/zoekt

What is Searching?

Searching is the core functionality of Zoekt, a fast text search engine specifically designed for source code. [1] Zoekt provides a powerful and efficient way to find relevant code within large repositories.

Why is Searching Important?

Software development involves more reading than writing code. [2] Finding the right code to read can be time-consuming and inefficient, especially when working on large projects. Search engines like Zoekt significantly accelerate this process, allowing developers to quickly locate the information they need, similar to how search engines improve browsing on the internet. [2]

Searching and Indexing

Zoekt employs a technique called “positional trigrams” for indexing and searching. [1]

Positional Trigrams

Zoekt builds an index of trigrams (groups of 3 characters) within a file, storing the offset of each trigram’s occurrence. [1] This allows for efficient substring searching by finding matching trigrams at specific distances within the file. [1]

For example, searching for the string “The quick brown fox” involves finding trigrams “The” and “fox” and checking their distances within the index. [1]

Regular Expressions

Zoekt handles regular expressions by extracting strings from them. [1] To search for (Path|PathFragment).*=.*/usr/local, Zoekt would look for (AND (OR substr:"Path" substr:"PathFragment") substr:"/usr/local"). [1]

Advantages of Positional Trigrams

Efficient Search: Searching only requires intersecting a small number of posting lists. [1]
Optimized Storage: Posting lists can be stored on slower media like SSDs. [1]
Document Order: Results are returned in document order, simplifying compound queries with AND and OR. [1]

Downsides of Positional Trigrams

Large Index Size: The index is about 3x the corpus size. [1]
Limited Regular Expression Support: Direct conversion of regular expressions to index ranges is not possible. [1]

Other Indexing Considerations

Case Sensitivity: Zoekt typically performs case-insensitive searches. [1]
UTF-8 Encoding: Zoekt assumes UTF-8 encoding for files. [1]
Branch Support: Zoekt indexes multiple branches of a repository using a bitmask. [1]
Index Format: The index is organized into shards, stored in efficiently mmap-able files. [1]
Ranking: Zoekt uses various signals for ranking, including:
- Number of matched atoms
- Closeness of matches
- Match quality
- File update time
- Filename length
- Tokenizer ranking
- Symbol ranking [1]

Query Language

Zoekt employs a query language using expression trees. [1]

Query Structure

Query: Consists of:
- Atom
- AND QueryList
- OR QueryList
- NOT Query [1]
Atom: Represents a specific search term:
- ConstQuery
- SubStringQuery
- RegexpQuery
- RepoQuery
- BranchQuery [1]

Query Parsing

Regular Expressions: Strings in the query language are interpreted as regular expressions. [1]
Implicit AND: Elements within parentheses are implicitly joined by AND. [1]
OR Operator: The OR operator has lower priority than the implicit AND. [1]

Gerrit/Gitiles Integration

Zoekt seamlessly integrates with Gerrit and Gitiles, popular systems for code review and browsing. [1]

Gerrit/Gitiles Search Process

Gitiles finds the branches visible to the logged-in user. [1]
Gitiles sends the raw query, along with branches and repository, to Zoekt. [1]
Zoekt parses the query and embeds it with branch filters. [1]
Zoekt returns the search results, allowing Gitiles to apply further filtering if needed. [1]

Service Management

Zoekt provides a service management tool for automating tasks like:

Polling Git hosting sites for new updates
Reindexing changed repositories
Running and restarting the webserver
Deleting old logs [1]

Security

Zoekt prioritizes security by addressing potential vulnerabilities. [1]

Sensitive Data

Credentials for accessing Git repositories
TLS server certificates
Query logs [1]

Untrusted Data

Code in Git repositories
Search queries [1]

Mitigation Strategies

Zoekt is written in Go, minimizing memory security risks. [1]
Seccomp-based sandboxing is used to mitigate risks from ctags. [1]

Privacy

Webserver logs containing sensitive data like IP addresses and search queries are deleted after a configurable period. [1]

Frequently Asked Questions

Why codesearch? [2]

Software development involves extensive code reading, and finding the right code can be inefficient, especially in large projects. [2] Search engines like Zoekt accelerate this process by enabling quick code discovery. [2]

What features make a code search engine great? [2]

Coverage: The relevant code should be searchable. [2]
Speed: Results should be returned quickly for efficient query iteration. [2]
Approximate Queries: Matching should support case-insensitive substrings for flexible search. [2]
Filtering: Refinement of search results through specific queries. [2]
Ranking: Relevant results, like function definitions, should be prioritized. [2]

How does `zoekt` provide for these? [2]

Coverage: Zoekt includes tools for mirroring Git repositories. [2]
Speed: Zoekt uses a positional trigram-based index for fast retrieval. [2]
Approximate Queries: Zoekt supports substrings and regular expressions with case-insensitive matching. [2]
Filtering: Queries can be filtered using additional atoms and exclusions. [2]
Ranking: Zoekt leverages ctags for symbol detection and ranking. [2]

How does this compare to `grep -r`? [2]

grep -r supports substring search but doesn’t scale well to large corpuses, lacking filtering and ranking capabilities. [2]

What about my IDE? [2]

While IDEs can be useful for smaller projects, they can be slow, cumbersome, and not universally supported. [2]

What about the search on `github.com`? [2]

GitHub’s search has good coverage but lacks support for arbitrary substrings. [2]

What about Etsy/Hound? [2]

Etsy/Hound is a code search engine that supports regular expressions but is significantly slower than Zoekt. [2] It also has limited filtering and symbol ranking features. [2]

What about livegrep? [2]

Livegrep supports regular expressions but requires considerable RAM and CPU resources due to its indexing method. [2] It also offers rudimentary filtering and lacks symbol ranking. [2]

How much resources does `zoekt` require? [2]

The search server needs a local SSD for the index (3.5x the corpus size) and at least 20% more RAM than the corpus size. [2]

Can I index multiple branches? [2]

Yes, Zoekt can index up to 64 branches, and identical files across branches occupy space only once in the index. [2]

How fast is the search? [2]

Rare strings can be retrieved very quickly. Common strings’ speed depends on the number of results desired. [2]

How fast is the indexer? [2]

The indexer’s speed depends on the corpus size and can be parallelized for faster indexing. [2]

What does cs.bazel.build run on? [2]

Currently, it runs on a single Google Cloud VM with 16 vCPUs, 60G RAM, and a physical SSD. [2]

How does `zoekt` work? [2]

Zoekt breaks down files into trigrams and stores their occurrences. [2] Substrings are found by matching trigrams at the correct distances. [2]

I want to know more [2]

Design doc: Technical details [2]
Godoc: API documentation [2]
Gerrit 2016 user summit: Slides [2]
Gerrit 2017 user summit: Transcript, slides, video [2]

API

When running zoekt-webserver with the -rpc option, a JSON HTTP API for searches is available at /api/search. [2]

Example API Request

curl -XPOST -d '{"Q":"needle"}' 'http://127.0.0.1:6070/api/search'

Filtering by Repository IDs

Efficiently filter searches to specific repositories using the RepoIDs filter:

curl -XPOST -d '{"Q":"needle","RepoIDs":[1234,4567]}' 'http://34.120.239.98/api/search'

Options

Various options can be passed under Opts, documented at [SearchOptions](blob/

Top-Level Directory Explanations

doc/ - This directory contains documentation for the project.

internal/ - This directory contains internal packages used by the project.

query/ - This directory contains the code for query processing.