What is Searching?
Searching is the core functionality of Zoekt, a fast text search engine specifically designed for source code. [1] Zoekt provides a powerful and efficient way to find relevant code within large repositories.
Why is Searching Important?
Software development involves more reading than writing code. [2] Finding the right code to read can be time-consuming and inefficient, especially when working on large projects. Search engines like Zoekt significantly accelerate this process, allowing developers to quickly locate the information they need, similar to how search engines improve browsing on the internet. [2]
Searching and Indexing
Zoekt employs a technique called “positional trigrams” for indexing and searching. [1]
Positional Trigrams
Zoekt builds an index of trigrams (groups of 3 characters) within a file, storing the offset of each trigram’s occurrence. [1] This allows for efficient substring searching by finding matching trigrams at specific distances within the file. [1]
For example, searching for the string “The quick brown fox” involves finding trigrams “The” and “fox” and checking their distances within the index. [1]
Regular Expressions
Zoekt handles regular expressions by extracting strings from them. [1] To search for (Path|PathFragment).*=.*/usr/local
, Zoekt would look for (AND (OR substr:"Path" substr:"PathFragment") substr:"/usr/local")
. [1]
Advantages of Positional Trigrams
- Efficient Search: Searching only requires intersecting a small number of posting lists. [1]
- Optimized Storage: Posting lists can be stored on slower media like SSDs. [1]
- Document Order: Results are returned in document order, simplifying compound queries with AND and OR. [1]
Downsides of Positional Trigrams
- Large Index Size: The index is about 3x the corpus size. [1]
- Limited Regular Expression Support: Direct conversion of regular expressions to index ranges is not possible. [1]
Other Indexing Considerations
- Case Sensitivity: Zoekt typically performs case-insensitive searches. [1]
- UTF-8 Encoding: Zoekt assumes UTF-8 encoding for files. [1]
- Branch Support: Zoekt indexes multiple branches of a repository using a bitmask. [1]
- Index Format: The index is organized into shards, stored in efficiently mmap-able files. [1]
- Ranking: Zoekt uses various signals for ranking, including:
- Number of matched atoms
- Closeness of matches
- Match quality
- File update time
- Filename length
- Tokenizer ranking
- Symbol ranking [1]
Query Language
Zoekt employs a query language using expression trees. [1]
Query Structure
- Query: Consists of:
- Atom
- AND QueryList
- OR QueryList
- NOT Query [1]
- Atom: Represents a specific search term:
- ConstQuery
- SubStringQuery
- RegexpQuery
- RepoQuery
- BranchQuery [1]
Query Parsing
- Regular Expressions: Strings in the query language are interpreted as regular expressions. [1]
- Implicit AND: Elements within parentheses are implicitly joined by AND. [1]
- OR Operator: The OR operator has lower priority than the implicit AND. [1]
Gerrit/Gitiles Integration
Zoekt seamlessly integrates with Gerrit and Gitiles, popular systems for code review and browsing. [1]
Gerrit/Gitiles Search Process
- Gitiles finds the branches visible to the logged-in user. [1]
- Gitiles sends the raw query, along with branches and repository, to Zoekt. [1]
- Zoekt parses the query and embeds it with branch filters. [1]
- Zoekt returns the search results, allowing Gitiles to apply further filtering if needed. [1]
Service Management
Zoekt provides a service management tool for automating tasks like:
- Polling Git hosting sites for new updates
- Reindexing changed repositories
- Running and restarting the webserver
- Deleting old logs [1]
Security
Zoekt prioritizes security by addressing potential vulnerabilities. [1]
Sensitive Data
- Credentials for accessing Git repositories
- TLS server certificates
- Query logs [1]
Untrusted Data
- Code in Git repositories
- Search queries [1]
Mitigation Strategies
- Zoekt is written in Go, minimizing memory security risks. [1]
- Seccomp-based sandboxing is used to mitigate risks from
ctags
. [1]
Privacy
Webserver logs containing sensitive data like IP addresses and search queries are deleted after a configurable period. [1]
Frequently Asked Questions
Why codesearch? [2]
Software development involves extensive code reading, and finding the right code can be inefficient, especially in large projects. [2] Search engines like Zoekt accelerate this process by enabling quick code discovery. [2]
What features make a code search engine great? [2]
- Coverage: The relevant code should be searchable. [2]
- Speed: Results should be returned quickly for efficient query iteration. [2]
- Approximate Queries: Matching should support case-insensitive substrings for flexible search. [2]
- Filtering: Refinement of search results through specific queries. [2]
- Ranking: Relevant results, like function definitions, should be prioritized. [2]
How does zoekt
provide for these? [2]
- Coverage: Zoekt includes tools for mirroring Git repositories. [2]
- Speed: Zoekt uses a positional trigram-based index for fast retrieval. [2]
- Approximate Queries: Zoekt supports substrings and regular expressions with case-insensitive matching. [2]
- Filtering: Queries can be filtered using additional atoms and exclusions. [2]
- Ranking: Zoekt leverages
ctags
for symbol detection and ranking. [2]
How does this compare to grep -r
? [2]
grep -r
supports substring search but doesn’t scale well to large corpuses, lacking filtering and ranking capabilities. [2]
What about my IDE? [2]
While IDEs can be useful for smaller projects, they can be slow, cumbersome, and not universally supported. [2]
What about the search on github.com
? [2]
GitHub’s search has good coverage but lacks support for arbitrary substrings. [2]
What about Etsy/Hound? [2]
Etsy/Hound is a code search engine that supports regular expressions but is significantly slower than Zoekt. [2] It also has limited filtering and symbol ranking features. [2]
What about livegrep? [2]
Livegrep supports regular expressions but requires considerable RAM and CPU resources due to its indexing method. [2] It also offers rudimentary filtering and lacks symbol ranking. [2]
How much resources does zoekt
require? [2]
The search server needs a local SSD for the index (3.5x the corpus size) and at least 20% more RAM than the corpus size. [2]
Can I index multiple branches? [2]
Yes, Zoekt can index up to 64 branches, and identical files across branches occupy space only once in the index. [2]
How fast is the search? [2]
Rare strings can be retrieved very quickly. Common strings’ speed depends on the number of results desired. [2]
How fast is the indexer? [2]
The indexer’s speed depends on the corpus size and can be parallelized for faster indexing. [2]
What does cs.bazel.build run on? [2]
Currently, it runs on a single Google Cloud VM with 16 vCPUs, 60G RAM, and a physical SSD. [2]
How does zoekt
work? [2]
Zoekt breaks down files into trigrams and stores their occurrences. [2] Substrings are found by matching trigrams at the correct distances. [2]
I want to know more [2]
- Design doc: Technical details [2]
- Godoc: API documentation [2]
- Gerrit 2016 user summit: Slides [2]
- Gerrit 2017 user summit: Transcript, slides, video [2]
API
When running zoekt-webserver
with the -rpc
option, a JSON HTTP API for searches is available at /api/search
. [2]
Example API Request
curl -XPOST -d '{"Q":"needle"}' 'http://127.0.0.1:6070/api/search'
Filtering by Repository IDs
Efficiently filter searches to specific repositories using the RepoIDs
filter:
curl -XPOST -d '{"Q":"needle","RepoIDs":[1234,4567]}' 'http://34.120.239.98/api/search'
Options
Various options can be passed under Opts
, documented at [SearchOptions](blob/
Top-Level Directory Explanations
doc/ - This directory contains documentation for the project.
internal/ - This directory contains internal packages used by the project.
query/ - This directory contains the code for query processing.