Embedding and Vector Similarity Search - pingcap/autoflow

In the Autoflow project, the concept of vector embeddings and vector similarity search is used to find semantically similar concepts and relationships within a knowledge graph. Vector embeddings are numerical representations of data objects, generated by machine learning models, which enable the search and retrieval of data based on semantic similarity.

Two main libraries used in Autoflow for vector similarity search are Faiss and NGT.

Faiss (Facebook AI Similarity Search) is a library developed by Meta that addresses limitations of traditional similarity search methods. It provides several search methods with different usage trade-offs, optimized for memory usage and speed, and offers a state-of-the-art GPU implementation for relevant indexing methods. Faiss is used for billion-scale similarity search with GPUs.

NGT (Neighborhood Graph and Tree) is a high-performing, open-source library for large-scale and high-dimensional vectors. It is used for approximate nearest neighbor search, which is indispensable for deep learning models with high-dimensional vectors. NGT is used in Yahoo! Japan for similarity-based fashion-item search.

To create embeddings for data within the knowledge graph, Autoflow uses LlamaIndex and DSPy libraries. LlamaIndex is a vector database that stores vector embeddings of data objects, making it easier to manage and process unstructured data. DSPy is a library for deep learning-based NLP tasks, which can generate vector embeddings for text data.

For finding similar concepts and relationships in the graph, Autoflow uses vector similarity search algorithms provided by Faiss and NGT. These algorithms search for neighbors to the specified query vector in high-dimensional space, allowing for the search of similar documents, images, products, human beings, and other entities.

Example: To find similar accounts in a social media platform, Autoflow would first generate account embeddings based on user engagement data. These embeddings help in finding accounts which are thematically and topically similar to one another. Then, the system would use vector similarity search algorithms provided by Faiss or NGT to find the most similar accounts to the seed by finding the nearest accounts in the vector space.

Sources: