Vector databases have been the fastest-growing database category for a few years, with their relevance growing more in the era of Generative AI. What differentiates them from relational databases is the implementation of ANN algorithms. What are they, you ask? Well, this article will explain what ANN algorithms in vector databases are and how they work. Moreover, it will discuss their unique methods for efficient data searching and practical applications across various industries. So, let’s begin.
Learn More: Vector Databases in Generative AI Solutions
In relational databases, each record is represented in a row and its attributes are represented in columns. For instance, consider a table with N author names and their respective research paper data. A naive approach would compare the query author’s name to all N values in the Author column to find the books written by a particular author. This method requires N comparisons.
A more efficient method is sorting the Author column alphabetically. Then by using binary search, we can find using only log(N) comparisons. However, the scenario changes when it comes to finding relevant research papers based on a given query. The naive approach is to find the similarity between the query embedding vector and all the document embedding vectors, requiring N comparisons.
Sorting the research paper text or embeddings and using binary search doesn’t work because we are not looking for the exact match to the query embedding. We only want to find the most similar embeddings. Moreover, embeddings represent the data in multi-dimensional space. Sorting by any single dimension doesn’t make sense.
So, we need different algorithms that can search for vectors more efficiently. These algorithms are called Approximate Nearest neighbor (ANN) algorithms. While these algorithms may not always find the most precise nearest neighbors compared to the naive approach, they significantly improve search speed and efficiency in large, multi-dimensional datasets. The implementation of ANN algorithms is what differentiates vector databases from traditional relational databases.
Learn More: Top 15 Vector Databases in 2024
Now that you understand what ANN algorithms are, let’s find out how different ANN algorithms work.
Tree-based algorithms organize data points where points that are closer in space are also closer in the tree. A few examples of such trees are the K-dimensional tree (KD-tree), Vantage Point tree (VP-tree), Ball tree, and Rectangular tree (R-tree).
One popular library that implements a tree-based algorithm is Annoy (Approximate Nearest Neighbors Oh Yeah). It was developed by Erik Bernhardsson while working at Spotify. Annoy builds the tree by dividing data points using random hyperplanes.
Let’s look into the details of how this works.
This detailed description explains how tree-based ANN algorithms work, particularly in dividing data points and finding nearest neighbors efficiently. By considering edge cases and utilizing multiple trees, the algorithm can improve accuracy and performance in finding the nearest neighbors.
In these algorithms, data points are represented as vertices of the graph, and edges are used to traverse the graph to find nearest neighbors. Let’s understand it in detail using the most popular algorithm currently, Hierarchical Navigable Small World (HNSW).
If you are wondering how we have built the graph in the first place, just like we have found the nearest neighbors for a query, we can find the nearest neighbors for a new vertex as we are inserting it. Then we can connect the new vertex to pre-defined nearest vertices through edges.
In the graph, each vertex connects to only a few nearby vertices, thereby creating a small-world network. As we navigate it, this is called a navigable small world.
When dealing with millions of data points, traversing the graph to find the nearest neighbor starting from a random point can be time-consuming, as each vertex is connected to only a few vertices. Increasing the number of edges for each vertex also takes a lot of time as more distances need to be calculated.
To overcome this problem, multiple graphs with different numbers of vertices are built. Each graph can be considered a layer.
Thus, the number of traversals and distance calculations are fewer as compared to the NSW algorithm.
How do we decide the number of layers and how many data points should be in each? The HNSW paper provides the following formula for allocating data points to different layers.
floor(-ln(unif(0,1))*mL)
Here,
HNSW is the default algorithm for most of the vector databases. Spotify also released a new library Voyager based on HNSW.
Now, let’s try the HNSW algorithm
import numpy as np
import faiss
# We can choose some random numbers for the database and queries.
d = 256 # dimension
nb = 100000 # database size
nq = 10000 # number of queries
np.random.seed(1234) # make reproducible
xb = np.random.random((nb, d)).astype('float32')
xb[:, 0] += np.arange(nb) / 1000.
xq = np.random.random((nq, d)).astype('float32')
xq[:, 0] += np.arange(nq) / 1000.
First, let’s try the naive approach by building the FlatIndex.
flat_index = faiss.IndexFlatL2(d) # build the index
print(flat_index.is_trained)
>>> True
flat_index.add(xb) # add vectors to the index
print(flat_index.ntotal)
>>> 100000
Now, we can search
%%time # this command will give time taken to run in jupyter notebook
k = 5 # we can get 5 nearest neighbors
D, I = flat_index.search(xq, k) # actual search
print(I[:5]) # neighbors of the 5 first queries
print(D[:5]) # distances of the 5 first queries
>>>[[ 69 525 628 595 1413]
[ 603 25 14 1616 698]
[ 732 744 739 589 1185]
[ 447 841 347 373 631]
[1053 924 163 101 302]]
[[33.871002 33.979095 34.67044 34.738922 35.204865]
[34.497314 34.682297 35.488464 35.671005 35.864685]
[32.993195 34.401352 34.411896 34.514572 34.659515]
[33.948517 34.039062 34.364456 34.466248 35.244644]
[33.487595 34.77111 34.81253 34.893692 35.152557]]
Lt’s try the HNSW algorithm now
M = 32 # each vertex will be connected to M other nearest vertices
hnsw_index = faiss.IndexHNSWFlat(d, M) # build the index
print(hnsw_index.is_trained)
>>> True
We can add the data to the index.
# To connect to M other vertices, it will greedily search upto 'efConstruction' vertices.
# the default value is 40, we can change it before adding dataset
hnsw_index.hnsw.efConstruction = 48
hnsw_index.add(xb)
# after adding our data we will find that the level has been set automatically
hnsw_index.hnsw.max_level
>>> 3
# and levels (or layers) are now populated
levels = faiss.vector_to_array(hnsw_index.hnsw.levels)
np.bincount(levels)
>>> array([ 0, 96812, 3093, 92, 3])
We can search now
# how many entry points will be explored between layers during the search.
# for example, we can select 30 nearest vertices in one layer,
# then start traversing the graph from those vertices in the next layer
hnsw_index.hnsw.efSearch = 30
%%time
hnsw_index.search(xq[:5], k=4)
>>> (array([[33.870995, 33.979073, 34.67042 , 34.738907],
[34.497334, 34.682304, 35.488453, 35.67101 ],
[32.993187, 34.401337, 34.411903, 34.514584],
[33.948494, 34.039097, 34.36444 , 34.46623 ],
[33.487595, 34.771133, 34.81257 , 34.893723]], dtype=float32),
array([[ 69, 525, 628, 595],
[ 603, 25, 14, 1616],
[ 732, 744, 739, 589],
[ 447, 841, 347, 373],
[1053, 924, 163, 101]]))
In these algorithms, we use both trees and graphs to find the nearest neighbors. An example is Neighborhood Graph and Tree (NGT) which is the best-performing ANN algorithm currently. NGT uses a dynamic vantage point tree and a graph. Let’s see how it works.
So, rather than traversing all the nodes using a graph using HNSW, we are localizing the search space using a dynamic vantage point tree in this algorithm. This combination of using both tree and graph makes it one of the fastest and most accurate algorithms. As of June 2024, Vald vector database supports this algorithm.
Let’s now explore some of the most common applications of ANN algorithms.
These applications focus on finding approximate matches to user preferences or content features.
These applications utilize embeddings to search for similar items across various media types, enhancing search accuracy and relevance.
Learn More: 10+ Vector Database Applications in the Real World
Vector databases, through efficient ANN algorithms like tree-based, graph-based, and hybrid methods, significantly enhance search capabilities in multi-dimensional spaces. Their practical applications span various industries, offering powerful solutions for similarity-based recommendations, embedding-based search, and personalized advertising.
Hope this article has given you a detailed idea of ANN algorithms in vector databases. Do check out our other articles on vector databases to learn more. Happy learning!
A. A vector database is a specialized type of database that handles multi-dimensional data, enabling efficient similarity searches using vector embeddings rather than traditional row-column structures.
A. Vector databases utilize various Approximate Nearest Neighbor (ANN) algorithms, including tree-based methods like KD-trees and Annoy, graph-based methods like HNSW, and hybrid methods like NGT.
A. Tree-based ANN algorithms organize data points using structures like KD-trees and Annoy, which divide the data space with hyperplanes, allowing efficient nearest neighbor searches by traversing the tree.
A. Graph-based algorithms, such as HNSW, represent data points as vertices in a graph, using edges to connect nearest neighbors and navigate the graph efficiently to find similar data points.
A. Practical applications of vector databases include similarity-based recommendations for music and products, personalized advertising, and embedding-based searches for text, images, and molecules.