Exploring ANN Algorithms in Vector Databases

Santhosh Reddy D Last Updated : 28 Jun, 2024

11 min read

Introduction

Vector databases have been the fastest-growing database category for a few years, with their relevance growing more in the era of Generative AI. What differentiates them from relational databases is the implementation of ANN algorithms. What are they, you ask? Well, this article will explain what ANN algorithms in vector databases are and how they work. Moreover, it will discuss their unique methods for efficient data searching and practical applications across various industries. So, let’s begin.

Learn More: Vector Databases in Generative AI Solutions

Learning Objectives

Learn how data representation and search methods differ between relational and vector databases, highlighting the limitations of binary search in multi-dimensional spaces.
Gain insights into tree-based ANN algorithms such as KD-trees and the Annoy library’s method of dividing data points using random hyperplanes.
Understand graph-based ANN algorithms, especially the HNSW algorithm, and how they efficiently construct and navigate graphs to find nearest neighbors.
Explore hybrid algorithms like NGT, which enhance search speed and accuracy by integrating tree and graph structures.
Discover the practical applications of vector databases in music recommendations, product recommendations, personalized advertising, and more.

What are ANN Algorithms?
How ANN Algorithms Work
Applications of ANN Algorithms in Vector Databases
Frequently Asked Questions

What are ANN Algorithms?

In relational databases, each record is represented in a row and its attributes are represented in columns. For instance, consider a table with N author names and their respective research paper data. A naive approach would compare the query author’s name to all N values in the Author column to find the books written by a particular author. This method requires N comparisons.

A more efficient method is sorting the Author column alphabetically. Then by using binary search, we can find using only log(N) comparisons. However, the scenario changes when it comes to finding relevant research papers based on a given query. The naive approach is to find the similarity between the query embedding vector and all the document embedding vectors, requiring N comparisons.

Sorting the research paper text or embeddings and using binary search doesn’t work because we are not looking for the exact match to the query embedding. We only want to find the most similar embeddings. Moreover, embeddings represent the data in multi-dimensional space. Sorting by any single dimension doesn’t make sense.

So, we need different algorithms that can search for vectors more efficiently. These algorithms are called Approximate Nearest neighbor (ANN) algorithms. While these algorithms may not always find the most precise nearest neighbors compared to the naive approach, they significantly improve search speed and efficiency in large, multi-dimensional datasets. The implementation of ANN algorithms is what differentiates vector databases from traditional relational databases.

Learn More: Top 15 Vector Databases in 2024

How ANN Algorithms Work

Now that you understand what ANN algorithms are, let’s find out how different ANN algorithms work.

Tree-based Algorithms

Tree-based algorithms organize data points where points that are closer in space are also closer in the tree. A few examples of such trees are the K-dimensional tree (KD-tree), Vantage Point tree (VP-tree), Ball tree, and Rectangular tree (R-tree).

One popular library that implements a tree-based algorithm is Annoy (Approximate Nearest Neighbors Oh Yeah). It was developed by Erik Bernhardsson while working at Spotify. Annoy builds the tree by dividing data points using random hyperplanes.

Let’s look into the details of how this works.

ANN Algorithms in Vector Databases | Annoy

How Annoy Works

Consider all the points located in the space as shown in the image.
Randomly choose two points from the dataset.
Calculate a hyperplane that is perpendicular to the line segment connecting the two points and passes through the midpoint of the line segment. We can use this hyperplane to divide all the points to the left or right side of the tree node.
Take the normal vector of the hyperplane and calculate the dot product with each data point. If the dot product is positive, the point is in the same direction as the normal vector. If the dot product is negative, the point is in the opposite direction as the normal vector. Based on the dot product, split the points into left or right child nodes.
Recursively split nodes by hyperplanes until only a few points remain in the leaf nodes. This divides the total space into grids, where leaf nodes store the points and all other nodes store the hyperplanes used for division.
To find the nearest neighbors for a query point, calculate its dot product with the normal vector of the root node hyperplane. Based on the result, traverse either to the left or right of the node. Continue traversing until reaching the leaf node. Then, calculate the similarity between the query and the points in the leaf node.
As the tree is a binary tree, finding nearest neighbors requires approximately log(N) comparisons.
If the query point is near the edge of any grid, considering only one leaf node may miss similar points in adjacent leaf nodes. To address this, we can build multiple trees, each with different random starting points, thus different hyperplanes. Traverse each tree with the query and calculate similarity with points in the leaf nodes of all trees, ensuring not to miss any nearest neighbor.
We can also store the nodes with calculated similarities in a priority queue to return the top-k nearest neighbors.

This detailed description explains how tree-based ANN algorithms work, particularly in dividing data points and finding nearest neighbors efficiently. By considering edge cases and utilizing multiple trees, the algorithm can improve accuracy and performance in finding the nearest neighbors.

Graph-based Algorithms

In these algorithms, data points are represented as vertices of the graph, and edges are used to traverse the graph to find nearest neighbors. Let’s understand it in detail using the most popular algorithm currently, Hierarchical Navigable Small World (HNSW).

ANN Algorithms in Vector Databases | HNSW

How HNSW Works

As shown in the above image, each vertex in the graph represents a data point.
Connect each vertex with a configurable number of nearest vertices in a greedy manner.
To find the nearest neighbors for a query, start from any random vertex, say A.
Find the vertices connected to A, which might be C, G, and D.
Calculate the distance between the query and each of these vertices (A, C, G, D).
Compare the distances and move to the vertex closest to the query, which is D in this case.
Repeat this process with vertex D and its connected vertices, moving next to E.
When we repeat this process by starting at E, we find that E is the closest vertex to the Query again. So, we found the nearest neighbor to our query.

If you are wondering how we have built the graph in the first place, just like we have found the nearest neighbors for a query, we can find the nearest neighbors for a new vertex as we are inserting it. Then we can connect the new vertex to pre-defined nearest vertices through edges.

In the graph, each vertex connects to only a few nearby vertices, thereby creating a small-world network. As we navigate it, this is called a navigable small world.

When dealing with millions of data points, traversing the graph to find the nearest neighbor starting from a random point can be time-consuming, as each vertex is connected to only a few vertices. Increasing the number of edges for each vertex also takes a lot of time as more distances need to be calculated.

To overcome this problem, multiple graphs with different numbers of vertices are built. Each graph can be considered a layer.

How This Works

In the first layer, use a fraction of the data points to build the graph, for example, N/4.
In the next layer, use N/2 data points to build the graph.
In the last layer, use all the data points.
To find the nearest neighbors to the query, start from layer 1.
Since the number of vertices is fewer, the edges are longer, allowing quick traversal to a nearer vertex in that layer (for example, H).
Start from vertex H in the next layer and traverse the graph until the nearest neighbor in that layer is found (vertex B).
Continue this process until the nearest neighbor is found in the last layer.

Thus, the number of traversals and distance calculations are fewer as compared to the NSW algorithm.

HNSW Formula and Implementation

How do we decide the number of layers and how many data points should be in each? The HNSW paper provides the following formula for allocating data points to different layers.

floor(-ln(unif(0,1))*mL)

Here,

unif(0,1) represents a random number drawn from a uniform distribution between 0 and 1.
−ln(unif(0,1)) natural logarithm of a uniform random number is used to transform the uniform distribution into an exponential distribution. This transformation along with -ve sign makes it the right-skewed distribution.
mL is a multiplier that scales the logarithmic value. It is usually set to 1/ln⁡(M), where M is the maximum number of neighbors each node can have.
The floor function rounds down the resulting value to the nearest integer. This determines the discrete level at which the node will be placed.

HNSW is the default algorithm for most of the vector databases. Spotify also released a new library Voyager based on HNSW.

Now, let’s try the HNSW algorithm

import numpy as np
import faiss

# We can choose some random numbers for the database and queries.
d = 256                           # dimension
nb = 100000                      # database size
nq = 10000                       # number of queries
np.random.seed(1234)             # make reproducible
xb = np.random.random((nb, d)).astype('float32')
xb[:, 0] += np.arange(nb) / 1000.
xq = np.random.random((nq, d)).astype('float32')
xq[:, 0] += np.arange(nq) / 1000.

First, let’s try the naive approach by building the FlatIndex.

flat_index = faiss.IndexFlatL2(d)   # build the index
print(flat_index.is_trained)
>>> True

flat_index.add(xb)                  # add vectors to the index
print(flat_index.ntotal)
>>> 100000

Now, we can search

%%time            # this command will give time taken to run in jupyter notebook
k = 5                             # we can get 5 nearest neighbors
D, I = flat_index.search(xq, k)     # actual search
print(I[:5])                   # neighbors of the 5 first queries
print(D[:5])                   # distances of the 5 first queries

>>>[[  69  525  628  595 1413]
 [ 603   25   14 1616  698]
 [ 732  744  739  589 1185]
 [ 447  841  347  373  631]
 [1053  924  163  101  302]]
[[33.871002 33.979095 34.67044  34.738922 35.204865]
 [34.497314 34.682297 35.488464 35.671005 35.864685]
 [32.993195 34.401352 34.411896 34.514572 34.659515]
 [33.948517 34.039062 34.364456 34.466248 35.244644]
 [33.487595 34.77111  34.81253  34.893692 35.152557]]

Lt’s try the HNSW algorithm now

M = 32              # each vertex will be connected to M other nearest vertices
hnsw_index = faiss.IndexHNSWFlat(d, M)   # build the index
print(hnsw_index.is_trained)

>>> True

We can add the data to the index.

# To connect to M other vertices, it will greedily search upto 'efConstruction' vertices.
# the default value is 40, we can change it before adding dataset
hnsw_index.hnsw.efConstruction = 48

hnsw_index.add(xb)

# after adding our data we will find that the level has been set automatically
hnsw_index.hnsw.max_level
>>> 3

# and levels (or layers) are now populated
levels = faiss.vector_to_array(hnsw_index.hnsw.levels)
np.bincount(levels)
>>> array([    0, 96812,  3093,    92,     3])

We can search now

# how many entry points will be explored between layers during the search.
# for example, we can select 30 nearest vertices in one layer, 
# then start traversing the graph from those vertices in the next layer
hnsw_index.hnsw.efSearch = 30

%%time
hnsw_index.search(xq[:5], k=4)

>>> (array([[33.870995, 33.979073, 34.67042 , 34.738907],
        [34.497334, 34.682304, 35.488453, 35.67101 ],
        [32.993187, 34.401337, 34.411903, 34.514584],
        [33.948494, 34.039097, 34.36444 , 34.46623 ],
        [33.487595, 34.771133, 34.81257 , 34.893723]], dtype=float32),
 array([[  69,  525,  628,  595],
        [ 603,   25,   14, 1616],
        [ 732,  744,  739,  589],
        [ 447,  841,  347,  373],
        [1053,  924,  163,  101]]))

Hybrid Algorithms

In these algorithms, we use both trees and graphs to find the nearest neighbors. An example is Neighborhood Graph and Tree (NGT) which is the best-performing ANN algorithm currently. NGT uses a dynamic vantage point tree and a graph. Let’s see how it works.

ANN Algorithms in Vector Databases | NGT

How NGT Works

The dvp-tree begins with a single leaf node representing the entire data space as shown in the above image.
As we add new points, the tree traverses to find the appropriate leaf node for insertion.
When the number of points in a leaf node exceeds a predefined maximum, the leaf node is split into smaller subspaces. This splitting is similar to the vantage point tree (vp-tree) method, where a vantage point is chosen, and the space is divided using hyperspheres centered at this vantage point.
For each point in the node, we calculate the distance to the vantage point.
Choose a radius ‘r’ such that it balances the points between inside and outside the hypersphere.
Points with a distance d≤r from the vantage point are inside the hypersphere, and points with d>r are outside. The circles and arcs in the above image represent these hyperspheres.
This division process is repeated recursively, creating a hierarchical structure of nodes and subnodes.
The dvp-tree supports dynamic updates, meaning we can incrementally add points without reconstructing the entire tree.
The process continues until each leaf node contains a manageable number of points.
Then, we can traverse only the leaf nodes in a graph using the NSW algorithm as explained above.

So, rather than traversing all the nodes using a graph using HNSW, we are localizing the search space using a dynamic vantage point tree in this algorithm. This combination of using both tree and graph makes it one of the fastest and most accurate algorithms. As of June 2024, Vald vector database supports this algorithm.

Applications of ANN Algorithms in Vector Databases

Let’s now explore some of the most common applications of ANN algorithms.

1. Similarity-Based Recommendations

These applications focus on finding approximate matches to user preferences or content features.

Music Recommendations: Platforms like Spotify use vector databases to recommend music based on user listening habits and song features. That’s why Spotify developed those ANN libraries.
Product Recommendations: E-commerce sites use vector databases to suggest products similar to those a user has viewed or purchased.
Personalized Advertising: Vector databases match ads to users based on their behavior and preferences, improving engagement and conversion rates. It is Yahoo Japan which developed the NGT algorithm.

2. Embedding-Based Search

These applications utilize embeddings to search for similar items across various media types, enhancing search accuracy and relevance.

Text Search: In natural language processing, vector databases store text embeddings for semantic search, document retrieval, and question-answering systems
Image and Video Search: Allow for the retrieval of visually similar images, used in reverse image search, content-based image or video retrieval, and digital asset management.
Molecule Search: In bioinformatics and drug discovery, molecule embeddings help find structurally similar molecules, supporting the identification of potential drug candidates.

3. Miscellaneous

Other applications include anomaly detection, geospatial analysis, etc.

Learn More: 10+ Vector Database Applications in the Real World

Conclusion

Vector databases, through efficient ANN algorithms like tree-based, graph-based, and hybrid methods, significantly enhance search capabilities in multi-dimensional spaces. Their practical applications span various industries, offering powerful solutions for similarity-based recommendations, embedding-based search, and personalized advertising.

Hope this article has given you a detailed idea of ANN algorithms in vector databases. Do check out our other articles on vector databases to learn more. Happy learning!

Key Takeaways

Vector databases excel in handling multi-dimensional data searches, surpassing traditional relational databases in efficiency and speed.
Tree-based ANN algorithms like KD-trees and Annoy improve search performance by organizing data points using random hyperplanes.
Graph-based algorithms, such as HNSW, effectively navigate complex data spaces by connecting data points through vertices and edges
Hybrid algorithms like NGT combine the strengths of trees and graphs to achieve faster and more accurate nearest neighbor searches.
Vector databases are crucial in applications like recommendations, personalized advertising, and embedding-based search across various media types.

Frequently Asked Questions

Q1. What is a vector database?

A. A vector database is a specialized type of database that handles multi-dimensional data, enabling efficient similarity searches using vector embeddings rather than traditional row-column structures.

Q2. What are the algorithms used in vector databases?

A. Vector databases utilize various Approximate Nearest Neighbor (ANN) algorithms, including tree-based methods like KD-trees and Annoy, graph-based methods like HNSW, and hybrid methods like NGT.

Q3. How do tree-based ANN algorithms work in vector databases?

A. Tree-based ANN algorithms organize data points using structures like KD-trees and Annoy, which divide the data space with hyperplanes, allowing efficient nearest neighbor searches by traversing the tree.

Q4. What is the role of graph-based algorithms in vector databases?

A. Graph-based algorithms, such as HNSW, represent data points as vertices in a graph, using edges to connect nearest neighbors and navigate the graph efficiently to find similar data points.

Q5. What are some practical applications of vector databases?

A. Practical applications of vector databases include similarity-based recommendations for music and products, personalized advertising, and embedding-based searches for text, images, and molecules.

Santhosh Reddy D

I am a Data Scientist interested in Machine Learning, Natural Language Processing, and Generative AI. I am interested in building products that leverage these technologies to solve real-world problems and drive innovation in various industries.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Deep Learning

Feed Forward Networks

Gradient Descent

Loss Function

Activation Functions

Introduction to Neural networks

Forward and Backward Propagation

Optimizers

Learning Rate Schedulers

NN on Structured Data

Improving the Deep Learning Model

Deep Learning Model Optimization

Unsupervised Deep Learning

AutoDL

Model Deployment

Introduction to PyTorch

Exploring ANN Algorithms in Vector Databases

Introduction

Learning Objectives

Table of Contents

What are ANN Algorithms?

How ANN Algorithms Work

Tree-based Algorithms

How Annoy Works

Graph-based Algorithms

How HNSW Works

How This Works

HNSW Formula and Implementation

Hybrid Algorithms

How NGT Works

Applications of ANN Algorithms in Vector Databases

1. Similarity-Based Recommendations

2. Embedding-Based Search

3. Miscellaneous

Conclusion

Key Takeaways

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics