This article will discuss cosine similarity, a tool for comparing two non-zero vectors. Its effectiveness at determining the orientation of vectors, regardless of their size, leads to its extensive use in domains such as text analysis, data mining, and information retrieval. This article explores the mathematics of cosine similarity and shows how to use it in Python.
Overview:
Cosine similarity measures the cosine of the angle between two vectors in a multi-dimensional space. The cosine of two non-zero vectors can be derived by using the Euclidean dot product formula:
Given two n-dimensional vectors of attributes, A and B, the cosine similarity, cos(θ), is represented using a dot product and magnitude as
The cosine similarity ranges from -1 to 1, where:
Let us now learn how to implement cosine similarity using different libraries:
# Using numpy
import numpy as np
# Define two vectors
A = np.array([1, 2, 3])
B = np.array([4, 5, 6])
# Compute cosine similarity
cos_sim = np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B))
print("Cosine Similarity (NumPy):", cos_sim)
Here, we are creating two arrays, A and B, which will act as the vectors we need to compare. We use the cosine similarity formula, i.e., the dot product of A and B upon mod of A X mod B.
from sklearn.metrics.pairwise import cosine_similarity
# Define two vectors
A = [[1, 2, 3]]
B = [[4, 5, 6]]
# Compute cosine similarity
cos_sim = cosine_similarity(A, B)
print("Cosine Similarity (scikit-learn):", cos_sim[0][0])
Here, we can see that the inbuilt function in the sklearn library does our job of finding the cosine similarity.
The first step behind the numpy code in defining vectors.
Compute the dot product of the two vectors A and B. The dot product is obtained by multiplying corresponding elements of the vectors and summing up the results.
Determine the magnitude (or norm) of each vector A and B. This involves calculating the square root of the sum of the squares of its elements.
The final step is to calculate the values.
Cosine similarity is a powerful tool for finding the similarity between vectors, particularly useful in high-dimensional and sparse datasets. In this article, we have also seen the implementation of cosine similarity using Python, which is very straightforward. We have used Python’s NumPy and scikit-learn libraries to implement cosine similarity. Cosine similarity is important in NLP, text analysis, and recommendation systems because it is independent of the magnitude of the vector.
A. Cosine similarity measures the cosine of the angle between two non-zero vectors in a multi-dimensional space, indicating how similar the vectors are.
A. In text analysis, we compare documents using cosine similarity by transforming texts into TF-IDF vectors and calculating their similarity.
A. You can implement cosine similarity in Python using the NumPy or scikit-learn libraries, which provide straightforward calculation methods.