Data Science deals with finding patterns in a large collection of data. For that, we need to compare, sort, and cluster various data points within the unstructured data. Similarity and dissimilarity measures are crucial in data science, to compare and quantify how similar the data points are. In this article, we will explore the different types of distance measures used in data science.
Let’s begin by learning about the different vector distance measures we use in data science.
This is based on the Pythagorean theorem. For two two-dimension it can be calculated as d = ((v1-u1)^2 + (v2-u2)^2)^0.5
This formula can be represented as ||u – v||2
import scipy.spatial.distance as distance
distance.euclidean([1, 0, 0], [0, 1, 0])
# returns 1.4142
distance.euclidean([1, 5, 0], [7, 3, 4])
# returns 7.4833
This is a more generalized measure for calculating distances, which can be represented by ||u – v||p. By varying the value of p, we can obtain different distances.
For p=1, City block (Manhattan) distance, for p=2, Eucleadian distance, when p=infinity, chebyshev distance
distance.minkowski([1, 5, 0], [7, 3, 4], p=2)
>>> 7.4833
distance.minkowski([1, 5, 0], [7, 3, 4], p=1)
>>> 12
distance.minkowski([1, 5, 0], [7, 3, 4], p=100)
>>> 6
Statistically similarity in data science is generally measured using Pearson Correlation.
It measures the linear relationship between two vectors.
import scipy
scipy.stats.pearsonr([1, 5, 0], [7, 3, 4])[0]
>>> -0.544
Other correlation metrics for different types of variables are discussed here.
The metrics mentioned above are effective for measuring the distance between numerical values. However, when it comes to text, we employ different techniques to calculate the distance.
To calculate text distance metrics we can install the required libraries by
'pip install textdistance[extras]'
Now let’s look at some edit-based distance measures used in data science.
It measures the number of differing characters between two strings of equal length.
We can add prefixes if we want to calculate for unequal-length strings.
textdistance.hamming('series', 'serene')
>>> 3
textdistance.hamming('AGCTTAG', 'ATCTTAG')
>>> 1
textdistance.hamming.normalized_distance('AGCTTAG', 'ATCTTAG')
>>> 0.1428
It is calculated based on how many corrections are needed to convert one string to another. The allowed corrections are insertion, deletion, and substitution.
textdistance.levenshtein('genomics', 'genetics')
>>> 2
textdistance.levenshtein('datamining', 'dataanalysis')
>>> 8
It also includes the transposition of two adjacent characters in addition to the corrections from Levenshtein distance.
textdistance.levenshtein('algorithm', 'algortihm')
>>> 2
textdistance.damerau_levenshtein('algorithm', 'algortihm')
>>> 1
The formula to measure this is Jaro-Winkler=Jaro+(l×p×(1−Jaro)), where
l=length of the common prefix (up to 4 characters)
p=scaling factor, typically 0.1
Jaro = 1/3 (∣s1∣/m + ∣s2∣/m + (m−t)/m), where
Si is the length of the string
m is the number of matching characters within max(∣s1∣,∣s2∣)/2 – 1
t is the number of transpositions.
For example, in the strings “MARTHA” and “MARHTA”, “T” and “H” are transpositions
textdistance.jaro_winkler('datamining', 'dataanalysis')
>>> 0.6444
textdistance.jaro_winkler('datamining', 'dataanalysis')
>>> 0.8833
Let me introduce you to some token-based distance measures in data science.
This measures similarity between two strings by dividing the number of characters common to both by the total number of strings in both (Intersection over union).
textdistance.jaccard('genomics', 'genetics')
>>> 0.6
textdistance.jaccard('datamining', 'dataanalysis')
>>> 0.375
# The results are similarity fraction between words.
measures similarity between two sets by dividing twice the size of their intersection by the size of their union.
textdistance.sorensen_dice('genomics', 'genetics')
>>> 0.75
textdistance.sorensen_dice('datamining', 'dataanalysis')
>>> 0.5454
It is like a generalization of the Sørensen–Dice coefficient and the Jaccard index.
Tversky Index(A,B)=∣A∩B∣ / ∣A∩B∣+α∣A−B∣+β∣B−A∣
When alpha and beta are 1, it is the same as Jaccard index. When they are 0.5 each, it same as Sørensen–Dice coefficient. We can change those values depending on how much weightage to give for mismatches from A and B, respectively.
textdistance.Tversky(ks=[1,1]).similarity('datamining', 'dataanalysis')
>>> 0.375
textdistance.Tversky(ks=[0.5,0.5]).similarity('datamining', 'dataanalysis')
>>> 0.5454
This measures the cosine of the angle between two non-zero vectors in a multidimensional space. cosine_similarity = A⋅B / ∣∣A∣∣×∣∣B∣, A.B is the dot product, ∣∣A∣∣ and ∣∣B∣are the magnitudes.
textdistance.cosine('AGCTTAG', 'ATCTTAG')
>>> 0.8571
textdistance.cosine('datamining', 'dataanalysis')
>>> 0.5477
We have now come to the last section of this article where we will explore some of the commonly used sequence-based distance measures.
This is the longest subsequence common to both strings, where we can get the subsequence by deleting zero or more characters without changing the order of the remaining characters.
textdistance.lcsseq('datamining', 'dataanalysis')
>>> 'datani'
textdistance.lcsseq('genomics is study of genome', 'genetics is study of genes')
>>> 'genics is study of gene'
This is the longest substring common to both strings, where we can get a substring in a contiguous sequence of characters within a string.
textdistance.lcsstr('datamining', 'dataanalysis')
>>> 'data'
textdistance.lcsstr('genomics is study of genome', 'genetics is study of genes')
>>> 'ics is study of gen'
A measure of similarity between two strings based on the concept of matching subsequences. It calculates the similarity by finding the longest matching substring between the two strings and then recursively finding matching substrings in the non-matching segments. Non-matching segments are taken from the left and right parts of the string after dividing the original strings by the matching substring.
Similarity = 2×M / ∣S1∣+∣S2∣
Example:
String 1: datamining, String 2: dataanalysis
Longest matching substring: ‘data’, Remaining segments: ‘mining’ and ‘analysis’ both on right side.
Compare mining and analysis, Longest matching substring: ‘n’, Remaining segments: ‘mi’ and ‘a’ on left side, ‘ing’ and ‘alysis’ on right side. There are no further matching substrings.
So, 2*5 / (10+12) = 0.4545
textdistance.ratcliff_obershelp('datamining', 'dataanalysis')
>>> 0.4545
textdistance.ratcliff_obershelp('genomics is study of genome', 'genetics is study of genes')
>>> 0.8679
These are some of the commonly used similarity and distance metrics in data science. A few others include Smith-Waterman based on dynamic programming, compression-based normalized compression distance, phonetic algorithms like the match rating approach, etc.
Learn more about these similarity measures here.
Similarity and dissimilarity measures are crucial in Data Science for tasks like clustering and classification. This article explored various metrics: Euclidean and Minkowski distances for numerical data, Pearson correlation for statistical relationships, Hamming and Levenshtein distances for text, and advanced methods like Jaro-Winkler, Tversky index, and Ratcliff-Obershelp similarity for nuanced comparisons, enhancing analytical capabilities.
A. Euclidean distance is a measure of the straight-line distance between two points in a multidimensional space, commonly used in clustering and classification tasks to compare numerical data points.
A. Levenshtein distance measures the number of insertions, deletions, and substitutions needed to transform one string into another, while Hamming distance only counts character substitutions and requires the strings to be of equal length.
A. Jaro-Winkler distance measures the similarity between two strings, giving higher scores to strings with matching prefixes. It is particularly useful for comparing names and other text data with common prefixes.
A. Cosine Similarity is ideal for comparing document vectors in high-dimensional spaces, such as in information retrieval, text mining, and clustering tasks, where the orientation of vectors (rather than their magnitude) is important.
A. Token-based similarity measures, like Jaccard index and Sørensen-Dice coefficient, compare the sets of tokens (words or characters) in strings. They are important for tasks where the presence and frequency of specific elements are crucial, such as in text analysis and document comparison.