Similarity and Dissimilarity Measures in Data Science

6 min read

Introduction

Data Science deals with finding patterns in a large collection of data. For that, we need to compare, sort, and cluster various data points within the unstructured data. Similarity and dissimilarity measures are crucial in data science, to compare and quantify how similar the data points are. In this article, we will explore the different types of distance measures used in data science.

Similarity and Dissimilarity Measures in Data Science

Overview

  • Understand the use of distance measures in data science.
  • Learn the different types of similarity and dissimilarity measures used in data science.
  • Learn how to implement more than 10 different distance measures in data science.

Vector Distance Measures in Data Science

Let’s begin by learning about the different vector distance measures we use in data science.

Euclidean Distance

This is based on the Pythagorean theorem. For two two-dimension it can be calculated as d = ((v1-u1)^2  + (v2-u2)^2)^0.5

This formula can be represented as ||u – v||2

import scipy.spatial.distance as distance

distance.euclidean([1, 0, 0], [0, 1, 0])
# returns 1.4142

distance.euclidean([1, 5, 0], [7, 3, 4])
# returns 7.4833

Minkovski Distance

This is a more generalized measure for calculating distances, which can be represented by ||u – v||p. By varying the value of p, we can obtain different distances.

For p=1, City block (Manhattan) distance, for p=2, Eucleadian distance, when p=infinity, chebyshev distance

distance.minkowski([1, 5, 0], [7, 3, 4], p=2)
>>> 7.4833

distance.minkowski([1, 5, 0], [7, 3, 4], p=1)
>>> 12

distance.minkowski([1, 5, 0], [7, 3, 4], p=100)
>>> 6
Vector based distance measures in data science

Statistical Similarity in Data Science

Statistically similarity in data science is generally measured using Pearson Correlation.

Pearson Correlation

It measures the linear relationship between two vectors.

correlation coefficient
import scipy
scipy.stats.pearsonr([1, 5, 0], [7, 3, 4])[0]
>>> -0.544

Other correlation metrics for different types of variables are discussed here.

The metrics mentioned above are effective for measuring the distance between numerical values. However, when it comes to text, we employ different techniques to calculate the distance.

To calculate text distance metrics we can install the required libraries by

'pip install textdistance[extras]'

Edit-based Distance Measures in Data Science

Now let’s look at some edit-based distance measures used in data science.

Hamming Distance

It measures the number of differing characters between two strings of equal length.

We can add prefixes if we want to calculate for unequal-length strings.

textdistance.hamming('series', 'serene')
>>> 3

textdistance.hamming('AGCTTAG', 'ATCTTAG')
>>> 1

textdistance.hamming.normalized_distance('AGCTTAG', 'ATCTTAG')
>>> 0.1428

Levenshtein Distance

It is calculated based on how many corrections are needed to convert one string to another. The allowed corrections are insertion, deletion, and substitution.

textdistance.levenshtein('genomics', 'genetics')
>>> 2

textdistance.levenshtein('datamining', 'dataanalysis')
>>> 8

Damerau-Levenshtein

It also includes the transposition of two adjacent characters in addition to the corrections from Levenshtein distance.

textdistance.levenshtein('algorithm', 'algortihm')
>>> 2

textdistance.damerau_levenshtein('algorithm', 'algortihm')
>>> 1

Jaro-Winkler Distance

The formula to measure this is Jaro-Winkler=Jaro+(l×p×(1−Jaro)), where
l=length of the common prefix (up to 4 characters)
p=scaling factor, typically 0.1

Jaro = 1/3 ​(∣s1∣/m​ + ∣s2∣/m​ + (m−t)/m​), where
Si is the length of the string
m is the number of matching characters within max(∣s1∣,∣s2∣)/2 – 1
t is the number of transpositions.

For example, in the strings “MARTHA” and “MARHTA”, “T” and “H” are transpositions

textdistance.jaro_winkler('datamining', 'dataanalysis')
>>> 0.6444

textdistance.jaro_winkler('datamining', 'dataanalysis')
>>> 0.8833

Token-based Distance Measures in Data Science

Let me introduce you to some token-based distance measures in data science.

Jaccard Index

This measures similarity between two strings by dividing the number of characters common to both by the total number of strings in both (Intersection over union).

textdistance.jaccard('genomics', 'genetics')
>>> 0.6

textdistance.jaccard('datamining', 'dataanalysis')
>>> 0.375

# The results are similarity fraction between words.

Sørensen–Dice Coefficient

measures similarity between two sets by dividing twice the size of their intersection by the size of their union.

textdistance.sorensen_dice('genomics', 'genetics')
>>> 0.75

textdistance.sorensen_dice('datamining', 'dataanalysis')
>>> 0.5454

Tversky Index

It is like a generalization of the Sørensen–Dice coefficient and the Jaccard index.

Tversky Index(A,B)=∣A∩B∣​ / ∣A∩B∣+α∣A−B∣+β∣B−A∣

When alpha and beta are 1, it is the same as Jaccard index. When they are 0.5 each, it same as Sørensen–Dice coefficient. We can change those values depending on how much weightage to give for mismatches from A and B, respectively.

textdistance.Tversky(ks=[1,1]).similarity('datamining', 'dataanalysis')
>>> 0.375

textdistance.Tversky(ks=[0.5,0.5]).similarity('datamining', 'dataanalysis')
>>> 0.5454

Cosine Similarity

This measures the cosine of the angle between two non-zero vectors in a multidimensional space. cosine_similarity = A⋅B / ∣∣A∣∣×∣∣B∣​, A.B is the dot product, ∣∣A∣∣ and ∣∣B∣are the magnitudes.

textdistance.cosine('AGCTTAG', 'ATCTTAG')
>>> 0.8571

textdistance.cosine('datamining', 'dataanalysis')
>>> 0.5477

Sequence-based Distance Measures in Data Science

We have now come to the last section of this article where we will explore some of the commonly used sequence-based distance measures.

Longest Common Subsequence

This is the longest subsequence common to both strings, where we can get the subsequence by deleting zero or more characters without changing the order of the remaining characters.

textdistance.lcsseq('datamining', 'dataanalysis')
>>> 'datani'


textdistance.lcsseq('genomics is study of genome', 'genetics is study of genes')
>>> 'genics is study of gene'

Longest Common Substring

This is the longest substring common to both strings, where we can get a substring in a contiguous sequence of characters within a string.

textdistance.lcsstr('datamining', 'dataanalysis')
>>> 'data'

textdistance.lcsstr('genomics is study of genome', 'genetics is study of genes')
>>> 'ics is study of gen'

Ratcliff-Obershelp Similarity

A measure of similarity between two strings based on the concept of matching subsequences. It calculates the similarity by finding the longest matching substring between the two strings and then recursively finding matching substrings in the non-matching segments. Non-matching segments are taken from the left and right parts of the string after dividing the original strings by the matching substring.

Similarity = 2×M​ / ∣S1∣+∣S2∣

Example:

String 1: datamining, String 2: dataanalysis

Longest matching substring: ‘data’, Remaining segments: ‘mining’ and ‘analysis’ both on right side.

Compare mining and analysis, Longest matching substring: ‘n’, Remaining segments: ‘mi’ and ‘a’ on left side, ‘ing’ and ‘alysis’ on right side. There are no further matching substrings.

So, 2*5 / (10+12) = 0.4545

textdistance.ratcliff_obershelp('datamining', 'dataanalysis')
>>> 0.4545

textdistance.ratcliff_obershelp('genomics is study of genome', 'genetics is study of genes')
>>> 0.8679

These are some of the commonly used similarity and distance metrics in data science. A few others include Smith-Waterman based on dynamic programming, compression-based normalized compression distance, phonetic algorithms like the match rating approach, etc.

Learn more about these similarity measures here.

Conclusion

Similarity and dissimilarity measures are crucial in Data Science for tasks like clustering and classification. This article explored various metrics: Euclidean and Minkowski distances for numerical data, Pearson correlation for statistical relationships, Hamming and Levenshtein distances for text, and advanced methods like Jaro-Winkler, Tversky index, and Ratcliff-Obershelp similarity for nuanced comparisons, enhancing analytical capabilities.

Frequently Asked Questions

Q1. What is the Euclidean distance and how is it used in Data Science?

A. Euclidean distance is a measure of the straight-line distance between two points in a multidimensional space, commonly used in clustering and classification tasks to compare numerical data points.

Q2. How does the Levenshtein distance differ from the Hamming distance?

A. Levenshtein distance measures the number of insertions, deletions, and substitutions needed to transform one string into another, while Hamming distance only counts character substitutions and requires the strings to be of equal length.

Q3. What is the purpose of the Jaro-Winkler distance?

A. Jaro-Winkler distance measures the similarity between two strings, giving higher scores to strings with matching prefixes. It is particularly useful for comparing names and other text data with common prefixes.

Q4. When should I use Cosine Similarity in text analysis?

A. Cosine Similarity is ideal for comparing document vectors in high-dimensional spaces, such as in information retrieval, text mining, and clustering tasks, where the orientation of vectors (rather than their magnitude) is important.

Q5. What are token-based similarity measures and why are they important?

A. Token-based similarity measures, like Jaccard index and Sørensen-Dice coefficient, compare the sets of tokens (words or characters) in strings. They are important for tasks where the presence and frequency of specific elements are crucial, such as in text analysis and document comparison.

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear