This article was published as a part of the Data Science Blogathon
DBSCAN(Density-Based Spatial Clustering Application with Noise), an unsupervised machine learning technique is one of the density-based clustering algorithms which is heavily used when we want to deal with the outliers and want the clusters of any random(arbitrary) shapes and sizes.
Therefore it becomes necessary for every aspiring Data Scientist and Machine Learning Engineer to have a good knowledge of this algorithm.
In this article, we will discuss the most important questions on the DBSCAN Clustering Algorithm which is helpful to get you a clear understanding of the algorithm, and also for Data Science Interviews, which covers its very fundamental level to complex concepts.
DBSCAN also known as Density-Based Spatial Clustering Application with Noise is an unsupervised machine learning algorithm that forms the clusters based upon the density of the data points or how close the data is.
As a result, the points which are outside the dense regions are excluded and considered as the noisy points or outliers. This characteristic of the DBSCAN algorithm makes it a perfect fit for outlier detection and making clusters of any random shapes and sizes.
This algorithm works on a parametric approach that uses two parameters i.e., eps(epsilon) and min_pts.
The major steps followed during the DBSCAN algorithm are as follows:
Step-1: Decide the value of the parameters eps and min_pts.
Step-2: For each data point(x) present in the dataset:
Step-3: For each core point, if it is not already assigned to a cluster then create a new cluster. Further, all the neighbouring points are recursively determined and are assigned the same cluster as that of the core point
Step-4: Repeat the above steps until all the points are visited.
Density-based models are the types of models that try to search the data space for the areas of the varied density of data points. It separates different density regions by assigning the data points within these regions in the same cluster.
For Example,
Density-based clustering: Creates clusters according to the density measurement since clusters have a higher density than the rest of the dataset which is observed by calculating the density in the data space.
Partition-based(K-means) and Hierarchical clustering techniques are highly efficient with normal-shaped clusters while density-based techniques are efficient in arbitrary-shaped clusters or detecting outliers.
For example, if we set the parameter min_pts to 4, then a given point needs to have 3 or more neighbouring data points to become a core data point.
Moreover, if the minimum number of points meet the epsilon distance requirement then they are considered as a cluster.
On the contrary, Connectivity involves a transitivity based chaining-approach to determine whether points are located in a particular cluster.
For example, Two points x and y points could be connected if x->r->s->t->y where a->b implies that b is in the neighbourhood of a.
If epsilon is too small: In such cases, we define the sparser clusters as noise i.e, result in the elimination of sparse clusters as outliers.
If epsilon is too large: In such cases, the denser clusters may be merged together, which gives the incorrect clusters.
Yes, the DBSCAN algorithm is very sensitive to the values of epsilon and min_pts. Therefore, it is crucial to understand how to choose the values of epsilon and min_pts. A minor variation in these values can change the results significantly produced by the DBSCAN algorithm.
There is no estimation for this parameter, but the distance functions need to be chosen appropriately based on the nature of the data set.
DBSCAN uses Euclidean distance by default, although other methods can also be used.
For Example, For geographical data, we use the great circle distance.
– Direct Density Reachable
– Density Reachable
– Density Connected
Direct density reachable: A point is called direct density reachable if it has a core point in its neighbourhood.
Density Reachable: A point is known as density reachable from another point if they are connected through a series of core points.
Density Connected: Two points are called density connected if there is a core point that is density reachable from both points.
In other words, we can say that if the number of neighbourhood points around x is greater or equal to MinPts then x is treated as a core point, if the neighbourhood points around x are less than MinPts but are close to a core point then x is treated as a border point. If x is neither a core point nor a border point then x is considered as a noisy point or an outlier.
Therefore in general smaller values of epsilon are preferable and usually, only a small fraction of points remain within this distance of each other.
Core Point: The data point x is the core point since it has at least min_pts (n) within epsilon (eps) distance.
Border Point: The data point y is the border point since it has at least one core point within epsilon (eps) distance and lower than min_pts (n) within epsilon (eps) distance from it.
Noise Point: The data point z is the noise point since it has no core points within epsilon (eps) distance.
To overcome such problems DBSCAN is used as it produces more reasonable results than k-means across a variety of different distributions. We can elucidate this through the following fact:
Image Source: Google Images
Best Case: If we use a spatial indexing system to store the dataset like kd-tree or r-tree such that neighbourhood queries are executed in logarithmic time, then we get O(N log N) runtime complexity.
Worst Case: Without the use of index structure or on degenerated data the worst-case run time complexity remains O(N2).
Average Case: It is the same as the best/worst case depending on data and implementation of the algorithm.
Case-1: The low value of min-Pts i.e, min_pts=1 does not make sense, as then every point on its own will already be a cluster.
Case-2: When min_pts ≤ 2, the result will be the same as that of the hierarchical clustering with the metric as a single link, having dendrogram cut at a height of epsilon.
Therefore, min_pts must be chosen at least 3.
However, larger values of min_pts are usually better when the data sets are having noise and as a result, it will yield more significant clusters.
Generally, as a rule of thumb min_pts = 2*D can be used.
To choose larger values, it may be necessary that the:
1. It does not need a predefined number of clusters i.e, not require an initial specification of the number of clusters.
2. Basically, clusters can be of any random shape and size, including non-spherical ones.
3. It is able to identify noise data, popularly known as outliers.
4. Unlike K means, In DBSCAN the user does not give the number of clusters to be generated as input to the algorithm.
5. DBSCAN can find any shape of clusters.
6. Also, the cluster doesn’t have to be circular.
7. Density-based clustering algorithms don’t include the outliers in any of the clusters. Outliers were considered as noise while clustering and hence they will be eliminated from the cluster after the algorithm completion.
1. DBSCAN clustering will fail when there are no density drops between clusters.
2. It seems to be difficult to detect outlier or noisy points if there is a variation in the density of the clusters.
3. It is sensitive to parameters i.e. it’s hard to determine the correct set of parameters.
4. Distance metric also plays a vital role in the quality of the DBSCAN algorithm
5. With high dimensional data, it does not give effective clusters
6. Not partitionable for multiprocessor systems.
Below given is an image in which the value of min_pts is 4. Accordingly, find the value of (x+y+z+t).
where,
x = Sum of the alphabet positions according to the English dictionary of the core points
y = Sum of the alphabet positions according to the English dictionary of the noisy points
z = Max(alphabet position according to the English dictionary of the border points)
t = Number of border points
For Example, In the series of the alphabet, D’s position is 4, T’s position is 20, etc.
Try to solve the Practice Question and answer it in the comment section below.
For any further queries feel free to contact me.
Thanks for reading!
I hope you enjoyed the questions and were able to test your knowledge about DBSCAN Clustering Algorithm.
If you liked this and want to know more, go visit my other articles on Data Science and Machine Learning by clicking on the Link
Please feel free to contact me on Linkedin, Email.
Something not mentioned or want to share your thoughts? Feel free to comment below And I’ll get back to you.
Currently, I pursuing my Bachelor of Technology (B.Tech) in Computer Science and Engineering from the Indian Institute of Technology Jodhpur(IITJ). I am very enthusiastic about Machine learning, Deep Learning, and Artificial Intelligence.
The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.
x+y+z+t = 50 Thanks for the article Chirag!