- Open Access
- Total Downloads : 20
- Authors : Harsh Shinde, Amruta Sankhe
- Paper ID : IJERTCONV5IS01027
- Volume & Issue : ICIATE – 2017 (Volume 5 – Issue 01)
- Published (First Online): 24-04-2018
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License: This work is licensed under a Creative Commons Attribution 4.0 International License
Comparison of Enhanced DBSCAN Algorithms: A Review
Harsh Shinde
-
Student of Computer Engineering Atharva College of Engineering, Mumbai University
Mumbai, MH, India
Amruta Sankhe
Prof. of Computer Engineering
Atharva College of Engineering, Mumbai University
Mumbai, MH, India
AbstractWhile data containing any type of information is getting ubiquitous, proper processing of such data is very much obliged. To classify spatial data, various clustering algorithms have been invented. DBSCAN (Density Based Spatial Clustering of Application with Noise) is one of the consummate clustering algorithm with respect to discovery of arbitrarily shaped cluster in a spatial datasets. Due to its flexibility and tremendous research potential, DBSCAN algorithm is one of the most cited in scholarly literature. Considering the fact that most researchers tried to experiment and evaluate the algorithm, certain modifications of DBSCAN were made in order to have more efficient outcomes or to reduce time complexities. This paper discusses such modified algorithms and how they surpass the original DBSCAN in certain ways. Thorough analysis of each algorithm is mentioned and their critical evaluation is done accordingly. These algorithms are then comparatively evaluated with regards to various parameters.
Index TermsClustering algorithms, DBSCAN, DDCAR, RDD-DBSCAN, FastDBSCAN, DDBSCAN, DSets-DBSCAN
-
Introduction
Spatial data management needs proper evaluation and processing of any information to generate useful and desired data. For such implementation of data processing, the technique of Knowledge Discovery of Data (KDD), which is also known as Data Mining, is widely used. Clustering is a popularly used data mining process which classifies spatially represented datasets [1]. This classification of datasets results in formation of similar group of objects, possessing similar properties. To classify such type of data, a number of clustering algorithms have been invented and implemented, making classification of data into arbitrarily shaped, similar datasets called clusters. These clustering algorithms help to extract useful information generally in the form of patterns. Density- based clustering techniques are used to mine information containing large datasets.
Due to emerging trends of big data, density-based clustering algorithms have been mostly cited in scientific literature due to its tremendous research potential and also possessing vast choices of enhancement in its original algorithms. Although the characteristics of other algorithms might work well with similar datasets, density based clustering algorithms is more efficient with respect to any variation of datasets. Out of many density-based clustering algorithms such as DBSCAN, DENCLUE, DBCLASD,
OPTICS, etc [8][9], DBSCAN (Density-Based Spatial Clustering of Application with Noise) is one of the most universally acclaimed. DBSCAN and other density-based algorithms are important because they are unaffected by noise points and can handle clusters of various shapes and sizes. They are a lot of clusters that DBSCAN can find that other algorithms would not be able to find. DBSCAN searches for core objects, i.e., objects having dense neighborhood. DBSCAN has a number of applications ranging from machine learning library to weather analysis or at atmospheric science on a national scale [3]. Due to its vast experimental potential, many researchers have cited and made some changes in the original algorithm. Some algorithm tries to reduce the computational process while other algorithm reduces time complexity over substantially varied datasets.
Researchers have been interested in DBSCAN since its proposal as this algorithm had some massive potential towards various data driven applications. Considering the emerging trends of big data and data mining tools, the applications of DBSCAN is also bringing effective results for various sets of data. DBSCAN possesses various limitations with respect to the variation and the size of datasets. Following such limitations, various advanced algorithms were invented for overcoming different types of shortcomings which the original DBSCAN possessed. These changes were made to enhance the restraints put forth by DBSCAN; some increase the effectiveness of the algorithm, while others produces similar results as the original algorithm but decreasing the time complexity taken by the DBSCAN [5].
The following paper gives an insight towards some recent changes made in the original DBSCAN. These changes are subjective to the overall development of the original algorithm and to overcome various drawbacks associated with DBSCAN. The algorithms discussed in this paper, providing profound understanding of each, are stated as follows: DBSCAN [2], DDCAR [3], RDD-DBSCAN [4], FastDBSCAN [5], DDBSCAN [6] and DSets-
DBSCAN [7]. Each of the given algorithms is thoroughly analyzed and their critical evaluation is done accordingly. Moreover, a separate comparison is made with regards to certain parameters, to obtain a holistic view of the changes made by each of the algorithms.
The rest of the paper is organized as follows. We discuss the original DBSCAN algorithm in Section 2. In
Section 3, we present the summary and elaboration of different algorithms which provide certain modifications and improvements to the original DBSCAN algorithm. Section 4 provides the conclusion of our overall paper.
-
DBSCAN
DBSCAN ( Density-Based Spatial Clustering of Application with Noise ) is a density-based data clustering algorithm which was proposed by Martin Ester, Hans-Peter Kriegal, Jörg Sander and Xiaowei Xu in 1996 [2]. The main purpose of the algorithm is to connect core objects (objects having dense neighborhoods) and their neighbors
NeighborPts'
}
}
if P' is not visited
{ mark P' as visited
NeighborPts' = regionQuery(P', eps) if sizeof(NeighborPts') >= MinPts
NeighborPts = NeighborPts joined with
}
if P' is not yet member of any cluster add P' to cluster C
to form dense regions as clusters. For a set of points in space, the algorithm groups together closely packed points and the points in the low density region are called outliers. Outliers can be used to detect irrelevant information, usually produced during fraud detection. This algorithm is flexible and dynamic with respect to data.
The steps of DBSCAN are given in the following points:
-
For a point, make n-dimensional sphere of radius Eps, for n-dimensional datasets.
-
Count the number of data points within the sphere. Indicate the value as p.
-
If p>min_pts, min_pts being the minimum points that should be present within the radius Eps, mark the center point to be a part of the cluster; also mark points inside Eps as a part of the cluster.
-
Repeat this step to the other points in the sphere except the center, to expand the cluster.
-
If p<min_pts, ignore the point p and proceed to another point in the dataset.
Following is a pseudo-code algorithm of DBSCAN
[10]:DBSCAN(D, eps, MinPts)
{ C = 0
for each point P in dataset D
{ if P is visited continue next point mark P as visited
NeighborPts = regionQuery(P, eps) if sizeof(NeighborPts) <
MinPts mark P as NOISE else {
C = next cluster
expandCluster(P, NeighborPts, C, eps,
MinPts)
}
}
}
expandCluster(P, NeighborPts, C, eps, MinPts)
{ add P to cluster C
for each poit P' in NeighborPts {
regionQuery(P, eps){ return all points within P's eps
neighborhood (including P)
}
The overall average runtime complexity achieved by this algorithm is O(nlogn). The worst case runtime complexity is O(n²).
-
-
LITERATURE REVIEW
In this section, we will discuss and evaluate the different modifications of the original DBSCAN algorithm. Each of the following algorithms pinpoints various flaws on the implementation of DBSCAN algorithm. These flaws were then solved using certain approach towards each of the following algorithms. Complete evaluation of modified algorithms and relevant future work, if any, are also discussed. The following algorithms are the improved version of the DBSCAN:
DDCAR [3] (Data Density clustering using Automated Radii) is a fully autonomous data density based clustering technique. This technique was proposed by Richard Hyde, Plamen Angelov in 2014. The algorithm automatically determines the number of clusters and derives suitable initial radii. This algorithm is non iterative, i.e., it assigns each sample to the appropriate cluster only once. The main advantages of this study over DBSCAN [2] are:
-
No prior knowledge of the number of cluster is required.
-
No initial radii or other user input required.
-
No knowledge of data density required.
-
Assignment of data to their original cluster is done accurately.
-
Time complexity is reduced and it is dependent on the clustering data.
RDD-DBSCAN [4] is an algorithm proposed by Irving Cordova and Teng-Sheng Moh in 2015. This algorithm addresses large datasets utility of DBSCAN as it is not efficient while working with Resilient Distributed Datasets, which are a fast data processing abstraction created directly for in-memory computation of large datasets. Experimentally, RDD-DBSCAN scales as the number of nodes in a given cluster scale, while generating the same results as the sequential version of DBSCAN. The main advantages of this study over DBSCAN [2] are:
-
Overcoming the scalability limitations by operating in a fully distributed fashion.
-
The algorithm scales as the number of nodes in a given cluster scales, while generating similar results as the sequential version of DBSCAN.
-
Improving in-computation memory and is more efficient.
Future work can be done in the following areas:
-
Loading complete datasets for a given partition into memory.
-
Selection of better partitioning scheme for the data space.
FastDBSCAN [5] is proposed by Vu Viet Thang,
-
Pantiukhin and A.I. Galushkin in 2015. This algorithm divides the data into k partitions (using k- means), then uses a min-max method to select points for DBSCAN clustering. It is a hybrid clustering algorithm as it uses a combination of k-means, min-max method and DBSCAN to recover the final clusters and outliers. The main advantages of this study over DBSCAN [2] are:
-
Overcoming quadratic time complexity (O(n2))
processed in DBSCAN.
-
Improves computational time and accuracy. Directions for further research are given as follows:
-
Developing graph based clustering based on k- means.
-
Applying the algorithm in intrusion detection system datasets.
-
DDBSCAN [6] (Different Densities-Based Spatial Clustering of Applications with Noise) is a modified version of DBSCAN [2]. It uses new concepts to deal with spatial
datasets. This algorithm was proposed by M.F. Hassanin,
-
Hassan and Abdalla Shoeb in 2015. The idea behind this algorithm is to define a density factor to the cluster and the object then define a threshold parameter as decision criteria to determine whether joining this object or not. On applying this technique, any cluster will contain indistinguishable density nodes. The main advantages of this study over DBSCAN [1] are:
-
Multi-density cluster handling is enhanced.
-
Effective clustering of adjacent clusters.
-
Clustering of noise points amongst adjacent clusters.
Dsets-DBSCAN [7] is a parameter free hybrid clustering algorithm proposed by Jian Hou, Huijun Gao and Xuelong Li. This algorithm is mainly used in data clustering and image segmentation experiments. The algorithm functions in two main steps. First, we apply histogram equalization to similarity matrices before using it in clustering, to eliminate the regulation parameter. This makes the algorithm parameter-free. The next step is to Run Dsets clustering followed by clustering based on DBSCAN and extract cluster sequentially. The main advantages of this study over DBSCAN [2] are:
-
Considering careful parameter tuning, this algorithm performs better than conventional DBSCAN.
-
Efficiency is increased with regards to data clustering.
-
TABLE I.
COMPARISON OF ENHANCED DBSCAN ALGORITHMS
Algorithm
DBSCAN
DDCAR
RDD-DBSCAN
FastDBSCAN
DDBSCAN
Dset-DBSCAN
Features
Grouping
Automatically
A modified
Divides the data in
Computing the density of a
Application of histogram
together closely
determines the
DBSCAN algorithm
k partitioning
cluster with respect to radius
equalization to pairwise
packed points,
number of clusters
which takes full
(using k-means),
value Eps and Min_pts.
similarity of input data.
called clusters.
and derives suitable
advantage of
then using a min-
Then provide density
This makes the Dsets
data-driven radii by
Apache Sparks
max method to
threshold which is
results independent of
the use of recursive
parallel capabilities
select points for
responsible for joining a
user specified
density equation.
implemented in
DBSCAN
point to a certain cluster or
parameters. Then extend
Scala programming
clustering.
not.
clusters from Dsets with
language and run in
DBSCAN.
top of Apache
Spark.
Advantages
Finds arbitrarily
No prior
Addresses large
Overcomes
Overcomes the problems of:
Parameter-free clustering
shaped clusters.
knowledge of the
datasets utility of
quadratic time
Multi-density cluster
algorithm.
Robust to
number of clusters
DBSCAN as it is not
complexity
discovery.
Prevents over-
outliers.
is required.
efficient while
processed in
Adjacent cluster discovery.
segmentation of arbitrary
No initial radi or
working with RDDs.
DBSCAN.
Noise points amongst
shape.
other user input
Overcomes
Improves
clusters.
Effective in data
required.
scalability issues of
computational time.
clustering.
No knowledge of
the traditional
Improves clustering
the density of the
DBSCAN algorithm
accuracy.
data is required.
by operating in a
Fewer calculation
fully distributed
by the use of
fashion.
recursive density
Efficient
equation.
performance on
Efficiency is
Apache Spark
significant with
platform.
larger datasets and
Improved memory
big data.
management.
Improved
Efficiency.
Disadvantages
Complex
In smaller datasets
Limitation is the
Needs more
–
Careful parameter tuning
computational
the time advantage
need to load the
research to interpret
is required for optimum
costs.
may be reduced
complete dataset for
these advantages of
results.
Prior knowledge
due carrying out
a given partition into
partitioning-based
of the number of
the density
memory.
clustering and
clusters is
calculation twice
Selection of better
density-based
required.
per cluster when
partitioning scheme
clustering for
Knowledge of
adjusting the radii.
for the data space is
constructing hybrid
data density
Algorithm is non-
unknown.
clustering.
required.
iterative, assigns
Improvement on n-
Scalability
each sample to the
dimensional datasets
limitation.
appropriate cluster
(n>1) is unknown.
Multi-density
only once.
cluster
Time complexity is
discovery.
dependent on the
Noise points
clustering data.
amongst adjacent
clusters.
Adjacent cluster
discovery.
Uses heavy user
specified
parameters.
Time
Average runtime
Varies. Depends on
O(n*logn)
Linear
time
–
–
complexity
complexity :
the size of the
complexity.
O(nlogn)
dataset.
Worst case
runtime
complexity :
O(n²)
Non matrix
based
implementation
complexity :
O(n)
Parameters
2
(Eps
and
None
3(Eps, Min_pts,
3(Input dataset D,
3 (Eps,Min_pts,Density
3 parameters for initial
needed
Min_pts)
Max_pts)
number of cluster
Threshold)
Histogram Equalization
for k-means k,
technique(Dataset D,
proportion of data
Eps, Min_pts), none for
t)
Dsets+DBSCAN.
Application
Fraud detection
Atmospheric
Mainly used in
Intrusion detection
Simulation
tasks.
(Testing
Data Clustering.
and uses
systems, data
sciences by using
Apache Spark.
system by
with
real
and
artificial
Image segmentation
analysis.
Hyper Pole to Pole
clustering.
datasets.)
td>
Observations
Experimenting on
(HIPPO) datasets.
various datasets.
Large climate
based datasets.
-
-
CONCLUSION
-
In this paper we have discussed and compared different modified DBSCAN algorithms. The algorithms discussed were DBSCAN, DDCAR, RDD-DBSCAN, FastDBSCAN,
DDBSCAN and DSets-DBSCAN. This evaluation was done due to many disadvantages which DBSCAN had regarding a number of its features. DDCAR and DSets- DBSCAN are parameter free clustering algorithms. They do not require a user input parameter. RDD-DBSCAN is effective while working with Resilient Distributed Datasets. FastDBSCAN improves computational time and accuracy by overcoming the quadratic time complexity possessed in the traditional DBSCAN. DDBSCAN combats the problem of multi-density clustering and generation of noise points amongst clusters. Thus we obtained a holistic view of the changes made by each of the algorithm.
REFERENCES
-
Jiawei Han, Micheline Kamber and Jian Pei, Data Mining: Concepts and Techniques, 3rd Edition, 2012.
-
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise, in Proc. Int. Conf. Knowl. Discovery Data Mining, 1996, pages 226231..
-
Richard Hyde, Plamen Angelov, A Fully Autonomous Data Density Based Clustering Technique in Evolving and Autonomous Learning Systems (EALS), 2014 IEEE Symposium, Orlando, FL on 9-12 Dec. 2014,pages 116 – 123.
-
Irving Cordova,Teng-Sheng Moh, DBSCAN on Resilient Distributed Datasets in High Performance Computing & Simulation (HPCS), 2015 International Conference, Amsterdam,20- 24 July 2015,pages 531 – 540.
-
Vu Viet Thang ,D. V. Pantiukhin ,A. I. Galushkin, A Hybrid Clustering Algorithm: The FastDBSCAN in 2015 International Conference on Engineering and Telecommunication (EnT), Moscow, 18-19 Nov. 2015, pages 69 – 74.
-
Mohammad F. Hassanin,Mohamed Hassan,Abdalla Shoeb, DDBSCAN: Different Densities-Based Spatial Clustering of Applications with Noise in 2015 International Conference on Control, Instrumentation, Communication and Computational Technologies (ICCICCT), Kumaracoil, 18-19 Dec. 2015, pages 401
– 404.
-
Jian Hou, Huijun Gao, Xuelong Li, DSets-DBSCAN: A Parameter- Free Clustering Algorithm in IEEE Transactions on Image Processing (Volume: 25, Issue: 7),2016, pages 3182 – 3193.
-
Garima,Hina Gulati,P.K.Singh, Clustering Techniques in Data Mining: A Comparison in 2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom),2015,Pages 410 – 415.
-
Saif ur Rehman,Sohail Asghar,Simon Fong ,S. Sarasvady, DBSCAN: Past, present and future in Applications of Digital Information and Web Technologies (ICADIWT), 2014 Fifth International Conference,Bangalore,17-19 Feb. 2014, pages 232 – 238.
-
Wikipedia. (2016, May, 27) DBSCAN from Wikipedia, the free encyclopedia. [Online] Available at: https://en.wikipedia.org/wiki/DBSCAN#Algorithm.