- Open Access
- Total Downloads : 577
- Authors : Nishu Sharma, Atul Pratap Singh
- Paper ID : IJERTV2IS60712
- Volume & Issue : Volume 02, Issue 06 (June 2013)
- Published (First Online): 19-06-2013
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License: This work is licensed under a Creative Commons Attribution 4.0 International License
A Comparative Study Of Data Clustering Techniques
Nishu Sharma1, Atul Pratap Singp
1, 2 M. Tech Scholars, School of Computer Science & Engineering, Galgotias University, Greater Noida
Abstract
As we know that clustering is a process for discovering groups and identifying interesting patterns. Data mining refers to extract knowledge from large database. We can say that data mining is a knowledge mining process. Size and need of dataset are increasing day by day. Thats why we need data mining techniques for managing huge dataset. It is a process of knowledge discovery in database (KDD). Data mining techniques are useful in many more fields today such as biology , libraries, GIS, satellite images, marketing, medical diagnostics and many more field .in this paper this study tells us about the comparison between data mining techniques on the basis of size, model, application areas and others features. This study tells us when and which data mining techniques are used.
-
Introduction-
Data clustering[1] is the process of putting similar data objects into a group. Data objects of one group are dissimilar from the data objects of other group. Clustering algorithms are not used for only organize data. These are used also data compression and model construction. Cluster center is the heart of the cluster. Below we define the process of making data clusters [2].
Fig.1 Clustering Process
Firstly, we take raw data, then apply clustering algorithm on the raw data and after that we will get the clusters of data. This is the process of making data clusters with the help of Clustering algorithm.
Clustering techniques [10] satisfy two main criteria:-
-
Each cluster is homogeneous because cluster makes with the help of similar objects.
-
Objects of one cluster are dissimilar from the other clusters objects. Each
Cluster should be different from other clusters.
-
-
Related work
-
Idea of Clustering:-
We take the example of the library system[3]. In a library, there are a lot of books available. Books have some similar qualities and as well as dissimilar qualities. We manage or organize the books with the help of clustering easily. We keep the books same place which have similar qualities and keep different locations or place which have dissimilar qualities. It means we make the clusters of the books. With the help of clustering searching option for specific book is so much easy. We can easily search specific book and it take lesser time for searching specific book.
-
Applications of Clustering:-
These are some application of clustering [4]:
-
Data Reduction
-
Hypothesis Testing
-
Business
-
Biology
-
Spatial data analysis
-
Web mining
-
-
Requirements of good clustering algorithm [10]:-
-
Scalability: – Clustering algorithm works well with small datasets and if it will work well with huge datasets. That is called
scalability. It means clustering algorithm should be highly scalable.
-
Dealing with different types of attribute: – The ability to analyze with any type of attributes such as binary, categorical and ordinal data or mixtures of data types.
-
Discovery of cluster with arbitrary shape: – Clusters could be any type of shape. Thats why algorithm should be detected any type of clustered shape.
-
Ability to deal with noise & outliers: – Databases contain noisy data (missing values) and outliers thats why algorithm should ability how to deal with these types of databases.
-
Interpretability & usability: – Users expect clustering results to be interpretable, and usable. Thats why clustering may need to be tied up with specific semantic interpretations.
-
High dimensionality: – A database and data warehouse can contain several dimensions or attributes. Clustering algorithm should be managed high dimensional data.
-
Data order dependency: – Algorithm should be insensitive to the order of input.
-
-
Clustering Techniques
Clustering [5] Techniques which are apply on the raw data and after that we get the clusters. That is called clustering techniques. Clustering techniques are of many types but some of them are so much important. These are as follows:-
-
Partitioning Algorithm: – This method is based on the partitioning. This method is to partition the data into k groups. The general behavior is that objects in the same clusters are close to each other and objects in the different clusters are far to each other.
There are two types of partitioning algorithm. These are as follows:-
-
K-means algorithm- This algorithm[6] is based on the mean value or centroid of the objects into the clusters. These steps are as follows: –
Fig.2 K-means
We arbitrarily [4] choose three objects as three initial cluster centers, where we mark the cluster center by +. Make the clusters with the help of objects, which objects are nearest to the cluster center then make cluster. After making clusters we will find out the mean value of the cluster. According to the new mean value we will make or generate new clusters. This process is continuing until we find out similar mean values like this figure [7].
Fig.3 K-means process
Advantages: –
-
Simplicity
-
Effectiveness
-
Easily understandable
Disadvantages: –
-
It is not suitable for discovering clusters with nonconvex shapes or different size of clusters.
-
It is sensitive to noise and outlier data points.
-
-
K-medoids algorithm- In this algorithm, each cluster [7] is represented by one of the objects which are located near the center of the cluster. This algorithm is based on the medoids.
Fig.4 K-medoids
Arbitrarily choose k objects as the initial medoids[8], after that repeat, then assign each remaining objects to the cluster with nearest medoids. Randomly pick a non-medoid object Orandom. Find out the total cost S of swapping Oj with Orandom.
If S<0 then swap Oj with Orandom to make new set of k medoids until no change.
Advantages: –
-
It can handle noise and outlier data points.
-
This method is more robust than K- means algorithm.
Disadvantages: –
-
It is more costly than K-means algorithm.
-
Slow
-
-
-
Hierarchical Clustering: – This algorithm is based on the hierarchical decomposition. In this algorithm [9] combine or divide the hierarchical structure. In other words we can say that it is a tree of clusters also known as dendrogram. It contains the hierarchy of clusters. Hierarchical Clustering is of two types.
-
Agglomerative approach: – This approach is also called bottom up approach. This is based on the combined approach[10]. In this method firstly we start with bottom clusters and take clusters which have
minimum distance between clusters. Then combined or merged these two clusters and make single cluster. This process is continued until remained single cluster.
-
Divisive clustering: – This approach is known as top down approach. It is based on division of clusters approach. We start from top of the hierarchy and take single cluster and divide or split them into clusters. That is called divisive or top down approach.
Fig5 Divisive clustering
Advantages: –
-
It has a logical structure.
-
Easy to read and interpret.
Disadvantages: –
-
After merging and splitting the objects it will neither undo.
-
Unstable & unreliable.
-
-
-
Density based Clustering: – This technique is based on the density of the objects. It is used for arbitrary shapes. Irregular shapes are managed by density based clustering. These steps are as follows [11]: –
-
DBSCAN Algorithm: – Select an arbitrary point p. Find out all the points density reachable from p. If p is a core point, cluster is formed. If p is a border point, no points are density reachable from p. And DBSCAN[12] visits the next point of the databases. Continue the process until all the points have been processed.
-
Advantages: –
-
Find clusters of arbitrary shape not just convex.
-
It can filter out noise.
Disadvantages: –
-
DBSCAN does not deal with high dimensional data.
-
-
Grid based Clustering: – This is used for spatial data[13,14]. Spatial data includes structure of objects in space. We quantize data into cells. Then we with only with those objects that are belong to cells
-
CLIQUE Algorithm: – This is the combination of both grid based and density based clustering. It is useful for clustering high dimensional [15] data in large databases. Steps are as follows: –
-
-
Bottom-up to find out dense units or crowded areas in units.
-
Generating minimal number of regions, each region cover one cluster.
Advantages: –
-
Processing time is very fast because it depends on the number of cells.
-
Effective in large image database.
Disadvantages: –
-
Quality of clustering is dependent on the granularity [16, 17] of cells.
-
-
Comparison
Comparison [18, 19] of all these protocols are given in table below:-
S. No
K-means
K-medoids
Hierarchical
DBSCAN
CLIQUE
1
Application area
Neural network,AI, market segmentation
,pattern recognition
With KD tree for s/w
fault prediction
Applied
science psychology AI Social science
Satellites
images, XRAY crystallography
Social n/w Biological n/w
2
Shape and Size
Spherical
Spherical
Arbitrary& non convex shape
Spherical & arbitrary shape
Find projected clusters in subspace of dimensional space
3
Cluster model
Centroid
Centroid
Connectivity model
Density model
Graph based
4
Scalability
Yes
Scale well only for small data set
No
To large dataset
Good scalability
5
Type of attribute
Numerical
Numerical/categori cal
Symbolic attribute
numerical
numerical
6
Outlier handling
No
no
yes
yes
yes
Table.1 Comparison of clustering protocols
-
Conclusion
In this paper we described the process of clustering in a short term from the data mining point of view. We discussed the properties of a good clustering methods used to nd meaningful partitioning. Clustering lies at the heart of data mining applications. The ability to discover highly correlated regions of objects when their number becomes very large is highly desirable, as data size is increasing day by day and their properties and data interrelationships change, managing and organizing that huge data set is very challenging. We have provided a brief introduction to cluster analysis. All of these approaches have been successfully applied in a number of areas, although there is a need for more extensive study to compare these different techniques and better understand their strengths and limitations. In particular, there is no reason to expect that one type of clustering approach will be suitable for all types of data.
-
References
-
A. Gupta, Comparisons among data mining algorithms, ICRITO2013.
-
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice-Hall, 1988
-
D.L.Boley,. Principal direction divisive partitioning. Data Mining and Knowledge Discovery, 1998.
Information. Technology. Journal, Vol, 10, No .3, pp478- 484, 2011.
-
J. Han and M. Kamber. Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, August 2000.
-
J. Han , M. Kamber, Data Mining Techniques, Morgan Kaufmann Publishers, 2000.
-
B. S. Everitt, Cluster Analysis,3rd Edition, Edward Arnold Publishers ,1993.
-
P. Rai ,S. Singh , A Survey of Clustering Techniques , International Journal of Computer Applications (0975 8887) Volume 7 No.12, October 2010
-
M. Steinbach, G.Karypis, V.Kumar, A Comparison of Document Clustering Techniques, University of Minnesota, Technical Report #00-034 (2000).
-
M. Halkidi, Y. Batistakis, M. Vazirgiannis, "On Clustering Validation Techniques", Intelligent Information Systems Journal, Kluwer Pulishers, 17(2-3): 107-145
-
R. Ali, U. Ghani, A. Saeed, Data Clustering and Its Applications, Rudjer Boskovic Institute, 2001
-
P.Arabie, L.J. Hubert, 1996. An overview of combinatorial data analysis, in: Arabie ,P., Hubert, L.J., and Soete, G.D. (Eds) Clustering and Classification, 5-63,
World Scientific Publishing Co ., NJ
-
J.Hartigan 1975 Clustering Algorithms. John Wiley & Sons, New York, NY
-
G. Fung, A Comprehensive Overview of Basic Clustering Algorithms., June 22, 2001
-
K. Hammouda, F. Karray , A Comparative Study of Data Clustering Techniques University of Waterloo,
Ontario, Canada N2L 3G1
-
O.A. Abbas, Department of computer Science, Yarmouk University, Jordan, Comparison Between Data Clustering Algorithm ,The International Arab Journal Of Information Technology, vol.5, No.3, July 2008