A Comparative Study Of Data Clustering Techniques

Nishu Sharma; Atul Pratap Singh

doi:10.17577/IJERTV2IS60712

Volume 02, Issue 06 (June 2013)

A Comparative Study Of Data Clustering Techniques

DOI : 10.17577/IJERTV2IS60712

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 223
Total Downloads : 577
Authors : Nishu Sharma, Atul Pratap Singh
Paper ID : IJERTV2IS60712
Volume & Issue : Volume 02, Issue 06 (June 2013)
Published (First Online): 19-06-2013
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

A Comparative Study Of Data Clustering Techniques

Nishu Sharma1, Atul Pratap Singp

1, 2 M. Tech Scholars, School of Computer Science & Engineering, Galgotias University, Greater Noida

Abstract

As we know that clustering is a process for discovering groups and identifying interesting patterns. Data mining refers to extract knowledge from large database. We can say that data mining is a knowledge mining process. Size and need of dataset are increasing day by day. Thats why we need data mining techniques for managing huge dataset. It is a process of knowledge discovery in database (KDD). Data mining techniques are useful in many more fields today such as biology , libraries, GIS, satellite images, marketing, medical diagnostics and many more field .in this paper this study tells us about the comparison between data mining techniques on the basis of size, model, application areas and others features. This study tells us when and which data mining techniques are used.

Introduction-

Data clustering[1] is the process of putting similar data objects into a group. Data objects of one group are dissimilar from the data objects of other group. Clustering algorithms are not used for only organize data. These are used also data compression and model construction. Cluster center is the heart of the cluster. Below we define the process of making data clusters [2].

Fig.1 Clustering Process

Firstly, we take raw data, then apply clustering algorithm on the raw data and after that we will get the clusters of data. This is the process of making data clusters with the help of Clustering algorithm.

Clustering techniques [10] satisfy two main criteria:-
- Each cluster is homogeneous because cluster makes with the help of similar objects.
- Objects of one cluster are dissimilar from the other clusters objects. Each
  
  Cluster should be different from other clusters.
Related work
Clustering Techniques

Clustering [5] Techniques which are apply on the raw data and after that we get the clusters. That is called clustering techniques. Clustering techniques are of many types but some of them are so much important. These are as follows:-
Advantages: –
- Find clusters of arbitrary shape not just convex.
- It can filter out noise.
  
  Disadvantages: –
  - DBSCAN does not deal with high dimensional data.
- Bottom-up to find out dense units or crowded areas in units.
- Generating minimal number of regions, each region cover one cluster.
  
  Advantages: –
- Processing time is very fast because it depends on the number of cells.
- Effective in large image database.
  
  Disadvantages: –
- Quality of clustering is dependent on the granularity [16, 17] of cells.

Comparison

Comparison [18, 19] of all these protocols are given in table below:-

S. No		K-means	K-medoids	Hierarchical	DBSCAN	CLIQUE
1	Application area	Neural network,AI, market segmentation ,pattern recognition	With KD tree for s/w fault prediction	Applied science psychology AI Social science	Satellites images, XRAY crystallography	Social n/w Biological n/w
2	Shape and Size	Spherical	Spherical	Arbitrary& non convex shape	Spherical & arbitrary shape	Find projected clusters in subspace of dimensional space
3	Cluster model	Centroid	Centroid	Connectivity model	Density model	Graph based
4	Scalability	Yes	Scale well only for small data set	No	To large dataset	Good scalability
5	Type of attribute	Numerical	Numerical/categori cal	Symbolic attribute	numerical	numerical
6	Outlier handling	No	no	yes	yes	yes

Table.1 Comparison of clustering protocols

Conclusion

In this paper we described the process of clustering in a short term from the data mining point of view. We discussed the properties of a good clustering methods used to nd meaningful partitioning. Clustering lies at the heart of data mining applications. The ability to discover highly correlated regions of objects when their number becomes very large is highly desirable, as data size is increasing day by day and their properties and data interrelationships change, managing and organizing that huge data set is very challenging. We have provided a brief introduction to cluster analysis. All of these approaches have been successfully applied in a number of areas, although there is a need for more extensive study to compare these different techniques and better understand their strengths and limitations. In particular, there is no reason to expect that one type of clustering approach will be suitable for all types of data.
References

A. Gupta, Comparisons among data mining algorithms, ICRITO2013.
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice-Hall, 1988
D.L.Boley,. Principal direction divisive partitioning. Data Mining and Knowledge Discovery, 1998.

[4]M.R. Anderberg, M. R. 1973. Cluster Analysis for applications. Academic Press

[5] L. Wanner, Introduction to Clustering Techniques, International Union of Local Authorities, July, 2004.

[6]A.K. Jain, Data Clustering: 50 Years Beyond K- Means , Pattern Recognition Letters, Vol 31 Issue 8 : pp.651-666 , June, 2010

[7 ]T. Velmurugan, and T. Santhanam, A Survey of Partition based Clustering Algorithms in Data Mining: An Experimental Approach An experimental approach.

Information. Technology. Journal, Vol, 10, No .3, pp478- 484, 2011.

J. Han and M. Kamber. Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, August 2000.
J. Han , M. Kamber, Data Mining Techniques, Morgan Kaufmann Publishers, 2000.
B. S. Everitt, Cluster Analysis,3rd Edition, Edward Arnold Publishers ,1993.
P. Rai ,S. Singh , A Survey of Clustering Techniques , International Journal of Computer Applications (0975 8887) Volume 7 No.12, October 2010
M. Steinbach, G.Karypis, V.Kumar, A Comparison of Document Clustering Techniques, University of Minnesota, Technical Report #00-034 (2000).
M. Halkidi, Y. Batistakis, M. Vazirgiannis, "On Clustering Validation Techniques", Intelligent Information Systems Journal, Kluwer Pulishers, 17(2-3): 107-145
R. Ali, U. Ghani, A. Saeed, Data Clustering and Its Applications, Rudjer Boskovic Institute, 2001
P.Arabie, L.J. Hubert, 1996. An overview of combinatorial data analysis, in: Arabie ,P., Hubert, L.J., and Soete, G.D. (Eds) Clustering and Classification, 5-63,

World Scientific Publishing Co ., NJ
J.Hartigan 1975 Clustering Algorithms. John Wiley & Sons, New York, NY
G. Fung, A Comprehensive Overview of Basic Clustering Algorithms., June 22, 2001
K. Hammouda, F. Karray , A Comparative Study of Data Clustering Techniques University of Waterloo,

Ontario, Canada N2L 3G1
O.A. Abbas, Department of computer Science, Yarmouk University, Jordan, Comparison Between Data Clustering Algorithm ,The International Arab Journal Of Information Technology, vol.5, No.3, July 2008

Volume 02, Issue 06 (June 2013)

A Comparative Study Of Data Clustering Techniques

A Comparative Study Of Data Clustering Techniques

Scalability: – Clustering algorithm works well with small datasets and if it will work well with huge datasets. That is called

Dealing with different types of attribute: – The ability to analyze with any type of attributes such as binary, categorical and ordinal data or mixtures of data types.

Discovery of cluster with arbitrary shape: – Clusters could be any type of shape. Thats why algorithm should be detected any type of clustered shape.

Ability to deal with noise & outliers: – Databases contain noisy data (missing values) and outliers thats why algorithm should ability how to deal with these types of databases.

Interpretability & usability: – Users expect clustering results to be interpretable, and usable. Thats why clustering may need to be tied up with specific semantic interpretations.

High dimensionality: – A database and data warehouse can contain several dimensions or attributes. Clustering algorithm should be managed high dimensional data.

Data order dependency: – Algorithm should be insensitive to the order of input.

Partitioning Algorithm: – This method is based on the partitioning. This method is to partition the data into k groups. The general behavior is that objects in the same clusters are close to each other and objects in the different clusters are far to each other.

K-means algorithm- This algorithm[6] is based on the mean value or centroid of the objects into the clusters. These steps are as follows: –

Advantages: –

Disadvantages: –

K-medoids algorithm- In this algorithm, each cluster [7] is represented by one of the objects which are located near the center of the cluster. This algorithm is based on the medoids.

Advantages: –

Disadvantages: –

Agglomerative approach: – This approach is also called bottom up approach. This is based on the combined approach[10]. In this method firstly we start with bottom clusters and take clusters which have

Divisive clustering: – This approach is known as top down approach. It is based on division of clusters approach. We start from top of the hierarchy and take single cluster and divide or split them into clusters. That is called divisive or top down approach.

Advantages: –

Disadvantages: –

Density based Clustering: – This technique is based on the density of the objects. It is used for arbitrary shapes. Irregular shapes are managed by density based clustering. These steps are as follows [11]: –

Advantages: –

Disadvantages: –

Grid based Clustering: – This is used for spatial data[13,14]. Spatial data includes structure of objects in space. We quantize data into cells. Then we with only with those objects that are belong to cells

CLIQUE Algorithm: – This is the combination of both grid based and density based clustering. It is useful for clustering high dimensional [15] data in large databases. Steps are as follows: –

Advantages: –

Disadvantages: –

Leave a Reply