New Approach for Clustering - based on Quick Sort

Sabeen Govind P V; Archana K N; Reenu M; Philomina Simon

doi:10.17577/IJERTV2IS100621

Volume 02, Issue 10 (October 2013)

New Approach for Clustering – based on Quick Sort

DOI : 10.17577/IJERTV2IS100621

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 63
Total Downloads : 227
Authors : Sabeen Govind P V, Archana K N, Reenu M, Philomina Simon
Paper ID : IJERTV2IS100621
Volume & Issue : Volume 02, Issue 10 (October 2013)
Published (First Online): 21-10-2013
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

New Approach for Clustering – based on Quick Sort

Sabeen Govind P V #1 , Archana K N #2, Reenu M #3 , Philomina Simon #4

1,2,3,4 Department of computer science

University of Kerala, India

Abstract

Clustering refers to the process of grouping samples so that samples in a group look similar. In this paper we propose a new clustering algorithm based on quick sort. Experimental result shows that our algorithm gives comparable results with existing algorithms. Complexity of this algorithm is also less.

Keywords Clustering, Quick sort, K-Means.

INTRODUCTION

Clustering can be considered as an unsupervised learning problem [1]. A cluster is a collection of object which are

similar between them and dissimilar to the objects belonging to other clusters [4]. Cluster analysis (CA) is an exploratory data analysis tool for organizing observed data (e.g. people, things, events, brands, companies) into meaningful taxonomies, groups, or clusters, which maximizes the similarity of cases within each cluster while maximizing the dissimilarity between groups that are initially unknown. Clustering can be applied in various fields like marketing, biology, libraries, etc. Cluster analysis is the obverse of factor analysis. Whereas factor analysis reduces the number of variables by grouping them into a smaller set of factors, cluster analysis reduces the number of observations or cases by grouping them into a smaller set of clusters. Based on the properties of cluster generated clustering algorithm can be broadly classified in to hierarchical clustering and partitional clustering. In hierarchical clustering we start with every data point in a separate cluster and we keep merging the most similar pairs of data points/clusters until we have one big cluster left. This is called a bottom-up or agglomerative method. Partitional clustering, attempts to directly decompose the data set into a set of disjoint clusters.
RELATED WORKS

This section introduces the related clustering techniques found in the literature.
K-means is a simple algorithm proposed by MacQueen(1967) that has been adapted to many problem domains. In K-means algorithm each cluster is associated with a centroid and each point is assigned to the cluster with the closest centroid [5].
It discovers cluster based on the density of points in regions. They are capable to produce arbitrary shapes clusters and filter out noise. DBSCAN [7] is a density based clustering proposed by Ester.
PROPOSED ALGORITHM

The proposed algorithm is based on quick sort [9]. Quick sort uses divide- and- conquer strategy. This algorithm works on data sets which consist of numerical data values. The data value is considered to be on a 2D plane which has both x co- ordinate and y co-ordinate. We are taking x value and sorting is performed on x coordinate values. We get sorted 1-dimensional array of x, from which the median value is calculated. It can be done on the y axis values also which also gives a similar result but the cluster values will be different. The algorithm then proceeds further by dividing the whole data set into two halves, first one to the left of Mid

_Value and second to the right. These halves cab be further divided by taking Median of newly formed clusters and so on. The algorithm terminates when the desired number of cluster is reached or when further clustering cannot be performed.
1. Partitioning Process
  
  The partitioning is applied by calculating the distance of the
  
  x values from each other. Since sorting is already performed the data values at each cluster will be closer. The Mid value will go to the cluster which has minimum distance when it is compared to the right x, Data(Mid_Value -1), and left x , Data(Mid_Value +1), is calculated. This process can be applied to two or three nearest values of Median. If both clusters have equal distance Mid_value can go to either group.
2. Pseudocode
  
  Input : n number of data points (x,y). Output : Required number of Clusters.
  
  Step 1: Sort the given x values in the data points using Quick sort algorithm.
  
  If x value is repeated sort y values of only identical x values.
EXPERIMENTAL RESULTS

Simulations are done in MATLAB. We are randomly selects N data points and the output is shown in the below figure. In this example we take N=50.

Fig (a) Output of K-means algorithm

Sorted_Array= Quick_Sort(Datapoints(1,x));

Step 2 : Compute the mid _value of the sorted array. Mid_Value=(n+1)/2 for even n values and Mid_Value=n/2 for odd n values.

Step 3: Compute the distance between successor and Mid_value and predecessor and Mid_value.

If (Mid_Value + Mid_Value 1) > (Mid_Value + Mid_Value+1) divide array as

Arr1=Sorted_Array(1Mid_Value-1) Arr2=Sorted_Array(Mid_Value n)

Fig (b) Output of Proposed method

else

Arr1=Sorted_Array(1Mid_Value) Arr2=Sorted_Array(Mid_Value+1 n)

Step 4 : Display the clusters.

Fig (c) Output of K-means algorithm

Fig (e) Output of proposed method (K=3)

Figure (a) shows the output of K-means algorithm (Number of clusters K=2) ,figure (b) shows the output of proposed method for the same data points (x,y)

Figure (c) and (d) represents the output of K-means and proposed algorithm simultaneously for some another randomly generated data points.

Figure (e) shows 50 randomly generated data points are put in to three clusters. (K=3)

Fig (d) Output of proposed method

Following table shows various clustering algorithms and their complexities [2].

Algorithm

Time complexity

K-mans

O(Nkd) (time)

O(N+K) (space)

Fuzzy c-means

Near O(N)

Hiearchical clustering

O(N2) (time)

O(N2) (space)

Proposed method

O(N2)
CONCLUSIONS

Clustering can be used to reduce the amount of data and to induce a categorization. In this paper a new method of clustering based on quick sort is presented. The method is simple to implement and experimental results shows that the proposed method is giving comparable results with K-means algorithm. We can extend the method for any number of clusters by just using a recursive function.

REFERENCES

Rui Xu, Donald Wunsch Survey of Clustering Algorithms,IEEE Transactions on Neural Networks Vol 16, No. 3, May 2005.
Madhuri A. Taya, M.M.Raghuwanshi Review on Various Clustering Methods for the Image Data in Journal of Emerging Trends in Computing and Information Sciences vol 2,special issue 2010.p.p 35-39.
Naz, Majeed, Irshad Image segmentation using fuzzy clustering: A survey International Conference on Emerging Technologies (ICET),p.p 181 186,2010.
Song Yu-chen , Jia Xiao-liang , Meng Hai-dong Comparative study of clustering methods based on linear data distribution International Conference on Management Science and Engineering (ICMSE), p.p 377 – 384 ,2012.
Hui Xiong, Junjie Wu, and Jian Chen,K-Means Clustering Versus Validation Measures:A Data- Distribution Perspective. IEEE Transaction on Man,and cybernetics-Part B:Cybernetics, Vol. 39, No. 2, April 2009.
W. Wang, J. Yang and R. Muntz, "STING: A Statistical Information Grid Approach to Spatial Data Mining". Proc. Int. Conf. VLDB, pp. 186-195, 1997.
M. Ester, H. P. Kriegel, J. Sander and X. Xu, "A Density-Based Algorithm for Discovering Clusters in

Large Spatial Databases with Noise", In: Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining (KDD96), Portland, AAAI Press pp. 291-316, 1996.
Earl Gose, Richard Johnsonbaugh, Steve Jost, Book on Pattern Recognition and Image Analysis
Alfred Aho, D.Ullman, Book on Data structures and algorithms.
Liu Su-hua , Hou Hui-fang A combination of mixture genetic algorithm and Fuzzy C-means clustering algorithm IEEE international symposium on Medicine and Education, Vol 1,p.p 254-258,2009

Algorithm	Time complexity
K-mans	O(Nkd) (time) O(N+K) (space)
Fuzzy c-means	Near O(N)
Hiearchical clustering	O(N2) (time) O(N2) (space)
Proposed method	O(N2)

New Approach for Clustering – based on Quick Sort

INTRODUCTION

RELATED WORKS

PROPOSED ALGORITHM

Step 1: Sort the given x values in the data points using Quick sort algorithm.

EXPERIMENTAL RESULTS

Step 2 : Compute the mid _value of the sorted array. Mid_Value=(n+1)/2 for even n values and Mid_Value=n/2 for odd n values.

Step 3: Compute the distance between successor and Mid_value and predecessor and Mid_value.

Step 4 : Display the clusters.

CONCLUSIONS

REFERENCES

Leave a Reply