A Comparative Analysis of Partitioning Based Clustering Algorithms and Applications

Sumit Pandey; Sanjay Kumar Dubey

doi:10.17577/IJERTV2IS120207

Volume 02, Issue 12 (December 2013)

A Comparative Analysis of Partitioning Based Clustering Algorithms and Applications

DOI : 10.17577/IJERTV2IS120207

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 432
Total Downloads : 419
Authors : Sumit Pandey, Sanjay Kumar Dubey
Paper ID : IJERTV2IS120207
Volume & Issue : Volume 02, Issue 12 (December 2013)
Published (First Online): 07-12-2013
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

A Comparative Analysis of Partitioning Based Clustering Algorithms and Applications

Sumit Pandey1, Sanjay Kumar Dubey2 Amity School of Engineering & Technology, Amity University, Noida, India

A set of data points are taking in the process of Clustering and assigning them to groups or clusters where the points in the same cluster are more similar to each another than the members of the other clusters. These clusters are then used to help find patterns in the data set that may allocate relevant correlations to be inferred. Clustering has many different applications in a variety of fields including: image segmentation, market research for advertising, data mining, several different biological applications, and clustering can also be used in file compression. Because clustering is used in many different fields, it is important that an algorithm is able to generate the most accurate clusters, and converge on those clusters as quickly as possible, because clustering is often performed on massive data sets, or the cluster assignments may need to be determined in real-time.

In this paper, two well known partitioning based methods k-means and k-medoids are compared. The study given here explores the behavior of these two methods.

Cluster analysis, also called segmentation analysis or taxonomy analysis [1]. That is, cluster analysis seeks to identify a set of groups which both minimize within-group variation and maximize between-group variation.

Clustering is a method of unsupervised learning and a well known technique for statistical data analysis. It is used in many fields such as machine learning, image analysis, pattern recognition, outlier detection, and bioinformatics to name a few.

In general, the major clustering methods can be classified into the following categories:

Partitioning methods
Hierarchical methods
Density-based methods
Grid-based methods
Model-based methods

Given a database of n objects or data tuple, a partitioning method constructs K partitions of the data, where each partition represents a cluster and K n. That is, it classifies the data into K groups, which together satisfy the following requirement [2].

Each group must contain at least one object.
Each object must belong to exactly one group.

Given K, the number of partitions to construct, a partitioning method creates an initial partitioning. It then uses an iterative relocation technique that attempts to improve the partitioning by moving objects from one group to another. The general criterion of good partitioning is that objects in the same cluster are close or related to each other, where as objects of different clusters are far apart or very different. There are various kinds of criteria for judging the quality of partitions. On the basis of the concepts various methods are proposed

K-mean Methods
K-Medoid Methods
Probabilistic Clustering

The most well known and commonly used partitioning methods are K-Mean, K-Medoids method and their variations[3]. Two methods k-means and k-medoids algorithms are partitional (breaking the dataset up into groups) and both attempt to minimize the distance between points labeled to be in a cluster and a point designated as the center of that cluster.

Partitioning Based Clustering Methods

A partitioning method creates k partitions, called clusters, from given set of n data objects. Initially, each data objects are assigned to some of the partitions. An iterative relocation technique is used to improve the partitioning by moving objects from one group to another. Here, each partition is represented by either a centroid or a medoid. A

centroid is an average of all data objects in a partition, while the medoid is the most representative point of a cluster. The fundamental requirements of the partitioning based methods are each cluster must contain at least one data object, and each data objects must belong to exactly one cluster. In this category of clustering, various methods have been developed[3][4]. A distance measure is one of the feature space used to identify similarity or dissimilarity of patterns between data objects. Some of the well known methods are k- means, k-medoids;
1. Partitioning around Medoids (PAM)
2. Clustering LARge Applications (CLARA)
3. Clustering Large Applications based upon RANdomized Search (CLARANS). Out of these methods k-means and k-medoids are reviewed here and also similarity measure for both algorithm is carried out by distance measure.

2.1.1 K-mean Clustering Algorithm

K-Means is one of the simplest unsupervised learning methods among all partitioning based clustering methods. It classifies a given set of n data objects in k clusters, where k is the number of desired clusters and it is required in advance. A centroid is defined K- Mean classify a given data set through certain number of clusters (assume k clusters) fixed apriori. The main idea is to define k centroids, one for each cluster. These centroids should be placed in a cunning way because of different location causes different result. It is well known that the basic K-means algorithm does not produce an analytic solution. The solution will depend on how the objects are initially assigned to clusters. The K- means algorithm gave better results only when the initial partition was close to the final solution. Several attempts have been reported to solve the cluster initialization problem[3].

Formally, the k-means clustering algorithm follows the following steps.

Choose a number of desired clusters, k.
Choose k starting points to be used as initial estimates of the cluster centroids. These are the initial starting values.
Examine each point (i.e., job) in the workload data set and assign it to the cluster whose centroid is nearest to it.
When each point is assigned to a cluster, recalculate the new k centroids.
Repeat steps 3 and 4 until no point changes its cluster assignment, or until a maximum number of passes through the data set is performed.

Algorithm 1: The k-means Clustering Algorithm

Input:

D = {d1, d2,……,dn} //set of n data items.

k // Number of desired clusters Output:

A set of k clusters.

Steps:

Arbitrarily choose k data-items from D as initial centroids;
Repeat

Assign each item di to the cluster which has the closest centroid;

Calculate new mean for each cluster; Until convergence criteria is met.

The k-means algorithm is the most extensively studied clustering algorithm and is generally effective in producing good results. The major drawback of this algorithm is that it produces different clusters for different sets of values of the initial centroids. Quality of the final clusters heavily depends on the selection of the initial centroids. The k-means algorithm is computationally expensive and requires time proportional to the product of the number of data items, number of clusters and the number of iterations[3].

k initial "means" (in this case k=3) are randomly generated within the data domain (shown in color).

k clusters are created by associating every observation with the nearest mean.

The centroid of each of the k clusters becomes the new mean.

Steps 2 and 3 are repeated until convergence has been reaced.

Figure 1: Demonstration of the standard K-Mean Algorithm[4]

Numerical Example: Cluster the following eight points (with (x, y) representing locations) into three clusters A1(2, 10) A2(2, 5) A3(8, 4) A4(5, 8) A5(7, 5) A6(6, 4)

A7(1, 2) A8(4, 9). Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2). The distance function between two points a=(x1, y1) and b=(x2, y2) is defined as: (a, b)

= |x2 x1| + |y2 y1| .

First we list all points in the first column of the table above. The initial cluster centers means, are (2, 10), (5, 8) and (1, 2) – chosen randomly. Next, we will calculate the distance from the first point (2, 10) to each of the three means, by using the distance function:

point mean1

x1, y1 x2, y2

(2, 10) (2, 10)

(a, b) = |x2 x1| + |y2 y1| (point, mean1) = |x2 x1| + |y2 y1|

= |2 2| + |10 10|

= 0 + 0

point mean2

x1, y1 x2, y2

(2, 10) (5, 8)

(a, b) = |x2 x1| + |y2 y1| (point, mean2) = |x2 x1| + |y2 y1|

= |5 2| + |8 10|

= 3 + 2

= 5

point mean3

x1, y1 x2, y2

(2, 10) (1, 2)

(a, b) = |x2 x1| + |y2 y1| (point, mean2) = |x2 x1| + |y2 y1|

= |1 2| + |2 10|

= 1 + 8

= 9

= 0
1. Next we need to re-compute the new cluster centers. We do so by taking the mean of all points in each cluster.
2. The K-means algorithm is sensitive to outliers, since an object with an extremely large value may substantially distort the distribution of the data.
3. In k-means algorithm, the prototype , called the center, is the mean of all objects belonging to a cluster.
4. It requires several passes on the entire dataset, which can make it very expensive for large dataset, which can make it very expensive for large dataset.
5. The k-medoids approach is more robust in this aspect.

Table 1: k-mean Iteration[4]

		(2, 10)	(5, 8)	(1, 2)
	Point	Dist Mean 1	Dist Mean 2	Dist Mean 3	Cluster
A1	(2, 10)	0	5	9	1
A2	(2, 5)	5	6	4	3

A3	(8, 4)	12	7	9	2
A4	(5, 8)	5	0	10	2
A5	(7, 5)	10	5	9	2
A6	(6, 4)	10	5	7	2
A7	(1, 2)	9	10	0	3
A8	(4, 9)	3	2	10	2

2.1.2. K-Mediods Clustering Algorithm

The k-means method uses centroid to represent the cluster and it is sensitive to outliers. This means, a data object with an extremely large value may disrupt the distribution of data. K-medoids method overcomes this problem by using medoids to represent the cluster rather than centroid. A medoid is the most centrally located data object in a cluster. Here, k data objects are selected randomly as medoids to represent k cluster and remaining all data objects are placed in a cluster having medoid nearest (or most similar) to that data object. After processing all data objects, new medoid is determined which can represent cluster in a better way and the entire process is repeated. Again all data objects are bound to the clusters based on the new medoids. In each iteration, medoids change their location step by step. Or in other words, medoids move in each iteration. This process is continued until no any medoid move. As a result, k clusters are found representing a set of n data objects. An algorithm for this method is given below.

Algorithm 2: The k- Mediods clustering algorithm

A typical k-Mediods algorithm for partitioning based on medoid or central objects is as follows:

Input: k: The number of clusters D: A data set containing n objects

Output: A set of k clusters that minimizes the sum of the dissimilarities of all the objectsto their nearest medoid.

Method: Arbitrarily choose k objects in D as the initial representative objects; Repeat assign each remaining object to the cluster with the nearest medoid; randomly select a non medoid object random;

compute the total points S of swaping object Oj withOramdom;

if S < 0 then swap Oj with Orandom to form the new set of k medoid;

Until no change;

Numerical Example:

Cluster the given data set of ten objects into two cluster i.e. k= 2. Consider a data set of then objects as follows :

Table 2 : Data Set- Ten Objects

X1	2	6
X2	3	4
X3	3	8
X4	4	7
X5	6	2
X6	6	4
X7	7	3
X8	7	4
X9	8	5
X10	7	6

Step 1

Initialize k centre, Let us assume c1 =(3,4) and c2 =(7 , 4).here c1 and c2 are selected as medoid. Calculating distance so as to associate each data object to its nearest medoid Calculating distance so as to associate each data object to nearest medoid. Cost is calculating using Minkowski distance metric

Table 3 : Cost Calculation[4]

C1		Data Object Xi		Cost(Distance)
3	4	2	6	3
3	4	3	8	4
3	4	4	7	4
3	4	6	2	5
3	4	6	4	3
3	4	7	3	4
3	4	8	5	6
3	4	7	6	6

C1		Data Object Xi		Cost(Distance)
3	4	2	6	3
3	4	3	8	4
3	4	4	7	4
3	4	6	2	5
3	4	6	4	3
3	4	7	3	4
3	4	8	5	6
3	4	7	6	6

Then the clusters become Cluster1 = {(3,4),(2,6),(3,8),(4,7}

Cluster2={(7,4),(6,2),(6,4),(7,3),(8,5),(7,6)}

Since the points (2,6) (3,8) and (4,7) are closer to c1 hence they from one cluster whilst remaining points from another cluster .

So the total cost involved is 20.

Where cost is the summation of the cost of data object from its medoid in its cluster so here.

Total cost =

{ cost((3,4),(2,6))+ cost((3,4),(3,8))+cost((3,4),(4,7))}

+

{cost((7,4),(6,2)) + cost((7,4),(6,4))+ cost((7,4),(7,3)) + cost ((7,4),(8,5))+ cost((7,4),(7,6))}

=(3+4+4)+(3+1+1+2+2)

=20

Cluster1={(3,4)(2,6)(3,8)(4,7)} Cluster1={(7,4)(6,2)(6,4)(7,3)(8,5)(7,6)}

1. Strength of K-Mean
  1. Relatively scalable and efficient in processing large data sets; complexity is O (i k n), where i is the total Number of iterations, k is the total number of clusters, and n is the total number of objects. Normally, k<<n and i<<n.
  2. Easy to understand and implement.
2. Weaknesses :
  1. Applicable only when the mean of a cluster is defined; not applicable to categorical data.
  2. Need to specify k, the total number of clusters in advance.
  3. Not suitable to discover clusters with non-convex shape, or clusters of very different size.
  4. Unable to handle noisy data and outliers.
  5. May terminate at local optimum.
  6. Result and total run time depends upon initial partition.

Strength of K-Medoids
1. More robust than k-means in the presence of noise and outliers; because a medoid is less influenced by outliers or other extreme values than a mean.

Weaknesses

Relatively more costly; complexity is O( i k (n-k)2), where i is the total number of iterations, is the total
Number of clusters, and n is the total number of objects.
Relatively not so much efficient.
Need to specify k, the total number of clusters in advance.

Result and total run time depends upon initial partition.

Table 4: Comparison of K-means & K-medoids[7][8]

Different Settings	k-means	k-medoids
Complexity	O ( i k n )	O ( i k (n-k)2 )
Efficiency	Comparatively more	Comparatively less
Implementation	Easy	Complicated
Sensitive to Outliers?	Yes	No
Necessity of convex shape	Yes	Not so much
Advance specification of no of clusters k	Required	Required
Does initial partition affects	Yes	Yes

result and runtime?
Optimized for	Separated clusters	Separated clusters, Small Dataset

From the above study, it can be concluded that partitioning based clustering methods are suitable for spherical shaped clusters in small to medium sized data sets. K-means and k- medoids both the methods find out clusters from the given database. Both the methods require to specify k, no of desired clusters, in advance. Result and runtime depends upon initial partition for both of these methods. The advantage of k-means is its low computation cost, while drawback is sensitivity to noisy data and outliers. Compared to this, k-medoid is not sensitive to noisy data and outliers, but it has high computation cost.

Salem A.M., Fahim A.M ,Torkey F.A.,Ramdan M.A., An efficient enhance k-means clustering algorithm, Journal of Zhejiang university Science,2006 7(10):1626-1633
Oyelade, O. J , Oladipupo, O. O, Obagbuwa, I. C Application of k-Means Clustering algorithm for prediction of Students Academic Performance (IJCSIS)

International Journal of Computer Science and Information Security, Vol. 7, 2010
Depa Pratima , Nivedita Nimmakanti , Pattern Recognition Algorithms for Cluster Identification Problem, International Journal of Computer Science & Informatics (IJCSI), ISSN (PRINT) : 22315292, Vol.- II, Issue-1, 2
Dr. Aishwarya Batra, Analysis and Approach: K-Means and K- Medoids Data Mining Algorithms, 5th IEEE International Conference on Advanced Computing & Communication Technologies [ICACCT-2011 ] ISBN 81-87885-03-3
Fahim A.M., Salem A.M., Modified enhanced k-means clustering algorithm, Journal of Zhejiang University Science, 1626 1633, 2006.
Frank Robinson, Amy Apon, Denny Brewer and Larry Dowdy, Initial Starting Point Analysis for K-Means Clustering: A Case Study , IEEE Transactions on Pattern Analysis and Machine Intelligence, 2006,pp. 345-254
D T Pham_, S S Dimov, and C D Nguyen Seletion of K in K-means clustering Proc. IMechE Vol. 219 Part C: J. Mechanical Engineering Science.
Kurt Hornik , Ingo Feinerer, Martin Kober, Christian Buchta Spherical k-Means Clustering Journal of Statistical Software September 2012, Volume 50, Issue 10.
T. Kanungo, D. Mount, N. Netanyahu, C. Piatko, R. Silverman, and A. Wu, An efficient k-means clustering algorithm: analysis and implementation, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, pp. 881-892.
Khalid Hammouda, Prof. Fakhreddine Karray, A Comparative Study of Data Clustering Techniques, University of Waterloo, Ontario, Canada N2L 3G1
Osama, Abu Abbas, Comparison Between Data Clustering Algorithm, The International Arab Journal Of Information Technology, vol.5, No.3, July 2008
Azhar Rauf, Sheeba, Saeed Mahfooz, Shah Khusro and Huma Javed ,Enhanced K- Mean Clustering Algorithm to Reduce Number of Iterations and Time Complexity, Middle-East Journal of Scientific Research 12 (7): 959-963, 2012

A Comparative Analysis of Partitioning Based Clustering Algorithms and Applications

Leave a Reply