- Open Access
- Total Downloads : 705
- Authors : Uppe Nanaji, Majji Nagaraju
- Paper ID : IJERTV1IS7262
- Volume & Issue : Volume 01, Issue 07 (September 2012)
- Published (First Online): 25-09-2012
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License: This work is licensed under a Creative Commons Attribution 4.0 International License
Statistical Measurement Space for Document Clustering Based on Correlation Preserving Indexing
Statistical Measurement Space for Document Clustering Based on Correlation Preserving Indexing
Uppe Nanaji, Majji Nagaraju
dept of CSE
Avanthi Saint theressa Institute of Engineering & Technology , JNTU Kakinada
Abstract— This paper presents a spectral clustering method named as correlation preserving indexing (CPI), which is performed in the correlation similarity measure space for sample data in live areas . In this framework, the documents are projected into a low-dimensional semantic space in which the correlations between the documents in the local patches are maximized while the correlations between the documents outside these patches are minimized simultaneously. Finally it Gives the statistical values for the existing with present system. Since the intrinsic geometrical structure of the document space is often embedded in the similarities between the documents, correlation as a similarity measure is more suitable for detecting the intrinsic geometrical structure of the document space than Euclidean distance. Consequently, the proposed CPI method can effectively discover the intrinsic structures embedded in high-dimensional document space. The effectiveness of the new method is demonstrated by extensive experiments conducted on various data sets and by comparison with existing document clustering techniques.
Keywords—–Document clustering, correlation measure, correlation latent semantic indexing dimensionality reduction.
I.INTRODUCTION
Document clustering aims to automatically group related documents into clusters. It is on of the most important tasks in machine learning and artificial intelligence and has received much attention in recent years [1]-[3]. Based on various distance measures, and a number of methods have been proposed to handle document clustering [4]-[10]. A typical and widely used distance measure is the Euclidean distance. The k-means method [4] is one of the methods that use the Euclidean distance, which minimizes the sum of the squared Euclidean distance between the data points and their corresponding cluster centres. Since the document space is always of high dimensionality, it is preferable to find a low-dimensional representation of the documents to reduce the computation complexity.
Low computation cost is achieved in spectral clus- tering methods, in which the documents are first projected into a low-dimensional semantic space and then a traditional clustering algorithm is applied to finding document clusters. Latent semantic indexing (LSI) [7] is one of the effective spectral clustering methods, aimed at finding the best subspace approximation to the original document space by minimizing the global reconstruction error (Euclidean distance).
However, because of the high dimensionality of the document space, a certain representation of documents usually reside on a nonlinear manifold embedded in the between the documents. Thus, it is not able to effectively
capture the nonlinear manifold structure embedded in the similarities between them [12]. An effective docu- ment clustering method must be able to find a low- dimensional representation of the documents that can best preserve the similarities between the data points. Locality preserving indexing (LPI) method is a different spectral clustering method based on graph partitioning theory [8]. The LPI method applies a weighted function to each pair-wise distance attempting to focus on cap- turing the similarity structure, rather than the dissimi- larity structure, of the documents. However, it does not overcome the essential limitation of Euclidean distance. Furthermore, the selection of the weighted functions is often a difficult task.
In recent years ,some studies suggest that correlation as a similarity measure can capture the intrinsic structure embedded in high-dimensionality data, especially when the input data is sparse. In probability theory statistics , correlation indicates the strength and direction of liner relationship between two random variable which revels the nature of data by the classical geometric concept of an angle. It is a scale invariant association measure usually used to calculate the similarity between two vectors.
In this paper, we propose a new document clustering methods based on correlation preserving indexing(CPI), which explicitly consider the manifold structure embedded in the similarities between the documents. It aims to find an optimal semantic subspace by simultaneously maximizing the correlation between the documents in the local patches and minimizing the correlations between the documents outside these patches. which is implemented by Reuters sample data and presented the correlation measurements and comparisons between the CPI and K-means.
In recent years, some studies [13]suggest that correlation as a similarity measure space can capture the intrinsic structure embedded in high-dimensional data, especially when the input data is sparse[19]-[20]. In probability theory and statistics, correlation indicates the strength and direction of a linear relationship between two random variables which reveals the nature of data represented by the classical geometric concept of an "angle". It is a scale invariant association measure usually used to calculate the similarity between two vectors. In many cases, correlation can effectively represent the distributional structure of the input data which conventional Euclidean
distance cannot explain. The usage of correlation as a similarity measure can be found in the canonical correlation analysis (CCA)method [21]. The CCA method is to find projections for paired data sets such that the correlations between their low-dimensional representatives in the projected spaces are mutually maximized. Specifically, given a paired data set consisting of matrices
X = x1, x2 ,……, xn and Y = y1 , y2 ,……, yn we
would like to find directions wx for X and wy for Y
two vectors u and v .
Online document clustering aims to group the documents into clusters, which belongs to unsupervised learning. However , it can be transformed into semi- supervised learning by using the following information.
A1) If two documents are close to each other in the original document space, then they tend to be grouped into the same cluster [8].
A2) If two documents are far away from each other in the original document space, they tend to be
that maximize the correlation between the projections of X grouped into different clusters.
on wx and the projections of Y on wy
. This can be
Based on these assumptions, we can propose a spectral clustering in the correlation similarity measure space
expressed as
max
wx ,wy
Xwx ,Ywy
Xwx . YwY
(1)
through the nearest neighbours graph learning.
1)K-Means on Document Sets: The k-means method is one of the methods that use the Euclidean
distance, which minimizes the sum of the squared Euclidean distance between the data points and their corresponding
where , and . denote the operators of inner
product and norm, respectively. As a powerful statistical technique, the CCA method has been applied in the field of pattern recognition and machine learning [20]-[21]. Rather than finding a projection of one set of data, CCA finds projections for two sets of corresponding data X and Y into a single latent space that projects the corresponding points in the two data sets to be as nearby as possible. In the application of document clusterig, while the document matrix X is available, the cluster label (Y ) is not. So the CCA method cannot be directly
cluster centres. Since the document space is always of high dimensionality, it is preferable to find a low dimensional representation of the documents to reduce computation complexity.
th
2)Correlation Based Clustering Criteria: Suppose yi Y is the low-dimensional representation of the i document xi X in the semantic subspace, where i = 1, 2, · · · , n. Then the above assumptions A1) and A2) can be expressed as
used for clustering. In this paper, we propose a document metric that is correlation preserving indexing (CPI).
max corr( yi , y j )
(3)
i x j N ( xi )
-
DOCUMENT CLUSTERING BASED ON CORRELATION PRESERVING INDEXING
min corr( yi , y j )
i x j N ( xi )
(4)
In high-dimensional document space, the semantic structure is usually implicit. It is desirable
respectively, where N (xi )
denotes the set of
to find a low dimensional semantic subspace in which the semantic structure can become clear. Hence , discovering the intrinsic structure of the documents space is often a primary concern of document clustering since the manifold structure is often embedded in the similarities between the documents, correlation as a similarity measure is suitable for capturing the manifold structure embedded in the high-dimensionality document space.
Mathematically, the correlation between vector
U and V is defined as
nearest neighbours of xi . The optimization of (3) and
(4) is equivalent to the following metric learning
d(x, y) *cos(x, y)
where d (x, y) denotes the similarity between the documents x and y, corresponds to whether x and y are the nearest neighbours of each other.
Physically, this model may be interpreted as follows. All documents are projected onto the unit hyper- sphere (circle for 2D). The global angles between the points
uT v u v
in the local neighbours, i , are minimized and the global
corr(u, v) ,
(2)
angles between the points outside the local neighbours, i ,
uT u
vT v u v
are maximized simultaneously, as illustrated in Fig. 1. on the
such that
Note that the correlation corresponds to an angle
cos corr(u,v) . The larger the value of
unit hyper-sphere, a global angle can be measured by spherical arc, that is, the geodesic distance. The geodesic distance
corr(u, v) is the stronger the association between the
between Z and expressed as
Z ' on the unit hyper-sphere can be
G
d (Z, Z ' ) arccos(Z T Z ' ) arccos(corr(Z, Z ' )
(5)
-
RELATED WORK:
Since a strong correlation between Z and Z ' means
-
Clustering Algorithm Based on CPI:
a small geodesic distance between Z and
Z ' , then CPI is
Given a set of documents
x , x ,……, x
Rn Let X
1 2 n
equivalent to simultaneously minimizing the geodesic distances between the points in the local patches and maximizing the geodesic distances between the points outside these patches. The geodesic distance is superior to traditional Euclidean distance in capturing the latent manifold[14].Based on this conclusion, CPI can effectively capture the intrinsic structures embedded in the high- dimensional document space.
The Proposed system shows that Euclidean distance is not appropriate for clustering high dimensional normalized data such as text and a better metric for text clustering is the cosine similarity [12]. Lebanon in [21]
proposed a distance metric for text documents, which was defined as:
denote the document matrix. The algorithm for document clustering based on CPI can be summarized as follows:
-
Construct the local neighbour patch, and compute the matrices M S and M T .
-
Project the document vectors into the SVD
subspace by throwing away the zero singular values. The singular value decomposition of X can be written as X U V T Here all zero-singular values in have been removed. Accordingly, the vectors in U and V that
correspond to these zero singular values
have been removed as well. Thus the document vectors in
n1
x y
the SVD subspace can be obtained by
X U T X .
d F *x, y
arccos i
i1
i
x,
i
y,
(6)
-
Compute CPI Projection. Based on the multipliers
This distance is very similar to the distance defined by (5).
0 , 1 ,….n obtained from 19 and 20 , one can computer
Since the distance is local(thus it captures local
the matrix
M * M * x xT ……. * x
xT . Let
variations within the space) and is defined on the entire
0 T 1 1 1
n n n
embedding space [21], correlation might be a suitable distance measure for capturing the intrinsic structure embedded in document space. That is why the proposed
Y W
CPI method is expected to outperform the LPI method.
WCPI be the solution of the generalized eigenvalue problem M SW MW . Then the low- dimensional representation of the document can be computed by
Note that the distance can be obtained based on the training data and it can be used for classification rather than
T CPI
~x W T X where
W UWCPI
is the
clustering.
If we use the notations
x
x1,
x2 ,…., y
y1,
y2 ,….
transaction matrix.
-
Cluster the documents in the CPI semantic subspace.
Since the documents were projected on the unit hyper-
and set
1 2 … n , then the distance metric
sphere, the inner product is a natural measure of similarity. we seek a partitioning
d F *x, y reduced to
k
xT c
m j
m
with , c where
m is
n1
arccos
*
xi yi
arccosCorrx, y.
(7)
j j 1
j j j
k
j 1 x j j
d F x, y
i1 xx
y y
the mean of the document vectors contained in the cluster
j .
This distance is very similar to the distance defined by (5).
-
-
-
COMPLEXITY ANALYSIS
Since the distance
d F *x, y
is local (thus it captures local
The time complexity of the CPI clustering algorithm can be analyzed as follows: Consider n
variations within the space) and is defined on the entire
embedding space [21], correlation might be a suitable distance measure for capturing the intrinsic structure embedded in document space. that is why the proposed CPI method is expected to outperform the LPI method. Note that
documents in the d-dimensional space (d n). In step a, we need to compute the pair wise distance which needs O(n2d) operations. Secondly, we need to find the k nearest neighbours for each data point which needs O(kn2)
operations. Thirdly, computing the matrices MS and MT
the distance
d F *x, y
can be obtained based on the
requires O(n2d) operations and O(n(n k)d) operations, respectively. Thus, the computation cost in step 1 is
training data and it can be used for classification rather than
clustering.
O(2n2d + kn2 + n(n k)d). In step 'b' the SVD decomposition of the matrix X needs O(d3) operations and projecting the documents into the n-dimensional SVD
subspace takes O(mn2) operations. As a result,
step 'b' costs O(d3+n2d).Then, transforming the documents into m-dimensions semantic subspace requires O(mn2) operations In step 'c', it takes O(lcmn) operations to find the final document clusters, whee l is the number of iterations and c is the number of clusters. Since k n, l<< n and m, n d in document clustering applications, the step 'b' will dominate the computation. To reduce the computation cost of step 'b', one can apply the iterative SVD algorithm [18] rather than matrix decomposition algorithm or feature selection method to first reduce the dimension.
-
DOCUMENT REPRESENTATION:
In all experiments, each document is represented as a term frequency vector. The term frequency vector can be computed as follows:
-
Transform the documents to a list of terms after words stemming operations.
-
Remove stop words. Stop words are common words that contain no semantic content.
-
Compute the term frequency vector using the TF/IDF weighting scheme. The TF/IDF weighting scheme
than or competitively with other algorithms.
TABLE 1
Document clustering Results Based on CPI
assigned to the term ti
in document d j
is given by:
tf
idf
i, j
tf
i, j
idfi , Here tf
i, j
ni, j
k nk , j
is the term frequency of
the term ti
in document d j
where
ni, j
is the number of
occurrences of the considered term
ti in document
d
is the inverse document frequency which
d j .idfi log
d : t d
i
is a measure of the general importance of the term ti , where |D| is the total number of documents in the corpus
and d : ti is the number of documents in which
TABLE 2
Reauters Documents sets from different Data Sets i.e sample data of Reauter Data.
the term ti
appears. Let
v t1 ,t2 …,tm be the list of
terms after the stop words removal and words
stemming operations. The term frequency vector
X j of
document d j is defined as:
1 j
j
X x
, x2 j
…..xmj ,
X ij
tf
/ idf
i , j
Using n documents from the corpus, we construct an m× n term-document matrix X. The above process can be completed by using the text to matrix generator (TMG) code.
-
-
CLUSTERING RESULTS:
-
Correlation Measurement Values Generated: Experiments were performed on Reuters, and OHSUMED data sets. . We compared the proposed algorithm with other competing algorithms under same experimental setting. In all experiments, our algorithm performs better
TABLE 3
Final Document Clusters based on CPI each cluster has some similar Group of Datasets
TABLE 4
Statistical Measurements of correlation measuring values between the documents which is belongs to Reauters Data set.
-
-
CONCLUSTION
We present a new document clustering method based on correlation preserving indexing. It simultaneously maximizes the correlation between the documents in the local patches and minimizes the correlation between the documents outside these patches. Consequently, a low-dimensional semantic subspace is derived where the documents corresponding to the same semantics are close to each other. Extensive experiments on NG20, Reuters and OHSUMED corpora show that the proposed CPI method outperforms other classical clustering methods. Furthermore, the CPI method has good capturing and capability of assigning
clustering formation and thus it can effectively deal with data with very large size.
REFERENCES:
-
R. T. Ng and J. Han, "Efficient and effective clustering methods for spatial data mining," in Proceedings of the 20th VLDB Conference, Santiago.
New York, NY, USA: ACM, 1994, pp. 144-155
-
A. K. Jain, M. N. Murty and P. J. Flynn, "Data clustering: a review," ACM Computing Surveys, vol. 31, no. 3, pp. 264-323, 1999.
-
S. Kotsiantis and P. Pintelas, "Recent advances in clustering:A brief survey," WSEAS Transactions on Information Science and Applications, vol. 1, no. 1, pp. 73-81, 2004.
15872, 2002.
[12]S. Zhong and J. Ghosh, "Scalable, balanced model-based clustering," in Proc. 3rd SIAM Int. Conf. Data Mining. New York, NY,USA: ACM, 2003, pp. 71-82. [13]Y. Fu, S. Yan, and T. S. Huang, "Correlation metric for generalized feature extraction," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 12, pp. 2229-2235, Dec. 2008. [14]J. B. Tenenbaum, V. de Silva, and J. C. Langford, "A global geometric framework for nonlinear dimensionality reduction." Science, vol. 290, no. 5500, pp. 2319-2323, December 2000. [15]X. Zhu, Z. Ghahramani, and J. Lafferty, "Semi-supervised learning using gaussian fields and harmonic functions," in ICML-03, 20th International Conference on Machine Learning, 2003. [16]X. Zhu, "Semi-supervised learning literature survey," Computer Sciences, University of Wisconsin-Madison, Tech. Rep., 2005. [17]G. Lebanon, "Metric learning for text documents," IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 28, no. 4, pp. 497-507, 2006. [18]P. Strobach, "Bi-iteration svd subspace tracking algorithms," IEEE Transactions on Signal Processing, vol. 45, no. 5, pp. 1222 -1240, may 1997. [19]Y. Ma, S. Lao, E. Takikawa, and M. Kawade, "Discriminant analysis in correlation similarity measure space," in ICML 07: Proceedings of the 24th international conference on Machine learning, 2007, pp. 577-584. [20]R. D. Juday, B. V. K. Kumar, A. Mahalanobis, Correlation Pattern Recognition. Cambridge University Press, 2005. [21]D. R. Hardoon, S. R. Szedmak, and J. R. Shawe-taylor, "Canonicalcorrelation analysis: An overview with application to learning methods," Neural Comput., vol. 16, no. 12, pp. 2639-2664, 2004.