Statistical Measurement Space for Document Clustering Based on Correlation Preserving Indexing

DOI : 10.17577/IJERTV1IS7262

Download Full-Text PDF Cite this Publication

Text Only Version

Statistical Measurement Space for Document Clustering Based on Correlation Preserving Indexing

Statistical Measurement Space for Document Clustering Based on Correlation Preserving Indexing

Uppe Nanaji, Majji Nagaraju

dept of CSE

Avanthi Saint theressa Institute of Engineering & Technology , JNTU Kakinada

Abstract— This paper presents a spectral clustering method named as correlation preserving indexing (CPI), which is performed in the correlation similarity measure space for sample data in live areas . In this framework, the documents are projected into a low-dimensional semantic space in which the correlations between the documents in the local patches are maximized while the correlations between the documents outside these patches are minimized simultaneously. Finally it Gives the statistical values for the existing with present system. Since the intrinsic geometrical structure of the document space is often embedded in the similarities between the documents, correlation as a similarity measure is more suitable for detecting the intrinsic geometrical structure of the document space than Euclidean distance. Consequently, the proposed CPI method can effectively discover the intrinsic structures embedded in high-dimensional document space. The effectiveness of the new method is demonstrated by extensive experiments conducted on various data sets and by comparison with existing document clustering techniques.

Keywords—–Document clustering, correlation measure, correlation latent semantic indexing dimensionality reduction.

I.INTRODUCTION

Document clustering aims to automatically group related documents into clusters. It is on of the most important tasks in machine learning and artificial intelligence and has received much attention in recent years [1]-[3]. Based on various distance measures, and a number of methods have been proposed to handle document clustering [4]-[10]. A typical and widely used distance measure is the Euclidean distance. The k-means method [4] is one of the methods that use the Euclidean distance, which minimizes the sum of the squared Euclidean distance between the data points and their corresponding cluster centres. Since the document space is always of high dimensionality, it is preferable to find a low-dimensional representation of the documents to reduce the computation complexity.

Low computation cost is achieved in spectral clus- tering methods, in which the documents are first projected into a low-dimensional semantic space and then a traditional clustering algorithm is applied to finding document clusters. Latent semantic indexing (LSI) [7] is one of the effective spectral clustering methods, aimed at finding the best subspace approximation to the original document space by minimizing the global reconstruction error (Euclidean distance).

However, because of the high dimensionality of the document space, a certain representation of documents usually reside on a nonlinear manifold embedded in the between the documents. Thus, it is not able to effectively

capture the nonlinear manifold structure embedded in the similarities between them [12]. An effective docu- ment clustering method must be able to find a low- dimensional representation of the documents that can best preserve the similarities between the data points. Locality preserving indexing (LPI) method is a different spectral clustering method based on graph partitioning theory [8]. The LPI method applies a weighted function to each pair-wise distance attempting to focus on cap- turing the similarity structure, rather than the dissimi- larity structure, of the documents. However, it does not overcome the essential limitation of Euclidean distance. Furthermore, the selection of the weighted functions is often a difficult task.

In recent years ,some studies suggest that correlation as a similarity measure can capture the intrinsic structure embedded in high-dimensionality data, especially when the input data is sparse. In probability theory statistics , correlation indicates the strength and direction of liner relationship between two random variable which revels the nature of data by the classical geometric concept of an angle. It is a scale invariant association measure usually used to calculate the similarity between two vectors.

In this paper, we propose a new document clustering methods based on correlation preserving indexing(CPI), which explicitly consider the manifold structure embedded in the similarities between the documents. It aims to find an optimal semantic subspace by simultaneously maximizing the correlation between the documents in the local patches and minimizing the correlations between the documents outside these patches. which is implemented by Reuters sample data and presented the correlation measurements and comparisons between the CPI and K-means.

In recent years, some studies [13]suggest that correlation as a similarity measure space can capture the intrinsic structure embedded in high-dimensional data, especially when the input data is sparse[19]-[20]. In probability theory and statistics, correlation indicates the strength and direction of a linear relationship between two random variables which reveals the nature of data represented by the classical geometric concept of an "angle". It is a scale invariant association measure usually used to calculate the similarity between two vectors. In many cases, correlation can effectively represent the distributional structure of the input data which conventional Euclidean

distance cannot explain. The usage of correlation as a similarity measure can be found in the canonical correlation analysis (CCA)method [21]. The CCA method is to find projections for paired data sets such that the correlations between their low-dimensional representatives in the projected spaces are mutually maximized. Specifically, given a paired data set consisting of matrices

X = x1, x2 ,……, xn and Y = y1 , y2 ,……, yn we

would like to find directions wx for X and wy for Y

two vectors u and v .

Online document clustering aims to group the documents into clusters, which belongs to unsupervised learning. However , it can be transformed into semi- supervised learning by using the following information.

A1) If two documents are close to each other in the original document space, then they tend to be grouped into the same cluster [8].

A2) If two documents are far away from each other in the original document space, they tend to be

that maximize the correlation between the projections of X grouped into different clusters.

on wx and the projections of Y on wy

. This can be

Based on these assumptions, we can propose a spectral clustering in the correlation similarity measure space

expressed as

max

wx ,wy

Xwx ,Ywy

Xwx . YwY

(1)

through the nearest neighbours graph learning.

1)K-Means on Document Sets: The k-means method is one of the methods that use the Euclidean

distance, which minimizes the sum of the squared Euclidean distance between the data points and their corresponding

where , and . denote the operators of inner

product and norm, respectively. As a powerful statistical technique, the CCA method has been applied in the field of pattern recognition and machine learning [20]-[21]. Rather than finding a projection of one set of data, CCA finds projections for two sets of corresponding data X and Y into a single latent space that projects the corresponding points in the two data sets to be as nearby as possible. In the application of document clusterig, while the document matrix X is available, the cluster label (Y ) is not. So the CCA method cannot be directly

cluster centres. Since the document space is always of high dimensionality, it is preferable to find a low dimensional representation of the documents to reduce computation complexity.

th

2)Correlation Based Clustering Criteria: Suppose yi Y is the low-dimensional representation of the i document xi X in the semantic subspace, where i = 1, 2, · · · , n. Then the above assumptions A1) and A2) can be expressed as

used for clustering. In this paper, we propose a document metric that is correlation preserving indexing (CPI).

max corr( yi , y j )

(3)

i x j N ( xi )

  1. DOCUMENT CLUSTERING BASED ON CORRELATION PRESERVING INDEXING

    min corr( yi , y j )

    i x j N ( xi )

    (4)

    In high-dimensional document space, the semantic structure is usually implicit. It is desirable

    respectively, where N (xi )

    denotes the set of

    to find a low dimensional semantic subspace in which the semantic structure can become clear. Hence , discovering the intrinsic structure of the documents space is often a primary concern of document clustering since the manifold structure is often embedded in the similarities between the documents, correlation as a similarity measure is suitable for capturing the manifold structure embedded in the high-dimensionality document space.

    Mathematically, the correlation between vector

    U and V is defined as

    nearest neighbours of xi . The optimization of (3) and

    (4) is equivalent to the following metric learning

    d(x, y) *cos(x, y)

    where d (x, y) denotes the similarity between the documents x and y, corresponds to whether x and y are the nearest neighbours of each other.

    Physically, this model may be interpreted as follows. All documents are projected onto the unit hyper- sphere (circle for 2D). The global angles between the points

    uT v u v

    in the local neighbours, i , are minimized and the global

    corr(u, v) ,

    (2)

    angles between the points outside the local neighbours, i ,

    uT u

    vT v u v

    are maximized simultaneously, as illustrated in Fig. 1. on the

    such that

    Note that the correlation corresponds to an angle

    cos corr(u,v) . The larger the value of

    unit hyper-sphere, a global angle can be measured by spherical arc, that is, the geodesic distance. The geodesic distance

    corr(u, v) is the stronger the association between the

    between Z and expressed as

    Z ' on the unit hyper-sphere can be

    G

    d (Z, Z ' ) arccos(Z T Z ' ) arccos(corr(Z, Z ' )

    (5)

  2. RELATED WORK:

    Since a strong correlation between Z and Z ' means

    1. Clustering Algorithm Based on CPI:

      a small geodesic distance between Z and

      Z ' , then CPI is

      Given a set of documents

      x , x ,……, x

      Rn Let X

      1 2 n

      equivalent to simultaneously minimizing the geodesic distances between the points in the local patches and maximizing the geodesic distances between the points outside these patches. The geodesic distance is superior to traditional Euclidean distance in capturing the latent manifold[14].Based on this conclusion, CPI can effectively capture the intrinsic structures embedded in the high- dimensional document space.

      The Proposed system shows that Euclidean distance is not appropriate for clustering high dimensional normalized data such as text and a better metric for text clustering is the cosine similarity [12]. Lebanon in [21]

      proposed a distance metric for text documents, which was defined as:

      denote the document matrix. The algorithm for document clustering based on CPI can be summarized as follows:

      1. Construct the local neighbour patch, and compute the matrices M S and M T .

      2. Project the document vectors into the SVD

      subspace by throwing away the zero singular values. The singular value decomposition of X can be written as X U V T Here all zero-singular values in have been removed. Accordingly, the vectors in U and V that

      correspond to these zero singular values

      have been removed as well. Thus the document vectors in

      n1

      x y

      the SVD subspace can be obtained by

      X U T X .

      d F *x, y

      arccos i

      i1

      i

      x,

      i

      y,

      (6)

      1. Compute CPI Projection. Based on the multipliers

        This distance is very similar to the distance defined by (5).

        0 , 1 ,….n obtained from 19 and 20 , one can computer

        Since the distance is local(thus it captures local

        the matrix

        M * M * x xT ……. * x

        xT . Let

        variations within the space) and is defined on the entire

        0 T 1 1 1

        n n n

        embedding space [21], correlation might be a suitable distance measure for capturing the intrinsic structure embedded in document space. That is why the proposed

        Y W

        CPI method is expected to outperform the LPI method.

        WCPI be the solution of the generalized eigenvalue problem M SW MW . Then the low- dimensional representation of the document can be computed by

        Note that the distance can be obtained based on the training data and it can be used for classification rather than

        T CPI

        ~x W T X where

        W UWCPI

        is the

        clustering.

        If we use the notations

        x

        x1,

        x2 ,…., y

        y1,

        y2 ,….

        transaction matrix.

      2. Cluster the documents in the CPI semantic subspace.

      Since the documents were projected on the unit hyper-

      and set

      1 2 … n , then the distance metric

      sphere, the inner product is a natural measure of similarity. we seek a partitioning

      d F *x, y reduced to

      k

      xT c

      m j

      m

      with , c where

      m is

      n1

      arccos

      *

      xi yi

      arccosCorrx, y.

      (7)

      j j 1

      j j j

      k

      j 1 x j j

      d F x, y

      i1 xx

      y y

      the mean of the document vectors contained in the cluster

      j .

      This distance is very similar to the distance defined by (5).

  3. COMPLEXITY ANALYSIS

    Since the distance

    d F *x, y

    is local (thus it captures local

    The time complexity of the CPI clustering algorithm can be analyzed as follows: Consider n

    variations within the space) and is defined on the entire

    embedding space [21], correlation might be a suitable distance measure for capturing the intrinsic structure embedded in document space. that is why the proposed CPI method is expected to outperform the LPI method. Note that

    documents in the d-dimensional space (d n). In step a, we need to compute the pair wise distance which needs O(n2d) operations. Secondly, we need to find the k nearest neighbours for each data point which needs O(kn2)

    operations. Thirdly, computing the matrices MS and MT

    the distance

    d F *x, y

    can be obtained based on the

    requires O(n2d) operations and O(n(n k)d) operations, respectively. Thus, the computation cost in step 1 is

    training data and it can be used for classification rather than

    clustering.

    O(2n2d + kn2 + n(n k)d). In step 'b' the SVD decomposition of the matrix X needs O(d3) operations and projecting the documents into the n-dimensional SVD

    subspace takes O(mn2) operations. As a result,

    step 'b' costs O(d3+n2d).Then, transforming the documents into m-dimensions semantic subspace requires O(mn2) operations In step 'c', it takes O(lcmn) operations to find the final document clusters, whee l is the number of iterations and c is the number of clusters. Since k n, l<< n and m, n d in document clustering applications, the step 'b' will dominate the computation. To reduce the computation cost of step 'b', one can apply the iterative SVD algorithm [18] rather than matrix decomposition algorithm or feature selection method to first reduce the dimension.

  4. DOCUMENT REPRESENTATION:

    In all experiments, each document is represented as a term frequency vector. The term frequency vector can be computed as follows:

    1. Transform the documents to a list of terms after words stemming operations.

    2. Remove stop words. Stop words are common words that contain no semantic content.

    3. Compute the term frequency vector using the TF/IDF weighting scheme. The TF/IDF weighting scheme

    than or competitively with other algorithms.

    TABLE 1

    Document clustering Results Based on CPI

    assigned to the term ti

    in document d j

    is given by:

    tf

    idf

    i, j

    tf

    i, j

    idfi , Here tf

    i, j

    ni, j

    k nk , j

    is the term frequency of

    the term ti

    in document d j

    where

    ni, j

    is the number of

    occurrences of the considered term

    ti in document

    d

    is the inverse document frequency which

    d j .idfi log

    d : t d

    i

    is a measure of the general importance of the term ti , where |D| is the total number of documents in the corpus

    and d : ti is the number of documents in which

    TABLE 2

    Reauters Documents sets from different Data Sets i.e sample data of Reauter Data.

    the term ti

    appears. Let

    v t1 ,t2 …,tm be the list of

    terms after the stop words removal and words

    stemming operations. The term frequency vector

    X j of

    document d j is defined as:

    1 j

    j

    X x

    , x2 j

    …..xmj ,

    X ij

    tf

    / idf

    i , j

    Using n documents from the corpus, we construct an m× n term-document matrix X. The above process can be completed by using the text to matrix generator (TMG) code.

  5. CLUSTERING RESULTS:

    1. Correlation Measurement Values Generated: Experiments were performed on Reuters, and OHSUMED data sets. . We compared the proposed algorithm with other competing algorithms under same experimental setting. In all experiments, our algorithm performs better

      TABLE 3

      Final Document Clusters based on CPI each cluster has some similar Group of Datasets

      TABLE 4

      Statistical Measurements of correlation measuring values between the documents which is belongs to Reauters Data set.

  6. CONCLUSTION

We present a new document clustering method based on correlation preserving indexing. It simultaneously maximizes the correlation between the documents in the local patches and minimizes the correlation between the documents outside these patches. Consequently, a low-dimensional semantic subspace is derived where the documents corresponding to the same semantics are close to each other. Extensive experiments on NG20, Reuters and OHSUMED corpora show that the proposed CPI method outperforms other classical clustering methods. Furthermore, the CPI method has good capturing and capability of assigning

clustering formation and thus it can effectively deal with data with very large size.

REFERENCES:

  1. R. T. Ng and J. Han, "Efficient and effective clustering methods for spatial data mining," in Proceedings of the 20th VLDB Conference, Santiago.

    New York, NY, USA: ACM, 1994, pp. 144-155

  2. A. K. Jain, M. N. Murty and P. J. Flynn, "Data clustering: a review," ACM Computing Surveys, vol. 31, no. 3, pp. 264-323, 1999.

  3. S. Kotsiantis and P. Pintelas, "Recent advances in clustering:A brief survey," WSEAS Transactions on Information Science and Applications, vol. 1, no. 1, pp. 73-81, 2004.

[4]J. B. MacQueen, "Some methods for classification and analysis of multivariate observations," in Proc. of the fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1. University of California Press, 1967,pp 281-297.

[5]L. D. Baker and A. K. McCallum, "Distributional clustering ofwords for text classification," in Proceedings of SIGIR-98, 21st ACM International Conference on Research and Development in Information Retrieval, Melbourne, AU:. New York, NY, USA: ACM, 1998, pp. 96-103. [6]X. Liu, Y. Gong, W. Xu and S. Zhu, "Document clustering withcluster refinement and model selection capabilities," in SIGIR 02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval. New York, NY, USA: ACM, 2002, pp. 191-198.

[7]S. C. Deerwester, S. T. Dumais. T. K. Landauer, G. W. Furnas, andR. A. Harshman, "Indexing by latent semantic analysis," Journal of the American Society of Information Science, vol. 41, no. 6, pp. 391-407, 1990.

[8]D. Cai, X. He, and J. Han, "Document clustering using locality preserving indexing," IEEE Trans. on Knowledge and Data Engi- neering, vol. 17, no. 12, pp. 1624-1637, 2005.

[9]W. Xu, X. Liu, and Y. Gong, "Document clustering based onnon- negative matrix factorization," in SIGIR 03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval. New York, NY, USA: ACM, 2003, pp. 267-273. [10]S. Zhong and J. Ghosh, "Generative model-based document clustering: a comparative study," Knowl. Inf. Syst., vol. 8, no. 3, pp.374-384, 2005. [11]D. K. Agrafiotis and H. Xu, "A self-organizing principle for learning nonlinear manifolds," Proceedings of the National Academyof Sciences of the United States of America, vol. 99, no. 25, pp. 15 869-

15872, 2002.

[12]S. Zhong and J. Ghosh, "Scalable, balanced model-based clustering," in Proc. 3rd SIAM Int. Conf. Data Mining. New York, NY,USA: ACM, 2003, pp. 71-82.

[13]Y. Fu, S. Yan, and T. S. Huang, "Correlation metric for generalized feature extraction," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 12, pp. 2229-2235, Dec. 2008. [14]J. B. Tenenbaum, V. de Silva, and J. C. Langford, "A global geometric framework for nonlinear dimensionality reduction." Science, vol. 290, no. 5500, pp. 2319-2323, December 2000.

[15]X. Zhu, Z. Ghahramani, and J. Lafferty, "Semi-supervised learning using gaussian fields and harmonic functions," in ICML-03, 20th International Conference on Machine Learning, 2003.

[16]X. Zhu, "Semi-supervised learning literature survey," Computer Sciences, University of Wisconsin-Madison, Tech. Rep., 2005. [17]G. Lebanon, "Metric learning for text documents," IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 28, no. 4, pp. 497-507, 2006.

[18]P. Strobach, "Bi-iteration svd subspace tracking algorithms," IEEE Transactions on Signal Processing, vol. 45, no. 5, pp. 1222 -1240, may 1997.

[19]Y. Ma, S. Lao, E. Takikawa, and M. Kawade, "Discriminant analysis in correlation similarity measure space," in ICML 07: Proceedings of the 24th international conference on Machine learning, 2007, pp. 577-584.

[20]R. D. Juday, B. V. K. Kumar, A. Mahalanobis, Correlation Pattern Recognition. Cambridge University Press, 2005.

[21]D. R. Hardoon, S. R. Szedmak, and J. R. Shawe-taylor, "Canonicalcorrelation analysis: An overview with application to learning methods," Neural Comput., vol. 16, no. 12, pp. 2639-2664, 2004.

Leave a Reply