Lingo an approach for Clustering

DOI : 10.17577/IJERTV1IS3166

Download Full-Text PDF Cite this Publication

Text Only Version

Lingo an approach for Clustering

Poonam C. Fafat

Department of Information Technology PRMIT&R,Badnera

Prof.S.S.Sikchi

Department of Information Technology PRMIT&R,Badnera

Abstract Clustering is a process of forming groups (clusters) of similar objects from a given set of inputs. When applied to web search results, clustering can be perceived as a way of organizing the results into a number of easily browsable thematic groups. Top Down clustering is a type of Hierarchical Clustering. It tries to find bigger clusters first and then does fine grained clustering on these clusters. Hence the name Top Down. Any clustering algorithm can be used to perform the Top Level Clustering ( finding bigger clusters ) and the Bottom Level Clustering ( fine grained clustering on each of the top level clusters).

Keywords- Top down Clustering;snippets;outlier detection

  1. INTRODUCTION

    Top Down clustering is a type of Hie rarchica l Clustering. It tries to find bigger c lusters first and then does fine grained clustering on these clusters. Hence the name Top Down. Any clustering algorith m can be used to perform the Top Level Clustering and the Bottom Level Clustering The first step to execute the top down clustering, would be to run clustering algorith m , pre ferably with clustering para meters which will produce bigger clusters. This would be the top level c lustering. Then, the output of this clustering should be post processed, to group them into respective top level clusters. Clustering is a process of forming groups (clusters) of similar objects fro m a g iven set of inputs. When applied to web search results, clustering can be pe rceived as a way of organizing the results into a numbe r of easily browsable thematic groups. Moreover, the advanced users, who sometimes issue general queries to learn about the whole spectrum of sub-topics available, will no longer be made to manually scan hundreds of the resulting documents. Instead, they will be presented with a concise Su mma ry of various subjects, with an option of drilling down selected topics.What should be emphasized about web search clustering is that the thematic groups must be created ad hoc and fully automatica lly. Addit ionally, in contrast to the classic Information Retrieva l tasks, the web search clustering algorith ms do not operate on the full te xt documents, but only on the short summaries of the m returned by search engines, called snippe ts. Therefore, the algorith ms must take into account the limited length and often low quality of input data.

  2. BACKGROUND

    Co mpared to classic full-te xt clustering methods, search results clustering has its own characteristics: (1) its processing targets are document snippets returned by search engine rather than full-te xt sources. Therefore, the algorith ms must be prepared for limited length and often low quality of input data. (2) ad -hoc processing. This type of clustering should be regarded as the post-processing part of the on-line search engine. Therefore, the algorith ms should satisfy the need of real-t ime web service. (3) the goal of search results clustering is to help the users to locate the interest more quic kly. The refore, the algorith ms should be capable of supplying users with meaningful, concise and unambiguous clustering labels.

    In this paper the clustering of data will be done using the lingo algorithm. Lingo algorith m is a c lustering algorith m.

  3. PROPOSED WORK

The proposed work is based on the clustering algorith m. To the desired search engine the input query will be given.

  1. Imple mentation of Clustering Algorith m on dataset which is high dimensional categorical data.

  2. Imple mentation of Detection of Outliers high dimensional categorical data.

  3. Performance analysis of Lingo algorith m with

other clustering method.

Outlie r detection is a fundamental task that is useful in a number of data analytic applications. Outliers arise due to mechanica l faults, changes in system behaviour, fraudulent behaviour, network intrusions or human e rrors. It deals with the problem of identifying rare o r a typica l point s that wide ly diverge fro m the general behaviour or mode l of the

data. The process of detecting outliers and subsequently using them for data analysis relies on the underlying application. For e xa mp le, outlier detection can be employed as a pre-processing step to clean the data set fro m erroneous measure ments and noisy data points. On the other hand, it can also be used to isolate suspicious or interesting patterns in the data. Exa mples include fraud detection, customer relationship manage ment, network intrusion, clin ical diagnosis and biological data analysis .

Clustering is a process of forming groups (clusters) of similar objects fro m a g iven set of inputs. Good clusters have the characteristic that objects belonging to the same cluster are "similar" to each other, while objects fro m two diffe rent clusters are "dissimilar". The idea of clustering originates fro m statistics where it was applied to nu merical data. However, co mputer science and data min ing in particular, e xtended the notion to other types of data such as text or mu ltimed ia.

Clearly, web snippets belong to the class of textual data and hence it is the text c lustering algorith ms that should be regarded as a base for the web search results clustering systems.

When designing a web clustering algo rith m, special attention must be paid to ensuring that both contents and description (labels) of the resulting groups are mean ingful to the users. The majority of currently used text clustering algorith ms follow a sche me where cluster content discovery is performed first, and then, based on the content, the labels are determined. Unfortunately, this may result in some groups' descriptions being meaningless to the users, which in turn, is very often caused by the nonsensical content of the clusters themselves. To avoid such problems LINGO adopts a radically diffe rent approach to finding and describing groups.

The general idea behind LINGO is to first find

mean ingful descriptions of clusters, and then, based on the descriptions, determine their content. To assign documents to the already labeled groups LINGO could use the Latent Se mantic Inde xing in the setting for which it was origina lly designed: given a query retrieve the best matching documents. When a cluster labe l is fed into the LSI as a query, as a result contents of the cluster will be returned. This approach should take advantage of the LSI's ability to capture high-order semantic dependencies in the input collection. In this way not only would documents that contain the cluster label be retrieved, but also the documents in wh ich the same concept is e xpressed without using the e xact phrase. In web search results clustering, however, the effect of semantic retrieval is sharply diminished by the small size of the input web snippets This, in turn, severely affects the precision of cluster content assignment. That is why we have decided to use the simple Vector Space Model instead of the LSI to determine the cluster content.

To become a full-featured clustering algorith m, the process of finding cluster labels and contents must be preceded by some preprocessing of the input collection. This stage should encompass text filtering, document's language recognition, stemming and stop words identification. It is also recommended that post-processing of the resulting clusters be performedto eliminate groups with identical contents and to merge the overlapping ones. As a summary of the introductory discussion, in Figure 1.2 we present the ma in phases of LINGO.

/** Phase 1: Preprocessing */ for each document

{

do text filtering;

identify the document's language; apply stemming;

ma rk stop words;

}

/** Phase 2: Feature e xt raction */ discover frequent terms and phrases;

/** Phase 3: Cluster label induction */ use LSI to d iscover abstract concepts; for each abstract concept{

find best-matching phrase;

}

prune simila r c luster labels;

/** Phase 4: Cluster content discovery */ for each cluster label{

use VSM to determine the cluster contents;

}

/** Phase 5: Final c luster formation */ calculate cluster scores;

apply cluster me rging

Noteworthy is the fact that a ll five phases of LINGO are independent and easily separable. This allo ws to man ipulate the quality and resource require ments of the algorith m by providing a lternative imp le mentations of some of its components.

Conclusion

The aim o f this project is to e xperiment with an effect ive clustering algorith m on the search res ults returned from

Google search engine. The clustering method used in this project is ca lled Description Co mes First (DCF). The core idea is to first find meaningful c luster labels, then assign snippets to them to create proper c lusters. So the algorith m can be split into two ma jor phases: cluster label induction phase and cluster content assignment phase.

Evaluation of the algorith m will be carried out on the test data fro m Open Directory Project (ODP) which serves as a source of pre-clustered document snippets.

The main result of this project is the design of LINGO a description-oriented web search results clustering algorith m. We strongly felt that it should be easier to find docu ments that match a meaningful cluster label than to describe a potentially senseless group of snippets. To verify the practical value of our idea we imple mented LINGO as a component of the Carrot2 fra me work. During the imple mentation work, we identified addit ional factors that significantly influence the quality of clustering. We cla im that proper preprocessing of the input data, including language identification, is of crucia l importance in web mining tasks.

  1. D. Barbara´, J. Couto, and Y. Li, COOLCAT: An Entropy -Based Algorithm for Categorical Clustering, Proc. 11th ACM Conf. Information and Knowledge M anagement (CIKM 02), pp. 582-589, 2002.

    .

    References

    1. P. Andritsos, P. Tsaparas, R. M iller, and K. Sevcik,

      LIM BO: Scalable Clustering of Categorical Data, Proc. Ninth Intl Conf. Extending Database Technology (EDBT 04), pp. 123-146, 2004.

    2. R. Ng and J. Han, CLARANS: A M ethod for Clustering

      Objects for Spatial Data M ining, IEEE Trans. Knowledge and Data Eng., vol. 14, no. 5, pp. 1003-1016, Sept./Oct. 2002.

    3. S. Guha, R. Rastogi, and K. Shim, ROCK: A Robust Clustering Algorithm for Categorical Attributes, Information Systems, vol. 25, no. 5, pp. 345-366, 2001

    4. J. Basak and R. Krishnapuram, Interpretable

      Hierarchical Clustering by Constructing an Unsupervised Decision Tree, IEEE Trans. Knowledge and Data Eng., vol. 17, no. 1, Jan. 2005

    5. Eugenio Cesario, Giuseppe M anco, Riccardo Ortale,

Top-Down Parameter-Free Clustering of High- Dimensional Categorical Data, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19, NO. 12, DECEM BER 2007.

Leave a Reply