Survey Paper on Clustering of Documents Based on Partitioning the Features

Prof. S. R. Durugkar; Madhuri Malode

doi:10.17577/IJERTV3IS10496

Volume 03, Issue 01 (January 2014)

Survey Paper on Clustering of Documents Based on Partitioning the Features

DOI : 10.17577/IJERTV3IS10496

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 67
Total Downloads : 280
Authors : Prof. S. R. Durugkar, Madhuri Malode
Paper ID : IJERTV3IS10496
Volume & Issue : Volume 03, Issue 01 (January 2014)
Published (First Online): 18-01-2014
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Survey Paper on Clustering of Documents Based on Partitioning the Features

Prof. S. R. Durugkar1, Madhuri Malode2

Assistant Professor1, P.G. student2 Department of Computer Engineering

Late G.N.Sapkal College of Engineering, Anjaneri, Nasik 1, 2

University Of Pune

Abstract :

Finding the appropriate number of clusters to which documents should be partitioned is crucial in document clustering. In this paper we will focus on various clustering techniques and our proposed system to discover the cluster structure without requiring the number of clusters as

input.

the user is interested in. The users are forced to sift through a long list of off-topic documents. Moreover, internal relationships among the documents in the search result are rarely presented and are left for the user.

Document features or even we can say that the various attributes will be automatically partitioned into two groups, in particular, discriminative words and nondiscriminative words, and contribute differently to document clustering.

There is one variational inference algorithm which we have studied to infer the document collection structure as well as the partition of document words at the same time. We will justify at the end in the conclusion how our approach will perform well on the data set. Then we will justify our systems accuracy and efficiency by describing what we have proposed with the predicted modules and features for the system.

Keywords Database management, database applications-text mining, pattern recognition, clustering document clustering, feature partition

Fig 1.1 Clustering Overview

Search engines or any information retrieval application are an invaluable tool for retrieving information from the Web. In response to a user query, they return a list of results ranked in order of relevance to the query.

I.INTRODUCTION

Clustering

Clustering can be measured the most important unsupervised learning problem;

It contract with finding a structure in a collection of unlabeled data.

A general definition of clustering could be the process of organizing given objects into certain number of groups whose members are similar in some way. A cluster is therefore a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters.

If the query is general, it is extremely difficult to identify the specific document which

Fig 1.2 Formation of cluster of given data

The objective of clustering is to decrease the amount of data by categorizing or grouping similar data items and present them collectively. Such grouping is pervasive in the way humans process information, and one of the inspirations for using clustering algorithms is to

provide automated tools to help in constructing categories or taxonomies

The user starts at the top of the list and follows it down examining one result at a time, until the sought information has been found. third method is search results clustering, which consists of grouping the results returned by a search engine into a hierarchy of labeled clusters (also called categories). This method combines the best features of query-based and category- based search, Not only has search results clustering attracted considerable commercial interest, but it is also an active research area, with a large number of published papers discussing specific issues and systems. Search results clustering is clearly related to the field of document clustering but it poses unique challenges concerning both the effectiveness and the efficiency of the underlying algorithms that cannot be addressed by conventional techniques.
Uses and possible Application Areas of Clustering
- Marketing: finding groups of customers with similar behavior given a large database of customer data containing their properties and past buying records;
- Biology: classification of plants and animals given their features;
- Libraries: book ordering;
- Insurance: identifying groups of motor insurance policy holders with a high average claim cost; identifying frauds;
- City-planning: identifying groups of houses according to their house type, value and geographical location;
- Earthquake studies: clustering observed earthquake epicenters to identify dangerous zones;
Typical requirements or constraint for execution of any clustering algorithm
- Scalability of the objects or the system;
- Dealing with different objects;
- Discovering clusters with arbitrary pattern;
- Minimal requirements for specific system
- Ability to deal with noisy and outlier data;
- High dimensionality;
- Interpretability and usability.
Document Clustering

Document clustering has been investigated for use in a number of different areas of text mining and information retrieval.

Initially, document clustering was used for improving the precision or recall in information retrieval applications and as an efficient way of finding the nearest neighbors of

a document so that system will return the max relevant document in response to users query. Document clustering has also been used to automatically generate hierarchical clusters of documents

SURVEY ON VARIOUS CLUSTERING TECHNIQUES

There are many clustering techniques which are available in the market, and each of them may give a different grouping of objects.

The choice of a particular method will depend on the type of output desired that is it depends on the end user to select one of them as per his requirement and form the desired number of clusters. The known performance of method with particular types of data, the hardware and software facilities available and the size of the dataset.
The complete link method is quite similar to the single link method except that it uses the least similar pair between two or more clusters to determine the inter-cluster similarity for generation of n clusters. This method is characterized by small, tightly bound clusters. Therefore need arises to find out the similar

clusters either baed on their distance or similarity.

And many more techniques present in the market for generation of clusters.
SURVEY ON DOCUMENT CLUSTERING
Mean field variational inference is a particular class of variational methods. Consider a model with a hyperparameter , latent variables W= { v1,v2,v3,vn} and data points x={ x1,x2,Xd} . Mean field variance method starts from a family of distributions Q by using which both the mean field procedure and the subsequent inference procedures are easy to handle.

In order to yield a computationally effective inference method, its very necessary and important to choose a reasonable family of distributions Q.
PROPOSED SYSTEM

Basic initial flow of the system:

Fig 4.1 Flow of initial processing of system

As observe in the above flow chart first of all what we have to do is to prepare a list of words i.e. vocabulary or dictionary.

Then selection of document one by one to check whether the term or word is occurred in that particular document or not.

Based on this we will try to extract another features of the documents and will prepare a set of features.

Now by considering that set of features we will form a set document in which n documents will poses that particular feature.
Use of Blocked Gibbs Sampling Algorithm

Another effective inference algorithm for our proposed model is the blocked Gibbs sampling algorithm.

Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo algorithm for obtaining a sequence of observations which are approximated from a specified multivariate probability distribution as we are using it for identifying the relationship between n documents and trying to form a group of documents based some similar features.

It is commonly used as a means of statistical inference, especially Bayesian inference, Bayesian inference derives the posterior probability as a consequence of two antecedents, a prior probability and a "likelihood function" derived from a probability model for the data to be observed. Bayesian inference computes the posterior probability according to Bayes' rule which is as follows.

P (H|E) = P (E|H).P (H)/P (E)

It is a randomized algorithm (i.e. an algorithm that makes use of random numbers, and hence may produce different results each time it is run), and is an alternative to deterministic algorithms for statistical inference such as variation Bayes or the expectation- maximization algorithm .
Proposed Idea with example

Fig 4.2 Proposed Architecture

Above proposed architecture depicts that what we wish to perform after collecting n no. of documents.
- Consider A, B, C, D, E as documents with the some similarities:
- Then we will have documents A,B,C,D,E

Now we will try to group the given documents based on some similar features and scenario will look like as follows:

In our proposed work what we wish to do is as follows:

First of all we have many documents as a input, then we will prepare a index or vocabulary for those documents.

Now we need identify various features

actually we will predict some features and by considering those features as a base we will cluster the documents.

Additionally we can add an information model such as probabilistic model, fuzzy model, Boolean model, extended Boolean model etc. we will compare the features and form a cluster of documents.

at thenext step it will be as follows :

and few more scenarios will be as follows:

Parser module which will handle prime task is to parse given documents into tokens & preparation of index or we can say the vocabulary will starts.

System will use this index/ vocabulary for further process by forming various rules so that it will form n document sets.

CONCLUSION

We will show that following targets will definitely achieve as follows if we will form a set or clusters of given documents :
- Document clustering is the act of collecting similar documents into bins, where similarity is some function on a document.
- If a collection is well clustered, we can search only the cluster that will contain relevant documents.
- Searching a smaller collection should improve effectiveness and efficiency.
So it will very useful to have clusters of data based on some similarity. In our proposed system we will use Dirichlet Process Mixture Model, mean variance algorithm and blocked gibbs sampling algorithm. Our proposed system will be as discussed above.

REFERENCES
1. Michael Steinbach George Karypis Vipin Kumar,A Comparison of Document Clustering Techniques Department of Computer Science and Engineering, University of Minnesota
2. Inderjit Dhillon, Jacob Kogan,Charles Nicholas ,Feature Selection and Document Clustering
3. http://en.wikipedia.org/wiki/ Bayesian_inference
4. C. Elkan, Clustering Documents with an Exponential-Family Approximation of the Dirichlet Compound Multinomial Distribution, Proc. Intl Conf. Machine Learning, pp. 289-296, 2006.
5. R. Madsen, D. Kauchak, and C. Elkan, Modeling Word Burstiness Using the Dirichlet Distribution, Proc. Intl Conf. Machine Learning, pp. 545-552, 2005.

Survey Paper on Clustering of Documents Based on Partitioning the Features

Leave a Reply