Personalized Web Directory : A Knowledge Discovery Approach

Madhavi S. Darokar; Prof. Mansi Bhonsle

doi:10.17577/IJERTV2IS60112

Volume 02, Issue 06 (June 2013)

Personalized Web Directory : A Knowledge Discovery Approach

DOI : 10.17577/IJERTV2IS60112

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 58
Total Downloads : 234
Authors : Madhavi S. Darokar, Prof. Mansi Bhonsle
Paper ID : IJERTV2IS60112
Volume & Issue : Volume 02, Issue 06 (June 2013)
Published (First Online): 17-06-2013
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Personalized Web Directory : A Knowledge Discovery Approach

Madhavi S. Darokar (Researcher) , Prof. Mansi Bhonsle , G.H. R. C.E. M. Pune, India.

Abstract

The World Wide Web is a rich source of information. The

number of users accessing web sites is increasing every day. As the growth of web is tremendous due to which it is not possible to retrieve the online information easily because of information overload problem. To address this problem, personalization is used, which focuses on the retrieval to meet the user-specific information. For effective and efficient han- dling the web mining techniques provides personalized con- tents at the disposal of users.

Web usages mining is an area of Data mining dealing with the extraction of interesting knowledge from the World Wide Web. So We are building here Knowledge Discovery Frame- work for the construction of community specific web direc- tories by applying personalization to web directories. The hierarchical structure of the web pages on the web categories into specific theme of user interest is called as web directories which is generally constructed manually by human experts. Instead of that the Cluster analysis, which deals with the or- ganization of a collection of objects into cohesive groups of

interest can play a very important role for automation of this process.

Key terms

Clustering, Personalization , Pattern Analysis, Machine Learning,Web Directory , Web Logs, Web Usages Min- ing.

Introduction

The tremendous amount of data is available on the World Wide Web. Now a days it is cumbersome task to find the relevant data from it because of information overload problem as its size is increasing continuously. The main objective of this paper is to construct web directory to reduce this problem with the help of personalization. The content of the web is organized into thematic hierarchy called as web directories which corresponds to listing of those topic in which user is interested. There are some real web directory like DMOZ(Open Directory Project) and Yahoo directory. Here user can get the interested information by searching inside the directory from broad category and gradually narrowing down until they get the thematic contents

.So the user has to check deep inside the directory until they get satisfied information on the web. To alleviate this problem we can apply here personalization to web directories using web usages data. A aggregate user community model is constructed from browsing behavior of web in the form of web usages data, for personalization of services on the web. This can be achieved by applying pattern analysis on the web usages data which are in the form of web log data at the server side. The clustering and probabilistic approaches are

used for the pattern analysis to build community specific personalized web directory. As the web data has high degree of thematic diversity (increased dimensionality and semantic incoherence), we are creating a knowledge discovery framework for construction of community web directory by applying personalization to web directory which will become automatic machine learning process[3].
1. Existing Systems
  
  There are some systems which are full automa- tion of personalized process and employs machine learning methods. One of such system is Montage Systemwhich creates personalized portal by applying number of heuristic metrics to web usages data such as the interest in a page or a topic, the probability of revisiting a page, etc. It consists of link to the number of pages the user has visited which are also organized like ODP[4]. The another system is Power Bookmark System which collects the bookmark information of the single user such as frequently visited pages and query results returned by search engine. This system usages text classification technique for categorization of web pages to specific folder. The problem of the system is adoptability is only to single user views and not construct aggregate model for the user and also scalability of the classification methods they use[2]. A Web directory, such as Microsoft (www.microsoft.com) and the Open Directory Project (ODP) (dmoz.org) and Yahoo! the personalized search is possible but directory is not personalized[3]. In this directory web pages are explicitly assigned manually to categories of directory.
2. Deficiency in the existing systems

Web is a goldmine of information but the informa- tion overload becomes frustrating phenomenon to the web users so requires personalized services for information retrieval on the web.
Due to tremendous growth in size of web at the cur- rent state web has not achieved the goal to navigate

information to the particular user.
Web pages are categorized manually, hence limited topic coverage.
Because of the size its complexity gets increased re- sults in difficulty of appropriate navigation.

Related Work

Many researches have been carried out for the web per- sonalization using web usages data. The pattern dis- covery from web log data is done by using majority of clustering methods[4]. This method is used to divide the data into groups which are different from each other. Clustering is basically used for groping people having common interest while browsing the web or the web pages having same content. Actually cluster is catego- rized into three categories[2].
1. Partitioning Methods
2. Hierarchical Methods
3. Model Based Methods
  
  In partitioning method the algorithms used are Leader, PageGather and Expectation Maximization. A Leader algorithm is used for clustering the user sessions which is represented by the n vector where n is the number of web pages accessed in that session. The value of each vector is represented by weight w where w is count of interest in particular web page of user. This algorithm produces good quality clusters. Also the pattern discov- ery is made by vector and weight characteristics. But this algorithm has drawback like different set of clusters can be generated depends upon training vector sequence provided as parameter to the algorithm. Another algo- rithm is PageGather used for clustering which takes user session as input and used for improving web site repre- sentation by gathering set of pages which are visited by user from the web log data. It has advantage of produc- ing overlapping cluster of same behavioral browsing of users. But it has drawback of its computation cost be- cause of graph based method where nodes of the graph are web pages and edges between the nodes are the co- occurrence of the web pages[16] .The Expectation Max- imization algorithm is also partition based which takes input as user sessions in the form of URLs from web
  
  log files and represented using categories of web pages of some topic. This algorithm cluster the user session of a particular group called community. The advantage is it is memory efficient but drawback is computationally expensive. This also include fuzzy clustering algorithm.
  
  The hierarchical method the BIRCH algorithm is used for clustering of user session. The web log data is con- verted to user session which contains the IP and times- tamp .The session are organized here like page hierarchy of the web. It is very efficient and applicable to large vol- ume of the data. But the drawback is it depends upon sequence of user inputs.
  
  The model based method Autoclass, SelfOrganizing Map, Incremental algorithm are used to construct user community mdel having similar interest in web usages pattern[5]. The user communities are clustered as per their interest in browsing the web. The advantage of Autoclass is that it mathematically sound but compu- tationally expensive. A self Organizing Map has good mapping of high dimensional data but it requires prior specification of the number of clusters. An Incremental algorithm is applicable to large dataset but suffer from scalability.
  
  Thus we can achieve Personalization of web based sys- tems from web usages data. For the experimental setup the log file is collected from proxy server of ISP. The software requirement for our project is JDK 7 and IDE required is Net Beans 7.1 with minimum hardware re- quirement of 1GB Hard Disk ,512 Ram and Window 7 Operating System with Pentium 3 processor.
System Architecture

A aggregate user model is constructed here by collecting data from web proxies as user browse the web and applying some machine learning techniques[15]. The main purpose of the project to construct a community web directory aytomatically which result in operational personalized knowledge. The process of getting from the data to the community Web directories is summarized below.

Fig 1: System architecture for web directory personal- ization.

System architecture basically consists of three blocks
1. Preparation of the Usages data
2. Discovery of community web directory c)Evaluation of Community Web Directory These are described as follows.
Fig 2: Data Flow for web directory personalization.
Results and Discussion

Evaluation of Community Web Directory: The current research approximated the gain of the end user but not taken into account the cost of losses so in that case the users do not find what they are looking for in the per- sonalized directory. This issue requires the evaluation of community Web directories in user studies which we im- plement in this project . The web log files are collected from the ISP and the proposed methodology address the issue of existing method by reducing dimensionality of the problem,through the classification of individual Web pages into the different categories of the web directory. This issue requires the evaluation of community Web di- rectories according to user preferences. The various types of components of the methodology could be replaced by a number of alternatives. Most importantly, more sophis- ticated methods for extracting the categories from usage data, in addition to the use of an existing Web direc- tory, would make the mapping of pages to domains and then to categories more accurate and complete. At first module of project we are aiming to use efficient k-means

algorithm in order to get efficient results.

The experiment is performed to evaluate the three pattern discovery methods. By this method of per- sonalization we are examining here the percentage of shrinkage of web directory. Average path length is calculated of community web directory and compared the result against original directories. Also we are evaluating here coverage and user gain against another algorithm to remove the problem of local overload.

Fig 3: Graph for Dimensionality of user session.

Fig 4: Clustring of user session.
Conclusion

Today world is of internet as the emergence of e-services in the new web era, such as e-commerce, e-learning and e- banking. The internet is used turning web sites into busi- ness and increasing competition between them. A web friendly environment is developed by offering personal- ization of services. The proposed methodology alleviat- ing the problem of information overload by constructing the community web directory to the needs and interest of particular user. With the help of machine learning methods we can construct such directories using cluster-

ing and probabilistic approach. As web usages data is diverse and voluminous, it can be reduced by classifying the web pages into classified folders called as directories mapped with user interest. But there is need of future scope to check the robustness of algorithm according to the changing environment and the parameters analysis of the community model.

References

M.Ramkrishna,L.Gowdar,M.HavanurWebMining: KeyAccomplishment, Applications And Future Di- rections,International journal of Data Storage and Data EngineerinVol.1, pp 4-9.,2010
D.Pierrakos, G. Paliours, C Papatheodorou, and

C. D. Spyropoulos Web Usages Mining as a Tool for Personalization : A Survey User Modeling and User-Adaption interaction, Vol. 13 , pp.311 372 ,2010
G. Paliouras, C. Papatheodorou, V. Karkaletsis, and C.D.Spyropoulos, Discovering User Communities on the Internet Using Unsupervised Machine Learn- ing Techniques Interacting with Computers J., vol. 14, no. 6, pp. 761-791, 2002.
D.Pierrakos, Georgios Paliours, Personalizing web directories with the Aid of Web Usages Data ,IEEE Transactions on Knowledge and Data Engineer- ing,vol.22,no.9,Sep 2010.
D. Pierrakos, G. Paliours, C Papatheodorou, and V. Karkaletis, M Dikaiakos Web Community Direc- tories : A New Approach to Web Personalization Web Mining : From web to Semantic Web, pp 113-129, Springer,2004.
D. Pierrakos and G. Paliouras, Exploiting Proba- bilistic Latent Information for the Construction of

Community Web Directories, Proc. 10th IntÃ¢l Conf. User Modeling, L. Ardissono, P. Brna, and A. Mitrovic, eds., pp. 89-98, 2005.
D. Chen, D. Wang, and F. Yu, A PLSA-Based Ap- proach for Building User Profile and Implementing Personalized Recommendation, Proc. Joint Ninth Asia-Pacific Web Conf. (APWeb 07) and Eighth Int Conf. Web-Age Information Management (WAIM 07), pp. 606-613, 2007.
B. Mobasher, R. Cooley, and J. Srivastava, Auto- matic Personalization Based on Web Usage Mining Comm. ACM, vol. 43, no. 8, pp. 142-151, 2000.
P. Brusilovsky, A. Kobsa, and W. Neijdl, The Adaptive Web, Methods and Strategies of Web Personalization , eds. Springer, 2007.
T. Hofmann,Learning What People (Dont) Want Proc. 12th European Conf. in Machine Learning, pp. 214-225, 2001.
G. Xu, Y. Zhang, and Y. Xun,Modeling User Be- haviour for Web Recommendation Using lda Model Proc. IEEE/WIC/ACM Int Conf. Web Intelligence and Intelligent Agent Technology, pp. 529-532, 2008.
W. Chu and S.-T.P. Park,Personalized Recom- mendation on Dynamic Content Using Predictive

Bilinear Models Proc. 18th Int Conf. World Wide Web (WWW), pp. 691-700, 2009.
X. Jin, Y. Zhou, and B. Mobasher,Task-Oriented Web User Modeling for Recommendation Proc. 10th International Conf. User Modeling, L. Ardissono, P. Brna, and A. Mitrovic, eds., pp. 109118,2005.
Y. Fu, K. Sandhu, and M. Shih. A Generalization- BasedApproach to Clustering ofWeb Usage Sessions, InProceedings of the 1999 KDD Workshop on Web Mining,San Diego, CA, vol. 1836 of LNAI,. Springer, 2000, 21-38
M. S. Chen, J. S. Park, and P. S. Yu. Efficient Data Mining for Path Traversal Patterns Knowl- edge and Data Engineering, 10(2), 1998, 209-221
A. Joshi and R. Krishnapuram. On Mining Web AccessLogs In ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pages 2000, 63- 69
Bettina Berendt, Web usage mining, site seman- tics, and thesupport of navigation, in Proceedings of the Workshop WEBKDD 2000 – Web Mining for E-Commerce -Challenges and Opportunities, 6th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 2000
B. Mobasher, H. Dai, T. Luo, and M. Nakagawa. Discovery and Evluation of Aggregate Usage Pro- files for Web Personalization. Data Mining and Knowledge Discovery,6(1), 2002, 61-82
P. Berkhin.Survey of clustering data mining tech- niques,Springer Berlin Heidelberg, Berlin,2006L. Catledge and J. Pitkow. Characterizing browsing behaviors on the World Wide WebComputer Net- works and ISDN Systems ,1995, vol. 27, no. 6, pp. 1065-1073
B. Mobasher, R. Cooley and J. Srivastara. Au- tomatic personalization based on Web session clustering,Communications of ACM, 2000, vol. 43, no. 8, pp. 142-151
J. Yu . General C-means clustering model IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005, Vol. 27, No. 8, pp.1197-1211
Y. Fu, K. Sandhu and M. Y. Shih. Clustering of Web Users Based on Access Patterns Proceedings

of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Workshop on Web Mining, Springer , 1999, pp. 560-567

Personalized Web Directory : A Knowledge Discovery Approach

The World Wide Web is a rich source of information. The

interest can play a very important role for automation of this process.

Leave a Reply