Comparison of UWAD Tool with Other Tools Used for Preprocessing

Nirali Honest; Dr. Bankim Patel; Dr. Atul Patel

doi:10.17577/IJERTV2IS120598

Volume 02, Issue 12 (December 2013)

Comparison of UWAD Tool with Other Tools Used for Preprocessing

DOI : 10.17577/IJERTV2IS120598

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 61
Total Downloads : 129
Authors : Nirali Honest, Dr. Bankim Patel, Dr. Atul Patel
Paper ID : IJERTV2IS120598
Volume & Issue : Volume 02, Issue 12 (December 2013)
Published (First Online): 20-12-2013
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Comparison of UWAD Tool with Other Tools Used for Preprocessing

Nirali Honest

Smt. Chandaben Mohanbhai Patel Institute of Computer Applications, Charotar University of Science and Technology (CHARUSAT), Changa, India.

Dr. Bankim Patel Shrimad Rajchandra Institute of Management

and Computer Application Uka Tarsadia University, Bardoli, India.

Dr. Atul Patel

Smt. Chandaben Mohanbhai Patel Institute of Computer Applications, Charotar University of Science and Technology (CHARUSAT), Changa, India.

Abstract

The purpose of Preprocessing phase is to produce accurate data with speed and accuracy. Interested patterns can be discovered based on the quality of data generated after the preprocessing phase. Designing this phase taken maximum effort in the entire process of Web Usage Mining (WUM), as it focuses on reducing the quantity of data but not compromising with the quality of data. The paper focuses the implementation concerns of preprocessing phase using tool Log parser lizard and focuses on the problems faced in using the tool. It also shows the usefulness of University Website Access Domain (UWAD) tool.

Introduction

Due to huge information and lack of unified structure information retrieval is difficult. Most users may not have good knowledge of the structure of the information network, and may easily get fed up by taking many access hops and losing impatience when waiting for the information [1]. Web is the single largest data source in the world, due to heterogeneity and lack of structure of web data, mining is a challenging task [2]. Preprocessing of log fie is complex and laborious job and it takes 80% of the total time of web usage mining process as whole [3]. We cannot negate the importance of preprocessing step in web usage mining. Paying due attention to preprocessing step, improves the quality of data [4], furthermore, preprocessing improves the efficiency and effectiveness of other two steps of WUM such as pattern discovery and pattern analysis.Information about internet user is stored in different raw log files. Doru Tanasa, et al. [5] focus on web server logs from several web sites, generally belonging to the same organization. An important organization might have several web servers for its web sites. Fang Yuan, et al. [6] mainly focus on analyzing visiting information from logged data in order to extract usage pattern, which can be classified on to

three categories: similar user group, relevant page group and frequency accessing paths.

Web Server logs are plain text (ASCII) files, that is independent from the server platform. Mohd Helmy Abd Wahab, et al. [7] discusses on the types of logs, but traditionally there are four types of server logs:
- Transfer Log
- Error Log
- Agent Log
- Referrer Log
  
  Each HTTP protocol transaction, weather completed or not, is recorded in the logs and some transactions are recorded in more than one log. Transfer log and error log are standard. The referrer and agent logs may or may not be turned on at the server or may be added to the transfer log file to create an extended log file format. Currently, there are three formats available to record log files:-
- W3C Extended Log files Format
- Microsoft IIS Log File
- NCSA Common Log files Format
  
  The W3C Extended log file format, Microsoft IIS log file format, and NCSA log file format are all ASCII text formats. This proposed research assumes that server uses W3C Extended Log File Format to record log files
  
  This proposed research will be experimented based on following parameters.
- Preprocessing Accuracy
- Hit Ratio
- Bandwidth usage
- Access pattern for particular events
- Ease of Use
Comparison

The initial testing of the below algorithms were carried out using Log Parser Lizard tool, which supports cleaning files, exporting files, writing queries and generating reports. But the drawback

with the tool is that you need to write queries for all the operations and generate the intermediate results and also need to use MS Excel to perform other utility tasks, which is very tedious and prone to error. Customization of reports based on the molded mining process is not available. Another objective is to generate per page frequency but the pages designed with CMS are generated with page ID and not page Name, so biding of ID and Name is not done by Log Parser Lizard, certain tools like state counter is a tool which support the generation of per page frequency but it generates report with page ID and not with name so it may not be informative. So taking these points into consideration we have prepared our own tool for our mining process namely UWAD (University Website Access Domain), and now the above algorithm steps are implemented using this tool. Below listing shows the algorithm used and the queries written for achieving certain tasks. The next section shows the results generated using this UWAD tool. Figure 1 shows the use of tool for selecting the file to pre-process and after that the summary of details of the cleaning process. Table3 shows the summary of the Data Cleaning process. Figure 2 reflects the cleaned log data.
Experimental Results

Log parser lizard tool is very useful tool but, for performing steps with the tool we need to perform lots of intermediate tasks like, managing to write queries, storing intermediate results and performing other utility tasks using MS Excel, for cleaning a single file. All these activities include lots of extra overhead so it becomes a error prone and time consuming process. Tool. Customization based on requirement of molding process in not available and pages designed with Master Page [8][9]concept are not supported by most of the tools[10].

After using both the tools the results vary in the processing time and the elimination of redundantly writing the same queries over different files, which aids ease and speed in generating results. Table 1 show the result derived after applying the algorithm using Log parser Lizard tool.

( Rows:352 Time taken: 00:00:00 )

Figure 8: Snapshot of reading records with unique IP.

Session Identification
1. Read record from the log file.
2. If there is a new user, then there is a new session.
3. In one user session, if the refer page is null we can draw a conclusion that there is a new session.
4. If the time between the page requests exceeds a certain limits (30 Minutes), it is assumed that the user is starting a new session.

Stage of Preprocessing	No. of Web Objects Retrieved
Initial size of file	12886 KB (12.5 MB)
Size of file after cleaning	608 KB
Before Cleaning	26380
After Cleaning	2226
Total time taken for data cleaning	2.41 minutes + extra overhead (writing queries, generating results, saving intermediate files, applying other utilities) (approximately 5 minutes)

Stage of Preprocessing	No. of Web Objects Retrieved
Initial size of file	12886 KB (12.5 MB)
Size of file after cleaning	608 KB
Before Cleaning	26380
After Cleaning	2226
Total time taken for data cleaning	2.41 minutes + extra overhead (writing queries, generating results, saving intermediate files, applying other utilities) (approximately 5 minutes)

(Table-1 Result Analysis of Proposed Cleaning Process)

Table 2 shows the result after applying the algorithm in the tool UWAD. It reduces time in processing the records and it saves the data in the database for further use.

.

Figure 9 : Snapshot of Calculating Session count

(Table-2 Result Analysis of Proposed Cleaning Process using UWAD Tool)

(Table-3 Comparison of Tools Used)

Standard as well as Visual reports can be generated using UWAD tool with customized requirements for the mining process as shown in Figures 10 and 11.

Figure 10 (a): Snapshot of selecting report. (b): Snapshot of summary report.

Figure 11 (a): Snapshot Cleaned log files.

(b): Snapshot of Particular file details.

Conclusion

The purpose of research focuses on the key areas like, Reduction of efforts by the automation of a number of tasks, Easier association of mining goals with the results that can be obtained and Possibility of involvement of people having less technical skills w.r.t mining. With the use of existing tools this areas are not satisfied optimally so UWAD provides the necessary foundation upon which further extended pattern based on the molded mining process can be generated. Next pattern will focus on per page frequency of pages designed using CMS and master page concept.

Acknowledgment

The authors thank Charotar University of Science and Technology (CHARUSAT) for providing necessary resources to accomplish this study.

References

HAN Jia-Wei, MENG Xiao-Feng, WANG Jing etc. Research on Web Mining. Journal of Computer ResearchLkDevelopment, 2001,38(4): 405-414.
Bing Liu , Web Content Mining ,The 14th International World Wide Web Conference (WWW-2005),May 10-14, 2005, Chiba, Japan.,
Pabarskaite, Z. (2002). Implementing Advanced Cleaning and End-User Interpretability Technologies in Web Log Mining. 24th Int. Conf. information Technology Interfaces /TI 2002, June 24-27, 2002, Cavtat, Croatia.
Han, J. and M. Kamber (2006). Data Mining: Concepts and Techniques. A. Stephan. San Francisco,, Morgan Kaufmann Publishers is an imprint of Elsevier.
Doru Tanasa and Brigitte Trousse, Advanced Data Preprocessing for Intersites Web Usage Mining , IEEE Computer Society, March/April, 2004.
Fang Yuan,Li-Juan Wang,Ge Yu,Study on Data Preprocessing Algorithm in Web Log Mining , Proceedings of the Second International Conferences on Machine Learning and Cybernetics, Xian, 2-5 November 2003.
Mohd Helmy Abd Wahab, Mohd Norzali Haji Mohd, et. Al, Data Pre-processing on Web Server Logs for Generalized Association Rules Mining Algorithm , World Academy of Science, Engineering and Technology, 2008.
Master Page Architecture and working, found at,http://msdn.microsoft.com/enus/library/wtxbf3hh

.aspx
MasterPage Information, found at , http://www.w3schools.com/aspnet/aspnet_masterpa ges.asp
Nirali Honest, Dr. Bankim Patel, Dr. Atul Patel, Article: Sessionization Process for the Pages Designed with the Concept of CMS in the International Journal of Advanced Research in Computer Science and Software Engineering , (IJARCSSE) ISSN: 2277 128X, Volume 3, Issue.

6, September 2013.For papers published in translation journals, please give the English citation first, followed by the original foreign-language citation [6].

Comparison of UWAD Tool with Other Tools Used for Preprocessing

Leave a Reply