Algorithms for Web Log Data: WUM Pre-Processing phase

Mansi Yadav; Prof.    Pankaj Dalal

doi:10.17577/IJERTV3IS120551

Volume 03, Issue 12 (December 2014)

Algorithms for Web Log Data: WUM Pre-Processing phase

DOI : 10.17577/IJERTV3IS120551

Download Full-Text PDF Cite this Publication

Open Access
[post-views]
Total Downloads : 471
Authors : Mansi Yadav, Prof. Pankaj Dalal
Paper ID : IJERTV3IS120551
Volume & Issue : Volume 03, Issue 12 (December 2014)
Published (First Online): 18-12-2014
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Algorithms for Web Log Data: WUM Pre-Processing phase

Mansi Yadav Pankaj Dalal

M.Tech (S.E.) Scholar Professor

Shrinathji Institute of Technology & Engineering, Nathdwara-313301

Abstarct – The age of information is Web cantered. The World Wide Web (WWW) servers are the source of spreading information in world using websites. The users find information through pages of hyper text mark-up language (HTML), PHP or ASP. The arrangement of pages is also one factor for improving accessibility of information. The maximum user hits same pattern of pages or maximum hit pattern arrangements can improve the accessibility. The pattern of pages can be extracted from log file by Web Usage (Log) mining (WUM). WUM extracts pages hit users and pattern information by WLM process. The paper WLM implements through algorithm for data extract, data cleaning process, user information (users identification and session identification etc.) extract algorithms.

Keywords – Web Mining, Web Usage Mining (WUM), Server Log File, Data Pre-Processing, web page pattern discovery, web page pattern analysis.

INTRODUCTION :

Web mining is the application of Data Mining. Data mining is the process of extracting meaningful data from the Warehouse. It is classified into three categories: Web Content Mining (WCM), Web Structure Mining (WSM) and Web Usage Mining (WUM). [1][6]
Web Content Mining (WCM):

WCM is the process of extracting meaningful information from the contents of web documents such as text, audio, video, and image thus it is also known as Text Mining. [2][6]
Web Structure Mining (WSM):

WSM is the process of mining useful information from Web hyperlink structure. HITS (Hyperlink Induced topic search) and Page Rank Algorithms are used in WSM. [2][6]
Web Usage Mining (WUM):

WUM is the process of extracting usage pattern from Web Log Files. It is also known as web Log Mining. [2][6]
WEB USAGE MINING:

WUM is the process of extracting meaningful users access information from Web data. It is used to understand and improved Web based applications to users. WUM is a powerful tool to analyzing, designing and modifying a

websites according to users access patterns. WUM main three phases are Data Pre-Processing, Pattern Discovery and Pattern Analysis. [1]
Figure 1 shows the process of WUM. Data Pre-Processing is the first step to involve Data Cleaning, User Identification, Session Identification and Path completion steps. [1][3]
Web Log File

Data Pre-Processing
- Data Cleaning
- User Identification
- Session Identification
- Path Completion
  
  Pattern Design Association Rules Sequential Pattern
  
  Pattern Analysis
  
  OLAP
  
  Visualization
  
  Access Pattern
  
  Fig.1. WUM Process
  
  Data Cleaning removes unwanted and irrelevant records which are not used in pattern design process like records that have suffix of css, js, jpeg, png etc and status code except 200. Result of cleaning phase is reduced Log file size and increased the accuracy of log file.[4] User Identification is second steps of Pre-Processing, find website users. New IP-Address represents new User.
  
  Session Identification finds session of particular users default timeout for single session is 30 minutes. Path Completion is last step of Pre-processing, finding complete user access paths and the missing paths are added. [7]
  Pattern Discovery is the second phase of WUM process, this phase find out users access patterns from cleaned log files using different techniques like Sequential Patterns, Association Rules, Clustering and Classification rules. [3][8][9][10]
  Pattern Analysis is the final phase of WUM process, this phase removes uninteresting patterns form pattern design and mined most frequent pattern using knowledge query mechanism such as Structure Query Language (SQL) and Visualization Techniques. [3][8][9][10]
PROPOSED ALGORITHM
1. Data Collection:
  
  We have collected live six month log data of www.drgoyal.co.in website server. Web log file (WLF) contains information about website visitors, IP-address, host name, Username, timestamp, method, path, protocol, status code and agent information.
  
  Example of Log File Entry:
  
  188.143.232.211 – – [31/Jul/2014:04:54:15 -0500] "GET
  
  /msg.php HTTP/1.1" 200 10743 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
2. Web Log File Extraction:
  
  A WLF consists of various data fields. Data field extraction separates data fields before cleaning process. This process of separating different data fields from single server log entry.
  
  The implementation of the Data field extraction algorithm is in C language. The separated fields of the log file are saved into a file. We have calculated log file size and counted number of records.
  
  Algorithm for Log File Extraction
  
  Input: – Web Log File (In text format) Output: – Extracted Log File (In text format) Step 1: Open WLF in read mode
  
  Step 2: Open Extracted Log File in write mode. Step 3: While {
  
  Read WLF until EOF (end of file); If next line {
  
  Read one line;
  
  Write line in Extracted Log File with #end;
  
  }
  
  }
  
  Step 4: Calculate size of Extracted Log File and number of records.
  
  Step 5: Close both files.
3. Log file Cleaning:
  
  Log file cleaning algorithm retains only those data entries in the log file whose status code is only 200, method is GET or POST and file suffix is except from js, xml, txt, gif, jpg,png and css.
  
  Algorithm for Log File Cleaning
  
  Input: Extracted Log File (In text format) Output: Cleaned Log File (In text format)
  
  Step 1: Open Extracted Log File in read mode Step 2: Open Cleaned Log File in write mode. Step 3: While(read until EOF) {
  
  Read Extracted Log File until EOF (end of file);
  
  If (status code=200 && method GET||POST && suffix != css || xml|| js|| png||jpeg||gif) {
  
  Write record in Cleaned Log File;
  
  }
  
  Else remove records;
  
  }
  
  Step 4: Calculate size of Cleaned Log File and number of records in that.
  
  Step 5: Close both files.
4. User Identification:
  
  Our algorithm follow the below rules to identify users:
  
  If there is new Internet Protocol (IP) address then it is represent new user/client.
  
  If IP address is same but OS (Operating System) is different than its represent new user.
  
  Algorithm for User Identification: Input: – Cleaned log File (In text format)
  
  Output: – Number of users, LogUser File
  
  Step 1: Open WLF //in read mode
  
  Step 2: Open LogUser File //in write mode Step 3: initialize int ucount=1;
  
  Char OldIP[20],NewIP[20];
  
  Step 4: if OldIP != NewIP Then
  
  Increment in ucount by one Write records in LogUser File
  
  Else
  
  Write records in LogUser File End if condition
  
  Step 5: repeat above step 4 until EOF Step 6: return number of users
  
  Step 7: Close both files.
5. Session Identification:
Webpage access of each user divided into individual session. Timeout mechanism is used for identify user session.

The rules for identify user session algorithm:

If there is a new IP address (User), there is a new session. Default 30 minutes timeout taken.

If users access website or web page above 30 minutes then its new session started.

Algorithm for Session Identification Input: – LogUser File (In text format)

Output: – LogSession File (In text format), numberof sessions

Step1: Open LogUser File //in read mode

Step 2: Open LogSession File // in write mode Step 3: initialize isession, usertime

Step 4: Usertime=TimeDiff(time1,time2)

//TimeDiff function used for finding difference between 2 times(prev and next)

Step 5: if usertime> 30 min Then

Increment in isession+1 Write in LogSession file

End if

Step 6: repeat above 4 & 5 steps until end of file Step 7: return isession value

Step 8: Close both files.
EXPERIMENTAL RESULTS

The 1825 KB size of extracted log file having 14225 entries. The figure 2 shows result snapshot.

Fig.2. Result after extraction log data

After cleaning the size of log file is 340 KB with 3061 entries are in the cleaned log file. The figure 3 shows result snapshot.

Fig.3. Result after Cleaning

Next, the users and sessions of users are found 325 users and 1831 sessions in drgoyal website log data. The figure 4 shows results snapshot.

Fig.4. User and session identification
RESULT ANALYSIS

There were 14,225 records and size of file was 1825 KB before cleaning, after cleaning process only 3,061 records and size of file was 340 KB. This was shown in table 1 and graphically present in figure 5.

No. of Records

Size of file (in KB)

Before Cleaning

14,225

1825

After Cleaning

3,061

340

Reduction (in %)

78.49

81.37

Table 1: Comparison in no. of records or Size before and after cleaning log
CONCLUSION

This paper, presents a brief overview of WUM. Data Cleaning is an important task in WUM process. We have implemented proposed algorithms for log file extraction, log file cleaning, user identification and session identification. The results which were obtained after cleaning contained valuable information about the log files and after cleaning step number of records or size of file are reduced hence increases the quality of the log file and reduced time required for pattern discovery process.
FUTURE WORK

The access patterns are expected and reduce accessibility time of websites using most frequent pattern algorithm and maximum hits pattern algorithm. Also give our suggested pattern algorithm for improve accessibility time of websites.

15000

14225

10000

5000

3061

1825

340

0

No. of Entries Size of file

Before cleaning After Cleaning

Fig.5. Graph between size and number of records before and after cleaning log

The results of algorithms of drgoyal.co.in log are tabulated in table 2.

	Drgoyal.co.in
Duration of log data records	April 1, 2014 to October 1, 2014
Before cleaning size(KB)	1825
After cleaning size(KB)	340
Reduction in size (%)	81.37
Before cleaning records	14225
After cleaning records	3061
Reduction in records (%)	78.49
Number of users	325
Number of sessions	1831

Table 2: Result comparison

REFERENCES

Vijay Kumar Padala, Sayeed Yasin, Durga Bhavani Alanka, A Novel Method for Data Cleaning and User- Session Identification for Web Mining, Vol. 3, Issue. 5, Sep – Oct. 2013.
Shaily G.Langhnoja,Mehul P. Barot,Darshak B. Mehta, Web Usage Mining Using Association Rule Mining on Clustered Data for Pattern Discovery, International Journal of Data Mining Techniques and Applications, Vol 02, Issue 01,June 2013.
Mona S.Kamat, J.W.Bakal, Madhu Nashipudi,Comparative Study of Techniques to Discover Frequent Patterns of Web Usage Mining, Volume-2, Issue-3, 2013.
Mrs.R.Kousalya, Ms.K.Suguna, Dr.V. Saravanan, Improving the Efficiency of Web Usage Mining Using K-Apriori and FP-Growth Algorithm, International Journal of Scientific & Engineering Research Volume 4, Issue3, March-2013.
S.Gowri Shanthi, Dr. Antony Selvadoss Thanamani, Web Page Categorization Using Web Mining International Journal of Advanced Research in Computer Engineering & Technology,

Volume-1, Issue-7, September 2012
Devinder Kaur, Ravneet Kaur, Minimizing the Repeated Database Scan Using an Efficient Frequent Pattern Mining Algorithm in Web Usage Mining, International Journal of Research in Advent Technology, Vol.2, No.6, June 2014
V.Chitraa, Dr.Antony Selvadoss Thanamani, A Novel Technique for Sessions Identification in Web Usage Mining Preprocessing, International Journal of Computer Applications, Volume 34 No.9,

November 2011
Monika Verma, Shikha Pandey, An Efficient Algorithm for Frequent Pattern

Mining using Web Analysis Approach, IJCSET, Vol 2, Issue 7, July 2012
K.S.R. Pavan Kumar, L. Manoj Chowdary, V.V. Sreedhar, A Critique on Web Usage Mining, International Journal of Computer Science and Information Technologies, Vol. 3
L.K. Joshila Grace, Dhinaharan Nagamalai, V.Maheswari, Analysis of Web Logs and Web User in Web Mining, International Journal of Network Security & Its Applications (IJNSA), Vol.3, No.1, January 2011.

Algorithms for Web Log Data: WUM Pre-Processing phase

Leave a Reply