- Open Access
- Total Downloads : 278
- Authors : Mrs.Kalpana Wani, Prof. Madhu Nashipudimath
- Paper ID : IJERTV2IS100500
- Volume & Issue : Volume 02, Issue 10 (October 2013)
- Published (First Online): 17-10-2013
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License: This work is licensed under a Creative Commons Attribution 4.0 International License
Association Rule Mining for Large Database
Vol. 2 Issue 10, October – 2013
Mrs.Kalpana Wani
Department of Computer Science, PIIT New Panvel, Navi Mumbai, India
Abstract
As we know World Wide Web plays vital role in serving the needs of the user's on web. The web log files are generated as a result of an interaction between the client and the service provider on web. Web log file contains the massive hidden valuable information pertaining to the visitors, if mined can be used for predicting the navigation behavior of the users. However the task of discovering frequent sequence patterns from the web log is challenging. This system focuses on adopting an intelligent technique that can provide personalized web service for accessing related web pages more efficiently and effectively, so that it can be determined which web pages are more likely to be accessed by the user in future. Proposed system uses two intelligent algorithms for predicting the user behavior's namely FP Growth and Eclat.These algorithms save the time and space problem of existing system. Further from the frequent pages pattern Direct and Indirect Association Rules are generated and based on that Ranking is provided to pages which will help recommendation system to recommend similar search pages.
Keywords Frequent pattern, Association Rule, Indirect association rule, Complex association rule.
-
Introduction
The World Wide Web serves as a vast, widely distributed, global information service center for advertisement, consumer information, e-commerce, education, financial management, government, news and many other information services. So, it has become much more difficult to access relevant information from the web with the explosive growth of information available on the internet. Therefore, further research work needs to be carried out for extracting the appropriate content as per the user's needs. This can be performed using Web mining as it helps in extracting the common sequences of the user's accessed web pages. In the web environment, association rules are typically applied to HTTP server log data that contain historical user sessions. Web sessions are gathered without any user involvement and additionally, they reliably reflect user behavior while navigating throughout a web site. For that reason, web sessions can be regarded as an important source of information about users. Association rules that reveal similarities between web pages derived from user behavior can be simply utilized in recommender systems. The main goal of such a recommendation is to suggest to the current user some web pages that appear to be useful.
Prof. Madhu Nashipudimath Department of Information Technology PIIT New Panvel, Navi Mumbai, India
-
Literature Survey
The literature review focuses on the study, comparison and contrast of the available preprocessing techniques. Data Cleaning is done to remove the inappropriate records with unsuccessful status [17]. The ability of using the data mining techniques to extract information from the server logs was first introduced by [14], [15], and [16]. The elements [11] of the web usage access are the users and web pages accessed by the user. The goal of web usage mining is to analyze the user access behavior patterns. Web mining can be practices in three different domains i.e. the content mining, hyper link web structure mining and web usage mining. These approaches effort to extract valuable information from the web which are then applied to some real world problems. The user's surfing behavior analysis follows three phases [6]: data collection and preparation, pattern discovery and content recommendation.
The output generated from the pattern analysis consists of sequences of accesses with corresponding probabilities. Frequent Pattern describes how often the pages are accessed together in a sequence. This has been introduced in 1993 by Argawal et al. [17] .The algorithms used to mine the usage are association rule mining and sequence analysis. Association rule mining discovers relationships between different web pages within a web site this approach helps to find out the order in which the pages are visited, reduces the bandwidth usage and storage needs, which undoubtedly results in improving the system efficiency and effectiveness i.e. an improved system. The objective of the association rules mining [18], [12],[10] is to discover correlations of association between existing records in a dataset. In [12],[10] fundamental association rules and indirect association rule have been described In [8], more practical and efficient methods are being presented, in order to find association rules in the case of less frequent items. Web log [13] can do this only after analyzing data resulted from the users' current and history data.
-
Problem Statement
The objective of this project is to analyzing the navigation behavior of the web access users from the accessed (filtered) data by applying two algorithms to find frequent access pages and finds the direct and indirect association rule between the frequent access pages. Besides many advantages,
association the loss of some vital information. Typical association rules focus on the co- occurrence of items (purchased products, visited web pages, etc.) within the transaction set. Web pages are connected with each other using hyperlinks and they usually determine all possible navigational paths. To reach a page, the user is often forced to navigate through other pages, e.g., a home page, a login page, etc. Additionally, the web site content is usually organized by the designer into thematic blocks which are not always suitable for particular users. For all these reasons, some personalized recommendation mechanisms are very useful in most web portals.
-
Proposed System
The Proposed system will analyse user behavior efficiently and effectively, so that it can be determined which web pages are more likely to be accessed by the user in future. Performance of web page Recommendation System is improved by using algorithm which will save time and memory space for discovering frequently access pages and finds the Direct and Indirect Association Rules from Frequent Access Pages to recommend more number of pages to recommendation system by assigning Ranks to the frequent access pages.
Fig.1. Architecture Diagram for Proposed System
-
Steps for Implementation
Following steps are used for implementation:
Step 1: Data Cleaning & Preprocessing Preprocessing is necessary, because Log file contain noisy & ambiguous data which may affect result of mining process. Some of web log file data are unnecessary for analysis process and could affect detection of web attack. Data preprocessing
is an important steps to filter and organize only appropriate information before applying any web mining algorithm. Preprocessing reduce log file size also increase quality of available data. The purpose of data preprocessing is to improve data quality and increase mining accuracy. Preprocessing consists of field extraction, data cleansing, user identification, session identification.
Fig .2. Block diagram for Data Preprocessing
Step 2: Finding Frequent Pattern
First count the occurrences of pages in database and then reorder the pages with higher count page first and so on. Draw the FP Growth tree and start mining FP tree by considering the suffix as page with lowest count first. Find the conditional pattern base for each page and if length of conditional pattern base is less than som threshold value then find the frequent pattern using Éclat algorithm and if length is greater than or equal to the threshold value then use FP Growth algorithm for finding the frequent pattern of visited pages. Sample session dataset and remaining calculation is shown below.
Table 1. Session Database
Session Id
Pages
1
P1,P2,P4
2
P1,P4
3
P1,P2,P4
4
P1,P3
5
P2,P4,P5,P6
6
P2,P4
7
P4,P5,P6
8
P2,P4,P5,P6
9
P1,P6
10
P1,P3
Table 2.Page Count
P4:7
P1:6
P2:5
P6:4
P5:3
P3:2
Table 3. Session Database with ordered Pages
Session Id
Pages
Ordered Pages
1
P1,P2,P4
P4,P1,P2
2
P1,P4
P4,P1
3
P1,P2,P4
P4,P1,P2
4
P1,P3
P1,P3
5
P2,P4,P5,P6
P4,P2,P6,P5
6
P2,P4
P4,P2
7
P4,P5,P6
P4,P6,P5
8
P2,P4,P5,P6
P4,P2,P6,P5
9
P1,P6
P1,P6
10
P1,P3
P1,P3
After rearranging the pages, form the FP Growth Tree starting from the lowest to highest page count as shown below. Count the no.of occurrences of each page.
Fig.3. FP-Growth Tree Table 4. FP-Tree Mining
P2
{P4 P1:2}
{P4:3}
{P4:5,P1:2}
{P4 P1 P2:2}
{P4 P2:5}
{P1 P2:2}
P1
{P4:3}
{P4:3}
{P4 P1:3}
P2
{P4 P1:2}
{P4:3}
{P4:5,P1:2}
{P4 P1 P2:2}
{P4 P2:5}
{P1 P2:2}
P1
{P4:3}
{P4:3}
{P4 P1:3}
After finding the frequent pattern find the confidence and support value for each frequent pattern as shown below.
Frequent Pattern
Confidence
Support
{P1,P3}
30%
20%
{P4 P6 P5}
P4 P6->P5 = 100% P4 P5->P6 =100%
P5 P6->P4 =100%
30%
{P4 P2 P6 P5}
P4 ->P2 P6 P5 = 28% P2 ->P4 P6 P5 = 40% P6 ->P4 P2 P5 = 5% P5 -> P4 P2 P6 = 60% P2 P6 P5->P4 =100% P4 P6 P5->P2 = 100% P4 P2 P5->P6 = 100% P4 P2 P6->P5 = 100% P4 P2->P6 P5 =40% P4 P6->P2 P5 = 60% P4 P5->P2 P6 =60% P6 P5->P4 P2 =60% P2 P5->P4 P6 =100% P2 P6->P4 P5 =100%
20%
{P4 P2 P5}
P4 P2->P5 =40% P4 P5->P2 = 60% P2 P5->P4 = 100%
20%
{P4 P5}
42%
30%
{P2 P5}
40%
20%
{P5 P6}
100%
30%
Frequent Pattern
Confidence
Support
{P4 P6}
42%
30%
{P2 P6}
40%
20%
{P4 P2 P6}
P4 P2->P6 = 40% P4 P6->P2 = 60% P6 P2->P4 = 100%
20%
{P4 P1 P2}
P4 P1->P2 = 60% P4 P2->P1 = 40% P1 P2->P4 = 100%
20%
{P4 P2 }
71%
50%
{P1 P2 }
33%
20%
Frequent Pattern
Confidence
Support
{P1,P3}
30%
20%
{P4 P6 P5}
P4 P6->P5 = 100% P4 P5->P6 =100%
P5 P6->P4 =100%
30%
{P4 P2 P6 P5}
P4 ->P2 P6 P5 = 28% P2 ->P4 P6 P5 = 40% P6 ->P4 P2 P5 = 5% P5 -> P4 P2 P6 = 60% P2 P6 P5->P4 =100% P4 P6 P5->P2 = 100% P4 P2 P5->P6 = 100% P4 P2 P6->P5 = 100% P4 P2->P6 P5 =40% P4 P6->P2 P5 = 60% P4 P5->P2 P6 =60% P6 P5->P4 P2 =60% P2 P5->P4 P6 =100% P2 P6->P4 P5 =100%
20%
{P4 P2 P5}
P4 P2->P5 =40% P4 P5->P2 = 60% P2 P5->P4 = 100%
20%
{P4 P5}
42%
30%
{P2 P5}
40%
20%
{P5 P6}
100%
30%
Frequent Pattern
Confidence
Support
{P4 P6}
42%
30%
{P2 P6}
40%
20%
{P4 P2 P6}
P4 P2->P6 = 40% P4 P6->P2 = 60% P6 P2->P4 = 100%
20%
{P4 P1 P2}
P4 P1->P2 = 60% P4 P2->P1 = 40% P1 P2->P4 = 100%
20%
{P4 P2 }
71%
50%
{P1 P2 }
33%
20%
Table 6. Frequent Patterns with Support and Confidence
Item
Condition Pattern Base
Conditional Pattern
Frequent Pattern
P3
{P1:2}
{P1:2}
{P1,P3:2}
{P4 P6 P5:3}
{P4 P2 P6
P5
{P4:1,P6:1}
{P4:2,P2:2,P6
:2}
{P4,P6:1}
{P4 P2 P6:2}
P5:2}
{P4 P2 P5:2}
{P4 P5:3}
{P2 P5:2}
{P5 P6:3}
P6
{P4:1}
{P1:1}
{P4,P2:2}
{P4:1}
{P4 P2:2}
{P4 P6:3}
{P2 P4:2}
{P4 P2 P6:2}
Similarly find the confidence and support value for remaining frequent pattern as shown above.
Step 3: Direct Association Rule Generation between the Frequent Access Pages
After getting the frequently visited page sequence, filter the pattern by applying some minimum confidence and support value and form the one page association rule between pages of frequent pattern and calculate its support and confidence values and filter it with respect to minimum confidence and support value as shown below.
Table 7. Formation of Direct Association Rule
Sr.No
Rule
Confidence
Support
1
P1->P4
0.5
0.2
2
P2->P4
1
0.5
3
P2->P5
0.4
0.2
4
P2->P6
0.4
0.2
5
P4 ->P2
0.6
0.5
6
P4->P6
0.5
0.4
7
P5->P2
0.6
0.2
8
P6->P2
0.4
0.2
9
P6->P4
0.8
0.4
10
P6->P5
0.4
0.2
Step 4: Indirect Association Rule Generation between the Frequent Access Pages
In direct association rule of pages there are many pages which are connected to each other indirectly via some transitive pages. Find such pages which are indirectly connected and filter it with respect to minimum confidence and support value.
Fig. 4. Indirect Association between Two Web Pages.
Step 5: Complete Indirect Association Rule Generation between the Frequent Accesses Pages
The complete indirect association rule di#dj
aggregates all partial indirect association rules from di to dj with respect to all existing transitive pages dk Tij and is characterized by complete indirect confidence con#(di#dj )
Where
L1
(i–k)
L2
(k–j)
Transitive pages
Lk (k—j)
Complete Indirect Rule
P4 ->P2 P5->P2 P6->P2
P2->P4 P2->P5 P2->P6
P2
P2->P4 P2->P5 P2->P6
P4->#P5,P2 P4->#P6,P2 P5->#P4,P2 P5->#P6,P2 P6->#P5,P2
P1->P4 P2->P4 P6->P4
P4->P2 P4->P6
P4
P4->P2 P4->P6
P1->#P2,P4 P1->#P6,P4 P2->#P6,P4 P6->#P2,P4
P2->P5 P6->P5
P5->P2
P5
P5->P2
P6->#P2,P5
P2->P6 P4->P6
P6->P2 P6->P4 P6->P5
P6
P6->P2 P6->P4 P6->P5
P2->#P4,P6 P2->#P5,P6 P4->#P2,P6 P4->#P5,P6
L1
(i–k)
L2
(k–j)
Transitive pages
Lk (k—j)
Complete Indirect Rule
P4 ->P2 P5->P2 P6->P2
P2->P4 P2->P5 P2->P6
P2
P2->P4 P2->P5 P2->P6
P4->#P5,P2 P4->#P6,P2 P5->#P4,P2 P5->#P6,P2 P6->#P5,P2
P1->P4 P2->P4 P6->P4
P4->P2 P4->P6
P4
P4->P2 P4->P6
P1->#P2,P4 P1->#P6,P4 P2->#P6,P4 P6->#P2,P4
P2->P5 P6->P5
P5->P2
P5
P5->P2
P6->#P2,P5
P2->P6 P4->P6
P6->P2 P6->P4 P6->P5
P6
P6->P2 P6->P4 P6->P5
P2->#P4,P6 P2->#P5,P6 P4->#P2,P6 P4->#P5,P6
Table 8. Formation of Indirect Association Rule
Sr.No
Rule
Confidence
1
P1-->#P2,P4
0.3
2
P1-->#P6,P4
0.25
3
P2-->#P4,P6
0.32
4
P2-->#P5,P6
0.16
5
P2- ->#P6,P4
0.5
6
P4-->#P2,P6
0.2
7
P4-->#P5,P2
0.24
0.22
8
P4-->#P5,P6
0.2
9
P4-->#P6,P2
0.24
10
P5-->#P4,P2
0.6
11
P5-->#P6,P2
0.24
12
P6-->#P4,P2
0.4
13
P6-->#P5,P2
0.16
Sr.No
Rule
Confidence
14
P6->#P2,P4
0.48
0.36
15
P6->#P2,P5
Sr.No
Rule
Confidence
1
P1-->#P2,P4
0.3
2
P1-->#P6,P4
0.25
3
P2-->#P4,P6
0.32
4
P2-->#P5,P6
0.16
5
P2- ->#P6,P4
0.5
6
P4-->#P2,P6
0.2
7
P4-->#P5,P2
0.24
0.22
8
P4-->#P5,P6
0.2
9
P4-->#P6,P2
0.24
10
P5-->#P4,P2
0.6
11
P5-->#P6,P2
0.24
12
P6-->#P4,P2
0.4
13
P6-->#P5,P2
0.16
Sr.No
Rule
Confidence
14
P6->#P2,P4
0.48
0.36
15
P6->#P2,P5
Table 9. Indirect Association Rule with Confidence
0.24
Step 5: Complex Association Rule Generation
A complex association rule exists if at least one of two component rules exists, i.e., either direct or complete indirect or both of them. The main quality features of both direct and indirect rules confidencesare combined within complex association rules.
A complex association rule is characterized by the complex confidence, con*(di dj ), as follows
Con*(di*dj) =.Con(didj) +(1 ).Con# (di#dj)
Where is the direct confidence reinforcing factor [0, 1]
Calculate the complex confidence value for different values of . Setting we can emphasize or damp the direct confidence at the expense of the Complete indirect one. The greater the value of , the closer the complex confidence to the direct one.
Depending on highest value of rank is assigned to the pages and sends it to recommendation system.
Fig. 5. Complex association results from either a direct association (a) or a complete indirect one
(b) or both (c).
By considering different values of i.e. between [0.2-0.9] value of coplex association rule is calculated, its calculation is shown in table on next page. Based on this value rank is assigned to the pages which is shown below in table.
Table 10. Page Ranking List
Page
Direct Pi->Pj
Indirect Pi->#Pj
Complex:con*(Pi->Pj)
= 0.2
= 0.3
= 0.4
= 0.5
= 0.6
= 0.7
= 0.8
= 0.9
P1
P4
P2 P6
P2P6P4
P2P4P6
P4P2P6
P2P4P6
P4P2P6
P2
P4P5P6
P4P5P6
P2 P6 P4
P4
P2P5
P2P5P6
P4P2P6
P2P4P6
P4P2P6
P5
P2
P4P6
P4P6P2
P4P2P6
P2P4P5
P6
P2P4P5
P4P5
P6P2P4
P4P6P2
P2P4P6
Table11. Values of complex confidence
Sr.No
Rule
Direct Pi->Pj
Indirect Pi->#Pj
Complex:con*(Pi->Pj)
= 0.2
= 0.3
= 0.4
= 0.5
= 0.6
= 0.7
= 0.8
= 0.9
1
P1-->*P2
0.24
0.21
0.18
0.15
0.12
0.9
0.6
0.03
2
P1-->*P3
3
P1-->*P4
0.1
0.15
0.2
0.25
0.30
0.35
0.40
0.45
4
P1-->*P5
5
P1-->*P6
0.2
0.175
0.026
0.125
0.1
0.075
0.05
0.025
6
P2-->*P1
7
P2-->*P4
0.456
0.524
0.592
0.66
0.728
0.796
0.864
0.932
8
P2-->*P5
0.208
0.013
0.015
0.016
0.015
0.013
0.010
0.005
9
P2-->*P6
0.32
0.042
0.048
0.05
0.048
0.42
0.032
0.018
10
P3-->*P1
11
P4-->*P1
12
P4-->*P2
0.019
0.025
0.028
0.030
0.288
0.252
0.192
0.010
13
P4-->*P5
0.176
0.154
0.132
0.11
0.088
0.066
0.044
0.022
14
P4-->*P6
0.019
0.025
0.028
0.03
0.028
0.025
0.019
0.010
15
P5-->*P1
16
P5-->*P2
0.12
0.18
0.24
0.30
0.36
0.42
0.48
0.54
17
P5-->*P4
0.48
0.42
0.36
0.30
0.24
0.18
0.12
0.06
18
P5-->*P6
0.192
0.168
0.144
0.12
0.096
0.072
0.048
0.024
19
P6-->*P1
20
P6-->*P2
0.08
0.12
0.16
0.20
0.24
0.28
0.32
0.36
21
P6-->*P4
0.051
0.067
0.076
0.08
0.076
0.067
0.512
0.028
-
Conclusion
Web usage mining has valuable uses to the marketing of businesses and a direct impact to the success of their promotional strategies and internet traffic. Analysis of web log data will help companies to develop promotions that are more effective. A quick web page recommendation system help user to get required information with multiple options without going through entire web site pages. Proposed system will provide efficient method for extraction of the frequent access pages and method for finding Direct and Indirect Association Rule which will provide multiple pages for Recommendation system.
-
Reference
-
Bina Kotiyalt, Ankit Kumar2, Bhaskar Pant3, R.H. Goudar4, Shiv ali Chauhan5 and Sonam Junee6. User Behavior Analysis in Web Log through Comparative Study of Éclat and Apriori, Proceedings of the International Conference on Intelligent Systems and Control (ISCO 2013)
-
Dr. S. Vijayarani, Ms. P. Sathya An Efficient Algorithm for Mining Frequent Items in Data Streams,in International Journal of Innovative Research in Computer and Communication Engineering Vol. 1, Issue 3, May 2013
-
Ms. Shweta, Dr. Kanwal Garg ,Mining Efficient Association Rules Through Apriori Algorithm Using Attributes and Comparative Analysis of Various Association Rule Algorithms
in Volume 3, Issue 6, June 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering.
-
N. VENKATESAN, RAMARAJ, FREQUENT ITEMSET MINING WITH BIT SEARCH, Journal of Theoretical and Applied Information Technology, 15 July 2012. Vol. 41 No.1
-
Ravi Bhushan and Rajender Nath, Recommendation of Optimized Web Pages to Users Using Web Log Mining Techniques, 978-1-4673-4529-3/12/$31.00c 2012 IEEE
-
Rachna Somkunwar Hash-Set Technique of Association Rulesin Volume 2, Issue 11, November 2012 ISSN: 2277 128 X, International Journal of Advanced Research in Computer Science and Software Engineering.
-
Chowdary Farha ahmed, Byeong-Soo Jeong, Eficient mining of high utility patterns over data streams with a sliding window model.Springerlink.com, 2011
-
Yo unghee Kim Won Young Kim and Ungmo Kim Mining frequent item sets with normalized weight in continuous data streams. Journal of information processing systems. 2010.
-
L. Zhou, Z. Zhong, J. Chang, J. Li, J.Z. Huang, S. Feng,
Balanced parallel FP-Growth with Map Reduce, in Conference
on Information Computing and Telecommunications, IEEE, pp. 243 246, 2010.
-
Ding, Yau, S. S.: TCOM, An innovative data structure for mining association rules among infrequent items, pp. 290-301. Comput. Math. Appl. 57, 2 (2009).
-
Ding, 1. Yau, S. S.: TCOM, an innovative data structure for mining association rules among infrequent items, pp. 290-301. Comput. Math. Appl. 57, 2 (2009).
-
Sang Lin ,Hu-yan Cui ,Ren Ying ,Zhou-lin Lin , Algorithm Research for Mining Maximal Frequent Itemsets Based on Item Constraints, 2009 Second International Symposium on Information Science and Engineering, ISBN: 978-0-7695-3991.
-
PRZEMYSAW KAZIENKO . MINING INDIRECT ASSOCIATION RULES FOR WEB RECOMMENDATION, Int. J. Appl. Math. Comput. Sci., 2009, Vol. 19, No. 1, 165186