- Open Access
- Total Downloads : 503
- Authors : S. Jaiganesh, P. Babu, K. Nimmati Satheesh
- Paper ID : IJERTV2IS4543
- Volume & Issue : Volume 02, Issue 04 (April 2013)
- Published (First Online): 17-04-2013
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License: This work is licensed under a Creative Commons Attribution 4.0 International License
A Comparative Study Of Various Web Search Algorithms For The Improvement Of Web Crawler
S. Jaiganesp , P. Babu2, K. NimmatiSatheesp
1 Associate Professor 2Associate Prpfessor, 3Assitant Professor,
Department of Computer Applications PSNA College of Engineering and Technology
Dindigul 624 622, Tamil Nadu, India.
Abstract— The crawler is a vital module of a web search engine. The effectiveness of a crawler directly affects the searching quality of the web search engines. As the crawler interacts with millions of hosts or servers over a period of weeks or months, the issues of validity, flexibility and manageability are of major importance. Also crawler could retrieve some other information, which may be of unimportant to the search from the HTML files as it is parsing them to get the new URLs. In this paper, an attempt has been made to improve the performance of the Web Crawler by comparing certain features of several algorithms such as best-first, breadth-first, pagerank, shark search and Hits. For this, various performance parameters such as precision, recall, accuracy and F-Score are taken into consideration. Based on the output parameters, an analysis is made for the improvement of web crawler towards web searching.
Keywords: Web Crawler, best-first, breadth-first, pagerank, shark-search, HITS, precision, recall, accuracy and F- measures
-
INTRODUCTION
Web crawlers are said to be spiders or robots, they are the programs that automatically retrieve the Web pages when a query is placed in the search engine. Since information on the Web is scattered among billions of pages served by millions of servers around the globe, users who browse the web can follow hyperlinks to access information, virtually moving from one page to the next.
A crawler can visit many sites to collect the information that can be analyzed and mined in a central location[1]. A lot of machine learning approaches are being employed to estimate their significance with respect to the user queries. This is a critical task because it greatly influences the perceived effectiveness of a search engine. Users often look at only a few top hits, making the precision achieved by the ranking algorithm of dominant
importance. Many search engines ranked pages principally based on their lexical similarity to the query.
The main objective of this work is to compare five Crawler Algorithms such as PageRank, Breadth-First, Best-First, shark search and HITS to asses their performance measures like precision, recall, accuracy and F-Score.
-
FUNCTIONALITY OF A CRAWLER
A crawler starts from a set of seed pages (URLs) and then uses the links within them to fetch other pages. The links in these pages are, in turn, extracted and the corresponding pages are visited. The process repeats until a sufficient number of pages are visited or some other objective is achieved. In fact, Google founders Sergey Brin and Lawrence Page, in their seminal paper [2], identified the Web crawler as the most sophisticated yet fragile component of a search engine.
Frontier [1]is the one where the Crawler maintains a list of unvisited URLs. The list is initialized with seed URLs which may be provided by the user or another program. In each iteration of its main loop, the crawler picks the next URL from the frontier, fetches the page corresponding to the URL through HTTP, parses the retrieved page to extract its URLs, adds newly discovered URLs to the frontier, and stores the page in a local disk repository. The crawling process may be terminated when a certain number of pages have been crawled.
A crawler is, in essence, a graph search algorithm. The Web can be seen as a large graph with pages as its nodes and hyperlinks as its edges. A crawler starts from a few of the nodes (seeds) and then follows the edges to reach other nodes[3]. The process of fetching a page and extracting the links within it is analogous to expanding a node in graph search. The frontier is the main data structure, which contains the URLs of unvisited pages[8].
Typical crawlers attempt to store the frontier in the main memory for efficiency. Based on the declining
price of memory and the spread of 64-bit processors, quite a large frontier size is feasible. Yet the crawler designer must decide which URLs have low priority and thus get discarded when the frontier is filled up[9]. Note that given some maximum size, the frontier will fill up quickly due to the high fan-out of pages. Even more importantly, the crawler algorithm must specify the order in which new URLs are extracted from the frontier to be visited[4]. These mechanisms determine the graph search algorithm implemented by the crawler.
-
CRAWLER ALGORITHMS
This chapter deals with various crawler algorithms such as PageRank, Breadth-First, Best-First, Shark search and HITS under comparison in this study.
-
Page Rank
PageRank was proposed by Brin and Page [2] as a possible model of user surfing behavior. The PageRank of a page represents the probability that a random surfer (one who follows links randomly from page to page) will be on that page at any given time[5]. A pages score depends recursively upon the scores of the pages that point to it.
PageRank (Topic, StartingUrls[], frequency)
{
for (i=0;i<=StartingUrl;i++) ENQUEUE(Frontier, Link);
do
{ if (multiplies(visited,frequency))
{ recomputed_scores_pr;
}
}
while (visited < MaxPages);
Link = DequeueTopLink(frontier); Document=Fetch(Link);
ScoreSim = Sim(Topic,Doc); Enqueue(BufferedPages,Doc,ScoreSim); if (BufferedPages >= MaxBuffer}
{
DequeueBottomLinks(BufferedPages)
}
Merge(Frontier, ExtractLinks(Doc), ScorePr); if (Frontier > MaxBuffer)
{
DequeueBottomLinks(Frontier)
}
}
-
Best-First
Best-First crawlers have been studied by Cho et al. [2] and Hersovici et al. [2]. The basic idea is that given a frontier of URLs, the best URL according to some estimation criterion (Precision, Recall, Accuracy and F- Score) is selected for crawling, using the frontier as a
priority queue. In this algorithm, the URL selection process is guided by the lexical similarity between the topic's keywords and the source page of the URL[4]. Thus the similarity between a page p and the topic keywords is used to estimate the relevance of all the outgoing links of p.
BestFirst ( StartingUrls)
{
for (i=0;i<=StartingUrl;i++)
ENQUEUE(Frontier, url,MaxScore);
do
{
url=Dequeue(Frontier); Page=Fetch(Url); Score=GetTopicScore(Page); Visited=Visited+1; Enqueue(Frontier,ExtractLinks(Page),Sc ore);
}
while (Visited < MaxPages && Frontier != Null);
}
-
Breadth-First
Breadth-First algorithm is the simplest strategy for crawling. It does not utilize heuristics in deciding which URL to visit next. It uses the frontier as a FIFO queue, crawling links in the order in which they are encountered.
BreadthFirst ( StartingUrls)
{
for (i=0;i<=StartingUrl;i++)
ENQUEUE(Frontier, url);
do
{ url=Dequeue(Frontier); Page=Fetch(Url); Visited=Visited+1;
Enqueue(Frontier,ExtractLinks(Page));
}
while (Visited < MaxPages && Frontier != Null);
}
-
Shark-search
Shark-Search[6] is a more aggressive version of Fish-Search. In Fish-Search, the crawlers search more extensively in areas of the Web in which relevant pages have been found. At the same time, th algorithm discontinues searches in regions that do not yield relevant pages. Shark-Search offers two main improvements over Fish-Search. It uses a continuous valued function for measuring relevance as opposed to the binary relevance function in Fish-Search. In addition, Shark-Search has a more refined notion of potential scores for the links in the crawl frontier.
Shark (topic, startingUrls) {
Foreach link (startingUrls) { Set_depth(link,d);
Enqueue(Frontier,links);
}
While (visted<MaxPages) { Link=DequeueTopLink(Frontier); Doc=fetch(Link); DocScore=sim(topic,doc);
If (depth(link)>0) {
Foreach outlink (extractLink(doc)) {
Score = (1-r) * neighborhoodScore(outlink)
+ r * inheritedScore(outlink); If DocScore > 0) {
setDepth(outlink, d); }
3. The set of starting URLs with the lowest topic score In the first experiment, the set of starting URLs was the first ten pages with most links
# of pages visited
# of relevant pages visited
Breadth First
100
28.797
Best First
100
33.93
Page Rank
100
35.09
Shark Search
100
27.06
HITS
100
31.23
# of pages visited
# of relevant pages visited
Breadth First
100
28.797
Best First
100
33.93
Page Rank
100
35.09
Shark Search
100
27.06
HITS
100
31.23
Table I- URLs with most links
else {
setdepth(outlinkl, depth(link) 1);
}
Enqueue(Frontier,outlink,score);
}
If (Frontier > MaxBuffer) { dequeueBottomLink(frontier);
}
}
}
}
-
HITS
In the HITS algorithm, the first step is to retrieve the set of results to the search query. The computation is performed only on this result set, not across all Web pages. Authority and hub values are defined in terms of one another in a mutual recursion. An authority value is computed as the sum of the scaled hub values that point to that page. A hub value is the sum of the scaled authority values of the pages it points to. Some implementations also consider the relevance of the linked pages.
Hits ( Pages) { G= SetofPages
For(p=0;p<G;p++)
{
p.auth = 1
p.hub = 1 HubsAndAuthorities(G){ for (i=0;i<k;i++)
for (p=0;p<G;p++)
for (q=0;q< p.incomingNeighbors; q++) p.incomingNeighbors p.incomingNeighbors
p.auth += q.hub for (p=0;p<G;p++)
for (page=0;PAGE<R;PAGE++)
p.hub += r.auth
} } }
-
-
EXPERIMENTAL SETUP
Experiments are conducted using the Lund dataset containing 100 attributes. Three categories of experiments for the performance measures namely :
-
The set of starting URLs with most links
-
The set of starting URLs with the highest topic score
In the second experiment, the set of starting URLs was the first ten pages with the highest topic score
Table II- URLs with the highest topic score
# of pages visited
# of relevant pages visited
Breadth First
100
32.52
Best First
100
34.23
Page Rank
100
35.363
Shark Search
100
31.5
HITS
100`
32.6
In the third experiment, the set of starting URLs was the first ten pages with the lowest topic score
Table III- URLs with the lowest topic score
# of pages visited
# of relevant pages visited
Breadth First
100
32.95
Best First
100
33.78
Page Rank
100
35.18
Shark Search
100
34.2
HITS
100
31.0
-
-
RESULT AND DISCUSSION
Based on the data set experiments with the following performance measures like precision, recall, accuracy and F-Score are taken into account for their assessment [7].
Precision [10] is defined as the fraction of the documents retrieved that are relevant to the users information need.
Precision = TP/(TP+FP)
Accuracy [10] is defined as the proportion of true results (both TP and TN) in the population
Accuracy= (TP+TN)/(TP+FP+FN+TN)
F-Score[ 10] is defined as the weighted harmonic mean of precision and Recall
Table IV (b)
2 * ( Precision * Recall )
F-Score = ————————————————
(Precision + Recall)
Recall [10] is defined as the fraction of the documents that are relevant to the query that are successfully retrieved.
True Positive
Recall = ———————————
(True Positive + False Negative)
Based on the experiment results shown below Table IV and Table V, an analysis is made on the various algorithms.. In the first case(with most links) we observe that precision is better in Page rank, Recall is better in Best First ,Accuracy is better in HITS and F-Score is better in Page Rank. In the second case(Highest Topic Score), we found that, precision is better in Page rank, Recall is better in Best First, Accuracy is better in Shark search and F- Score is better in Page Rank. From the third case(Lowest Topic Score), it is observed that precision is better in Page rank, Recall is better in Best First, Accuracy is better in Shark search and F-Score is better in Page Rank.
Graphical view of an analysis is shown in fig 1, 2, 3, 4. It is also realized that there is a strong need to further probe into this to develop a robust crawler algorithm for better performance in all the above mentioned counts.
Table IV (a)
Best First
Precision
Recall
Accuracy
F-Score
Most Links
33.93
47
47.7
39.41
Highest Topic Score
34.23
47.7
48.34
39.85
Lowest Topic Score
33.78
47.1
47.03
39.34
Average
33.98
47.3
47.69
39.54
Best First
Precision
Recall
Accuracy
F-Score
Most Links
33.93
47
47.7
39.41
Highest Topic Score
34.23
47.7
48.34
39.85
Lowest Topic Score
33.78
47.1
47.03
39.34
Average
33.98
47.3
47.69
39.54
Table IV (c)
Shark Search
Precisin
Recall
Accuracy
F-Score
Most Links
27.05
41.23
44.25
32.66
Highest Topic Score
31.5
44.33
44.98
36.83
Lowest Topic Score
34.28
46.82
47.67
39.58
Average
30.94
44.13
45.63
36.36
Table IV (d)
Page Rank
Precision
Recall
Accuracy
F-Score
Most Links
35.09
44.63
45.78
39.29
Highest Topic Score
35.363
45.37
46.41
39.75
Lowest Topic Score
35.181
46.21
47.12
39.95
Average
35.21
45.4
46.44
39.66
Table IV (e)
Breadth First
Precision
Recall
Accuracy
F-Score
Most Links
28.8
40.4
43.12
33.61
Highest Topic Score
32.5
46
47.49
38.12
Lowest Topic Score
33
46
47.51
38.41
Average
31.43
44.3
46.04
36.71
HITS
Precision
Recall
Accuracy
F-Score
Most Links
31.23
46.77
47.85
37.45
Highest Topic Score
32.6
46.47
47.53
38.32
Lowest Topic Score
30.78
45.63
46.86
36.76
Average
31.54
46.29
47.41
37.51
Breadth First
Precision
Recall
Accuracy
F-Score
Most Links
28.8
40.4
43.12
33.61
Highest Topic Score
32.5
46
47.49
38.12
Lowest Topic Score
33
46
47.51
38.41
Average
31.43
44.3
46.04
36.71
HITS
Precision
Recall
Accuracy
F-Score
Most Links
31.23
46.77
47.85
37.45
Highest Topic Score
32.6
46.47
47.53
38.32
Lowest Topic Score
30.78
45.63
46.86
36.76
Average
31.54
46.29
47.41
37.51
Graphical View of the Performance Measures
Figure 1: Performance measure by Precision
Figure 2: Performance measure by Recall
Figure 3: Performance measure by Accuracy
Figure 4: Performance measure by F-score
-
CONCLUSIONS
In this paper, an attempt has been made to improve the efficiency of the Web Crawler for improving the web searching with the help of comparing certain features of several algorithms such as best-first, breadth- first, pagerank, shark search and HITS. For this, various performance parameters such as precision, recall, accuracy and F- score are taken into consideration. Based on the output parameters, it is observed that pagerank algorithm outperforms over other algorithms by various performance measures. This result leads to the conclusion that the pagerank algorithm improves the performance of web crawler for quick information retrieval. Any efficient web search algroritm for web crawler should focus on precision and F-score for the betterment of web search.
-
REFERENCES
Personalizing web search and crawling from click stream data.
[7]. Rafael Romero Trujillo, 2006 – Simulation tool to study focused web crawling strategies [8]. Roger P. Menezes and Prof. Soumen Chakrabarti,IIT,Bombay,2004. Crawling the Web at Desktop Scales [9]. K.Shchekotykhin , G.Friedrich , University Klagenfurt, Austria , Nov 2009. xCrawl : a high recall crawling method for Web mining. Springer 2009. [10]. Accuracy, Precision, Recall and F-Score – Wikipedia, the free encyclopedia.