- Open Access
- Total Downloads : 142
- Authors : Kota Vamsee Krishna, Smita Deshmukh
- Paper ID : IJERTV4IS020727
- Volume & Issue : Volume 04, Issue 02 (February 2015)
- Published (First Online): 25-02-2015
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License: This work is licensed under a Creative Commons Attribution 4.0 International License
Search Results Re-ranking using User Goals
Kota Vamsee Krishna#1, Smita Deshmukh#2
#Information Technology, Terna Engineering College, Mumbai University Sector-22, Nerul, Navi Mumbai, Maharashtra, India
Abstract Before submitting any query to a search engine, every user has a specific goal in mind but that goal is not known to the search engine. Based on the query related matching information which is currently available in the database, search engine will display the results and user will have to scan through them to find out the website of his interest. From user experience perspective, this is a very time consuming activity. User would be happy to see the displayed search results to be categorized into various groups (such as Business, Technology, Sports etc..) and the results in each group to be arrangedstarting from the individual users most visited to least visited website. This project is about a desktop-based application which will re-rank the display of search results, for end users, using user goals. Users will have to register themselves and create user id and password. Considering one login/logout as a single session, usage logs will be captured, feedback sessions and pseudo-documents will be generated to re-rank and optimize display of search results. The performance of this appliation will be evaluated using Classified Average Precision (CAP) factor.
Keywords User search goals, feedback sessions, pseudo- documents, re-ranking search results, classified average precision.
-
INTRODUCTION
In todays world, if anyone needs any information then the first thing that they do is to search for it in any of the popular search engines using Internet. Some of the popular search engines are Google, Microsoft Bing etc.. Let us call the information that user is looking for as a Query. Before entering the query in search engine, every user has a goal (i.e. specific information) in mind that theywant to achieve. But the search engine does not know what exactly the user is looking for, so it will display all the available URLs belonging to various domains without any categorization. The user has to scan through all the displayed URLs one-by-one until the desired URL is identified.
Practically every user wants the required/relevant/wanted/needed information to be quickly displayed to save time and this would be possible only if there is an improvement in the display of search results. This gives rise to the need for developing a desktop based application which will analyse usage logs, produce feedback sessions, map the feedback sessions to pseudo-documents, cluster the pseudo-documents, infer user search goals, re-rank the search results whenever user submits the same/similar query next time, evaluate the performance of the re-ranked results using Classified Average Precision(CAP) and further optimize it.
-
RELATED WORK
The authors of reference paper [1],propose a unique approach to conclude user search goals by adeep analysis of search engine query logs. The work in reference paper [2],talks about text classification algorithms used to automatically classify random search results into an existing category structure as a continuous on-going activity. Organizing search results will allow a user to concentrate on URLs in the user required categories.Further, the reference paper [3] discusses about organization ofthe web search results into a cluster and thusfacilitate a user in quick browsing of the search results. The reference paper [4] highlights regarding effective organizing of search results as a critical activityso as to improvise the utility of any search engine. To navigate through relevant documents easily, clustered search results prove to be efficient.Displaying the relevant documents in the order starting from most relevant to least relevant is done by a good information retrieval system and this is explained inreference paper [5]. The goal is to develop a method that utilizes click-through data for training, namely the query-log of the search engine in connection with the log of links the users clicked on in the presented ranking.As per reference paper [6] depending on the clicks relative preferences are obtained which are averagely accurate. The work in reference paper [7] presentsa method which is based on query clustering process. It facilitatesgrouping of semantic similar queries which could be identified. Referring to the work in reference paper no. [8],it proposes and evaluates a method for auto- segmentation of users query streams into distinct units.Reference paper [9]states that query substitution is generation of a new query inorder to replace the user's old query.The work in reference paper [10] presentsincrease of precision retrieval, the new search engines input manually verified answers to frequently asked queries.
-
FRAMEWORK
The framework is divided into two parts. The upper part is to generate the feedback session and pseudo-document. The lower part is display re-ranked search results, whichare obtained from original search results.User search goals can be inferred from the upper part.
Fig.2: Mapping Feedback session to Pseudo-document
Producing a pseudo-document includes two steps:
-
Using URLs in the feedback session. Each URLs title andsnippet are represented by a Term Frequency- InverseDocument Frequency (TF-IDF) vector [1], respectively, as in
Fig1: Framework
-
-
FEEDBACK SESSIONS AND PSEUDO
DOCUMENTS
-
Feedback Session. A feedback session consists of both clickedand unclicked URLs and ends with the last URL
Tui = [tw1, tw2, ., twn]T
Sui = [sw1, sw2, ., swn]T (1) where Tui and Sui are the TF-IDF vectors of the URLs titleand snippet, respectively. ui means the ith URL in thefeedback session. Andwj(j=1,2,,n) is the jth termappearing in the enriched URLs. twj and swjrepresent the TF-IDF value of the jthterm in the URLs title and snippet, respectively. Consideringthat URLs titles and snippets have different significances,we represent the enriched URL by the
that wasclicked in a single session.
weightedsum of Tuiand Sui, namely
Fui = wtTui+wsSui = [fw1, fw2, , fwn]T
(2)
Fig.2: Feedback Session
Referring to Fig.2, 0 means Unclicked URL. All 10 URLs are a part of single session. The rectangular represents a feedback session.
-
Pseudo Documents
Users generally tend to submit either a single word or may be few words to a search engine. Collectively, a set of these keywords can be called as goal texts or pseudo-documents.
-
Map Feedback Sessions to Pseudo-Documents
The feedback session consists of URLs, title and snippet. Using text processing techniques such as stemming, commonly used words such as nouns can be extracted as keywords and stored in pseudo-documents. Next time whenever user submits the same keyword which is present in pseudo-document, all the related URLs captured in feedback session will get displayed. But our aim is to further re- rankthese results into groups and to re-rank URLs in each group, for that clustering will be needed.
where Fui means the feature representation of the ith URLin the feedback session, and wtand ws are the weights of thetitles and the snippets, respectively. We set wsto be 1 at first.Then, we stipulate that the titles should be more significantthan the snippets. Therefore, the weight of the titles shouldbe higher and we set wt to be 2 in this paper. We also tried toset wtto be 1.5, the results were similar. Based on (2), thefeature representation of the URLs in the feedbacksessioncan be obtained. It is worth noting that although Tui and Suiare TF- IDF features, Fui is not a TF-IDF feature. Each term of Fui (i.e., fwj ) indicatesthe importance of a term in the ith URL.
2) Forming pseudo-document based on URL representations.Let Ffs be the feature representation of a feedback session,and ffs(w) be the value for the term w. Let Fum(m=1, 2,,M) and Ful(1= 2,,L) be the featurerepresentationsof the clicked and unclicked URLs in this feedbacksession, respectively. Let fucm(w) and ful(w) be the valuesfor the term w in the vectors. We want to obtain such aFfs that the sum of the distances between Ffs andeach Fucm is minimized and the sum of the distancesbetween Ffs and each Ful is maximized. Based on theassumption that the terms in the vectors are independent,we can perform optimization on each dimension independently,as shown in
Ffs = [ffs(w1), ffs(w2),,ffs(wn)]T
Ffs(w) = arg minffs(w) {M[ffs(w)-fucm(w)]2 – L[ffs(w)- ful(w)]2}, ffs(w)Ic(3)
Let Ic be the interval [µfuc(w)-fuc(w), µfuc(w)+fuc(w)] and Ibe the interval [µfu(w)-fu(w), µfu(w)+fu(w)], where
µfuc(w) and fuc(w) represent the mean and mean square error of fu(w), respectively. If IccI or IcIc, we consider that the
user does not care about the term w.In this situation, we set ffs(w) to be 0, as shown in
Ffs(w) = 0, IccI or IcIc (4)
is a parameter balancing the importance of clicked andunclicked URLs. When in (3) is 0, unclicked URLs are nottaken into account. On the other hand, if is too big,unclicked URLs will dominate the value of ffs(w). In thispaper, we set to be 0.5.
-
-
IMPROVED USER SEARCH RESULTSUSING
PSEUDO-DOCUMENTS
Using pseudo-documents we can predict user search goals and thus obtain improved user search results.
As in (3) and (4), each feedback session is representedbya pseudo-document and the feature representation of thepseudo-document is Ffs. The similarity between twopseudo-documents is computed as the cosine score of Ffsiand Ffsj , as follows:
Simi,j = cos(Ffsi , Ffsj) = (Ffsi . Ffsj) / (|Ffsi|.|Ffsj|) (5) And the distance between two feedback sessions is
Disi,j = 1 Simi,j (6)
We cluster pseudo-documents by K-means clusteringwhich is simple and effective. Since we do not know theexact number of user search goals for each query, we set Kto be five different values (i.e., 1, 2,.,5) and performclustering based on these five values, respectively.
k=1
fsk
After clustering all the pseudo-documents, each clustercan be considered as one user search goal. The center pointof a cluster is computed as the average of the vectors of allthe pseudo-documents in the cluster, as shown in
not suitable for evaluating the re-structured or clustered searching results.
-
Voted Average Precision (VAP).Referring to fig.2, the URLs in a single feedback session are re-structured into two classes where the un-boldfaced ones are clustered into class 1 and boldfaced ones are clustered into class 2. Voted Average Precision (VAP) is the average precision of the class including more clicks namely votes. If the numbers of the clicks in two classes are the same, the bigger AP is selected as the VAP. Assume that one user has only one search goal, then ideally all the clicked URLs in a single session should belong to one class. And a good restructuring of search results should have higher VAP.
The URLs in the single session are restructured into two classes where the un-boldfaced ones are clustered into class1 and boldfaced ones are clustered into class 2.
For example, the VAP of the restructured search results is the AP of class 1, calculated by: VAP = (1/3)*[(1/1)+(2/2)+(3/6)] = 0.833.
However, VAP is still an unsatisfactory criterion. Considering an extreme case, if each URL in the click session is categorized into one class, VAP will always be the highest value namely 1 no matter whether users have so many search goals or not. Therefore, there should be a risk to avoid classifying search results into too many classes by error.To overcome the limitations of VAP, CAP is proposed.
-
Classified Average Precision (CAP).CAP selects the Average Precision of the class that user is interested in (i.e., with the most clicks/votes) and takes the risk of wrong classification into account. The influence of Risk on CAP is adjusted, which can be learned from training data. Finally, we
Fcenteri=
CiF
/C , (F
c Clusteri)(7)
utilize CAP to evaluate the performance of restructuring
i
fsk
where Fcenteri is the ith clusters center and Ci is the numberof the pseudo-documents in the ith cluster. Fcenteri isutilized to conclude the search goal of the ith cluster.
Keywords having highest values in the centre are the user search goals. They can also be used in future query recommendations.
-
-
EVALUATION
We will use CAP to evaluate re-ranked results inorder to select the best cluster number.
-
Re-ranking Web Search Results:Web search results are re- ranked by categorizing URLs of original search results into groups (Business, Technology, Sports, etc).
-
Evaluation Criterion:From user click-through logs, implicit relevance feedbacks can be obtained, namely Clicked URL means relevant and Unclicked URL means irrelevant.
-
Average Precision.A possible evaluation criterion is the average precision (AP) which evaluates according to user implicit feedbacks.
-
search results.
Considering another extreme case, if all the URLs in the search results are categorized into one class, Risk will always be the lowest namely 0; however, VAP could be very low. Generally, categorizing search results into less clusters will induce smaller Risk and bigger VAP, and more clusters will result in bigger Risk and smaller VAP. The proposed CAP depends on both of Risk and VAP.
CAP = VAP × (1 Risk) (9)
is used to adjust the influence of Risk on CAP, which can be learned from training data.
m
Risk = i,j=1(i<j)m dij / C 2 (10)
It calculates the normalized number of clicked URL pairs that are not in the same class, where m is the number of the clicked URLs. If the pair of the ith clicked URL and the jth clicked URL are not categorized into one class, dij will be 1; otherwise, it will be 0.
C2m = m(m-1)/2 is the total number of the clicked URL pairs. Then, the risk in can be calculated by: Risk = 3/6
= 1/2 = 0.5. The proposed CAP depends on both Risk and
AP = (1/N+) [
Nrel(r).(R /r) ] (8)
VAP.
r=1 r
where N+ is the number of clicked documents in the retrieved ones, r is the rank, N is the total number of retrieved documents, rel() is a binary function on the relevance of a given rank, and Rr is the number of relevant retrieved documents of rank r or less. For example, considering the single feedback shown in Fig.2and we can compute AP as: { (1/4) * [(1/2)+(2/3)+(3/7)+(4/9)]} = 0.510. However, AP is
-
-
EXPERIMENTS
To implement the desktop based application, Google API was downloaded and added to the reference library so that application could fetch URLs from Google.
As a first step, user registered by clicking on Create New Account, filled-in all the details and thus create a user id and password.
created and next time when the same user (using same user id and password) logged-in and made the same search, improved/re-ranked search results got displayed as shown in Fig.6.
Fig.3: Login page
Fig.4: Registration page
Fig.5: Display of original search results
Table 1: Pseudo-document keywords for the user query Sun
Fig.6: Display of Improved Search Results using Pseudo-documents
As shown in Fig.5, for the first time when user made a search, using Google API, original results as displayed in Google were shown. These URLs were captured as usage logs, feedback session was generated, pseudo docment was
Considering = 0.7 in (9), let us compare our method with the existing method for 10 random queries. The existing method is from reference paper[20].
Fig.7: Our method Vs. Existing method
As shown in Fig.7, CAP of our method is better.
-
CONCLUSION
In reality, our approach can discover user search goals for some popular queries offline at first. Then, when users submit one of the queries, the search engine can return the results that are categorized into different groups according to user search goals online. Thus, users can find what they want conveniently.
REFERENCES
-
Zheng Lu, Hongyuan Zha, Xiaokang Yang, Weiyao Lin and Zhaohui Zheng, A New Algorithm for Inferring User Search Goals with Feedback Sessions, IEEE transactions on Knowledge and Data Engineering, Vol.25, no.3, pp. 502-513, 2013.
-
H. Chen and S. Dumais, Bringing Order to the Web: Automatically Categorizing Search Results, Proc. SIGCHI Conf. Human Factors in Computing Systems (SIGCHI 00), pp. 145-152, 2000.
-
H.-J Zeng, Q.-C He, Z. Chen, W.-Y Ma, and J. Ma, Learning to Cluster Web Search Results, Proc. 27th Ann. Intl ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR 04), pp. 210-217, 2004.
-
X. Wang and C.-X Zhai, Learn from Web Search Logs to Organize Search Results, Proc. 30th Ann. Intl ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR 07), pp. 87-94, 2007.
-
T. Joachims, Optimizing Search Engines Using Clickthrough Data, Proc. Eighth ACM SIGKDD Intl Conf. Knowledge Discovery and Data Mining (SIGKDD 02), pp. 133-142, 2002.
-
T. Joachims, L. Granka, B. Pang, H. Hembrooke, and G. Gay,
Accurately Interpreting Clickthrough Data as Implicit Feedback, Proc. 28th Ann. Intl ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR 05), pp. 154-161, 2005.
-
R. Baeza-Yates, C. Hurtado, and M. Mendoza, Query Recommendation Using Query Logs in Search Engines, Proc. Intl Conf. Current Trends in Database Technology (EDBT 04), pp. 588-596, 2004.
-
R. Jones and K.L. Klinkner, Beyond the Session Timeout: Automatic Hierarchical Segmentation of Search Topics in Query Logs, Proc. 17th ACM Conf. Information and Knowledge Management (CIKM 08), pp. 699-708, 2008.
-
R. Jones, B. Rey, O. Madani, and W. Greiner, Generating Query Substitutions, Proc. 15th Intl Conf. World Wide Web (WWW 06), pp. 387-396, 2006.
-
J.-R Wen, J.-Y Nie, and H.-J Zhang, Clustering User Queries of a Search Engine, Proc. Tenth Intl Conf. World Wide Web (WWW 01), pp. 162-168, 2001.