Washington, DC April 22, 2013

Size: px

Start display at page:

Download "Washington, DC April 22, 2013"

Jennifer Owen
6 years ago
Views:

1 Structured Query Formulation and Result Organization for Session Search A Thesis submitted to the Faculty of the Graduate School of Arts and Sciences of Georgetown University in partial fulllment of the requirements for the degree of Master of Science in Computer Science By Dongyi Guan Washington, DC April 22, 2013

3 Structured Query Formulation and Result Organization for Session Search Dongyi Guan Thesis Advisor: Dr. Grace Hui Yang Abstract Complicated search such as making a travel plan usually requires more than one search queries. A user interacts with a search engine for multiple iterations, which we call a session. Session search is the task that deals with document retrieval within a session. A session often involves a series of interactions between the user and the search engine. To make use of all the queries and various interactions in a session, we propose an eective structured query formulation method for session search. By identifying phrase-like textual nuggets, we investigate dierent degrees of importance for phrases in queries, aggregate them to create a highly eective session-wise query, and send them to state-of-the-art search engine to retrieval the relevant documents. Our system participated in the TREC 2012 Session track evaluation and won the second position in whole session search (RL2-RL4). A second main contribution of this thesis is to increase stability of result organization for session search. Search result clustering (SRC) hierarchies are widely used in organizing search results. These hierarchies provide users overviews about their search results. Search result organization is usually sensitive to even slight change in queries. Within a session, queries are related and hence the search result organization should be related as well and maintain a more stable representation of the organization. We propose two monothetic concept hierarchy approaches that exploit external knowledge to build more stable SRC hierarchies for session search. One approach corrects iii

4 erroneous relations generated by Subsumption, a state-of-the-art concept hierarchy construction approach. The other employs external knowledge and directly builds SRC hierarchies. Evaluations show that our approaches generate statistically signicantly more stable search result organizations while keeping the organization in good quality. Index words: Information retrieval, session search, structured query, search result organization iv

5 Acknowledgments This thesis would not have been completed without the guidance and the help of the persons who contributed and extended their valuable assistance in the preparation and completion of this study. First and foremost, I would like to express my utmost gratitude to my advisor Dr. Grace H. Yang for her continuous support and inspiring instruction throughout my study and research. Dr. Grace H. Yang is a great advisor with patience, motivation, enthusiasm, and immense knowledge. I would also like to thank her for encouraging and helping me to shape my interest and ideas. Besides my advisor, I am deeply grateful to the rest of my thesis committee: Dr. Lisa Singh and Dr. Calvin Newport, for their insightful comments and high quality questions. My sincere thanks also go to the professors at Georgetown University for their great support and kind help: Dr. Ophir Frieder, Dr. Evan Barba, Dr. Eric Burger, Dr. Der-Chen Chang, Dr. Jeremy Fineman, Dr. Nazli Goharian, Dr. Bala Kalyanasundaram, Dr. Mark Maloof, Dr. Jami Montgomery, Dr. Micah Sherr, Dr. Clay Shields, Dr. Richard Squier, Dr. Mahendran Velauthapillai, Dr. Wenchao Zhou. Also I thank my friends Yifan Gu, Jiyun Luo, Jon Parker, Henry Tan, Amin Teymorian, Chris Wacek, Yifang Wei, Andrew Yates, and Sicong Zhang, for the stimulating discussion and sleepless nights we were working together before deadlines. I owe my warm thanks to my family for their continuous love and supports in my decision. My parents always give me advice to help me get through the dicult times. v

6 I am so grateful to my Fiancée, whose love and unconditional support allow me to nish this journey. Finally, I would like to dedicate this work to my lost Grandma, who left us too soon. I hope that this work makes her proud. vi

7 Table of Contents Chapter 1 Introduction Motivation Session Search Overview Query Formulation Search Result Organization Challenges Challenges in Query Formulation for Session Search Challenges in Result Organization for Session Search TREC Session Tracks Our Approaches Structured Query Formulation for Session Search Stable Search Result Organization by Exploiting External Knowledge Contributions of this Thesis Outline Related Work Session Search and TREC Session Tracks Query Formulation Search Result Organization Hierarchical Clustering Subsumption Exploiting External Knowledge Eective Structured Query Formulation for Session Search Identifying Nuggets and Formulating Structured Queries The Strict Method The Relaxed Method Query Aggregation within a Session Aggregation Schemes Query Expansion by Anchor Text Removing Duplicated Queries Document Re-ranking vii

8 3.6 Evaluation for Session Search Datasets, Baseline, and Evaluation Metrics Results for TREC 2011 Session Track Results for TREC 2012 Session Track Ocial Evaluation Results for TREC 2012 Session Track Chapter Summary Increasing Stability of Result Organization for Session Search Utilizing External Knowledge to Increase Stability of Search Result Organization Identifying Reference Wikipedia Entries Improving Stability of Subsumption Building Concept Hierarchy Purely Based on Wikipedia Evaluation for Search Result Organization Hierarchy Stability Hierarchy Quality Chapter Summary Conclusion Research Summary Signicance of the Thesis Future Directions Bibliography viii

9 List of Figures 1.1 Typical procedure of session search Retrieved documents by Lemur (TREC 2011 Session 25). The top document only describes the symptoms and treatments for communicable diseases, which is not relevant to the topic of session collagen vascular disease Search result clustering (SRC) hierarchies by Yippy (TREC 2010 Session 123). SRC hierarchies (a) and (b) are for queries diet and low carb diet respectively. A low carb diet South Beach Diet that should have appeared in both (a) and (b), is missing in (b); The cluster of Diet And Weight Loss in (a) are dramatically changed in (b). Screenshot was snapped at 15:51EST, 6/15/2012 from Yippy A sample nugget in the TREC 2012 session 53 query servering spinal cord paralysis Words in a snippet built from TREC 2012 session 53 query servering spinal cord consequenses, where spinal is always connected to cord Words in a snippet built from TREC 2011 session 20 query dooney bourke purses, where dooney and bourke is a brand name but the user omits the word and ndcg@10 values of retrieved documents using TREC 2011 Session track dataset. two cases, with threshold and without threshold, are compared Anchor text in a web page Changes in ndcg@10 from RL1 to RL2 presented by TREC 2012 Session track. Error bars are 95% condence intervals (Figure 1 in [26]) All results by ndcg@10 for the current query in the session for each subtask (Table 2 in [26]) Framework overview of the Wikipedia enhanced concept hierarchy construction system Mapping to relevant Wikipedia entry. Text in circles denotes Wikipedia entries, while text in rectangle denotes concepts. Based on the context of current search session, the entry Gestational diabetes is selected as the most relevant Wikipedia entry. Therefore the concept GDM is mapped to Gestational diabetes, whose supercategories are Diabetes and Health issues in pregnancy ix

10 4.3 An example of Wikipedia-enhanced Subsumption. The concepts Diabetes and type 2 diabetes satisfy Eq.(4.5) and is identied as a potential subsumption pair. The reference Wikipedia entry of Diabetes is a category, and reference Wikipedia entry of type 2 diabetes is a Wikipedia entry Diabetes mellitus type 2. Therefore we check if Diabetes is one of the supercategories of Diabetes mellitus type 2 and conrm that diabetes subsumes type 2 diabetes An example of Wikipedia-only hierarchy construction. From concept Diabetes mellitus we nd the reference Wikipedia entry Diabetes mellitus, then we nd its start category Diabetes. Similarly, for another concept joslin, we nd its reference Wikipedia entry Joslin Diabetes Center and its start category Diabetes organizations. We then expand from these two start categories. Diabetes organizations is one of the subcategories of Diabetes, thus we merge them together Major clusters in hierarchies built by Clusty for TREC 2010 session 3. (a) is for query diabetes education and (b) is for diabetes education videos books Major clusters in hierarchies built by Wiki-only for TREC 2010 session 3. (a) is for query diabetes education and (b) is for diabetes education videos books Major clusters in hierarchies built by Subsumption for TREC 2010 session 3. (a) is for query diabetes education and (b) is for diabetes education videos books Major clusters in hierarchies built by Subsumption+Wiki for TREC 2010 session 3. (a) is for query diabetes education and (b) is for diabetes education videos books Search result organization quality improvement vs. stability for Subsumption and Subsumption+Wiki Extreme case 1. A totally static hierarchy for two queries in a session (TREC 2010 session 107) Extreme case 2. A totally dierent hierarchy for two queries in a session (TREC 2010 session 75) x

11 List of Tables 3.1 for TREC 2011 Session track RL1. Dirichlet smoothing method is used. µ = 4000, f = 10 for strict method and µ = 4000, f = 20 for relaxed method. Methods are compared to the baseline original query. A signicant improvement over the baseline is indicated with a at p < 0.05 level and a at p < level (t-test, singletailed). The best run and median run in TREC 2011 are listed for comparison ndcg@10 for TREC 2011 Session track RL2. Dirichlet smoothing method and strict method are used. µ = 4000, f = 5 for uniform, µ = 4500, f = 5 for previous vs. current (PvC ) and distance-based. Methods are compared to the baseline original query. A signicant improvement over the baseline is indicated with a at p < 0.05 level and a at p < level (t-test, single-tailed). The best run and median run in TREC 2011 are listed for comparison ndcg@10 for TREC 2011 Session track RL3 and RL4. All runs use strict method and the conguration of µ = 4500, f = 5. Methods are compared to the baseline original query. A signicant improvement over the baseline is indicated with a at p < 0.05 level and a at p < level (t-test, single-tailed). The best run and median run in TREC 2011 are listed for comparison Methods and parameter settings for TREC 2012 Session track. µ is the Dirichlet smoothing parameter, f is the number of pseudo relevance feedback ndcg@10 for TREC 2012 Session track. Mean of the median of the evaluation results in TREC 2012 are listed AP for TREC 2012 session track. Mean of the median of the evaluation results in TREC 2012 are listed Statistics of TREC 2010 and TREC 2011 Session track datasets Stability of search result organization for TREC 2010 Session queries. Approaches are compared to the baseline - Subsumption. A signicant improvement over the baseline is indicated with a at p < 0.05 and a at p < (t-test, single-tailed) xi

12 4.3 Stability of search result organization for TREC 2011 Session queries. Approaches are compared to the baseline - Subsumption. A signicant improvement over the baseline is indicated with a at p < 0.05 and a at p < (t-test, single-tailed) xii

13 Chapter 1 Introduction 1.1 Motivation Complicated search tasks, such as planning a trip, buying a product, or looking for a good elementary school, are common in our daily life. These tasks often contain multiple sub-topics so that they require more than one query. A user usually interacts with a search engine when performing complicated search tasks. These interactions form a session. Session search is the task that deals with document retrieval within a session. Major Web search engines, including Google and Bing, return a list of documents that are ranked in decreasing relevance order to a single query. However, this representation may not fully satisfy the users. Because more complicated information needs may contain multiple sub-topics, therefore documents relevant to dierent sub-topics are perhaps mixed up in the returned document list. If a search engine organizes the search results into a hierarchical representation that shows the sub-topics emerging in the documents explicitly, a user will probably be able to locate the information that he or she needs in the search results more easily and more eciently. For example, if a user is preparing an article about the Pocono Mountains region, he or she would want to search the information about many sub-topics of the Pocono Mountain region, such as national parks, resorts, shopping, etc. It is not easy to retrieve the information about all these aspects by one query. Consequently, the user 1

14 Figure 1.1: Typical procedure of session search. may begin with a query pocono mountain region and then turn to queries like pocono mountains region things to do, pocono mountains region activities, and pocono mountains region national park to search the sub-topics. Because the queries are about dierent sub-topics, the user may expect a system that can organize the relevant documents into a hierarchy structure, which places documents about dierent sub-topics in dierent groups such as activities or national park. 1.2 Session Search Session search is a eld devoted to nding documents relevant to a session. A system that support session search accepts an entire session, which includes a series of previous queries with corresponding previous search results and a current/last query, and retrieves documents relevant to the topic of the session. 2

15 1.2.1 Overview Figure shows a typical procedure for a session search. The interaction between a user and a search engine can be represented as a session. The session contains a series of previous queries q 1, q 2,, q n 1 each associated with a set of relevant documents, i.e., previous results, D 1, D 2,, D n 1, and a current/last query q n. The search engine usually formulates a query that represents the entire session. The search engine then applies retrieval models on the formulated query and an indexed corpus to retrieve relevant documents. After retrieval, the search engine presents them to the user. The user may be satised with the search results and nish the procedure, or may be unsatised and modify the session to re-retrieve the documents. This thesis studies two crucial components in this procedure: query formulation and search result organization Query Formulation Query formulation is important in a system because the retrieval model directly relies on the formulated query. The system can retrieve more relevant documents if the formulated query represents the topic of session more accurately. Structural formulation for queries, like combination of terms, assigning weights to terms, or query expansion, focuses on the underlying meanings in queries. Structured queries identify the concepts in queries and emphasize the important concepts as individual atoms. In other words, structured queries express user intentions more precisely, so as to retrieve relevant documents more eectively. In the example in Section 1.1, there are multiple concepts in this session: pocono mountains region, things to do, activities, and national park. Since pocono mountains region appears in every query, it is probably much more important than 3

16 the others. Therefore, the search engine can build a structured query assigning higher weight to pocono mountains region to express the importance of the concept pocono mountains region Search Result Organization Clear organization for search results gives users an overview on the search results and may help users discover their further information needs eectively. Since the results of a session search often contain multiple aspects, the search engine is friendly to the user if it applies search result clustering to organize the search results into hierarchies, which we call SRC hierarchies. SRC hierarchies support better information access by improving the display of information. Search results are presented in a "lay of the land" format, which presents similar results together and reveals important concepts in lower ranked results. In the example in Section 1.1, a SRC hierarchy is appropriate for organizing the search results because the documents relevant to the topic contains multiple subtopics and some sub-topics can be further divided. For example, there is a query about pocono mountains things to do, which may include documents that can be divided to more detailed groups like hiking or camping. 1.3 Challenges The complexity of session search lays great challenges in front of the researchers, especially in the two crucial components in the procedure of session search: query formulation and search result organization. 4

17 Figure 1.2: Retrieved documents by Lemur (TREC 2011 Session 25). The top document only describes the symptoms and treatments for communicable diseases, which is not relevant to the topic of session collagen vascular disease Challenges in Query Formulation for Session Search Words within a query may form phrases to express coherent meanings, or concepts. A word group may describe a topic dierent from every single word. Words in the group may be more important than the rest. Furthermore, a session contains multiple queries, some of which are more important than others for expressing the topic of session. If a search engine treats all the words in a session individually and identically, it could give the documents relevant to every single word higher rankings. However, the documents relevant to single words are not necessarily relevant to the topic of the query. This may decrease the search accuracy. 5

18 Figure 1.2 shows an example of directly submitting all the words in queries in a session to Lemur 1, a powerful search engine. The session is composed of three queries: collagen vascular disease causes symptoms treatments eects, CVD causes symptoms treatments, and collagen vascular disease causes symptoms treatments. As we can see, the search engine processes collagen, vascular, and disease as separate words. Moreover, the common words disease, symptoms, and treatments, which repeatedly occur in all the queries, enormously bias the search results, leading to high ranks of the documents related to symptoms and treatments of other diseases. Consequently, the relevant documents about the topic collagen vascular disease do not appear in the top retrieval results. Session search lays tremendous challenges in front of us: (1) How to identify word groups expressing unit coherent meanings, that is, concepts, in queries within a session; (2) How to formulate these word groups into a structured query according to their importance Challenges in Result Organization for Session Search SRC hierarchies (see an example in Figure 1.3) are suitable to organize the search results for a regular search. However, most SRC hierarchies created by the state-ofthe-art algorithms are overly sensitive to minor query changes regardless of whether queries are similar and belong to the same session. This minor query change often occurs within a session. For instance, about 38.6% adjacent queries in TREC session tracks only show a one-word change and 26.4% show a two-word change. Figure 1.3 shows hierarchies generated by Yippy 2 for adjacent queries diet and low carb diet. The second query low carb diet is a specication of the rst. We 1 version

Figure 1.3: Search result clustering (SRC) hierarchies by Yippy (TREC 2010 Session 123). SRC hierarchies (a) and (b) are for queries diet and low carb diet respectively.

19 Figure 1.3: Search result clustering (SRC) hierarchies by Yippy (TREC 2010 Session 123). SRC hierarchies (a) and (b) are for queries diet and low carb diet respectively. A low carb diet South Beach Diet that should have appeared in both (a) and (b), is missing in (b); The cluster of Diet And Weight Loss in (a) are dramatically changed in (b). Screenshot was snapped at 15:51EST, 6/15/2012 from Yippy. observe many changes between two SRC hierarchies (a) and (b). Overall, hierarchies (a) and (b) share only 4 common words, weight, loss, review, and diet, and 0 common pair-wise relations. This is a very low overlap given that these two queries are closely related and within the same session. The dramatic change, a.k.a., instability, of SRC hierarchies for a session search weakens their functionality to serve as an information overview. With rapidly changing 7

20 SRC hierarchies, users may perceive them as random search result organizations and it is dicult to re-nd relevant documents identied in the previous queries. We argue that although SRC hierarchies should not be static, while making changes, they should maintain the basic topics and structures across the entire session. Ideally, SRC hierarchies should not only be closely related to the current query and its search results but also reect changes in adjacent queries to the right degree and at the right places. In this work, we address this new challenge of producing stable SRC hierarchies for session search. 1.4 TREC Session Tracks National Institute of Standards and Technology (NIST) had held TREC Session tracks [24, 25, 26] for three years from 2010 to TREC Session tracks aim to test whether IR systems can improve the search accuracy by the assistance of previous queries and corresponding user interactions. The session data was composed of sequences of queries q 1, q 2,, q n 1, and q n, with only the current (last) query q n being the subject for retrieval; and q 1, q 2,, q n 1 were named previous queries correspondingly. NIST invited faculty, sta, and students at the University of Sheeld as users to generate session queries. In addition to these queries, NIST provided a user interaction for each previous query in a session. A user interaction contained a ranked document list that was retrieved for a previous query and user-click information, including the click order, start time, and end time. The TREC participants (we are one of them) were requested to submit the retrieval results as ranked document lists. NIST assessors evaluated the submissions. TREC released ocial evaluation results every year. 8

21 1.5 Our Approaches In this work, we tackle challenges in two crucial components in session search: query formulation and search result organization. We formulate structured queries for sessions to improve the search accuracy. We also propose to build stable and high quality SRC hierarchies for a session search Structured Query Formulation for Session Search Observation shows that a query often contains phrases which describe a coherent meaning as a group. For example, the query russian politics kursk submarine (TREC 2012, session 18) contains two phrases russian politics and kursk submarine, each of which expresses a concept and cannot be split. Phrases usually are more related to the topic of a session so as to be more important than single words. A structured query can represent the phrases in a query. Therefore, we focus on formulating eective structured queries for search tasks within a session. In order to represent phrases, we introduce Nugget, a substring in a query with the terms that frequently occur together. We propose to identify nuggets by examining, in the pseudo-relevance feedback, the distance between terms that are adjacent in a query. Two rules, named strict and relaxed, are applied when calculating the term distance. We can generate a set of terms and nuggets from every query in a session. However, the importance of these queries are not identical. We combine the terms and nuggets from every query into one structured query using dierent aggregation schemes. We compare three schemes: uniform, previous vs. current, and distance-based. The schemes are designed based on the order of queries in a session. 9

22 Our approach includes query expansion and document re-ranking as well. The top k terms in anchor texts in pseudo-relevance feedback are extracted to expand the structured query. We then re-rank the retrieved documents by comparing them to the clicked documents in the user interactions, setting the dwell times as the weights Stable Search Result Organization by Exploiting External Knowledge External knowledge such as Wikipedia and WordNet are compiled manually. Therefore, they are widely used to enhance automatic information retrieval. Correct relations between concepts is crucial for generating high quality SRC hierarchies. We apply external knowledge as a reference to build relations between concepts. We choose Wikipedia in this work because it contains more extensive denition for concepts and relations represented by links and categories. Wikipedia is used in two ways: (1) xing the incorrect relations generated by an existing approach, which is named Subsumption+Wiki ; (2) extracting the category information to build the concept hierarchies directly, which is named Wiki-only. The issue of unstable SRC hierarchies might occur for various reasons, where the most signicant one is the popular bottom-up strategy. Contrastingly, monothetic concept hierarchy approaches rst extract the labels (or concepts) from retrieved documents and then organize these concepts into hierarchies. Since labels are obtained before clusters are formed, they are not derived from the clusters. Monothetic concept hierarchy approaches, hence, produce more stable hierarchies than clustering approaches. Therefore, we build our system based on the monothetic concept hierarchy approach. In both methods that we exploit Wikipedia, we extract a set of concepts from the document set, i.e., the search results. In the rst one, we apply an existing approach 10

23 to draw the possible parent-child relations between pairs of concepts. Then we identify the location of this pair of concepts in the category network of Wikipedia and lter out the incorrect relations. In the second one, for each concept, we identify the most relevant page in Wikipedia. Then we extract the category structure from this Wikipedia page. The category structures for all concepts are merged to build the SRC hierarchies. 1.6 Contributions of this Thesis This thesis focuses on improving the search accuracy for session search and building stable SRC hierarchies for queries in a session. By combining the nugget approach and aggregation schemes, a structured query represents the topic of a session more accurate. In addition, our approach integrates external knowledge into the monothetic concept hierarchy algorithm and signicantly increases the stability of SRC hierarchies without loss of quality. The specic contributions are: 1) we propose an approach that introduces a concept of nugget to formulates a session into a structured query; 2) we propose an ecient method to predict a window size for a nugget; 3) we present two eective approaches that organize the search results into SRC hierarchies of high stability and high quality; 4) we evaluate the stability of concept hierarchies built by monothetic concept hierarchy approaches and by clustering approaches over the dataset of TREC Session tracks. We propose to formulate a structured query to represent the topic of a session precisely. We try to nd the phrases in queries, which express atomic meanings. In particular, we introduce a concept of nugget, a phrase-like substring in queries. Based on identifying nuggets, an eective approach is proposed to generate a structured query from a session. Evaluation indicates that nuggets increase the accuracy 11

24 for session search. Moreover, We propose an ecient relaxed method to predict an appropriate window size for a nugget according to the average distance between two terms in the pseudo-relevance feedback. Experiments show that the relaxed method gives advantage to the single-query task over the traditional method examining the n-grams. Furthermore, we introduce three aggregation schemes for multiple queries in a session are studied in this work. A session contains multiple queries, from which we can obtain a set of nuggets. The queries may be dierent in importance, hence some nuggets may play more important roles than others. We nd out that the last query is commonly more important than the previous ones. This thesis further studies the results organization for session search. Search result organization gives the user an overview about the relevant documents, which helps the user to locate the needed information rapidly. We present a novel framework based on the monothetic concept hierarchy approach, which show advantages in terms of stability over the popular organization approaches mostly based on hierarchical clustering. Our algorithm dynamically maps the concepts to Wikipedia entries and generates the hierarchical structure, which can extract Wikipedia categories structures about a specic topic eciently. We are the rst to evaluate the stability of concept hierarchies built by monothetic concept hierarchy approaches and by clustering approaches. Moreover, we are the rst to integrate external knowledge into an monothetic concept hierarchy approach, to correct the erroneous parent-child relationship between concepts. The results indicate that our approach improves the quality of the hierarchies. 12

25 1.7 Outline The rest of this thesis is organized as follows. Chapter 2 discusses the related work. Chapter 3 presents the methods of generating the eective structured query from the session data. Chapter 4 presents the enhancement of the results organization for session search by integrating the Wikipedia category structure. Chapter 5 summarizes the thesis and describes possible directions for future work. 13

26 Chapter 2 Related Work This chapter reviews the related work to this thesis research. The related work includes the submissions to TREC Session tracks, query formulation, and search result organization. 2.1 Session Search and TREC Session Tracks In TREC 2011 and TREC 2012 Session tracks [25, 26], a session contained multiple queries q 1, q 2,, q n 1, q n, and the user interactions such as the previous search results and click information. Four subtasks were requested: RL1. Only using the current query q n. RL2. Including the previous queries q 1, q 2,, q n 1 and the current query q n. RL3. Including top retrieved documents for previous queries. RL4. Considering additional information about which top results are clicked by users. ClueWeb09 collection 1 is used as the corpus in TREC Session tracks. However, participants are allowed to use the rst 50 million documents of ClueWeb09, named Category B or CatB, as the corpus as well. However, they are evaluated as if they were using the entire collection (named Category A or CatA)

27 Twenty teams have participated in TREC Session tracks in three years [22, 6, 45, 32, 28, 19, 23, 13, 33, 2, 15, 30, 50]. Evaluation results showed signicant improvements from the rst subtask to the last one in most of submissions. These results indicated that considering all session information contributed to search accuracy. Jiang et al. [22, 23] applied the sequential dependence model (SDM) [36] as the basic retrieval model. SDM features, including all single terms, ordered phrases, and unordered phrases, were extracted from the query. Then the features were incorporated into the Lemur system by using the Indri query language. The session historical query model (SH-QM) was used when including previous queries. For each SDM feature, a weight was assigned by linearly combining the frequencies of this feature in current and previous queries. After introducing previous search results in RL3 and RL4, the author applied the pseudo-relevance feedback query model (PRF-QM) on the single-term feature. The weight of a single-term feature was adjusted by calculating the term frequency in pseudo-relevance feedback. In RL3, top 10 ranked Wikipedia documents served as the pseudo-relevance feedback; while in RL4, clicked documents associated with their snippets were considered as the pseudo-relevance feedback. Furthermore, Jiang et al. introduced document novelty to adjust document scores in retrieval. Scores of the documents clicked by the user previously were lowered based on their ranks in previous search results. Jiang et al. achieved the top rank in TREC 2011 and TREC 2012 Session track. Albakour et al. from the University of Essex [34] utilized anchor texts to expand queries. The anchor log le proposed by the University of Twente 2 was used as the reference to nd terms with similar topic to the session. First, stop words were removed from queries in a session. Then they found in the anchor log lines containing any

28 of these queries. Terms in these lines were extracted to expand queries. Anchor text approach was proven eective and adopted by other teams, such as the BUPT team [31]. Nootropia model [38] was applied in another approach proposed by the University of Essex [34]. The authors built a Nootropia network based on previous search results and then re-ranked the documents retrieved for the current query. They experimented applying two opposite strategies. The positive one assumed that previous search results in a session were relevant to the topic of the session so that the documents with higher Nootropia scores would be ranked higher in the nal results. The negative one made the opposite assumption that previous search results in a session dissatised the user who submitted the session, hence the documents with higher Nootropia scores would be ranked lower in the nal results. Evaluation indicated that the positive strategy was more valid than the negative one. The CWI team [17] presented a discount rate model for the previous queries. They assumed two classes of users: good users and bad users. A good user learned from previous search results in a session to generate a high quality query, so that the current query in a session was able to express the topic of the session precisely. On the contrary, a bad user failed to adjust queries to t the topic of a session. Consequently the previous queries had equal values in representing the topic of a session. Based on the above assumption, a session submitted by a better user received a more discounted rate for its previous queries. When only considering the queries in the session, the authors used the average number of interactions for all sessions as a standard to determine whether a sessions was submitted by a good user or a bad user. A session submitted by a good user was supposed to be nished within the average number of interactions, while a session submitted by a bad user was supposed to contain more interactions than the average number. After adding 16

29 the information about the previous search results, the average adjacent interaction overlap for all sessions became the standard of dierentiating sessions submitted by good users or bad users. The authors assumed that a session submitted by a better user would have less overlap between the adjacent interactions in it. The BUPT team [31] exploited dwell time of documents clicked by users. They built a reference document set which contained all clicked documents in a session. Then dwell time of every document in the set was transformed into attention time by an exponential decay function regarding to the rank of the document. Next, they predicted attention time for every retrieved document based on its cosine similarity to the reference document set. Finally the author re-ranked retrieved documents according to their predicted attention time. Most teams modied the retrieval models and used query expansion [22, 23, 33, 2, 30, 50] to t session search. However, they did not apply query formulation to generate structured queries. Structured queries can represent phrases, which can emphasize the important terms in the query. We propose to build structured queries for session search in our work. 2.2 Query Formulation The process of query formulation modies the original query submitted by a user [8]. The goal is to understand the user intention underlying the query more accurately. Query formulation includes spelling correction, term proximity, etc. As a crucial component of search engines, spelling correction has been studied thoroughly [29, 20, 10]. For example, Li et al. proposed a generalized hidden Markov model to correct query spelling errors [29]. They divided the spelling errors into six types. For every type, the authors designed a rule to x the spelling error. For each 17

30 word in the query submitted by a user, they classied it into one type based on the Markov model. The parameters of the Markov model are trained using manually corrected documents. Many structural query formulation approaches were based on n-grams, which were dened to be continuous terms with a length of n [3, 37, 42]. For example, Bendersky et al. focused on optimizing weights of concepts in a query [3]. The authors rst extracted bi-grams from a query as concept candidates. Then they referred to multiple information sources such as ClueWeb09 and Wikipedia concurrently to evaluate the relatedness between a bi-gram and the query. With the evaluation, they ltered out those meaningless bi-grams and assigned a weight to the the remained bi-grams, i.e., concepts. Finally a structured query consisted of the concepts associated with weights. Mishne et al. applied proximity terms on web retrieval. They extracted n-grams from a query and then experimented multiple ways to dene a term frequency (tf) and a inversed document frequency (idf) for an n-gram. For example, idf could be dened as the minimum or maximum idf of the terms in a group. Finally, they applied a traditional tf-idf retrieval model with extended tf's and idf's for n-grams. Zhao and Callan tried to identify term mismatches and x them by expanding them using boolean conjunctive normal form (CNF) [51]. CNF queries contain operator AND or OR to describe relations between terms in a query. The authors experimented two measurements, highest inverse document frequency or lowest probability of a term in pseudo-relevance feedback, to diagnose mismatched terms. Evaluation showed that it was more accurate to use probability of a term in pseudo-relevance feedback. After identifying mismatched terms, they use a set of manually built CNF queries to expand the original query. Huston and Croft detected key concepts in queries by a classier [18]. The features used in the classier included term frequency, inverse document frequency, residual 18

31 inverse document frequency, weighted information gain, n-grams term frequency, and query frequency. The classier was trained using GOV 2 dataset. The approaches using n-grams can eectively represent phrases in a query. However, some phrases have multiple forms. For example, both pull out a book and pull it out contain the phrase pull out. Therefore, n-grams method is sometimes too strict. If we only identify continuous terms, we may miss some relevant documents. On the contrary, Boolean conjunctive normal form is hard to represent phrases. On the other hand, a classier can precisely detect key concepts in queries with a good training dataset. However, is is not easy to nd a training dataset that can t all queries. We propose a relaxed method, which relies on the query itself, to predict window sizes for nuggets and then build structured queries. 2.3 Search Result Organization Meta search engines such as Clusty (now Yippy) employ search result clustering (SRC) [1, 5, 41] to automatically organize search results into hierarchical clusters. There are two strategies to cluster the search results: hierarchical clustering and monothetic concept hierarchy construction. Furthermore, external knowledge bases are more and more exploited to improve the quality of clustering. The remainder of this subsection discusses the related work on hierarchical clustering, Subsumption, and exploiting external knowledge Hierarchical Clustering Most search result organization adopted clustering-based approaches, which shared a common scheme that rst clustered similar documents and then assigned labels to the clusters [5, 9]. Clustering-based approaches often produce non-interpretable 19

32 clusters and semantically-ill hierarchies due to their data-driven nature and poor cluster labeling. Even in the best known commercial clustering-based search engine, Clusty (now Yippy), which presents search results in hierarchical clusters and labels the clusters by variable length sentences, cluster labeling remains a challenging issue Subsumption Monothetic concept hierarchy approaches build concept hierarchy dierently. They avoid cluster labeling by rst extracting concepts from documents and then organizing concepts into a hierarchy where each concept attaches to the subset of documents containing it. Hence, for document sets about similar topics, monothetic concept hierarchy approaches usually generate hierarchies with more stable nodes, which are concepts extracted from the entire document set. The Subsumption approach [39] is a classic and state-of-the-art monothetic concept hierarchy approach. This approach built browsing hierarchies based on conditional probability. They expanded the query by Local Context Analysis [46]. Then they used the expanded query to retrieve the documents. Terms with high ratio of occurrence in the retrieved documents to that in the collection were chosen to add into the concept set composed of query terms. For the term pairs (x, y) in the concept set, x was said to subsume y if P (x y) 0.8, P (y x) < Exploiting External Knowledge Computer scientists used Wikipedia to improve their research work because Wikipedia was compiled by thousands of experts from all over the world [16, 44, 4, 40, 27]. Not only the texts but also the links and the categories in Wikipedia are prevalently exploited. 20

33 Carmel et al. improved cluster labeling accuracy by using labels from Wikipedia [4]. They rst look for a set of terms that maximizes the Jensen-Shannon Divergence (JSD) distance between the specic cluster and the entire corpus. They then search Wikipedia using these terms for a list of documents. The title and corresponding categories are picked up as the candidate cluster labels. After that, they rank the candidates by Mutual Information judgment and Score Propagation judgment. The results indicate high quality labels. Han et al. organized the search result based on topics by leveraging the knowledge that Wikipedia links provided [16]. They chose Wikipedia concepts from the links words in the retrieved Wikipedia documents. Then a semantic graph was built based on the semantic relatedness between these Wikipedia concepts. The graph was divided into communities according to the internal links density. The terms in the community represent the subtopics of the query. Finally, the search results for the query were assigned to the communities by comparing the similarity to the communities. Wang et al. [44] constructed a thesaurus of concepts from Wikipedia and use this thesaurus to improve the text classication. They used the out-link category-based measure to assist deciding whether or not two articles were related. The out-link categories of an article were dened as the categories to which out-link articles from the original one belong. The two articles were more closely related if their out-link categories overlapped more. They reported signicant improvement in text classication by introducing the out-link category-based measure. Many SRC hierarchy construction approaches are data driven, such as the widelyused hierarchical clustering algorithms. These algorithms rst group similar documents into clusters and then label the clusters as hierarchy nodes. Multiple aspects in textual search results often yield mixed-initiative clusters, which reduce the stability of SRC hierarchies. Moreover, when clustering algorithms build clusters bottom-up, 21

34 little changes in leaf clusters propagate to upper levels and amplify the instability. Furthermore, hierarchy labels are automatically generated from documents in a cluster, which is often data-sensitive so that SRC hierarchies could look even more unstable. Monothetic concept hierarchy approaches usually generate hierarchies with stable nodes. However, monothetic concept hierarchy approaches often produce hierarchies short of semantic meanings because just term frequencies, not meanings are taken into account. This work lls the gap by exploiting external knowledge to correct relations between concepts. 22

35 Chapter 3 Effective Structured Query Formulation for Session Search In session search, a user feeds a session into a search engine. The session includes a series of previous queries q 1, q 2,, q n 1 with corresponding previous results D 1, D 2,, D n 1, and a current/last query q n. All the queries have a common underlying topic, or session topic. The search engine is expected to retrieve documents relevant to the session topic. Each query in a session is composed of terms and phrases. In order to represent the topic of session precisely, we extract phrases from each query and combine words and phrases in all queries together when performing document retrieval. The Indri query language 1 supports complex queries such as proximity terms and combining beliefs, so as to benet building structured queries from all queries in a session. In this work, we further expand structured queries with anchor texts. Anchor texts are texts in a document, each of which is associated with a link to another document. The research reported in this chapter have been published in Proceedings of the 21 st Text REtrieval Conference (TREC 2012) [13]. 3.1 Identifying Nuggets and Formulating Structured Queries In a query, several words sometimes bundle together as a phrase to express a coherent meaning. We identify phrase-like text nuggets and formulate them into Lemur queries 1 version

36 Figure 3.1: A sample nugget in the TREC 2012 session 53 query servering spinal cord paralysis. for retrieval. Nuggets are substrings in a query, similar to phrases but are not necessarily as semantically coherent as phrases. Figure 3.1 shows an example of a nugget. Words spinal and cord often occur together to represent a specic concept. We discover that a valid nugget appears frequently in the top returned snippets for a query. Hence, we identify nuggets to formulate new structured queries in Lemur query language. Particularly, we look for nuggets in the top s snippets returned by Lemur for a query q. Nuggets are identied by two methods, a strict one and a relaxed one, as described below The Strict Method First, a query is represented as a word list q = w 1 w 2 w n. We send this word list into Lemur and retrieve the top s snippets over an inverted index built for ClueWeb CatB. Then all snippets are concatenated as a reference document R. For every bi-gram in q, we count its occurrences in R. The occurrence of a bi-gram is normalized by the smaller occurrence of the words in the bi-gram. A bi-gram is marked as a nugget candidate if its normalized occurrence exceeds a threshold, as shown in (3.1): count(w i w i+1 ; R) min(count(w i ; R), count(w i+1 ; R)) θ (3.1) 24

Figure 3.2: Words in a snippet built from TREC 2012 session 53 query servering spinal cord consequenses, where spinal is always connected to cord.

97 over all of the TREC 2011 session data. For example, in TREC 2012 session 53 query servering spinal cord consequenses, we identify a bi-gram spinal cord as a candidate.

37 Figure 3.2: Words in a snippet built from TREC 2012 session 53 query servering spinal cord consequenses, where spinal is always connected to cord. where count(x; R) denotes the occurrence of x in the reference document R; w i and w i+1 are adjacent words in the query; θ is the threshold, which is tuned to be 0.97 over all of the TREC 2011 session data. For example, in TREC 2012 session 53 query servering spinal cord consequenses, we identify a bi-gram spinal cord as a candidate. Bi-grams could connect to form longer n-grams. For instance, there is a query hawaii real estate average resale value house or condo news in TREC 2011 session 11. We discover that hawaii real and real estate are both marked as nugget candidates, so that they can be merged into a longer sequence hawaii real estate. On the contrary, estate average, is not a candidate, hence we cannot append it to form hawaii real estate average. Therefore, hawaii real estate is the longest sequence and is recognized as a nugget. Consequently, the query is broken down into nuggets and single words. All serve as the elements to build up a structured query using the Lemur query language #combine(nugget 1 nugget 2 nugget m w 1 w 2 w r ) (3.2) where we suppose there are m nuggets and r single words. 25

38 Figure 3.3: Words in a snippet built from TREC 2011 session 20 query dooney bourke purses, where dooney and bourke is a brand name but the user omits the word and. The example nugget detection for TREC 2012 session 53 is shown in Figure 3.2. We obtain the structured query #1(spinal cord) servering consequenses The Relaxed Method Operator #1 is a strict structure operator and may miss relevant documents. For example, the queries in TREC 2011 session 20 all contain dooney bourke. However, dooney and bourke is a brand name and is written as dooney bourke sometimes. We would miss relevant documents with the phrase dooney and bourke if we formulate the query to be #1(dooney bourke). Hence, we introduce a relaxed method for query formulation. We relax the constraints based on the intuition that distance between two words reects the associativity of them. Particularly, we rst retrieve the reference document R as in Section Every word's position in the snippet is 26

39 Figure 3.4: values of retrieved documents using TREC 2011 Session track dataset. two cases, with threshold and without threshold, are compared. marked as shown in Figure 3.3. We then estimate the centroid of a word w i by j x(w i ) = x j(w i ; R) count(w i ; R) (3.3) where s is the number of snippets, R is the reference document, which is the snippets, x j (w i ; R) is the index of the j th instance of w i in R, count(w i ; R) is the occurrence of w i in R. For every bi-gram in a query, the distance between their estimated centroids is calculated. We predict the window size (X in #X) of a nugget based on the distance. Intuitively, it is reasonable to assume that the window size is proportional to the 27

40 distance between their estimated centroids, which can be written as: x(wi ) x(w i+1 ) nugget = # (w i w i+1 ) (3.4) ξ where ξ is an empirical factor. However, some terms in a query do not form nuggets. The distance between the centroids of these terms may be very long so as to generate a large window size, which could be noise and hurt the search precision. Therefore, we set a threshold to lter out those term pairs with too far centroids. Figure 3.4 compares the ndcg@10 values of the retrieved documents over TREC 2011 sessions, with and without threshold respectively. It shows that the precision greatly increases with the threshold. A decision tree can be derived from Eq (3.4) with the threshold: #1(w i w i+1 ) nugget = #2(w i w i+1 ) x(w i ) x(w i+1 ) ξ ξ < x(w i ) x(w i+1 ) 2ξ (3.5) x(w i ) x(w i+1 ) > 2ξ where we set the threshold to be 2ξ by experiments, i.e. we only consider nuggets with window size no larger than 2. A structured query is then formulated as in eq (3.2). We tune ξ from 2 to 8 using TREC 2011 Session track dataset. Figure 3.4 shows the ndcg@10 value for dierent ξ. We nd that the precision of session search is not sensitive to the value of threshold ξ. Hence, we choose the ξ value with largest ndcg@10, which is 5. For the above query dooney bourke purses, Figure 3.3 shows the procedure of generating a structured query #2(dooney bourke) purses. 28

41 3.2 Query Aggregation within a Session A session contains multiple queries, from each of which we can build a structured query. Therefore, we aggregate over all queries in a session to generate a large structured query. We rst obtain a set of nuggets and single words from every query q k = {nugget ik, w jk } by the approach presented in Section 3.1. Then we merge these nuggets to form a structured query: #weight( λ 1 #combine(nugget 11 nugget 12 nugget 1m w 11 w 12 w 1r ) λ 2 #combine(nugget 21 nugget 22 nugget 2m w 21 w 22 w 2r ) (3.6) λ n #combine(nugget n1 nugget n2 nugget nm w n1 w n2 w nr ) ) where λ k denotes the weight of query q k. Note that the last #combine is for the current query q n Aggregation Schemes Three weighting schemes are designed to determine the weight λ k, namely uniform, previous vs. current, and distance-based. uniform. Queries are assigned the same weight, i.e., λ k = 1. previous vs. current. All previous queries share the same weight while the current query uses a complementary and higher weight. Particularly, we dene: λ p k = 1, 2,, n 1 λ k = 1 λ p k = n (3.7) 29

42 where λ p is tuned to be 0.4 on TREC 2011 session track data. distance-based. The weights are distributed based on how far a query's position in the session is from the current query. We use a reciprocal function to model it: λ p k = 1, 2,, n 1 n k λ k = 1 λ p k = n (3.8) where λ p is tuned to be 0.4 based on TREC 2011 session track data, k is the position of a query. 3.3 Query Expansion by Anchor Text A session also provides previous search results, which are pages relevant to the previous queries. An anchor text pointing to a page often provide valuable human-created description to this page [34], as shown in Figure 3.5, which enable us to expand a query by words in anchor texts. A anchor log is extracted by harvestlinks in the Lemur toolkit. We collect anchor texts for all previous search results and sort them by term frequency in decreasing order. The top 5 frequent anchor texts are appended to the structured query generated in 3.2, each with a weight proportional to its term fre- 30

Figure 3.5: Anchor text in a web page. quency. #weight( λ 1 #combine(nugget 11 nugget 12 nugget 1m w 11 w 12 w 1r ) λ 2 #combine(nugget 21 nugget 22 nugget 2m w 21 w 22 w 2r ) (3.

43 Figure 3.5: Anchor text in a web page. quency. #weight( λ 1 #combine(nugget 11 nugget 12 nugget 1m w 11 w 12 w 1r ) λ 2 #combine(nugget 21 nugget 22 nugget 2m w 21 w 22 w 2r ) (3.9) λ n #combine(nugget n1 nugget n2 nugget nm w n1 w n2 w nr ) βω 1 #combine(e 1 ) βω 2 #combine(e 2 ) βω 5 #combine(e 5 ) ) where e i (i = 1 5) is the top 5 anchor texts, ω i (i = 1 5) denotes the corresponding frequency of the anchor text, which is normalized by the maximum frequency, β is a 31

44 factor to adjust the intervention of anchor texts, which is tuned to be 0.1 based on the TREC 2011 session data. For example, in TREC 2012 session 53, the anchor texts with top frequencies are type of paralysi, quadriplegia paraplegia, paraplegia, spinal cord injury, and quadriplegic tetraplegic, hence the nal structured query becomes #weight(1.0 #1(spinal cord) 0.6 consequenses 0.4 paralysis 1.0 servering #combine(type of paralysi) #combine(quadriplegia paraplegia) paraplegia #combine(spinal cord injury) #combine(quadriplegic tetraplegic) ), where the underlined part is from anchor texts. 3.4 Removing Duplicated Queries The trace of how a user modies queries in a session may suggest the intention of the user so that it can be exploited to study the real information need of the user. We notice that sometimes a user repeats a previous query and makes duplicated queries. Thus, we make two assumptions to rene the nal structured query as follows. If there is a previous query that is the same as the current query q n, we only use the current query to generate a nal structured query. For example, in TREC 2011 session 22, the current query shoulder joint pain is the same as the rst query shoulder joint pain. The possible reason is that the search results for the intermediate queries do not satisfy the user, which results in that the user returns to one of the previous queries. If multiple previous queries are duplicated but they are all dierent from q n, we remove these queries when formulating a nal structured query. For example, in TREC 2011 session 60, the query non-extinct marsupials occurs for three 32

45 times and the query marsupial manure occurs twice. It could bias the search results if we used all of these duplicate queries. In the duplicate detection, we consider a special situation as follows. If a substring is the abbreviation of another one, we consider that these two queries are duplicated. For example, the only dierence between queries History of DSEC and History of dupont science essay contest is DSEC and dupont science essay contest, in which the former is the abbreviation of the latter, hence they are considered as duplicates. To detect abbreviations, we scan a query string and split a word into letters if this word is entirely uppercase. In the example above, the rst query is transformed to History of D S E C. When comparing two queries, two words in corresponding positions are considered the same if one of them contains only one capital letter and they start with the same letter. In the above example, dupont and D are considered to be same. 3.5 Document Re-ranking A user intends to stay in a page that he or she is interested for longer time [11, 14, 47]. We use dwell time, which is dened as the elapsed time that a user stays in a page, to re-rank the search results from the structured query generate in Section 3.4. Click information in a session is associated with a start time t s and an end time t e. Therefore, dwell time t can be derived by t e t s. In a session, we retrieve all clicked pages c i with their dwell time t i. For each returned document d j for a structured query generated after 3.4, its cosine similarity to c i is computed. We calculate the score of d j by s(d j ) = i Sim(d j, c i ) t i (3.10) 33

46 where Sim(d j, c i ) is the cosine similarity between d j and c i. We rank d j by s(d j ) in decreasing order as the nal search results. In our experiments, the raw dwell time used in our method strongly bias the document weights towards those with long dwell time, which corresponds to that satisfying visits receive much higher weights. For example, if a document has been reviewed by a user for more than 30 seconds, we consider that the user is satised by this document. On the contrary, if the dwell time of a document is only a few seconds, the user might just have a glimpse on this document and nd the content not relevant. Since the dwell time is multiplied to the similarity, the former document would achieve much higher score than the latter. 3.6 Evaluation for Session Search We participated in TREC 2012 Session track and submitted three runs using dierent approach combinations, which are listed in Table 3.4. Four subtasks named RL1, RL2, RL3, and RL4 are described in Section 2.1. In the evaluation results ocially released by NIST [26], we achieved the highest improvement from RL1 to RL2. Our retrieval result for RL2RL4 won the second rank among the participants Datasets, Baseline, and Evaluation Metrics We build an inverted index over ClueWeb09 CatB. An anchor log is acquired by applying harvestlinks over ClueWeb09 CatA, since the ocial previous search results are from CatA. Previous research demonstrates that ClueWeb09 collection involves many spam documents. We lter out spam documents based on Waterloo GroupX spam ranking score 2 less than 70 [7]

47 Lemur search engine is employed in our experiments as the baseline. Lemur's language model based on the Bayesian belief network is applied [35]. The language model is a multinomial distribution, for which the conjugate prior for Bayesian analysis is the Dirichlet distribution [49]: p µ (w d) = c(w; d) + µp(w C) w c(w; d) + µ (3.11) where c(w; d) denotes the occurrences of term w in document d, p(w C) is the collection language model, µ is the parameter. The parameter µ is tuned based on the 2011 session data. The metrics provided by TREC 2012 Session track [26] are used to evaluate the retrieval performance: Expected Reciprocal Rank (ERR), ERR@10, ERR normalized by the maximum ERR per query (nerr), nerr@10, normalized discounted cumulative gain (ndcg), ndcg@10, Average Precision (AP), and Precision@10, where ndcg@10 serves as the primary metric, which is dened as [21]: ndcg@10 = 10 i=1 10 rel(i) 1 + log 2 (i) / i=1 rel (i) 1 + log 2 (i) (3.12) where rel(i) is the relevance score of the document at rank i in the ranked document list retrieved for a session, rel (i) denotes the relevance score of the document at rank i in the ideal ranked document list for the session. The top 10 documents are taken account because search engines usually display top 10 relevant documents in the rst page, which are the most attractive to a user Results for TREC 2011 Session Track For RL1, where only the current query q n is available, we generate a structured query from q n by the approach described in Section 3.1 and send it into Lemur. The Dirichlet parameter µ and the number of pseudo relevance feedback f are tested on 35

48 Table 3.1: for TREC 2011 Session track RL1. Dirichlet smoothing method is used. µ = 4000, f = 10 for strict method and µ = 4000, f = 20 for relaxed method. Methods are compared to the baseline original query. A signicant improvement over the baseline is indicated with a at p < 0.05 level and a at p < level (t-test, single-tailed). The best run and median run in TREC 2011 are listed for comparison. Method original query strict relaxed TREC Best ndcg@ %chg 0.00% 13.50% 17.79% 12.17% TREC 2011 session data. The documents retrieved by directly searching q n serve as the baseline. Table 3.1 shows the ndcg@10 results for RL1 on TREC By formulating structured query using nuggets, we greatly boost the search accuracy than baseline by 13.50%. The relaxed form achieves even better search accuracy of (+17.79%). For RL2, we apply query expansion with the previous queries explained in Section 3.2. We observe that the strict method performs much better, because the window size in relaxed method is hard to optimize for multiple queries. Table 3.2 presents the ndcg@10 for RL2 on TREC 2011 session data. We nd that previous vs. current scheme gives the best search accuracy. It is worth noting that distance-based scheme performs even worse than uniform scheme, which implies that the modication of user intention is complex and we cannot assume that the early query has less importance in the entire session. For RL3 and RL4, we combine several methods, including anchor texts, removing duplicated queries and re-ranking by dwell time. Table 3.3 displays the ndcg@10 for RL3 and RL4 on 2011 session track data. It illustrates that removing duplicated queries signicantly improves the performance. However, neither re-ranking nor only 36

49 Table 3.2: for TREC 2011 Session track RL2. Dirichlet smoothing method and strict method are used. µ = 4000, f = 5 for uniform, µ = 4500, f = 5 for previous vs. current (PvC ) and distance-based. Methods are compared to the baseline original query. A signicant improvement over the baseline is indicated with a at p < 0.05 level and a at p < level (t-test, single-tailed). The best run and median run in TREC 2011 are listed for comparison. Scheme original query uniform PvC distance-based TREC Best ndcg@ %chg 0.00% 32.47% 36.94% 31.17% 26.73% Table 3.3: ndcg@10 for TREC 2011 Session track RL3 and RL4. All runs use strict method and the conguration of µ = 4500, f = 5. Methods are compared to the baseline original query. A signicant improvement over the baseline is indicated with a at p < 0.05 level and a at p < level (t-test, single-tailed). The best run and median run in TREC 2011 are listed for comparison. baseline = anchor text TREC Best all documents clicked documents ndcg@10 ndcg@10 %chg ndcg@10 %chg RL3 Method all queries % % remove duplicate % % RL4 re-rank by dwell time % considering clicked document contributes to the results. The reason may lie in that we calculate cosine similarity based on the full text of documents, which perhaps introduce lots of noise Results for TREC 2012 Session Track We submit three runs to TREC 2012 session track. The run names, methods and parameters are listed in Table 3.4, where µ is the Dirichlet smoothing parameter and f is the number of pseudo relevance feedback. 37

50 Table 3.4: Methods and parameter settings for TREC 2012 Session track. µ is the Dirichlet smoothing parameter, f is the number of pseudo relevance feedback. run RL1 RL2 RL3 RL4 guphrase1 guphrase2 strict method µ = 4000, f = 10 strict method µ = 3500, f = 10 gurelaxphr relaxed method µ = 4000, f = 20 strict method query expansion µ = 4500, f = 5 strict method query expansion µ = 5000, f = 5 relaxed method query expansion µ = 4500, f = 20 strict method query expansion anchor text remove duplicates µ = 4500, f = 5 strict method query expansion anchor text remove duplicates µ = 5000, f = 5 relaxed method query expansion anchor text remove duplicates µ = 4500, f = 20 strict method query expansion anchor text all queries µ = 4500, f = 5 strict method query expansion anchor text all queries µ = 5000, f = 5 strict method query expansion anchor text re-ranking by time µ = 4500, f = 5 Table 3.5: ndcg@10 for TREC 2012 Session track. Mean of the median of the evaluation results in TREC 2012 are listed. run original query guphrase1 guphrase2 gurelaxphr TREC Best RL RL RL RL Table 3.6: AP for TREC 2012 session track. Mean of the median of the evaluation results in TREC 2012 are listed. run original query guphrase1 guphrase2 gurelaxphr TREC Best RL RL RL RL

51 Figure 3.6: Changes in from RL1 to RL2 presented by TREC 2012 Session track. Error bars are 95% condence intervals (Figure 1 in [26]) The evaluation results of ndcg@10 and Average Precision (AP) by TREC are presented in Table 3.5 and Table 3.6. They show similar trends as what we observe on the TREC 2011 data, but in a much lower range even beneath the results using the original query. This may imply that our query formulation methods may overt on TREC 2011 session data. Nonetheless, using previous queries and eliminating duplicates continues to demonstrate signicant improvement in search accuracy Official Evaluation Results for TREC 2012 Session Track TREC 2012 Session track presented the evaluation for all participants [26]. The runs were compared on both the individual subtasks and the improvements between the 39

52 Figure 3.7: All results by for the current query in the session for each subtask (Table 2 in [26]). 40

53 pairs of subtasks. Our runs achieved the highest improvement from RL1 to RL2, as shown in Figure 3.6 (Figure 1 in [26]). This improvement made us ranked second among all the groups in RL2, RL3, and RL4, as shown in Figure 3.7 (Table 2 in [26]). The evaluation results demonstrate the eectiveness of query formulation by combining nuggets and user interaction information in session search. 3.7 Chapter Summary In this chapter, we describe an approach to build eective structured query in session search. A concept of nuggets is introduced to represent the phrase-like semantic units in the query. A window size can be predicted for a nugget by the relaxed method in nugget identication. Nuggets from all queries are combined together by three aggregation schemes. Experiments indicate that injecting nuggets into all queries in a session increases the search accuracy signicantly. Removing duplicated queries in a session improves the search accuracy further more. In addition, nuggets from the current query are more important than those from previous queries. However, no evidence shows dierence in the importances of previous queries. Query expansion and document re-ranking are applied to make additional progress in search accuracy. Moreover, we design two rules to remove duplicate queries within a session, which improves the search accuracy eectively. All these techniques make our results ranked second among all the participants in the subtasks that involve a session in TREC 2012 Session track. 41

54 Chapter 4 Increasing Stability of Result Organization for Session Search The relatedness of the queries in a session requires high stability for the search result organization. In order to improve the stability of SRC hierarchies, we presented an original system framework based on the monothetic concept hierarchy approach. Particularly, we extract concepts from the document set rst. Then the hierarchies are built according to the statistics of the concepts such as document frequency in the document set. Additionally, we applied the category information in Wikipedia to regulate the parent-child relationship between pairs of concepts. It is worth mentioning that we investigate how to increase stability of concept hierarchies by considering only the current query and its search results. One may argue that the instability issue could be resolved if considering queries in the same session all together when building SRC hierarchies. However, in Web search, session membership is not always available. Therefore, our task is more consistent with the real application. Moreover, our task is to independently generate similar hierarchies for queries as long as these queries are similar, which place more challenge in front of us. Furthermore, our algorithms can be extended to include other queries in the session if session segmentation is known. The research results reported in this chapter have been published in Proceedings of the 35 th European Conference on Information Retrieval (ECIR 2013) [12]. 42

Figure 4.1: Framework overview of the Wikipedia enhanced concept hierarchy construction system. 4.1 Utilizing External Knowledge to Increase Stability of Search Result Organization We propose to exploit external knowledge to increase stability of SRC hierarchies.

The title of a page is called an entry. Every entry belongs to one or more categories.

55 Figure 4.1: Framework overview of the Wikipedia enhanced concept hierarchy construction system. 4.1 Utilizing External Knowledge to Increase Stability of Search Result Organization We propose to exploit external knowledge to increase stability of SRC hierarchies. Wikipedia, a broadly used knowledge base, is used as the main source of external knowledge. We refer to each article in Wikipedia as a page, which usually discusses a single topic. The title of a page is called an entry. Every entry belongs to one or more categories. The categories in Wikipedia are organized following the subsumption (also called is-a) relations; together all Wikipedia categories form a network that consists of many connected hierarchies. Our framework consists of three components: concept extraction, identifying reference Wikipedia entries, and relationship construction, as shown in Figure 4.1. Initially, the framework takes in a single query q and its search results D and extracts con- 43

56 cept set C that best represents D by an ecient version of [48] Chapter 4. Next, for each concept c C, the framework identies its most relevant Wikipedia entry e which is called a reference Wikipedia entry. Finally, relationship construction adopts two schemes to incorporate Wikipedia category information. One applies Subsumption [39] rst and then renes the relationships according to Wikipedia categories while another connects the concepts purely based on Wikipedia. We present mapping to reference Wikipedia entry in Section 4.2, followed by enhancing Subsumption by Wikipedia in Section 4.3 and constructing hierarchies purely based on Wikipedia in Section Identifying Reference Wikipedia Entries Given a set of concepts C acquired by concept extraction, we identify the reference Wikipedia entry for each concept. In particular, we rst obtain potential Wikipedia entries by retrieval. We employ Lemur toolkit to build an index from the entire Wikipedia collection in ClueWeb09 CatB dataset. Each concept c C is sent as a query to the index and the top 10 returned Wikipedia pages are kept. The titles of these pages are considered as Wikipedia entry candidates for c. We denote these entries as {e i }, i = We then select the most relevant Wikipedia entry as the reference Wikipedia entry. Although we have obtained a ranked list of Wikipedia pages for c, the top result is not always the best suited Wikipedia entry for the search session. For instance, TREC 2010 session 3 is about diabetes education, the top Lemur returned Wikipedia entry for concept GDM is GNOME Display Manager, which is not relevant. Instead, the second ranked entry Gestational diabetes is relevant. We propose to disambiguate among the top returned Wikipedia entries by the following measures. 44

Therefore the concept GDM is mapped to Gestational diabetes, whose supercategories are Diabetes and Health issues in pregnancy. Cosine Similarity.

57 Figure 4.2: Mapping to relevant Wikipedia entry. Text in circles denotes Wikipedia entries, while text in rectangle denotes concepts. Based on the context of current search session, the entry Gestational diabetes is selected as the most relevant Wikipedia entry. Therefore the concept GDM is mapped to Gestational diabetes, whose supercategories are Diabetes and Health issues in pregnancy. Cosine Similarity. Selected by the concept extraction component, most concepts in C are meaningful phrases and exactly map to a Wikipedia entry. However, many mutiple-word concepts and entries only partially match to each other. If they partially match with a good portion, they should still be considered as matched. We therefore measure the similarity between a concept c and its candidate Wikipedia entries by cosine similarity. Particularly, we represent the concept and the entry as term vectors after stemming and stop word removal. If a candidate entry, i.e. the title of a Wikipedia page, starts with Category:, we remove the prex Category. Cosine similarity of c and Wikipedia entry candidate e i is: Sim(c, e i ) = v c v ei v c v ei (4.1) 45

58 where v c and v ei are term vectors of c and e i respectively. Mutual Information. To resolve the ambiguity in Wikipedia entry candidates, we select the entry that best ts the current search query q and its search results D. For example, in Figure 4.2, concept GDM could mean GNOME Display Manager or Gestational Diabetes Mellitus. Given the query diabetes education, only the latter is relevant. We need a measure to indicate similarity between a candidate entry e i and the search query. Since concept set C can be used to represent the search results D, we convert this problem into measuring the similarity between e i and C. We calculate the mutual information MI(e i, C) between an entry candidate e i and the extracted concept set C as described in [4], but with a modied formula for calculating the weight of a concept: w(c) = log(1 + ctf(c)) idf(c) (4.2) where ctf(c) is the term frequency of concept c with regard to the entire document set, and idf(c) is the inverse document frequency of concept c with regard to the entire document set. It is worth noting that [4] clustered the document set rst. Therefore, the weight formula in [4] counted the term frequency of a concept with regard to the cluster to which the concept belong. In addition, the weight formula in [4] slightly biased weights of the terms distributed over many cluster documents by multiplying an extra item cdf(c, L) = log(n(c, L) + 1) (4.3) where N(c, L) is the document frequency of the concept c with regard to the cluster L. Finally, we aggregate the scores. Each candidate entry is scored by a linear combination of cosine similarity and MI: score(e i ) = αsim(e i, c) + (1 α)mi(e i, c) (4.4) 46

59 where α is set to 0.8 empirically. The aggregated score considers both the word similarity and topic relevancy of a candidate entry. The highest scored candidate entry is selected as the reference Wikipedia entry. Figure 4.2 illustrates the procedure of nding the reference Wikipedia entry. 4.3 Improving Stability of Subsumption Subsumption is a popular approach for building concept hierarchies [39]. It identies the is-a relationship between two concepts based on conditional probabilities: concept x subsumes concept y if 0.8 < P (x y) < 1 and P (y x) < 1. The main weakness of Subsumption is that minor uctuation in document frequency may result in opposite conclusion. For example, in search results for the query diabetes education, two concepts type 1 diabetes and type 2 diabetes, show very similar document frequencies. Small changes in search result documents may completely turn the decision from type 1 diabetes subsuming type 2 diabetes into type 2 diabetes subsuming type 1 diabetes. Neither of the conclusions is reliable or stable. In this work, we propose to inject Wikipedia category information to Subsumption for building more stable hierarchies. First, we build a concept hierarchy by Subsumption. For the sake of eciency, we sort all concepts in C by their document frequencies in D from high to low. We compare document frequency of a concept c with every concept that has higher document frequency than c. Since the concepts are all relevant to the same session, we slightly relax the decision condition in Subsumption: for concepts x and y with document frequencies df x > df y, we say x potentially subsumes y if log(1 + df y ) log(1 + df x ) > 0.6 (4.5) 47

60 where df x and df y are document frequencies of concepts x and y respectively and are evaluated in D. Second, based on reference Wikipedia entries e x and e y for concepts x and y, we evaluate all potential subsumption pairs (x, y) in the following cases: e x is marked as a Wikipedia category: We extract the Wikipedia categories that e y belongs to, including the case that e y itself is a Wikipedia category, from e y 's Wikipedia page. Note that e y may have multiple categories. The list of Wikipedia categories for e y is called super-categories of e y and denoted as S y. x subsumes y is conrmed if e x S y. Neither e x nor e y is marked as a Wikipedia category: We extract the Wikipedia categories that contain e y (e x ) to form its super-categories set S y (S x ). For each s yi S y, we again extract its super-categories and form the super-supercategory set SS y for e y. Next we calculate a subsumption score by counting the overlap between SS y and S x, normalized by the smaller size of SS y and S x. The subsumption score for concepts x and y is dened as: Score sub (x, y) = count(s; s S x and s SS y ) min( S x, SS y ) (4.6) where count(s; s S x and s SS y ) denotes the number of categories that appear in both S x and SS y. If Score sub (x, y) for a potential subsumption pair (x, y) passes a threshold (set to 0.6), x subsumes y. e y is marked as a Wikipedia category but e x is not: The potential subsumption relationship between x and y is canceled. By employing Wikipedia to rene and expand the relationships identied by Subsumption, we remove the majority of noise in hierarchies built by Subsumption. Figure 4.3 demonstrates this procedure. 48

Figure 4.3: An example of Wikipedia-enhanced Subsumption. The concepts Diabetes and type 2 diabetes satisfy Eq.(4.5) and is identied as a potential subsumption pair.

Therefore we check if Diabetes is one of the supercategories of Diabetes mellitus type 2 and conrm that diabetes subsumes type 2 diabetes. 4.

61 Figure 4.3: An example of Wikipedia-enhanced Subsumption. The concepts Diabetes and type 2 diabetes satisfy Eq.(4.5) and is identied as a potential subsumption pair. The reference Wikipedia entry of Diabetes is a category, and reference Wikipedia entry of type 2 diabetes is a Wikipedia entry Diabetes mellitus type 2. Therefore we check if Diabetes is one of the supercategories of Diabetes mellitus type 2 and conrm that diabetes subsumes type 2 diabetes. 4.4 Building Concept Hierarchy Purely Based on Wikipedia This section describes how to build SRC hierarchies purely based on Wikipedia. We observed that categories on the same topic often share common super-categories or common subcategories. This inspired us to create hierarchies by joining Wikipedia subtrees. The algorithm is described as the following: First, identify the start categories. For each concept c C, we collect all Wikipedia categories that c's reference Wikipedia entry belongs to. We call these categories start categories. If an entry is marked as a category, it is the start category. Second, expand from the start categories. For each start category, we extract its sub-categories from its Wikipedia page. Among these subcategories, we choose those relevant to the current query for further expansion. The relevance for (e i, q) is measured by the MI measure described in Section 4.2. The subcategories with the MI 49

Figure 4.4: An example of Wikipedia-only hierarchy construction. From concept Diabetes mellitus we nd the reference Wikipedia entry Diabetes mellitus, then we nd its start category Diabetes.

Diabetes organizations is one of the subcategories of Diabetes, thus we merge them together. score higher than a threshold (set to 0.9) are kept.

62 Figure 4.4: An example of Wikipedia-only hierarchy construction. From concept Diabetes mellitus we nd the reference Wikipedia entry Diabetes mellitus, then we nd its start category Diabetes. Similarly, for another concept joslin, we nd its reference Wikipedia entry Joslin Diabetes Center and its start category Diabetes organizations. We then expand from these two start categories. Diabetes organizations is one of the subcategories of Diabetes, thus we merge them together. score higher than a threshold (set to 0.9) are kept. For the sake of eciency as well as hierarchy quality, we expand the subcategories to three levels at most. Since concepts in the search session share many start categories, expanding to a limited number of levels hardly misses relevant categories. At the end of this step, we generate a forest of trees consisting of all concepts in C as well as their related Wikipedia categories. Third, select the right nodes to merge the trees. We apply the MI score described in Section 4.2 to determine which super-category ts into the search session and assign the common node as its child. For example, start categories Diabetes and Medical and health organizations by medical condition share a common child node Diabetes organizations, which is a start category too. Diabetes is selected as the super-category of Diabetes organizations. The trees that have common nodes get connected together and form a larger hierarchy. 50

63 Last, clean up the hierarchy. For every internal node in the joined structure, we traverse downwards to the leaves. Along the way, we trim the nodes that have no ospring in the concept set C to eliminate noise that is irrelevant to the current query. Figure 4.4 shows the Wikipedia-only algorithm. 4.5 Evaluation for Search Result Organization We evaluate our approach using the dataset of TREC 2010 and 2011 Session Tracks. For each q, to obtain its search results D, we retrieve the top 1000 documents returned by Lemur from an index built from the ClueWeb09 CatB collection. All relevant documents identied by TREC assessors are merged into the results set. Table 4.1 summarizes the data used in this evaluation. We compare our approaches, Subsumption+Wikipedia (Section 4.3) and Wikipediaonly (Section 4.4), with the following systems: Clusty (now Yippy): We could not re-implement Clusty's algorithm. Instead, we sent queries to yippy.com, saved the hierarchies. Hierarchical clustering: We employ WEKA 1 to form hierarchical document clusters and then assign labels to the clusters. The labeling is done by a highly eective cluster labeling algorithm [4]. Subsumption: A popular monothetic concept hierarchy construction algorithm, used as the baseline. [39]. We modify Subsumption's decision parameters to suit our dataset. In particular, we consider x subsumes y if P (x y) 0.6 and P (y x) < version 3.6.6, bottom-up hierarchical clustering based on cosine similarity. 51

64 Table 4.1: Statistics of TREC 2010 and TREC 2011 Session track datasets. Dataset #sessions #q #q per session #doc TREC ,000 TREC ,000 Total , Hierarchy Stability To quantitatively evaluate the stability of SRC hierarchies, we compare the similarity between SRC hierarchies created within one search session. Given a query session Q with queries q 1, q 2,... q n, the stability of SRC hierarchies for Q is measured by the average of pairwise hierarchy similarity between unique query pairs in Q. It is dened as: Stability(Q) = 2 n 1 n(n 1) n i=1 j=i+1 Sim hie (H i, H j ) (4.7) where n is the number of queries in Q, H i and H j are SRC hierarchies for query q i and q j, and Sim hie (H i, H j ) is the hierarchy similarity between H i and H j. We apply three methods to calculate Sim hie (H i, H j ). Suppose there are M nodes in H i and N nodes in H j, node overlap: Measures the percentage of identical nodes in H i and H j, normalized by min(m, N). parent-child precision: Measures the percentage of similar parent-child pairs in H i and H j, normalized by min(m, N). 52

65 fragment-based similarity (FBS) [48]: Given two hierarchies H i and H j, F BS compares their similarity by calculating 1 max(m, N) m Sim cos (c ip, c jp ) (4.8) p=1 where c ip H i, c jp H j, and they are the p th matched pair among the m matched fragment pairs. These metrics measure dierent aspects of two hierarchies. Node overlap measures content dierence between hierarchies and ignores structure dierences. Parent-child precision measures local content and structure dierences and it is a very strict measure. FBS considers both content and structure dierences; it measures dierences at fragment level and tolerant minor changes in hierarchies. Table 4.2 and Table 4.3 summarize the stability evaluation over the TREC 2010 and 2011 datasets, respectively. The most stable hierarchies are generated by the proposed approaches. Our approaches statistically signicantly outperform Subsumption in terms of stability in FBS for the evaluation datasets. Not only our approaches but also Subsumption tremendously improves the stability of SRC hierarchies as compared to Clusty. Our observation is that monothetic concept hierarchy approaches acquire concepts directly from the search results, it probably learns from a more complete dataset rather than a segment of data (one cluster) and be able to avoid minor changes. Figure 4.5 and Figure 4.6 exhibit major clusters in SRC hierarchies for TREC 2010 session 3 generated by Clusty and Wiki-only (Section 4.4) respectively. The queries are diabetes education and diabetes education videos books. We observe that the Clusty hierarchies (Figure 4.5(a)(b)) are less stable than that built by Wikionly (Figure 4.6(a)(b)). For example, Clusty groups the search results by types of services (Figure 4.5(a)); however, a test indicator of diabetes Blood Sugar, which is 53

66 Table 4.2: Stability of search result organization for TREC 2010 Session queries. Approaches are compared to the baseline - Subsumption. A signicant improvement over the baseline is indicated with a at p < 0.05 and a at p < (t-test, single-tailed). Method FBS Node overlap Parent-child precision Average % chg Average % chg Average % chg Clusty Hierarchical clustering Subsumption % % % Subsumption + Wikipedia % % % Wikipedia only % % % Table 4.3: Stability of search result organization for TREC 2011 Session queries. Approaches are compared to the baseline - Subsumption. A signicant improvement over the baseline is indicated with a at p < 0.05 and a at p < (t-test, single-tailed). Method FBS Node overlap Parent-child precision Average % chg Average % chg Average % chg Clusty Hierarchical clustering Subsumption % % % Subsumption + Wikipedia % % % Wikipedia only % % % not any type of services, is added after the query is slightly changed (Figure 4.5(b)). Moreover, the largest cluster in Figure 4.5(a), Research, disappears completely in Figure 4.5(b). These changes make Clusty hierarchies less stable and less desirable. The Wiki-only approach (Figure 4.6(a)(b)) that employ external knowledge bases better maintain a single classication dimension, in this case types of diabetes, and are easy to follow. 54

(b) is for diabetes education videos books.

67 Figure 4.5: Major clusters in hierarchies built by Clusty for TREC 2010 session 3. (a) is for query diabetes education and (b) is for diabetes education videos books. Figure 4.6: Major clusters in hierarchies built by Wiki-only for TREC 2010 session 3. (a) is for query diabetes education and (b) is for diabetes education videos books. 55

68 4.5.2 Hierarchy Quality One may question that perfect stability can be achieved by a static SRC hierarchy regardless of query changes in a session. To avoid evaluating SRC hierarchies only by stability while sacricing other important features, such as hierarchy quality, we manually evaluate the hierarchies. Particularly, we compare two approaches, Subsumption and Subsumption+Wikipedia, to see how much quality improvement is done by adding Wikipedia information. Figure 4.7 and Figure 4.8 illustrates the major clusters in hierarchies built for TREC 2010 session 3 by Subsumption (Section 2.3.2) and Subsumption+Wiki (Section 4.3). We observe errors in Figure 4.7(a): Type 1 diabetes is misplaced under type 2 diabetes. While Figure 4.8(a) corrects this relationship and these two concepts are both correctly identied under diabetes. Moreover, we nd that hierarchies created by Wikipedia (Figure 4.8(a)(b)) exhibits higher stability than that by Subsumption only (4.7(a)(b)). For example, in Figure 4.7, type 2 diabetes becomes the root of hierarchy when the query changes. While in Figure 4.8, the main structure of hierarchy, Diabetes with two children type 2 diabetes and Type 1 diabetes are maintained. We further compare the hierarchies generated by Subsumption+Wiki (Section 4.3) and Wiki-only (Section 4.4). Wiki-only approach generates more stable hierarchies because it utilizes Wikipedia entries, which are standardized concepts, to connect the concepts extracted from search results. This may cause a high overlap between the hierarchies generated from queries about a similar topic. On the contrary, the hierarchies generated by Subsumption+Wiki approach are more related to the query because it primarily builds the relations between the concepts extracted from the search results and only uses Wikipedia to lter out the inappropriate relations. 56

Subsumption+Wiki for TREC 2010 session 3.

69 Figure 4.7: Major clusters in hierarchies built by Subsumption for TREC 2010 session 3. (a) is for query diabetes education and (b) is for diabetes education videos books. Figure 4.8: Major clusters in hierarchies built by Subsumption+Wiki for TREC 2010 session 3. (a) is for query diabetes education and (b) is for diabetes education videos books. 57

70 Figure 4.9: Search result organization quality improvement vs. stability for Subsumption and Subsumption+Wiki. Quantitatively, we measure the quality improvement of Subsumption+Wiki over Subsumption by checking the correctness of parent-child concept pairs in a hierarchy H as: count w,corr count w,err (count s,corr count w,err ) count w + count s (4.9) where count is the number of concept pairs in H, w denotes Subsumption+Wikipedia, s denotes Subsumption, corr means the correct pairs, err means the incorrect pairs. Figure 4.9 plots the quality improvement vs stability for Subsumption and Subsumption+Wiki over all evaluated query sessions. Stability is measured by the number of dierent parent-child pairs in corresponding hierarchies generated by these 58

Figure 4.10: Extreme case 1. A totally static hierarchy for two queries in a session (TREC 2010 session 107). two approaches. Figure 4.9 demonstrates that quality and stability could correlate well.

71 Figure 4.10: Extreme case 1. A totally static hierarchy for two queries in a session (TREC 2010 session 107). two approaches. Figure 4.9 demonstrates that quality and stability could correlate well. Moreover, we calculate the Spearman's rank correlation coecient [43] and the Pearson's correlation coecient [43] between quality improvement and stability and the values are and respectively. Queries change slightly within a session, hence the user may not expect a totally static hierarchy in session search. Figure 4.10 shows two extreme cases. In the rst case, two queries in a session are elliptical trainer and elliptical trainer benets (TREC 2010 session 107). The hierarchies are exactly same for these two queries, but the user may want more detailed hierarchy about benets for the query elliptical trainer benets. With hierarchies not stable, the user would not be satised either, as shown in Figure 4.11 (TREC 2010 session 75). Therefore, in these extreme cases, the quality of hierarchies are poor for all stability, shown as the red line in Figure 4.9. The comparison indicates that our proposed techniques increase the quality of hierarchies while improving stability. 59

Figure 4.11: Extreme case 2. A totally dierent hierarchy for two queries in a session (TREC 2010 session 75). 4.6 Chapter Summary This chapter present a system framework which can generate stable hierarchy for session search.

Our system rst extracts the concepts from the document set, and then use the concepts as the nodes to build the hierarchy.

72 Figure 4.11: Extreme case 2. A totally dierent hierarchy for two queries in a session (TREC 2010 session 75). 4.6 Chapter Summary This chapter present a system framework which can generate stable hierarchy for session search. Because the query usually changes little within a session, the stability is required for search result organization. Our system rst extracts the concepts from the document set, and then use the concepts as the nodes to build the hierarchy. We present two approaches that exploit Wikipedia category to improve the stability of hierarchy. The rst one corrects the mistaken relationship generated by Subsumption, while the second one builds the hierarchy purely from the Wikipedia categories related to the concepts. The monothetic concept hierarchy approaches indicate signicant improvement in stability over the hierarchical clustering approaches. The evaluation further shows that the Wikipedia category information increase not only the stability but also the quality of the hierarchy. 60

Effective Structured Query Formulation for Session Search

Effective Structured Query Formulation for Session Search Dongyi Guan Hui Yang Nazli Goharian Department of Computer Science Georgetown University 37 th and O Street, NW, Washington, DC, 20057 dg372@georgetown.edu,