Extracting Search-Focused Key N-Grams for Relevance Ranking in Web Search

Size: px

Start display at page:

Download "Extracting Search-Focused Key N-Grams for Relevance Ranking in Web Search"

Damian Dalton
6 years ago
Views:

1 Extracting Search-Focused Key N-Grams for Relevance Ranking in Web Search Chen Wang Fudan University Shanghai, P.R.China Hang Li Microsoft Research Asia Beijing, P.R.China Keping Bi Peking University Beijing, P.R.China Guihong Cao Microsoft Corporation Redmond, WA USA Yunhua Hu Microsoft Research Asia Beijing, P.R.China ABSTRACT In web search, relevance ranking of popular pages is relatively easy, because of the inclusion of strong signals such as anchor text and search log data. In contrast, with less popular pages, relevance ranking becomes very challenging due to a lack of information. In this paper the former is referred to as head pages, and the latter tail pages. We address the challenge by learning a model that can extract search-focused key n-grams from web pages, and using the key n-grams for searches of the pages, particularly, the tail pages. To the best of our knowledge, this problem has not been previously studied. Our approach has four characteristics. First, key n-grams are search-focused in the sense that they are defined as those which can compose good queries for searching the page. Second, key n-grams are learned in a relative sense using learning to rank techniques. Third, key n-grams are learned using search log data, such that the characteristics of key n-grams in the search log data, particularly in the heads; can be applied to the other data, particularly to the tails. Fourth, the extracted key n-grams are used as features of the relevance ranking model also trained with learning to rank techniques. Experiments validate the effectiveness of the proposed approach with large-scale web search datasets. The results show that our approach can significantly improve relevance ranking performance on both heads and tails; and particularly tails, compared with baseline approaches. Characteristics of our approach have also been fully investigated through comprehensive experiments. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval This work was conducted at Microsoft Research Asia when the first two authors were interns there. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. WSDM 12, February 8 12, 2012, Seattle, Washingtion, USA. Copyright 2012 ACM /12/02...$ General Terms Algorithms, Experimentation, Performance Keywords Search Relevance, Ranking, Tail page, Key N-Gram Extraction, Learning to Rank 1. INTRODUCTION In web search, the relevance ranking model, which can be either manually created or automatically learned, assigns scores representing degree of web page relevance in respect to the query, and ranks pages according to score. The ranking model utilizes information such as the frequency of query words in the title, body, URL, anchor text and search log data in determining the relevance. There are popular web pages that include rich information such as anchor text and search log data. For these pages, it is easy for the ranking model to predict relevance with respect to a query and thereby assign reliable relevance scores. In contrast, there are also less popular web pages that lack sufficient information. In these instances, it is challenging to accurately calculate the relevance. In this paper, we refer to web pages with a lot of anchor text and associated queries in search log data, as head pages. Web pages with less anchor text and associated queries are called tail pages. If there is a distribution of web page visits, then the head pages should have high visit frequencies, while the tail pages should have low visit frequencies. In this paper, we aim to solve the problem of improving tail page relevance, currently one of the most challenging problems in web search. Our approach has the following characteristics: 1) Extracting search-focused information from web pages. 2) Taking key n-grams as the representations of search-focused information. 3) Employing learning to rank to train the extraction model using search log data. 4) Employing learning to rank to train a model for relevance ranking using search-focused key n-grams as features. We deal with tail page relevance by extracting good queries. These are most suitable for searching the page, assuming that the data sources for the extraction only include the title, URL, and page body (A more difficult question is how to automatically generate good queries instead of extracting good queries and we will leave this problem for future work). Such information is available for any page, including tail pages. When searching with a page s good queries, both the page and queries should be highly relevant. We refer to this as search-focused extraction.

2 We extract search-focused key n-grams from web pages and use them to improve relevance, particularly tail page relevance. The key n-grams should compose good queries for searching the pages. There are two reasons why we chose key n-gram extraction rather than keyphrase extraction. First, conventional relevance models, whether or not they are created by machine learning, usually only employ n-grams from queries and documents. Therefore, the extraction of key n-grams is more than adequate for the purpose of enhancing ranking model performance. Second, the use of n-grams eliminates the need to segment queries and documents, and thus frees us from segmentation errors. We further employ a learning to rank approach to the extraction of key n-grams. The problem is formalized as ranking key n-grams from a given web page. The importance of key n-grams is only meaningful in a relative sense, and thus one does not need to make hard categorization decisions regarding important n-grams vs. unimportant n-grams, as we proposed earlier in [14]. We use n- gram positions and web page html tags, as well as term frequencies as features in the learning to rank model. We take search log data as training data for learning the model of extraction (one can also consider taking anchor text data as training data). The premise is that the statistical properties of a page s good queries can be learned and applied across different web pages. The objective of learning is an exact and accurate extraction of search-focused key n- grams, because the page-associated queries are sets of key n-grams for search. The abundance of search log data available for head pages facilitates the actual learning of the model of extraction primarily from those pages. In this way, we can extend the knowledge we acquire from head pages and apply it to the tail pages, thereby effectively addressing the challenge of determining tail page relevance. Note that the learned model can also help improve the relevance of head pages. The extracted key n-grams include scores representing the strength of key n-grams on a given page. We further take a learning to rank approach to train the relevance ranking model, using the key n- grams and their scores as additional ranking model features. Those n-grams are also utilized in the original ranking model, as they are from the same page, and thus their contributions to the ranking model are further enhanced by our approach. We have conducted experiments with two large-scale web search datasets to validate the effectiveness of the proposed approach. Each dataset contains more than 10,000 queries and approximately 1,000,000 documents. There are, on average, 65 documents for each query that includes relevance judgments. All the evaluations are conducted on search relevance ranking in terms of MAP and NDCG. Results show that our approach can significantly improve search relevance on both head and tail pages, with the particularly greater improvement on tail pages. Our method also works better than a conventional keyphrase extraction method, indicating it is better to employ a search-focused approach. We studied several important characteristics in our approach to both the key n-gram extraction phase and relevance ranking phase. The results show that it is better to formalize the key n-gram extraction as a ranking problem rather than a classification problem. The results also show that our approach with search log data is comparable to human-labeled data. Moreover, the performances of our approach trained with search log data will saturate when the size of search log data increases. These results suggest that head page search log data, which is of low cost, can be effectively leveraged to train a model for improving search relevance. We have found that considerably good relevance ranking results can be achieved with just unigrams, and that performance can be further improved when bigrams and trigrams are also included. Furthermore, we have found that extracting the top 20 key n-grams achieves the best performance in relevance ranking. In addition, it has been observed that the use of key n-gram scores can further enhance relevance ranking. The contributions in this paper are as follows. We have proposed conducting a search-focused key n-gram extraction from web pages to enhance search relevance, particularly tail page relevance. (To the best of our knowledge, this problem has not been studied before.) We have developed a method for extracting search-focused key n-grams, using learning to rank and search log data. We have significantly improved search relevance using the extracted key n-grams, particularly for the tail pages. We have also conducted comprehensive studies to investigate the characteristics of our approach. The rest of the paper is organized as follows. Sec. 2 introduces related work. Sec. 3 gives the motivation of the problem. We introduce our proposed approach in Sec. 4 and present experimental results in Sec. 5. We also offer discussion in Sec. 6. Sec. 7 concludes the paper and provides suggestions for further work. 2. RELATED WORK 2.1 Relevance Ranking Relevance ranking, one of the most important search engine components, assigns scores representing a document s relevance degree with respect to the query and ranks the documents according to their scores. Many methods have been proposed as a basis for constructing a relevance ranking model. Traditionally, the ranking model is manually created with a few fine-tuned parameters. Recently, machine learning techniques, called learning to rank, have also been applied to ranking model construction. Traditional models such as the Vector Space Model [26], BM25 [25], Language Model for IR [18, 23], Markov Random Field [21], and the learning to rank models [19, 20] make use of the n-grams that exist as features of the queries and documents. In fact, with all of these methods, the queries and documents are viewed as vectors of n- grams. Intuitively, if query-based n-grams occur a number of times in the document, then it is likely that the document is relevant to the query. More information about web pages can be utilized in web search; for example, the body, URL, anchor texts and associated search log data queries. Previous work has shown that the use of web page (HTML documents) titles, anchor texts and URLs can enhance relevance ranking accuracy in web search. For example, Cutler et al. [5] proposed using the structures of HTML documents to improve document retrieval. They linearly combined term frequencies in several fields extracted from an HTML document. In TREC- 2003, for example, more than half of the participants considered the use of richer representations of web pages [4]. Hu et al. [11] proposed extracting titles from the bodies of web pages and using the extracted titles as new metadata fields for web search. Search log data is also a useful resource for improving relevance ranking performance. Agichtein et al. [1], for example, showed that using a web page s associated queries in search log data can enhance the accuracy of the page s relevance ranking. Wang et al. [31] indicated that the different web page fields usually have different characteristics, and should be represented by different language models. Bendersky et al. [2] proposed employing a discriminative method for extracting and weighting terms or phrases in pseudo relevance feedback and then exploiting the weighted terms in relevance ranking. They focused on term weighting from the query side and did

not make use of information from the document side. Carvalho et al. [6] proposed finding phrasal terms from documents to enhance relevance ranking.

To the best of our knowledge, there was no previous work that studies the problem of extracting and using key n-grams from web pages for relevance ranking.

A number of authors such as Frank et al., Witten et al., and Turney [7, 28, 29, 32] propose keyphrase extraction is a problem of classification.

The classifier is used to categorize phrases within a new document as keyphrases or non-keyphrases.

neural network [27]. The KEA tool based on naive Bayes is publically available for use in keyphrase extraction studies.

keyphrases. The learning to rank method, Ranking SVM, is employed in the construction of ranker.

extracted key terms from user log data in a user-centric entity detection system [13] and used learning to rank to learn the extraction model.

A regression model was used. The evaluation of the extracted keyphrases in both studies was limited to browsing scenarios.

2) We extract key n-grams instead of keyphrases. 2.3 Search Log Mining for Ranking Search log data contains queries and clicked URLs.

a relevance ranking model [1]. Search log data is usually only available for popular queries and pages. Gao et al.

In this paper, we use search log data as training data for key n-gram extraction.

MOTIVATION Head pages usually have a lot of anchor text and associated queries in search log data, and they can become strong signals for judging page

In contrast, tail pages are usually limited to just a few anchor texts and associated queries, or even none at all.

Tail pages - including new pages, personal homepages, and discussion pages - are of high value to users.

us citizenship (~10,000) 3. naturalization (~6,000) 4. green card (~100) 5. immigration (~4,000) 6. employee immigration (~10) Queries (Clicks): 1 1.

oregon health insurance (2) Figure 1: Similar appearances of queries in example head page (top) and tail page (bottom) According to a study by Goel et al.

3 not make use of information from the document side. Carvalho et al. [6] proposed finding phrasal terms from documents to enhance relevance ranking. To extract phrasal terms, they utilized statistical features and an SVM classifier. To the best of our knowledge, there was no previous work that studies the problem of extracting and using key n-grams from web pages for relevance ranking. 2.2 Keyphrase Extraction Our work is also related to keyphrase extraction, given that ngrams represent phrase fragments. A number of authors such as Frank et al., Witten et al., and Turney [7, 28, 29, 32] propose keyphrase extraction is a problem of classification. In this approach, document phrases are labeled as keyphrases or non-keyphrases. A classifier is then trained using the labeled data. The classifier is used to categorize phrases within a new document as keyphrases or non-keyphrases. Several learning methods were applied into the classifier s learning, including naive Bayes [7, 32], decision tree [28, 29], rule-induction [12], and neural network [27]. The KEA tool based on naive Bayes is publically available for use in keyphrase extraction studies. Recently, we [14] proposed formalizing keyphrase extraction as a ranking problem, and learning a ranker to rank phrases based on the degree that they are keyphrases. The learning to rank method, Ranking SVM, is employed in the construction of ranker. Another method proposes the use of log data as training data for keyphrase extraction. Irmak et al. extracted key terms from user log data in a user-centric entity detection system [13] and used learning to rank to learn the extraction model. Paranjpe proposed a similar method [22] which utilizes search log data to learn the aboutness of a document represented by words and phrases. A regression model was used. The evaluation of the extracted keyphrases in both studies was limited to browsing scenarios. Our work differs from existing work in the following points: 1) Our goal is to enhance search relevance ranking. 2) We extract key n-grams instead of keyphrases. 2.3 Search Log Mining for Ranking Search log data contains queries and clicked URLs. Each web page s URL is associated with a number of queries, as well as with the number of times the page is clicked as the query s searches are also recorded. Search log data has been used for relevance ranking in web search; for example, as training data for creating a ranking model [15, 24], and as features of a relevance ranking model [1]. Search log data is usually only available for popular queries and pages. Gao et al. proposed two smoothing methods to propagate click numbers along the click-through graph [8]. In this paper, we use search log data as training data for key n-gram extraction. The model learned by our method can even be applied to web pages (tail pages) without any click or anchor text. 3. MOTIVATION Head pages usually have a lot of anchor text and associated queries in search log data, and they can become strong signals for judging page relevance with respect to queries. Modern search engines can effectively leverage these signals and perform very well in head page searches. In contrast, tail pages are usually limited to just a few anchor texts and associated queries, or even none at all. As a result, searching such pages tends to be difficult. Tail pages - including new pages, personal homepages, and discussion pages - are of high value to users. These pages need to be easily found for those who want to search them. Queries (Clicks): 1. us citizenship and immigration service (~10,000) 2. us citizenship (~10,000) 3. naturalization (~6,000) 4. green card (~100) 5. immigration (~4,000) 6. employee immigration (~10) Queries (Clicks): 1 1. oregon (3) 2. office of private health partnerships (11) 3. oregon health insurance (2) Figure 1: Similar appearances of queries in example head page (top) and tail page (bottom) According to a study by Goel et al. [9], search of tail pages is an important element in enhancing user satisfaction, and good search results on tail pages can play a central role in winning users loyalty to a particular search engine. If, for example, a user wants to find his friend s homepage, his satisfaction will be quite high if it can be found immediately by the search engine. Our statistics on a collection of web pages show that more than 75% are tail pages with less than one anchor text and associated query. This effectively illustrates how tail page relevance becomes one of the most challenging issues in web search. Our proposal is to extract or generate good queries for each page, including tail pages, which should be more suitable for searching the page. The extraction or generation uses the page title, URL and body, all of which are available for tail pages. When searched with good queries, the page should rank highly. Berger et al. [3] proposed taking a translation approach to document retrieval. The generation or extraction can be viewed as translating a web page into a number of keyphrases. We refer to this task search focused keyphrase extraction or generation. In this paper, we focus on extraction. We accomplish our task with two means; namely, learning from search log data and key n-gram extraction. First, we take search log data as training data for the extraction. A web page s associated queries can be viewed as good queries for searching the page. The data can be used to train a model. Since the head pages have more click data, we end up with learning the model from head pages and apply the learned model to tail pages.

4 Second, we consider key n-gram extraction an approximation of keyphrase extraction. Queries, particularly long queries such as star wars anniversary edition lego darth vader fighter, are difficult to segment. If the query is associated with a page in the search log data, then we take all of the query s n-grams as key n-grams of the page. In this way, we can skip query segmentation, which is difficult to do with a high degree of accuracy. Actually, all current ranking models, even those based on machine learning, usually use n-grams only as features, meaning that key n-gram extraction sufficiently enhances the model s performance. We assume that key n-grams share the same patterns across different pages, and that we can successfully learn an extraction model from the head and apply it to the tail. Fig. 1 gives an example to illustrate this. The page on the top is a head page with more than 600,000 clicks in the search engine s search log data within one year. The page on the bottom is a tail page with just 23 clicks. Despite this difference, the queries in the two pages have similar patterns and appearances. In both pages, the queries are located in region A and region C. Specifically, region A includes the page s title and subtitle. Regions B and D include website navigation links. R The page s main content is in region C. Region E consists of outer links. We can use the formatting, term frequency and position information to identify whether an n-gram is likely to be a key n-gram. 4. OUR APPROACH Our approach consists of two parts: Key n-gram extraction and relevance ranking. Fig. 2 shows an overview. 4.1 Key N-Gram Extraction Pre-Processing We assume that the objects to be searched and ranked by the search engine are web pages. During pre-processing, a web page in HTML format is parsed and represented as a sequence of tags/words. Then all the words are converted into lower case, and stop words into a list 1 are removed. In our approach, we define an n-gram as n successive words within a short text separated by punctuation symbols and special HTML tags. Note that some HTML tags provide a natural separation of text, e.g., <h1>experimental Result</h1> indicates that Experimental Result is a short text. Some tags do not mean a separation, e.g., <font color="red">significant</font> improvement Training Data Generation We propose using search log data for training a key n-gram extraction model, because search log data represents users implicit judgments on the relevance between queries and documents. More specifically, if users search with a query and click a page afterward, and this occurs many times (e.g., beyond a threshold), then it is very likely that the query and the page are relevant. We can consider automatically extracting queries from the page. Head pages generally include a number of associated queries in the search log data. Such data can naturally be used as training data for the automatic extraction of queries, particularly for tail pages. Instead of extracting queries or keyphrases, we extract key n- grams. We treat the n-grams in each of the document s queries as its labeled key n-grams. For example, when a document is ABDC associated with the query ABC, we consider unigrams A, B, C, and bigrams AB are key n-grams with the assumption that 1 ftp://ftp.cs.cornell.edu/pub/smart/english.stop they should be ranked higher than unigram D, and bigrams BD and DC, by the extraction model. Features for each n-gram are then extracted, as described in Sec , and then an extraction model is trained. The advantage of taking this approach is that no segmentation on queries and documents is necessary, which is difficult to accurately carry out. In fact, existing ranking models usually make use of n-grams only as features, and thus extraction of key n-grams is sufficient for enhancing search relevance. Deriving training data from search log data is an approach with the benefit of low cost. An alternative would allow human labelers to assign relevant queries to selected web pages. However, this suffers from two problems: 1) Cost is high. 2) Human labelers are not query owners, making it difficult for them to come up with relevant page queries. An approximation of such a method uses conventional relevance judgment data created from queries whereby each page is associated with a small number of queries, or even just one query. As will be seen in our experiment, such an approximation does not work better than our method using search log data. Given that human labeling is expensive and log data is very low cost, the use of log data for training data creation is definitely a good practice N-Gram Features Generation Web pages contain rich formatting information compared to plain text. We utilize both textual and formatting information to create features in the extraction (ranking) model in order to accurately extract key n-grams. We have conducted a comprehensive study on the features for the task. Below is a list of features which were found to be useful from an empirical study of 500 randomly selected pages and the key n-grams associated with them. N-grams may be highlighted with different HTML formatting information, and the formatting information is useful in identifying the importance of n-grams. 1. Frequency Features The original/normalized term frequencies of an n-gram within several fields, tags and attributes are utilized. a) Frequency in Fields: The n-gram frequencies in four web page fields, including URL, page title, meta-keyword and metadescription. b) Frequency within Structure Tags: The frequencies of n- gram in texts within a header, table or list indicated by HTML tags including <h1>,..., <h6>, <table>, <li> and <dd>. c) Frequency within Highlight Tags: The frequencies of n- gram in texts highlighted or emphasized by HTML tags including <a>, <b>, <i>, <em> and <strong>. d) Frequency within Attributes of Tags: The frequencies of n- gram in web page tag attributes. These are hidden texts which are not visible to users. We found, however, that they are still valuable for key n-gram extraction; for example, the title of an image <img title= Still Life: Vase with Fifteen Sunflowers... />. Specifically, title, alt, href and src tag attributes are used. e) Frequencies in Other Contexts: The frequencies of n-gram in other contexts including 1) The page headers, which means n- gram frequency within any of <h1>,..., <h6> tags. 2) The page meta-data field. 3) The page body. 4) The whole HTML file. 2. Appearance Features The appearances of n-grams are also important indicators of their importance. a) Position: The n-gram s position when it first appears in the title, paragraph and document. b) Coverage: The coverage of an n-gram in the title or a header, e.g., whether an n-gram covers more than 50% of the title. c) Distribution: The n-gram s distribution across different parts

5 Phase 1: Key N-Gram Extraction Phase 2: Relevance Ranking Webpages Pre- Processing Training Data Generation Search Log Data N-Gram Feature Extraction Model Learning Key N-Gram Extraction Model Queries, Webpages, Relevance Judgments Key N-Grams Relevance Ranking Feature Extraction Model Learning Relevance Ranking Model Webpages Pre- Processing N-Gram Feature Extraction Prediction Queries, Webpages Relevance Ranking Feature Extraction Prediction Ranking of Webpages Figure 2: Framework of our approach of a page. The page is separated into several parts and the entropy of the n-gram across the parts is used Key N-Gram Extraction Key n-gram extraction is formalized as a learning to rank problem. In learning, a ranking model is trained which can rank n-grams according to their relative importance as key n-grams associated with a given web page. Features are defined and utilized for the ranking of n-grams. In extraction, given a new page and the trained model, the n-grams in the page are ranked with the model. Top K n-grams are selected as key n-grams of the page. We give a formalization of the learning task. Let X R p is the space of features of n-grams, while Y = {r 1, r 2,..., r m } is the space of ranks. There exists a total order among the ranks: r m r m 1... r 1. Here, m = 2, representing key n-grams and non key n- grams. The goal is to learn a ranking function f (x) such that for any pair of n-grams (x i, y i ) and (x j, y j ), the following condition holds: f (x i ) > f (x j ) y i y j (1) Here x i and x j are elements of X, and y i and y j are elements of Y representing the ranks of x i and x j. We employ Ranking SVM [10] to learn the ranking function f (x), and we specifically use the S V M Rank tool [16] 2. We assume that f (x) = w T x is a linear function on x. Given a training set, we first convert it to ordered pairs of n- grams: P = {(i, j) (x i, x j ), y i y j }. f (x) is learned by the following optimization: ŵ = arg min w 1 2 wt w + c P (i, j) P s.t. (i, j) P : w T x i w T x j 1 ξ i j, ξ i j 0 where ξ i j denotes slack variables and c is a parameter. 4.2 Relevance Ranking Relevance Ranking Features Generation In web search, web pages are represented in several fields, also referred to as meta-streams. In this paper, we consider the following meta-streams: URL, page title, page body, meta-keywords, meta-description, anchor texts, search log data associated queries and the key n-gram stream created by our method. The first five 2 ξ i j (2) meta-streams are extracted from the web page itself, and they reflect the web designer s view. Anchor texts are extracted from other pages and they represent other web designers summaries. The query meta-stream consists of users queries leading to clicks on the page. It provides the search users view. The key n-gram metastream created by our method also provides a web page summary. Note that the key n-grams are extracted based solely on the information from the first five meta-streams. The model for extraction is trained primarily from head pages using a number of associated queries as training data; and is applied to tail pages, which may not include anchor texts and associated queries. A ranking model includes query-document matching features that represent the relevance of the document with respect to the query. Popular features include tf-idf, BM25 and minimal span. All can be defined on the meta-streams. Document features which describe the importance of the document itself, such as PageRank and Spam Score, are also used. Given a query and a document, we derive the following querydocument matching features from each meta-stream: a) Unigram/Bigram/Trigram BM25: N-gram BM25 [33] is an extension of traditional unigram-based BM25 b) Original/Normalized PerfectMatch: Number of exact matches between query and text in the stream c) Original/Normalized OrderedMatch: Number of continuous words in the stream which can be matched with the words in the query in the same order d) Original/Normalized PartiallyMatch: Number of continuous words in the stream which are all contained in the query e) Original/Normalized QueryWordFound: Number of words in the query which are also in the stream In addition, PageRank and domain rank scores are also used as document features in our model Ranking Model Learning and Prediction We employ learning to rank techniques to automatically construct the ranking model from labeled training data for relevance ranking. Again, we use Ranking SVM as the learning algorithm. In search, given a new query and retrieved web pages, we generate the query-document matching features between the query and each page s meta-streams, and calculate ranking scores using the learned ranking model. Next, we rank all of the web pages in descending order of their ranking model scores.

6 5. EXPERIMENTS 5.1 Experiment Settings Dataset We conducted experiments on a large search log dataset and three large-scale relevance ranking datasets. We used a search log collected from a commercial search engine over the course of one year. It contains 76,882,604 distinct URLs and 121,100,491 distinct queries with 10,280,590,031 clicks. To derive training data for key n-gram extraction, the query-document pairs with at least click min = 2 clicks were kept and the others were discarded. Then N train = 2, 000 HTML pages were randomly selected from the search log data, which had a minimum of query min = 5 associated queries. Some data statistics are given in Tab. 1. Table 1: Dataset used for training of key n-gram extraction # of HTML Pages 2,000 Average # of queries per page Average # of words per query 3.47 The relevance ranking experiments were conducted on three largescale datasets which contained queries, documents, and their relevance judgments. The first two sets are comprised of general queries (queries randomly sampled from the search log), and the last set of tail queries (queries randomly sampled from the search log s low frequency queries), referred to as Training (Training Set), Test1 (Test Set with General Queries) and Test2 (Test Set with Tail Queries). We use Training Set to learn a relevance ranking model and use Test1 and Test2 evaluate the model. Each set includes 10,000 queries and about 1,000,000 web pages. The relevance judgments are represented at five levels including Perfect, Excellent, Good, Fair and Bad. Some statistics about the three datasets are given in Tab. 2. Table 2: Datasets used for relevance ranking Test1 Test2 Dataset Training (General (Tail Queries) Queries) # of queries 20,429 12,705 10,991 # of HTML pages 1,008,480 1,131, ,768 Avg. # of pages/query Avg. # of perfect/query Avg. # of excellent/query Avg. # of good/query Avg. # of fair/query Avg. # of bad/query Avg. # of words/query Parameters and Evaluation Measures We first set the ranges for our method s parameters, and then created extraction and ranking models under all possible parameter settings, and tested the our method s relevance ranking performance on the Test1 and Test2 datasets. We report the best performance on the two test datasets at the best parameter setting. (It turns out that the best parameter settings are the same.) As we later describe in detail, we started from the best parameter setting, changed one parameter, and fixed the other parameters. In key n-gram extraction, there are three parameters: number of training pages N train, minimum number of clicks between a query and a page click min, and minimum number of queries associated with a page query min. We set the ranges of N train as {100, 200, 500, 1000, 2000}, click min as {1, 2, 5}, query min as {2, 5, 10}. The best setting is as follows: N train = 2, 000, click min = 2, and query min = 5. In relevance ranking, there are two parameters: Number of top K n-grams selected, and number of words in n-grams n. We set the ranges of K as {5, 10, 20, 30} and n as {1, 2, 3}. The best setting is as follows: K = 20 and n = 3. There is a parameter c for Ranking SVM which is used in both key n-gram extraction and relevance ranking. In extraction, c is chosen among {0.001, 0.01, 0.1, 1, 10, 100, 1000}. In ranking, c is selected from {0.01, 0.1, 1, 10, 100, 1000, 10000, }. Note that the ranges of c in the two phases are quite different because the tasks in the two phases are quite different. The best parameter setting is c = 0.1 and c = 1, 000. Mean Average Precision (MAP) [30] and Normalized Discounted Cumulative Gain (NDCG) [17] are used as evaluation measures in relevance ranking. 5.2 Experiments on Relevance Ranking We evaluated our proposed approach for relevance ranking. First, we evaluated performance to see whether our method can enhance relevance ranking. Then, we compared our method against a popular keyphrase extraction method KEA [32] used in relevance ranking. We particularly examined how our approach performed on head and tail pages Relevance Ranking with Extracted Key N-Grams We tested the effectiveness of our method using extracted key n- grams in relevance ranking. One baseline used the ranking model trained with features derived from several meta-streams (see Sec. 4.2 for details) without using extracted n-grams, referred to as baseline. Note that both our method and the baseline were built with the same web page information. Our method further extracts and utilizes information, namely search-focused key n-grams, for page ranking. The parameters of our method were at the best setting as explained in Sec The experimental results are presented in Fig. 3. The experimental results indicate that relevance ranking performance can be significantly enhanced by adding extracted key n- grams into the ranking model. More specifically, our method can improve NDCG@1 by 1.92 and 1.94 points on Test1 and Test2 datasets respectively. The improvements of MAP on Test1 and Test2 are 0.87 and 1.28 points respectively. T-test shows that the improvements are all statistically significant (p < 0.001) Ranking Performance Comparison with KEA Another baseline used KEA [32] for relevance ranking, which takes web pages as input but extracts keyphrases from the pages. The keyphrases are used as a page meta-stream, and a ranking model with additional features from the meta-stream is trained. There are differences between our method and KEA. Our method learns the model from search log data while KEA learns from humanlabeled keyphrases, and our method is search focused while KEA is for general purposes. The KEA model was trained using the data in Jiang et al. s work [14], which has 300 web pages adopted with human-labeled keyphrases. To make the comparison fair, we took the same set of web pages and their associated queries as training data for our method. The other parameters of our method were set the same as those in the best parameter setting. Note that our approach performance is slightly worse than Sec The parameters of KEA, such as the number of output keyphrases, were tuned and here we report the best KEA results.

7 Baseline(Without Key N-Grams) Our Approach(With Key N-Grams) Baseline KEA Our Approach Figure 3: Ranking performance comparison between baseline (without key n-grams) and our approach (with key n- grams) From the result shown in Fig. 4, we can see that KEA can work better than the baseline, but performs worse than our method. The results indicate that relevance ranking can be enhanced by both keyphrase and key n-gram extraction. Moreover, it is better to employ search-focused key n-gram extraction than general keyphrase extraction Ranking Performance on Head and Tail Pages Our method is trained mainly with head page data, because head pages have more associated queries with higher click frequencies. It is potentially more useful for relevance ranking of tail pages, because tail pages usually have less anchor text and associated queries. We evaluated our method s performance on head pages and tail pages respectively. We respectively split both Test1 Dataset and Test2 Dataset into a Head Page Subset and Tail Page Subset. When more than 75% a query s relevant pages - which were rated as perfect, excellent or good - did not have other associated queries and anchor texts, we classified the query and its associated pages into a Tail Page Subset, otherwise Head Page Subset. The ratio of of Head Page Subsets and Tail Page Subsets was about 3:1 for both Test1 and Test2. The results are shown in Fig. 5. We can see that: 1) Our proposed method significantly improves the Tail Page Subset vs. the Head Page Subset for both datasets. 2) The improvements made by our method remain statistically significant even for Head Page Subsets (p < 0.001). 3) We can also observe that the improvements on Test2 Dataset are slightly greater than on Test1 Dataset, indicating that our approach can be more effective on tail queries than general queries. Similar conclusions were drawn when we tested other split ratios such as 50%, 60%, 70%, 80% and 90%. 5.3 Studies on Key N-Gram Extraction We studied the alternatives to our approach with key n-gram extraction. Figure 4: Ranking performance comparison between KEA and our approach Head Page Set, General Queries Head Page Set, Tail Queries Tail Page Set, General Queries Tail Page Set, Tail Queries Figure 5: Ranking performance improvement comparison between head pages and tail pages Classification vs. Ranking In our approach, we formalize the extraction problem as one of ranking instead of alternatively viewing it as one of classification. We compared the two options based on Ranking SVM and SVM. Results are shown in Fig. 6. The results indicate that using Ranking SVM is more effective than using SVM. T-test shows that Ranking SVM s improvement over SVM is statistically significant (p < 0.001). The result validates the correctness of the assumption that it is better to formalize the key n-gram extraction problem as one of ranking rather than classification. This is because it is hard to detect key n-grams in an absolute manner, but easy to judge whether one n-gram is more important than another Training Data Generation In our approach, we derive training data from search log data. We assume that the document is relevant to the query if the document is searched and clicked by the query at least click min times. We evaluated the data creation method effectiveness by comparing training with search log data vs. training with human-labeled data.

8 Baseline SVM Ranking SVM Baseline Search Log Data Human Labeled Data Figure 6: Ranking performance comparison between SVM (classification) and Ranking SVM (ranking) for key n-gram extraction We randomly selected N train = 2, 000 web pages which included both search log data and human judgments in data collection. The data did not overlap with the test data in Test1 and Test2. The pages labeled by search log data and the pages labeled by human judgments were used to train the key n-gram extraction models respectively. For the human-labeled data, the queries with the original judgments perfect, excellent, and good were viewed as the labeled data. Results are shown in Fig. 7. We can see that: 1) These two options both perform better than the baseline, suggesting that it is better to take a search-focused key n-gram extraction approach. 2) It appears that the model trained by the search log data offers slightly better performance than the model trained by the human-labeled data, indicating that the use of search log data is adequate for the accurate extraction of key n- grams. Given the cost advantages, it is definitely better to exploit search log data as training data Training Data Size We studied different values of N train, training data size. We randomly selected 100, 200, 500, 1,000 and 2,000 web pages leveraging associated queries as training data for key n-gram extraction. The results in Fig. 8 indicate that the differences among different training data sizes are quite small. The performance may increase slightly as more training data becomes available. It appears that N train = 2, 000 is enough to train a good model for key n-gram extraction. 5.4 Studies on Relevance Ranking We investigated alternatives to our relevance ranking approach Unigram, Bigram and Trigram We studied the effect of different n s of n-gram in relevance ranking. Specifically, we considered three options: 1) Using unigrams. 2) Using both unigrams and bigrams. 3) Using unigrams, bigram, and trigrams. We did not consider four-grams or higher order grams Figure 7: Ranking performance comparison between different training data generation methods in the extraction phase because the average word number within a query is approximately three. Fig. 9 shows the relevance ranking results. It can be seen that the use of unigrams outperforms the baseline. When bigrams and trigrams are included, the performance further improves. However, the amount of improvement becomes smaller as n increases. Thus, we can conclude that unigrams play the most important role in the results. Bigrams and trigrams add more value Top K We studied the impact of using different K s for top K n-grams selection. We tried different values for K, i.e., 5, 10, 20, and 30. The result is given in Fig. 10. We can see that the ranking performance initially increases and then decreases when K increases. When K = 20, it achieves the best result N-Gram Extraction Score Our approach uses extracted key n-grams as a new meta-stream and creates ranking model features. In other words, we ignore key n-grams extraction scores. An alternative also utilizes the extraction scores as features in the ranking model. We considered three ways to use the scores: 1) The scores are used to weigh the n-grams in our key n-gram meta-stream (Score as Weight). 2) The scores of all the key n-grams in the query are summed up and used as features (Score Sum). 3) The above two strategies are combined (All). The experimental results are shown in Fig. 11. We can see the change in performance is quite small when using Score as Weight. However, Score Sum further improves the ranking performance in both datasets. Moreover, the best performance is achieved when both strategies are combined. It further improves our method s NDCG@1 by 1.04 and 1.24 points on Test1 and Test2 Datasets. (Compared with the baseline, the improvements are 2.96 and 3.18 points respectively) This experiment shows that key n-gram scores can also benefit relevance ranking even when very simple strategies are utilized.

9 Baseline Baseline Unigram Bigram Trigram Figure 8: Ranking performance with different training data size in the extraction phase Figure 9: Ranking performance comparison between unigram, bigram and trigram 6. DISCUSSION Our analysis of the results investigated why our approach resulted in performance that was better than the baseline and its limitations. Table 3: Examples of extracted key n-grams URL transformer-app-makes-your-ipod-touch-look -like-an-iphone/ Queries cydia app to make itouch in to an iphone Search Log (none) Key N-Grams transformer, ipod, iphone, ipod touch look, touch look like, look like, like iphone, ipod touch, transformer app, app makes URL early_show_crowd_celebrates_myrtle_beach _board-ar / Queries myrtle beach boardwalk grand opening Search Log myrtle beach, myrtle beach boardwalk Key N-Grams myrtle beach, beach, myrtle, grand opening, cbs, show, cbs early show, early show, crowd celebrate, myrtle beach boardwalk By checking those cases in which our method worked better than the baseline, we found that our approach extracted good page n- grams that matched against a query when they did not appear in anchor texts and associated page queries. Tab. 3 shows examples. In the first example, there is no search log data associated with the page. In contrast, our approach can extract key n-grams contributing to ranking. In the second example, our approach can further extract a grand opening which does not occur in the log data regardless of the search log data associated with the page. Indeed, our method adds a good representation of a web page and enhances its relevance ranking. We examined the erroneous cases generated by our approach, and found that the ranking performance can be further improved with representations that are better than n-grams. Occasionally, n- grams do not suffice as representatives for the matches between queries and documents. Obviously there are some matches based on linguistic structures which cannot be represented well by n- grams. For example, ipod touch look in Tab. 3 is extracted by our approach but its linguistic structure is not captured. If the query is Abraaj Capital and the web page is the home page of Abraaj Capital Art Prize, our method still assigns a high score to the page, because of the match through bigram Abraaj Capital. This means that a fundamental change to relevance models allowing the use of more complicated data might be necessary. 7. CONCLUSIONS This paper studies the problem of extracting search-focused key n-grams from web pages and utilizing these key n-grams, as well as their weights, as features to enhance search relevance ranking. We use search log data to generate training data and employ learning to rank to learn the model of key n-grams extraction mainly from head pages, and we apply the learned model to all web pages, particularly tail pages. The extracted key n-grams provide another description of the web page, in addition to search log data anchor texts and queries. Results validate the effectiveness of the proposed web search approach on several large datasets. The proposed approach significantly improves relevance ranking. The improvements on tail pages are particularly great. We plan to apply our method of key n-gram extraction to other datasets and to other tasks such as index compression in future work. We also want to study the problem of web page key n-grams generation. 8. REFERENCES [1] E. Agichtein, E. Brill, and S. Dumais. Improving web search ranking by incorporating user behavior information. In Proc. of SIGIR 06, pages 19 26, [2] M. Bendersky, D. Metzler, and W. B. Croft. Parameterized concept weighting in verbose queries. In Proc. of SIGIR 11, pages , 2011.

10 Baseline Top5 Top10 Top20 Top30 Baseline Without Score Score as Weight Score Sum All Figure 10: Ranking performance with different K in top K n- grams selection Figure 11: Ranking performance with extraction scores of key n-grams as features in relevance ranking [3] A. Berger and J. Lafferty. Information retrieval as statistical translation. In Proc. of SIGIR 99, pages , [4] N. Craswell, D. Hawking, R. Wilkinson, and M. Wu. Overview of the trec 2003 web track. In Proc. of TREC 03, pages 78 92, [5] M. Cutler, Y. Shih, and W. Meng. Using the structure of html documents to improve retrieval. In Proc. of USITS 97, pages , [6] A. L. da Costa Carvalho, E. S. de Moura, and P. Calado. Using statistical features to find phrasal terms in text collections. Journal of Information and Data Management, 1(3): , [7] E. Frank, G. W. Paynter, I. H. Witten, C. Gutwin, and C. G. Nevill-Manning. Domain-specific keyphrase extraction. In Proc. of IJCAI 99, pages , [8] J. Gao, W. Yuan, X. Li, K. Deng, and J.-Y. Nie. Smoothing clickthrough data for web search ranking. In Proc. of SIGIR 09, pages , [9] S. Goel, A. Broder, E. Gabrilovich, and B. Pang. Anatomy of the long tail: ordinary people with extraordinary tastes. In Proc. of WSDM 10, pages , [10] R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for ordinal regression. Advances in Large Margin Classifiers, [11] Y. Hu, G. Xin, R. Song, G. Hu, S. Shi, Y. Cao, and H. Li. Title extraction from bodies of html documents and its application to web page retrieval. In Proc. of SIGIR 05, pages , [12] A. Hulth. Improved automatic keyword extraction given more linguistic knowledge. In Proc. of EMNLP 03, pages , [13] U. Irmak, V. V. Brzeski, and R. Kraft. Contextual ranking of keywords using click data. In Proc. of ICDE 09, pages , [14] X. Jiang, Y. Hu, and H. Li. A ranking approach to keyphrase extraction. In Proc. of SIGIR 09, pages , [15] T. Joachims. Optimizing search engines using clickthrough data. In Proc. of KDD 02, pages , [16] T. Joachims. Training linear svms in linear time. In Proc. of KDD 06, pages , [17] K. Järvelin and J. Kekäläinen. Ir evaluation methods for retrieving highly relevant documents. In Proc. of SIGIR 00, pages 41 48, [18] J. Lafferty and C. Zhai. Document language models, query models, and risk minimization for information retrieval. In Proc. of SIGIR 01, pages , [19] H. Li. Learning to rank for information retrieval and natural language processing. Synthesis Lectures on Human Language Technologies, 4(1):1 113, [20] T. Liu. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3): , [21] D. Metzler and W. Croft. A markov random field model for term dependencies. In Proc. of SIGIR 05, pages , [22] D. Paranjpe. Learning document aboutness from implicit user feedback and document structure. In Proc. of CIKM 09, pages , [23] J. Ponte and W. Croft. A language modeling approach to information retrieval. In Proc. of SIGIR 98, pages , [24] F. Radlinski and T. Joachims. Query chains: learning to rank from implicit feedback. In Proc. of KDD 05, pages , [25] S. E. Robertson and S. J. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In Proc. of SIGIR 94, pages , [26] G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Communications of the ACM, 18: , [27] K. Sarkar, M. Nasipuri, and S. Ghoser. A new approach to keyphrase extraction using neural networks. International Journal of Computer Science, 7:16 25, [28] P. D. Turney. Learning algorithms for keyphrase extraction. Information Retrieval, 2(4): , [29] P. D. Turney. Mining the web for lexical knowledge to improve keyphrase extraction: Learning from labeled and unlabeled data. Technical Report ERB-1096, National Research Council, Institute for Information Technology, [30] E. Voorhees and D. Harman. Trec: Experiment and evaluation in information retrieval. Computational Linguistics, 32(4): , [31] K. Wang, X. Li, and J. Gao. Multi-style language model for web scale information retrieval. In Proc. of SIGIR 10, pages , [32] I. H. Witten, G. W. Paynter, E. Frank, C. Gutwin, and C. G. Nevill-Manning. Kea: practical automatic keyphrase extraction. In Proc. of DL 99, pages , [33] J. Xu, H. Li, and C. Zhong. Relevance ranking using kernels. In Proc. of AIRS 10, pages 1 12, 2010.

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,