Extracting Search-Focused Key N-Grams for Relevance Ranking in Web Search

Size: px
Start display at page:

Download "Extracting Search-Focused Key N-Grams for Relevance Ranking in Web Search"

Transcription

1 Extracting Search-Focused Key N-Grams for Relevance Ranking in Web Search Chen Wang Fudan University Shanghai, P.R.China Hang Li Microsoft Research Asia Beijing, P.R.China Keping Bi Peking University Beijing, P.R.China Guihong Cao Microsoft Corporation Redmond, WA USA Yunhua Hu Microsoft Research Asia Beijing, P.R.China ABSTRACT In web search, relevance ranking of popular pages is relatively easy, because of the inclusion of strong signals such as anchor text and search log data. In contrast, with less popular pages, relevance ranking becomes very challenging due to a lack of information. In this paper the former is referred to as head pages, and the latter tail pages. We address the challenge by learning a model that can extract search-focused key n-grams from web pages, and using the key n-grams for searches of the pages, particularly, the tail pages. To the best of our knowledge, this problem has not been previously studied. Our approach has four characteristics. First, key n-grams are search-focused in the sense that they are defined as those which can compose good queries for searching the page. Second, key n-grams are learned in a relative sense using learning to rank techniques. Third, key n-grams are learned using search log data, such that the characteristics of key n-grams in the search log data, particularly in the heads; can be applied to the other data, particularly to the tails. Fourth, the extracted key n-grams are used as features of the relevance ranking model also trained with learning to rank techniques. Experiments validate the effectiveness of the proposed approach with large-scale web search datasets. The results show that our approach can significantly improve relevance ranking performance on both heads and tails; and particularly tails, compared with baseline approaches. Characteristics of our approach have also been fully investigated through comprehensive experiments. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval This work was conducted at Microsoft Research Asia when the first two authors were interns there. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. WSDM 12, February 8 12, 2012, Seattle, Washingtion, USA. Copyright 2012 ACM /12/02...$ General Terms Algorithms, Experimentation, Performance Keywords Search Relevance, Ranking, Tail page, Key N-Gram Extraction, Learning to Rank 1. INTRODUCTION In web search, the relevance ranking model, which can be either manually created or automatically learned, assigns scores representing degree of web page relevance in respect to the query, and ranks pages according to score. The ranking model utilizes information such as the frequency of query words in the title, body, URL, anchor text and search log data in determining the relevance. There are popular web pages that include rich information such as anchor text and search log data. For these pages, it is easy for the ranking model to predict relevance with respect to a query and thereby assign reliable relevance scores. In contrast, there are also less popular web pages that lack sufficient information. In these instances, it is challenging to accurately calculate the relevance. In this paper, we refer to web pages with a lot of anchor text and associated queries in search log data, as head pages. Web pages with less anchor text and associated queries are called tail pages. If there is a distribution of web page visits, then the head pages should have high visit frequencies, while the tail pages should have low visit frequencies. In this paper, we aim to solve the problem of improving tail page relevance, currently one of the most challenging problems in web search. Our approach has the following characteristics: 1) Extracting search-focused information from web pages. 2) Taking key n-grams as the representations of search-focused information. 3) Employing learning to rank to train the extraction model using search log data. 4) Employing learning to rank to train a model for relevance ranking using search-focused key n-grams as features. We deal with tail page relevance by extracting good queries. These are most suitable for searching the page, assuming that the data sources for the extraction only include the title, URL, and page body (A more difficult question is how to automatically generate good queries instead of extracting good queries and we will leave this problem for future work). Such information is available for any page, including tail pages. When searching with a page s good queries, both the page and queries should be highly relevant. We refer to this as search-focused extraction.

2 We extract search-focused key n-grams from web pages and use them to improve relevance, particularly tail page relevance. The key n-grams should compose good queries for searching the pages. There are two reasons why we chose key n-gram extraction rather than keyphrase extraction. First, conventional relevance models, whether or not they are created by machine learning, usually only employ n-grams from queries and documents. Therefore, the extraction of key n-grams is more than adequate for the purpose of enhancing ranking model performance. Second, the use of n-grams eliminates the need to segment queries and documents, and thus frees us from segmentation errors. We further employ a learning to rank approach to the extraction of key n-grams. The problem is formalized as ranking key n-grams from a given web page. The importance of key n-grams is only meaningful in a relative sense, and thus one does not need to make hard categorization decisions regarding important n-grams vs. unimportant n-grams, as we proposed earlier in [14]. We use n- gram positions and web page html tags, as well as term frequencies as features in the learning to rank model. We take search log data as training data for learning the model of extraction (one can also consider taking anchor text data as training data). The premise is that the statistical properties of a page s good queries can be learned and applied across different web pages. The objective of learning is an exact and accurate extraction of search-focused key n- grams, because the page-associated queries are sets of key n-grams for search. The abundance of search log data available for head pages facilitates the actual learning of the model of extraction primarily from those pages. In this way, we can extend the knowledge we acquire from head pages and apply it to the tail pages, thereby effectively addressing the challenge of determining tail page relevance. Note that the learned model can also help improve the relevance of head pages. The extracted key n-grams include scores representing the strength of key n-grams on a given page. We further take a learning to rank approach to train the relevance ranking model, using the key n- grams and their scores as additional ranking model features. Those n-grams are also utilized in the original ranking model, as they are from the same page, and thus their contributions to the ranking model are further enhanced by our approach. We have conducted experiments with two large-scale web search datasets to validate the effectiveness of the proposed approach. Each dataset contains more than 10,000 queries and approximately 1,000,000 documents. There are, on average, 65 documents for each query that includes relevance judgments. All the evaluations are conducted on search relevance ranking in terms of MAP and NDCG. Results show that our approach can significantly improve search relevance on both head and tail pages, with the particularly greater improvement on tail pages. Our method also works better than a conventional keyphrase extraction method, indicating it is better to employ a search-focused approach. We studied several important characteristics in our approach to both the key n-gram extraction phase and relevance ranking phase. The results show that it is better to formalize the key n-gram extraction as a ranking problem rather than a classification problem. The results also show that our approach with search log data is comparable to human-labeled data. Moreover, the performances of our approach trained with search log data will saturate when the size of search log data increases. These results suggest that head page search log data, which is of low cost, can be effectively leveraged to train a model for improving search relevance. We have found that considerably good relevance ranking results can be achieved with just unigrams, and that performance can be further improved when bigrams and trigrams are also included. Furthermore, we have found that extracting the top 20 key n-grams achieves the best performance in relevance ranking. In addition, it has been observed that the use of key n-gram scores can further enhance relevance ranking. The contributions in this paper are as follows. We have proposed conducting a search-focused key n-gram extraction from web pages to enhance search relevance, particularly tail page relevance. (To the best of our knowledge, this problem has not been studied before.) We have developed a method for extracting search-focused key n-grams, using learning to rank and search log data. We have significantly improved search relevance using the extracted key n-grams, particularly for the tail pages. We have also conducted comprehensive studies to investigate the characteristics of our approach. The rest of the paper is organized as follows. Sec. 2 introduces related work. Sec. 3 gives the motivation of the problem. We introduce our proposed approach in Sec. 4 and present experimental results in Sec. 5. We also offer discussion in Sec. 6. Sec. 7 concludes the paper and provides suggestions for further work. 2. RELATED WORK 2.1 Relevance Ranking Relevance ranking, one of the most important search engine components, assigns scores representing a document s relevance degree with respect to the query and ranks the documents according to their scores. Many methods have been proposed as a basis for constructing a relevance ranking model. Traditionally, the ranking model is manually created with a few fine-tuned parameters. Recently, machine learning techniques, called learning to rank, have also been applied to ranking model construction. Traditional models such as the Vector Space Model [26], BM25 [25], Language Model for IR [18, 23], Markov Random Field [21], and the learning to rank models [19, 20] make use of the n-grams that exist as features of the queries and documents. In fact, with all of these methods, the queries and documents are viewed as vectors of n- grams. Intuitively, if query-based n-grams occur a number of times in the document, then it is likely that the document is relevant to the query. More information about web pages can be utilized in web search; for example, the body, URL, anchor texts and associated search log data queries. Previous work has shown that the use of web page (HTML documents) titles, anchor texts and URLs can enhance relevance ranking accuracy in web search. For example, Cutler et al. [5] proposed using the structures of HTML documents to improve document retrieval. They linearly combined term frequencies in several fields extracted from an HTML document. In TREC- 2003, for example, more than half of the participants considered the use of richer representations of web pages [4]. Hu et al. [11] proposed extracting titles from the bodies of web pages and using the extracted titles as new metadata fields for web search. Search log data is also a useful resource for improving relevance ranking performance. Agichtein et al. [1], for example, showed that using a web page s associated queries in search log data can enhance the accuracy of the page s relevance ranking. Wang et al. [31] indicated that the different web page fields usually have different characteristics, and should be represented by different language models. Bendersky et al. [2] proposed employing a discriminative method for extracting and weighting terms or phrases in pseudo relevance feedback and then exploiting the weighted terms in relevance ranking. They focused on term weighting from the query side and did

3 not make use of information from the document side. Carvalho et al. [6] proposed finding phrasal terms from documents to enhance relevance ranking. To extract phrasal terms, they utilized statistical features and an SVM classifier. To the best of our knowledge, there was no previous work that studies the problem of extracting and using key n-grams from web pages for relevance ranking. 2.2 Keyphrase Extraction Our work is also related to keyphrase extraction, given that ngrams represent phrase fragments. A number of authors such as Frank et al., Witten et al., and Turney [7, 28, 29, 32] propose keyphrase extraction is a problem of classification. In this approach, document phrases are labeled as keyphrases or non-keyphrases. A classifier is then trained using the labeled data. The classifier is used to categorize phrases within a new document as keyphrases or non-keyphrases. Several learning methods were applied into the classifier s learning, including naive Bayes [7, 32], decision tree [28, 29], rule-induction [12], and neural network [27]. The KEA tool based on naive Bayes is publically available for use in keyphrase extraction studies. Recently, we [14] proposed formalizing keyphrase extraction as a ranking problem, and learning a ranker to rank phrases based on the degree that they are keyphrases. The learning to rank method, Ranking SVM, is employed in the construction of ranker. Another method proposes the use of log data as training data for keyphrase extraction. Irmak et al. extracted key terms from user log data in a user-centric entity detection system [13] and used learning to rank to learn the extraction model. Paranjpe proposed a similar method [22] which utilizes search log data to learn the aboutness of a document represented by words and phrases. A regression model was used. The evaluation of the extracted keyphrases in both studies was limited to browsing scenarios. Our work differs from existing work in the following points: 1) Our goal is to enhance search relevance ranking. 2) We extract key n-grams instead of keyphrases. 2.3 Search Log Mining for Ranking Search log data contains queries and clicked URLs. Each web page s URL is associated with a number of queries, as well as with the number of times the page is clicked as the query s searches are also recorded. Search log data has been used for relevance ranking in web search; for example, as training data for creating a ranking model [15, 24], and as features of a relevance ranking model [1]. Search log data is usually only available for popular queries and pages. Gao et al. proposed two smoothing methods to propagate click numbers along the click-through graph [8]. In this paper, we use search log data as training data for key n-gram extraction. The model learned by our method can even be applied to web pages (tail pages) without any click or anchor text. 3. MOTIVATION Head pages usually have a lot of anchor text and associated queries in search log data, and they can become strong signals for judging page relevance with respect to queries. Modern search engines can effectively leverage these signals and perform very well in head page searches. In contrast, tail pages are usually limited to just a few anchor texts and associated queries, or even none at all. As a result, searching such pages tends to be difficult. Tail pages - including new pages, personal homepages, and discussion pages - are of high value to users. These pages need to be easily found for those who want to search them. Queries (Clicks): 1. us citizenship and immigration service (~10,000) 2. us citizenship (~10,000) 3. naturalization (~6,000) 4. green card (~100) 5. immigration (~4,000) 6. employee immigration (~10) Queries (Clicks): 1 1. oregon (3) 2. office of private health partnerships (11) 3. oregon health insurance (2) Figure 1: Similar appearances of queries in example head page (top) and tail page (bottom) According to a study by Goel et al. [9], search of tail pages is an important element in enhancing user satisfaction, and good search results on tail pages can play a central role in winning users loyalty to a particular search engine. If, for example, a user wants to find his friend s homepage, his satisfaction will be quite high if it can be found immediately by the search engine. Our statistics on a collection of web pages show that more than 75% are tail pages with less than one anchor text and associated query. This effectively illustrates how tail page relevance becomes one of the most challenging issues in web search. Our proposal is to extract or generate good queries for each page, including tail pages, which should be more suitable for searching the page. The extraction or generation uses the page title, URL and body, all of which are available for tail pages. When searched with good queries, the page should rank highly. Berger et al. [3] proposed taking a translation approach to document retrieval. The generation or extraction can be viewed as translating a web page into a number of keyphrases. We refer to this task search focused keyphrase extraction or generation. In this paper, we focus on extraction. We accomplish our task with two means; namely, learning from search log data and key n-gram extraction. First, we take search log data as training data for the extraction. A web page s associated queries can be viewed as good queries for searching the page. The data can be used to train a model. Since the head pages have more click data, we end up with learning the model from head pages and apply the learned model to tail pages.

4 Second, we consider key n-gram extraction an approximation of keyphrase extraction. Queries, particularly long queries such as star wars anniversary edition lego darth vader fighter, are difficult to segment. If the query is associated with a page in the search log data, then we take all of the query s n-grams as key n-grams of the page. In this way, we can skip query segmentation, which is difficult to do with a high degree of accuracy. Actually, all current ranking models, even those based on machine learning, usually use n-grams only as features, meaning that key n-gram extraction sufficiently enhances the model s performance. We assume that key n-grams share the same patterns across different pages, and that we can successfully learn an extraction model from the head and apply it to the tail. Fig. 1 gives an example to illustrate this. The page on the top is a head page with more than 600,000 clicks in the search engine s search log data within one year. The page on the bottom is a tail page with just 23 clicks. Despite this difference, the queries in the two pages have similar patterns and appearances. In both pages, the queries are located in region A and region C. Specifically, region A includes the page s title and subtitle. Regions B and D include website navigation links. R The page s main content is in region C. Region E consists of outer links. We can use the formatting, term frequency and position information to identify whether an n-gram is likely to be a key n-gram. 4. OUR APPROACH Our approach consists of two parts: Key n-gram extraction and relevance ranking. Fig. 2 shows an overview. 4.1 Key N-Gram Extraction Pre-Processing We assume that the objects to be searched and ranked by the search engine are web pages. During pre-processing, a web page in HTML format is parsed and represented as a sequence of tags/words. Then all the words are converted into lower case, and stop words into a list 1 are removed. In our approach, we define an n-gram as n successive words within a short text separated by punctuation symbols and special HTML tags. Note that some HTML tags provide a natural separation of text, e.g., <h1>experimental Result</h1> indicates that Experimental Result is a short text. Some tags do not mean a separation, e.g., <font color="red">significant</font> improvement Training Data Generation We propose using search log data for training a key n-gram extraction model, because search log data represents users implicit judgments on the relevance between queries and documents. More specifically, if users search with a query and click a page afterward, and this occurs many times (e.g., beyond a threshold), then it is very likely that the query and the page are relevant. We can consider automatically extracting queries from the page. Head pages generally include a number of associated queries in the search log data. Such data can naturally be used as training data for the automatic extraction of queries, particularly for tail pages. Instead of extracting queries or keyphrases, we extract key n- grams. We treat the n-grams in each of the document s queries as its labeled key n-grams. For example, when a document is ABDC associated with the query ABC, we consider unigrams A, B, C, and bigrams AB are key n-grams with the assumption that 1 ftp://ftp.cs.cornell.edu/pub/smart/english.stop they should be ranked higher than unigram D, and bigrams BD and DC, by the extraction model. Features for each n-gram are then extracted, as described in Sec , and then an extraction model is trained. The advantage of taking this approach is that no segmentation on queries and documents is necessary, which is difficult to accurately carry out. In fact, existing ranking models usually make use of n-grams only as features, and thus extraction of key n-grams is sufficient for enhancing search relevance. Deriving training data from search log data is an approach with the benefit of low cost. An alternative would allow human labelers to assign relevant queries to selected web pages. However, this suffers from two problems: 1) Cost is high. 2) Human labelers are not query owners, making it difficult for them to come up with relevant page queries. An approximation of such a method uses conventional relevance judgment data created from queries whereby each page is associated with a small number of queries, or even just one query. As will be seen in our experiment, such an approximation does not work better than our method using search log data. Given that human labeling is expensive and log data is very low cost, the use of log data for training data creation is definitely a good practice N-Gram Features Generation Web pages contain rich formatting information compared to plain text. We utilize both textual and formatting information to create features in the extraction (ranking) model in order to accurately extract key n-grams. We have conducted a comprehensive study on the features for the task. Below is a list of features which were found to be useful from an empirical study of 500 randomly selected pages and the key n-grams associated with them. N-grams may be highlighted with different HTML formatting information, and the formatting information is useful in identifying the importance of n-grams. 1. Frequency Features The original/normalized term frequencies of an n-gram within several fields, tags and attributes are utilized. a) Frequency in Fields: The n-gram frequencies in four web page fields, including URL, page title, meta-keyword and metadescription. b) Frequency within Structure Tags: The frequencies of n- gram in texts within a header, table or list indicated by HTML tags including <h1>,..., <h6>, <table>, <li> and <dd>. c) Frequency within Highlight Tags: The frequencies of n- gram in texts highlighted or emphasized by HTML tags including <a>, <b>, <i>, <em> and <strong>. d) Frequency within Attributes of Tags: The frequencies of n- gram in web page tag attributes. These are hidden texts which are not visible to users. We found, however, that they are still valuable for key n-gram extraction; for example, the title of an image <img title= Still Life: Vase with Fifteen Sunflowers... />. Specifically, title, alt, href and src tag attributes are used. e) Frequencies in Other Contexts: The frequencies of n-gram in other contexts including 1) The page headers, which means n- gram frequency within any of <h1>,..., <h6> tags. 2) The page meta-data field. 3) The page body. 4) The whole HTML file. 2. Appearance Features The appearances of n-grams are also important indicators of their importance. a) Position: The n-gram s position when it first appears in the title, paragraph and document. b) Coverage: The coverage of an n-gram in the title or a header, e.g., whether an n-gram covers more than 50% of the title. c) Distribution: The n-gram s distribution across different parts

5 Phase 1: Key N-Gram Extraction Phase 2: Relevance Ranking Webpages Pre- Processing Training Data Generation Search Log Data N-Gram Feature Extraction Model Learning Key N-Gram Extraction Model Queries, Webpages, Relevance Judgments Key N-Grams Relevance Ranking Feature Extraction Model Learning Relevance Ranking Model Webpages Pre- Processing N-Gram Feature Extraction Prediction Queries, Webpages Relevance Ranking Feature Extraction Prediction Ranking of Webpages Figure 2: Framework of our approach of a page. The page is separated into several parts and the entropy of the n-gram across the parts is used Key N-Gram Extraction Key n-gram extraction is formalized as a learning to rank problem. In learning, a ranking model is trained which can rank n-grams according to their relative importance as key n-grams associated with a given web page. Features are defined and utilized for the ranking of n-grams. In extraction, given a new page and the trained model, the n-grams in the page are ranked with the model. Top K n-grams are selected as key n-grams of the page. We give a formalization of the learning task. Let X R p is the space of features of n-grams, while Y = {r 1, r 2,..., r m } is the space of ranks. There exists a total order among the ranks: r m r m 1... r 1. Here, m = 2, representing key n-grams and non key n- grams. The goal is to learn a ranking function f (x) such that for any pair of n-grams (x i, y i ) and (x j, y j ), the following condition holds: f (x i ) > f (x j ) y i y j (1) Here x i and x j are elements of X, and y i and y j are elements of Y representing the ranks of x i and x j. We employ Ranking SVM [10] to learn the ranking function f (x), and we specifically use the S V M Rank tool [16] 2. We assume that f (x) = w T x is a linear function on x. Given a training set, we first convert it to ordered pairs of n- grams: P = {(i, j) (x i, x j ), y i y j }. f (x) is learned by the following optimization: ŵ = arg min w 1 2 wt w + c P (i, j) P s.t. (i, j) P : w T x i w T x j 1 ξ i j, ξ i j 0 where ξ i j denotes slack variables and c is a parameter. 4.2 Relevance Ranking Relevance Ranking Features Generation In web search, web pages are represented in several fields, also referred to as meta-streams. In this paper, we consider the following meta-streams: URL, page title, page body, meta-keywords, meta-description, anchor texts, search log data associated queries and the key n-gram stream created by our method. The first five 2 ξ i j (2) meta-streams are extracted from the web page itself, and they reflect the web designer s view. Anchor texts are extracted from other pages and they represent other web designers summaries. The query meta-stream consists of users queries leading to clicks on the page. It provides the search users view. The key n-gram metastream created by our method also provides a web page summary. Note that the key n-grams are extracted based solely on the information from the first five meta-streams. The model for extraction is trained primarily from head pages using a number of associated queries as training data; and is applied to tail pages, which may not include anchor texts and associated queries. A ranking model includes query-document matching features that represent the relevance of the document with respect to the query. Popular features include tf-idf, BM25 and minimal span. All can be defined on the meta-streams. Document features which describe the importance of the document itself, such as PageRank and Spam Score, are also used. Given a query and a document, we derive the following querydocument matching features from each meta-stream: a) Unigram/Bigram/Trigram BM25: N-gram BM25 [33] is an extension of traditional unigram-based BM25 b) Original/Normalized PerfectMatch: Number of exact matches between query and text in the stream c) Original/Normalized OrderedMatch: Number of continuous words in the stream which can be matched with the words in the query in the same order d) Original/Normalized PartiallyMatch: Number of continuous words in the stream which are all contained in the query e) Original/Normalized QueryWordFound: Number of words in the query which are also in the stream In addition, PageRank and domain rank scores are also used as document features in our model Ranking Model Learning and Prediction We employ learning to rank techniques to automatically construct the ranking model from labeled training data for relevance ranking. Again, we use Ranking SVM as the learning algorithm. In search, given a new query and retrieved web pages, we generate the query-document matching features between the query and each page s meta-streams, and calculate ranking scores using the learned ranking model. Next, we rank all of the web pages in descending order of their ranking model scores.

6 5. EXPERIMENTS 5.1 Experiment Settings Dataset We conducted experiments on a large search log dataset and three large-scale relevance ranking datasets. We used a search log collected from a commercial search engine over the course of one year. It contains 76,882,604 distinct URLs and 121,100,491 distinct queries with 10,280,590,031 clicks. To derive training data for key n-gram extraction, the query-document pairs with at least click min = 2 clicks were kept and the others were discarded. Then N train = 2, 000 HTML pages were randomly selected from the search log data, which had a minimum of query min = 5 associated queries. Some data statistics are given in Tab. 1. Table 1: Dataset used for training of key n-gram extraction # of HTML Pages 2,000 Average # of queries per page Average # of words per query 3.47 The relevance ranking experiments were conducted on three largescale datasets which contained queries, documents, and their relevance judgments. The first two sets are comprised of general queries (queries randomly sampled from the search log), and the last set of tail queries (queries randomly sampled from the search log s low frequency queries), referred to as Training (Training Set), Test1 (Test Set with General Queries) and Test2 (Test Set with Tail Queries). We use Training Set to learn a relevance ranking model and use Test1 and Test2 evaluate the model. Each set includes 10,000 queries and about 1,000,000 web pages. The relevance judgments are represented at five levels including Perfect, Excellent, Good, Fair and Bad. Some statistics about the three datasets are given in Tab. 2. Table 2: Datasets used for relevance ranking Test1 Test2 Dataset Training (General (Tail Queries) Queries) # of queries 20,429 12,705 10,991 # of HTML pages 1,008,480 1,131, ,768 Avg. # of pages/query Avg. # of perfect/query Avg. # of excellent/query Avg. # of good/query Avg. # of fair/query Avg. # of bad/query Avg. # of words/query Parameters and Evaluation Measures We first set the ranges for our method s parameters, and then created extraction and ranking models under all possible parameter settings, and tested the our method s relevance ranking performance on the Test1 and Test2 datasets. We report the best performance on the two test datasets at the best parameter setting. (It turns out that the best parameter settings are the same.) As we later describe in detail, we started from the best parameter setting, changed one parameter, and fixed the other parameters. In key n-gram extraction, there are three parameters: number of training pages N train, minimum number of clicks between a query and a page click min, and minimum number of queries associated with a page query min. We set the ranges of N train as {100, 200, 500, 1000, 2000}, click min as {1, 2, 5}, query min as {2, 5, 10}. The best setting is as follows: N train = 2, 000, click min = 2, and query min = 5. In relevance ranking, there are two parameters: Number of top K n-grams selected, and number of words in n-grams n. We set the ranges of K as {5, 10, 20, 30} and n as {1, 2, 3}. The best setting is as follows: K = 20 and n = 3. There is a parameter c for Ranking SVM which is used in both key n-gram extraction and relevance ranking. In extraction, c is chosen among {0.001, 0.01, 0.1, 1, 10, 100, 1000}. In ranking, c is selected from {0.01, 0.1, 1, 10, 100, 1000, 10000, }. Note that the ranges of c in the two phases are quite different because the tasks in the two phases are quite different. The best parameter setting is c = 0.1 and c = 1, 000. Mean Average Precision (MAP) [30] and Normalized Discounted Cumulative Gain (NDCG) [17] are used as evaluation measures in relevance ranking. 5.2 Experiments on Relevance Ranking We evaluated our proposed approach for relevance ranking. First, we evaluated performance to see whether our method can enhance relevance ranking. Then, we compared our method against a popular keyphrase extraction method KEA [32] used in relevance ranking. We particularly examined how our approach performed on head and tail pages Relevance Ranking with Extracted Key N-Grams We tested the effectiveness of our method using extracted key n- grams in relevance ranking. One baseline used the ranking model trained with features derived from several meta-streams (see Sec. 4.2 for details) without using extracted n-grams, referred to as baseline. Note that both our method and the baseline were built with the same web page information. Our method further extracts and utilizes information, namely search-focused key n-grams, for page ranking. The parameters of our method were at the best setting as explained in Sec The experimental results are presented in Fig. 3. The experimental results indicate that relevance ranking performance can be significantly enhanced by adding extracted key n- grams into the ranking model. More specifically, our method can improve NDCG@1 by 1.92 and 1.94 points on Test1 and Test2 datasets respectively. The improvements of MAP on Test1 and Test2 are 0.87 and 1.28 points respectively. T-test shows that the improvements are all statistically significant (p < 0.001) Ranking Performance Comparison with KEA Another baseline used KEA [32] for relevance ranking, which takes web pages as input but extracts keyphrases from the pages. The keyphrases are used as a page meta-stream, and a ranking model with additional features from the meta-stream is trained. There are differences between our method and KEA. Our method learns the model from search log data while KEA learns from humanlabeled keyphrases, and our method is search focused while KEA is for general purposes. The KEA model was trained using the data in Jiang et al. s work [14], which has 300 web pages adopted with human-labeled keyphrases. To make the comparison fair, we took the same set of web pages and their associated queries as training data for our method. The other parameters of our method were set the same as those in the best parameter setting. Note that our approach performance is slightly worse than Sec The parameters of KEA, such as the number of output keyphrases, were tuned and here we report the best KEA results.

7 Baseline(Without Key N-Grams) Our Approach(With Key N-Grams) Baseline KEA Our Approach Figure 3: Ranking performance comparison between baseline (without key n-grams) and our approach (with key n- grams) From the result shown in Fig. 4, we can see that KEA can work better than the baseline, but performs worse than our method. The results indicate that relevance ranking can be enhanced by both keyphrase and key n-gram extraction. Moreover, it is better to employ search-focused key n-gram extraction than general keyphrase extraction Ranking Performance on Head and Tail Pages Our method is trained mainly with head page data, because head pages have more associated queries with higher click frequencies. It is potentially more useful for relevance ranking of tail pages, because tail pages usually have less anchor text and associated queries. We evaluated our method s performance on head pages and tail pages respectively. We respectively split both Test1 Dataset and Test2 Dataset into a Head Page Subset and Tail Page Subset. When more than 75% a query s relevant pages - which were rated as perfect, excellent or good - did not have other associated queries and anchor texts, we classified the query and its associated pages into a Tail Page Subset, otherwise Head Page Subset. The ratio of of Head Page Subsets and Tail Page Subsets was about 3:1 for both Test1 and Test2. The results are shown in Fig. 5. We can see that: 1) Our proposed method significantly improves the Tail Page Subset vs. the Head Page Subset for both datasets. 2) The improvements made by our method remain statistically significant even for Head Page Subsets (p < 0.001). 3) We can also observe that the improvements on Test2 Dataset are slightly greater than on Test1 Dataset, indicating that our approach can be more effective on tail queries than general queries. Similar conclusions were drawn when we tested other split ratios such as 50%, 60%, 70%, 80% and 90%. 5.3 Studies on Key N-Gram Extraction We studied the alternatives to our approach with key n-gram extraction. Figure 4: Ranking performance comparison between KEA and our approach Head Page Set, General Queries Head Page Set, Tail Queries Tail Page Set, General Queries Tail Page Set, Tail Queries Figure 5: Ranking performance improvement comparison between head pages and tail pages Classification vs. Ranking In our approach, we formalize the extraction problem as one of ranking instead of alternatively viewing it as one of classification. We compared the two options based on Ranking SVM and SVM. Results are shown in Fig. 6. The results indicate that using Ranking SVM is more effective than using SVM. T-test shows that Ranking SVM s improvement over SVM is statistically significant (p < 0.001). The result validates the correctness of the assumption that it is better to formalize the key n-gram extraction problem as one of ranking rather than classification. This is because it is hard to detect key n-grams in an absolute manner, but easy to judge whether one n-gram is more important than another Training Data Generation In our approach, we derive training data from search log data. We assume that the document is relevant to the query if the document is searched and clicked by the query at least click min times. We evaluated the data creation method effectiveness by comparing training with search log data vs. training with human-labeled data.

8 Baseline SVM Ranking SVM Baseline Search Log Data Human Labeled Data Figure 6: Ranking performance comparison between SVM (classification) and Ranking SVM (ranking) for key n-gram extraction We randomly selected N train = 2, 000 web pages which included both search log data and human judgments in data collection. The data did not overlap with the test data in Test1 and Test2. The pages labeled by search log data and the pages labeled by human judgments were used to train the key n-gram extraction models respectively. For the human-labeled data, the queries with the original judgments perfect, excellent, and good were viewed as the labeled data. Results are shown in Fig. 7. We can see that: 1) These two options both perform better than the baseline, suggesting that it is better to take a search-focused key n-gram extraction approach. 2) It appears that the model trained by the search log data offers slightly better performance than the model trained by the human-labeled data, indicating that the use of search log data is adequate for the accurate extraction of key n- grams. Given the cost advantages, it is definitely better to exploit search log data as training data Training Data Size We studied different values of N train, training data size. We randomly selected 100, 200, 500, 1,000 and 2,000 web pages leveraging associated queries as training data for key n-gram extraction. The results in Fig. 8 indicate that the differences among different training data sizes are quite small. The performance may increase slightly as more training data becomes available. It appears that N train = 2, 000 is enough to train a good model for key n-gram extraction. 5.4 Studies on Relevance Ranking We investigated alternatives to our relevance ranking approach Unigram, Bigram and Trigram We studied the effect of different n s of n-gram in relevance ranking. Specifically, we considered three options: 1) Using unigrams. 2) Using both unigrams and bigrams. 3) Using unigrams, bigram, and trigrams. We did not consider four-grams or higher order grams Figure 7: Ranking performance comparison between different training data generation methods in the extraction phase because the average word number within a query is approximately three. Fig. 9 shows the relevance ranking results. It can be seen that the use of unigrams outperforms the baseline. When bigrams and trigrams are included, the performance further improves. However, the amount of improvement becomes smaller as n increases. Thus, we can conclude that unigrams play the most important role in the results. Bigrams and trigrams add more value Top K We studied the impact of using different K s for top K n-grams selection. We tried different values for K, i.e., 5, 10, 20, and 30. The result is given in Fig. 10. We can see that the ranking performance initially increases and then decreases when K increases. When K = 20, it achieves the best result N-Gram Extraction Score Our approach uses extracted key n-grams as a new meta-stream and creates ranking model features. In other words, we ignore key n-grams extraction scores. An alternative also utilizes the extraction scores as features in the ranking model. We considered three ways to use the scores: 1) The scores are used to weigh the n-grams in our key n-gram meta-stream (Score as Weight). 2) The scores of all the key n-grams in the query are summed up and used as features (Score Sum). 3) The above two strategies are combined (All). The experimental results are shown in Fig. 11. We can see the change in performance is quite small when using Score as Weight. However, Score Sum further improves the ranking performance in both datasets. Moreover, the best performance is achieved when both strategies are combined. It further improves our method s NDCG@1 by 1.04 and 1.24 points on Test1 and Test2 Datasets. (Compared with the baseline, the improvements are 2.96 and 3.18 points respectively) This experiment shows that key n-gram scores can also benefit relevance ranking even when very simple strategies are utilized.

9 Baseline Baseline Unigram Bigram Trigram Figure 8: Ranking performance with different training data size in the extraction phase Figure 9: Ranking performance comparison between unigram, bigram and trigram 6. DISCUSSION Our analysis of the results investigated why our approach resulted in performance that was better than the baseline and its limitations. Table 3: Examples of extracted key n-grams URL transformer-app-makes-your-ipod-touch-look -like-an-iphone/ Queries cydia app to make itouch in to an iphone Search Log (none) Key N-Grams transformer, ipod, iphone, ipod touch look, touch look like, look like, like iphone, ipod touch, transformer app, app makes URL early_show_crowd_celebrates_myrtle_beach _board-ar / Queries myrtle beach boardwalk grand opening Search Log myrtle beach, myrtle beach boardwalk Key N-Grams myrtle beach, beach, myrtle, grand opening, cbs, show, cbs early show, early show, crowd celebrate, myrtle beach boardwalk By checking those cases in which our method worked better than the baseline, we found that our approach extracted good page n- grams that matched against a query when they did not appear in anchor texts and associated page queries. Tab. 3 shows examples. In the first example, there is no search log data associated with the page. In contrast, our approach can extract key n-grams contributing to ranking. In the second example, our approach can further extract a grand opening which does not occur in the log data regardless of the search log data associated with the page. Indeed, our method adds a good representation of a web page and enhances its relevance ranking. We examined the erroneous cases generated by our approach, and found that the ranking performance can be further improved with representations that are better than n-grams. Occasionally, n- grams do not suffice as representatives for the matches between queries and documents. Obviously there are some matches based on linguistic structures which cannot be represented well by n- grams. For example, ipod touch look in Tab. 3 is extracted by our approach but its linguistic structure is not captured. If the query is Abraaj Capital and the web page is the home page of Abraaj Capital Art Prize, our method still assigns a high score to the page, because of the match through bigram Abraaj Capital. This means that a fundamental change to relevance models allowing the use of more complicated data might be necessary. 7. CONCLUSIONS This paper studies the problem of extracting search-focused key n-grams from web pages and utilizing these key n-grams, as well as their weights, as features to enhance search relevance ranking. We use search log data to generate training data and employ learning to rank to learn the model of key n-grams extraction mainly from head pages, and we apply the learned model to all web pages, particularly tail pages. The extracted key n-grams provide another description of the web page, in addition to search log data anchor texts and queries. Results validate the effectiveness of the proposed web search approach on several large datasets. The proposed approach significantly improves relevance ranking. The improvements on tail pages are particularly great. We plan to apply our method of key n-gram extraction to other datasets and to other tasks such as index compression in future work. We also want to study the problem of web page key n-grams generation. 8. REFERENCES [1] E. Agichtein, E. Brill, and S. Dumais. Improving web search ranking by incorporating user behavior information. In Proc. of SIGIR 06, pages 19 26, [2] M. Bendersky, D. Metzler, and W. B. Croft. Parameterized concept weighting in verbose queries. In Proc. of SIGIR 11, pages , 2011.

10 Baseline Top5 Top10 Top20 Top30 Baseline Without Score Score as Weight Score Sum All Figure 10: Ranking performance with different K in top K n- grams selection Figure 11: Ranking performance with extraction scores of key n-grams as features in relevance ranking [3] A. Berger and J. Lafferty. Information retrieval as statistical translation. In Proc. of SIGIR 99, pages , [4] N. Craswell, D. Hawking, R. Wilkinson, and M. Wu. Overview of the trec 2003 web track. In Proc. of TREC 03, pages 78 92, [5] M. Cutler, Y. Shih, and W. Meng. Using the structure of html documents to improve retrieval. In Proc. of USITS 97, pages , [6] A. L. da Costa Carvalho, E. S. de Moura, and P. Calado. Using statistical features to find phrasal terms in text collections. Journal of Information and Data Management, 1(3): , [7] E. Frank, G. W. Paynter, I. H. Witten, C. Gutwin, and C. G. Nevill-Manning. Domain-specific keyphrase extraction. In Proc. of IJCAI 99, pages , [8] J. Gao, W. Yuan, X. Li, K. Deng, and J.-Y. Nie. Smoothing clickthrough data for web search ranking. In Proc. of SIGIR 09, pages , [9] S. Goel, A. Broder, E. Gabrilovich, and B. Pang. Anatomy of the long tail: ordinary people with extraordinary tastes. In Proc. of WSDM 10, pages , [10] R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for ordinal regression. Advances in Large Margin Classifiers, [11] Y. Hu, G. Xin, R. Song, G. Hu, S. Shi, Y. Cao, and H. Li. Title extraction from bodies of html documents and its application to web page retrieval. In Proc. of SIGIR 05, pages , [12] A. Hulth. Improved automatic keyword extraction given more linguistic knowledge. In Proc. of EMNLP 03, pages , [13] U. Irmak, V. V. Brzeski, and R. Kraft. Contextual ranking of keywords using click data. In Proc. of ICDE 09, pages , [14] X. Jiang, Y. Hu, and H. Li. A ranking approach to keyphrase extraction. In Proc. of SIGIR 09, pages , [15] T. Joachims. Optimizing search engines using clickthrough data. In Proc. of KDD 02, pages , [16] T. Joachims. Training linear svms in linear time. In Proc. of KDD 06, pages , [17] K. Järvelin and J. Kekäläinen. Ir evaluation methods for retrieving highly relevant documents. In Proc. of SIGIR 00, pages 41 48, [18] J. Lafferty and C. Zhai. Document language models, query models, and risk minimization for information retrieval. In Proc. of SIGIR 01, pages , [19] H. Li. Learning to rank for information retrieval and natural language processing. Synthesis Lectures on Human Language Technologies, 4(1):1 113, [20] T. Liu. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3): , [21] D. Metzler and W. Croft. A markov random field model for term dependencies. In Proc. of SIGIR 05, pages , [22] D. Paranjpe. Learning document aboutness from implicit user feedback and document structure. In Proc. of CIKM 09, pages , [23] J. Ponte and W. Croft. A language modeling approach to information retrieval. In Proc. of SIGIR 98, pages , [24] F. Radlinski and T. Joachims. Query chains: learning to rank from implicit feedback. In Proc. of KDD 05, pages , [25] S. E. Robertson and S. J. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In Proc. of SIGIR 94, pages , [26] G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Communications of the ACM, 18: , [27] K. Sarkar, M. Nasipuri, and S. Ghoser. A new approach to keyphrase extraction using neural networks. International Journal of Computer Science, 7:16 25, [28] P. D. Turney. Learning algorithms for keyphrase extraction. Information Retrieval, 2(4): , [29] P. D. Turney. Mining the web for lexical knowledge to improve keyphrase extraction: Learning from labeled and unlabeled data. Technical Report ERB-1096, National Research Council, Institute for Information Technology, [30] E. Voorhees and D. Harman. Trec: Experiment and evaluation in information retrieval. Computational Linguistics, 32(4): , [31] K. Wang, X. Li, and J. Gao. Multi-style language model for web scale information retrieval. In Proc. of SIGIR 10, pages , [32] I. H. Witten, G. W. Paynter, E. Frank, C. Gutwin, and C. G. Nevill-Manning. Kea: practical automatic keyphrase extraction. In Proc. of DL 99, pages , [33] J. Xu, H. Li, and C. Zhong. Relevance ranking using kernels. In Proc. of AIRS 10, pages 1 12, 2010.

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information

Query Likelihood with Negative Query Generation

Query Likelihood with Negative Query Generation Query Likelihood with Negative Query Generation Yuanhua Lv Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 ylv2@uiuc.edu ChengXiang Zhai Department of Computer

More information

A New Approach to Query Segmentation for Relevance Ranking in Web Search

A New Approach to Query Segmentation for Relevance Ranking in Web Search Noname manuscript No. (will be inserted by the editor) A New Approach to Query Segmentation for Relevance Ranking in Web Search Haocheng Wu Yunhua Hu Hang Li Enhong Chen Received: date / Accepted: date

More information

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

Comment Extraction from Blog Posts and Its Applications to Opinion Mining Comment Extraction from Blog Posts and Its Applications to Opinion Mining Huan-An Kao, Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan

More information

Mining the Search Trails of Surfing Crowds: Identifying Relevant Websites from User Activity Data

Mining the Search Trails of Surfing Crowds: Identifying Relevant Websites from User Activity Data Mining the Search Trails of Surfing Crowds: Identifying Relevant Websites from User Activity Data Misha Bilenko and Ryen White presented by Matt Richardson Microsoft Research Search = Modeling User Behavior

More information

News-Oriented Keyword Indexing with Maximum Entropy Principle.

News-Oriented Keyword Indexing with Maximum Entropy Principle. News-Oriented Keyword Indexing with Maximum Entropy Principle. Li Sujian' Wang Houfeng' Yu Shiwen' Xin Chengsheng2 'Institute of Computational Linguistics, Peking University, 100871, Beijing, China Ilisujian,

More information

WebSci and Learning to Rank for IR

WebSci and Learning to Rank for IR WebSci and Learning to Rank for IR Ernesto Diaz-Aviles L3S Research Center. Hannover, Germany diaz@l3s.de Ernesto Diaz-Aviles www.l3s.de 1/16 Motivation: Information Explosion Ernesto Diaz-Aviles

More information

TREC-10 Web Track Experiments at MSRA

TREC-10 Web Track Experiments at MSRA TREC-10 Web Track Experiments at MSRA Jianfeng Gao*, Guihong Cao #, Hongzhao He #, Min Zhang ##, Jian-Yun Nie**, Stephen Walker*, Stephen Robertson* * Microsoft Research, {jfgao,sw,ser}@microsoft.com **

More information

An Attempt to Identify Weakest and Strongest Queries

An Attempt to Identify Weakest and Strongest Queries An Attempt to Identify Weakest and Strongest Queries K. L. Kwok Queens College, City University of NY 65-30 Kissena Boulevard Flushing, NY 11367, USA kwok@ir.cs.qc.edu ABSTRACT We explore some term statistics

More information

Advanced Topics in Information Retrieval. Learning to Rank. ATIR July 14, 2016

Advanced Topics in Information Retrieval. Learning to Rank. ATIR July 14, 2016 Advanced Topics in Information Retrieval Learning to Rank Vinay Setty vsetty@mpi-inf.mpg.de Jannik Strötgen jannik.stroetgen@mpi-inf.mpg.de ATIR July 14, 2016 Before we start oral exams July 28, the full

More information

Collaborative Ranking between Supervised and Unsupervised Approaches for Keyphrase Extraction

Collaborative Ranking between Supervised and Unsupervised Approaches for Keyphrase Extraction The 2014 Conference on Computational Linguistics and Speech Processing ROCLING 2014, pp. 110-124 The Association for Computational Linguistics and Chinese Language Processing Collaborative Ranking between

More information

Northeastern University in TREC 2009 Million Query Track

Northeastern University in TREC 2009 Million Query Track Northeastern University in TREC 2009 Million Query Track Evangelos Kanoulas, Keshi Dai, Virgil Pavlu, Stefan Savev, Javed Aslam Information Studies Department, University of Sheffield, Sheffield, UK College

More information

Improving Difficult Queries by Leveraging Clusters in Term Graph

Improving Difficult Queries by Leveraging Clusters in Term Graph Improving Difficult Queries by Leveraging Clusters in Term Graph Rajul Anand and Alexander Kotov Department of Computer Science, Wayne State University, Detroit MI 48226, USA {rajulanand,kotov}@wayne.edu

More information

Optimizing Search Engines using Click-through Data

Optimizing Search Engines using Click-through Data Optimizing Search Engines using Click-through Data By Sameep - 100050003 Rahee - 100050028 Anil - 100050082 1 Overview Web Search Engines : Creating a good information retrieval system Previous Approaches

More information

Automatic Summarization

Automatic Summarization Automatic Summarization CS 769 Guest Lecture Andrew B. Goldberg goldberg@cs.wisc.edu Department of Computer Sciences University of Wisconsin, Madison February 22, 2008 Andrew B. Goldberg (CS Dept) Summarization

More information

A Few Things to Know about Machine Learning for Web Search

A Few Things to Know about Machine Learning for Web Search AIRS 2012 Tianjin, China Dec. 19, 2012 A Few Things to Know about Machine Learning for Web Search Hang Li Noah s Ark Lab Huawei Technologies Talk Outline My projects at MSRA Some conclusions from our research

More information

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Abstract We intend to show that leveraging semantic features can improve precision and recall of query results in information

More information

A Machine Learning Approach for Improved BM25 Retrieval

A Machine Learning Approach for Improved BM25 Retrieval A Machine Learning Approach for Improved BM25 Retrieval Krysta M. Svore and Christopher J. C. Burges Microsoft Research One Microsoft Way Redmond, WA 98052 {ksvore,cburges}@microsoft.com Microsoft Research

More information

Automatic Domain Partitioning for Multi-Domain Learning

Automatic Domain Partitioning for Multi-Domain Learning Automatic Domain Partitioning for Multi-Domain Learning Di Wang diwang@cs.cmu.edu Chenyan Xiong cx@cs.cmu.edu William Yang Wang ww@cmu.edu Abstract Multi-Domain learning (MDL) assumes that the domain labels

More information

LETOR: Benchmark Dataset for Research on Learning to Rank for Information Retrieval Tie-Yan Liu 1, Jun Xu 1, Tao Qin 2, Wenying Xiong 3, and Hang Li 1

LETOR: Benchmark Dataset for Research on Learning to Rank for Information Retrieval Tie-Yan Liu 1, Jun Xu 1, Tao Qin 2, Wenying Xiong 3, and Hang Li 1 LETOR: Benchmark Dataset for Research on Learning to Rank for Information Retrieval Tie-Yan Liu 1, Jun Xu 1, Tao Qin 2, Wenying Xiong 3, and Hang Li 1 1 Microsoft Research Asia, No.49 Zhichun Road, Haidian

More information

University of Delaware at Diversity Task of Web Track 2010

University of Delaware at Diversity Task of Web Track 2010 University of Delaware at Diversity Task of Web Track 2010 Wei Zheng 1, Xuanhui Wang 2, and Hui Fang 1 1 Department of ECE, University of Delaware 2 Yahoo! Abstract We report our systems and experiments

More information

TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback

TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback RMIT @ TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback Ameer Albahem ameer.albahem@rmit.edu.au Lawrence Cavedon lawrence.cavedon@rmit.edu.au Damiano

More information

An Investigation of Basic Retrieval Models for the Dynamic Domain Task

An Investigation of Basic Retrieval Models for the Dynamic Domain Task An Investigation of Basic Retrieval Models for the Dynamic Domain Task Razieh Rahimi and Grace Hui Yang Department of Computer Science, Georgetown University rr1042@georgetown.edu, huiyang@cs.georgetown.edu

More information

Keywords APSE: Advanced Preferred Search Engine, Google Android Platform, Search Engine, Click-through data, Location and Content Concepts.

Keywords APSE: Advanced Preferred Search Engine, Google Android Platform, Search Engine, Click-through data, Location and Content Concepts. Volume 5, Issue 3, March 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Advanced Preferred

More information

A Deep Relevance Matching Model for Ad-hoc Retrieval

A Deep Relevance Matching Model for Ad-hoc Retrieval A Deep Relevance Matching Model for Ad-hoc Retrieval Jiafeng Guo 1, Yixing Fan 1, Qingyao Ai 2, W. Bruce Croft 2 1 CAS Key Lab of Web Data Science and Technology, Institute of Computing Technology, Chinese

More information

ResPubliQA 2010

ResPubliQA 2010 SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first

More information

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN: IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 20131 Improve Search Engine Relevance with Filter session Addlin Shinney R 1, Saravana Kumar T

More information

Semantic Annotation using Horizontal and Vertical Contexts

Semantic Annotation using Horizontal and Vertical Contexts Semantic Annotation using Horizontal and Vertical Contexts Mingcai Hong, Jie Tang, and Juanzi Li Department of Computer Science & Technology, Tsinghua University, 100084. China. {hmc, tj, ljz}@keg.cs.tsinghua.edu.cn

More information

High Accuracy Retrieval with Multiple Nested Ranker

High Accuracy Retrieval with Multiple Nested Ranker High Accuracy Retrieval with Multiple Nested Ranker Irina Matveeva University of Chicago 5801 S. Ellis Ave Chicago, IL 60637 matveeva@uchicago.edu Chris Burges Microsoft Research One Microsoft Way Redmond,

More information

Information Retrieval

Information Retrieval Information Retrieval Learning to Rank Ilya Markov i.markov@uva.nl University of Amsterdam Ilya Markov i.markov@uva.nl Information Retrieval 1 Course overview Offline Data Acquisition Data Processing Data

More information

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets Arjumand Younus 1,2, Colm O Riordan 1, and Gabriella Pasi 2 1 Computational Intelligence Research Group,

More information

News Filtering and Summarization System Architecture for Recognition and Summarization of News Pages

News Filtering and Summarization System Architecture for Recognition and Summarization of News Pages Bonfring International Journal of Data Mining, Vol. 7, No. 2, May 2017 11 News Filtering and Summarization System Architecture for Recognition and Summarization of News Pages Bamber and Micah Jason Abstract---

More information

Sentiment analysis under temporal shift

Sentiment analysis under temporal shift Sentiment analysis under temporal shift Jan Lukes and Anders Søgaard Dpt. of Computer Science University of Copenhagen Copenhagen, Denmark smx262@alumni.ku.dk Abstract Sentiment analysis models often rely

More information

Focused Retrieval Using Topical Language and Structure

Focused Retrieval Using Topical Language and Structure Focused Retrieval Using Topical Language and Structure A.M. Kaptein Archives and Information Studies, University of Amsterdam Turfdraagsterpad 9, 1012 XT Amsterdam, The Netherlands a.m.kaptein@uva.nl Abstract

More information

Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings.

Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings. Online Review Spam Detection by New Linguistic Features Amir Karam, University of Maryland Baltimore County Bin Zhou, University of Maryland Baltimore County Karami, A., Zhou, B. (2015). Online Review

More information

Extracting Visual Snippets for Query Suggestion in Collaborative Web Search

Extracting Visual Snippets for Query Suggestion in Collaborative Web Search Extracting Visual Snippets for Query Suggestion in Collaborative Web Search Hannarin Kruajirayu, Teerapong Leelanupab Knowledge Management and Knowledge Engineering Laboratory Faculty of Information Technology

More information

Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching

Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching Wolfgang Tannebaum, Parvaz Madabi and Andreas Rauber Institute of Software Technology and Interactive Systems, Vienna

More information

Chapter 8. Evaluating Search Engine

Chapter 8. Evaluating Search Engine Chapter 8 Evaluating Search Engine Evaluation Evaluation is key to building effective and efficient search engines Measurement usually carried out in controlled laboratory experiments Online testing can

More information

Ranking and Learning. Table of Content. Weighted scoring for ranking Learning to rank: A simple example Learning to ranking as classification.

Ranking and Learning. Table of Content. Weighted scoring for ranking Learning to rank: A simple example Learning to ranking as classification. Table of Content anking and Learning Weighted scoring for ranking Learning to rank: A simple example Learning to ranking as classification 290 UCSB, Tao Yang, 2013 Partially based on Manning, aghavan,

More information

Modern Retrieval Evaluations. Hongning Wang

Modern Retrieval Evaluations. Hongning Wang Modern Retrieval Evaluations Hongning Wang CS@UVa What we have known about IR evaluations Three key elements for IR evaluation A document collection A test suite of information needs A set of relevance

More information

Exploring Reductions for Long Web Queries

Exploring Reductions for Long Web Queries Exploring Reductions for Long Web Queries Niranjan Balasubramanian University of Massachusetts Amherst 140 Governors Drive, Amherst, MA 01003 niranjan@cs.umass.edu Giridhar Kumaran and Vitor R. Carvalho

More information

A Document-centered Approach to a Natural Language Music Search Engine

A Document-centered Approach to a Natural Language Music Search Engine A Document-centered Approach to a Natural Language Music Search Engine Peter Knees, Tim Pohle, Markus Schedl, Dominik Schnitzer, and Klaus Seyerlehner Dept. of Computational Perception, Johannes Kepler

More information

Classification and retrieval of biomedical literatures: SNUMedinfo at CLEF QA track BioASQ 2014

Classification and retrieval of biomedical literatures: SNUMedinfo at CLEF QA track BioASQ 2014 Classification and retrieval of biomedical literatures: SNUMedinfo at CLEF QA track BioASQ 2014 Sungbin Choi, Jinwook Choi Medical Informatics Laboratory, Seoul National University, Seoul, Republic of

More information

LINK GRAPH ANALYSIS FOR ADULT IMAGES CLASSIFICATION

LINK GRAPH ANALYSIS FOR ADULT IMAGES CLASSIFICATION LINK GRAPH ANALYSIS FOR ADULT IMAGES CLASSIFICATION Evgeny Kharitonov *, ***, Anton Slesarev *, ***, Ilya Muchnik **, ***, Fedor Romanenko ***, Dmitry Belyaev ***, Dmitry Kotlyarov *** * Moscow Institute

More information

Predictive Indexing for Fast Search

Predictive Indexing for Fast Search Predictive Indexing for Fast Search Sharad Goel, John Langford and Alex Strehl Yahoo! Research, New York Modern Massive Data Sets (MMDS) June 25, 2008 Goel, Langford & Strehl (Yahoo! Research) Predictive

More information

Robust Relevance-Based Language Models

Robust Relevance-Based Language Models Robust Relevance-Based Language Models Xiaoyan Li Department of Computer Science, Mount Holyoke College 50 College Street, South Hadley, MA 01075, USA Email: xli@mtholyoke.edu ABSTRACT We propose a new

More information

A Deep Top-K Relevance Matching Model for Ad-hoc Retrieval

A Deep Top-K Relevance Matching Model for Ad-hoc Retrieval A Deep Top-K Relevance Matching Model for Ad-hoc Retrieval Zhou Yang, Qingfeng Lan, Jiafeng Guo, Yixing Fan, Xiaofei Zhu, Yanyan Lan, Yue Wang, and Xueqi Cheng School of Computer Science and Engineering,

More information

A Survey on Postive and Unlabelled Learning

A Survey on Postive and Unlabelled Learning A Survey on Postive and Unlabelled Learning Gang Li Computer & Information Sciences University of Delaware ligang@udel.edu Abstract In this paper we survey the main algorithms used in positive and unlabeled

More information

Boolean Model. Hongning Wang

Boolean Model. Hongning Wang Boolean Model Hongning Wang CS@UVa Abstraction of search engine architecture Indexed corpus Crawler Ranking procedure Doc Analyzer Doc Representation Query Rep Feedback (Query) Evaluation User Indexer

More information

Query Independent Scholarly Article Ranking

Query Independent Scholarly Article Ranking Query Independent Scholarly Article Ranking Shuai Ma, Chen Gong, Renjun Hu, Dongsheng Luo, Chunming Hu, Jinpeng Huai SKLSDE Lab, Beihang University, China Beijing Advanced Innovation Center for Big Data

More information

Retrieval Evaluation. Hongning Wang

Retrieval Evaluation. Hongning Wang Retrieval Evaluation Hongning Wang CS@UVa What we have learned so far Indexed corpus Crawler Ranking procedure Research attention Doc Analyzer Doc Rep (Index) Query Rep Feedback (Query) Evaluation User

More information

A User Preference Based Search Engine

A User Preference Based Search Engine A User Preference Based Search Engine 1 Dondeti Swedhan, 2 L.N.B. Srinivas 1 M-Tech, 2 M-Tech 1 Department of Information Technology, 1 SRM University Kattankulathur, Chennai, India Abstract - In this

More information

CS229 Final Project: Predicting Expected Response Times

CS229 Final Project: Predicting Expected  Response Times CS229 Final Project: Predicting Expected Email Response Times Laura Cruz-Albrecht (lcruzalb), Kevin Khieu (kkhieu) December 15, 2017 1 Introduction Each day, countless emails are sent out, yet the time

More information

Custom IDF weights for boosting the relevancy of retrieved documents in textual retrieval

Custom IDF weights for boosting the relevancy of retrieved documents in textual retrieval Annals of the University of Craiova, Mathematics and Computer Science Series Volume 44(2), 2017, Pages 238 248 ISSN: 1223-6934 Custom IDF weights for boosting the relevancy of retrieved documents in textual

More information

Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach

Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach P.T.Shijili 1 P.G Student, Department of CSE, Dr.Nallini Institute of Engineering & Technology, Dharapuram, Tamilnadu, India

More information

Feature Selecting Model in Automatic Text Categorization of Chinese Financial Industrial News

Feature Selecting Model in Automatic Text Categorization of Chinese Financial Industrial News Selecting Model in Automatic Text Categorization of Chinese Industrial 1) HUEY-MING LEE 1 ), PIN-JEN CHEN 1 ), TSUNG-YEN LEE 2) Department of Information Management, Chinese Culture University 55, Hwa-Kung

More information

STUDYING OF CLASSIFYING CHINESE SMS MESSAGES

STUDYING OF CLASSIFYING CHINESE SMS MESSAGES STUDYING OF CLASSIFYING CHINESE SMS MESSAGES BASED ON BAYESIAN CLASSIFICATION 1 LI FENG, 2 LI JIGANG 1,2 Computer Science Department, DongHua University, Shanghai, China E-mail: 1 Lifeng@dhu.edu.cn, 2

More information

Detecting Multilingual and Multi-Regional Query Intent in Web Search

Detecting Multilingual and Multi-Regional Query Intent in Web Search Detecting Multilingual and Multi-Regional Query Intent in Web Search Yi Chang, Ruiqiang Zhang, Srihari Reddy Yahoo! Labs 701 First Avenue Sunnyvale, CA 94089 {yichang,ruiqiang,sriharir}@yahoo-inc.com Yan

More information

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Thaer Samar 1, Alejandro Bellogín 2, and Arjen P. de Vries 1 1 Centrum Wiskunde & Informatica, {samar,arjen}@cwi.nl

More information

NTU Approaches to Subtopic Mining and Document Ranking at NTCIR-9 Intent Task

NTU Approaches to Subtopic Mining and Document Ranking at NTCIR-9 Intent Task NTU Approaches to Subtopic Mining and Document Ranking at NTCIR-9 Intent Task Chieh-Jen Wang, Yung-Wei Lin, *Ming-Feng Tsai and Hsin-Hsi Chen Department of Computer Science and Information Engineering,

More information

Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches

Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches Masaki Eto Gakushuin Women s College Tokyo, Japan masaki.eto@gakushuin.ac.jp Abstract. To improve the search performance

More information

Risk Minimization and Language Modeling in Text Retrieval Thesis Summary

Risk Minimization and Language Modeling in Text Retrieval Thesis Summary Risk Minimization and Language Modeling in Text Retrieval Thesis Summary ChengXiang Zhai Language Technologies Institute School of Computer Science Carnegie Mellon University July 21, 2002 Abstract This

More information

Evaluating an Associative Browsing Model for Personal Information

Evaluating an Associative Browsing Model for Personal Information Evaluating an Associative Browsing Model for Personal Information Jinyoung Kim, W. Bruce Croft, David A. Smith and Anton Bakalov Department of Computer Science University of Massachusetts Amherst {jykim,croft,dasmith,abakalov}@cs.umass.edu

More information

Learning Lexicon Models from Search Logs for Query Expansion

Learning Lexicon Models from Search Logs for Query Expansion Learning Lexicon Models from Search Logs for Query Expansion Jianfeng Gao Microsoft Research, Redmond Washington 98052, USA jfgao@microsoft.com Xiaodong He Microsoft Research, Redmond Washington 98052,

More information

Verbose Query Reduction by Learning to Rank for Social Book Search Track

Verbose Query Reduction by Learning to Rank for Social Book Search Track Verbose Query Reduction by Learning to Rank for Social Book Search Track Messaoud CHAA 1,2, Omar NOUALI 1, Patrice BELLOT 3 1 Research Center on Scientific and Technical Information 05 rue des 03 frères

More information

Adapting Document Ranking to Users Preferences using Click-through Data

Adapting Document Ranking to Users Preferences using Click-through Data Adapting Document Ranking to Users Preferences using Click-through Data Min Zhao 1 *, Hang Li 2, Adwait Ratnaparkhi 3, Hsiao-Wuen Hon 2, and Jue Wang 1 1 Institute of Automation, Chinese Academy of Sciences,

More information

Personalized Web Search

Personalized Web Search Personalized Web Search Dhanraj Mavilodan (dhanrajm@stanford.edu), Kapil Jaisinghani (kjaising@stanford.edu), Radhika Bansal (radhika3@stanford.edu) Abstract: With the increase in the diversity of contents

More information

Window Extraction for Information Retrieval

Window Extraction for Information Retrieval Window Extraction for Information Retrieval Samuel Huston Center for Intelligent Information Retrieval University of Massachusetts Amherst Amherst, MA, 01002, USA sjh@cs.umass.edu W. Bruce Croft Center

More information

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How

More information

Link-Contexts for Ranking

Link-Contexts for Ranking Link-Contexts for Ranking Jessica Gronski University of California Santa Cruz jgronski@soe.ucsc.edu Abstract. Anchor text has been shown to be effective in ranking[6] and a variety of information retrieval

More information

A Task Level Metric for Measuring Web Search Satisfaction and its Application on Improving Relevance Estimation

A Task Level Metric for Measuring Web Search Satisfaction and its Application on Improving Relevance Estimation A Task Level Metric for Measuring Web Search Satisfaction and its Application on Improving Relevance Estimation Ahmed Hassan Microsoft Research Redmond, WA hassanam@microsoft.com Yang Song Microsoft Research

More information

Document Structure Analysis in Associative Patent Retrieval

Document Structure Analysis in Associative Patent Retrieval Document Structure Analysis in Associative Patent Retrieval Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media Studies University of Tsukuba 1-2 Kasuga, Tsukuba, 305-8550,

More information

A New Technique to Optimize User s Browsing Session using Data Mining

A New Technique to Optimize User s Browsing Session using Data Mining Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,

More information

A Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2

A Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2 A Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2 1 Student, M.E., (Computer science and Engineering) in M.G University, India, 2 Associate Professor

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

Reducing Redundancy with Anchor Text and Spam Priors

Reducing Redundancy with Anchor Text and Spam Priors Reducing Redundancy with Anchor Text and Spam Priors Marijn Koolen 1 Jaap Kamps 1,2 1 Archives and Information Studies, Faculty of Humanities, University of Amsterdam 2 ISLA, Informatics Institute, University

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction CHAPTER 5 SUMMARY AND CONCLUSION Chapter 1: Introduction Data mining is used to extract the hidden, potential, useful and valuable information from very large amount of data. Data mining tools can handle

More information

NUSIS at TREC 2011 Microblog Track: Refining Query Results with Hashtags

NUSIS at TREC 2011 Microblog Track: Refining Query Results with Hashtags NUSIS at TREC 2011 Microblog Track: Refining Query Results with Hashtags Hadi Amiri 1,, Yang Bao 2,, Anqi Cui 3,,*, Anindya Datta 2,, Fang Fang 2,, Xiaoying Xu 2, 1 Department of Computer Science, School

More information

Semi-Parametric and Non-parametric Term Weighting for Information Retrieval

Semi-Parametric and Non-parametric Term Weighting for Information Retrieval Semi-Parametric and Non-parametric Term Weighting for Information Retrieval Donald Metzler 1 and Hugo Zaragoza 1 Yahoo! Research {metzler,hugoz}@yahoo-inc.com Abstract. Most of the previous research on

More information

Estimating Credibility of User Clicks with Mouse Movement and Eye-tracking Information

Estimating Credibility of User Clicks with Mouse Movement and Eye-tracking Information Estimating Credibility of User Clicks with Mouse Movement and Eye-tracking Information Jiaxin Mao, Yiqun Liu, Min Zhang, Shaoping Ma Department of Computer Science and Technology, Tsinghua University Background

More information

Collaborative Filtering using a Spreading Activation Approach

Collaborative Filtering using a Spreading Activation Approach Collaborative Filtering using a Spreading Activation Approach Josephine Griffith *, Colm O Riordan *, Humphrey Sorensen ** * Department of Information Technology, NUI, Galway ** Computer Science Department,

More information

Entity and Knowledge Base-oriented Information Retrieval

Entity and Knowledge Base-oriented Information Retrieval Entity and Knowledge Base-oriented Information Retrieval Presenter: Liuqing Li liuqing@vt.edu Digital Library Research Laboratory Virginia Polytechnic Institute and State University Blacksburg, VA 24061

More information

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015 University of Virginia Department of Computer Science CS 4501: Information Retrieval Fall 2015 5:00pm-6:15pm, Monday, October 26th Name: ComputingID: This is a closed book and closed notes exam. No electronic

More information

On Duplicate Results in a Search Session

On Duplicate Results in a Search Session On Duplicate Results in a Search Session Jiepu Jiang Daqing He Shuguang Han School of Information Sciences University of Pittsburgh jiepu.jiang@gmail.com dah44@pitt.edu shh69@pitt.edu ABSTRACT In this

More information

IMAGE RETRIEVAL SYSTEM: BASED ON USER REQUIREMENT AND INFERRING ANALYSIS TROUGH FEEDBACK

IMAGE RETRIEVAL SYSTEM: BASED ON USER REQUIREMENT AND INFERRING ANALYSIS TROUGH FEEDBACK IMAGE RETRIEVAL SYSTEM: BASED ON USER REQUIREMENT AND INFERRING ANALYSIS TROUGH FEEDBACK 1 Mount Steffi Varish.C, 2 Guru Rama SenthilVel Abstract - Image Mining is a recent trended approach enveloped in

More information

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017 CPSC 340: Machine Learning and Data Mining Probabilistic Classification Fall 2017 Admin Assignment 0 is due tonight: you should be almost done. 1 late day to hand it in Monday, 2 late days for Wednesday.

More information

TriRank: Review-aware Explainable Recommendation by Modeling Aspects

TriRank: Review-aware Explainable Recommendation by Modeling Aspects TriRank: Review-aware Explainable Recommendation by Modeling Aspects Xiangnan He, Tao Chen, Min-Yen Kan, Xiao Chen National University of Singapore Presented by Xiangnan He CIKM 15, Melbourne, Australia

More information

arxiv: v1 [cs.mm] 12 Jan 2016

arxiv: v1 [cs.mm] 12 Jan 2016 Learning Subclass Representations for Visually-varied Image Classification Xinchao Li, Peng Xu, Yue Shi, Martha Larson, Alan Hanjalic Multimedia Information Retrieval Lab, Delft University of Technology

More information

Repositorio Institucional de la Universidad Autónoma de Madrid.

Repositorio Institucional de la Universidad Autónoma de Madrid. Repositorio Institucional de la Universidad Autónoma de Madrid https://repositorio.uam.es Esta es la versión de autor de la comunicación de congreso publicada en: This is an author produced version of

More information

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

The Comparative Study of Machine Learning Algorithms in Text Data Classification* The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification

More information

International Journal of Advance Engineering and Research Development. A Facebook Profile Based TV Shows and Movies Recommendation System

International Journal of Advance Engineering and Research Development. A Facebook Profile Based TV Shows and Movies Recommendation System Scientific Journal of Impact Factor (SJIF): 4.72 International Journal of Advance Engineering and Research Development Volume 4, Issue 3, March -2017 A Facebook Profile Based TV Shows and Movies Recommendation

More information

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University of Economics

More information

Term Frequency Normalisation Tuning for BM25 and DFR Models

Term Frequency Normalisation Tuning for BM25 and DFR Models Term Frequency Normalisation Tuning for BM25 and DFR Models Ben He and Iadh Ounis Department of Computing Science University of Glasgow United Kingdom Abstract. The term frequency normalisation parameter

More information

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost

More information

Federated Search. Jaime Arguello INLS 509: Information Retrieval November 21, Thursday, November 17, 16

Federated Search. Jaime Arguello INLS 509: Information Retrieval November 21, Thursday, November 17, 16 Federated Search Jaime Arguello INLS 509: Information Retrieval jarguell@email.unc.edu November 21, 2016 Up to this point... Classic information retrieval search from a single centralized index all ueries

More information

Deep Web Crawling and Mining for Building Advanced Search Application

Deep Web Crawling and Mining for Building Advanced Search Application Deep Web Crawling and Mining for Building Advanced Search Application Zhigang Hua, Dan Hou, Yu Liu, Xin Sun, Yanbing Yu {hua, houdan, yuliu, xinsun, yyu}@cc.gatech.edu College of computing, Georgia Tech

More information

Context based Re-ranking of Web Documents (CReWD)

Context based Re-ranking of Web Documents (CReWD) Context based Re-ranking of Web Documents (CReWD) Arijit Banerjee, Jagadish Venkatraman Graduate Students, Department of Computer Science, Stanford University arijitb@stanford.edu, jagadish@stanford.edu}

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

Ranking models in Information Retrieval: A Survey

Ranking models in Information Retrieval: A Survey Ranking models in Information Retrieval: A Survey R.Suganya Devi Research Scholar Department of Computer Science and Engineering College of Engineering, Guindy, Chennai, Tamilnadu, India Dr D Manjula Professor

More information