A Framework for Web Host Quality Detection
|
|
- Maud Paul
- 5 years ago
- Views:
Transcription
1 A Framework for Web Host Quality Detection A. Aycan Atak, and Şule Gündüz Ögüdücü Abstract With the rapid growth of World Wide Web, finding useful and desired information in a short amount of time becomes an important issue for Web users. Search engines and focused crawlers help people to navigate the internet. A user expresses her information need in the form of a query and there is huge number of Web pages returning to this query. However, the majority of users view only a single page (the top 10 Web pages as ranked by the search engine) returned by a search engine. Even if the returned Web pages do not provide the exact information they need, the users also do not refine their query based on the returning results of their initial query. Thus, not only finding relevant Web pages but also ranking them plays an important role for the search engines. For this reason, determining the quality of Web pages is one of the main priorities of search engines, since low quality Web pages cause search engines results to be extremely vague and flooded with irrelevant Web pages. In this paper, we propose a novel method for determining the quality of Web pages. The proposed method first identifies the genre of Web pages and then it determines the quality of Web pages based on their genre. Our experimental results show that our proposed method is very effective and efficient. I. INTRODUCTION The number of Web pages grows rapidly every day. The advent of Web has caused a dramatic increase of the use of Internet as a huge, widely distributed, global information service for every kind of information. Since there is no central system to control the Web, it is impossible to estimate the precise number of Web sites and Web pages on Internet. Monthly surveys by sites like Netcraft 1 have shown that in September 2010 there are nearly 227,225,642 sites on the Web. In this environment, search engines help people to locate information relevant to their search needs expressed in the form of a query. However, the lack of central control has also increased the number of Web pages consisting of highly noisy, contradictory and unreliable information. Due to this fact, even the search results in the initial results pages are being heavily spammed. Since Web searchers usually examine the top ten results, it becomes an important issue for search engines, to list really relevant and high quality Web pages at top of search results. Web spam can significantly decrease the quality of search engine results. Search engines work on efficient algorithms to determine and block spam Web pages. Without using such algorithms the search engine results may be unreliable A. Aycan Atak is with the Department of Computer Engineering, Istanbul Technical University, Istanbul, 34469, Turkey (phone: ; fax: ; ataka@itu.edu.tr). Şule Gündüz Ögüdücü is with the Department of Computer Engineering, Istanbul Technical University, Istanbul, 34469, Turkey (phone: ; fax: ; sgunduz@itu.edu.tr). 1 Accessed on 27 Sept and Web searchers lose their trust and confidence in search engine. Common approaches for spam detection are based on extracting a set of content-based and link-based features from Web pages. From the machine learning point of view, Web page spam detection is considered as a binary classification problem of Web site content as spam or non-spam. In this problem, the Web pages are represented with feature vectors with dimensions corresponding to the terms appeared on them. However, this technique is vulnerable to Web sites faking high relevance with respect to some topics. This is called Search Engine Persuasion (SEP) in [9]. Link analysis is one of the solutions to overcome this problem. Thus, using PageRank scores for eliminating spam Web pages from search results has become a standard for years. But, applying methods for spam filtering may still not guarantee that search engines list relevant and high quality Web pages first. The nature of result ranking task based on the relevancy and quality of results is different to that on more traditional spam detection. One of the differences is that the measurement of relevance and quality of search results is subjective since it is highly dependent on peoples perceptions of the relevance and quality of information. Low quality is not simply equivalent to Web spam. To promote the research and practice of new strategies for determining the overall rank, quality and importance of a Web site and estimating Web content quality, the ECML/PKDD Discovery Challenge is organized in Different from the traditional Web spam detection problem, the aim in this challenge is to develop site-level classification for the genre of the Web sites (editorial, news, commercial, educational, deep Web, or Web spam and more) as well as their readability, authoritativeness, trustworthiness and neutrality. The motivation behind this labeling procedure is to help organizations, such as search engines and Web archives, in their efforts to prioritize their procedures to gather, store and organize their collection of Web pages. Although link-based features are commonly used for determining relevant and trusted Web hosts today, term vectors obtained from Web content are still considerably important components of this kind of quality determination problems. Using only link-based features such as PageRank scores, it is difficult to determine the factuality or genre of a Web site that effect its quality score. However, when considering the size of the Web, it is not feasible to extract the content of Web hosts in order to determine a quality score for them. Besides, for organizations such as search engines or Web archives, it is important to determine the quality score of a Web host without downloading the content of the Web page. In this paper, we focus on determining the quality score of Web pages. For this task, we used the data set provided by the
2 Discovery Challenge The Discovery Challenge tasks included the prediction of the quality score predefined based on genre, trust, factuality and bias and spamicity. It is found that, content based features are useful when predicting the genre of Web pages. However, determining more subjective characteristics of Web pages, such as trustiness, neutrality and bias, predicted genre labels of Web pages yields better results. The rest of this paper is organized as follows; in section 2, detailed description and analysis of available dataset is given. In section 3, classifiers and feature selection methods used in this study are mentioned. Experimental results are shared and discussed in section 4. And finally, conclusion and future work plans are given in section 5. II. RELATED WORK Web spam detection has been pointed as a serious problem for search engines and Web archives. However, the studies on this problem have been slow down since the problem of determination spam in social networks have become more attractive for researchers. Another important research area is to determine the utility of Web page in relation to an information need represented as a query. It has been shown that a range of factors affect human judgments of relevance. However, these studies have been conducted on textual documents which structures are different from Web documents with a wide range of formally and informally produced multimedia content and hyperlinks. However, studies on users perceptions of the relevance of information need on the Web are few. Recently, several studies are performed within ECML/PKDD 2010 Discovery Challenge (DC2010) 2. Geng et. al. used multi-scale attributes composed of attribute groups including content features, page and host level link features and TFIDF features [1]. With the fusion of different sets of features, bagging is applied to C4.5 decision tree to classify Web sites according to the categories given in the DC2010 data sets. In that study, it has been found that the host level link features are robust for classifying tasks and that feature fusion is necessary for statistical Web content quality assessment. Sokolov et. al. used RankBoost algorithm in their instance based model and propagation schemes in their graph based model separately [2]. They showed that iterative algorithm for learning propagation scheme is comparatively more efficient on revealing correlations between different quality levels. Nikulin reduced dimensionality of host attributes using Wilcoxon-based feature selection [3]. He also reduced multiclass decision problem to corresponding number of two-class decision problem using one class against all method. In that study, Nikulin took the final decision using minimal and maximal values from the result set of two-class decision problems. For predicting host quality, Lex et. al. used voting with three classifiers; J48 decision tree, class-feature-centroid classifier (CFC) and support vector machine (SVM)[4]. In 2 that study, each classifier applied on different types of attributes and oversampling method is used to deal with imbalanced dataset problem. III. DATA SET The data set, DC2010, used in this study is provided by European Archive Foundation as the material of the ECML/PKDD Discovery Challenge 2010 on Web Quality [5]. It is created through crawling the Web sites in the.eu domain in three languages: English, German and French. Thus, the data set is separated into three parts where each part contains Web hosts from a different language with the same group of attributes. In this data set, four sets of attributes are provided: link-based attribute set, content-based attribute set, natural language processing (NLP) attribute set and term frequency vectors of hosts. Also URL and host graph of crawled piece of the Web are provided with the data set. The details of these sets of attributes are described in detail in the next paragraphs 3 : The provided link-based attributes are obtained from the Web graph. Among the 178 link-based attributes, the most salient attributes of this group seem to be: PageRank, out-degree, in-degree and TrustRank values. These attributes give information about the graph properties of hosts. Content-based attributes are obtained using full content of hosts. For a host, the number of words in the homepage or compression rate of the homepage can be given as examples of this set of attributes. The total number of attributes in this set is 98. NLP attributes are computed per URL using text content of Web hosts. This attribute set includes features such as the number of tokens in a URL or counts of token types such as adverb or pronoun in a URL. Term frequency vector of each host is computed by counting the number of times words appear within each host. The term frequency vector consists of the most frequent 50,000 terms after eliminating the stop words. One of the tasks in this challenge is the quality determination of web sites whereas the quality of a Web site is measured as an aggregate function of its genre, neutrality, bias and trustiness. Therefore, predicting class attributes of Web sites is the main problem which contains the answers of quality prediction. The first class attribute to be predicted is genre. There are 6 possible values of this attribute in the provided data set. Thus a Web site is labeled with one of the following categories; spam, news-editorial, commercial, educational, discussion or personal-leisure. As can be seen from Table If genres of Web hosts are considered, as seen from Table I, the data set is highly imbalanced with more Web site labeled as commercial or educational. There is very few information available for some genre labels in the training set which will affect the learning results negatively. For example, for French data set, there is only one host labeled as discussion. 3
3 TABLE I GENRE DISTRIBUTION FOR EACH DATASET Dataset Spam News-Editorial Commercial Educational Discussion Personal-Leisure English 2.7% 3.0% 36.2% 34.0% 4.3% 19.9% German 3.7% 11.2% 44.8% 12.0% 10.4% 18.0% French 4.8% 3.7% 33.3% 13.2% 0.5% 44.4% TABLE II TRUSTINESS (A), NEUTRALITY (B) AND BIAS (C) DISTRIBUTION FOR EACH PART OF DATASET Dataset English 0.3% 1.1% 98.5% German 5.7% 70.0% 24.1% French 0.0% 2.0% 98.0% (A) Dataset English 0.3% 2.3% 97.4% German 3.4% 61.0% 35.6% French 1.0% 5.2% 93.8% (B) Dataset 1 2 English 1.2% 98.8% German 4.6% 95.4% French 0.0% 100.0% (C) Fig. 1. Proposed methodology The second class attribute, trustiness, can take three values; 1, 2 and 3. A value of 1 means that the host is not reliable. Values of 2 and 3 mean that the host is reliable. The third class attribute, neutrality indicates factuality of a host and it also has three values; 1, 2 and 3. A value of 1 means that the host is problematic. The fourth class attribute, bias, can take two values; 1 and 2. A value of 1 indicates that there are significant bias problems in the host. As can be seen in Table II, the training data set is imbalanced. For example, the training set of the French data contains any example with a trustiness or bias value of 1. IV. METHODOLOGY In this study, we only utilize the term vector attribute set. Remaining 3 sets of attributes, namely link-based, content-based and NLP attributes, are determined irrelevant in predicting class attributes based on the results of feature selection methods. It is observed that these 3 sets of attributes do not contain sufficient information to determine the class labels. At this point it must be noted that, there are some instances in the data set that do not have term frequency vector. These instances are removed from both training and validation sets. It is observed that genre values of hosts are more informative than the term frequency vectors in determining the hosts trustiness, neutrality and bias values. For this reason, in this study first the genre of Web hosts is determined based on the term frequency vectors which is used in turn to identify the trustiness, neutrality, bias values of hosts. The obtained values are used to compute the quality score of the hosts. An overview of our methodology is given in Figure 1. For machine learning algorithms including attribute selection, classification and oversampling methods, implementations in WEKA 4 data mining tool are used. However, it is observed that oversampling methods are not successful to handle the imbalanced data set problem and they reduce the performance of classification algorithms. For this reason, oversampling methods and the classification results with oversampling methods are not reported in this study. A. Genre Prediction If genre prediction is considered, we are facing several problems; multi-class decision problem, text-categorization problem and high dimensional vector space. Yang and Pedersen compared several attribute selection methods for text categorization including information gain, chi-square statistics, document frequency, mutual information and term strength [6]. According to that study, information gain and chi-square statistics are among the best performing attribute selection methods for text categorization. In this study, for all datasets, chi-square statistics also appears to be performing better. It decreases classification execution times while increasing the accuracy of the classifier. For each data set, number of selected terms is determined with hillclimbing method. Attribute selection method and number of selected terms which minimize the training error are selected for the final classification. For classification, different classifiers are applied on the same data set. Support vector machines, (SVM) which is 4
4 considered as the best performing classifier for many text categorization problems [7], appears to be performing better in this study. Similarly as in attribute selection phase, the classifier and its parameters are also selected based on the results of hill-climbing method. From original term vectors, new term vectors are generated with binary weighting. According to binary weighting, for host h and term t, if h contains t then the weight of t in term vector of h is set to 1, otherwise it is set to 0. Each experiment to find the best classifier and attribute selection method is applied to both data sets to illustrate the effect of different weighting schemes on classification results. Subsequently, term vectors with binary weighting have better results in terms of classification accuracy. Table III presents the applied classifiers and feature selection methods to each data set and the resulting number of terms after feature selection. TABLE III FINAL DECISIONS FOR ATTRIBUTE SELECTION METHOD AND CLASSIFIER Dataset Feature Selection # Of Selected Method Terms Weighting Classifier English Chi-square Stat Binary SVM German Chi-square Stat. 100 Binary SVM French Chi-square Stat Binary SVM B. Trustiness, Neutrality and Bias Prediction Trustiness, neutrality and bias are predicted based on the results of genre prediction. For host h, genre prediction process produces 6 probability values. Each of these probability values indicates the probability of belonging to a genre. A host h i can be then represented with 6 attributes, such as g i1, g i2,..., g i6, where the values of each attribute is obtained from genre prediction and 6 j=1 g ij = 1. Linear regression is applied to this data set for predicting trustiness, neutrality and bias values of each host. Thus, linear regression produces a score for each hosts trustiness, neutrality and bias values. Then, the hosts in the test set are sorted by each of these values individually. C. Quality Prediction For determining quality levels of hosts, the prediction of the utility score predefined based on genre, trustiness, neutrality and bias values are also inclusded in the DC2010 tasks. For each host h, utility value is calculated as in Algorithm 1 by combining the genre, trustiness, neutrality and bias values of h. Algorithm 1 Quality Determination value = 0; if News-Edit OR Educational then value = 5; else if Discussion then value = 4; else if Commercial OR Personal-Leisure then value = 3; if neutrality = 3 then value+ = 2; if bias = 1 then value = 2; if trustiness = 3 then value+ = 2; Based on Algorithm 1, the utility value of a host can range between -2 and 9. The categories News and Educational have the highest quality. Also, trusted, unbiased and neutral contents have a high quality score. By default, Web Spam hosts have the lowest quality. The utility value may be utilized to predict the quality of a host. Similarly, linear regression method can be applied to obtain quality scores. V. EXPERIMENTAL RESULTS The results of the experiments are conducted in terms of the evaluation metrics used at the DC2010. Depending on tasks determined by this challenge, quality prediction is evaluated for English, German and French datasets while genre, trustiness, neutrality and bias prediction are evaluated for only English dataset. A. Evaluation For evaluation, normalized discounted cumulative gain (ndcg) is used at DC2010. ndcg is obtained with normalization of discounted cumulative gain (DCG) with ideal DCG value. DCG and ndcg can be computed with Eq. 1 and Eq. 2 respectively. DCG = N rank=1 ( utility (rank) 1 rank ) N DCG ndcg = (2) IdealDCG In Eq. 1, N is the number of test instances in the test set and ideal DCG is the DCG value obtained by the ideal ordering of the hosts. According to ndcg formula, it is important to rank hosts with higher utility values at top. Please note that, in a perfect ranking algorithm, the ndcg values will be the same as the Ideal DCG producing an ndcg of 1.0. Thus, all ndcg calculations are then relative values on the interval 0.0 to 1.0. For genre prediction problem, when evaluating results for genre g, utility value is 1 for host h, if h belongs to genre g. Otherwise utility value is 0. (1)
5 TABLE IV GENRE PREDICTION RESULTS FOR ENGLISH DATASET B. Genre Prediction Genre ndcg Spam 0.88 News-Editorial 0.73 Commercial 0.84 Educational 0.87 Discussion 0.76 Personal-Leisure 0.81 Overall 0.82 Genre prediction results are given in Table IV. For each genre, ndcg value is computed separately. Overall result of genre prediction can be computed with arithmetic average of all ndcg values obtained from genre prediction. This experiment is applied on English dataset only. As can be seen from Table IV, the ndcg results are high compared to the results from previous publications. These results indicate that our proposed method yields higher ndcg values for spam genre prediction. The ndcg values for spam and educational genres are higher. This may be due to the fact that spam and educational hosts have more discriminative words. According to Table IV, it can be said that spam and educational genres are more word-specific than other genres. There are more discriminative words that mostly exist in spam and educational hosts. However, discussion and newseditorial hosts contain more words in common. C. Trustiness, Neutrality and Bias Prediction Results of the trustiness, neutrality and bias predictions are given in Table V. These class attributes are predicted with term vectors and genre prediction results separately. As in genre prediction, this experiment is also applied on English dataset only. TABLE V TRUSTINESS, NEUTRALITY AND BIAS PREDICTION RESULTS FOR ENGLISH DATASET Class Attribute ndcg Using ndcg Using Term Vectors Genre Predictions Trustiness Neutrality Bias As can be seen from Table V, except for neutrality, using genre prediction results instead of term vectors causes a significant improvement in the ndcg values. Also it can be concluded that for Web hosts, neutrality is less depended on genre values than trustiness and bias. However, also for neutrality, using genre prediction results still gives better results than using term vectors. If other studies are considered, these results are quite satisfactory. For trustiness, the best ndcg value is obtained TABLE VI QUALITY PREDICTION RESULTS Dataset ndcg English 0.85 German 0.81 French 0.82 so far. Also bias prediction results are at the same level with top results. D. Quality Prediction Quality prediction results are given in Table VI. For each dataset, quality is computed separately using trustiness, neutrality and bias prediction results. These results are also compatible with the results obtained in other studies focused on the same dataset. Using prediction results from other classification processes reduces the size of the attribute set which in turn reduces the execution time of the classifier. That s why, this results are more satisfactory when execution time and accuracy are considered together. VI. CONCLUSION AND FUTURE WORK In this study, a method to measure the quality of Web hosts is proposed. It is first hypothesized and then observed that terms of hosts contain more information about Web host genre than link-based metrics. For predicting class attributes related with Web hosts, a framework is presented. In order to evaluate the proposed framework, a data set from a conference challenge is used. The experimental results show that our proposed method is superior to the previous proposed methods in terms ndcg and execution time. We are extending the classification method in several ways. More accurate predictions can be made using hybrid datasets including category prediction results and term vectors. However, after quality prediction with two-class decision problem [8] and this multi-class decision problem studies, detecting quality using propagation on graph structures is the main subject for our future work. Due to existence of Web host graph structure in the DC2010 dataset, possible future studies can focus on the same dataset with this study. VII. ACKNOWLEDGEMENT Author Gunduz-Oguducu was partially supported by the Scientific and Technological Research Council of Turkey (TUBITAK) EEEAG project 110E027. REFERENCES [1] G. Geng, X. Jin, X. Zhang and D. Zhang, Evaluating Web content quality via multi-scale features, ECML/PKDD 2010 Discovery Challenge Workshop, [2] A. Sokolov, T. Urvoy, L. Denoyer and O. Richard, Madspam consortium at the ECML/PKDD Discovery Challenge 2010, ECML/PKDD 2010 Discovery Challenge Workshop, [3] V. Nikulin, Web-mining with Wilcoxon-based feauture selection, ensembling and multiple binary classifiers, ECML/PKDD 2010 Discovery Challenge Workshop, [4] E. Lex, I. Khan, H. Bischof, M. Granitzer, Assessing the quality of Web content, ECML/PKDD 2010 Discovery Challenge Workshop, 2010.
6 [5] A. A. Benczur, C. Castillo, M. Erdelyi, Z. Gyongi, J. Masanes and M. Matthews, ECML/PKDD 2010 Discovery Challenge Data Set, Crawled by the European Archive Foundation. [6] Y. Yang and J. O. Pedersen, A comparative study on feature selection in text categorization, In proceedings of the fourteenth international conference on machine learnin (ICML 97), Douglas H.Fisher (Ed.), Morgan Kaufmann Publishers Inc., San Fransisco, CA, USA, pp , [7] J. Thorsten, Text categorization with support vector machines: Learning with many relevant features, The 10th European Conference on Machine Learning, [8] A. A. Atak, S. G. Oguducu, A framework for social spam detection based on relational bayesian classifier, In proceedings of the 6th International Conference on Data Mining (DMIN10), pp.71-77, [9] M. Marchiori, The quest for correct information on the Web: hyper search engines, In Selected papers from the sixth international conference on World Wide Web, Phillip H. Enslow, Jr., Mike Genesereth, and Anna Patterson (Eds.). Elsevier Science Publishers Ltd., Essex, UK, pp , 1997.
A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems
A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University
More informationWebSci and Learning to Rank for IR
WebSci and Learning to Rank for IR Ernesto Diaz-Aviles L3S Research Center. Hannover, Germany diaz@l3s.de Ernesto Diaz-Aviles www.l3s.de 1/16 Motivation: Information Explosion Ernesto Diaz-Aviles
More informationPersonalized Web Search
Personalized Web Search Dhanraj Mavilodan (dhanrajm@stanford.edu), Kapil Jaisinghani (kjaising@stanford.edu), Radhika Bansal (radhika3@stanford.edu) Abstract: With the increase in the diversity of contents
More informationA novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems
A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University of Economics
More informationUsing Text Learning to help Web browsing
Using Text Learning to help Web browsing Dunja Mladenić J.Stefan Institute, Ljubljana, Slovenia Carnegie Mellon University, Pittsburgh, PA, USA Dunja.Mladenic@{ijs.si, cs.cmu.edu} Abstract Web browsing
More informationThe Comparative Study of Machine Learning Algorithms in Text Data Classification*
The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification
More informationA Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics
A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics Helmut Berger and Dieter Merkl 2 Faculty of Information Technology, University of Technology, Sydney, NSW, Australia hberger@it.uts.edu.au
More informationCHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES
188 CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES 6.1 INTRODUCTION Image representation schemes designed for image retrieval systems are categorized into two
More informationChapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction
CHAPTER 5 SUMMARY AND CONCLUSION Chapter 1: Introduction Data mining is used to extract the hidden, potential, useful and valuable information from very large amount of data. Data mining tools can handle
More informationMinghai Liu, Rui Cai, Ming Zhang, and Lei Zhang. Microsoft Research, Asia School of EECS, Peking University
Minghai Liu, Rui Cai, Ming Zhang, and Lei Zhang Microsoft Research, Asia School of EECS, Peking University Ordering Policies for Web Crawling Ordering policy To prioritize the URLs in a crawling queue
More informationIn this project, I examined methods to classify a corpus of s by their content in order to suggest text blocks for semi-automatic replies.
December 13, 2006 IS256: Applied Natural Language Processing Final Project Email classification for semi-automated reply generation HANNES HESSE mail 2056 Emerson Street Berkeley, CA 94703 phone 1 (510)
More informationCSCI 599: Applications of Natural Language Processing Information Retrieval Evaluation"
CSCI 599: Applications of Natural Language Processing Information Retrieval Evaluation" All slides Addison Wesley, Donald Metzler, and Anton Leuski, 2008, 2012! Evaluation" Evaluation is key to building
More informationKarami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings.
Online Review Spam Detection by New Linguistic Features Amir Karam, University of Maryland Baltimore County Bin Zhou, University of Maryland Baltimore County Karami, A., Zhou, B. (2015). Online Review
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationIMAGE RETRIEVAL SYSTEM: BASED ON USER REQUIREMENT AND INFERRING ANALYSIS TROUGH FEEDBACK
IMAGE RETRIEVAL SYSTEM: BASED ON USER REQUIREMENT AND INFERRING ANALYSIS TROUGH FEEDBACK 1 Mount Steffi Varish.C, 2 Guru Rama SenthilVel Abstract - Image Mining is a recent trended approach enveloped in
More informationTaccumulation of the social network data has raised
International Journal of Advanced Research in Social Sciences, Environmental Studies & Technology Hard Print: 2536-6505 Online: 2536-6513 September, 2016 Vol. 2, No. 1 Review Social Network Analysis and
More informationInternational Journal of Software and Web Sciences (IJSWS)
International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) ISSN (Print): 2279-0063 ISSN (Online): 2279-0071 International
More informationLarge Scale Chinese News Categorization. Peng Wang. Joint work with H. Zhang, B. Xu, H.W. Hao
Large Scale Chinese News Categorization --based on Improved Feature Selection Method Peng Wang Joint work with H. Zhang, B. Xu, H.W. Hao Computational-Brain Research Center Institute of Automation, Chinese
More informationRanking Techniques in Search Engines
Ranking Techniques in Search Engines Rajat Chaudhari M.Tech Scholar Manav Rachna International University, Faridabad Charu Pujara Assistant professor, Dept. of Computer Science Manav Rachna International
More informationRank Measures for Ordering
Rank Measures for Ordering Jin Huang and Charles X. Ling Department of Computer Science The University of Western Ontario London, Ontario, Canada N6A 5B7 email: fjhuang33, clingg@csd.uwo.ca Abstract. Many
More informationWEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1
WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1 H. Altay Güvenir and Aynur Akkuş Department of Computer Engineering and Information Science Bilkent University, 06533, Ankara, Turkey
More informationAdvanced Search Techniques for Large Scale Data Analytics Pavel Zezula and Jan Sedmidubsky Masaryk University
Advanced Search Techniques for Large Scale Data Analytics Pavel Zezula and Jan Sedmidubsky Masaryk University http://disa.fi.muni.cz The Cranfield Paradigm Retrieval Performance Evaluation Evaluation Using
More informationTEXT CATEGORIZATION PROBLEM
TEXT CATEGORIZATION PROBLEM Emrah Cem Department of Electrical and Computer Engineering Koç University Istanbul, TURKEY 34450 ecem@ku.edu.tr Abstract Document categorization problem gained a lot of importance
More informationResearch on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a
International Conference on Education Technology, Management and Humanities Science (ETMHS 2015) Research on Applications of Data Mining in Electronic Commerce Xiuping YANG 1, a 1 Computer Science Department,
More informationAutomated Tagging for Online Q&A Forums
1 Automated Tagging for Online Q&A Forums Rajat Sharma, Nitin Kalra, Gautam Nagpal University of California, San Diego, La Jolla, CA 92093, USA {ras043, nikalra, gnagpal}@ucsd.edu Abstract Hashtags created
More informationChapter 8. Evaluating Search Engine
Chapter 8 Evaluating Search Engine Evaluation Evaluation is key to building effective and efficient search engines Measurement usually carried out in controlled laboratory experiments Online testing can
More informationSemantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman
Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Abstract We intend to show that leveraging semantic features can improve precision and recall of query results in information
More informationBusiness Club. Decision Trees
Business Club Decision Trees Business Club Analytics Team December 2017 Index 1. Motivation- A Case Study 2. The Trees a. What is a decision tree b. Representation 3. Regression v/s Classification 4. Building
More informationSemantic Website Clustering
Semantic Website Clustering I-Hsuan Yang, Yu-tsun Huang, Yen-Ling Huang 1. Abstract We propose a new approach to cluster the web pages. Utilizing an iterative reinforced algorithm, the model extracts semantic
More informationDomain Specific Search Engine for Students
Domain Specific Search Engine for Students Domain Specific Search Engine for Students Wai Yuen Tang The Department of Computer Science City University of Hong Kong, Hong Kong wytang@cs.cityu.edu.hk Lam
More informationLink Prediction for Social Network
Link Prediction for Social Network Ning Lin Computer Science and Engineering University of California, San Diego Email: nil016@eng.ucsd.edu Abstract Friendship recommendation has become an important issue
More informationLink Analysis and Web Search
Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html
More informationEnhanced Performance of Search Engine with Multitype Feature Co-Selection of Db-scan Clustering Algorithm
Enhanced Performance of Search Engine with Multitype Feature Co-Selection of Db-scan Clustering Algorithm K.Parimala, Assistant Professor, MCA Department, NMS.S.Vellaichamy Nadar College, Madurai, Dr.V.Palanisamy,
More informationCS299 Detailed Plan. Shawn Tice. February 5, The high-level steps for classifying web pages in Yioop are as follows:
CS299 Detailed Plan Shawn Tice February 5, 2013 Overview The high-level steps for classifying web pages in Yioop are as follows: 1. Create a new classifier for a unique label. 2. Train it on a labelled
More informationContents. Preface to the Second Edition
Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................
More informationAutomatic New Topic Identification in Search Engine Transaction Log Using Goal Programming
Proceedings of the 2012 International Conference on Industrial Engineering and Operations Management Istanbul, Turkey, July 3 6, 2012 Automatic New Topic Identification in Search Engine Transaction Log
More informationChapter 27 Introduction to Information Retrieval and Web Search
Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval
More informationCS 8803 AIAD Prof Ling Liu. Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai
CS 8803 AIAD Prof Ling Liu Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai Under the supervision of Steve Webb Motivations and Objectives Spam, which was until
More informationString Vector based KNN for Text Categorization
458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research
More informationA Content Vector Model for Text Classification
A Content Vector Model for Text Classification Eric Jiang Abstract As a popular rank-reduced vector space approach, Latent Semantic Indexing (LSI) has been used in information retrieval and other applications.
More informationUser Intent Discovery using Analysis of Browsing History
User Intent Discovery using Analysis of Browsing History Wael K. Abdallah Information Systems Dept Computers & Information Faculty Mansoura University Mansoura, Egypt Dr. / Aziza S. Asem Information Systems
More informationClassifying Twitter Data in Multiple Classes Based On Sentiment Class Labels
Classifying Twitter Data in Multiple Classes Based On Sentiment Class Labels Richa Jain 1, Namrata Sharma 2 1M.Tech Scholar, Department of CSE, Sushila Devi Bansal College of Engineering, Indore (M.P.),
More informationSearch Evaluation. Tao Yang CS293S Slides partially based on text book [CMS] [MRS]
Search Evaluation Tao Yang CS293S Slides partially based on text book [CMS] [MRS] Table of Content Search Engine Evaluation Metrics for relevancy Precision/recall F-measure MAP NDCG Difficulties in Evaluating
More informationProximity Prestige using Incremental Iteration in Page Rank Algorithm
Indian Journal of Science and Technology, Vol 9(48), DOI: 10.17485/ijst/2016/v9i48/107962, December 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Proximity Prestige using Incremental Iteration
More informationChapter 6: Information Retrieval and Web Search. An introduction
Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods
More informationA Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion
A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion Ye Tian, Gary M. Weiss, Qiang Ma Department of Computer and Information Science Fordham University 441 East Fordham
More informationMachine Learning Techniques for Data Mining
Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already
More informationCS570: Introduction to Data Mining
CS570: Introduction to Data Mining Classification Advanced Reading: Chapter 8 & 9 Han, Chapters 4 & 5 Tan Anca Doloc-Mihu, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei. Data Mining.
More informationImproving the Efficiency of Fast Using Semantic Similarity Algorithm
International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year
More informationCS249: ADVANCED DATA MINING
CS249: ADVANCED DATA MINING Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu April 24, 2017 Homework 2 out Announcements Due May 3 rd (11:59pm) Course project proposal
More informationContent Based Smart Crawler For Efficiently Harvesting Deep Web Interface
Content Based Smart Crawler For Efficiently Harvesting Deep Web Interface Prof. T.P.Aher(ME), Ms.Rupal R.Boob, Ms.Saburi V.Dhole, Ms.Dipika B.Avhad, Ms.Suvarna S.Burkul 1 Assistant Professor, Computer
More informationSCALABLE KNOWLEDGE BASED AGGREGATION OF COLLECTIVE BEHAVIOR
SCALABLE KNOWLEDGE BASED AGGREGATION OF COLLECTIVE BEHAVIOR P.SHENBAGAVALLI M.E., Research Scholar, Assistant professor/cse MPNMJ Engineering college Sspshenba2@gmail.com J.SARAVANAKUMAR B.Tech(IT)., PG
More informationCS145: INTRODUCTION TO DATA MINING
CS145: INTRODUCTION TO DATA MINING 08: Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu October 24, 2017 Learnt Prediction and Classification Methods Vector Data
More informationAUTOMATIC VISUAL CONCEPT DETECTION IN VIDEOS
AUTOMATIC VISUAL CONCEPT DETECTION IN VIDEOS Nilam B. Lonkar 1, Dinesh B. Hanchate 2 Student of Computer Engineering, Pune University VPKBIET, Baramati, India Computer Engineering, Pune University VPKBIET,
More informationFeature Selection Using Modified-MCA Based Scoring Metric for Classification
2011 International Conference on Information Communication and Management IPCSIT vol.16 (2011) (2011) IACSIT Press, Singapore Feature Selection Using Modified-MCA Based Scoring Metric for Classification
More informationFig 1. Overview of IE-based text mining framework
DiscoTEX: A framework of Combining IE and KDD for Text Mining Ritesh Kumar Research Scholar, Singhania University, Pacheri Beri, Rajsthan riteshchandel@gmail.com Abstract: Text mining based on the integration
More informationFocused crawling: a new approach to topic-specific Web resource discovery. Authors
Focused crawling: a new approach to topic-specific Web resource discovery Authors Soumen Chakrabarti Martin van den Berg Byron Dom Presented By: Mohamed Ali Soliman m2ali@cs.uwaterloo.ca Outline Why Focused
More informationSlides for Data Mining by I. H. Witten and E. Frank
Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-
More informationAutomatic Classification of Audio Data
Automatic Classification of Audio Data Carlos H. C. Lopes, Jaime D. Valle Jr. & Alessandro L. Koerich IEEE International Conference on Systems, Man and Cybernetics The Hague, The Netherlands October 2004
More informationIn the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,
1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to
More informationAn Empirical Study of Lazy Multilabel Classification Algorithms
An Empirical Study of Lazy Multilabel Classification Algorithms E. Spyromitros and G. Tsoumakas and I. Vlahavas Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece
More informationCHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION
CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant
More informationInformation-Theoretic Feature Selection Algorithms for Text Classification
Proceedings of International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 5 Information-Theoretic Feature Selection Algorithms for Text Classification Jana Novovičová Institute
More informationA PRELIMINARY STUDY ON THE EXTRACTION OF SOCIO-TOPICAL WEB KEYWORDS
A PRELIMINARY STUDY ON THE EXTRACTION OF SOCIO-TOPICAL WEB KEYWORDS KULWADEE SOMBOONVIWAT Graduate School of Information Science and Technology, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-0033,
More informationKeywords Fuzzy, Set Theory, KDD, Data Base, Transformed Database.
Volume 6, Issue 5, May 016 ISSN: 77 18X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Fuzzy Logic in Online
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationPart I: Data Mining Foundations
Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?
More informationDATA MINING II - 1DL460. Spring 2014"
DATA MINING II - 1DL460 Spring 2014" A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt14 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,
More informationAutomatic Query Type Identification Based on Click Through Information
Automatic Query Type Identification Based on Click Through Information Yiqun Liu 1,MinZhang 1,LiyunRu 2, and Shaoping Ma 1 1 State Key Lab of Intelligent Tech. & Sys., Tsinghua University, Beijing, China
More informationFeature-weighted k-nearest Neighbor Classifier
Proceedings of the 27 IEEE Symposium on Foundations of Computational Intelligence (FOCI 27) Feature-weighted k-nearest Neighbor Classifier Diego P. Vivencio vivencio@comp.uf scar.br Estevam R. Hruschka
More informationGRAPHICAL REPRESENTATION OF TEXTUAL DATA USING TEXT CATEGORIZATION SYSTEM
http:// GRAPHICAL REPRESENTATION OF TEXTUAL DATA USING TEXT CATEGORIZATION SYSTEM Akshay Kumar 1, Vibhor Harit 2, Balwant Singh 3, Manzoor Husain Dar 4 1 M.Tech (CSE), Kurukshetra University, Kurukshetra,
More informationSearch Engines Chapter 8 Evaluating Search Engines Felix Naumann
Search Engines Chapter 8 Evaluating Search Engines 9.7.2009 Felix Naumann Evaluation 2 Evaluation is key to building effective and efficient search engines. Drives advancement of search engines When intuition
More informationRandom projection for non-gaussian mixture models
Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,
More informationA Novel Approach to Image Segmentation for Traffic Sign Recognition Jon Jay Hack and Sidd Jagadish
A Novel Approach to Image Segmentation for Traffic Sign Recognition Jon Jay Hack and Sidd Jagadish Introduction/Motivation: As autonomous vehicles, such as Google s self-driving car, have recently become
More informationA Dynamic Bayesian Network Click Model for Web Search Ranking
A Dynamic Bayesian Network Click Model for Web Search Ranking Olivier Chapelle and Anne Ya Zhang Apr 22, 2009 18th International World Wide Web Conference Introduction Motivation Clicks provide valuable
More informationDynamic Visualization of Hubs and Authorities during Web Search
Dynamic Visualization of Hubs and Authorities during Web Search Richard H. Fowler 1, David Navarro, Wendy A. Lawrence-Fowler, Xusheng Wang Department of Computer Science University of Texas Pan American
More informationDATA MINING - 1DL105, 1DL111
1 DATA MINING - 1DL105, 1DL111 Fall 2007 An introductory class in data mining http://user.it.uu.se/~udbl/dut-ht2007/ alt. http://www.it.uu.se/edu/course/homepage/infoutv/ht07 Kjell Orsborn Uppsala Database
More informationEfficient Voting Prediction for Pairwise Multilabel Classification
Efficient Voting Prediction for Pairwise Multilabel Classification Eneldo Loza Mencía, Sang-Hyeun Park and Johannes Fürnkranz TU-Darmstadt - Knowledge Engineering Group Hochschulstr. 10 - Darmstadt - Germany
More informationIJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:
IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 20131 Improve Search Engine Relevance with Filter session Addlin Shinney R 1, Saravana Kumar T
More informationWeb Data mining-a Research area in Web usage mining
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 13, Issue 1 (Jul. - Aug. 2013), PP 22-26 Web Data mining-a Research area in Web usage mining 1 V.S.Thiyagarajan,
More informationAn Empirical Performance Comparison of Machine Learning Methods for Spam Categorization
An Empirical Performance Comparison of Machine Learning Methods for Spam E-mail Categorization Chih-Chin Lai a Ming-Chi Tsai b a Dept. of Computer Science and Information Engineering National University
More informationNETWORK FAULT DETECTION - A CASE FOR DATA MINING
NETWORK FAULT DETECTION - A CASE FOR DATA MINING Poonam Chaudhary & Vikram Singh Department of Computer Science Ch. Devi Lal University, Sirsa ABSTRACT: Parts of the general network fault management problem,
More informationGlobal Journal of Engineering Science and Research Management
CONFIGURATION FILE RECOMMENDATIONS USING METADATA AND FUZZY TECHNIQUE Gnanamani.H*, Mr. C.V. Shanmuka Swamy * PG Student Department of Computer Science Shridevi Institute Of Engineering and Technology
More informationTDT- An Efficient Clustering Algorithm for Large Database Ms. Kritika Maheshwari, Mr. M.Rajsekaran
TDT- An Efficient Clustering Algorithm for Large Database Ms. Kritika Maheshwari, Mr. M.Rajsekaran M-Tech Scholar, Department of Computer Science and Engineering, SRM University, India Assistant Professor,
More informationChallenges on Combining Open Web and Dataset Evaluation Results: The Case of the Contextual Suggestion Track
Challenges on Combining Open Web and Dataset Evaluation Results: The Case of the Contextual Suggestion Track Alejandro Bellogín 1,2, Thaer Samar 1, Arjen P. de Vries 1, and Alan Said 1 1 Centrum Wiskunde
More informationA Naïve Soft Computing based Approach for Gene Expression Data Analysis
Available online at www.sciencedirect.com Procedia Engineering 38 (2012 ) 2124 2128 International Conference on Modeling Optimization and Computing (ICMOC-2012) A Naïve Soft Computing based Approach for
More informationWeb Structure Mining using Link Analysis Algorithms
Web Structure Mining using Link Analysis Algorithms Ronak Jain Aditya Chavan Sindhu Nair Assistant Professor Abstract- The World Wide Web is a huge repository of data which includes audio, text and video.
More informationContent Based Image Retrieval system with a combination of Rough Set and Support Vector Machine
Shahabi Lotfabadi, M., Shiratuddin, M.F. and Wong, K.W. (2013) Content Based Image Retrieval system with a combination of rough set and support vector machine. In: 9th Annual International Joint Conferences
More informationEncoding Words into String Vectors for Word Categorization
Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,
More informationResearch and Design of Key Technology of Vertical Search Engine for Educational Resources
2017 International Conference on Arts and Design, Education and Social Sciences (ADESS 2017) ISBN: 978-1-60595-511-7 Research and Design of Key Technology of Vertical Search Engine for Educational Resources
More informationA Survey on Postive and Unlabelled Learning
A Survey on Postive and Unlabelled Learning Gang Li Computer & Information Sciences University of Delaware ligang@udel.edu Abstract In this paper we survey the main algorithms used in positive and unlabeled
More informationManaging Open Bug Repositories through Bug Report Prioritization Using SVMs
Managing Open Bug Repositories through Bug Report Prioritization Using SVMs Jaweria Kanwal Quaid-i-Azam University, Islamabad kjaweria09@yahoo.com Onaiza Maqbool Quaid-i-Azam University, Islamabad onaiza@qau.edu.pk
More informationCIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets
CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets Arjumand Younus 1,2, Colm O Riordan 1, and Gabriella Pasi 2 1 Computational Intelligence Research Group,
More informationImproving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets
Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets Md Nasim Adnan and Md Zahidul Islam Centre for Research in Complex Systems (CRiCS)
More informationCombining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating
Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating Dipak J Kakade, Nilesh P Sable Department of Computer Engineering, JSPM S Imperial College of Engg. And Research,
More informationIntroduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.
Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How
More informationUsing PageRank in Feature Selection
Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy fienco,meo,bottag@di.unito.it Abstract. Feature selection is an important
More informationCLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval
DCU @ CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval Walid Magdy, Johannes Leveling, Gareth J.F. Jones Centre for Next Generation Localization School of Computing Dublin City University,
More informationInformation Retrieval
Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have
More informationRetrieval Evaluation. Hongning Wang
Retrieval Evaluation Hongning Wang CS@UVa What we have learned so far Indexed corpus Crawler Ranking procedure Research attention Doc Analyzer Doc Rep (Index) Query Rep Feedback (Query) Evaluation User
More information