A Framework for Web Host Quality Detection

Size: px
Start display at page:

Download "A Framework for Web Host Quality Detection"

Transcription

1 A Framework for Web Host Quality Detection A. Aycan Atak, and Şule Gündüz Ögüdücü Abstract With the rapid growth of World Wide Web, finding useful and desired information in a short amount of time becomes an important issue for Web users. Search engines and focused crawlers help people to navigate the internet. A user expresses her information need in the form of a query and there is huge number of Web pages returning to this query. However, the majority of users view only a single page (the top 10 Web pages as ranked by the search engine) returned by a search engine. Even if the returned Web pages do not provide the exact information they need, the users also do not refine their query based on the returning results of their initial query. Thus, not only finding relevant Web pages but also ranking them plays an important role for the search engines. For this reason, determining the quality of Web pages is one of the main priorities of search engines, since low quality Web pages cause search engines results to be extremely vague and flooded with irrelevant Web pages. In this paper, we propose a novel method for determining the quality of Web pages. The proposed method first identifies the genre of Web pages and then it determines the quality of Web pages based on their genre. Our experimental results show that our proposed method is very effective and efficient. I. INTRODUCTION The number of Web pages grows rapidly every day. The advent of Web has caused a dramatic increase of the use of Internet as a huge, widely distributed, global information service for every kind of information. Since there is no central system to control the Web, it is impossible to estimate the precise number of Web sites and Web pages on Internet. Monthly surveys by sites like Netcraft 1 have shown that in September 2010 there are nearly 227,225,642 sites on the Web. In this environment, search engines help people to locate information relevant to their search needs expressed in the form of a query. However, the lack of central control has also increased the number of Web pages consisting of highly noisy, contradictory and unreliable information. Due to this fact, even the search results in the initial results pages are being heavily spammed. Since Web searchers usually examine the top ten results, it becomes an important issue for search engines, to list really relevant and high quality Web pages at top of search results. Web spam can significantly decrease the quality of search engine results. Search engines work on efficient algorithms to determine and block spam Web pages. Without using such algorithms the search engine results may be unreliable A. Aycan Atak is with the Department of Computer Engineering, Istanbul Technical University, Istanbul, 34469, Turkey (phone: ; fax: ; ataka@itu.edu.tr). Şule Gündüz Ögüdücü is with the Department of Computer Engineering, Istanbul Technical University, Istanbul, 34469, Turkey (phone: ; fax: ; sgunduz@itu.edu.tr). 1 Accessed on 27 Sept and Web searchers lose their trust and confidence in search engine. Common approaches for spam detection are based on extracting a set of content-based and link-based features from Web pages. From the machine learning point of view, Web page spam detection is considered as a binary classification problem of Web site content as spam or non-spam. In this problem, the Web pages are represented with feature vectors with dimensions corresponding to the terms appeared on them. However, this technique is vulnerable to Web sites faking high relevance with respect to some topics. This is called Search Engine Persuasion (SEP) in [9]. Link analysis is one of the solutions to overcome this problem. Thus, using PageRank scores for eliminating spam Web pages from search results has become a standard for years. But, applying methods for spam filtering may still not guarantee that search engines list relevant and high quality Web pages first. The nature of result ranking task based on the relevancy and quality of results is different to that on more traditional spam detection. One of the differences is that the measurement of relevance and quality of search results is subjective since it is highly dependent on peoples perceptions of the relevance and quality of information. Low quality is not simply equivalent to Web spam. To promote the research and practice of new strategies for determining the overall rank, quality and importance of a Web site and estimating Web content quality, the ECML/PKDD Discovery Challenge is organized in Different from the traditional Web spam detection problem, the aim in this challenge is to develop site-level classification for the genre of the Web sites (editorial, news, commercial, educational, deep Web, or Web spam and more) as well as their readability, authoritativeness, trustworthiness and neutrality. The motivation behind this labeling procedure is to help organizations, such as search engines and Web archives, in their efforts to prioritize their procedures to gather, store and organize their collection of Web pages. Although link-based features are commonly used for determining relevant and trusted Web hosts today, term vectors obtained from Web content are still considerably important components of this kind of quality determination problems. Using only link-based features such as PageRank scores, it is difficult to determine the factuality or genre of a Web site that effect its quality score. However, when considering the size of the Web, it is not feasible to extract the content of Web hosts in order to determine a quality score for them. Besides, for organizations such as search engines or Web archives, it is important to determine the quality score of a Web host without downloading the content of the Web page. In this paper, we focus on determining the quality score of Web pages. For this task, we used the data set provided by the

2 Discovery Challenge The Discovery Challenge tasks included the prediction of the quality score predefined based on genre, trust, factuality and bias and spamicity. It is found that, content based features are useful when predicting the genre of Web pages. However, determining more subjective characteristics of Web pages, such as trustiness, neutrality and bias, predicted genre labels of Web pages yields better results. The rest of this paper is organized as follows; in section 2, detailed description and analysis of available dataset is given. In section 3, classifiers and feature selection methods used in this study are mentioned. Experimental results are shared and discussed in section 4. And finally, conclusion and future work plans are given in section 5. II. RELATED WORK Web spam detection has been pointed as a serious problem for search engines and Web archives. However, the studies on this problem have been slow down since the problem of determination spam in social networks have become more attractive for researchers. Another important research area is to determine the utility of Web page in relation to an information need represented as a query. It has been shown that a range of factors affect human judgments of relevance. However, these studies have been conducted on textual documents which structures are different from Web documents with a wide range of formally and informally produced multimedia content and hyperlinks. However, studies on users perceptions of the relevance of information need on the Web are few. Recently, several studies are performed within ECML/PKDD 2010 Discovery Challenge (DC2010) 2. Geng et. al. used multi-scale attributes composed of attribute groups including content features, page and host level link features and TFIDF features [1]. With the fusion of different sets of features, bagging is applied to C4.5 decision tree to classify Web sites according to the categories given in the DC2010 data sets. In that study, it has been found that the host level link features are robust for classifying tasks and that feature fusion is necessary for statistical Web content quality assessment. Sokolov et. al. used RankBoost algorithm in their instance based model and propagation schemes in their graph based model separately [2]. They showed that iterative algorithm for learning propagation scheme is comparatively more efficient on revealing correlations between different quality levels. Nikulin reduced dimensionality of host attributes using Wilcoxon-based feature selection [3]. He also reduced multiclass decision problem to corresponding number of two-class decision problem using one class against all method. In that study, Nikulin took the final decision using minimal and maximal values from the result set of two-class decision problems. For predicting host quality, Lex et. al. used voting with three classifiers; J48 decision tree, class-feature-centroid classifier (CFC) and support vector machine (SVM)[4]. In 2 that study, each classifier applied on different types of attributes and oversampling method is used to deal with imbalanced dataset problem. III. DATA SET The data set, DC2010, used in this study is provided by European Archive Foundation as the material of the ECML/PKDD Discovery Challenge 2010 on Web Quality [5]. It is created through crawling the Web sites in the.eu domain in three languages: English, German and French. Thus, the data set is separated into three parts where each part contains Web hosts from a different language with the same group of attributes. In this data set, four sets of attributes are provided: link-based attribute set, content-based attribute set, natural language processing (NLP) attribute set and term frequency vectors of hosts. Also URL and host graph of crawled piece of the Web are provided with the data set. The details of these sets of attributes are described in detail in the next paragraphs 3 : The provided link-based attributes are obtained from the Web graph. Among the 178 link-based attributes, the most salient attributes of this group seem to be: PageRank, out-degree, in-degree and TrustRank values. These attributes give information about the graph properties of hosts. Content-based attributes are obtained using full content of hosts. For a host, the number of words in the homepage or compression rate of the homepage can be given as examples of this set of attributes. The total number of attributes in this set is 98. NLP attributes are computed per URL using text content of Web hosts. This attribute set includes features such as the number of tokens in a URL or counts of token types such as adverb or pronoun in a URL. Term frequency vector of each host is computed by counting the number of times words appear within each host. The term frequency vector consists of the most frequent 50,000 terms after eliminating the stop words. One of the tasks in this challenge is the quality determination of web sites whereas the quality of a Web site is measured as an aggregate function of its genre, neutrality, bias and trustiness. Therefore, predicting class attributes of Web sites is the main problem which contains the answers of quality prediction. The first class attribute to be predicted is genre. There are 6 possible values of this attribute in the provided data set. Thus a Web site is labeled with one of the following categories; spam, news-editorial, commercial, educational, discussion or personal-leisure. As can be seen from Table If genres of Web hosts are considered, as seen from Table I, the data set is highly imbalanced with more Web site labeled as commercial or educational. There is very few information available for some genre labels in the training set which will affect the learning results negatively. For example, for French data set, there is only one host labeled as discussion. 3

3 TABLE I GENRE DISTRIBUTION FOR EACH DATASET Dataset Spam News-Editorial Commercial Educational Discussion Personal-Leisure English 2.7% 3.0% 36.2% 34.0% 4.3% 19.9% German 3.7% 11.2% 44.8% 12.0% 10.4% 18.0% French 4.8% 3.7% 33.3% 13.2% 0.5% 44.4% TABLE II TRUSTINESS (A), NEUTRALITY (B) AND BIAS (C) DISTRIBUTION FOR EACH PART OF DATASET Dataset English 0.3% 1.1% 98.5% German 5.7% 70.0% 24.1% French 0.0% 2.0% 98.0% (A) Dataset English 0.3% 2.3% 97.4% German 3.4% 61.0% 35.6% French 1.0% 5.2% 93.8% (B) Dataset 1 2 English 1.2% 98.8% German 4.6% 95.4% French 0.0% 100.0% (C) Fig. 1. Proposed methodology The second class attribute, trustiness, can take three values; 1, 2 and 3. A value of 1 means that the host is not reliable. Values of 2 and 3 mean that the host is reliable. The third class attribute, neutrality indicates factuality of a host and it also has three values; 1, 2 and 3. A value of 1 means that the host is problematic. The fourth class attribute, bias, can take two values; 1 and 2. A value of 1 indicates that there are significant bias problems in the host. As can be seen in Table II, the training data set is imbalanced. For example, the training set of the French data contains any example with a trustiness or bias value of 1. IV. METHODOLOGY In this study, we only utilize the term vector attribute set. Remaining 3 sets of attributes, namely link-based, content-based and NLP attributes, are determined irrelevant in predicting class attributes based on the results of feature selection methods. It is observed that these 3 sets of attributes do not contain sufficient information to determine the class labels. At this point it must be noted that, there are some instances in the data set that do not have term frequency vector. These instances are removed from both training and validation sets. It is observed that genre values of hosts are more informative than the term frequency vectors in determining the hosts trustiness, neutrality and bias values. For this reason, in this study first the genre of Web hosts is determined based on the term frequency vectors which is used in turn to identify the trustiness, neutrality, bias values of hosts. The obtained values are used to compute the quality score of the hosts. An overview of our methodology is given in Figure 1. For machine learning algorithms including attribute selection, classification and oversampling methods, implementations in WEKA 4 data mining tool are used. However, it is observed that oversampling methods are not successful to handle the imbalanced data set problem and they reduce the performance of classification algorithms. For this reason, oversampling methods and the classification results with oversampling methods are not reported in this study. A. Genre Prediction If genre prediction is considered, we are facing several problems; multi-class decision problem, text-categorization problem and high dimensional vector space. Yang and Pedersen compared several attribute selection methods for text categorization including information gain, chi-square statistics, document frequency, mutual information and term strength [6]. According to that study, information gain and chi-square statistics are among the best performing attribute selection methods for text categorization. In this study, for all datasets, chi-square statistics also appears to be performing better. It decreases classification execution times while increasing the accuracy of the classifier. For each data set, number of selected terms is determined with hillclimbing method. Attribute selection method and number of selected terms which minimize the training error are selected for the final classification. For classification, different classifiers are applied on the same data set. Support vector machines, (SVM) which is 4

4 considered as the best performing classifier for many text categorization problems [7], appears to be performing better in this study. Similarly as in attribute selection phase, the classifier and its parameters are also selected based on the results of hill-climbing method. From original term vectors, new term vectors are generated with binary weighting. According to binary weighting, for host h and term t, if h contains t then the weight of t in term vector of h is set to 1, otherwise it is set to 0. Each experiment to find the best classifier and attribute selection method is applied to both data sets to illustrate the effect of different weighting schemes on classification results. Subsequently, term vectors with binary weighting have better results in terms of classification accuracy. Table III presents the applied classifiers and feature selection methods to each data set and the resulting number of terms after feature selection. TABLE III FINAL DECISIONS FOR ATTRIBUTE SELECTION METHOD AND CLASSIFIER Dataset Feature Selection # Of Selected Method Terms Weighting Classifier English Chi-square Stat Binary SVM German Chi-square Stat. 100 Binary SVM French Chi-square Stat Binary SVM B. Trustiness, Neutrality and Bias Prediction Trustiness, neutrality and bias are predicted based on the results of genre prediction. For host h, genre prediction process produces 6 probability values. Each of these probability values indicates the probability of belonging to a genre. A host h i can be then represented with 6 attributes, such as g i1, g i2,..., g i6, where the values of each attribute is obtained from genre prediction and 6 j=1 g ij = 1. Linear regression is applied to this data set for predicting trustiness, neutrality and bias values of each host. Thus, linear regression produces a score for each hosts trustiness, neutrality and bias values. Then, the hosts in the test set are sorted by each of these values individually. C. Quality Prediction For determining quality levels of hosts, the prediction of the utility score predefined based on genre, trustiness, neutrality and bias values are also inclusded in the DC2010 tasks. For each host h, utility value is calculated as in Algorithm 1 by combining the genre, trustiness, neutrality and bias values of h. Algorithm 1 Quality Determination value = 0; if News-Edit OR Educational then value = 5; else if Discussion then value = 4; else if Commercial OR Personal-Leisure then value = 3; if neutrality = 3 then value+ = 2; if bias = 1 then value = 2; if trustiness = 3 then value+ = 2; Based on Algorithm 1, the utility value of a host can range between -2 and 9. The categories News and Educational have the highest quality. Also, trusted, unbiased and neutral contents have a high quality score. By default, Web Spam hosts have the lowest quality. The utility value may be utilized to predict the quality of a host. Similarly, linear regression method can be applied to obtain quality scores. V. EXPERIMENTAL RESULTS The results of the experiments are conducted in terms of the evaluation metrics used at the DC2010. Depending on tasks determined by this challenge, quality prediction is evaluated for English, German and French datasets while genre, trustiness, neutrality and bias prediction are evaluated for only English dataset. A. Evaluation For evaluation, normalized discounted cumulative gain (ndcg) is used at DC2010. ndcg is obtained with normalization of discounted cumulative gain (DCG) with ideal DCG value. DCG and ndcg can be computed with Eq. 1 and Eq. 2 respectively. DCG = N rank=1 ( utility (rank) 1 rank ) N DCG ndcg = (2) IdealDCG In Eq. 1, N is the number of test instances in the test set and ideal DCG is the DCG value obtained by the ideal ordering of the hosts. According to ndcg formula, it is important to rank hosts with higher utility values at top. Please note that, in a perfect ranking algorithm, the ndcg values will be the same as the Ideal DCG producing an ndcg of 1.0. Thus, all ndcg calculations are then relative values on the interval 0.0 to 1.0. For genre prediction problem, when evaluating results for genre g, utility value is 1 for host h, if h belongs to genre g. Otherwise utility value is 0. (1)

5 TABLE IV GENRE PREDICTION RESULTS FOR ENGLISH DATASET B. Genre Prediction Genre ndcg Spam 0.88 News-Editorial 0.73 Commercial 0.84 Educational 0.87 Discussion 0.76 Personal-Leisure 0.81 Overall 0.82 Genre prediction results are given in Table IV. For each genre, ndcg value is computed separately. Overall result of genre prediction can be computed with arithmetic average of all ndcg values obtained from genre prediction. This experiment is applied on English dataset only. As can be seen from Table IV, the ndcg results are high compared to the results from previous publications. These results indicate that our proposed method yields higher ndcg values for spam genre prediction. The ndcg values for spam and educational genres are higher. This may be due to the fact that spam and educational hosts have more discriminative words. According to Table IV, it can be said that spam and educational genres are more word-specific than other genres. There are more discriminative words that mostly exist in spam and educational hosts. However, discussion and newseditorial hosts contain more words in common. C. Trustiness, Neutrality and Bias Prediction Results of the trustiness, neutrality and bias predictions are given in Table V. These class attributes are predicted with term vectors and genre prediction results separately. As in genre prediction, this experiment is also applied on English dataset only. TABLE V TRUSTINESS, NEUTRALITY AND BIAS PREDICTION RESULTS FOR ENGLISH DATASET Class Attribute ndcg Using ndcg Using Term Vectors Genre Predictions Trustiness Neutrality Bias As can be seen from Table V, except for neutrality, using genre prediction results instead of term vectors causes a significant improvement in the ndcg values. Also it can be concluded that for Web hosts, neutrality is less depended on genre values than trustiness and bias. However, also for neutrality, using genre prediction results still gives better results than using term vectors. If other studies are considered, these results are quite satisfactory. For trustiness, the best ndcg value is obtained TABLE VI QUALITY PREDICTION RESULTS Dataset ndcg English 0.85 German 0.81 French 0.82 so far. Also bias prediction results are at the same level with top results. D. Quality Prediction Quality prediction results are given in Table VI. For each dataset, quality is computed separately using trustiness, neutrality and bias prediction results. These results are also compatible with the results obtained in other studies focused on the same dataset. Using prediction results from other classification processes reduces the size of the attribute set which in turn reduces the execution time of the classifier. That s why, this results are more satisfactory when execution time and accuracy are considered together. VI. CONCLUSION AND FUTURE WORK In this study, a method to measure the quality of Web hosts is proposed. It is first hypothesized and then observed that terms of hosts contain more information about Web host genre than link-based metrics. For predicting class attributes related with Web hosts, a framework is presented. In order to evaluate the proposed framework, a data set from a conference challenge is used. The experimental results show that our proposed method is superior to the previous proposed methods in terms ndcg and execution time. We are extending the classification method in several ways. More accurate predictions can be made using hybrid datasets including category prediction results and term vectors. However, after quality prediction with two-class decision problem [8] and this multi-class decision problem studies, detecting quality using propagation on graph structures is the main subject for our future work. Due to existence of Web host graph structure in the DC2010 dataset, possible future studies can focus on the same dataset with this study. VII. ACKNOWLEDGEMENT Author Gunduz-Oguducu was partially supported by the Scientific and Technological Research Council of Turkey (TUBITAK) EEEAG project 110E027. REFERENCES [1] G. Geng, X. Jin, X. Zhang and D. Zhang, Evaluating Web content quality via multi-scale features, ECML/PKDD 2010 Discovery Challenge Workshop, [2] A. Sokolov, T. Urvoy, L. Denoyer and O. Richard, Madspam consortium at the ECML/PKDD Discovery Challenge 2010, ECML/PKDD 2010 Discovery Challenge Workshop, [3] V. Nikulin, Web-mining with Wilcoxon-based feauture selection, ensembling and multiple binary classifiers, ECML/PKDD 2010 Discovery Challenge Workshop, [4] E. Lex, I. Khan, H. Bischof, M. Granitzer, Assessing the quality of Web content, ECML/PKDD 2010 Discovery Challenge Workshop, 2010.

6 [5] A. A. Benczur, C. Castillo, M. Erdelyi, Z. Gyongi, J. Masanes and M. Matthews, ECML/PKDD 2010 Discovery Challenge Data Set, Crawled by the European Archive Foundation. [6] Y. Yang and J. O. Pedersen, A comparative study on feature selection in text categorization, In proceedings of the fourteenth international conference on machine learnin (ICML 97), Douglas H.Fisher (Ed.), Morgan Kaufmann Publishers Inc., San Fransisco, CA, USA, pp , [7] J. Thorsten, Text categorization with support vector machines: Learning with many relevant features, The 10th European Conference on Machine Learning, [8] A. A. Atak, S. G. Oguducu, A framework for social spam detection based on relational bayesian classifier, In proceedings of the 6th International Conference on Data Mining (DMIN10), pp.71-77, [9] M. Marchiori, The quest for correct information on the Web: hyper search engines, In Selected papers from the sixth international conference on World Wide Web, Phillip H. Enslow, Jr., Mike Genesereth, and Anna Patterson (Eds.). Elsevier Science Publishers Ltd., Essex, UK, pp , 1997.

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University

More information

WebSci and Learning to Rank for IR

WebSci and Learning to Rank for IR WebSci and Learning to Rank for IR Ernesto Diaz-Aviles L3S Research Center. Hannover, Germany diaz@l3s.de Ernesto Diaz-Aviles www.l3s.de 1/16 Motivation: Information Explosion Ernesto Diaz-Aviles

More information

Personalized Web Search

Personalized Web Search Personalized Web Search Dhanraj Mavilodan (dhanrajm@stanford.edu), Kapil Jaisinghani (kjaising@stanford.edu), Radhika Bansal (radhika3@stanford.edu) Abstract: With the increase in the diversity of contents

More information

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University of Economics

More information

Using Text Learning to help Web browsing

Using Text Learning to help Web browsing Using Text Learning to help Web browsing Dunja Mladenić J.Stefan Institute, Ljubljana, Slovenia Carnegie Mellon University, Pittsburgh, PA, USA Dunja.Mladenic@{ijs.si, cs.cmu.edu} Abstract Web browsing

More information

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

The Comparative Study of Machine Learning Algorithms in Text Data Classification* The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification

More information

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics Helmut Berger and Dieter Merkl 2 Faculty of Information Technology, University of Technology, Sydney, NSW, Australia hberger@it.uts.edu.au

More information

CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES

CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES 188 CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES 6.1 INTRODUCTION Image representation schemes designed for image retrieval systems are categorized into two

More information

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction CHAPTER 5 SUMMARY AND CONCLUSION Chapter 1: Introduction Data mining is used to extract the hidden, potential, useful and valuable information from very large amount of data. Data mining tools can handle

More information

Minghai Liu, Rui Cai, Ming Zhang, and Lei Zhang. Microsoft Research, Asia School of EECS, Peking University

Minghai Liu, Rui Cai, Ming Zhang, and Lei Zhang. Microsoft Research, Asia School of EECS, Peking University Minghai Liu, Rui Cai, Ming Zhang, and Lei Zhang Microsoft Research, Asia School of EECS, Peking University Ordering Policies for Web Crawling Ordering policy To prioritize the URLs in a crawling queue

More information

In this project, I examined methods to classify a corpus of s by their content in order to suggest text blocks for semi-automatic replies.

In this project, I examined methods to classify a corpus of  s by their content in order to suggest text blocks for semi-automatic replies. December 13, 2006 IS256: Applied Natural Language Processing Final Project Email classification for semi-automated reply generation HANNES HESSE mail 2056 Emerson Street Berkeley, CA 94703 phone 1 (510)

More information

CSCI 599: Applications of Natural Language Processing Information Retrieval Evaluation"

CSCI 599: Applications of Natural Language Processing Information Retrieval Evaluation CSCI 599: Applications of Natural Language Processing Information Retrieval Evaluation" All slides Addison Wesley, Donald Metzler, and Anton Leuski, 2008, 2012! Evaluation" Evaluation is key to building

More information

Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings.

Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings. Online Review Spam Detection by New Linguistic Features Amir Karam, University of Maryland Baltimore County Bin Zhou, University of Maryland Baltimore County Karami, A., Zhou, B. (2015). Online Review

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

IMAGE RETRIEVAL SYSTEM: BASED ON USER REQUIREMENT AND INFERRING ANALYSIS TROUGH FEEDBACK

IMAGE RETRIEVAL SYSTEM: BASED ON USER REQUIREMENT AND INFERRING ANALYSIS TROUGH FEEDBACK IMAGE RETRIEVAL SYSTEM: BASED ON USER REQUIREMENT AND INFERRING ANALYSIS TROUGH FEEDBACK 1 Mount Steffi Varish.C, 2 Guru Rama SenthilVel Abstract - Image Mining is a recent trended approach enveloped in

More information

Taccumulation of the social network data has raised

Taccumulation of the social network data has raised International Journal of Advanced Research in Social Sciences, Environmental Studies & Technology Hard Print: 2536-6505 Online: 2536-6513 September, 2016 Vol. 2, No. 1 Review Social Network Analysis and

More information

International Journal of Software and Web Sciences (IJSWS)

International Journal of Software and Web Sciences (IJSWS) International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) ISSN (Print): 2279-0063 ISSN (Online): 2279-0071 International

More information

Large Scale Chinese News Categorization. Peng Wang. Joint work with H. Zhang, B. Xu, H.W. Hao

Large Scale Chinese News Categorization. Peng Wang. Joint work with H. Zhang, B. Xu, H.W. Hao Large Scale Chinese News Categorization --based on Improved Feature Selection Method Peng Wang Joint work with H. Zhang, B. Xu, H.W. Hao Computational-Brain Research Center Institute of Automation, Chinese

More information

Ranking Techniques in Search Engines

Ranking Techniques in Search Engines Ranking Techniques in Search Engines Rajat Chaudhari M.Tech Scholar Manav Rachna International University, Faridabad Charu Pujara Assistant professor, Dept. of Computer Science Manav Rachna International

More information

Rank Measures for Ordering

Rank Measures for Ordering Rank Measures for Ordering Jin Huang and Charles X. Ling Department of Computer Science The University of Western Ontario London, Ontario, Canada N6A 5B7 email: fjhuang33, clingg@csd.uwo.ca Abstract. Many

More information

WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1

WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1 WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1 H. Altay Güvenir and Aynur Akkuş Department of Computer Engineering and Information Science Bilkent University, 06533, Ankara, Turkey

More information

Advanced Search Techniques for Large Scale Data Analytics Pavel Zezula and Jan Sedmidubsky Masaryk University

Advanced Search Techniques for Large Scale Data Analytics Pavel Zezula and Jan Sedmidubsky Masaryk University Advanced Search Techniques for Large Scale Data Analytics Pavel Zezula and Jan Sedmidubsky Masaryk University http://disa.fi.muni.cz The Cranfield Paradigm Retrieval Performance Evaluation Evaluation Using

More information

TEXT CATEGORIZATION PROBLEM

TEXT CATEGORIZATION PROBLEM TEXT CATEGORIZATION PROBLEM Emrah Cem Department of Electrical and Computer Engineering Koç University Istanbul, TURKEY 34450 ecem@ku.edu.tr Abstract Document categorization problem gained a lot of importance

More information

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a International Conference on Education Technology, Management and Humanities Science (ETMHS 2015) Research on Applications of Data Mining in Electronic Commerce Xiuping YANG 1, a 1 Computer Science Department,

More information

Automated Tagging for Online Q&A Forums

Automated Tagging for Online Q&A Forums 1 Automated Tagging for Online Q&A Forums Rajat Sharma, Nitin Kalra, Gautam Nagpal University of California, San Diego, La Jolla, CA 92093, USA {ras043, nikalra, gnagpal}@ucsd.edu Abstract Hashtags created

More information

Chapter 8. Evaluating Search Engine

Chapter 8. Evaluating Search Engine Chapter 8 Evaluating Search Engine Evaluation Evaluation is key to building effective and efficient search engines Measurement usually carried out in controlled laboratory experiments Online testing can

More information

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Abstract We intend to show that leveraging semantic features can improve precision and recall of query results in information

More information

Business Club. Decision Trees

Business Club. Decision Trees Business Club Decision Trees Business Club Analytics Team December 2017 Index 1. Motivation- A Case Study 2. The Trees a. What is a decision tree b. Representation 3. Regression v/s Classification 4. Building

More information

Semantic Website Clustering

Semantic Website Clustering Semantic Website Clustering I-Hsuan Yang, Yu-tsun Huang, Yen-Ling Huang 1. Abstract We propose a new approach to cluster the web pages. Utilizing an iterative reinforced algorithm, the model extracts semantic

More information

Domain Specific Search Engine for Students

Domain Specific Search Engine for Students Domain Specific Search Engine for Students Domain Specific Search Engine for Students Wai Yuen Tang The Department of Computer Science City University of Hong Kong, Hong Kong wytang@cs.cityu.edu.hk Lam

More information

Link Prediction for Social Network

Link Prediction for Social Network Link Prediction for Social Network Ning Lin Computer Science and Engineering University of California, San Diego Email: nil016@eng.ucsd.edu Abstract Friendship recommendation has become an important issue

More information

Link Analysis and Web Search

Link Analysis and Web Search Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html

More information

Enhanced Performance of Search Engine with Multitype Feature Co-Selection of Db-scan Clustering Algorithm

Enhanced Performance of Search Engine with Multitype Feature Co-Selection of Db-scan Clustering Algorithm Enhanced Performance of Search Engine with Multitype Feature Co-Selection of Db-scan Clustering Algorithm K.Parimala, Assistant Professor, MCA Department, NMS.S.Vellaichamy Nadar College, Madurai, Dr.V.Palanisamy,

More information

CS299 Detailed Plan. Shawn Tice. February 5, The high-level steps for classifying web pages in Yioop are as follows:

CS299 Detailed Plan. Shawn Tice. February 5, The high-level steps for classifying web pages in Yioop are as follows: CS299 Detailed Plan Shawn Tice February 5, 2013 Overview The high-level steps for classifying web pages in Yioop are as follows: 1. Create a new classifier for a unique label. 2. Train it on a labelled

More information

Contents. Preface to the Second Edition

Contents. Preface to the Second Edition Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................

More information

Automatic New Topic Identification in Search Engine Transaction Log Using Goal Programming

Automatic New Topic Identification in Search Engine Transaction Log Using Goal Programming Proceedings of the 2012 International Conference on Industrial Engineering and Operations Management Istanbul, Turkey, July 3 6, 2012 Automatic New Topic Identification in Search Engine Transaction Log

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

CS 8803 AIAD Prof Ling Liu. Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai

CS 8803 AIAD Prof Ling Liu. Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai CS 8803 AIAD Prof Ling Liu Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai Under the supervision of Steve Webb Motivations and Objectives Spam, which was until

More information

String Vector based KNN for Text Categorization

String Vector based KNN for Text Categorization 458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research

More information

A Content Vector Model for Text Classification

A Content Vector Model for Text Classification A Content Vector Model for Text Classification Eric Jiang Abstract As a popular rank-reduced vector space approach, Latent Semantic Indexing (LSI) has been used in information retrieval and other applications.

More information

User Intent Discovery using Analysis of Browsing History

User Intent Discovery using Analysis of Browsing History User Intent Discovery using Analysis of Browsing History Wael K. Abdallah Information Systems Dept Computers & Information Faculty Mansoura University Mansoura, Egypt Dr. / Aziza S. Asem Information Systems

More information

Classifying Twitter Data in Multiple Classes Based On Sentiment Class Labels

Classifying Twitter Data in Multiple Classes Based On Sentiment Class Labels Classifying Twitter Data in Multiple Classes Based On Sentiment Class Labels Richa Jain 1, Namrata Sharma 2 1M.Tech Scholar, Department of CSE, Sushila Devi Bansal College of Engineering, Indore (M.P.),

More information

Search Evaluation. Tao Yang CS293S Slides partially based on text book [CMS] [MRS]

Search Evaluation. Tao Yang CS293S Slides partially based on text book [CMS] [MRS] Search Evaluation Tao Yang CS293S Slides partially based on text book [CMS] [MRS] Table of Content Search Engine Evaluation Metrics for relevancy Precision/recall F-measure MAP NDCG Difficulties in Evaluating

More information

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Proximity Prestige using Incremental Iteration in Page Rank Algorithm Indian Journal of Science and Technology, Vol 9(48), DOI: 10.17485/ijst/2016/v9i48/107962, December 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Proximity Prestige using Incremental Iteration

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion

A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion Ye Tian, Gary M. Weiss, Qiang Ma Department of Computer and Information Science Fordham University 441 East Fordham

More information

Machine Learning Techniques for Data Mining

Machine Learning Techniques for Data Mining Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already

More information

CS570: Introduction to Data Mining

CS570: Introduction to Data Mining CS570: Introduction to Data Mining Classification Advanced Reading: Chapter 8 & 9 Han, Chapters 4 & 5 Tan Anca Doloc-Mihu, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei. Data Mining.

More information

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Improving the Efficiency of Fast Using Semantic Similarity Algorithm International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year

More information

CS249: ADVANCED DATA MINING

CS249: ADVANCED DATA MINING CS249: ADVANCED DATA MINING Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu April 24, 2017 Homework 2 out Announcements Due May 3 rd (11:59pm) Course project proposal

More information

Content Based Smart Crawler For Efficiently Harvesting Deep Web Interface

Content Based Smart Crawler For Efficiently Harvesting Deep Web Interface Content Based Smart Crawler For Efficiently Harvesting Deep Web Interface Prof. T.P.Aher(ME), Ms.Rupal R.Boob, Ms.Saburi V.Dhole, Ms.Dipika B.Avhad, Ms.Suvarna S.Burkul 1 Assistant Professor, Computer

More information

SCALABLE KNOWLEDGE BASED AGGREGATION OF COLLECTIVE BEHAVIOR

SCALABLE KNOWLEDGE BASED AGGREGATION OF COLLECTIVE BEHAVIOR SCALABLE KNOWLEDGE BASED AGGREGATION OF COLLECTIVE BEHAVIOR P.SHENBAGAVALLI M.E., Research Scholar, Assistant professor/cse MPNMJ Engineering college Sspshenba2@gmail.com J.SARAVANAKUMAR B.Tech(IT)., PG

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 08: Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu October 24, 2017 Learnt Prediction and Classification Methods Vector Data

More information

AUTOMATIC VISUAL CONCEPT DETECTION IN VIDEOS

AUTOMATIC VISUAL CONCEPT DETECTION IN VIDEOS AUTOMATIC VISUAL CONCEPT DETECTION IN VIDEOS Nilam B. Lonkar 1, Dinesh B. Hanchate 2 Student of Computer Engineering, Pune University VPKBIET, Baramati, India Computer Engineering, Pune University VPKBIET,

More information

Feature Selection Using Modified-MCA Based Scoring Metric for Classification

Feature Selection Using Modified-MCA Based Scoring Metric for Classification 2011 International Conference on Information Communication and Management IPCSIT vol.16 (2011) (2011) IACSIT Press, Singapore Feature Selection Using Modified-MCA Based Scoring Metric for Classification

More information

Fig 1. Overview of IE-based text mining framework

Fig 1. Overview of IE-based text mining framework DiscoTEX: A framework of Combining IE and KDD for Text Mining Ritesh Kumar Research Scholar, Singhania University, Pacheri Beri, Rajsthan riteshchandel@gmail.com Abstract: Text mining based on the integration

More information

Focused crawling: a new approach to topic-specific Web resource discovery. Authors

Focused crawling: a new approach to topic-specific Web resource discovery. Authors Focused crawling: a new approach to topic-specific Web resource discovery Authors Soumen Chakrabarti Martin van den Berg Byron Dom Presented By: Mohamed Ali Soliman m2ali@cs.uwaterloo.ca Outline Why Focused

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

Automatic Classification of Audio Data

Automatic Classification of Audio Data Automatic Classification of Audio Data Carlos H. C. Lopes, Jaime D. Valle Jr. & Alessandro L. Koerich IEEE International Conference on Systems, Man and Cybernetics The Hague, The Netherlands October 2004

More information

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google, 1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to

More information

An Empirical Study of Lazy Multilabel Classification Algorithms

An Empirical Study of Lazy Multilabel Classification Algorithms An Empirical Study of Lazy Multilabel Classification Algorithms E. Spyromitros and G. Tsoumakas and I. Vlahavas Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece

More information

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant

More information

Information-Theoretic Feature Selection Algorithms for Text Classification

Information-Theoretic Feature Selection Algorithms for Text Classification Proceedings of International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 5 Information-Theoretic Feature Selection Algorithms for Text Classification Jana Novovičová Institute

More information

A PRELIMINARY STUDY ON THE EXTRACTION OF SOCIO-TOPICAL WEB KEYWORDS

A PRELIMINARY STUDY ON THE EXTRACTION OF SOCIO-TOPICAL WEB KEYWORDS A PRELIMINARY STUDY ON THE EXTRACTION OF SOCIO-TOPICAL WEB KEYWORDS KULWADEE SOMBOONVIWAT Graduate School of Information Science and Technology, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-0033,

More information

Keywords Fuzzy, Set Theory, KDD, Data Base, Transformed Database.

Keywords Fuzzy, Set Theory, KDD, Data Base, Transformed Database. Volume 6, Issue 5, May 016 ISSN: 77 18X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Fuzzy Logic in Online

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Part I: Data Mining Foundations

Part I: Data Mining Foundations Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?

More information

DATA MINING II - 1DL460. Spring 2014"

DATA MINING II - 1DL460. Spring 2014 DATA MINING II - 1DL460 Spring 2014" A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt14 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,

More information

Automatic Query Type Identification Based on Click Through Information

Automatic Query Type Identification Based on Click Through Information Automatic Query Type Identification Based on Click Through Information Yiqun Liu 1,MinZhang 1,LiyunRu 2, and Shaoping Ma 1 1 State Key Lab of Intelligent Tech. & Sys., Tsinghua University, Beijing, China

More information

Feature-weighted k-nearest Neighbor Classifier

Feature-weighted k-nearest Neighbor Classifier Proceedings of the 27 IEEE Symposium on Foundations of Computational Intelligence (FOCI 27) Feature-weighted k-nearest Neighbor Classifier Diego P. Vivencio vivencio@comp.uf scar.br Estevam R. Hruschka

More information

GRAPHICAL REPRESENTATION OF TEXTUAL DATA USING TEXT CATEGORIZATION SYSTEM

GRAPHICAL REPRESENTATION OF TEXTUAL DATA USING TEXT CATEGORIZATION SYSTEM http:// GRAPHICAL REPRESENTATION OF TEXTUAL DATA USING TEXT CATEGORIZATION SYSTEM Akshay Kumar 1, Vibhor Harit 2, Balwant Singh 3, Manzoor Husain Dar 4 1 M.Tech (CSE), Kurukshetra University, Kurukshetra,

More information

Search Engines Chapter 8 Evaluating Search Engines Felix Naumann

Search Engines Chapter 8 Evaluating Search Engines Felix Naumann Search Engines Chapter 8 Evaluating Search Engines 9.7.2009 Felix Naumann Evaluation 2 Evaluation is key to building effective and efficient search engines. Drives advancement of search engines When intuition

More information

Random projection for non-gaussian mixture models

Random projection for non-gaussian mixture models Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,

More information

A Novel Approach to Image Segmentation for Traffic Sign Recognition Jon Jay Hack and Sidd Jagadish

A Novel Approach to Image Segmentation for Traffic Sign Recognition Jon Jay Hack and Sidd Jagadish A Novel Approach to Image Segmentation for Traffic Sign Recognition Jon Jay Hack and Sidd Jagadish Introduction/Motivation: As autonomous vehicles, such as Google s self-driving car, have recently become

More information

A Dynamic Bayesian Network Click Model for Web Search Ranking

A Dynamic Bayesian Network Click Model for Web Search Ranking A Dynamic Bayesian Network Click Model for Web Search Ranking Olivier Chapelle and Anne Ya Zhang Apr 22, 2009 18th International World Wide Web Conference Introduction Motivation Clicks provide valuable

More information

Dynamic Visualization of Hubs and Authorities during Web Search

Dynamic Visualization of Hubs and Authorities during Web Search Dynamic Visualization of Hubs and Authorities during Web Search Richard H. Fowler 1, David Navarro, Wendy A. Lawrence-Fowler, Xusheng Wang Department of Computer Science University of Texas Pan American

More information

DATA MINING - 1DL105, 1DL111

DATA MINING - 1DL105, 1DL111 1 DATA MINING - 1DL105, 1DL111 Fall 2007 An introductory class in data mining http://user.it.uu.se/~udbl/dut-ht2007/ alt. http://www.it.uu.se/edu/course/homepage/infoutv/ht07 Kjell Orsborn Uppsala Database

More information

Efficient Voting Prediction for Pairwise Multilabel Classification

Efficient Voting Prediction for Pairwise Multilabel Classification Efficient Voting Prediction for Pairwise Multilabel Classification Eneldo Loza Mencía, Sang-Hyeun Park and Johannes Fürnkranz TU-Darmstadt - Knowledge Engineering Group Hochschulstr. 10 - Darmstadt - Germany

More information

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN: IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 20131 Improve Search Engine Relevance with Filter session Addlin Shinney R 1, Saravana Kumar T

More information

Web Data mining-a Research area in Web usage mining

Web Data mining-a Research area in Web usage mining IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 13, Issue 1 (Jul. - Aug. 2013), PP 22-26 Web Data mining-a Research area in Web usage mining 1 V.S.Thiyagarajan,

More information

An Empirical Performance Comparison of Machine Learning Methods for Spam Categorization

An Empirical Performance Comparison of Machine Learning Methods for Spam  Categorization An Empirical Performance Comparison of Machine Learning Methods for Spam E-mail Categorization Chih-Chin Lai a Ming-Chi Tsai b a Dept. of Computer Science and Information Engineering National University

More information

NETWORK FAULT DETECTION - A CASE FOR DATA MINING

NETWORK FAULT DETECTION - A CASE FOR DATA MINING NETWORK FAULT DETECTION - A CASE FOR DATA MINING Poonam Chaudhary & Vikram Singh Department of Computer Science Ch. Devi Lal University, Sirsa ABSTRACT: Parts of the general network fault management problem,

More information

Global Journal of Engineering Science and Research Management

Global Journal of Engineering Science and Research Management CONFIGURATION FILE RECOMMENDATIONS USING METADATA AND FUZZY TECHNIQUE Gnanamani.H*, Mr. C.V. Shanmuka Swamy * PG Student Department of Computer Science Shridevi Institute Of Engineering and Technology

More information

TDT- An Efficient Clustering Algorithm for Large Database Ms. Kritika Maheshwari, Mr. M.Rajsekaran

TDT- An Efficient Clustering Algorithm for Large Database Ms. Kritika Maheshwari, Mr. M.Rajsekaran TDT- An Efficient Clustering Algorithm for Large Database Ms. Kritika Maheshwari, Mr. M.Rajsekaran M-Tech Scholar, Department of Computer Science and Engineering, SRM University, India Assistant Professor,

More information

Challenges on Combining Open Web and Dataset Evaluation Results: The Case of the Contextual Suggestion Track

Challenges on Combining Open Web and Dataset Evaluation Results: The Case of the Contextual Suggestion Track Challenges on Combining Open Web and Dataset Evaluation Results: The Case of the Contextual Suggestion Track Alejandro Bellogín 1,2, Thaer Samar 1, Arjen P. de Vries 1, and Alan Said 1 1 Centrum Wiskunde

More information

A Naïve Soft Computing based Approach for Gene Expression Data Analysis

A Naïve Soft Computing based Approach for Gene Expression Data Analysis Available online at www.sciencedirect.com Procedia Engineering 38 (2012 ) 2124 2128 International Conference on Modeling Optimization and Computing (ICMOC-2012) A Naïve Soft Computing based Approach for

More information

Web Structure Mining using Link Analysis Algorithms

Web Structure Mining using Link Analysis Algorithms Web Structure Mining using Link Analysis Algorithms Ronak Jain Aditya Chavan Sindhu Nair Assistant Professor Abstract- The World Wide Web is a huge repository of data which includes audio, text and video.

More information

Content Based Image Retrieval system with a combination of Rough Set and Support Vector Machine

Content Based Image Retrieval system with a combination of Rough Set and Support Vector Machine Shahabi Lotfabadi, M., Shiratuddin, M.F. and Wong, K.W. (2013) Content Based Image Retrieval system with a combination of rough set and support vector machine. In: 9th Annual International Joint Conferences

More information

Encoding Words into String Vectors for Word Categorization

Encoding Words into String Vectors for Word Categorization Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,

More information

Research and Design of Key Technology of Vertical Search Engine for Educational Resources

Research and Design of Key Technology of Vertical Search Engine for Educational Resources 2017 International Conference on Arts and Design, Education and Social Sciences (ADESS 2017) ISBN: 978-1-60595-511-7 Research and Design of Key Technology of Vertical Search Engine for Educational Resources

More information

A Survey on Postive and Unlabelled Learning

A Survey on Postive and Unlabelled Learning A Survey on Postive and Unlabelled Learning Gang Li Computer & Information Sciences University of Delaware ligang@udel.edu Abstract In this paper we survey the main algorithms used in positive and unlabeled

More information

Managing Open Bug Repositories through Bug Report Prioritization Using SVMs

Managing Open Bug Repositories through Bug Report Prioritization Using SVMs Managing Open Bug Repositories through Bug Report Prioritization Using SVMs Jaweria Kanwal Quaid-i-Azam University, Islamabad kjaweria09@yahoo.com Onaiza Maqbool Quaid-i-Azam University, Islamabad onaiza@qau.edu.pk

More information

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets Arjumand Younus 1,2, Colm O Riordan 1, and Gabriella Pasi 2 1 Computational Intelligence Research Group,

More information

Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets

Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets Md Nasim Adnan and Md Zahidul Islam Centre for Research in Complex Systems (CRiCS)

More information

Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating

Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating Dipak J Kakade, Nilesh P Sable Department of Computer Engineering, JSPM S Imperial College of Engg. And Research,

More information

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How

More information

Using PageRank in Feature Selection

Using PageRank in Feature Selection Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy fienco,meo,bottag@di.unito.it Abstract. Feature selection is an important

More information

CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval

CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval DCU @ CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval Walid Magdy, Johannes Leveling, Gareth J.F. Jones Centre for Next Generation Localization School of Computing Dublin City University,

More information

Information Retrieval

Information Retrieval Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have

More information

Retrieval Evaluation. Hongning Wang

Retrieval Evaluation. Hongning Wang Retrieval Evaluation Hongning Wang CS@UVa What we have learned so far Indexed corpus Crawler Ranking procedure Research attention Doc Analyzer Doc Rep (Index) Query Rep Feedback (Query) Evaluation User

More information