A Framework for Web Host Quality Detection

Size: px

Start display at page:

Download "A Framework for Web Host Quality Detection"

Maud Paul
5 years ago
Views:

1 A Framework for Web Host Quality Detection A. Aycan Atak, and Şule Gündüz Ögüdücü Abstract With the rapid growth of World Wide Web, finding useful and desired information in a short amount of time becomes an important issue for Web users. Search engines and focused crawlers help people to navigate the internet. A user expresses her information need in the form of a query and there is huge number of Web pages returning to this query. However, the majority of users view only a single page (the top 10 Web pages as ranked by the search engine) returned by a search engine. Even if the returned Web pages do not provide the exact information they need, the users also do not refine their query based on the returning results of their initial query. Thus, not only finding relevant Web pages but also ranking them plays an important role for the search engines. For this reason, determining the quality of Web pages is one of the main priorities of search engines, since low quality Web pages cause search engines results to be extremely vague and flooded with irrelevant Web pages. In this paper, we propose a novel method for determining the quality of Web pages. The proposed method first identifies the genre of Web pages and then it determines the quality of Web pages based on their genre. Our experimental results show that our proposed method is very effective and efficient. I. INTRODUCTION The number of Web pages grows rapidly every day. The advent of Web has caused a dramatic increase of the use of Internet as a huge, widely distributed, global information service for every kind of information. Since there is no central system to control the Web, it is impossible to estimate the precise number of Web sites and Web pages on Internet. Monthly surveys by sites like Netcraft 1 have shown that in September 2010 there are nearly 227,225,642 sites on the Web. In this environment, search engines help people to locate information relevant to their search needs expressed in the form of a query. However, the lack of central control has also increased the number of Web pages consisting of highly noisy, contradictory and unreliable information. Due to this fact, even the search results in the initial results pages are being heavily spammed. Since Web searchers usually examine the top ten results, it becomes an important issue for search engines, to list really relevant and high quality Web pages at top of search results. Web spam can significantly decrease the quality of search engine results. Search engines work on efficient algorithms to determine and block spam Web pages. Without using such algorithms the search engine results may be unreliable A. Aycan Atak is with the Department of Computer Engineering, Istanbul Technical University, Istanbul, 34469, Turkey (phone: ; fax: ; ataka@itu.edu.tr). Şule Gündüz Ögüdücü is with the Department of Computer Engineering, Istanbul Technical University, Istanbul, 34469, Turkey (phone: ; fax: ; sgunduz@itu.edu.tr). 1 Accessed on 27 Sept and Web searchers lose their trust and confidence in search engine. Common approaches for spam detection are based on extracting a set of content-based and link-based features from Web pages. From the machine learning point of view, Web page spam detection is considered as a binary classification problem of Web site content as spam or non-spam. In this problem, the Web pages are represented with feature vectors with dimensions corresponding to the terms appeared on them. However, this technique is vulnerable to Web sites faking high relevance with respect to some topics. This is called Search Engine Persuasion (SEP) in [9]. Link analysis is one of the solutions to overcome this problem. Thus, using PageRank scores for eliminating spam Web pages from search results has become a standard for years. But, applying methods for spam filtering may still not guarantee that search engines list relevant and high quality Web pages first. The nature of result ranking task based on the relevancy and quality of results is different to that on more traditional spam detection. One of the differences is that the measurement of relevance and quality of search results is subjective since it is highly dependent on peoples perceptions of the relevance and quality of information. Low quality is not simply equivalent to Web spam. To promote the research and practice of new strategies for determining the overall rank, quality and importance of a Web site and estimating Web content quality, the ECML/PKDD Discovery Challenge is organized in Different from the traditional Web spam detection problem, the aim in this challenge is to develop site-level classification for the genre of the Web sites (editorial, news, commercial, educational, deep Web, or Web spam and more) as well as their readability, authoritativeness, trustworthiness and neutrality. The motivation behind this labeling procedure is to help organizations, such as search engines and Web archives, in their efforts to prioritize their procedures to gather, store and organize their collection of Web pages. Although link-based features are commonly used for determining relevant and trusted Web hosts today, term vectors obtained from Web content are still considerably important components of this kind of quality determination problems. Using only link-based features such as PageRank scores, it is difficult to determine the factuality or genre of a Web site that effect its quality score. However, when considering the size of the Web, it is not feasible to extract the content of Web hosts in order to determine a quality score for them. Besides, for organizations such as search engines or Web archives, it is important to determine the quality score of a Web host without downloading the content of the Web page. In this paper, we focus on determining the quality score of Web pages. For this task, we used the data set provided by the

2 Discovery Challenge The Discovery Challenge tasks included the prediction of the quality score predefined based on genre, trust, factuality and bias and spamicity. It is found that, content based features are useful when predicting the genre of Web pages. However, determining more subjective characteristics of Web pages, such as trustiness, neutrality and bias, predicted genre labels of Web pages yields better results. The rest of this paper is organized as follows; in section 2, detailed description and analysis of available dataset is given. In section 3, classifiers and feature selection methods used in this study are mentioned. Experimental results are shared and discussed in section 4. And finally, conclusion and future work plans are given in section 5. II. RELATED WORK Web spam detection has been pointed as a serious problem for search engines and Web archives. However, the studies on this problem have been slow down since the problem of determination spam in social networks have become more attractive for researchers. Another important research area is to determine the utility of Web page in relation to an information need represented as a query. It has been shown that a range of factors affect human judgments of relevance. However, these studies have been conducted on textual documents which structures are different from Web documents with a wide range of formally and informally produced multimedia content and hyperlinks. However, studies on users perceptions of the relevance of information need on the Web are few. Recently, several studies are performed within ECML/PKDD 2010 Discovery Challenge (DC2010) 2. Geng et. al. used multi-scale attributes composed of attribute groups including content features, page and host level link features and TFIDF features [1]. With the fusion of different sets of features, bagging is applied to C4.5 decision tree to classify Web sites according to the categories given in the DC2010 data sets. In that study, it has been found that the host level link features are robust for classifying tasks and that feature fusion is necessary for statistical Web content quality assessment. Sokolov et. al. used RankBoost algorithm in their instance based model and propagation schemes in their graph based model separately [2]. They showed that iterative algorithm for learning propagation scheme is comparatively more efficient on revealing correlations between different quality levels. Nikulin reduced dimensionality of host attributes using Wilcoxon-based feature selection [3]. He also reduced multiclass decision problem to corresponding number of two-class decision problem using one class against all method. In that study, Nikulin took the final decision using minimal and maximal values from the result set of two-class decision problems. For predicting host quality, Lex et. al. used voting with three classifiers; J48 decision tree, class-feature-centroid classifier (CFC) and support vector machine (SVM)[4]. In 2 that study, each classifier applied on different types of attributes and oversampling method is used to deal with imbalanced dataset problem. III. DATA SET The data set, DC2010, used in this study is provided by European Archive Foundation as the material of the ECML/PKDD Discovery Challenge 2010 on Web Quality [5]. It is created through crawling the Web sites in the.eu domain in three languages: English, German and French. Thus, the data set is separated into three parts where each part contains Web hosts from a different language with the same group of attributes. In this data set, four sets of attributes are provided: link-based attribute set, content-based attribute set, natural language processing (NLP) attribute set and term frequency vectors of hosts. Also URL and host graph of crawled piece of the Web are provided with the data set. The details of these sets of attributes are described in detail in the next paragraphs 3 : The provided link-based attributes are obtained from the Web graph. Among the 178 link-based attributes, the most salient attributes of this group seem to be: PageRank, out-degree, in-degree and TrustRank values. These attributes give information about the graph properties of hosts. Content-based attributes are obtained using full content of hosts. For a host, the number of words in the homepage or compression rate of the homepage can be given as examples of this set of attributes. The total number of attributes in this set is 98. NLP attributes are computed per URL using text content of Web hosts. This attribute set includes features such as the number of tokens in a URL or counts of token types such as adverb or pronoun in a URL. Term frequency vector of each host is computed by counting the number of times words appear within each host. The term frequency vector consists of the most frequent 50,000 terms after eliminating the stop words. One of the tasks in this challenge is the quality determination of web sites whereas the quality of a Web site is measured as an aggregate function of its genre, neutrality, bias and trustiness. Therefore, predicting class attributes of Web sites is the main problem which contains the answers of quality prediction. The first class attribute to be predicted is genre. There are 6 possible values of this attribute in the provided data set. Thus a Web site is labeled with one of the following categories; spam, news-editorial, commercial, educational, discussion or personal-leisure. As can be seen from Table If genres of Web hosts are considered, as seen from Table I, the data set is highly imbalanced with more Web site labeled as commercial or educational. There is very few information available for some genre labels in the training set which will affect the learning results negatively. For example, for French data set, there is only one host labeled as discussion. 3

3 TABLE I GENRE DISTRIBUTION FOR EACH DATASET Dataset Spam News-Editorial Commercial Educational Discussion Personal-Leisure English 2.7% 3.0% 36.2% 34.0% 4.3% 19.9% German 3.7% 11.2% 44.8% 12.0% 10.4% 18.0% French 4.8% 3.7% 33.3% 13.2% 0.5% 44.4% TABLE II TRUSTINESS (A), NEUTRALITY (B) AND BIAS (C) DISTRIBUTION FOR EACH PART OF DATASET Dataset English 0.3% 1.1% 98.5% German 5.7% 70.0% 24.1% French 0.0% 2.0% 98.0% (A) Dataset English 0.3% 2.3% 97.4% German 3.4% 61.0% 35.6% French 1.0% 5.2% 93.8% (B) Dataset 1 2 English 1.2% 98.8% German 4.6% 95.4% French 0.0% 100.0% (C) Fig. 1. Proposed methodology The second class attribute, trustiness, can take three values; 1, 2 and 3. A value of 1 means that the host is not reliable. Values of 2 and 3 mean that the host is reliable. The third class attribute, neutrality indicates factuality of a host and it also has three values; 1, 2 and 3. A value of 1 means that the host is problematic. The fourth class attribute, bias, can take two values; 1 and 2. A value of 1 indicates that there are significant bias problems in the host. As can be seen in Table II, the training data set is imbalanced. For example, the training set of the French data contains any example with a trustiness or bias value of 1. IV. METHODOLOGY In this study, we only utilize the term vector attribute set. Remaining 3 sets of attributes, namely link-based, content-based and NLP attributes, are determined irrelevant in predicting class attributes based on the results of feature selection methods. It is observed that these 3 sets of attributes do not contain sufficient information to determine the class labels. At this point it must be noted that, there are some instances in the data set that do not have term frequency vector. These instances are removed from both training and validation sets. It is observed that genre values of hosts are more informative than the term frequency vectors in determining the hosts trustiness, neutrality and bias values. For this reason, in this study first the genre of Web hosts is determined based on the term frequency vectors which is used in turn to identify the trustiness, neutrality, bias values of hosts. The obtained values are used to compute the quality score of the hosts. An overview of our methodology is given in Figure 1. For machine learning algorithms including attribute selection, classification and oversampling methods, implementations in WEKA 4 data mining tool are used. However, it is observed that oversampling methods are not successful to handle the imbalanced data set problem and they reduce the performance of classification algorithms. For this reason, oversampling methods and the classification results with oversampling methods are not reported in this study. A. Genre Prediction If genre prediction is considered, we are facing several problems; multi-class decision problem, text-categorization problem and high dimensional vector space. Yang and Pedersen compared several attribute selection methods for text categorization including information gain, chi-square statistics, document frequency, mutual information and term strength [6]. According to that study, information gain and chi-square statistics are among the best performing attribute selection methods for text categorization. In this study, for all datasets, chi-square statistics also appears to be performing better. It decreases classification execution times while increasing the accuracy of the classifier. For each data set, number of selected terms is determined with hillclimbing method. Attribute selection method and number of selected terms which minimize the training error are selected for the final classification. For classification, different classifiers are applied on the same data set. Support vector machines, (SVM) which is 4

4 considered as the best performing classifier for many text categorization problems [7], appears to be performing better in this study. Similarly as in attribute selection phase, the classifier and its parameters are also selected based on the results of hill-climbing method. From original term vectors, new term vectors are generated with binary weighting. According to binary weighting, for host h and term t, if h contains t then the weight of t in term vector of h is set to 1, otherwise it is set to 0. Each experiment to find the best classifier and attribute selection method is applied to both data sets to illustrate the effect of different weighting schemes on classification results. Subsequently, term vectors with binary weighting have better results in terms of classification accuracy. Table III presents the applied classifiers and feature selection methods to each data set and the resulting number of terms after feature selection. TABLE III FINAL DECISIONS FOR ATTRIBUTE SELECTION METHOD AND CLASSIFIER Dataset Feature Selection # Of Selected Method Terms Weighting Classifier English Chi-square Stat Binary SVM German Chi-square Stat. 100 Binary SVM French Chi-square Stat Binary SVM B. Trustiness, Neutrality and Bias Prediction Trustiness, neutrality and bias are predicted based on the results of genre prediction. For host h, genre prediction process produces 6 probability values. Each of these probability values indicates the probability of belonging to a genre. A host h i can be then represented with 6 attributes, such as g i1, g i2,..., g i6, where the values of each attribute is obtained from genre prediction and 6 j=1 g ij = 1. Linear regression is applied to this data set for predicting trustiness, neutrality and bias values of each host. Thus, linear regression produces a score for each hosts trustiness, neutrality and bias values. Then, the hosts in the test set are sorted by each of these values individually. C. Quality Prediction For determining quality levels of hosts, the prediction of the utility score predefined based on genre, trustiness, neutrality and bias values are also inclusded in the DC2010 tasks. For each host h, utility value is calculated as in Algorithm 1 by combining the genre, trustiness, neutrality and bias values of h. Algorithm 1 Quality Determination value = 0; if News-Edit OR Educational then value = 5; else if Discussion then value = 4; else if Commercial OR Personal-Leisure then value = 3; if neutrality = 3 then value+ = 2; if bias = 1 then value = 2; if trustiness = 3 then value+ = 2; Based on Algorithm 1, the utility value of a host can range between -2 and 9. The categories News and Educational have the highest quality. Also, trusted, unbiased and neutral contents have a high quality score. By default, Web Spam hosts have the lowest quality. The utility value may be utilized to predict the quality of a host. Similarly, linear regression method can be applied to obtain quality scores. V. EXPERIMENTAL RESULTS The results of the experiments are conducted in terms of the evaluation metrics used at the DC2010. Depending on tasks determined by this challenge, quality prediction is evaluated for English, German and French datasets while genre, trustiness, neutrality and bias prediction are evaluated for only English dataset. A. Evaluation For evaluation, normalized discounted cumulative gain (ndcg) is used at DC2010. ndcg is obtained with normalization of discounted cumulative gain (DCG) with ideal DCG value. DCG and ndcg can be computed with Eq. 1 and Eq. 2 respectively. DCG = N rank=1 ( utility (rank) 1 rank ) N DCG ndcg = (2) IdealDCG In Eq. 1, N is the number of test instances in the test set and ideal DCG is the DCG value obtained by the ideal ordering of the hosts. According to ndcg formula, it is important to rank hosts with higher utility values at top. Please note that, in a perfect ranking algorithm, the ndcg values will be the same as the Ideal DCG producing an ndcg of 1.0. Thus, all ndcg calculations are then relative values on the interval 0.0 to 1.0. For genre prediction problem, when evaluating results for genre g, utility value is 1 for host h, if h belongs to genre g. Otherwise utility value is 0. (1)

5 TABLE IV GENRE PREDICTION RESULTS FOR ENGLISH DATASET B. Genre Prediction Genre ndcg Spam 0.88 News-Editorial 0.73 Commercial 0.84 Educational 0.87 Discussion 0.76 Personal-Leisure 0.81 Overall 0.82 Genre prediction results are given in Table IV. For each genre, ndcg value is computed separately. Overall result of genre prediction can be computed with arithmetic average of all ndcg values obtained from genre prediction. This experiment is applied on English dataset only. As can be seen from Table IV, the ndcg results are high compared to the results from previous publications. These results indicate that our proposed method yields higher ndcg values for spam genre prediction. The ndcg values for spam and educational genres are higher. This may be due to the fact that spam and educational hosts have more discriminative words. According to Table IV, it can be said that spam and educational genres are more word-specific than other genres. There are more discriminative words that mostly exist in spam and educational hosts. However, discussion and newseditorial hosts contain more words in common. C. Trustiness, Neutrality and Bias Prediction Results of the trustiness, neutrality and bias predictions are given in Table V. These class attributes are predicted with term vectors and genre prediction results separately. As in genre prediction, this experiment is also applied on English dataset only. TABLE V TRUSTINESS, NEUTRALITY AND BIAS PREDICTION RESULTS FOR ENGLISH DATASET Class Attribute ndcg Using ndcg Using Term Vectors Genre Predictions Trustiness Neutrality Bias As can be seen from Table V, except for neutrality, using genre prediction results instead of term vectors causes a significant improvement in the ndcg values. Also it can be concluded that for Web hosts, neutrality is less depended on genre values than trustiness and bias. However, also for neutrality, using genre prediction results still gives better results than using term vectors. If other studies are considered, these results are quite satisfactory. For trustiness, the best ndcg value is obtained TABLE VI QUALITY PREDICTION RESULTS Dataset ndcg English 0.85 German 0.81 French 0.82 so far. Also bias prediction results are at the same level with top results. D. Quality Prediction Quality prediction results are given in Table VI. For each dataset, quality is computed separately using trustiness, neutrality and bias prediction results. These results are also compatible with the results obtained in other studies focused on the same dataset. Using prediction results from other classification processes reduces the size of the attribute set which in turn reduces the execution time of the classifier. That s why, this results are more satisfactory when execution time and accuracy are considered together. VI. CONCLUSION AND FUTURE WORK In this study, a method to measure the quality of Web hosts is proposed. It is first hypothesized and then observed that terms of hosts contain more information about Web host genre than link-based metrics. For predicting class attributes related with Web hosts, a framework is presented. In order to evaluate the proposed framework, a data set from a conference challenge is used. The experimental results show that our proposed method is superior to the previous proposed methods in terms ndcg and execution time. We are extending the classification method in several ways. More accurate predictions can be made using hybrid datasets including category prediction results and term vectors. However, after quality prediction with two-class decision problem [8] and this multi-class decision problem studies, detecting quality using propagation on graph structures is the main subject for our future work. Due to existence of Web host graph structure in the DC2010 dataset, possible future studies can focus on the same dataset with this study. VII. ACKNOWLEDGEMENT Author Gunduz-Oguducu was partially supported by the Scientific and Technological Research Council of Turkey (TUBITAK) EEEAG project 110E027. REFERENCES [1] G. Geng, X. Jin, X. Zhang and D. Zhang, Evaluating Web content quality via multi-scale features, ECML/PKDD 2010 Discovery Challenge Workshop, [2] A. Sokolov, T. Urvoy, L. Denoyer and O. Richard, Madspam consortium at the ECML/PKDD Discovery Challenge 2010, ECML/PKDD 2010 Discovery Challenge Workshop, [3] V. Nikulin, Web-mining with Wilcoxon-based feauture selection, ensembling and multiple binary classifiers, ECML/PKDD 2010 Discovery Challenge Workshop, [4] E. Lex, I. Khan, H. Bischof, M. Granitzer, Assessing the quality of Web content, ECML/PKDD 2010 Discovery Challenge Workshop, 2010.

6 [5] A. A. Benczur, C. Castillo, M. Erdelyi, Z. Gyongi, J. Masanes and M. Matthews, ECML/PKDD 2010 Discovery Challenge Data Set, Crawled by the European Archive Foundation. [6] Y. Yang and J. O. Pedersen, A comparative study on feature selection in text categorization, In proceedings of the fourteenth international conference on machine learnin (ICML 97), Douglas H.Fisher (Ed.), Morgan Kaufmann Publishers Inc., San Fransisco, CA, USA, pp , [7] J. Thorsten, Text categorization with support vector machines: Learning with many relevant features, The 10th European Conference on Machine Learning, [8] A. A. Atak, S. G. Oguducu, A framework for social spam detection based on relational bayesian classifier, In proceedings of the 6th International Conference on Data Mining (DMIN10), pp.71-77, [9] M. Marchiori, The quest for correct information on the Web: hyper search engines, In Selected papers from the sixth international conference on World Wide Web, Phillip H. Enslow, Jr., Mike Genesereth, and Anna Patterson (Eds.). Elsevier Science Publishers Ltd., Essex, UK, pp , 1997.

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University