Automated Controversy Detection on the Web

Size: px

Start display at page:

Download "Automated Controversy Detection on the Web"

Rosalyn Hopkins
5 years ago
Views:

Automated Controversy Detection on the Web Shiri Dori-Hacohen and James Allan Center for Intelligent Information Retrieval Department of Computer Science University of Massachusetts Amherst Amherst,

1 Automated Controversy Detection on the Web Shiri Dori-Hacohen and James Allan Center for Intelligent Information Retrieval Department of Computer Science University of Massachusetts Amherst Amherst, Massachusetts {shiri, ABSTRACT Alerting users about controversial search results can encourage critical literacy, promote healthy civic discourse and counteract the filter bubble effect, and therefore would be a useful feature in a search engine or browser extension. In order to implement such a feature, however, the binary classification task of determining which topics or webpages are controversial must be solved. Earlier work described a proof of concept approach to this problem using a supervised nearest neighbor classifier, by comparing web pages to an oracle of manually annotated Wikipedia articles. This paper generalizes and extends that concept by taking the human out of the loop, leveraging the rich metadata available in Wikipedia articles in a distantly-supervised classification approach. The new technique we present allows the nearest neighbor approach to be extended on a much larger scale and to other datasets. The results improve substantially over naive baselines and are nearly identical to the supervised approach by standard measures of F 1, F 0.5, and accuracy. Finally, we discuss implications of solving this problem as part of a wider subject of interest to the IR community, and suggest several avenues for further exploration in this exciting new space. 1. INTRODUCTION On the web today, alternative medicine sites appear alongside pediatrician advice websites. In the United States, the phrase global warming is a hoax is in wide circulation and political debates rage, both online and offline, over the economy, same-sex marriage and healthcare. Worldwide, geopolitical and religious tensions are causing strife and upheaval; the Middle East is being reshaped by a generation that expects or even demands access to information. In countries with unlimited access, controversies proliferate online, but not everyone hears all sides of the argument; the filter bubble effect encourages confirmation bias by offering users the answers they want to hear [17]. In other cases, access does not translate into trustworthy information. For example, first-time parents seeking information about vaccines will Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX...$ find plenty of proof that they cause autism, and may not even realize the depth of the controversy involved (see Figure 1). Ads for helplines displayed to users searching for abortion are discreetly frequently funded by pro-life (antiabortion) religious groups [11]. Figure 1: One of the top query results in a major search engine for the query do vaccines cause autism [11]. Unfortunately, critical literacy, civic discourse and trustworthy information are not immediate results of successful information retrieval. The underlying thread connecting all these examples is that users searching for these topics may not even be aware that a controversy exists; without the aid of a search engine feature or browser extension to warn them, they may never find out. We believe that a valuable addition to end-user experience would be to inform users about controversial topics; this requires detecting such topics as a prerequisite. This problem was first proposed as a binary classification task by Dori-Hacohen and Allan [8]. Our paper suggests a method to solve that task and determine which topics or webpages are controversial. In prior work, a nearest neighbor approach was proposed to detect controversy, and a proof-of-concept was shown using human judgments as an oracle for the controversy level of related Wikipedia articles [8]; this approach obtained 10-20% absolute gains over several baselines. However, human

2 annotations are expensive to generate, and do not scale. Dori-Hacohen and Allan had 1,761 Wikipedia articles annotated, which, while representing 20% of the articles in their dataset, are only a minuscule 0.04% of the English language Wikipedia. This naturally raises the question of whether the nearest neighbor approach can be generalized, obviating the need for human annotations. In this work, we answer that question by extending and generalizing the nearest neighbor approach to controversy detection, and present a new, distantly-supervised approach to detect controversial topics. We use automated techniques to score related Wikipedia articles, replacing the human annotations and creating a fully-automated system with no human-in-the-loop, that can classify a webpage as either controversial or not. We consider our system as distantlysupervised [16] since we use heuristic labels for Wikipedia articles, which act as a bridge between the rich metadata available in Wikipedia and the sparse data on the web. While one might hypothesize that using an automated approach to scoring Wikipedia articles (instead of an oracle of human annotations) would degrade the results, our approach achieves comparable results to the fully-supervised system [8], while at the same time making the approach scalable and applicable to any web dataset. 2. RELATED WORK Several strands of related work inform our work on controversy: controversy detection in Wikipedia, controversy on the web and in search, fact disputes and trustworthiness, as well as sentiment analysis. We describe each area in turn. 2.1 Controversy Detection in Wikipedia Research on controversy detection in Wikipedia can be leveraged and applied to a wider scope on the web. Wikipedia s collaborative nature, along with its versioning system, is a rich resource that has excited many researchers, with several papers focusing on detecting controversy in Wikipedia. Kittur et al. first proposed that controversy could be detected automatically using the rich metadata available on Wikipedia articles. They proposed a logistic regression classifier based on several metrics such as the length of the article, its associated talk page, and the number of users [14]. Sumi, Yasseri and their colleagues proposed a metric for controversy level, M, that is based on reverts - a special type of Wikipedia edits, wherein an editor completely undos the previous edit [21, 25]. Sepehri Rad et al. presented a binary classifier for controversy using collaboration networks, and also presented a comparative study of the various approaches to detecting controversy in Wikipedia [20, 19]. Wikipedia is a valuable resource, but often hides the existence of debate by presenting even controversial topics in deliberately neutral tones [24], which may be misleading to people unfamiliar with the debate. For example, the Wikipedia article for U.S. President Barack Obama make him appear non-controversial, while both the Talk page and an automated analysis show otherwise. While detecting controversy in Wikipedia automatically can be seen as an end in itself, these controversy detection methods have wider reach and can be used as a step for solving other problems. Recently, Das et al. explored the possibility of Wikipedia administrators attempting to manipulate controversial pages in a certain area by extending any controversy metric into a clustered controversy measure, hypothesizing that editors focusing on a certain controversial topic were more invested in the outcomes than those spreading their edits across several topical areas [7]. Additionally, Wikipedia has been used in the past as a valuable resource assisting in controversy detection elsewhere, whether as a lexicon or as a hierarchy for controversial words and topics [3, 2, 18]. Likewise, we make use of a few of the Wikipedia-specific controversy measures described above as a step in our approach (see Section 3.2). As described above, prior work showed that it is possible to use related Wikipedia articles as a proxy for controversy on the web, by using human annotations as an oracle rating the controversy of the articles [8]. In contrast, we use automatically-generated values for the Wikipedia articles. In order to evaluate our distantly-supervised system and compare it to the previously described supervised results, we acquired the same dataset collected by Dori-Hacohen and Allan, consisting of webpages and Wikipedia articles annotated as controversial or noncontroversial [8]. 2.2 Controversy on the web and in search Outside of Wikipedia, other targeted domains such as news [3, 5] and Twitter [18] have been mined for controversial topics, mostly focusing on politics and politicians. Some work relies on domain-specified sources such as Debatepedia 1 [3, 13] that are likewise politics-heavy: for example, as of this writing Debatepedia has no entry discussing Homeopathy. We consider controversy to be wider in scope than simply politics; medical and religious controversies are equally interesting. In some cases, a simple word search can be useful in detecting controversial queries [10]; unfortunately, this queryside approach to controversy detection relied on the Google Suggest API, which was deprecated in 2011 as part of the Google Labs shutdown. Assuming one knows that a query is controversial, diversifying search results based on opinions is a useful feature [12, 13]. Prior work has shown that there is value in presenting users with information that may differ from their original perspective, whether it is portrayed implicitly or explicitly [9, 23, 26]. When reading about claims that may be in dispute users do not actively seek contrasting information or viewpoints, but they are more likely to read the contrasting viewpoint when it was explicitly portrayed as such [23]. Yom-Tov et al. described how, for polarized political topics, two versions of seemingly similar search queries existed; using one query or the other exposed the biases of the user issuing the query (e.g. obamacare vs. affordable health care ) [26]. Their research demonstrated that interspersing search results from the opposite viewpoint on such polarized queries had the double effect of both encouraging users to read more diverse opinions, and to read more news in general [26]; this partially counteracts the filter bubble effect [17]. This is the motivation behind our work: to automatically inform users of the various viewpoints on controversial topics. 2.3 Fact disputes and Trustworthiness Previous work also explored fact disputes and trustworthiness [9, 23], which is often related to controversial topics. For example, the Dispute Finder tool focused on finding 1

and exposing disputed claims on the web to users as they browse [9]; our ultimate goal, similarly, is to inform users of controversial topics in their search results.

3 and exposing disputed claims on the web to users as they browse [9]; our ultimate goal, similarly, is to inform users of controversial topics in their search results. However, Dispute Finder was focused largely on manually added or bootstrapped fact disputes, whereas we are interested in scalably detecting controversies that may, or may not, stem from fact disputes. Some controversies indeed originate from arguments over fact disputes (such as the birthplace of Barack Obama) or over matters of scientific study (the effect of vaccines on autism). In other cases, however, judgments, values and interpretation can fuel the fire despite widespread agreement on the basic facts of the matter; moral debates over abortion and euthanasia, or political debates over the size of the U.S. government and the Israeli-Palestinian conflict, are some examples. 2.4 Sentiment Analysis Sentiment analysis can naturally be seen as a useful tool as a step towards detecting varying opinions and potentially controversy [5, 18, 22]. However, sentiment analysis and opinion mining work has long focused on product or movie reviews, where users have no intention or reason to hide their true thoughts; in contrast, many biased webpages intentionally obfuscate the sentiment and controversy involved, attempting to pass their vision as objective. As others have mentioned, sentiment alone may not suffice for detecting controversy [3, 8], though it may be useful as a feature. We believe that warning about controversial webpages therefore requires a dedicated approach that will likely not be based solely on sentiment analysis, and must rely on information outside the text of the page itself - at the very least, other websites on the same topic. 3. NEAREST NEIGHBOR APPROACH Our approach to detecting controversy on the web is a nearest neighbor classifier that maps webpages to the Wikipedia articles related to them. We start from a webpage and find Wikipedia articles that discuss the same topic; if the Wikipedia articles are controversial, it is reasonable to assume the webpage is controversial as well. Prior work, using human annotations as an oracle for the Wikipedia articles controversy level, showed that such a nearest neighbor classifier can achieve accuracy of 72.9% and F 0.5 of 64.8%, which represent measurable improvements over several baselines [8]. Our work replicates these results in a scalable manner by removing the human-in-the-loop and replacing it with automatically-generated scores for Wikipedia. The choice to map specifically to Wikipedia rather than to any webpages was driven by the availability of the rich metadata and edit history on Wikipedia; as described above, there is a fair amount of prior work dedicated to detecting controversy on Wikipedia by using this wealth of information [14, 19, 20, 21, 25]. Using a targeted nearest neighbor approach, with neighbors specifically selected from Wikipedia, allows us to bridge the gap between sparse web data and Wikipedia, affording an opportunity to leverage this valuable resource to other areas of the web. We consider our approach as a distantly-supervised classifier [16], since the labels used for controversy on Wikipedia are heuristics which are automatically generated, rather than truth labels; some of them were learned using a supervised classifier on Wikipedia, but none of them were trained for the task at hand, namely classifying webpages controversy. To implement our nearest neighbor classifier, we use several parts: matching via query generation, scoring the Wikipedia articles, aggregation, thresholding and voting. A schematic of the system is shown in Figure 2, which we will refer to while describing the various parts. Figure 2: Schematic of the fully automated Nearest Neighbor Approach 3.1 Matching via Query Generation We use the same query generation approach described in previous work [8] to map from webpages to the related Wikipedia articles. The top ten most frequent terms on the webpage, excluding stop words, are extracted from the webpage (step 1 in Figure 2), and then used as a keyword query restricted to the Wikipedia domain and run on a commercial search engine (step 2 in Figure 2). We use one of two different stop sets, a 418 word set (which we refer to as Full Stopping [4]) or a 35 word set ( Light Stopping [15]). Wikipedia redirects were followed wherever applicable in order to ensure we reached the full Wikipedia article with its associated metadata; any talk or user pages were ignored. We considered the articles returned from the query as the webpage s neighbors, which will be evaluated for their controversy level. Based on the assumption that higher ranked articles might be more relevant, but provide less coverage, we varied the number of neighbors in our experiments from 1 to 20, or used all articles returned by the search engine. A brief evaluation of the query generation approach is presented in Section Automatically-generated Wikipedia labels The Wikipedia articles, found as neighbors to webpages, were labeled with several scores measuring their controversy level (step 3 in Figure 2). In previous work [8], human annotations were used to label the Wikipedia articles. In our paper, we use three different types of automated scores for controversy in Wikipedia, which we refer to as D, C and M. All three scores can be automatically generated for any Wikipedia article, based on information available in its associated metadata, talk page and revision history. For all three scores (and without loss of generality), higher scores indicate more controversy.

4 3.2.1 D score - presence of dispute tags D tests for the presence of Dispute tags that are manually added to the talk pages of Wikipedia articles by its contributors [14, 19]. These tags, while very sparse and therefore difficult to rely on [19, 8], are potentially valuable when they are present. We test for the presence of such tags, and use the results as a binary score (1 if the tag exists or -1 if it doesn t). Although this approach relies on data that was manually entered into Wikipedia, our use of it is automatic and unsupervised. Unfortunately, the number of dispute tags available is very low: in a recent Wikipedia dump, there were only 1,449 articles that had a dispute tag on their talk page, only 0.03% of the articles on Wikipedia. This is an even smaller dataset than the human annotations provided in prior work [8]; the overlap between these articles and the 8,755 articles in our dataset is a mere 165 articles C score - metadata-based regression The C score is a regression that predicts the controversy level of the Wikipedia article using a variety of metadata features. This regression is based on the approach first described by Kittur et al. [14], and referred to alternately as the meta classifier [19] or the C score [7]. The goal of this score is to extend the dispute tags to a wider set of articles that don t contain them, thus increasing coverage. The C score is trained on articles that contain dispute tags as training examples. The regression relies on the assumption that the more revisions of the article that the dispute tag appears in, the more controversial the page will be. The classifier is thus implemented as a linear regression that attempts to predict, for a given Wikipedia article that doesn t actually contain a dispute tag, the estimated number of revisions which would theoretically contain a dispute tag (if it did have one). The classifier uses a variety of metadata features, such as: the length of the article and its associated talk page; the number of revisions to each of those; and the number of editors and anonymous editors, among others. We use the version of this regression as implemented and trained recently by Das et al. [7], which weighs the score by the number of edits on the page and normalizes the score, generating a floating point score in the range (0,1) M score - mutual reverts score The M score, as defined by Sumi et al., is a different way of estimating the controversy level of a Wikipedia article, based on the concept of reverts and edit wars in Wikipedia [21, 25]. A revert is a special type of edit in Wikipedia, wherein an editor completely undos a previous edit, thus reverting the page to a prior version. Simply measuring reverts, however, is not a sufficient measure of controversy; reverts are quite commonly associated with vandalism, which is very frequent on popular pages regardless of their controversy level. In an edit war, however, multiple editors revert each other, often more than once - termed mutual reverts [21]. The M score is therefore defined as a function of the number of mutual reverts to the article and the number of edits each editor has performed on the page [21]. The premise of the M score is that the more times an editor have edited a page, the less likely that their edit being reverted is associated with vandalism (most vandals are blocked from Wikipedia quickly), and therefore the more likely to be caused by an actual controversy. Thus, the M score of an article considers the sum of mutual reverts on the page; for every pair of editors engaging in a mutual revert on it, weighed by the minimum number of contributions of the two editors [21]. The effect of vandalism is thus expected to be minimized. Finally, the top pair of reverting editors is ignored, thus nullifying potential personal vendettas. The score is a positive real number, theoretically unbounded (in practice it ranges from 0 to several billion). 3.3 Aggregation and Thresholding The score for a webpage is computed, by taking either the maximum or the average of all its neighbors scores (step 4 in Figure 2). Which of the two aggregation methods used is one parameter we vary in our experiments. After aggregation, each webpage has a single score according to each of the three scoring methods ( D, C and M ) which can be considered an estimator of the page s controversy level. A threshold must be chosen if one wishes to convert them into binary judgments (step 5 in Figure 2). We found that the threshold chosen for the M metric by Sumi et. al [21] was not sufficient for our purposes, while C had no available threshold. We therefore trained several thresholds for both C and M (see Section 4.1). 3.4 Voting In addition to using each of the three labels in isolation, we can also combine them by voting (step 6 in Figure 2). We apply one of several voting schemes to the binary classification labels, after the thresholds have been applied. The schemes we use are: Majority vote: consider the webpage controversial if at least two out of the three labels are controversial. Logical Or: consider the webpage controversial if any of the three labels is controversial. Logical And: consider the webpage controversial only if all the three labels are controversial. Other logical combinations: we consider results for the combination (Dispute (C M)), based on the premise that if the dispute tag happens to be present, it would be valuable. We tried using all three combinations of the form (x (y z)), but D s coverage was so low that with with x=c or x=m the results were essentially identical to the majority voting; we therefore omit them. Table 1: Data set size and annotations. NNT denotes the subset of Wikipedia articles that are Nearest Neighbors of the webpages Training set. Webpages Set Pages Controversial All (32.6%) Training (29.8%) Testing (38.0%) Wikipedia articles Set Articles Annotated Controversial All 8,755 1, (16.0%) NNT 4, (13.5%)

5 4. EXPERIMENTAL SETUP AND DATA SET The dataset we use, which was described in prior work [8], is comprised of 377 webpages from ClueWeb09, drawn from 41 seed queries sent to a commercial search engine. The collection was split into approximately 60% training and 40% testing sets in a random fashion; the seeds were split up, with some seeds pages going entirely into either train or test, and others being split among the two categories. The final distribution of controversy in the two sets differed slightly (see Table 1), as the training set had a lower proportion of controversial pages than the testing set (29.8% vs. 38.0%). The webpages were annotated for controversy level on a four point scale. Additionally, the dataset contained 8,755 Wikipedia articles that were found as neighbors of the webpages in the set, using the approach described in Section 3.1. Of these articles, 4,060 were the Nearest Neighbors associated with the Training set ( NNT in Table 1), which we used to train the thresholds for C and M. We use the following five metrics, both to train and to evaluate our results: Figure 3: Precision-Recall curves (uninterpolated) for C and M thresholds on the Wikipedia NNT set Precision is the proportion of pages that the system labeled as controversial, that were actually controversial. Recall is the proportion of pages that the system labeled as controversial, out of all the controversial pages in the dataset. Accuracy is the proportion of pages that the system labeled correctly (both controversial and noncontroversial), out of all the pages in the dataset. F 1 is the harmonic mean between precision and recall. F 0.5 is the weighted harmonic mean which gives precision twice as much importance as recall. Another way of looking these definitions is by using the classic IR sense of these metrics, with controversial standing in for relevant. 4.1 Threshold training C and M are both real-valued numbers, as described in SectionSec:AutoWP; in order to generate a binary classification, we must select a threshold above which the page will be considered controversial. (For the D score which we defined as a binary value of either -1 or 1, a natural threshold exists at 0.) The Precision-Recall curve for both scores is displayed in Figure 3. Since we had access to human annotations on some of the Wikipedia articles [8], we trained the thresholds for C and M for the subset of articles associated with the training set (labeled NNT in Table 1). We select five thresholds for the two scoring methods, based on the best results achieved on this subset for Precision, Recall, F 1, Accuracy and F 0.5 as defined above. The resulting thresholds are shown in Table 2, along with the results on the same set achieved by D 2. For comparison, we also present single-class baselines on this task of labeling the Wikipedia articles. The dominant class baseline labels all pages as noncontroversial, which 2 The threshold of 0 for the M metric returns every nonzero score as controversial, and only zero scores as noncontroversial. yields 0% precision and 0% recall since no pages are labeled as controversial; accuracy is the only nonzero metric in this case. The nondominant class baseline, in turn, labels all pages as controversial and thus achieves 100% recall; its precision and accuracy are equivalent to the incidence of controversy in the dataset. Finally, a random baseline, which labels every article as either controversial or noncontroversial based on a coin flip, is presented for comparison (average of three random runs). Its precision is also similar to the incidence of controversy in the dataset, while its recall and accuracy are closer to 50%. 5. EVALUATION We treat the controversy detection problem as a binary classification problem of assigning labels of controversial and noncontroversial to webpages. We therefore evaluate our results to the classification problem by using the definitions of precision, recall, accuracy and F β, as defined in Section 4. We present a brief evaluation for the query generation approach before turning to describe our results for the controversy detection problem. 5.1 Judgments from Matching A key step in our approach is selecting which Wikipedia articles to use as nearest neighbors. In order to evaluate how well our query generation approach is mapping webpages to Wikipedia articles, we evaluated the automated queries and the relevance of their results to the original webpage. This allows an intrinsic measure of the effectiveness of this step - independent of its effect on the extrinsic task. Focusing on the queries generated by the Light Stopping set (see Section 3.1), we annotated 3,430 of the query-article combinations (out of 7,630 combinations total) that were returned from the search engine; the combinations represented 2,454

6 Table 2: Thresholds for M and C trained on NNT set, based on the task of labeling Wikipedia articles (not webpages). Results are optimized for each of the metrics (Precision, Recall, F 1, Accuracy and F 0.5 ); the metric optimized is in bold. Threshold P R F 1 Acc F 0.5 M M M M M C C C C C Dispute Random Dominant Non-Dominant Figure 4: Judgments on Wikipedia articles returned by the automatically-generated queries, by query rank. Annotators could choose one of the following options: H-On= Highly on [webpage s] topic, S- On= Slightly on topic, S-Off= Slightly off topic, H- Off= Highly off topic, Links= Links to this topic, but doesn t discuss it directly, DK= Don t Know unique Wikipedia articles. Preference was given to annotating articles that were higher-ranked in the query results, resulting in 361 annotations for the top rank, and between 330 and 280 annotations each for ranks 2 through 10. Our annotators were presented with the webpage and the titles of up to 10 Wikipedia articles in alphabetical order (not ranked order). The annotators were asked to name the single article that best matched the webpage. They were also asked to judge, for each article, whether it was relevant to the original page. The annotators were not shown the automatically-generated query. The options were highly on topic, slightly on topic, slightly off topic or highly off topic, as well as links to this topic, but doesn t discuss it directly or don t Know. Figure 4 shows how the ranked list of Wikipedia articles were judged. In the figure, it is clear that the top-ranking article was viewed as highly on topic but then the quality dropped rapidly. However, if both on-topic judgments are combined, a large number of highly or slightly relevant articles are being selected. Also, the higher the rank, the more likely that the article was on topic. Figure 5 further supports this, by showing that most of the best articles are at ranks 1 to 3; the top-ranked article was considered the best for 42% of the pages. Using a reciprocal rank contribution of 0 for don t know or none of the above, the Mean Reciprocal Rank for the dataset was Baseline runs and prior work We compare our approach to several baselines. We use the same sentiment analysis approach as described in prior work [8], along with random and single-class baselines as described in section 4.1. The sentiment analysis baseline is a logistic regression classifier [1] trained to detect presence of sentiment on the webpage, whether positive or negative; sentiment is used as a proxy for controversy. Finally, the best results from prior work are displayed [8]; as mentioned above, these results use a similar nearest neighbor approach, with human annotations serving as an oracle for the Wiki- pedia articles controversy level. 5.3 Our approach As described in Section 3, we varied several parameters in our the nearest neighbor approach: 1. Stopping set (Light or Full) 2. Number of neighbors (k=1..20, or no limit) 3. Aggregation method (average or max) 4. Scoring or voting method (C, M, D; Majority, Or, And, D (C M)) 5. Thresholds for C and M (one of the five values each from Table 2). These parameters were evaluated on the training set and the best runs were selected, optimizing for Precision, Recall, F 1, F 0.5 and Accuracy. The parameters that performed best, for each of the scoring/voting methods, were then run on the test set. 5.4 Results The results of our approach on the test set are displayed in Table 3. For ease of discussion, we will refer to row numbers in the table. We present the best runs, as trained on the training set, with their test set results DISCUSSION 3 In some cases, the number of neighbors was similar, all other parameters were identical and the results were very close; we present only the best of these similar runs.

7 Table 3: Results on Testing Set. Results are displayed for the best runs on the training set, using each scoring method, optimized for F 1, Accuracy and F 0.5. Sample results optimizing for Recall are also displayed. The overall best results of our runs, in each metric, are displayed in bold; the best prior results (rows 18-20, marked Oracle [8]) and baseline results (rows 21-25) are also displayed in bold. See text for discussion. Parameters Test Set # Stop Score k agg Thres C Thres M Target P R F 1 Acc F Full M 8 avg F 1, Acc Light M 8 max F Light M 12 max F Light M 18 max 660 R Light C 14 max Acc Light C 15 max F Light C 7 avg Acc, F Light C 7 max R Light D 19 max F Full D 5 max Acc Light D 6 max Acc, F Light Majority 15 max F Full Majority 5 max Acc, F Full Or 8 avg F 1, Acc, F Light And no max F 1, Acc, F Full D (CM) 8 avg F Light D (CM) 7 avg Acc, F Light Oracle 4 avg P Light Oracle 20 max R Full Oracle 8 max F Sentiment Random Random Dominant Class (all noncontroversial) Non-Dominant Class (all controversial) As the results in Table 3 show, our fully-automated approach (rows 1-17) achieves results higher than all baselines (rows 21-25), in all five metrics. The precision-recall curves for select runs is presented in Figure 6. Results optimizing for precision and recall are largely omitted from the table; values of exactly 100% for each were achievable on the training set by many different parameter combinations, often using either the most discriminative (for precision) or most permissive (for recall) thresholds. The runs with 100% recall on the training set generalized well, and consistently generated at or near 100% recall on the test set. These runs were essentially identical to the nondominant class baseline, and we present the best of these runs (rows 4 and 8 in Table 3). The thresholds for precision did not generalize across the folds; they were mostly too restrictive, generating a single controversy label on the training set (thus gaining 100% correctness on one label), in most cases returning no results on the test set and thus equal to the dominant class baseline. Note that for the task of detecting controversy in Wikipedia articles (one of the steps in our approach), Precision and Recall combinations of the automated methods were fairly low, as shown in Table 2 and Figure 3; F 1 and F 0.5 are never above 35%. Despite those seemingly poor results for this step, aggregating these automated scores over several neighbors results in fairly high F values for the webpage task (near or above 60% for both M and C methods). We note that the results for the dispute method on the webpage task were much lower than one might otherwise expect; however, those are partially explained by the low results on the Wikipedia task (see Table 2). We observe that when taking the maximum of the neighbors, results were generally better with higher, more discriminative thresholds and a large number of neighbors; when average was used, a lower threshold with fewer neighbors was more effective. This makes sense when considering that the max function is more sensitive to noise than the average function - a single highly controversial neighbor can change the classification, so a higher threshold can reduce the sensitivity to such noise while extending coverage by considering more neighbors. For many of the max runs, increasing the number of neighbors beyond a certain number had very little effect on the actual results since controversial pages had already been labeled as such by a higher-ranked neighbor. The method that optimized for F 0.5 on the training set among all the single score approaches was the run using Light Stopping, M with a rather high (discriminative) threshold, and aggregating over the maximal value of all the result neighbors (rows 2 and 3 in Table 3). Using k values of 8 through 12 achieved identical results on the training set. These runs ended up achieving some of the best results on the test set; with value k=8 the results were the best

8 Figure 5: Frequency of page selected as best, by rank. DK= Don t Know, N= None of the above Figure 6: Precision-Recall curves (uninterpolated) for select runs on the Test set. Row numbers refer to Table 3. Number of Pages Precision Run Light C 7 avg (row 7) Light M 8 max (row 2) DK N Rank Recall for F 0.5 as well as Accuracy (row 2), and with value k=12 the results were best for F 1 among the single-score methods (row 3). The k=8 run (row 2) showed 10.1% absolute gain in accuracy (16.3% relative gain) over the dominant class baseline, which had the best accuracy score among the baselines; F 0.5 showed 19.5% absolute gain (44.5% relative gain) over the best F 0.5 score, which was achieved by the Random 50 baseline. The k=12 run (row 3) showed a 9% absolute gain (16.3% relative gain) in F 1 over the best baseline for F 1, the non dominant class baseline. Even though none of the results displayed in the table were optimized for precision, they still had higher results than the baselines across the board (compare rows 1-17 to rows 21-25). Among the voting methods, the method that optimized for F 1 on the training set was the Majority voting, using Light Stopping, aggregating over the maximal value of 15 neighbors, with discriminative thresholds for both M and C (row 12). This run showed a 10.4% (18.9% relative gain) absolute gain on the test set over the best baseline for F 1. The results of the sentiment baseline (row 21 in Table 3) are surprisingly similar to the all controversial baseline (row 25); at closer look, the sentiment classifier only returns about 10% of the webpages as lacking sentiment and therefore its results are not far from the single-class baseline. We tried applying higher confidence thresholds to the sentiment classifier, but this only resulted in lower recall without improvement in precision. It s worth noting that the sentiment classifier was not trained to detect controversy; it s clear from these results, as others have noted, that sentiment alone is too simplistic to predict controversy [3, 8]. For example, a webpage may proclaim its love of Julia Roberts; this contains clear sentiment, but fairly little controversy (unless you happen to hate Julia Roberts). Finally, when comparing our distantly-supervised results (rows 1-17) to the best supervised runs from prior work (rows 18-20, see [8]), the results are quite comparable. The best supervised run (row 18) had absolute gains of 1.5% and 0.8% in F 0.5 and Accuracy, respectively, over our best for those metrics (row 2); F 1 for the same runs had an absolute gain of 4.5% for our approach over the supervised approach. The run that is best for F 1 using our approach (row 12) has a 1.3% absolute gain over the best F 1 run in prior work (row 20). This is despite the fact that the supervised runs relied on human judgments for Wikipedia articles, while the distantly-supervised methods we described can be automatically applied to any webpage in a scalable manner. 7. CONCLUSIONS AND FUTURE WORK In this work, we presented the first fully automated approach to solving the recently proposed binary classification task of controversy detection [8]. We showed that Web controversy detection can be done with automatic labeling of exemplars in a nearest neighbor classifier. Our approach improves upon the previous work by creating a scalable distantly-supervised classification system, that leverages the rich metadata available in Wikipedia and uses it to classify webpages for which such information is not available. We reported results that represent 20% absolute gains in F measures and 10% absolute gains in accuracy over several baselines including sentiment analysis, and are comparable to prior work that used human annotations as an oracle standin [8]. As with other work that leverages such controversy scores calculated on Wikipedia articles, our approach is modular and therefore agnostic to the method chosen; the scores used for the Wikipedia articles can be generated via any method available to evaluate their controversy level [7]. For example, scores based on a network collaboration approach [20]

9 could be substituted in place of the M and C values, or added to them as another feature. Additionally, the nearest neighbor method we described is agnostic to the choice of target collection we query, and is not limited to Wikipedia. Other rich web collections whose controversy values can be easily computed, such as Debate.org, Debatabase or procon.org, could also be used alongside Wikipedia as targeted neighbors, potentially improving precision. There are several ways that our method could be improved upon, and we look forward to exploring some of these. More complex query generation methods could be employed to match neighbors; alternatively, it can be enhanced by entity linking for Wikification, or else language models of the pages and articles could be used to compare candidate neighbors directly. Other classifiers for the Wikipedia articles could be explored, and standard machine learning approaches can be used to combine our method with other features such as sentiment analysis. We would also be interested in validating our approach on a larger data set: the low results that the dispute metric received against the data set presented here raise interesting questions that have we have not had the chance to explore. The nearest neighbor approach we presented is limited in nature by the collection it targets; it will not properly detect controversial topics that are sparsely covered by Wikipedia or not covered at all. One could argue that is a major problem with this approach; though Wikipedia s coverage is fairly wide, it will not reach the truly long tail of the web when it comes to controversies of small scope, or those interesting only to a specific subpopulation. Entirely different approaches would need to be employed to detect such controversies. We believe sentiment presence alone may not suffice to detect controversy, and some biased websites may intentionally attempt to present as neutral. Nonetheless, it s possible that some metric of sentiment variance across multiple websites could provide useful clues; clustering websites topically and then analyzing the sentiments across them might be one way to handle this challenge. Another approach could use language models or topic models to automatically detect the fact that strongly opposing, possibly biased points of view exist on a topic, and thus that it may be controversial. This would flip the directionality of some recent work that presupposes subjectivity and bias to detect points of view [6, 26]. Being able to automatically detect controversies on the web has potential implications for improving critical literacy and civic discourse. Along with the ever-growing trend towards personalization in search comes the risk of fragmenting the web into separate worlds, with search engines showing users only what they think the user wants to see in a self-fulfilling prophecy. The ability to inform users about fact disputes and controversies in their queries enables search engines to improve trustworthiness. Additionally, explicitly exposing bias and polarization may partially counteract the filter bubble or echo chamber effects, wherein click feedback further reenforce users predispositions. We see the controversy detection problem itself as a prerequisite to several other interesting applications and larger problems. The most obvious application is a feature for search engines, or a browser plugin, that informs users when the webpage they are looking at is controversial. It would be useful to conduct user studies to see if such a warning had impact on search and browsing activity, whether in the moment or over time. Tracking the evolution of controversial topics on the web over time, or exploring the incidence of controversy in the wild, would be additional questions to explore that could be of interest to social scientists. Other interesting problems include diversifying controversial search results according to the stances and opinions on them, and automatically clustering or classifying webpages into various points of view on controversial topics, just to name a few. We re fascinated to explore these larger avenues of future work that can further healthy debates on the web, encourage civic discourse, and promote critical literacy for end-users of search. 8. ACKNOWLEDGMENTS This work was supported in part by the Center for Intelligent Information Retrieval and in part by NSF grant #IIS Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsor. We would like to thank Ed Chi, Kevin Murphy, Elad Yom- Tov and the members of the CIIR lab for fruitful conversations; Allen Lavoie, Hoda Sepehri Rad, Taha Yasseri added to those, along with supplying valuable resources. Sandeep Kalra, Ravali Rochampally, Emma Tosch and Celal Ziftci commented on versions of the draft and helped improve it significantly. Finally, this paper would not have been possible to write without the incredible support of Gonen Dori- Hacohen. 9. REFERENCES [1] E. Aktolga and J. Allan. Sentiment diversification with different biases. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, SIGIR 13, pages , New York, NY, USA, ACM. [2] R. Awadallah, M. Ramanath, and G. Weikum. Opinionetit: understanding the opinions-people network for politically controversial topics. In Proceedings of the 20th ACM international conference on Information and knowledge management, CIKM 11, pages , New York, NY, USA, ACM. [3] R. Awadallah, M. Ramanath, and G. Weikum. Harmony and dissonance: organizing the people s voices on political controversies. In Proceedings WSDM, pages , [4] J. P. Callan, W. B. Croft, and S. M. Harding. The INQUERY retrieval system. In Database and Expert Systems Applications, pages Springer Vienna, [5] Y. Choi, Y. Jung, and S.-H. Myaeng. Identifying controversial issues and their sub-topics in news articles. In Proceedings of the 2010 Pacific Asia conference on Intelligence and Security Informatics, PAISI 10, pages , Berlin, Heidelberg, Springer-Verlag. [6] S. Das and A. Lavoie. Automated inference of point of view from user interactions in collective intelligence venues. In Workshop on Social Computing and User Generated Content, EC 13, [7] S. Das, A. Lavoie, and M. Magdon-Ismail. Manipulation among the arbiters of collective

10 intelligence: How Wikipedia administrators mold public opinion. In Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management, CIKM 13, pages , New York, NY, USA, ACM. [8] S. Dori-Hacohen and J. Allan. Detecting controversy on the web. In Proceedings of the 22nd ACM international conference on Conference on Information & Knowledge Management, CIKM 13, pages , New York, NY, USA, ACM. [9] R. Ennals, B. Trushkowsky, and J. M. Agosta. Highlighting disputed claims on the web. In Proceedings of the 19th International Conference on World Wide Web, WWW 10, pages , New York, NY, USA, ACM. [10] K. Gyllstrom and M.-F. Moens. Clash of the typings: finding controversies and children s topics within queries. In Proceedings of the 33rd European conference on Advances in information retrieval, ECIR 11, pages 80 91, Berlin, Heidelberg, Springer-Verlag. [11] Heroic Media. Free abortion help website, January Retrieved from [12] M. Kacimi and J. Gamper. Diversifying search results of controversial queries. In Proceedings of the 20th ACM international conference on Information and knowledge management, CIKM 11, pages 93 98, New York, NY, USA, ACM. [13] M. Kacimi and J. Gamper. MOUNA: mining opinions to unveil neglected arguments. In Proceedings of CIKM, pages , [14] A. Kittur, B. Suh, B. A. Pendleton, and E. H. Chi. He says, she says: conflict and coordination in Wikipedia. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI 07, pages , New York, NY, USA, ACM. [15] C. D. Manning, P. Raghavan, and H. Schütze. Introduction to information retrieval, volume 1. Cambridge University Press Cambridge, [16] M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2, ACL 09, pages , Stroudsburg, PA, USA, Association for Computational Linguistics. [17] E. Pariser. The Filter Bubble: What the Internet is hiding from you. Penguin Press HC, [18] A.-M. Popescu and M. Pennacchiotti. Detecting controversial events from twitter. In Proceedings of CIKM, pages , [19] H. Sepehri Rad and D. Barbosa. Identifying controversial articles in Wikipedia: A comparative study. In Proceedings of 8th conference on WikiSym, WikiSym 12. ACM, [20] H. Sepehri Rad, A. Makazhanov, D. Rafiei, and D. Barbosa. Leveraging editor collaboration patterns in Wikipedia. In Proceedings of the 23rd ACM conference on Hypertext and social media, HT 12, pages 13 22, New York, NY, USA, ACM. [21] R. Sumi, T. Yasseri, A. Rung, A. Kornai, and J. Kertész. Edit wars in Wikipedia. CoRR, pages , [22] M. Tsytsarau, T. Palpanas, and K. Denecke. Scalable discovery of contradictions on the web. In Proceedings of the 19th international conference on World wide web, WWW 10, pages , New York, NY, USA, ACM. [23] V. G. V. Vydiswaran, C. Zhai, D. Roth, and P. Pirolli. Biastrust: Teaching biased users about controversial topics. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management, CIKM 12, pages , New York, NY, USA, ACM. [24] Wikipedia. Wikipedia: Neutral Point of View policy, January Retrieved from wiki/wikipedia:neutral point of view#controversial subjects. [25] T. Yasseri, R. Sumi, A. Rung, A. Kornai, and J. Kertész. Dynamics of conflicts in Wikipedia. PLoS ONE, 7(6):e38869, [26] E. Yom-Tov, S. Dumais, and Q. Guo. Promoting civil discourse through search engine diversity. Social Science Computer Review, 2013.

Automated Controversy Detection on the Web

Automated Controversy Detection on the Web Shiri Dori-Hacohen and James Allan Center for Intelligent Information Retrieval School of Computer Science University of Massachusetts, Amherst, Amherst MA 01002,