InfoQ: Computational Assessment of Information Quality on the Web. Centrum voor Wiskunde en Informatica (CWI) Vrije Universiteit Amsterdam

Similar documents
Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating

arxiv: v1 [cs.cr] 30 May 2014

Linking Trust to Data Quality

Information: Truth or Fiction? ),7

Competitive Intelligence and Web Mining:

SCALABLE KNOWLEDGE BASED AGGREGATION OF COLLECTIVE BEHAVIOR

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets

User Control Mechanisms for Privacy Protection Should Go Hand in Hand with Privacy-Consequence Information: The Case of Smartphone Apps

Evaluating the Usefulness of Sentiment Information for Focused Crawlers

CLARIAH Media Suite: Enabling Scholarly Research for Audiovisual and Mixed Media Data Sets in a Sustainable, Distributed Infrastructure

QUALITY SEO LINK BUILDING

Dmesure: a readability platform for French as a foreign language

Development of an Ontology-Based Portal for Digital Archive Services

Rodale. Upper East Side

Effective Tweet Contextualization with Hashtags Performance Prediction and Multi-Document Summarization

Building the future of television

Enhancing applications with Cognitive APIs IBM Corporation

An Oracle White Paper October Oracle Social Cloud Platform Text Analytics

Enriching the metadata of European colonial maps with crowdsourcing

BOTS, TROLLS AND SOCIAL BOTS: THE GOOD, THE BAD AND THE UGLY

ITEE Journal. Information Technology & Electrical Engineering

Deliverable D Integrated Platform

Collaborative Content-Based Method for Estimating User Reputation in Online Forums

Multimedia Content Delivery in Mobile Cloud: Quality of Experience model

On User-centric QoE Prediction for VoIP & Video Streaming based on Machine-Learning

Mymory: Enhancing a Semantic Wiki with Context Annotations

Social Business Intelligence in Action

Semantic Web Applications and the Semantic Web in 10 Years. Based on work of Grigoris Antoniou, Frank van Harmelen

Master projects, internships and other opportunities

ResPubliQA 2010

Bibster A Semantics-Based Bibliographic Peer-to-Peer System

A Secure and Dynamic Multi-keyword Ranked Search Scheme over Encrypted Cloud Data

(216)

TEL2813/IS2621 Security Management

Re-contextualization and contextual Entity exploration. Sebastian Holzki

A study of Video Response Spam Detection on YouTube

ABSTRACT 1. INTRODUCTION

Building Rich User Profiles for Personalized News Recommendation

Ranking Assessment of Event Tweets for Credibility

Enhanced Trustworthy and High-Quality Information Retrieval System for Web Search Engines

Process and Tool-support to Collaboratively Formalize Statutory Texts by Executable Models

The Ultimate Guide for Content Marketers. by SEMrush

2019 MARKETING OPPORTUNITIES.

Chapter 1. Introduction. 1.1 Content Quality - Motivation

MetaData for Database Mining

Pervasive Communication: The Need for Distributed Context Adaptations

Factors Influencing the Quality of the User Experience in Ubiquitous Recommender Systems

Challenges on Combining Open Web and Dataset Evaluation Results: The Case of the Contextual Suggestion Track

System of Systems Architecture Generation and Evaluation using Evolutionary Algorithms

Social Bookmarks. Blasting their site with them during the first month of creation Only sending them directly to their site

A CONFIDENCE MODEL BASED ROUTING PRACTICE FOR SECURE ADHOC NETWORKS

VisoLink: A User-Centric Social Relationship Mining

Category Theory in Ontology Research: Concrete Gain from an Abstract Approach

Content Enrichment. An essential strategic capability for every publisher. Enriched content. Delivered.

Evaluation and Design Issues of Nordic DC Metadata Creation Tool

ANNUAL REPORT Visit us at project.eu Supported by. Mission

Mobile Offloading. Matti Kemppainen Miika Komu Lecture Slides T Mobile Cloud Computing

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES

Exploiting the Social and Semantic Web for guided Web Archiving

Natural Language Processing on Hospitals: Sentimental Analysis and Feature Extraction #1 Atul Kamat, #2 Snehal Chavan, #3 Neil Bamb, #4 Hiral Athwani,

Extracting Attributed Verification and Debunking Reports from Social Media: MediaEval-2015 Trust and Credibility Analysis of Image and Video

3XL News: a Cross-lingual News Aggregator and Reader

Search Engines and Knowledge Graphs

A Visual Tool for Supporting Developers in Ontology-based Application Integration

icast / TRUST Collaboration Year 2 - Kickoff Meeting

Inductively Generated Pointcuts to Support Refactoring to Aspects

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS

ANALYTICS DRIVEN DATA MODEL IN DIGITAL SERVICES

Eleven+ Views of Semantic Search

Linking Entities in Chinese Queries to Knowledge Graph

AWERProcedia Information Technology & Computer Science

Performance Assessment using Text Mining

Domain-Specific Languages for Digital Forensics

Analysis of Sorting as a Streaming Application

Chemometrics. Description of Pirouette Algorithms. Technical Note. Abstract

TAKING NETWORK TESTING TO THE NEXT LEVEL

Using Component-oriented Process Models for Multi-Metamodel Applications

Processes in Distributed Systems

USING TRANSFORMATIONS TO INTEGRATE TASK MODELS IN

A Scenario for Business Benefit from Public Data

Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings.

Semantic Web. Ontology Alignment. Morteza Amini. Sharif University of Technology Fall 95-96

PERIODIC REPORT 3 KYOTO, ICT version April 2012

A Composite Trust based Public Key Management in MANETs

6 TOOLS FOR A COMPLETE MARKETING WORKFLOW

The Internet. Revised by Jennifer Yack, MSIS Original by: Corina Bustillos, MSLS Carrie Gassett, MSIS June 2015

Comparing Sentiment Engine Performance on Reviews and Tweets

SENTIMENT ESTIMATION OF TWEETS BY LEARNING SOCIAL BOOKMARK DATA

An Archiving System for Managing Evolution in the Data Web

Cookies, fake news and single search boxes: the role of A&I services in a changing research landscape

31 Examples of how Microsoft Dynamics 365 Integrates with Marketing Automation

Data Quality Evaluation in Data Integration Systems

Provenance: Information for Shared Understanding

The multi-perspective process explorer

An Object Oriented Runtime Complexity Metric based on Iterative Decision Points

DIGIT.B4 Big Data PoC

Contextual Search using Cognitive Discovery Capabilities

Three Key Considerations for Your Public Cloud Infrastructure Strategy

Security in India: Enabling a New Connected Era

Open-Source Natural Language Processing and Computational Archival Science

Transcription:

InfoQ: Computational Assessment of Information Quality on the Web Davide Ceolin 1 *, Chantal van Son 2, Lora Aroyo 2, Julia Noordegraaf 3, Ozkan Sener 2, Robin Sharma 2, Piek Vossen 2, Serge ter Braake 3, Lesia Tkacz 2, Inger Leemans 2, Rens Bod 3 1 Centrum voor Wiskunde en Informatica (CWI) 2 Vrije Universiteit Amsterdam 3 Universiteit van Amsterdam * corresponding author (davide.ceolin @cwi.nl / @davideceolin) As a result of the democratic nature of the Web, people can contribute different types of information by means of blog posts, articles, tweets - a huge amount of information is available on the Web and this information can be potentially useful to a variety of users, ranging from laymen to scholars. However, this results in a vast variety of quality of the published documents expressing a multitude of perspectives on different topics. Online information could be provided by incompetent authors or by malicious ones. Additionally, the social media mechanisms for sharing and liking introduce the popularity currency (number of retweets, reposts, likes, etc.), which provides an additional motivation for misuse of the Web democracy to generate artificial popularity of a particular perspective, or to disguise incompetency behind professionally looking websites (e.g. low-quality documents can easily be crafted to appear credible and to gain popularity). Fake news is a noteworthy example of documents containing low-quality (e.g., inaccurate) information, often created with misleading intentions that gain or gained a high popularity. This adds another layer of uncertainty regarding information quality. To complicate things further, quality is not a monolithic and binomial thing: it is hardly possible to judge documents as good or bad in absolute terms. The overall quality of a document depends both on the topics, the user that assesses it, and on the intended uses for this document. However, it is possible to decompose quality into objective dimensions or aspects that can be combined in order to increase the awareness of possible aspects related to information quality, and also to increase the awareness in terms of the relation between quality and perspectives [1-6]. In this paper, we introduce InfoQ - a data assessment tool developed within the context of the 1 QuPiD2 project (Quality and Perspectives in Deep Data). QuPiD2 investigates methods and tools for computational support to capture, model and assess the diversity in quality of online information and the multitude of perspectives (i.e. beliefs, opinions and world views) being reflected in this online information. Thus, we explore the perception factors and linguistic phenomena reflecting a certain perspective or influencing the perceived quality of text, and 1 https://qupid-project.net/

develop tools to automatically detect perspectives and assess quality. InfoQ offers the following quality assessment functionalities: - document-centric assessment: for a given URL InfoQ provides a detailed analysis of its information quality - topic-centric assessment: for a given topic InfoQ provides in-depth comparative analysis of the quality of all the documents related to this topic. The quality assessment performed by InfoQ provides a comprehensive, exhaustive and multi-perspective view along multiple quality dimensions (precision, trustworthiness, accuracy, neutrality, readability, relevance with respect to a given topic). In order to address the intrinsic subjectivity of quality assessment, InfoQ is based on a symbiotic pipeline that brings together humans and machines to gather and train information assessments and the factors that impact them. InfoQ machine learning models are trained on (1) quality assessments provided by experts and crowds and rely on (2) automatic extraction of document features, such as NLP features (e.g., sentiment, named entity recognition), provenance features (source type, etc.), web-based technical features (e.g., latency, i.e., 2 Website speed test), and crowdsourced features (e.g., Web of Trust trustworthiness scores) to identify correlations between document features and human assessments, and allow for the assessment of any Web document. As machine learning algorithm, we employ multi-label regression and Support Vector Machines. Figure 1 gives an overview of InfoQ. Figure 1. Overview of InfoQ, our online quality assessment tool. 2 http://mywot.com

Ultimately, the quality assessments produced by InfoQ are presented to the user by employing a radar chart visualization. On the one hand, this allows the user to both obtain a summary of the quality of a document (or of a group of documents) without having to introduce artificial aggregations. Figure 2 shows a comparison of five documents of different quality. This overview does not provide details about the exact meaning of each dimension of the radar graph but allows a quick comparison among the assessed documents (the small the colored area in the graph the lower the quality of the document, and vice versa). Figure 2. Example of a high-level view of a list of assessed documents. On the other hand, such a visualisation, allows the final user to investigate further the quality of a given document. When restricting the focus on a lower number of documents, the user can understand precisely how the document scores in each dimension, or how two documents compare to each other (see Figure 3). This allows increasing the user awareness regarding the quality of documents. In turn, allowing users deciding whether documents meet their contextual and subjective requirements, without, for instance, having to decide which (combination of) quality dimension determines the overall quality of a document. InfoQ is a tool for computationally assessing the quality of Web information. The tool is currently running on local instances, and under deployment at http://qupid.d2s.labs.vu.nl. The tool has been tested on a group of 50 documents regarding the vaccination debate (selected in order to represent a small but heterogeneous sample consisting of blog posts, news articles, documents from public authorities, etc.), where it shows a promising performance (up to 90% accuracy) that reflects previous works of ours [3,6]. We also envision three main future developments for InfoQ. First, the extension of the domains and topics covered (currently the tool allows any document to be assessed, but the accuracy of such assessments is under evaluation). Second, the improvement of the computation speed. Lastly, the personalisation

of the quality assessments, such that different users can obtain assessments matching their specific needs and requirements. References Figure 3. Example comparison of the quality of two documents. [1] Son, Chantal van, Emiel van Miltenburg, and Roser Morante (2016). Building a Dictionary of Affixal Negations. In Proceedings of the Workshop on Extra-Propositional Aspects of Meaning in Computational Linguistics (ExProM 2016), Osaka, Japan. [2] Fokkens, Antske, Serge ter Braake, Isa Maks, and Davide Ceolin, On the Semantics of Concept Drift: Towards Formal Definitions of Semantic Change, Drift-a-LOD. Detection, Representation and Management of Concept Drift in Linked Open Data. Workshop at EKAW, Bologna, Italy, 20 November 2016. [3] Ceolin, Davide, Julia Noordegraaf, Lora Aroyo, Capturing the Ineffable: Collecting, Analysing, and Automating Web Document Quality Assessment, EKAW 2016: Knowledge Engineering and Knowledge Management, 83-97. [4] Davide Ceolin, Lora Aroyo, Julia Noordegraaf, Identifying and Classifying Uncertainty Layers in Web Document Quality Assessment, International Workshop on Uncertainty Reasoning for the Semantic Web. pp: 61-64 [5] Van der Zwaan, J.M., Maarten van Meersbergen, Antske Fokkens, Serge ter Braake, Inger Leemans, Erika Kuijpers, Piek Vossen and Isa Maks, Storyteller: Visualizing Perspectives in Digital Humanities Projects, in: Bozic, Mendel-Gleason, Debruyne and O Sullivan eds., 2nd IFIP Workshop on Computational History and Data-Driven Humanities (2016).

[6] Davide Ceolin, Julia Noordegraaf, Lora Aroyo, and Chantal van Son. 2016. Towards web documents quality assessment for digital humanities scholars. In Proceedings of the 8th ACM Conference on Web Science (WebSci 16). ACM, New York, NY, USA, 315-317.