Comparing Sentiment Engine Performance on Reviews and Tweets

Similar documents
Create Swift mobile apps with IBM Watson services IBM Corporation

Parts of Speech, Named Entity Recognizer

Mining Social Media Users Interest

NLP Final Project Fall 2015, Due Friday, December 18

Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating

TEXT ANALYTICS USING AZURE COGNITIVE SERVICES

ANNUAL REPORT Visit us at project.eu Supported by. Mission

Enhancing applications with Cognitive APIs IBM Corporation

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets

Social Business Intelligence in Action

A Multilingual Social Media Linguistic Corpus

Structured Prediction Basics

Short Text Categorization Exploiting Contextual Enrichment and External Knowledge

Ranked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points?

Analysis of User Sentiments in App Reviews Coding Guide V4

Textual Entailment Evolution and Application. Milen Kouylekov

Contextual Search using Cognitive Discovery Capabilities

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

Advanced Marketing Lab

Analysis of Nokia Customer Tweets with SAS Enterprise Miner and SAS Sentiment Analysis Studio

SENTIMENT ANALYSIS OF TEXTUAL DATA USING MATRICES AND STACKS FOR PRODUCT REVIEWS

Question Answering Systems

Why data science is the new frontier in software development

Extracting Information from Social Networks

USER GUIDE DASHBOARD OVERVIEW A STEP BY STEP GUIDE

LIDER Survey. Overview. Number of participants: 24. Participant profile (organisation type, industry sector) Relevant use-cases

Learning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search

Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank text

Sentiment Analysis for Amazon Reviews

Search Engines. Information Retrieval in Practice

Novel Comment Spam Filtering Method on Youtube: Sentiment Analysis and Personality Recognition

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

Demystifying Machine Learning

Chuck Cartledge, PhD. 24 February 2018

CSI 445/660 Part 10 (Link Analysis and Web Search)

Evaluating the Usefulness of Sentiment Information for Focused Crawlers

Recommender Systems: Practical Aspects, Case Studies. Radek Pelánek

Creating a Classifier for a Focused Web Crawler

Ranking Algorithms For Digital Forensic String Search Hits

Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings.

ISSN: Page 74

Taming Text. How to Find, Organize, and Manipulate It MANNING GRANT S. INGERSOLL THOMAS S. MORTON ANDREW L. KARRIS. Shelter Island

3 Publishing Technique

Magical Chatbots with Cisco Spark and IBM Watson

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

Domain Adaptation Using Domain Similarity- and Domain Complexity-based Instance Selection for Cross-domain Sentiment Analysis

Why Top Tasks & Readability Analysis Matter for the Canadian Government s Digital Strategy

Data Clustering via Graph Min-Cuts. Erin R. Rhode University of Michigan EECS 767 March 15, 2006

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY

PROJECT REPORT. TweetMine Twitter Sentiment Analysis Tool KRZYSZTOF OBLAK C

An Architecture for Sentiment Analysis in Twitter

Information Retrieval

Content Enrichment. An essential strategic capability for every publisher. Enriched content. Delivered.

Exploring the use of Paragraph-level Annotations for Sentiment Analysis of Financial Blogs

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Keywords APSE: Advanced Preferred Search Engine, Google Android Platform, Search Engine, Click-through data, Location and Content Concepts.

SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT

MEMA. Memory Management for Museum Exhibitions. Independent Study Report 2970 Fall 2011

Distributed System. Gang Wu. Spring,2019

Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms

TriRank: Review-aware Explainable Recommendation by Modeling Aspects

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

FLL: Answering World History Exams by Utilizing Search Results and Virtual Examples

Sentiment Analysis using Support Vector Machine based on Feature Selection and Semantic Analysis

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

Micro-blogging Sentiment Analysis Using Bayesian Classification Methods

Annotating Spatio-Temporal Information in Documents

Programming Projects

Algorithm Engineering Applied To Graph Clustering

Code Mania Artificial Intelligence: a. Module - 1: Introduction to Artificial intelligence and Python:

DOWNLOAD PDF CAPTURING TRUST IN SOCIAL WEB APPLICATIONS JOHN ODONOVAN

Classification. I don t like spam. Spam, Spam, Spam. Information Retrieval

SemEval-2013 Task 13: Word Sense Induction for Graded and Non-Graded Senses

Diversity in Recommender Systems Week 2: The Problems. Toni Mikkola, Andy Valjakka, Heng Gui, Wilson Poon

A bit of theory: Algorithms

Cookies, fake news and single search boxes: the role of A&I services in a changing research landscape

IBM Watson Application Developer Workshop. Watson Knowledge Studio: Building a Machine-learning Annotator with Watson Knowledge Studio.

WebSci and Learning to Rank for IR

Scratcher Party Published by: Ibrahim Eminovic iphone United States / English (US)

Empowering People with Knowledge the Next Frontier for Web Search. Wei-Ying Ma Assistant Managing Director Microsoft Research Asia

Performance Evaluation

Social Bookmarks. Blasting their site with them during the first month of creation Only sending them directly to their site

Multi-Modal Word Synset Induction. Jesse Thomason and Raymond Mooney University of Texas at Austin

Cloud-based Twitter sentiment analysis for Ranking of hotels in the Cities of Australia

introduction to using Watson Services with Java on Bluemix

Protocols and Layers. Networked Systems (H) Lecture 2

LP360 (Advanced) Planar Point Filter

A CASE STUDY: Structure learning for Part-of-Speech Tagging. Danilo Croce WMR 2011/2012

Advanced Computer Graphics CS 525M: Crowds replace Experts: Building Better Location-based Services using Mobile Social Network Interactions

Context Sensitive Search Engine

ELEC6910Q Analytics and Systems for Social Media and Big Data Applications Lecture 4. Prof. James She

Semantic Estimation for Texts in Software Engineering

ON AUTOMATICALLY DETECTING SIMILAR ANDROID APPS. By Michelle Dowling

A MODEL OF EXTRACTING PATTERNS IN SOCIAL NETWORK DATA USING TOPIC MODELLING, SENTIMENT ANALYSIS AND GRAPH DATABASES

Entity and Knowledge Base-oriented Information Retrieval

Bibliometrics: Citation Analysis

TALP at WePS Daniel Ferrés and Horacio Rodríguez

How to create dialog system

Assigning Vocation-Related Information to Person Clusters for Web People Search Results

SOCIAL MEDIA. Charles Murphy

Transcription:

Comparing Sentiment Engine Performance on Reviews and Tweets Emanuele Di Rosa, PhD CSO, Head of Artificial Intelligence Finsa s.p.a. emanuele.dirosa@finsa.it www.app2check.com www.finsa.it

Motivations and Goals 2

Motivations and Goals Motivations Computing accurately a sentiment expressed in a text is a task largely needed in the market, and ready-to-use APIs with pre-trained sentiment classifiers are available. However, sentiment engines asked to classify a text as positive, negative or neutral, do not reach a 100% of accuracy. They show misclassifications in multiple cases, even in cases that are straightforward for humans: this involves both research and industrial tools. 3

Motivations and Goals Motivations Computing accurately a sentiment expressed in a text is a task largely needed in the market, and ready-to-use APIs with pre-trained sentiment classifiers are available. However, sentiment engines asked to classify a text as positive, negative or neutral, do not reach a 100% of accuracy. They show misclassifications in multiple cases, even in cases that are straightforward for humans: this involves both research and industrial tools. On the one hand, academic research advances are visible and international challenges are organized each year, asking researchers to train/fine-tune their engines to work well on specific tasks (e.g. polarity classification, subjectivity or irony detection), on specific sources/domains (e.g tweets about politics), and specific languages (English, Italian, Arabic, etc.) 4

Motivations and Goals Motivations Computing accurately a sentiment expressed in a text is a task largely needed in the market, and ready-to-use APIs with pre-trained sentiment classifiers are available. However, sentiment engines asked to classify a text as positive, negative or neutral, do not reach a 100% of accuracy. They show misclassifications in multiple cases, even in cases that are straightforward for humans: this involves both research and industrial tools. On the one hand, academic research advances are visible and international challenges are organized each year, asking researchers to train/fine-tune their engines to work well on specific tasks (e.g. polarity classification, subjectivity or irony detection), on specific sources/domains (e.g tweets about politics), and specific languages (English, Italian, Arabic, etc.) On the other hand, tools needed by industry have to face the need of the market, asking for engines that can receive as input any textual source (tweets, reviews, etc) and being applied to general purpose applications. 5

Motivations and Goals Motivations Computing accurately a sentiment expressed in a text is a task largely needed in the market, and ready-to-use APIs with pre-trained sentiment classifiers are available. However, sentiment engines asked to classify a text as positive, negative or neutral, do not reach a 100% of accuracy. They show misclassifications in multiple cases, even in cases that are straightforward for humans. This involves both research and industrial tools. We ideally need industrial sentiment engines providing high average performance on multiple sources and domains On the one hand, academic research advances are visible and international challenges are organized each year, asking researchers to train/fine-tune their engines to work well on specific tasks (e.g. polarity classification, subjectivity or irony detection), on specific sources/domains (e.g tweets about politics), and specific languages (English, Italian, Arabic, etc.) On the other hand, tools needed by industry have to face the need of the market, asking for engines that can receive as input any textual source (tweets, reviews, etc) and being applied to general purpose applications. 6

Motivations and Goals Goals 1. Sentiment engine performance: Perceived by humans VS experimentally measured 2. What s the performance gap between industrial general purpose engines and research engines, since the latter are built to show high performance on specific settings (source, domain, language, task, etc)? Are there differences in performance analyzing tweets or reviews in different languages (e.g. English and Italian)? 7

Outline 1. Motivations and goals 2. Sentiment Engine (mis)classifications On simple cases On difficult cases: «Cross-domain» Sentiment Classification 3. Experimental Evaluation of Research and Industrial Engines Results on Tweets Results on Product Reviews 4. Conclusions 8

Sentiment Engine (mis)classifications 9

Sentiment Engine (mis)classifications Sentiment Engines on simple classifications We consider some industrial and research sentiment engines providing an online demo: Research engines: ifeel Platform (running 18 research tools implementing different methods) Standford Deep Learning Industrial tools: IBM Watson Google Cloud Natural Language API (Google CNL) Finsa X2Check We test 3 simple sentences with «clear» sentiment classification: A negative sentence A positive sentence A negative («difficult») sentence 10

Sentiment Engine (mis)classifications Engines on simple classifications: ifeel Platform I hate this game Screenshots date: June 10, 2017 11

Sentiment Engine (mis)classifications Engines on simple classifications: StandfordDL I hate this game Screenshots date: June 10, 2017 12

Sentiment Engine (mis)classifications Engines on simple classifications: IBM Watson I hate this game Screenshots date: June 10, 2017 13

Sentiment Engine (mis)classifications Engines on simple classifications: ifeel Platform I like this game Screenshots date: June 10, 2017 14

Sentiment Engine (mis)classifications Engines on simple classifications: StandfordDL I like this game Screenshots date: June 10, 2017 15

Sentiment Engine (mis)classifications Engines on simple classifications: IBM Watson I like this game Screenshots date: June 10, 2017 16

Sentiment Engine (mis)classifications Engines on simple classifications: ifeel Platform I just connected my game with my facebook account and instead of saving the progress I have lost all my progress and it came on Level 1 although I was on lvl 98 Please help!!!!! Screenshots date: June 10, 2017 17

Sentiment Engine (mis)classifications Engines on simple classifications: ifeel Platform I just connected my game with my facebook account and instead of saving the progress I have lost all my progress and it came on Level 1 although I was on lvl 98 Please help!!!!! Screenshots date: June 10, 2017 18

Sentiment Engine (mis)classifications Engines on simple classifications: Standford DL Screenshots date: June 10, 2017 19

Sentiment Engine (mis)classifications Engines on simple classifications: IBM Watson Screenshots date: June 10, 2017 20

Sentiment Engine (mis)classifications Observations We showed objective examples of sentiment misclassification performed by popular research and industrial engines, even on cases that are straightforward for humans However, it is not possible to make any kind of generalization of these results or let us somehow rank the engines involved in the previous examples. In order to do that, a wide experimental analysis is needed. Performance in sentiment polarity classification depends on many factors, involving the classifier s training (source) set and test (target) set. Some sentiment classifiers are built to perform better on a specific: Topic domain (e.g. movies, politics) Textual source (tweets, reviews, etc.) Language 21

Cross-domain classification and difficult cases Cross-domain classification and domain-adaptation How do we classify the polarity of the following text? "Candy crush is my addiction, I love it!" This is a case of domain-dependent sentiment. Moreover, It is well known in literature that: Users often use some different words when they express sentiment in different domains [Pan S.J.,et al 2010] Classifiers trained on one domain may perform poorly on another domain [Pang, et al. 2008]. Cross-domain sentiment analysis research area works on domain-adaptation techniques [Blitzer, et al 2007], [Pan S.J.,et al 2010], [Liu B., 2012], [Wu F.,et al, 2016], [Wu F.,et al, 2017]. Sometimes domain-adaptation may also lead to worse performance [Pan, S.J.,et al 2010]. 22

Cross-domain classification and difficult cases Document-level VS Sentence-level VS Entity level SA "I like this game but after the ios update I get a crash when the app starts. Please do something!! " It is probably impossible to agree about its overall overall (document-level) sentiment classification It is known in literature [1] that group of humans, when evaluating sentiment (the polarity in three classes), agree in about the 80% of the cases since there can be controversial cases due to the subjective qualitative evaluation. [1] T. Wilson, J. Wiebe, P. Hoffmann. Recognizing Contextual Polarity in Phraselevel Sentiment Analysis. In proc. of HLT 2005.

Cross-domain classification and difficult cases Document-level VS Sentence-level VS Entity level SA "I like this game but after the ios update I get a crash when the app starts. Please do something!!" Screenshots date: June 10, 2017

Cross-domain classification and difficult cases Document-level VS Sentence-level VS Entity level SA "I like this game but after the ios update I get a crash when the app starts. Please do something!!" App2Check: Screenshots date: June 10, 2017

Experimental Evaluation of Research and Industrial Engines 26

Experimental Evaluation of Research and Industrial Engines Experimental Evaluation In order to fairly compare engines performance, we need: a gold standard reference benchmarks on multiple sources and mixed domains benchmarks in more than one language Tweets we see a worst case for industrial engines Benchmarks and engines from Evalita SentiPolC 2016 for Italian language Benchmarks and engines from SemEval 2017 for English language Reviews we see a worst case for research engines Amazon Product Reviews: Benchmarks from ESWC Semantic Sentiment Analysis 2016 27

Experimental Evaluation of Research and Industrial Engines Experimental Evaluation About pre-trained, ready-to-use industrial Sentiment APIs: most of the commercial engines for SA, in terms of service, do not allow to use their APIs to perform an experimental comparative analysis. The goal of such tools is to measure user opinion and, as per every measurement tool, being aware of its accuracy is fundamental. This is even more important in sentiment analysis since, as we recalled, pre-trained engines may in general show a significant different performance depending on the target test set. We considered industrial engines, having a public sentiment API and without explicit restrictions in the terms of service to make a comparative analysis General purpose APIs: Google CNL Finsa X2Check X2Check adaptations, specifically trained on the target source: App2Check specifically trained on apps reviews. Tweet2Check specifically trained on tweets. Amazon2Check is specifically trained on amazon reviews. 28

Evaluation on Tweets in Italian Tab 1: Evaluation on 2K tweets in Italian from Evalita SentiPolC 2016. Industrial engines added to the official results. Industrial engines VS research engines specifically trained/tuned on the given domain/source.

Evaluation on Tweets in Italian Pos = F1 pos pos 0 + F11 2 Neg = F1 neg neg 0 + F11 2 F = Neg + Pos 2 Tab 1: Evaluation on 2K tweets in Italian from Evalita SentiPolC 2016. Industrial engines added to the official results. Industrial engines VS research engines specifically trained/tuned on the given domain/source.

Evaluation on Tweets in Italian F = 3.4% F = 2.6% Tab 1: Evaluation on 2K tweets in Italian from Evalita SentiPolC 2016. Industrial engines added to the official results. Industrial engines VS research engines specifically trained/tuned on the given domain/source.

Evaluation on Tweets in Italian F = 1.3% F = 0.5% Tab 1: Evaluation on 2K tweets in Italian from Evalita SentiPolC 2016. Industrial engines added to the official results. Industrial engines VS research engines specifically trained/tuned on the given domain/source.

Evaluation on Tweets in Italian F = 10.1% F = 3.8% Tab 1: Evaluation on 2K tweets in Italian from Evalita SentiPolC 2016. Industrial engines added to the official results. Industrial engines VS research engines specifically trained/tuned on the given domain/source.

Evaluation on Tweets in English Tab 2: Evaluation on 12,284 tweets in English from SemEval 2017, Task 4, subtask A. Industrial engines added to the official results.. Industrial engines VS research engines specifically trained/tuned on the given domain/source.

Evaluation on Tweets in English AvgRec = 1 3 RP + R N + R U AvgF 1 PN = 1 2 F 1 P + F 1 N Tab 2: Evaluation on 12,284 tweets in English from SemEval 2017, Task 4, subtask A. Industrial engines added to the official results.. Industrial engines VS research engines specifically trained/tuned on the given domain/source.

Evaluation on Tweets in English AvgF1 = 12.4% AvgF1 = 4.7% Tab 2: Evaluation on 12,284 tweets in English from SemEval 2017, Task 4, subtask A. Industrial engines added to the official results.. Industrial engines VS research engines specifically trained/tuned on the given domain/source.

Evaluation on Amazon Product Reviews in English Tab 5: Evaluation on about 200,000 generic amazon product reviews in English from ESWC Semantic Sentiment Analysis 2016. Industrial engines VS research engines not specifically trained on the target domain/source.

Evaluation on Amazon Product Reviews in English F 1 P = 2 RP P P R P + P P MF 1 = 1 2 F 1 P + F 1 N Tab 5: Evaluation on about 200,000 generic amazon product reviews in English from ESWC Semantic Sentiment Analysis 2016. Industrial engines VS research engines not specifically trained on the target domain/source.

Evaluation on Amazon Product Reviews in English MF1 = 4.1% MF1 = 23.2% Tab 5: Evaluation on about 200,000 generic amazon product reviews in English from ESWC Semantic Sentiment Analysis 2016. Industrial engines VS research engines not specifically trained on the target domain/source.

Overall Results In our experimental evaluation, we showed that: considering the best performing research tool specifically trained on the target source as a reference (worst case for industrial APIs tweets from SemEval 2017 and Evalita SentiPolc 2016): X2Check is lower than 3.4% of F-score on Italian and 11.6% of Avg-F1 on English benchmarks Google CNL is lower than 13.5% of F-score on Italian and 16.3% of Avg-F1 on English benchmarks App2Check [not tuned on tweets] is lower than 9.7% of F-score on Italian and 16.9% on English benchmarks considering the best performing research tool not specifically trained on the target source as a reference (worst case for research engines amazon product reviews from ESWC SSA 2016): on Amazon Product Reviews in English X2Check shows a macro-f1 score of 23.2% higher than the best research tool Google CNL shows a macro-f1 score of 19.1% higher than the best research tool App2Check [not tuned on amazon reviews] is lower than 13.3% of MF1 on English benchmarks from Amazon product reviews 40

Conclusions Sentiment Analysis is still a very complex task and evaluating the engines results on individual examples, counting just on the «human perception», is not a scientific approach and lead to wrong conclusions about engine performance. However, such «manual inspection» may help to focus on the engine s defects, understand the reasons why some misclassifictions occur and better design/improve the engine. It is necessary evaluate the performance of a «general purpose» (pre-trained) sentiment engine APIs, through an extensive experimental analysis on multiple textual sources and domains, taking into account the overall average KPIs (accuracy, macro-f1 score, etc). Since sentiment engines are measurement tools, it would be better if companies provided, together with the pre-trained models, also some performance indicators on specific settings (source, topic domains, language, etc), or at least let buyers perform a comparative analysis. Domain/source-specific models show in general better results compared to pre-trained «general purpose» classifiers. However, applying domain-adaptation techniques or recognizing the best specialized model to apply, may reduce misclassifications on the target domain. 41

Thank you Emanuele Di Rosa, PhD CSO, Head of Artificial Intelligence Finsa s.p.a. emanuele.dirosa@finsa.it www.app2check.com www.finsa.it

Sentiment Engines on individual examples Engines on simple classifications: X2Check I hate this game Screenshots date: June 10, 2017 43

Sentiment Engines on individual examples Engines on simple classifications: X2Check I like this game Screenshots date: June 10, 2017 44

Sentiment Engines on individual examples Engines on simple classifications: X2Check Screenshots date: June 10, 2017 45