IBM Research - China

Similar documents
Interactive Visual Text Analytics for Decision Making. Shixia Liu Microsoft Research Asia

Understanding Text Corpora with Multiple Facets

Last Week: Visualization Design II

A Web Application to Visualize Trends in Diabetes across the United States

FacetAtlas: Multifaceted Visualization for Rich Text Corpora

iknow Use Cases Michael Br ands Senior Product Manager

New Bee. Samyuktha Sridharan Xuanyi Qi Hanshu Lin

Instructor: Stefan Savev

Semantic Web Company. PoolParty - Server. PoolParty - Technical White Paper.

Use of Social Media Data as a Surveillance Tool. Brenda Curtis, PhD, MSPH Assistant Professor Department of Psychiatry Addictions August 15, 2016

Securing the Internet of Things (IoT) at the U.S. Department of Veterans Affairs

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

Interpreting Document Collections with Topic Models. Nikolaos Aletras University College London

Analysis of a Population of Diabetic Patients Databases in Weka Tool P.Yasodha, M. Kannan

Chapter 6: Information Retrieval and Web Search. An introduction

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets

Exploring the Query Expansion Methods for Concept Based Representation

What is this Song About?: Identification of Keywords in Bollywood Lyrics

Web Data mining-a Research area in Web usage mining

MULTIMEDIA ANALYTICS: SYNERGY BETWEEN HUMAN AND MACHINE BY VISUALIZATION

Monitor Qlik Sense sites. Qlik Sense Copyright QlikTech International AB. All rights reserved.

Visual Query Suggestion

DMI Exam PDDM Professional Diploma in Digital Marketing Version: 7.0 [ Total Questions: 199 ]

VISA: A VIsual Sentiment Analysis System

Construction Scheme for Cloud Platform of NSFC Information System

Research and Design of Key Technology of Vertical Search Engine for Educational Resources

Visualization and text mining of patent and non-patent data

Tracking and Evaluating Changes to Address-Based Sampling Frames over Time

Text Mining. Representation of Text Documents

Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Learning Objectives for Data Concept and Visualization

MIS2502: Data Analytics Clustering and Segmentation. Jing Gong

A Multiple-stage Approach to Re-ranking Clinical Documents

Clinical Database applications in hospital

Overview of Web Mining Techniques and its Application towards Web

Mining Web Data. Lijun Zhang

Feature LDA: a Supervised Topic Model for Automatic Detection of Web API Documentations from the Web

Oracle9i Data Mining. Data Sheet August 2002

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

Question No : 1 Web spiders carry out a key function within search. What is it? Choose one of the following:

Interactive Campaign Planning for Marketing Analysts

The Establishment of Large Data Mining Platform Based on Cloud Computing. Wei CAI

Developing Focused Crawlers for Genre Specific Search Engines

Mining Web Data. Lijun Zhang

Medical Records Clustering Based on the Text Fetched from Records

Information Retrieval CSCI

Topic Diversity Method for Image Re-Ranking

CLIENT HEALTH QUESTIONNAIRE AND INITIAL SCREENING QUESTIONS HEALTH QUESTIONNAIRE INSTRUCTIONS

Security Control Methods for Statistical Database

APP USER GUIDE Sugar.IQ with Watson

Contents. About this Book...1 Audience... 1 Prerequisites... 1 Conventions... 2

Domain-specific Concept-based Information Retrieval System

Promoting Ranking Diversity for Biomedical Information Retrieval based on LDA

Text Mining: A Burgeoning technology for knowledge extraction

IBM Research Report. Model-Driven Business Transformation and Semantic Web

How GPUs Power Comcast's X1 Voice Remote and Smart Video Analytics. Jan Neumann Comcast Labs DC May 10th, 2017

DATA WAREHOUING UNIT I

Ontology and Hyper Graph Based Dashboards in Data Warehousing Systems

Lesson 3: Building a Market Basket Scenario (Intermediate Data Mining Tutorial)

Improving Difficult Queries by Leveraging Clusters in Term Graph

DIGITALIZATION OF MEDICATION THERAPY MANAGEMENT:

Overview of Big Data

Empowering People with Knowledge the Next Frontier for Web Search. Wei-Ying Ma Assistant Managing Director Microsoft Research Asia

VisoLink: A User-Centric Social Relationship Mining

!!!!!! Portfolio Summary!! for more information July, C o n c e r t T e c h n o l o g y

Oracle Endeca Information Discovery

The application of OLAP and Data mining technology in the analysis of. book lending

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

A Measurement Design for the Comparison of Expert Usability Evaluation and Mobile App User Reviews

Ontario Cancer Profiles User Help File

ANNUAL REPORT Visit us at project.eu Supported by. Mission

Free for All! Assessing User Data Exposure to Advertising Libraries on Android

1. Inroduction to Data Mininig

SAS/SPECTRAVIEW Software and Data Mining: A Case Study

Query Disambiguation from Web Search Logs

Developing ArXivSI to Help Scientists to Explore the Research Papers in ArXiv

A Taxonomy of Web Search

A Statistical Method of Knowledge Extraction on Online Stock Forum Using Subspace Clustering with Outlier Detection

NExT++: Center for Extreme Search

retail Free popcorn today cinema All food 20% off women s clothing counter food court

CS-490WIR Web Information Retrieval and Management. Luo Si

Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha

Mapping the library future: Subject navigation for today's and tomorrow's library catalogs

Lecture 2 Map design. Dr. Zhang Spring, 2017

TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION

Unstructured Data. CS102 Winter 2019

Interactive Visualization of the Stock Market Graph

Trending Words in Digital Library for Term Cloud-based Navigation

Engineering challenges in mhealth systems Octav Chipara

Query Sugges*ons. Debapriyo Majumdar Information Retrieval Spring 2015 Indian Statistical Institute Kolkata

Understanding SSD Reliability in Large-Scale Cloud Systems

An Exploration of Query Term Deletion

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition

Visual Analytics of Heterogeneous Data in Life Science Applications Hans-Jörg Schulz

An overview of Graph Categories and Graph Primitives

Chapter 3: Data Mining:

Master projects, internships and other opportunities

IBM Advantage: IBM Watson Compare and Comply Element Classification

Transcription:

TIARA: A Visual Exploratory Text Analytic System Furu Wei +, Shixia Liu +, Yangqiu Song +, Shimei Pan #, Michelle X. Zhou*, Weihong Qian +, Lei Shi +, Li Tan + and Qiang Zhang + + IBM Research China, Beijing, China # IBM Research T. J. Watson Center, Hawthorne, NY, USA * IBM Research Almaden Center, San Jose, CA, USA KDD, Washington D.C, Jul 2010 2010

Outline Introduction The TIARA System - System Overview - Analytics - Experiments Demonstration Conclusion

What is TIARA TIARA: Text Insight via Automated Responsive Analytics Text analytics + interactive visualization to visually analyze the topics in text collections and their content changes over time Visually revealing topic strength and content evolution over time

# of docs Interactive Legend Topic Search Box Topic Height Time-sensitive keywords Time TIARA's visual summary of the cause of injury field of the 23,000+ emergency room records from 2002 to 2003

Outline Introduction The TIARA System - System Overview - Analytics - Experiments Demonstration Conclusion

Text Preprocessing Content/Meta Extraction Topic Analysis Text/Topic Indexing Offline Lucene Index (Text, Metadata, Topic, etc.) Online Snippet Generation Time-sensitive Keyword Re-ranking Time-sensitive Keyword Extraction Topic Ranking Topic Summarization JSON Adapter Index Parsing Search Query Parsing Web Service

Analytics: Topic Analysis Goal: Extract topics from a set of documents Solutions - Topic Models (such as Latent Dirichlet Allocation, etc.) - Text Clustering (such as K-means etc.)

Analytics: Topic Ranking Motivations - Help user quickly locate the topics they are mostly interested in Solutions - Topic Content Coverage: Topics covering a significant portion of the corpus content are more important than those covering less content - Topic Variance: Topics appearing in all the documents are to be too generic to be interesting

Analytics: Keyword based Topic Summarization Motivations - Common/general words in topic analysis results (!= stop words!) - LDA on a financial news corpus: common words such as Dow, Jones, Wall, Street etc., are ranked high in many topics because they are relevant to all the topics Solutions - Topic Frequency: if a word occurs frequently in a topic, it is important - Inverse Topic Frequency: if the word also appears in many other topics, the word is not important because it is too common

Analytics: Time-sensitive Keyword Extraction Motivations - Allow user visually analyze content evolutions Solutions - Break the documents into several sub-collections, each of which is associated with a particular time interval Most active time segments Do not break a topic near the peaks of a topic layer - Extract time-sensitive keywords for each sub-collection

Analytics: Time-sensitive Keyword Extraction Solutions (cont ) - Extract time-sensitive keywords for each sub-collection Completeness: whether we can re-cover the original semantics of a topic by combining the semantics associated with each topic segment Distinctiveness: whether we can distinguish one topic segment from another based on their associated keywords

Visualization Visual Design - Stacked graph based visual layout - Augment the stacked graph with layer ordering and tag clouds to generate a keyword based visual text summarization - See Liu et al. (CIKM 2009) for more details Interactions - See our (live) demo

Experiment Studies Data Sets - Email data set: Personal email collection with 8326 emails - Healthcare data set: Emergency room data set (ref. NHAMCS :National Hospital Ambulatory Medical Care Survey) containing 23,501 patient records from 2002 to 2003 Evaluation Plans - Topic ranking - Keyword based topic summarization - Topic segmentation and time-sensitive keyword selection - TIARA s online response time Song et al. (CIKM 2009)

Experiment Studies Evaluation Criteria - Completeness: F 1 measure between topic keywords and combination of time-sensitive keywords - Distinctiveness: KL-divergences of time-sensitive keyword distributions among different segments Baseline System - Select time-sensitive keywords based on term frequencies

Experiment Results Results on Email data (mean+std) Results on Healthcare data (mean+std) 1. TIARA performs better than the baseline system 2. The topics derived from the emails evolve more quickly than those derived from the emergency room records

Experiment Results Results on Healthcare data Results on Email data 1. The most timeconsuming procedures are Search and Parse 2. Sampling, select top N documents, or indexing only the top N topic keywords

Outline Introduction The TIARA System - System Overview - Analytics - Experiments Demonstration Conclusion

We have shown the live demo on Tuesday, 27 July

Application: Visual Email Summarization Who - Email owner: review his work (projects etc.) in 2008 Data - Personal email collection with 8326 emails How could TIARA help him?

Topic Evolution Working on these projects before July TIARA became important! TIARA s visual summary of 8,000+ emails.

Show height

Zoom into details

Drill down into the emails

Click shimei pan

Search cobra

Application: Visual Patient Record Analysis Who - Alice: A government officer responsible for disease control and prevention, who is investigating the major causes and diagnosis for residential illnesses countrywide Data - Healthcare data: emergency room data set containing 23,501 patient records from 2002 to 2003 - Free text fields: cause of injury, reason for visit, and diagnosis - Structured How fields: could patient TIARA sex, age, help etc. her?

Application: Visual Patient Record Analysis Interaction Legend Data navigation TIARA's visual summary of the cause of injury field of the 23,000+ emergency room records from 2002 to 2003

Step 1: Select reason for visit and diagnosis Cutting, twisting and drug abuse are the dominant causal factors within the period of data corpus Content correlation: { open, fingr, cut } Content correlation: { bone, fracture } the number of cases falls in each winter and rebounds from the spring

Step 2: Time zoom IBM Research in (Feb - China 2002 to Jan 2003 within a whole year) Step 3: Select the topic indicating vertigo illnesses from the full topic list (interactive legend) Step 4: Click the vertigo topic trend to expand the view and show correlations Each time segment corresponds to one season of the year

In spring, the patient suffers from adverse effect of drugs as well as some common diseases like diabetes and essential hypertension. While in summer, the high temperature which causes heat exhaustion turns out dominating. Further in winter, the same symptom may be ascribed to complications of common illnesses. The patterns in autumn are quite outstanding, where more urticaria and fever is found.

Step 5: Select the cause of injury field Step 6: Select the patient sex Topmost cut injury happens mostly to men as nearly all the there are in dark green Men tend to twist their ankle during heavy sports including basketball, football and soccer. Women generally get their ankles hurt during walking, running in the porch or missing their steps downstairs

Step 5: Select the cause of injury field Step 6: Select the patient sex The cause of injury for men mainly refers to hard drugs, such as cocaine, opioid, cannabis, as well as alcohol. While for women, drugs for therapy, such as xanax, aspirin and tylenol are the principle causes Suicide: kill and attempt almost exclusively for women

Outline Introduction (What) The TIARA System (How) - System Overview - Analytics - Experiments Demonstration Conclusion

Conclusions TIARA to help users visually view, explore and analyze large collections of documents Lessons from our case studies and initial deployment - TIARA is even more effective for business professionals - It is more effective for those who have some background knowledge (familiar with) on the data What a topic is?

Future Work Use any categorized facet as the topic layers - Topics are difficult to be understood by common users - Classification labels, companies, hotels, ages, pos (part-ofspeech), sentiment orientations, etc. Visualize more meaningful text unit - Besides keywords, we can show more meaningful text units: time-sensitive NEs (named entities), phrases, etc. - More NLP/IE and mining techniques are expected Large scale data sets - Sampling, select top N documents, or indexing only the top N topic keywords

If you want to see our live demo, please contact us! THANK YOU