Mining Trusted Information in Medical Science: An Information Network Approach

Similar documents
Construction and Mining of Heterogeneous Information Networks: Will It Be a Key to Web-Aged Information Management and Mining

ResearchNet and NewsNet: Two Nets That May Capture Many Lovely Birds

Data Mining: Dynamic Past and Promising Future

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 1

Jiawei Han University of Illinois at Urbana Champaign

User Guided Entity Similarity Search Using Meta-Path Selection in Heterogeneous Information Networks

c 2012 by Yizhou Sun. All rights reserved.

Graph Classification in Heterogeneous

CS249: ADVANCED DATA MINING

HetPathMine: A Novel Transductive Classification Algorithm on Heterogeneous Information Networks

IVIS: SEARCH AND VISUALIZATION ON HETEROGENEOUS INFORMATION NETWORKS YINTAO YU THESIS

DATA MINING RESEARCH: RETROSPECT AND PROSPECT

Effective Keyword Search over (Semi)-Structured Big Data Mehdi Kargar

CSE5243 INTRO. TO DATA MINING

Chapter 1 Introduction

AspEm: Embedding Learning by Aspects in Heterogeneous Information Networks

Introduction to Text Mining. Hongning Wang

Integrating Meta-Path Selection with User-Preference for Top-k Relevant Search in Heterogeneous Information Networks

Personalized Entity Recommendation: A Heterogeneous Information Network Approach

Citation Prediction in Heterogeneous Bibliographic Networks

Mining Heterogeneous Information Networks

RankCompete: Simultaneous Ranking and Clustering of Information Networks

Mining Knowledge from Databases: An Information Network Analysis Approach

WE know that most real systems usually consist of a

Chapter 1, Introduction

Keywords Data alignment, Data annotation, Web database, Search Result Record

Query-Based Outlier Detection in Heterogeneous Information Networks

Introduction to Information Retrieval. Hongning Wang

CS425: Algorithms for Web Scale Data

A. Papadopoulos, G. Pallis, M. D. Dikaiakos. Identifying Clusters with Attribute Homogeneity and Similar Connectivity in Information Networks

CS 412 Intro. to Data Mining

Overview. Data-mining. Commercial & Scientific Applications. Ongoing Research Activities. From Research to Technology Transfer

Mining Query-Based Subnetwork Outliers in Heterogeneous Information Networks

Query Independent Scholarly Article Ranking

Tutorial on Mining Heterogeneous Information Networks

Graph Cube: On Warehousing and OLAP Multidimensional Networks

Analysis of Large Graphs: TrustRank and WebSpam

Scholarly Big Data: Leverage for Science

TupleRank: Ranking Relational Databases using Random Walks on Extended K-partite Graphs

Jianyong Wang Department of Computer Science and Technology Tsinghua University

Data Mining Concepts & Tasks

Link Prediction across Networks by Biased Cross-Network Sampling

CS145: INTRODUCTION TO DATA MINING

3 Data, Data Mining. Chengkai Li

Personalized Recommendations using Knowledge Graphs. Rose Catherine Kanjirathinkal & Prof. William Cohen Carnegie Mellon University

Chapter 2 Survey of Current Developments

Heterogeneous Graph-Based Intent Learning with Queries, Web Pages and Wikipedia Concepts

c 2014 by Xiao Yu. All rights reserved.

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Part I: Data Mining Foundations

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Link Mining & Entity Resolution. Lise Getoor University of Maryland, College Park

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Effective Latent Space Graph-based Re-ranking Model with Global Consistency

Object Distinction: Distinguishing Objects with Identical Names

OLAP on Information Networks: a new Framework for Dealing with Bibliographic Data

Introduction. IST557 Data Mining: Techniques and Applications. Jessie Li, Penn State University

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Introduction to Data Mining

Basic Data Mining Technique

Data Mining Jay Urbain, PhD. Credits: Nazli Goharian, Jiawei Han, Micheline Kamber, and Jian Pei

Medical Records Clustering Based on the Text Fetched from Records

CS249: ADVANCED DATA MINING

Clustering using Topic Models

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

CSE-4412: Data Mining

Introduction to Data Mining and Data Analytics

An Overview of Data Warehousing and OLAP Technology

Meta Paths and Meta Structures: Analysing Large Heterogeneous Information Networks

Multi-label Collective Classification using Adaptive Neighborhoods

State of the Art and Trends in Search Engine Technology. Gerhard Weikum

An Improved Apriori Algorithm for Association Rules

Meta Path-Based Collective Classification in Heterogeneous Information Networks

Searching complex graphs

Gurpreet Kaur 1, Naveen Aggarwal 2 1,2

Combining Text Embedding and Knowledge Graph Embedding Techniques for Academic Search Engines

CS145: INTRODUCTION TO DATA MINING

MetaGraph2Vec: Complex Semantic Path Augmented Heterogeneous Network Embedding

Text Analytics (Text Mining)

CS-490WIR Web Information Retrieval and Management. Luo Si

Thanks to Jure Leskovec, Anand Rajaraman, Jeff Ullman

Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei i Han 1 University of Illinois, IBM TJ Watson.

TriRank: Review-aware Explainable Recommendation by Modeling Aspects

COMP 465: Data Mining Classification Basics

Semantic Scholar. ICSTI Towards a More Efficient Review of Research Literature 11 September

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

(Big Data Integration) : :

GraphGAN: Graph Representation Learning with Generative Adversarial Nets

Text Analytics (Text Mining)

CS423: Data Mining. Introduction. Jakramate Bootkrajang. Department of Computer Science Chiang Mai University

Enhanced Trustworthy and High-Quality Information Retrieval System for Web Search Engines

Céline Robardet. December 9, Habilitation à diriger des recherches

CS490W: Web Information Search & Management. CS-490W Web Information Search and Management. Luo Si. Department of Computer Science Purdue University

Big Trend in Business Intelligence: Data Mining over Big Data Web Transaction Data. Fall 2012

Query- vs. Crawling-based Classification of Searchable Web Databases

Data Mining: Concepts and Techniques

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery

Semantic Web and Web2.0. Dr Nicholas Gibbins

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

Transcription:

Mining Trusted Information in Medical Science: An Information Network Approach Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign Collaborated with many, especially Yizhou Sun, Ming Ji, Chi Wang, Tim Weninger, Xiaoxin Yin, Bo Zhao Acknowledgements: NSF, ARL, NASA, AFOSR (MURI), Microsoft, IBM, Yahoo!, Google, HP Lab & Boeing November 28, 2012 1

Outline Why Information Network Approach for Medical and Health Informatics? Exploring Rich Semantics of Structured Heterogeneous Networks From RankClus to RankClass A PubMed Exploration Information Trust Analysis: An Info. Network Approach From Truth Finder to Latent Truth Model Conclusions 2

The Real World: Heterogeneous Networks Multiple object types and/or multiple link types Movie Studio Venue Paper Author DBLP Bibliographic Network Actor Movie Director The IMDB Movie Network The Facebook Network Homogeneous networks are information loss projection of heterogeneous networks! Directly mining information-richer heterogeneous networks

What Can be Mined from Heterogeneous Networks? DBLP: A Computer Science bibliographic database A sample publication record in DBLP (>1.8 M papers, >0.7 M authors, >10 K venues), Knowledge hidden in DBLP Network How are CS research areas structured? Who are the leading researchers on Web search? What are the most essential terms, venues, authors in AI? Who are the peer researchers of Jure Leskovec? Whom will Christos Faloutsos collaborate with? Which types of relationships are most influential for an author to decide her topics? How was the field of Data Mining emerged or evolving? Which authors are rather different from his/her peers in IR? Mining Functions Clustering Ranking Classification + Ranking Similarity Search Relationship Prediction Relation Strength Learning Network Evolution Outlier/anomaly detection 4

Outline Why Information Network Approach for Medical and Health Informatics? Exploring Rich Semantics of Structured Heterogeneous Networks From RankClus to RankClass A PubMed Exploration Information Trust Analysis: An Info. Network Approach From Truth Finder to Latent Truth Model Conclusions 5

RankClus: Algorithm Framework Initialization Randomly partition Repeat Ranking Sub-Network Ranking objects in each sub-network induced from each cluster Generating new measure space SIGMOD VLDB EDBT KDD ICDM SDM AAAI ICML Tom Mary Alice Bob Cindy Tracy Jack Mike Lucy Jim SIGMOD VLDB EDBT AAAI ICML SDM ICDM Ranking KDD Ranking Objects Clustering Estimate mixture model coefficients for each target object Adjusting cluster Until stable 6

NetClus on DBLP: Database System Cluster database 0.0995511 system 0.0678563 data 0.0214893 query 0.0133316 management 0.00850744 object 0.00837766 relational 0.0081175 VLDB 0.318495 SIGMOD Conf. 0.313903 ICDE 0.188746 PODS 0.107943 EDBT 0.0436849 Surajit Chaudhuri 0.00678065 Michael Stonebraker 0.00616469 Michael J. Carey 0.00545769 C. Mohan 0.00528346 David J. DeWitt 0.00491615 Hector Garcia-Molina 0.00453497 H. V. Jagadish 0.00434289 David B. Lomet 0.00397865 Rank-Based Clustering of Multimedia Data RankCompete: Organize your photo album automatically! 7

Classification: Knowledge Propagation M. Ji, et al., Graph Regularized Transductive Classification on Heterogeneous Information Networks", ECMLPKDD'10. M. Ji, M. Danilevski, et al., Graph Regularized Transductive Classification on Heterogeneous Information Networks", ECMLPKDD'10 8

Experiments with Very Small Training Set DBLP: 4-fields data set (DB, DM, AI, IR) forming a heterog. info. network Rank objects within each class (with extremely limited label information) Obtain High classification accuracy and excellent rankings within each class Top-5 ranked conferences Top-5 ranked terms Database Data Mining AI IR VLDB KDD IJCAI SIGIR SIGMOD SDM AAAI ECIR ICDE ICDM ICML CIKM PODS PKDD CVPR WWW EDBT PAKDD ECML WSDM data mining learning retrieval database data knowledge information query clustering reasoning web system classification logic search xml frequent cognition text 9

MedRank: Discovering Influential Medical Treatments from Literature Heuristics: A good treatment is likely to be found in good medical articles published in good journals and written by good authors and successful in clinical trials Data (PubMed) and Ontology 20M articles, forming a gigantic heterogeneous infonet Use only those treatments that passed Clinical Trial Phase III MeSH: Medical ontology used Exploring rich semantics of structured heterogeneous networks Star schema MedRank (extension to NetClus) Ranked treatments on popular and non-popular diseases Star Schema for PubMed InfoNet 10

Experiments: Ranking Medical Treatments Treatments of 5 diseases Ranking influential treatments for diseases from MEDLINE data ALS: Amyotrophic Lateral Sclerosis HB: Hepatitis B AIDS: D2: Diabetes Mellitus Type II RA: Rheumatoid Arthritis Rank treatments for AIDS from MEDLINE MedRank vs. baselines using AO (average over sum of weighted overlaps of 1 st d elts) 11

Guidance: Meta Path in Bibliographic Network Relationship prediction: meta path-guided prediction Meta path relationships among similar typed links share similar semantics and are comparable and inferable venue topic publish publish -1 mention -1 paper mention contain/contain -1 cite/cite -1 write -1 write author Co-author prediction (A P A) using topological features also encoded by meta paths, e.g., citation relations between authors (A P P A) 12

Meta-Path Based Co-authorship Prediction in DBLP Co-authorship prediction problem Whether two authors are going to collaborate for the first time Co-authorship encoded in meta-path Author-Paper-Author Topological features encoded in meta-paths Meta-Path Semantic Meaning Meta-paths between authors under length 4 13

The Power of PathPredict Explain the prediction power of each meta-path Wald Test for logistic regression Higher prediction accuracy than using projected homogeneous network 11% higher in prediction accuracy Co-author prediction for Jian Pei: Only 42 among 4809 candidates are true first-time co-authors! (Feature collected in [1996, 2002]; Test period in [2003,2009]) 14

Outline Why Information Network Approach for Medical and Health Informatics? Exploring Rich Semantics of Structured Heterogeneous Networks From RankClus to RankClass A PubMed Exploration Information Trust Analysis: An Info. Network Approach From Truth Finder to Latent Truth Model Conclusions 15

Enhancing the Quality of Heterogeneous Info. Networks Info. networks could be untrustworthy, error-prone, missing, TruthFinder [KDD 07]: Inference on trustworthiness by mutual enhancement of info provider and statement trustworthiness Latent Truth Model (LTM) [VLDB12]: Modeling two-sided quality to support multiple true values per entity for truth-finding Web sites Facts w 1 f 1 w 2 f 2 w 3 f 3 w 4 f 4 Objects o 1 o 2 Generating Implicit Negative Claims: Positive Claim Negative Claim Correct Claim Incorrect Claim High Precision, High Recall IMDB High Precision, Low Recall Netflix Low Precision, Low BadSour Recall ce Harry Potter 16

Truth Discovery: Effectiveness of Latent Truth Model Experimental datasets: Large and real Book Authors from abebooks.com (1263 books, 879 sources, 48153 claims, 2420 book-author, 100 labeled) Movie Directors from Bing (15073 movies, 12 sources, 108873 claims, 33526 movie-director, 100 labeled) Effectiveness of Latent Truth Model: Model source quality in other data integration tasks, e.g. entity resolution. Trustworthiness in multi-genre networks (text-rich networks, social networks, etc.) 17

Outline Why Information Network Approach for Medical and Health Informatics? Exploring Rich Semantics of Structured Heterogeneous Networks From RankClus to RankClass A PubMed Exploration Information Trust Analysis: An Info. Network Approach From Truth Finder to Latent Truth Model Conclusions 18

Conclusions Heterogeneous information networks are ubiquitous Most datasets can be organized or transformed into structured multi-typed heterogeneous info. networks Examples: DBLP, IMDB, Flickr, Google News, Wikipedia, Surprisingly rich knowledge can be mined from such structured heterogeneous info. networks Clustering, ranking, classification, data cleaning, trust analysis, role discovery, similarity search, relationship prediction, Meta path holds a key to effective mining and exploration! Knowledge is power, but knowledge is hidden in massive, but relatively structured nodes and links! Much more to be explored in information network mining! 19

From Data Mining to Mining Info. Networks Han, Kamber and Pei, Data Mining, 3 rd ed. 2011 Yu, Han and Faloutsos (eds.), Link Mining, 2010 Sun and Han, Mining Heterogeneous Information Networks, 2012 20