Construction and Mining of Heterogeneous Information Networks: Will It Be a Key to Web-Aged Information Management and Mining

Size: px

Start display at page:

Download "Construction and Mining of Heterogeneous Information Networks: Will It Be a Key to Web-Aged Information Management and Mining"

Dana McKenzie
5 years ago
Views:

1 Construction and Mining of Heterogeneous Information Networks: Will It Be a Key to Web-Aged Information Management and Mining Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign Collaborated with many students, especially Yizhou Sun, Ming Ji, Chi Wang, Ahmed El-Kishky, Bo Zhao Acknowledgements: NSF, ARL, NASA, AFOSR (MURI), ARO, Microsoft, IBM, Yahoo!, LinkedIn, Google & Boeing July 3,

2 Outline Why Heterogeneous Information Networks? On the Power of Mining Structured Heterogeneous Info. Networks Clustering & Classification: A Ranking-Based Approach Similarity Search & Meta Path in Heterogeneous Info. Net Prediction & Recommendation: A Meta Path-Based Approach From Unstructured Data to Structured Networks From Unstructured Text to Structured Hierarchies Type Extraction and Role Discovery: Enrich Network Structures Truth Analysis: Enhance Quality of Structured Network Looking Forward to the Future 2

3 Where There Is Information, There Are Networks! Social Networking Websites Biological Network: Protein Interaction Research Collaboration Network Product Recommendation Network via s 3

The Real World: Heterogeneous Networks Multiple object

Author DBLP Bibliographic Network Actor Movie Director The

networks are information loss projection of heterogeneous

4 The Real World: Heterogeneous Networks Multiple object types and/or multiple link types Movie Studio Venue Paper Author DBLP Bibliographic Network Actor Movie Director The IMDB Movie Network The Facebook Network Homogeneous networks are information loss projection of heterogeneous networks! Directly mining information-richer heterogeneous networks 4

Structured Heterogeneous Network Modeling Leads to the New Power of Data Mining! DBLP: A Computer Science bibliographic database A sample publication record in DBLP (>2 M papers, >0.

5 Structured Heterogeneous Network Modeling Leads to the New Power of Data Mining! DBLP: A Computer Science bibliographic database A sample publication record in DBLP (>2 M papers, >0.7 M authors, >10 K venues), Knowledge hidden in DBLP Network How are CS research areas structured? Who are the leading researchers on Web search? What are the most essential terms, venues, authors in AI? Who are the peer researchers of Jure Leskovec? Whom will Christos Faloutsos collaborate with? Which types of relationships are most influential for an author to decide her topics? How was the field of Data Mining emerged or evolving? Which authors are rather different from his/her peers in IR? Mining Functions Clustering Ranking Classification + Ranking Similarity Search Relationship Prediction Relation Strength Learning Network Evolution Outlier/anomaly detection Power of het. network modeling: Treat Author, Venue, Term & Paper all first-class citizens! 5

6 Outline Why Heterogeneous Information Networks? On the Power of Mining Structured Heterogeneous Info. Networks Clustering & Classification: A Ranking-Based Approach Similarity Search & Meta Path in Heterogeneous Info. Net Prediction & Recommendation: A Meta Path-Based Approach From Unstructured Data to Structured Networks From Unstructured Text to Structured Hierarchies Type Extraction and Role Discovery: Enrich Network Structures Truth Analysis: Enhance Quality of Structured Network Looking Forward to the Future 6

7 Ranking-Based Clustering & Classification Basic functions for networks: Clustering, classification & ranking Why not integrate ranking with clustering & classification? High-ranked objects should be more important (thus more weighted) in a cluster/class than low-ranked ones Ranking will make more sense within a particular cluster/class Einstein in physics vs. Turing in Coputer Science Integrate ranking with clustering/classification in the same process Ranking, as the feature, is conditional (i.e., relative) to a specific cluster/class Ranking and clustering/classification mutually enhance each other Ranking-based clustering: RankClus [EDBT 09], NetClus [KDD 09], PathSelClus [KDD 12] Ranking-based classification: GNetMine (for classification) [ECMLPKDD 10], RankClass [KDD 11] 7

RankClus: Integrating Clustering with Ranking

Sub-Network Ranking objects in each sub-network

space Estimate mixture model coefficients for

Clustering Ranking An E-M framework for iterative

Simple ranking rule: Ranking high if more papers

8 RankClus: Integrating Clustering with Ranking Initialization Randomly partition Repeat Ranking Sub-Network Ranking objects in each sub-network induced from each cluster Generating new measure space Estimate mixture model coefficients for each target object Adjusting cluster Until stable Clustering Ranking An E-M framework for iterative enhancement Difference on ranking rules: 1. Simple ranking rule: Ranking high if more papers 2. Highly ranked if publishing more in highly rankedvenues 8

9 Step-by-Step Running of RankClus Clustering and ranking two fields: DB/DM & HW/CA Initially, ranking distributions are mixed together Improved a little Two clusters of objects mixed together, but preserve similarity somehow Two clusters are almost well separated Improved significantly Well separated Stable 9

10 NetClus: Ranking-Based Clustering with Star Network Schema Beyond bi-typed network: capture more semantics with multiple types Split a network into multi-subnets, each a net-cluster [KDD 09] DBLP network: using terms, venues & authors to jointly distinguish a sub-field, e.g., database V P A Database Venue Author T Publish Write V A Research Paper P Hardware Contain Term Computer Science NetClus V T P A Theory T 10

NetClus: Database System Cluster Term Venue Author database 0.0995511 databases 0.0708818 system 0.0678563 data 0.0214893 query 0.0133316 systems 0.0110413 queries 0.0090603 management 0.

11 NetClus: Database System Cluster Term Venue Author database databases system data query systems queries management object relational processing based distributed xml oriented design web information model efficient VLDB SIGMOD Conf ICDE PODS EDBT One-level deeper: Authors in XML, Xquery cluster Surajit Chaudhuri Michael Stonebraker Michael J. Carey C. Mohan David J. DeWitt Hector Garcia-Molina H. V. Jagadish David B. Lomet Raghu Ramakrishnan Philip A. Bernstein Joseph M. Hellerstein Jeffrey F. Naughton Yannis E. Ioannidis Jennifer Widom Per-Ake Larson Rakesh Agrawal Dan Suciu Michael J. Franklin Umeshwar Dayal Abraham Silberschatz

Rank-Based Clustering: Works in Multiple

12 Rank-Based Clustering: Works in Multiple Domains RankCompete: Organize your photo album automatically! Use multi-typed image features to build up heterog. networks Explore multiple types in a star schema network MedRank: Rank treatments for AIDS from Medline 12

13 Classification: Knowledge Propagation Across Heterogeneous Typed Networks Labeled knowledge propagates through multi-typed objects across heterogeneous networks [ECMLPKDD'10, KDD 11] 13

14 RankClass: Ranking-Based Classification Class: a group of multi-typed nodes sharing the same topic + within-class ranking node class The RankClass Framework 14

Methodology of RankClass Classification in heterogeneous networks Knowledge propagation: Class label knowledge propagated across multi-typed objects through a heterogeneous network GNetMine [Ji et al.

15 Methodology of RankClass Classification in heterogeneous networks Knowledge propagation: Class label knowledge propagated across multi-typed objects through a heterogeneous network GNetMine [Ji et al., PKDD 10]: Objects are treated equally RankClass [Ji et al., KDD 11]: Ranking-based classification Highly ranked objects will play more role in classification An object can only be ranked high in some focused classes Class membership and ranking are stat. distributions Let ranking and classification mutually enhance each other! Output: Classification results + ranking list of objects within each class 15

16 Experiments on DBLP Class: Four research areas (communities) Database, data mining, AI, information retrieval Four types of objects Paper (14376), Conf. (20), Author (14475), Term (8920) Three types of relations Paper-conf., paper-author, paper-term Algorithms for comparison Learning with Local and Global Consistency (LLGC) [Zhou et al. NIPS 2003] also the homogeneous version of our method Weighted-vote Relational Neighbor classifier (wvrn) [Macskassy et al. JMLR 2007] Network-only Link-based Classification (nlb) [Lu et al. ICML 2003, Macskassy et al. JMLR 2007] 16

17 Comparing Classification Accuracy on the DBLP Data 17

18 Object Ranking in Each Class: Experiment DBLP: 4-fields data set (DB, DM, AI, IR) forming a heterog. info. network Rank objects within each class (with extremely limited label information) Obtain high classification accuracy and excellent ranking within each class Top-5 ranked conferences Top-5 ranked terms Database Data Mining AI IR VLDB KDD IJCAI SIGIR SIGMOD SDM AAAI ECIR ICDE ICDM ICML CIKM PODS PKDD CVPR WWW EDBT PAKDD ECML WSDM data mining learning retrieval database data knowledge information query clustering reasoning web system classification logic search xml frequent cognition text 18

19 Outline Why Heterogeneous Information Networks? On the Power of Mining Structured Heterogeneous Info. Networks Clustering & Classification: A Ranking-Based Approach Similarity Search & Meta Path in Heterogeneous Info. Net Prediction & Recommendation: A Meta Path-Based Approach From Unstructured Data to Structured Networks From Unstructured Text to Structured Hierarchies Type Extraction and Role Discovery: Enrich Network Structures Truth Analysis: Enhance Quality of Structured Network Looking Forward to the Future 19

20 Similarity Search: Find Similar Objects in Networks Who are most similar to Christos Faloutsos? Meta-Path: Meta-level description of a path between two objects Different meta-paths Schema of the carry rather different DBLP Network semantics Meta-Path: Author-Paper-Author (APA) Different meta-paths lead to very different results! Meta-Path: Author-Paper-Venue-Paper-Author (APVPA) Christos s students or close collaborators Similar reputation at similar venues 20

21 PathSim: A Het. Network Similarity Measure Popular object similarity measures in networks Random walk (RW) (or Personalized PageRank): Favors highly visible objects (i.e., objects with large degrees) Pairwise random walk (PRW) (or SimRank): Favors pure objects (i.e., objects with highly skewed distribution in their in-links or out-links) PathSim [VLDB 11] Note: P-PageRank and SimRank do not distinguish object type nor relationship type Favor peers : objects with strong connectivity and similar visibility under the given meta-path x y 21

22 Which Similarity Measure Is Better? Anhai Doan CS, Wisconsin Database area PhD: 2002 Jignesh Patel CS, Wisconsin Database area PhD: 1998 Meta-Path: Author-Paper-Venue-Paper-Author (APVPA) Amol Deshpande CS, Maryland Database area PhD: 2004 Jun Yang CS, Duke Database area PhD:

23 Outline Why Heterogeneous Information Networks? On the Power of Mining Structured Heterogeneous Info. Networks Clustering & Classification: A Ranking-Based Approach Similarity Search & Meta Path in Heterogeneous Info. Net Prediction & Recommendation: A Meta Path-Based Approach From Unstructured Data to Structured Networks From Unstructured Text to Structured Hierarchies Type Extraction and Role Discovery: Enrich Network Structures Truth Analysis: Enhance Quality of Structured Network Looking Forward to the Future 23

24 PathPredict: Meta-Path Based Relationship Prediction Meta path-guided prediction of links and relationships vs. Philosophy: Meta path relationships among similar typed links share similar semantics and are comparable and inferable venue publish publish -1 mention topic -1 write paper -1 mention write contain/contain -1 cite/cite -1 author Bibliographic network: Co-author prediction (A P A) Use topological features encoded by meta paths, e.g., citation relations between authors (A P P A) 24

Meta-Path Based Co-authorship Prediction Co-authorship prediction problem Whether two authors are going to collaborate for the first time Co-authorship

25 Meta-Path Based Co-authorship Prediction Co-authorship prediction problem Whether two authors are going to collaborate for the first time Co-authorship encoded in meta-path Author-Paper-Author Topological features encoded in meta-paths Meta-Path Semantic Meaning Meta-paths between authors under length 4 25

higher in prediction accuracy Co-author prediction for Jian Pei: Only 42 among 4809 candidates

26 The Power of PathPredict: Experiments Explain the prediction power of each meta-path Wald Test for logistic regression Higher prediction accuracy than using projected homogeneous network 11% higher in prediction accuracy Co-author prediction for Jian Pei: Only 42 among 4809 candidates are true first-time co-authors! (Feature collected in [1996, 2002]; Test period in [2003,2009]) 26

Enhancing the Power of Recommender Systems by Heterog. Info.

large number of ratings Romance # of ratings Most users and items have a small

Networks [WSDM 14] Heterogeneous relationships complement each other Users and

of paths Connect new users or items in the information network Different users

27 Enhancing the Power of Recommender Systems by Heterog. Info. Network Analysis Collaborative filtering methods suffer from the data sparsity issue Avatar Zoe Saldana Aliens Adventure Titanic James Cameron Leonardo Dicaprio Revolutionary Road Kate Winslet A small set of users & items have a large number of ratings Romance # of ratings Most users and items have a small number of ratings # of users or items Personalized recommendation with heterog. Networks [WSDM 14] Heterogeneous relationships complement each other Users and items with limited feedback can be connected to the network by different types of paths Connect new users or items in the information network Different users may require different models: Relationship heterogeneity makes personalized recommendation models easier to define 27

Performance Study: Personalized Recommendation in Heterogeneous Networks Datasets: Methods to be compared: Popularity: Recommend the most popular items to users Co-click: Conditional probabilities

28 Performance Study: Personalized Recommendation in Heterogeneous Networks Datasets: Methods to be compared: Popularity: Recommend the most popular items to users Co-click: Conditional probabilities between items NMF: Non-negative matrix factorization on user feedback Hybrid-SVM: Use Rank-SVM with plain features (utilize both user feedback and information network) Winner: HeteRec personalized recommendation (HeteRec-p) 28

29 Outline Why Heterogeneous Information Networks? On the Power of Mining Structured Heterogeneous Info. Networks Clustering & Classification: A Ranking-Based Approach Similarity Search & Meta Path in Heterogeneous Info. Net Prediction & Recommendation: A Meta Path-Based Approach From Unstructured Data to Structured Networks From Unstructured Text to Structured Hierarchies Type Extraction and Role Discovery: Enrich Network Structures Truth Analysis: Enhance Quality of Structured Network Looking Forward to the Future 29

30 Automated Construction of Heterogeneous Info. Networks Rich semantics can be mined from structured heterog. networks Unfortunately, much of the real world data is unstructured News, Wikipedia, blogs, tweets, multimedia data, Too expensive to be structured by human: Automated & scalable How to turn unstructured data to structured heterog. Networks? From Unstructured Text to Structured Hierarchies Type Extraction and Role Discovery: Enrich Network Structures Truth Analysis: Enhance Quality of Structured Network Methodology: Incrementally & progressively explore links and structures to construct quality heterogeneous networks 30

From Unstructured Text to Structured Hierarchies Entity text Input collection Our recent effort: 1. Hierarchical topic discovery with entities 2. Phrase mining 3.

31 From Unstructured Text to Structured Hierarchies Entity text Input collection Our recent effort: 1. Hierarchical topic discovery with entities 2. Phrase mining 3. Rank phrases & entities per topic Output hierarchy w/ phrases & entities Kert [Danilevsky et al, SDM 14]: LDA + Keyphrase extraction and ranking by topic (4 criteria: coverage, purity, phraseness, completeness) CATHTY [Wang, et al., KDD 13]: Recursive generation of topic hierarchy based on term co-occurrence network CATHYHIN [Wang, et al., ICDM 13]: Use hetero. network to help topic hierarchy generation TopMine [El-Kishky et al., 2014]: Exploring frequent contiguous pattern mining o/1/1 o/1 o o/1/2 o/2 o/2/1 Hierarchy so generated can be used for network mining, exploration, etc. 31

32 Automatic Discovery of Clusters and Term Hierarchies from Collection of Documents KERT: Keyphrase Extraction and Ranking by Topic [SDM 14] 1. Use background LDA (blda) to cluster unigrams into T topics + one background topic 2. Extract candidate keyphrases for each foreground topic according to the word topic assignment should be order-free and non-consecutive E.g., mining frequent patterns vs. frequent pattern mining E.g., mining top-k frequent closed patterns 3. Rank the keyphrases in each foreground topic Coverage: information retrieval vs. cross-language information retrieval Purity: only frequent in documents about topic t Phraseness: active learning vs. learning classification Completeness: vector machine vs. support vector machine 32

33 Experiment: Machine Learning: Top Keyphrases Dataset: DBLP paper titles KERT variations: ignore each criterion in turn Baselines: two variations of an approach from a recent work 33

34 CATHYHIN: Topic Hierarchy Construction by Integration of Heterogeneous Info. Networks Using DBLP heterog. Info. network to enhance topical hierarchy generation Hierarchies generated not only on topical phrases but also on authors & venues 34

35 TopMine: Phrase Mining Then Top Modeling Frequent Contiguous Phrase Mining Frequency: A low frequency phrase is likely unimportant --- a generalization of the most likely unigrams in LDA Collocation: The co-occurrence of tokens in such frequency that is significantly higher than expected due to chance. Completeness: Do not double count sub-phrases (e.g. support vector is not a phrase when inside support vector machine) General frequent pattern mining may place words far apart in the same phrase The pattern for a phrase: each phrase is a contiguous sequence of ordered words. Markov Blanket Feature Selection for Support Vector Machines 35

36 Good Phrases Should Be Stat. Significant Under the assumption that the text data was generated by a sequence of Bernoulli trials, we can derive a significance measure for potential mergings. Good Phrases! 36

37 Bag-of-Phrase Construction: An Example 37

38 The ToPMine Framework 1. Preprocessing: Perform stemming to reduce vocabulary size and increase collisions of phrases, split on phrase-invariant punctuation 2. Frequent Contiguous Pattern Mining: Collects aggregate counts and candidate phrases 3. Phrase Construction: Agglomerative construction of significant phrases (addresses frequency, collocation, and completeness) 4. Use the phrase construction of bag of phrases : This can be considered an enriched input for later algorithms (not exclusive to topic modeling) 5. Bag-of-phrases Topic Modeling 38

39 Experiments on AP News 39

40 Experiments on Yelp Reviews 40

41 Scalability: Runtime Results 41

42 Outline Why Heterogeneous Information Networks? On the Power of Mining Structured Heterogeneous Info. Networks Clustering & Classification: A Ranking-Based Approach Similarity Search & Meta Path in Heterogeneous Info. Net Prediction & Recommendation: A Meta Path-Based Approach From Unstructured Data to Structured Networks From Unstructured Text to Structured Hierarchies Type Extraction and Role Discovery: Enrich Network Structures Truth Analysis: Enhance Quality of Structured Network Looking Forward to the Future 42

Role Discovery: Mining Advisor-Advisee Relationships in DBLP Network Propagation of simple, commonly accepted constraints in Time- Constrained Probabilistic Factor Graph (TPFG) Advisor has more

43 Role Discovery: Mining Advisor-Advisee Relationships in DBLP Network Propagation of simple, commonly accepted constraints in Time- Constrained Probabilistic Factor Graph (TPFG) Advisor has more publications and longer history than advisee at the time of advising Once an advisee becomes advisor, s/he will not become advisee again Input: Temporal collaboration network 1999 Output: Relationship analysis (0.9, [/, 1998]) Visualized chorological hierarchies Ada 2000 Bob Ada (0.4, [/, 1998]) (0.5, [/, 2000]) 2000 (0.8, [1999,2000]) Ying Smith Jerry Bob (0.7, [2000, 2001]) (0.65, [2002, 2004]) Ying (0.49, [/, 1999]) Jerry (0.2, [2001, 2003]) 2004 Smith 43

44 Role Discovery: Performance & Case Study DBLP data: 654, 628 authors, 1076,946 publications, years provided Labeled data: MathGealogy Project; AI Gealogy Project; Homepage Datasets RULE SVM IndMAX TPFG TEST1 69.9% 73.4% 75.2% 78.9% 80.2% 84.4% TEST2 69.8% 74.6% 74.6% 79.0% 81.5% 84.3% TEST3 80.6% 86.7% 83.1% 90.9% 88.8% 91.3% Case study heuristics Supervised learning Empirical parameter Advisee Top Ranked Advisor Time Note optimized parameter David M. Blei 1. Michael I. Jordan PhD advisor, 2004 grad 2. John D. Lafferty Postdoc, 2006 Hong Cheng 1. Qiang Yang MS advisor, Jiawei Han PhD advisor, 2008 Sergey Brin 1. Rajeev Motawani Unofficial advisor 44

Hierarchical Relationship Discovery From partially ordered objects to hierarchy (tree) Based on NLP or other techniques to extract partially ordered objects Using constraints to discover

45 Hierarchical Relationship Discovery From partially ordered objects to hierarchy (tree) Based on NLP or other techniques to extract partially ordered objects Using constraints to discover relationships Singleton Potential Type Cognitive description Potential definition Homophile Polarity Support pattern Forbidden pattern Parent and child are similar Parent is superior to child Patterns frequently occurring with child-parent pairs Patterns rarely occurring with child-parent pairs Pairwise Potential Function: Cases Type Cognitive description Potential definition Attribute Use inherited attributes augment from parents or children Label Similar nodes share similar propagate parents (or children) Patterns altering in childparent & parent-child pairs Reciprocity Constraints Restrict certain patterns Discovery of the Kenny Family Tree 45

46 Outline Why Heterogeneous Information Networks? On the Power of Mining Structured Heterogeneous Info. Networks Clustering & Classification: A Ranking-Based Approach Similarity Search & Meta Path in Heterogeneous Info. Net Prediction & Recommendation: A Meta Path-Based Approach From Unstructured Data to Structured Networks From Unstructured Text to Structured Hierarchies Type Extraction and Role Discovery: Enrich Network Structures Truth Analysis: Enhance Quality of Structured Network Looking Forward to the Future 46

47 Truth Analysis: Enhancing the Quality of Heterogeneous Info. Networks Info. networks could be untrustworthy, error-prone, missing, TruthFinder [KDD 07]: Inference on trustworthiness by constructing heterogeneous info. networks Web sites (Book Sellers) Facts (Book Authors) w 1 f 1 Objects (Books) o 1 w 2 f 2 w 3 f 3 o 2 w 4 f 4 True facts and trustable websites mutually enhance each other and will become apparent after some iterations 47

Truth Discovery: Multiple Truth Value and Handling False Negatives Voting may not always work well: Some sources tend to miss true values (False Negatives), while some others tend to produce false

48 Truth Discovery: Multiple Truth Value and Handling False Negatives Voting may not always work well: Some sources tend to miss true values (False Negatives), while some others tend to produce false claims (False Positives) Why Latent Truth Model (LTM)? Modeling two-sided quality to support multiple true values per entity for truth finding [VLDB 12] Generating Implicit Negative Claims: Positive Claim High Precision, High Recall IMDB Negative Claim Correct Claim Incorrect Claim High Precision, Low Recall Netflix Low Precision, Low Recall BadSource Harry Potter 48

49 Truth Discovery: Effectiveness of Latent Truth Model Experimental datasets: Large and real Book Authors from abebooks.com (1263 books, 879 sources, claims, 2420 book-author, 100 labeled) Movie Directors from Bing (15073 movies, 12 sources, claims, movie-director, 100 labeled) Effectiveness of Latent Truth Model: Model source quality in other data integration tasks, e.g. entity resolution. Trustworthiness in multi-genre networks (text-rich networks, social networks, etc.) 49

50 Outline Why Heterogeneous Information Networks? On the Power of Mining Structured Heterogeneous Info. Networks Clustering & Classification: A Ranking-Based Approach Similarity Search & Meta Path in Heterogeneous Info. Net Prediction & Recommendation: A Meta Path-Based Approach From Unstructured Data to Structured Networks From Unstructured Text to Structured Hierarchies Type Extraction and Role Discovery: Enrich Network Structures Truth Analysis: Enhance Quality of Structured Network Looking Forward to the Future 50

51 On-Going and Future Work From small, structured data to real, big, unstructured data From the DBLP network to networks constructed from news, PubMed, tweets, Construction of quality, semi-structured heterogeneous networks from massive, real data sets Integration information network with data/text cube and OLAP technologies Data/text cubes have been shown its power in multidimensional OLAP analysis, e.g., EventCube Integration of heterogeneous network mining OLAP mining on multi-dimensional social and information networks Network exploration of mission- or query-based hidden networks 51

52 From Data Mining to Mining Info. Networks Han, Kamber and Pei, Data Mining, 3 rd ed Yu, Han and Faloutsos (eds.), Link Mining, 2010 Sun and Han, Mining Heterogeneous Information Networks,

53 Conclusions Heterogeneous information networks are ubiquitous Most datasets can be organized or transformed into structured multi-typed heterogeneous info. networks Examples: DBLP, IMDB, Flickr, Google News, Wikipedia, Surprisingly rich knowledge can be mined from structured heterogeneous info. networks Clustering, ranking, classification, path prediction, Knowledge is power, but knowledge is hidden in massive, but relatively structured nodes and links! Key issue: Construction of trusted, semi-structured heterogeneous networks from unstructured data From data to knowledge: Much more to be explored but heterogeneous network mining has shown high promise! 53

54 References M. Ji, J. Han, and M. Danilevsky, "Ranking-Based Classification of Heterogeneous Information Networks", KDD'11. Y. Sun and J. Han, Mining Heterogeneous Information Networks: Principles and Methodologies, Morgan & Claypool Publishers, 2012 Y. Sun, J. Han, et al., "RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis", EDBT 09 Y. Sun, Y. Yu, and J. Han, "Ranking-Based Clustering of Heterogeneous Information Networks with Star Network Schema", KDD 09 Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu, PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks, VLDB'11 Y. Sun, R. Barber, M. Gupta, C. Aggarwal and J. Han, "Co-Author Relationship Prediction in Heterogeneous Bibliographic Networks", ASONAM'11 Y. Sun, J. Han, C. C. Aggarwal, N. Chawla, When Will It Happen? Relationship Prediction in Heterogeneous Information Networks, WSDM'12 F. Tao, et al., EventCube: Multi-Dimensional Search and Mining of Structured and Text Data, (system demo) KDD 13 C. Wang, J. Han, et al., Mining Advisor-Advisee Relationships from Research Publication Networks", KDD'10 C. Wang, M. Danilevsky, et al., A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy, KDD 13 C. Wang, Marina Danilevsky, et al., "Constructing Topical Hierarchies in Heterogeneous Information Networks", ICDM'13 54

Mining Trusted Information in Medical Science: An Information Network Approach

Mining Trusted Information in Medical Science: An Information Network Approach Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign Collaborated with many, especially Yizhou