Construction and Mining of Heterogeneous Information Networks: Will It Be a Key to Web-Aged Information Management and Mining

Size: px
Start display at page:

Download "Construction and Mining of Heterogeneous Information Networks: Will It Be a Key to Web-Aged Information Management and Mining"

Transcription

1 Construction and Mining of Heterogeneous Information Networks: Will It Be a Key to Web-Aged Information Management and Mining Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign Collaborated with many students, especially Yizhou Sun, Ming Ji, Chi Wang, Ahmed El-Kishky, Bo Zhao Acknowledgements: NSF, ARL, NASA, AFOSR (MURI), ARO, Microsoft, IBM, Yahoo!, LinkedIn, Google & Boeing July 3,

2 Outline Why Heterogeneous Information Networks? On the Power of Mining Structured Heterogeneous Info. Networks Clustering & Classification: A Ranking-Based Approach Similarity Search & Meta Path in Heterogeneous Info. Net Prediction & Recommendation: A Meta Path-Based Approach From Unstructured Data to Structured Networks From Unstructured Text to Structured Hierarchies Type Extraction and Role Discovery: Enrich Network Structures Truth Analysis: Enhance Quality of Structured Network Looking Forward to the Future 2

3 Where There Is Information, There Are Networks! Social Networking Websites Biological Network: Protein Interaction Research Collaboration Network Product Recommendation Network via s 3

4 The Real World: Heterogeneous Networks Multiple object types and/or multiple link types Movie Studio Venue Paper Author DBLP Bibliographic Network Actor Movie Director The IMDB Movie Network The Facebook Network Homogeneous networks are information loss projection of heterogeneous networks! Directly mining information-richer heterogeneous networks 4

5 Structured Heterogeneous Network Modeling Leads to the New Power of Data Mining! DBLP: A Computer Science bibliographic database A sample publication record in DBLP (>2 M papers, >0.7 M authors, >10 K venues), Knowledge hidden in DBLP Network How are CS research areas structured? Who are the leading researchers on Web search? What are the most essential terms, venues, authors in AI? Who are the peer researchers of Jure Leskovec? Whom will Christos Faloutsos collaborate with? Which types of relationships are most influential for an author to decide her topics? How was the field of Data Mining emerged or evolving? Which authors are rather different from his/her peers in IR? Mining Functions Clustering Ranking Classification + Ranking Similarity Search Relationship Prediction Relation Strength Learning Network Evolution Outlier/anomaly detection Power of het. network modeling: Treat Author, Venue, Term & Paper all first-class citizens! 5

6 Outline Why Heterogeneous Information Networks? On the Power of Mining Structured Heterogeneous Info. Networks Clustering & Classification: A Ranking-Based Approach Similarity Search & Meta Path in Heterogeneous Info. Net Prediction & Recommendation: A Meta Path-Based Approach From Unstructured Data to Structured Networks From Unstructured Text to Structured Hierarchies Type Extraction and Role Discovery: Enrich Network Structures Truth Analysis: Enhance Quality of Structured Network Looking Forward to the Future 6

7 Ranking-Based Clustering & Classification Basic functions for networks: Clustering, classification & ranking Why not integrate ranking with clustering & classification? High-ranked objects should be more important (thus more weighted) in a cluster/class than low-ranked ones Ranking will make more sense within a particular cluster/class Einstein in physics vs. Turing in Coputer Science Integrate ranking with clustering/classification in the same process Ranking, as the feature, is conditional (i.e., relative) to a specific cluster/class Ranking and clustering/classification mutually enhance each other Ranking-based clustering: RankClus [EDBT 09], NetClus [KDD 09], PathSelClus [KDD 12] Ranking-based classification: GNetMine (for classification) [ECMLPKDD 10], RankClass [KDD 11] 7

8 RankClus: Integrating Clustering with Ranking Initialization Randomly partition Repeat Ranking Sub-Network Ranking objects in each sub-network induced from each cluster Generating new measure space Estimate mixture model coefficients for each target object Adjusting cluster Until stable Clustering Ranking An E-M framework for iterative enhancement Difference on ranking rules: 1. Simple ranking rule: Ranking high if more papers 2. Highly ranked if publishing more in highly rankedvenues 8

9 Step-by-Step Running of RankClus Clustering and ranking two fields: DB/DM & HW/CA Initially, ranking distributions are mixed together Improved a little Two clusters of objects mixed together, but preserve similarity somehow Two clusters are almost well separated Improved significantly Well separated Stable 9

10 NetClus: Ranking-Based Clustering with Star Network Schema Beyond bi-typed network: capture more semantics with multiple types Split a network into multi-subnets, each a net-cluster [KDD 09] DBLP network: using terms, venues & authors to jointly distinguish a sub-field, e.g., database V P A Database Venue Author T Publish Write V A Research Paper P Hardware Contain Term Computer Science NetClus V T P A Theory T 10

11 NetClus: Database System Cluster Term Venue Author database databases system data query systems queries management object relational processing based distributed xml oriented design web information model efficient VLDB SIGMOD Conf ICDE PODS EDBT One-level deeper: Authors in XML, Xquery cluster Surajit Chaudhuri Michael Stonebraker Michael J. Carey C. Mohan David J. DeWitt Hector Garcia-Molina H. V. Jagadish David B. Lomet Raghu Ramakrishnan Philip A. Bernstein Joseph M. Hellerstein Jeffrey F. Naughton Yannis E. Ioannidis Jennifer Widom Per-Ake Larson Rakesh Agrawal Dan Suciu Michael J. Franklin Umeshwar Dayal Abraham Silberschatz

12 Rank-Based Clustering: Works in Multiple Domains RankCompete: Organize your photo album automatically! Use multi-typed image features to build up heterog. networks Explore multiple types in a star schema network MedRank: Rank treatments for AIDS from Medline 12

13 Classification: Knowledge Propagation Across Heterogeneous Typed Networks Labeled knowledge propagates through multi-typed objects across heterogeneous networks [ECMLPKDD'10, KDD 11] 13

14 RankClass: Ranking-Based Classification Class: a group of multi-typed nodes sharing the same topic + within-class ranking node class The RankClass Framework 14

15 Methodology of RankClass Classification in heterogeneous networks Knowledge propagation: Class label knowledge propagated across multi-typed objects through a heterogeneous network GNetMine [Ji et al., PKDD 10]: Objects are treated equally RankClass [Ji et al., KDD 11]: Ranking-based classification Highly ranked objects will play more role in classification An object can only be ranked high in some focused classes Class membership and ranking are stat. distributions Let ranking and classification mutually enhance each other! Output: Classification results + ranking list of objects within each class 15

16 Experiments on DBLP Class: Four research areas (communities) Database, data mining, AI, information retrieval Four types of objects Paper (14376), Conf. (20), Author (14475), Term (8920) Three types of relations Paper-conf., paper-author, paper-term Algorithms for comparison Learning with Local and Global Consistency (LLGC) [Zhou et al. NIPS 2003] also the homogeneous version of our method Weighted-vote Relational Neighbor classifier (wvrn) [Macskassy et al. JMLR 2007] Network-only Link-based Classification (nlb) [Lu et al. ICML 2003, Macskassy et al. JMLR 2007] 16

17 Comparing Classification Accuracy on the DBLP Data 17

18 Object Ranking in Each Class: Experiment DBLP: 4-fields data set (DB, DM, AI, IR) forming a heterog. info. network Rank objects within each class (with extremely limited label information) Obtain high classification accuracy and excellent ranking within each class Top-5 ranked conferences Top-5 ranked terms Database Data Mining AI IR VLDB KDD IJCAI SIGIR SIGMOD SDM AAAI ECIR ICDE ICDM ICML CIKM PODS PKDD CVPR WWW EDBT PAKDD ECML WSDM data mining learning retrieval database data knowledge information query clustering reasoning web system classification logic search xml frequent cognition text 18

19 Outline Why Heterogeneous Information Networks? On the Power of Mining Structured Heterogeneous Info. Networks Clustering & Classification: A Ranking-Based Approach Similarity Search & Meta Path in Heterogeneous Info. Net Prediction & Recommendation: A Meta Path-Based Approach From Unstructured Data to Structured Networks From Unstructured Text to Structured Hierarchies Type Extraction and Role Discovery: Enrich Network Structures Truth Analysis: Enhance Quality of Structured Network Looking Forward to the Future 19

20 Similarity Search: Find Similar Objects in Networks Who are most similar to Christos Faloutsos? Meta-Path: Meta-level description of a path between two objects Different meta-paths Schema of the carry rather different DBLP Network semantics Meta-Path: Author-Paper-Author (APA) Different meta-paths lead to very different results! Meta-Path: Author-Paper-Venue-Paper-Author (APVPA) Christos s students or close collaborators Similar reputation at similar venues 20

21 PathSim: A Het. Network Similarity Measure Popular object similarity measures in networks Random walk (RW) (or Personalized PageRank): Favors highly visible objects (i.e., objects with large degrees) Pairwise random walk (PRW) (or SimRank): Favors pure objects (i.e., objects with highly skewed distribution in their in-links or out-links) PathSim [VLDB 11] Note: P-PageRank and SimRank do not distinguish object type nor relationship type Favor peers : objects with strong connectivity and similar visibility under the given meta-path x y 21

22 Which Similarity Measure Is Better? Anhai Doan CS, Wisconsin Database area PhD: 2002 Jignesh Patel CS, Wisconsin Database area PhD: 1998 Meta-Path: Author-Paper-Venue-Paper-Author (APVPA) Amol Deshpande CS, Maryland Database area PhD: 2004 Jun Yang CS, Duke Database area PhD:

23 Outline Why Heterogeneous Information Networks? On the Power of Mining Structured Heterogeneous Info. Networks Clustering & Classification: A Ranking-Based Approach Similarity Search & Meta Path in Heterogeneous Info. Net Prediction & Recommendation: A Meta Path-Based Approach From Unstructured Data to Structured Networks From Unstructured Text to Structured Hierarchies Type Extraction and Role Discovery: Enrich Network Structures Truth Analysis: Enhance Quality of Structured Network Looking Forward to the Future 23

24 PathPredict: Meta-Path Based Relationship Prediction Meta path-guided prediction of links and relationships vs. Philosophy: Meta path relationships among similar typed links share similar semantics and are comparable and inferable venue publish publish -1 mention topic -1 write paper -1 mention write contain/contain -1 cite/cite -1 author Bibliographic network: Co-author prediction (A P A) Use topological features encoded by meta paths, e.g., citation relations between authors (A P P A) 24

25 Meta-Path Based Co-authorship Prediction Co-authorship prediction problem Whether two authors are going to collaborate for the first time Co-authorship encoded in meta-path Author-Paper-Author Topological features encoded in meta-paths Meta-Path Semantic Meaning Meta-paths between authors under length 4 25

26 The Power of PathPredict: Experiments Explain the prediction power of each meta-path Wald Test for logistic regression Higher prediction accuracy than using projected homogeneous network 11% higher in prediction accuracy Co-author prediction for Jian Pei: Only 42 among 4809 candidates are true first-time co-authors! (Feature collected in [1996, 2002]; Test period in [2003,2009]) 26

27 Enhancing the Power of Recommender Systems by Heterog. Info. Network Analysis Collaborative filtering methods suffer from the data sparsity issue Avatar Zoe Saldana Aliens Adventure Titanic James Cameron Leonardo Dicaprio Revolutionary Road Kate Winslet A small set of users & items have a large number of ratings Romance # of ratings Most users and items have a small number of ratings # of users or items Personalized recommendation with heterog. Networks [WSDM 14] Heterogeneous relationships complement each other Users and items with limited feedback can be connected to the network by different types of paths Connect new users or items in the information network Different users may require different models: Relationship heterogeneity makes personalized recommendation models easier to define 27

28 Performance Study: Personalized Recommendation in Heterogeneous Networks Datasets: Methods to be compared: Popularity: Recommend the most popular items to users Co-click: Conditional probabilities between items NMF: Non-negative matrix factorization on user feedback Hybrid-SVM: Use Rank-SVM with plain features (utilize both user feedback and information network) Winner: HeteRec personalized recommendation (HeteRec-p) 28

29 Outline Why Heterogeneous Information Networks? On the Power of Mining Structured Heterogeneous Info. Networks Clustering & Classification: A Ranking-Based Approach Similarity Search & Meta Path in Heterogeneous Info. Net Prediction & Recommendation: A Meta Path-Based Approach From Unstructured Data to Structured Networks From Unstructured Text to Structured Hierarchies Type Extraction and Role Discovery: Enrich Network Structures Truth Analysis: Enhance Quality of Structured Network Looking Forward to the Future 29

30 Automated Construction of Heterogeneous Info. Networks Rich semantics can be mined from structured heterog. networks Unfortunately, much of the real world data is unstructured News, Wikipedia, blogs, tweets, multimedia data, Too expensive to be structured by human: Automated & scalable How to turn unstructured data to structured heterog. Networks? From Unstructured Text to Structured Hierarchies Type Extraction and Role Discovery: Enrich Network Structures Truth Analysis: Enhance Quality of Structured Network Methodology: Incrementally & progressively explore links and structures to construct quality heterogeneous networks 30

31 From Unstructured Text to Structured Hierarchies Entity text Input collection Our recent effort: 1. Hierarchical topic discovery with entities 2. Phrase mining 3. Rank phrases & entities per topic Output hierarchy w/ phrases & entities Kert [Danilevsky et al, SDM 14]: LDA + Keyphrase extraction and ranking by topic (4 criteria: coverage, purity, phraseness, completeness) CATHTY [Wang, et al., KDD 13]: Recursive generation of topic hierarchy based on term co-occurrence network CATHYHIN [Wang, et al., ICDM 13]: Use hetero. network to help topic hierarchy generation TopMine [El-Kishky et al., 2014]: Exploring frequent contiguous pattern mining o/1/1 o/1 o o/1/2 o/2 o/2/1 Hierarchy so generated can be used for network mining, exploration, etc. 31

32 Automatic Discovery of Clusters and Term Hierarchies from Collection of Documents KERT: Keyphrase Extraction and Ranking by Topic [SDM 14] 1. Use background LDA (blda) to cluster unigrams into T topics + one background topic 2. Extract candidate keyphrases for each foreground topic according to the word topic assignment should be order-free and non-consecutive E.g., mining frequent patterns vs. frequent pattern mining E.g., mining top-k frequent closed patterns 3. Rank the keyphrases in each foreground topic Coverage: information retrieval vs. cross-language information retrieval Purity: only frequent in documents about topic t Phraseness: active learning vs. learning classification Completeness: vector machine vs. support vector machine 32

33 Experiment: Machine Learning: Top Keyphrases Dataset: DBLP paper titles KERT variations: ignore each criterion in turn Baselines: two variations of an approach from a recent work 33

34 CATHYHIN: Topic Hierarchy Construction by Integration of Heterogeneous Info. Networks Using DBLP heterog. Info. network to enhance topical hierarchy generation Hierarchies generated not only on topical phrases but also on authors & venues 34

35 TopMine: Phrase Mining Then Top Modeling Frequent Contiguous Phrase Mining Frequency: A low frequency phrase is likely unimportant --- a generalization of the most likely unigrams in LDA Collocation: The co-occurrence of tokens in such frequency that is significantly higher than expected due to chance. Completeness: Do not double count sub-phrases (e.g. support vector is not a phrase when inside support vector machine) General frequent pattern mining may place words far apart in the same phrase The pattern for a phrase: each phrase is a contiguous sequence of ordered words. Markov Blanket Feature Selection for Support Vector Machines 35

36 Good Phrases Should Be Stat. Significant Under the assumption that the text data was generated by a sequence of Bernoulli trials, we can derive a significance measure for potential mergings. Good Phrases! 36

37 Bag-of-Phrase Construction: An Example 37

38 The ToPMine Framework 1. Preprocessing: Perform stemming to reduce vocabulary size and increase collisions of phrases, split on phrase-invariant punctuation 2. Frequent Contiguous Pattern Mining: Collects aggregate counts and candidate phrases 3. Phrase Construction: Agglomerative construction of significant phrases (addresses frequency, collocation, and completeness) 4. Use the phrase construction of bag of phrases : This can be considered an enriched input for later algorithms (not exclusive to topic modeling) 5. Bag-of-phrases Topic Modeling 38

39 Experiments on AP News 39

40 Experiments on Yelp Reviews 40

41 Scalability: Runtime Results 41

42 Outline Why Heterogeneous Information Networks? On the Power of Mining Structured Heterogeneous Info. Networks Clustering & Classification: A Ranking-Based Approach Similarity Search & Meta Path in Heterogeneous Info. Net Prediction & Recommendation: A Meta Path-Based Approach From Unstructured Data to Structured Networks From Unstructured Text to Structured Hierarchies Type Extraction and Role Discovery: Enrich Network Structures Truth Analysis: Enhance Quality of Structured Network Looking Forward to the Future 42

43 Role Discovery: Mining Advisor-Advisee Relationships in DBLP Network Propagation of simple, commonly accepted constraints in Time- Constrained Probabilistic Factor Graph (TPFG) Advisor has more publications and longer history than advisee at the time of advising Once an advisee becomes advisor, s/he will not become advisee again Input: Temporal collaboration network 1999 Output: Relationship analysis (0.9, [/, 1998]) Visualized chorological hierarchies Ada 2000 Bob Ada (0.4, [/, 1998]) (0.5, [/, 2000]) 2000 (0.8, [1999,2000]) Ying Smith Jerry Bob (0.7, [2000, 2001]) (0.65, [2002, 2004]) Ying (0.49, [/, 1999]) Jerry (0.2, [2001, 2003]) 2004 Smith 43

44 Role Discovery: Performance & Case Study DBLP data: 654, 628 authors, 1076,946 publications, years provided Labeled data: MathGealogy Project; AI Gealogy Project; Homepage Datasets RULE SVM IndMAX TPFG TEST1 69.9% 73.4% 75.2% 78.9% 80.2% 84.4% TEST2 69.8% 74.6% 74.6% 79.0% 81.5% 84.3% TEST3 80.6% 86.7% 83.1% 90.9% 88.8% 91.3% Case study heuristics Supervised learning Empirical parameter Advisee Top Ranked Advisor Time Note optimized parameter David M. Blei 1. Michael I. Jordan PhD advisor, 2004 grad 2. John D. Lafferty Postdoc, 2006 Hong Cheng 1. Qiang Yang MS advisor, Jiawei Han PhD advisor, 2008 Sergey Brin 1. Rajeev Motawani Unofficial advisor 44

45 Hierarchical Relationship Discovery From partially ordered objects to hierarchy (tree) Based on NLP or other techniques to extract partially ordered objects Using constraints to discover relationships Singleton Potential Type Cognitive description Potential definition Homophile Polarity Support pattern Forbidden pattern Parent and child are similar Parent is superior to child Patterns frequently occurring with child-parent pairs Patterns rarely occurring with child-parent pairs Pairwise Potential Function: Cases Type Cognitive description Potential definition Attribute Use inherited attributes augment from parents or children Label Similar nodes share similar propagate parents (or children) Patterns altering in childparent & parent-child pairs Reciprocity Constraints Restrict certain patterns Discovery of the Kenny Family Tree 45

46 Outline Why Heterogeneous Information Networks? On the Power of Mining Structured Heterogeneous Info. Networks Clustering & Classification: A Ranking-Based Approach Similarity Search & Meta Path in Heterogeneous Info. Net Prediction & Recommendation: A Meta Path-Based Approach From Unstructured Data to Structured Networks From Unstructured Text to Structured Hierarchies Type Extraction and Role Discovery: Enrich Network Structures Truth Analysis: Enhance Quality of Structured Network Looking Forward to the Future 46

47 Truth Analysis: Enhancing the Quality of Heterogeneous Info. Networks Info. networks could be untrustworthy, error-prone, missing, TruthFinder [KDD 07]: Inference on trustworthiness by constructing heterogeneous info. networks Web sites (Book Sellers) Facts (Book Authors) w 1 f 1 Objects (Books) o 1 w 2 f 2 w 3 f 3 o 2 w 4 f 4 True facts and trustable websites mutually enhance each other and will become apparent after some iterations 47

48 Truth Discovery: Multiple Truth Value and Handling False Negatives Voting may not always work well: Some sources tend to miss true values (False Negatives), while some others tend to produce false claims (False Positives) Why Latent Truth Model (LTM)? Modeling two-sided quality to support multiple true values per entity for truth finding [VLDB 12] Generating Implicit Negative Claims: Positive Claim High Precision, High Recall IMDB Negative Claim Correct Claim Incorrect Claim High Precision, Low Recall Netflix Low Precision, Low Recall BadSource Harry Potter 48

49 Truth Discovery: Effectiveness of Latent Truth Model Experimental datasets: Large and real Book Authors from abebooks.com (1263 books, 879 sources, claims, 2420 book-author, 100 labeled) Movie Directors from Bing (15073 movies, 12 sources, claims, movie-director, 100 labeled) Effectiveness of Latent Truth Model: Model source quality in other data integration tasks, e.g. entity resolution. Trustworthiness in multi-genre networks (text-rich networks, social networks, etc.) 49

50 Outline Why Heterogeneous Information Networks? On the Power of Mining Structured Heterogeneous Info. Networks Clustering & Classification: A Ranking-Based Approach Similarity Search & Meta Path in Heterogeneous Info. Net Prediction & Recommendation: A Meta Path-Based Approach From Unstructured Data to Structured Networks From Unstructured Text to Structured Hierarchies Type Extraction and Role Discovery: Enrich Network Structures Truth Analysis: Enhance Quality of Structured Network Looking Forward to the Future 50

51 On-Going and Future Work From small, structured data to real, big, unstructured data From the DBLP network to networks constructed from news, PubMed, tweets, Construction of quality, semi-structured heterogeneous networks from massive, real data sets Integration information network with data/text cube and OLAP technologies Data/text cubes have been shown its power in multidimensional OLAP analysis, e.g., EventCube Integration of heterogeneous network mining OLAP mining on multi-dimensional social and information networks Network exploration of mission- or query-based hidden networks 51

52 From Data Mining to Mining Info. Networks Han, Kamber and Pei, Data Mining, 3 rd ed Yu, Han and Faloutsos (eds.), Link Mining, 2010 Sun and Han, Mining Heterogeneous Information Networks,

53 Conclusions Heterogeneous information networks are ubiquitous Most datasets can be organized or transformed into structured multi-typed heterogeneous info. networks Examples: DBLP, IMDB, Flickr, Google News, Wikipedia, Surprisingly rich knowledge can be mined from structured heterogeneous info. networks Clustering, ranking, classification, path prediction, Knowledge is power, but knowledge is hidden in massive, but relatively structured nodes and links! Key issue: Construction of trusted, semi-structured heterogeneous networks from unstructured data From data to knowledge: Much more to be explored but heterogeneous network mining has shown high promise! 53

54 References M. Ji, J. Han, and M. Danilevsky, "Ranking-Based Classification of Heterogeneous Information Networks", KDD'11. Y. Sun and J. Han, Mining Heterogeneous Information Networks: Principles and Methodologies, Morgan & Claypool Publishers, 2012 Y. Sun, J. Han, et al., "RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis", EDBT 09 Y. Sun, Y. Yu, and J. Han, "Ranking-Based Clustering of Heterogeneous Information Networks with Star Network Schema", KDD 09 Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu, PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks, VLDB'11 Y. Sun, R. Barber, M. Gupta, C. Aggarwal and J. Han, "Co-Author Relationship Prediction in Heterogeneous Bibliographic Networks", ASONAM'11 Y. Sun, J. Han, C. C. Aggarwal, N. Chawla, When Will It Happen? Relationship Prediction in Heterogeneous Information Networks, WSDM'12 F. Tao, et al., EventCube: Multi-Dimensional Search and Mining of Structured and Text Data, (system demo) KDD 13 C. Wang, J. Han, et al., Mining Advisor-Advisee Relationships from Research Publication Networks", KDD'10 C. Wang, M. Danilevsky, et al., A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy, KDD 13 C. Wang, Marina Danilevsky, et al., "Constructing Topical Hierarchies in Heterogeneous Information Networks", ICDM'13 54

Mining Trusted Information in Medical Science: An Information Network Approach

Mining Trusted Information in Medical Science: An Information Network Approach Mining Trusted Information in Medical Science: An Information Network Approach Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign Collaborated with many, especially Yizhou

More information

ResearchNet and NewsNet: Two Nets That May Capture Many Lovely Birds

ResearchNet and NewsNet: Two Nets That May Capture Many Lovely Birds ResearchNet and NewsNet: Two Nets That May Capture Many Lovely Birds Jiawei Han Data Mining Research Group, Computer Science University of Illinois at Urbana-Champaign Acknowledgements: NSF, ARL, NIH,

More information

CS249: ADVANCED DATA MINING

CS249: ADVANCED DATA MINING CS249: ADVANCED DATA MINING Recommender Systems II Instructor: Yizhou Sun yzsun@cs.ucla.edu May 31, 2017 Recommender Systems Recommendation via Information Network Analysis Hybrid Collaborative Filtering

More information

Personalized Entity Recommendation: A Heterogeneous Information Network Approach

Personalized Entity Recommendation: A Heterogeneous Information Network Approach Personalized Entity Recommendation: A Heterogeneous Information Network Approach Xiao Yu, Xiang Ren, Yizhou Sun, Quanquan Gu, Bradley Sturt, Urvashi Khandelwal, Brandon Norick, Jiawei Han University of

More information

Jiawei Han University of Illinois at Urbana Champaign

Jiawei Han University of Illinois at Urbana Champaign 1 Web Structure Mining and Information Network Analysis: An Integrated Approach Jiawei Han University of Illinois at Urbana Champaign Collaborated with many students in my group, especially Tim Weninger,

More information

Data Mining: Dynamic Past and Promising Future

Data Mining: Dynamic Past and Promising Future SDM@10 Anniversary Panel: Data Mining: A Decade of Progress and Future Outlook Data Mining: Dynamic Past and Promising Future Jiawei Han Department of Computer Science University of Illinois at Urbana

More information

HetPathMine: A Novel Transductive Classification Algorithm on Heterogeneous Information Networks

HetPathMine: A Novel Transductive Classification Algorithm on Heterogeneous Information Networks HetPathMine: A Novel Transductive Classification Algorithm on Heterogeneous Information Networks Chen Luo 1,RenchuGuan 1, Zhe Wang 1,, and Chenghua Lin 2 1 College of Computer Science and Technology, Jilin

More information

User Guided Entity Similarity Search Using Meta-Path Selection in Heterogeneous Information Networks

User Guided Entity Similarity Search Using Meta-Path Selection in Heterogeneous Information Networks User Guided Entity Similarity Search Using Meta-Path Selection in Heterogeneous Information Networks Xiao Yu, Yizhou Sun, Brandon Norick, Tiancheng Mao, Jiawei Han Computer Science Department University

More information

Graph Classification in Heterogeneous

Graph Classification in Heterogeneous Title: Graph Classification in Heterogeneous Networks Name: Xiangnan Kong 1, Philip S. Yu 1 Affil./Addr.: Department of Computer Science University of Illinois at Chicago Chicago, IL, USA E-mail: {xkong4,

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 1

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 1 Data Mining: Concepts and Techniques (3 rd ed.) Chapter 1 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University 2013 Han, Kamber & Pei. All rights

More information

IVIS: SEARCH AND VISUALIZATION ON HETEROGENEOUS INFORMATION NETWORKS YINTAO YU THESIS

IVIS: SEARCH AND VISUALIZATION ON HETEROGENEOUS INFORMATION NETWORKS YINTAO YU THESIS c 2010 Yintao Yu IVIS: SEARCH AND VISUALIZATION ON HETEROGENEOUS INFORMATION NETWORKS BY YINTAO YU THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer

More information

Introduction to Text Mining. Hongning Wang

Introduction to Text Mining. Hongning Wang Introduction to Text Mining Hongning Wang CS@UVa Who Am I? Hongning Wang Assistant professor in CS@UVa since August 2014 Research areas Information retrieval Data mining Machine learning CS@UVa CS6501:

More information

Chapter 1 Introduction

Chapter 1 Introduction Chapter 1 Introduction Abstract In this chapter, we introduce some basic concepts and definitions in heterogeneous information network and compare the heterogeneous information network with other related

More information

Effective Latent Space Graph-based Re-ranking Model with Global Consistency

Effective Latent Space Graph-based Re-ranking Model with Global Consistency Effective Latent Space Graph-based Re-ranking Model with Global Consistency Feb. 12, 2009 1 Outline Introduction Related work Methodology Graph-based re-ranking model Learning a latent space graph A case

More information

c 2012 by Yizhou Sun. All rights reserved.

c 2012 by Yizhou Sun. All rights reserved. c 2012 by Yizhou Sun. All rights reserved. MINING HETEROGENEOUS INFORMATION NETWORKS BY YIZHOU SUN DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy

More information

CS425: Algorithms for Web Scale Data

CS425: Algorithms for Web Scale Data CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org J.

More information

Chapter 1, Introduction

Chapter 1, Introduction CSI 4352, Introduction to Data Mining Chapter 1, Introduction Young-Rae Cho Associate Professor Department of Computer Science Baylor University What is Data Mining? Definition Knowledge Discovery from

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Text Analytics (Text Mining)

Text Analytics (Text Mining) CSE 6242 / CX 4242 Apr 1, 2014 Text Analytics (Text Mining) Concepts and Algorithms Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer,

More information

WE know that most real systems usually consist of a

WE know that most real systems usually consist of a IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 29, NO. 1, JANUARY 2017 17 A Survey of Heterogeneous Information Network Analysis Chuan Shi, Member, IEEE, Yitong Li, Jiawei Zhang, Yizhou Sun,

More information

AspEm: Embedding Learning by Aspects in Heterogeneous Information Networks

AspEm: Embedding Learning by Aspects in Heterogeneous Information Networks AspEm: Embedding Learning by Aspects in Heterogeneous Information Networks Yu Shi, Huan Gui, Qi Zhu, Lance Kaplan, Jiawei Han University of Illinois at Urbana-Champaign (UIUC) Facebook Inc. U.S. Army Research

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

A Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2

A Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2 A Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2 1 Student, M.E., (Computer science and Engineering) in M.G University, India, 2 Associate Professor

More information

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How

More information

Citation Prediction in Heterogeneous Bibliographic Networks

Citation Prediction in Heterogeneous Bibliographic Networks Citation Prediction in Heterogeneous Bibliographic Networks Xiao Yu Quanquan Gu Mianwei Zhou Jiawei Han University of Illinois at Urbana-Champaign {xiaoyu1, qgu3, zhou18, hanj}@illinois.edu Abstract To

More information

Part I: Data Mining Foundations

Part I: Data Mining Foundations Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?

More information

Integrating Meta-Path Selection with User-Preference for Top-k Relevant Search in Heterogeneous Information Networks

Integrating Meta-Path Selection with User-Preference for Top-k Relevant Search in Heterogeneous Information Networks Integrating Meta-Path Selection with User-Preference for Top-k Relevant Search in Heterogeneous Information Networks Shaoli Bu bsl89723@gmail.com Zhaohui Peng pzh@sdu.edu.cn Abstract Relevance search in

More information

TriRank: Review-aware Explainable Recommendation by Modeling Aspects

TriRank: Review-aware Explainable Recommendation by Modeling Aspects TriRank: Review-aware Explainable Recommendation by Modeling Aspects Xiangnan He, Tao Chen, Min-Yen Kan, Xiao Chen National University of Singapore Presented by Xiangnan He CIKM 15, Melbourne, Australia

More information

Analysis of Large Graphs: TrustRank and WebSpam

Analysis of Large Graphs: TrustRank and WebSpam Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit

More information

Mining Heterogeneous Information Networks

Mining Heterogeneous Information Networks ACM SIGKDD Conference Tutorial, Washington, D.C., July 25, 2010 Mining Heterogeneous Information Networks Jiawei Han Yizhou Sun Xifeng Yan Philip S. Yu University of Illinois at Urbana Champaign University

More information

Effective Keyword Search over (Semi)-Structured Big Data Mehdi Kargar

Effective Keyword Search over (Semi)-Structured Big Data Mehdi Kargar Effective Keyword Search over (Semi)-Structured Big Data Mehdi Kargar School of Computer Science Faculty of Science University of Windsor How Big is this Big Data? 40 Billion Instagram Photos 300 Hours

More information

Link Mining & Entity Resolution. Lise Getoor University of Maryland, College Park

Link Mining & Entity Resolution. Lise Getoor University of Maryland, College Park Link Mining & Entity Resolution Lise Getoor University of Maryland, College Park Learning in Structured Domains Traditional machine learning and data mining approaches assume: A random sample of homogeneous

More information

Mining Knowledge from Databases: An Information Network Analysis Approach

Mining Knowledge from Databases: An Information Network Analysis Approach ACM SIGMOD Conference Tutorial, Indianapolis, Indiana, June 8, 2010 Mining Knowledge from Databases: An Information Network Analysis Approach Jiawei Han Yizhou Sun Xifeng Yan Philip S. Yu University of

More information

Text Analytics (Text Mining)

Text Analytics (Text Mining) CSE 6242 / CX 4242 Text Analytics (Text Mining) Concepts and Algorithms Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko,

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Lecture #10: Link Analysis-2 Seoul National University 1 In This Lecture Pagerank: Google formulation Make the solution to converge Computing Pagerank for very large graphs

More information

Holistic and Compact Selectivity Estimation for Hybrid Queries over RDF Graphs

Holistic and Compact Selectivity Estimation for Hybrid Queries over RDF Graphs Holistic and Compact Selectivity Estimation for Hybrid Queries over RDF Graphs Authors: Andreas Wagner, Veli Bicer, Thanh Tran, and Rudi Studer Presenter: Freddy Lecue IBM Research Ireland 2014 International

More information

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010 INFORMATICS SEMINAR SEPT. 27 & OCT. 4, 2010 Introduction to Semi-Supervised Learning Review 2 Overview Citation X. Zhu and A.B. Goldberg, Introduction to Semi- Supervised Learning, Morgan & Claypool Publishers,

More information

Jianyong Wang Department of Computer Science and Technology Tsinghua University

Jianyong Wang Department of Computer Science and Technology Tsinghua University Jianyong Wang Department of Computer Science and Technology Tsinghua University jianyong@tsinghua.edu.cn Joint work with Wei Shen (Tsinghua), Ping Luo (HP), and Min Wang (HP) Outline Introduction to entity

More information

Mining Diversity on Networks

Mining Diversity on Networks Mining Diversity on Networks Lu Liu 1, Feida Zhu 3, Chen Chen 2, Xifeng Yan 4, Jiawei Han 2, Philip Yu 5, and Shiqiang Yang 1 1 Tsinghua University, 2 University of Illinois at Urbana-Champaign 3 Singapore

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/6/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 2 High dim. data Graph data Infinite data Machine

More information

Chapter 2 Survey of Current Developments

Chapter 2 Survey of Current Developments Chapter 2 Survey of Current Developments Abstract Heterogeneous information network (HIN) provides a new paradigm to manage networked data. Meanwhile, it also introduces new challenges for many data mining

More information

A Machine Learning Approach for Information Retrieval Applications. Luo Si. Department of Computer Science Purdue University

A Machine Learning Approach for Information Retrieval Applications. Luo Si. Department of Computer Science Purdue University A Machine Learning Approach for Information Retrieval Applications Luo Si Department of Computer Science Purdue University Why Information Retrieval: Information Overload: Since the introduction of digital

More information

Computer-based Tracking Protocols: Improving Communication between Databases

Computer-based Tracking Protocols: Improving Communication between Databases Computer-based Tracking Protocols: Improving Communication between Databases Amol Deshpande Database Group Department of Computer Science University of Maryland Overview Food tracking and traceability

More information

Graph Cube: On Warehousing and OLAP Multidimensional Networks

Graph Cube: On Warehousing and OLAP Multidimensional Networks Graph Cube: On Warehousing and OLAP Multidimensional Networks Peixiang Zhao, Xiaolei Li, Dong Xin, Jiawei Han Department of Computer Science, UIUC Groupon Inc. Google Cooperation pzhao4@illinois.edu, hanj@cs.illinois.edu

More information

CSE5243 INTRO. TO DATA MINING

CSE5243 INTRO. TO DATA MINING CSE5243 INTRO. TO DATA MINING Chapter 1. Introduction Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han CSE 5243. Course Page & Schedule Class Homepage:

More information

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Database and Knowledge-Base Systems: Data Mining. Martin Ester Database and Knowledge-Base Systems: Data Mining Martin Ester Simon Fraser University School of Computing Science Graduate Course Spring 2006 CMPT 843, SFU, Martin Ester, 1-06 1 Introduction [Fayyad, Piatetsky-Shapiro

More information

Slides based on those in:

Slides based on those in: Spyros Kontogiannis & Christos Zaroliagis Slides based on those in: http://www.mmds.org A 3.3 B 38.4 C 34.3 D 3.9 E 8.1 F 3.9 1.6 1.6 1.6 1.6 1.6 2 y 0.8 ½+0.2 ⅓ M 1/2 1/2 0 0.8 1/2 0 0 + 0.2 0 1/2 1 [1/N]

More information

COMP 465: Data Mining Still More on Clustering

COMP 465: Data Mining Still More on Clustering 3/4/015 Exercise COMP 465: Data Mining Still More on Clustering Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Describe each of the following

More information

OLAP on Information Networks: a new Framework for Dealing with Bibliographic Data

OLAP on Information Networks: a new Framework for Dealing with Bibliographic Data OLAP on Information Networks: a new Framework for Dealing with Bibliographic Data Wararat Jakawat, C. Favre, Sabine Loudcher To cite this version: Wararat Jakawat, C. Favre, Sabine Loudcher. OLAP on Information

More information

Tutorial on Mining Heterogeneous Information Networks

Tutorial on Mining Heterogeneous Information Networks Tutorial on Mining Heterogeneous Information Networks Rokia Missaoui LARIM Université du Québec en Outaouais, Canada http://w3.uqo.ca/missaoui 1 Acknowledgement I am grateful to Professor Jiawei Han who

More information

Holistic Analysis of Multi-Source, Multi- Feature Data: Modeling and Computation Challenges

Holistic Analysis of Multi-Source, Multi- Feature Data: Modeling and Computation Challenges Holistic Analysis of Multi-Source, Multi- Feature Data: Modeling and Computation Challenges Abhishek Santra 1 and Sanjukta Bhowmick 2 1 Information Technology Laboratory, CSE Department, University of

More information

A. Papadopoulos, G. Pallis, M. D. Dikaiakos. Identifying Clusters with Attribute Homogeneity and Similar Connectivity in Information Networks

A. Papadopoulos, G. Pallis, M. D. Dikaiakos. Identifying Clusters with Attribute Homogeneity and Similar Connectivity in Information Networks A. Papadopoulos, G. Pallis, M. D. Dikaiakos Identifying Clusters with Attribute Homogeneity and Similar Connectivity in Information Networks IEEE/WIC/ACM International Conference on Web Intelligence Nov.

More information

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures Springer Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web

More information

Quotient Cube: How to Summarize the Semantics of a Data Cube

Quotient Cube: How to Summarize the Semantics of a Data Cube Quotient Cube: How to Summarize the Semantics of a Data Cube Laks V.S. Lakshmanan (Univ. of British Columbia) * Jian Pei (State Univ. of New York at Buffalo) * Jiawei Han (Univ. of Illinois at Urbana-Champaign)

More information

Query Independent Scholarly Article Ranking

Query Independent Scholarly Article Ranking Query Independent Scholarly Article Ranking Shuai Ma, Chen Gong, Renjun Hu, Dongsheng Luo, Chunming Hu, Jinpeng Huai SKLSDE Lab, Beihang University, China Beijing Advanced Innovation Center for Big Data

More information

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Infinite data. Filtering data streams

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University  Infinite data. Filtering data streams /9/7 Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them

More information

DATA MINING RESEARCH: RETROSPECT AND PROSPECT

DATA MINING RESEARCH: RETROSPECT AND PROSPECT DATA MINING RESEARCH: RETROSPECT AND PROSPECT Prof(Dr).V.SARAVANAN & Mr. ABDUL KHADAR JILANI Department of Computer Science College of Computer and Information Sciences Majmaah University Kingdom of Saudi

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 21 Table of contents 1 Introduction 2 Data mining

More information

Object Distinction: Distinguishing Objects with Identical Names

Object Distinction: Distinguishing Objects with Identical Names Object Distinction: Distinguishing Objects with Identical Names Xiaoxin Yin Univ. of Illinois xyin1@uiuc.edu Jiawei Han Univ. of Illinois hanj@cs.uiuc.edu Philip S. Yu IBM T. J. Watson Research Center

More information

Overview. Data-mining. Commercial & Scientific Applications. Ongoing Research Activities. From Research to Technology Transfer

Overview. Data-mining. Commercial & Scientific Applications. Ongoing Research Activities. From Research to Technology Transfer Data Mining George Karypis Department of Computer Science Digital Technology Center University of Minnesota, Minneapolis, USA. http://www.cs.umn.edu/~karypis karypis@cs.umn.edu Overview Data-mining What

More information

RankCompete: Simultaneous Ranking and Clustering of Information Networks

RankCompete: Simultaneous Ranking and Clustering of Information Networks RankCompete: Simultaneous Ranking and Clustering of Information Networks Liangliang Cao a, Xin Jin b, Zhijun Yin b, Andrey Del Pozo a, Jiebo Luo c, Jiawei Han b, Thomas S. Huang a a Beckman Institute and

More information

Using PageRank in Feature Selection

Using PageRank in Feature Selection Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy {ienco,meo,botta}@di.unito.it Abstract. Feature selection is an important

More information

Heterogeneous Graph-Based Intent Learning with Queries, Web Pages and Wikipedia Concepts

Heterogeneous Graph-Based Intent Learning with Queries, Web Pages and Wikipedia Concepts Heterogeneous Graph-Based Intent Learning with Queries, Web Pages and Wikipedia Concepts Xiang Ren, Yujing Wang, Xiao Yu, Jun Yan, Zheng Chen, Jiawei Han University of Illinois, at Urbana Champaign MicrosoD

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 20 Table of contents 1 Introduction 2 Data mining

More information

Big Trend in Business Intelligence: Data Mining over Big Data Web Transaction Data. Fall 2012

Big Trend in Business Intelligence: Data Mining over Big Data Web Transaction Data. Fall 2012 Big Trend in Business Intelligence: Data Mining over Big Data Web Transaction Data Fall 2012 Data Warehousing and OLAP Introduction Decision Support Technology On Line Analytical Processing Star Schema

More information

Graphs / Networks. CSE 6242/ CX 4242 Feb 18, Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Graphs / Networks. CSE 6242/ CX 4242 Feb 18, Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech CSE 6242/ CX 4242 Feb 18, 2014 Graphs / Networks Centrality measures, algorithms, interactive applications Duen Horng (Polo) Chau Georgia Tech Partly based on materials by Professors Guy Lebanon, Jeffrey

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second

More information

Big Data Analytics Influx of data pertaining to the 4Vs, i.e. Volume, Veracity, Velocity and Variety

Big Data Analytics Influx of data pertaining to the 4Vs, i.e. Volume, Veracity, Velocity and Variety Holistic Analysis of Multi-Source, Multi- Feature Data: Modeling and Computation Challenges Big Data Analytics Influx of data pertaining to the 4Vs, i.e. Volume, Veracity, Velocity and Variety Abhishek

More information

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set To Enhance Scalability of Item Transactions by Parallel and Partition using Dynamic Data Set Priyanka Soni, Research Scholar (CSE), MTRI, Bhopal, priyanka.soni379@gmail.com Dhirendra Kumar Jha, MTRI, Bhopal,

More information

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK Qing Guo 1, 2 1 Nanyang Technological University, Singapore 2 SAP Innovation Center Network,Singapore ABSTRACT Literature review is part of scientific

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

Searching complex graphs

Searching complex graphs Searching complex graphs complex graph data Big volume: huge number of nodes/links Big variety: complex, heterogeneous schema Big velocity: e.g., frequently updated Noisy, ambiguous attributes and values

More information

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116 IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY AN EFFICIENT APPROACH FOR TEXT MINING USING SIDE INFORMATION Kiran V. Gaidhane*, Prof. L. H. Patil, Prof. C. U. Chouhan DOI: 10.5281/zenodo.58632

More information

Master Project. Various Aspects of Recommender Systems. Prof. Dr. Georg Lausen Dr. Michael Färber Anas Alzoghbi Victor Anthony Arrascue Ayala

Master Project. Various Aspects of Recommender Systems. Prof. Dr. Georg Lausen Dr. Michael Färber Anas Alzoghbi Victor Anthony Arrascue Ayala Master Project Various Aspects of Recommender Systems May 2nd, 2017 Master project SS17 Albert-Ludwigs-Universität Freiburg Prof. Dr. Georg Lausen Dr. Michael Färber Anas Alzoghbi Victor Anthony Arrascue

More information

Thanks to Jure Leskovec, Anand Rajaraman, Jeff Ullman

Thanks to Jure Leskovec, Anand Rajaraman, Jeff Ullman Thanks to Jure Leskovec, Anand Rajaraman, Jeff Ullman http://www.mmds.org Overview of Recommender Systems Content-based Systems Collaborative Filtering J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive

More information

3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today

3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today 3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

More information

Mining Query-Based Subnetwork Outliers in Heterogeneous Information Networks

Mining Query-Based Subnetwork Outliers in Heterogeneous Information Networks Mining Query-Based Subnetwork Outliers in Heterogeneous Information Networks Honglei Zhuang, Jing Zhang 2, George Brova, Jie Tang 2, Hasan Cam 3, Xifeng Yan 4, Jiawei Han University of Illinois at Urbana-Champaign

More information

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering Abstract Mrs. C. Poongodi 1, Ms. R. Kalaivani 2 1 PG Student, 2 Assistant Professor, Department of

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3 Data Mining: Concepts and Techniques (3 rd ed.) Chapter 3 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University 2011 Han, Kamber & Pei. All rights

More information

Hierarchical Document Clustering

Hierarchical Document Clustering Hierarchical Document Clustering Benjamin C. M. Fung, Ke Wang, and Martin Ester, Simon Fraser University, Canada INTRODUCTION Document clustering is an automatic grouping of text documents into clusters

More information

arxiv: v1 [cs.mm] 12 Jan 2016

arxiv: v1 [cs.mm] 12 Jan 2016 Learning Subclass Representations for Visually-varied Image Classification Xinchao Li, Peng Xu, Yue Shi, Martha Larson, Alan Hanjalic Multimedia Information Retrieval Lab, Delft University of Technology

More information

Information Retrieval

Information Retrieval Introduction Information Retrieval Information retrieval is a field concerned with the structure, analysis, organization, storage, searching and retrieval of information Gerard Salton, 1968 J. Pei: Information

More information

Bring Semantic Web to Social Communities

Bring Semantic Web to Social Communities Bring Semantic Web to Social Communities Jie Tang Dept. of Computer Science, Tsinghua University, China jietang@tsinghua.edu.cn April 19, 2010 Abstract Recently, more and more researchers have recognized

More information

TupleRank: Ranking Relational Databases using Random Walks on Extended K-partite Graphs

TupleRank: Ranking Relational Databases using Random Walks on Extended K-partite Graphs TupleRank: Ranking Relational Databases using Random Walks on Extended K-partite Graphs Jiyang Chen, Osmar R. Zaïane, Randy Goebel, Philip S. Yu 2 Department of Computing Science, University of Alberta

More information

Empowering People with Knowledge the Next Frontier for Web Search. Wei-Ying Ma Assistant Managing Director Microsoft Research Asia

Empowering People with Knowledge the Next Frontier for Web Search. Wei-Ying Ma Assistant Managing Director Microsoft Research Asia Empowering People with Knowledge the Next Frontier for Web Search Wei-Ying Ma Assistant Managing Director Microsoft Research Asia Important Trends for Web Search Organizing all information Addressing user

More information

Personalized Recommendations using Knowledge Graphs. Rose Catherine Kanjirathinkal & Prof. William Cohen Carnegie Mellon University

Personalized Recommendations using Knowledge Graphs. Rose Catherine Kanjirathinkal & Prof. William Cohen Carnegie Mellon University + Personalized Recommendations using Knowledge Graphs Rose Catherine Kanjirathinkal & Prof. William Cohen Carnegie Mellon University + The Problem 2 n Generate content-based recommendations on sparse real

More information

GraphGAN: Graph Representation Learning with Generative Adversarial Nets

GraphGAN: Graph Representation Learning with Generative Adversarial Nets The 32 nd AAAI Conference on Artificial Intelligence (AAAI 2018) New Orleans, Louisiana, USA GraphGAN: Graph Representation Learning with Generative Adversarial Nets Hongwei Wang 1,2, Jia Wang 3, Jialin

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

Mining Latent Entity Structures

Mining Latent Entity Structures MORGAN& CLAYPOOL PUBLISHERS Mining Latent Entity Structures Chi Wang Jiawei Han SyntheSiS LectureS on Data Mining and KnowLeDge DiScovery Jiawei Han, Lise Getoor, Wei Wang, Johannes Gehrke, Robert Grossman,

More information

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets Arjumand Younus 1,2, Colm O Riordan 1, and Gabriella Pasi 2 1 Computational Intelligence Research Group,

More information

Using PageRank in Feature Selection

Using PageRank in Feature Selection Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy fienco,meo,bottag@di.unito.it Abstract. Feature selection is an important

More information

An Improved Apriori Algorithm for Association Rules

An Improved Apriori Algorithm for Association Rules Research article An Improved Apriori Algorithm for Association Rules Hassan M. Najadat 1, Mohammed Al-Maolegi 2, Bassam Arkok 3 Computer Science, Jordan University of Science and Technology, Irbid, Jordan

More information

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate

More information

CS249: ADVANCED DATA MINING

CS249: ADVANCED DATA MINING CS249: ADVANCED DATA MINING Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu April 24, 2017 Homework 2 out Announcements Due May 3 rd (11:59pm) Course project proposal

More information

An Overview of Data Warehousing and OLAP Technology

An Overview of Data Warehousing and OLAP Technology An Overview of Data Warehousing and OLAP Technology What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation lecture 2 1 What is Data Warehouse?

More information

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Clustering Results. Result List Example. Clustering Results. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Presenting Results Clustering Clustering Results! Result lists often contain documents related to different aspects of the query topic! Clustering is used to

More information

Plagiarism Detection Using FP-Growth Algorithm

Plagiarism Detection Using FP-Growth Algorithm Northeastern University NLP Project Report Plagiarism Detection Using FP-Growth Algorithm Varun Nandu (nandu.v@husky.neu.edu) Suraj Nair (nair.sur@husky.neu.edu) Supervised by Dr. Lu Wang December 10,

More information

Tribhuvan University Institute of Science and Technology MODEL QUESTION

Tribhuvan University Institute of Science and Technology MODEL QUESTION MODEL QUESTION 1. Suppose that a data warehouse for Big University consists of four dimensions: student, course, semester, and instructor, and two measures count and avg-grade. When at the lowest conceptual

More information

Supervised Models for Multimodal Image Retrieval based on Visual, Semantic and Geographic Information

Supervised Models for Multimodal Image Retrieval based on Visual, Semantic and Geographic Information Supervised Models for Multimodal Image Retrieval based on Visual, Semantic and Geographic Information Duc-Tien Dang-Nguyen, Giulia Boato, Alessandro Moschitti, Francesco G.B. De Natale Department of Information

More information

Leveraging Knowledge Graphs for Web-Scale Unsupervised Semantic Parsing. Interspeech 2013

Leveraging Knowledge Graphs for Web-Scale Unsupervised Semantic Parsing. Interspeech 2013 Leveraging Knowledge Graphs for Web-Scale Unsupervised Semantic Parsing LARRY HECK, DILEK HAKKANI-TÜR, GOKHAN TUR Focus of This Paper SLU and Entity Extraction (Slot Filling) Spoken Language Understanding

More information