Graph Mining and Social Network Analysis

Similar documents
Overlay (and P2P) Networks

Mining Web Data. Lijun Zhang

Part I: Data Mining Foundations

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Data Mining in Bioinformatics Day 5: Graph Mining

Pattern Mining in Frequent Dynamic Subgraphs

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Lecture notes for April 6, 2005

gspan: Graph-Based Substructure Pattern Mining

Web Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono

Web Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono

An Improved Apriori Algorithm for Association Rules

Mining Web Data. Lijun Zhang

Midterm Review NETS 112

Web Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono

Link Mining & Entity Resolution. Lise Getoor University of Maryland, College Park

Data Mining in Bioinformatics Day 3: Graph Mining

CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks

The k-means Algorithm and Genetic Algorithm

Chapter 1. Social Media and Social Computing. October 2012 Youn-Hee Han

A new approach of Association rule mining algorithm with error estimation techniques for validate the rules

Chapter 4 Data Mining A Short Introduction

Mining frequent Closed Graph Pattern

Data Mining Course Overview

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Association mining rules

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Improving the Efficiency of Web Usage Mining Using K-Apriori and FP-Growth Algorithm

Object-Based Classification & ecognition. Zutao Ouyang 11/17/2015

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application

CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS

G(B)enchmark GraphBench: Towards a Universal Graph Benchmark. Khaled Ammar M. Tamer Özsu

MINING CONCEPT IN BIG DATA

Comparative Study of Subspace Clustering Algorithms

A Review on Cluster Based Approach in Data Mining

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Theme Identification in RDF Graphs

Small-World Models and Network Growth Models. Anastassia Semjonova Roman Tekhov

Mining Frequent Patterns without Candidate Generation

Chapters 11 and 13, Graph Data Mining

An overview of Graph Categories and Graph Primitives

Data mining, 4 cu Lecture 8:

Graph Algorithms using Map-Reduce. Graphs are ubiquitous in modern society. Some examples: The hyperlink structure of the web

Categorization of Sequential Data using Associative Classifiers

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

ANU MLSS 2010: Data Mining. Part 2: Association rule mining

PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore

Improved Frequent Pattern Mining Algorithm with Indexing

Graph Classification in Heterogeneous

CAIM: Cerca i Anàlisi d Informació Massiva

Topic 1 Classification Alternatives

A SURVEY OF DATA MINING & ITS APPLICATIONS

An Exploratory Journey Into Network Analysis A Gentle Introduction to Network Science and Graph Visualization

Web Structure Mining Community Detection and Evaluation

Case Study: Social Network Analysis. Part II

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

A Keypoint Descriptor Inspired by Retinal Computation

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset

Improved Apriori Algorithms- A Survey

Clustering in Data Mining

Searching frequent itemsets by clustering data: towards a parallel approach using MapReduce

Accelerating Unique Strategy for Centroid Priming in K-Means Clustering

Basic Data Mining Technique

SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT

CS512 (Spring 2012) Advanced Data Mining : Midterm Exam I

Community detection algorithms survey and overlapping communities. Presented by Sai Ravi Kiran Mallampati

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

A Cloud Framework for Big Data Analytics Workflows on Azure

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material.

Absorbing Random walks Coverage

Absorbing Random walks Coverage

Clustering part II 1

Big Data Analytics Influx of data pertaining to the 4Vs, i.e. Volume, Veracity, Velocity and Variety

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at

The Transpose Technique to Reduce Number of Transactions of Apriori Algorithm

Holistic Analysis of Multi-Source, Multi- Feature Data: Modeling and Computation Challenges

Degree Distribution: The case of Citation Networks

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATIONS (MCA) Semester: IV

Topology Enhancement in Wireless Multihop Networks: A Top-down Approach

H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Databases. Paper s goals. H-mine characteristics. Why a new algorithm?

DATA MINING II - 1DL460. Spring 2014"

An Algorithm for Mining Large Sequences in Databases

Chapter 4: Mining Frequent Patterns, Associations and Correlations

DATA MINING - 1DL105, 1DL111

An Efficient Clustering for Crime Analysis

Online Social Networks and Media

This paper proposes: Mining Frequent Patterns without Candidate Generation

数据挖掘 Introduction to Data Mining

Implementation of Data Mining for Vehicle Theft Detection using Android Application

Discovery of Frequent Itemset and Promising Frequent Itemset Using Incremental Association Rule Mining Over Stream Data Mining

Detecting and Analyzing Communities in Social Network Graphs for Targeted Marketing

CS6220: DATA MINING TECHNIQUES

Mining Association Rules in Large Databases

CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING

Epilog: Further Topics

Arabesque. A system for distributed graph mining. Mohammed Zaki, RPI

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Distance-based Methods: Drawbacks

Transcription:

Graph Mining and Social Network Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano)

References q Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann Series in Data Management Systems (Second Edition) " Chapter 9

Graph Mining

Graph Mining Overview q Graphs are becoming increasingly important to model many phenomena in a large class of domains (e.g., bioinformatics, computer vision, social analysis) q To deal with these needs, many data mining approaches have been extended also to graphs and trees q Major approaches " Mining frequent subgraphs " Indexing " Similarity search " Classification " Clustering

Mining frequent subgraphs q Given a labeled graph data set D = {G 1,G 2,,G n } q We define support(g) as the percentage of graphs in D where g is a subgraph q A frequent subgraph in D is a subgraph with a support greater than min_sup q How to find frequent subgraph? " Apriori-based approach " Pattern-growth approach

AprioriGraph q Apply a level-wise iterative algorithm 1. Choose two similar size-k frequent subgraphs in S 2. Merge two similar subgraphs in a size-(k+1) subgraph 3. If the new subgraph is frequent add to S 4. Restart from 2. until all similar subgraphs have been considered. Otherwise restart from 1. and move to k+1. q What is subgraph size? " Number of vertex " Number of edges " Number of edge-disjoint paths q Two subgraphs of size-k are similar if they have the same size-(k-1) subgraph q AprioriGraph has a big computational cost (due to the merging step)

PatternGrowthGraph q Incrementally extend frequent subgraphs 1. Add to S each frequent subgraphs g E obtained by extending subgraph g 2. Until S is not empty, select a new subgraph g in S to extend and start from 1. q How to extend a subgraph? " Add a vertex " Add an edge q The same graph can be discovered many times! " Get rid of duplicates once discovered " Reduce the generation of duplicates

Mining closed, unlabeled, and constrained subgraphs q Closed subgraphs " G is closed iff there is no proper supergraph G with the same support of G " Reduce the growth of subgraphs discovered " Is a more compact representation of knowledge q Unlabeled (or partially labeled) graphs " Introduce a special label Φ " Φ can match any label or only itself q Constrained subgraphs " Containment constraint (edges, vertex, subgraphs) " Geometric constraint " Value constraint

Graph Indexing q Indexing is basilar for effective search and query processing q How to index graphs? q Path-based approach takes the path as indexing unit " All the path up to maxl length are indexed " Does not scale very well q gindex approach takes frequent and discriminative subgraphs as indexing unit " A subgraph is frequent if it has a support greater than a threshold " A subgraph is discriminative if its support cannot be well approximated by the intersection of the graph sets that contain one of its subgraphs

Graph Classification and Clustering q Mining of frequent subgraphs can be effectively used for classification and clustering purposes q Classification " Frequent and discriminative subgraphs are used as features to perform the classification task " A subgraph is discriminative if it is frequent only in one class of graphs and infrequent in the others " The threshold on frequency and discriminativeness should be tuned to obtain the desired classification results q Clustering " The mined frequent subgraphs are used to define similarity between graphs " Two graphs that share a large set of patterns should be considered similar and grouped in the same cluster " The threshold on frequency can be tuned to find the desired number of clusters q As the mining step affects heavily the final outcome, this is an intertwined process rather tan a two-steps process

Social Network Analysis

Social Network q A social network is an heterogeneous and multirelational dataset represented by a graph " Vertexes represent the objects (entities) " Edges represent the links (relationships or interaction) " Both objects and links may have attributes " Social networks are usually very large q Social network can be used to represents many real-world phenomena (not necessarily social) " Electrical power grids " Phone calls " Spread of computer virus " WWW

Small World Networks (1) q Are social networks random graphs? q NO! Internet Map Science Coauthorship Protein Network High degree of local clustering Few degrees of separation

Small World Networks (2) Sarah Peter Jane Ralph Society: Six degrees S. Milgram 1967 F. Karinthy 1929 WWW: 19 degrees Albert et al. 1999

Small World Networks (3) q Definitions " Node s degree is the number of incident edges " Network effective diameter is the max distance within 90% of the network q Properties " Densification power law " Shrinking diameter " Heavy-tailed degrees distribution n: number of nodes e: number of eges 1<α<2 Node degrees Preferential attachment model Node rank

Mining social networks (1) q Several Link mining tasks can be identified in the analysis of social networks q Link based object classification " Classification of objects on the basis of its attributes, its links and attributes of objects linked to it " E.g., predict topic of a paper on the basis of Keywords occurrence Citations and cocitations q Link type prediction " Prediction of link type on the basis of objects attributes " E.g., predict if a link between two Web pages is an advertising link or not q Predicting link existence " Predict the presence of a link between two objects

Mining social networks (2) q Link cardinality estimation " Prediction of the number of links to an object " Prediction of the number of objects reachable from a specific object q Object reconciliation " Discover if two objects are the same on the basis of their attributes and links " E.g., predict if two websites are mirrors of each other q Group detection " Clustering of objects on the basis both of their attributes and their links q Subgraph detection " Discover characteristic subgraphs within network

Challenges q Feature construction " Not only the objects attributes need to be considered but also attributes of linked objects " Feature selection and aggregation techniques must be applied to reduce the size of search space q Collective classification and consolidation " Unlabeled data cannot be classified independently " New objects can be correlated and need to be considered collectively to consolidate the current model q Link prediction " The prior probability of link between two objects may be very low q Community mining from multirelational networks " Many approaches assume an homogenous relationship while social networks usually represent different communities and functionalities

Applications q Link Prediction q Viral Marketing

Link prediction q What edges will be added to the network? q Given a snapshot of a network at time t, link prediction aims to predict the edges that will be added before a given future time t q Link prediction is generally solved assigning to each pair of nodes a weight score(x,y) q The higher the score the more likely that link will be added in the near future q The score(x,y) can be computed in several way " Shortest path: the shortest he path between X and Y the highest is their score " Common neighbors: the greater the number of neighbors X and Y have in common, the highest is their score " Ensemble of all paths: weighted sum of paths that connects X and Y (shorter paths have usually larger weights)

Viral Marketing q Several marketing approaches " Mass marketing is targeted on specific segment of customers " Direct marketing is target on specific customers solely on the basis of their characteristics " Viral marketing tries to exploit the social connections to maximize the output of marketing actions q Each customer has a specific network value based on " The number of connections " Its role in the network (e.g., opinion leader, listener) " Role of its connections q Viral marketing aims to exploit the network value of customers to predict their influence and to maximize the outcome of marketing actions

Viral Marketing: Random Spreading q 500 randomly chosen customers are given a product (from 5000).

Viral Marketing: Directed Spreading q The 500 most connected consumers are given a product.