Scalable Mining of Massive Networks: Distance-based Centrality, Similarity, and Influence. Edith Cohen Tel Aviv University
|
|
- Melanie Snow
- 5 years ago
- Views:
Transcription
1 Scalable Mining of Massive Networks: Distance-based Centrality, Similarity, and Influence Edith Cohen Tel Aviv University
2 Graph Datasets: Represent relations between things Bowtie structure of the Web Broder et. al. 200 Dolphin interactions
3 Graph Datasets Hyperlinks (the Web) Social graphs (Facebook, Twitter, LinkedIn, ): friend, follow, like logs, phone call logs, messages Commerce transactions (Amazon,eBay) Road networks Communication networks Protein interactions
4 Analytics on Graphs Centralities/Influence The power/importance/coverage of a node or a set of nodes Applications: ranking, viral marketing, Similarities/Communities How tightly related are 2 or more nodes Applications: Friend/Product Recommendations, attribute completion, Advertising, Prediction
5 Challenges Modeling: Models must capture reality Be flexible to allow fitting to problem (tunable parameters) Scalability: 0 9+ interactions (edges), 0 8+ entities (nodes)
6 Centrality, Similarity, Influence Structural measures (based on the set of interactions) Local measures: only depend on the set of neighbors (paths of length ) Degree centrality, Influence as cardinality of union of neighbors Similarity by relation of neighbor sets (Jaccard, Adamic/Adar) Advantages: Gets the first order bit Scalable Disadvantages: Limited recall Spammable
7 Centrality, Similarity, Influence Structural measures (based on the set of interactions) Global Measures: depend on the paths ensemble Higher recall but much less scalable Random walk based (Geometric/Heat Kernel): Centrality, similarity, influence thru (personalized) page rank Katz (Sum of paths, discounted by length) Resistance: Similarity by effective resistance Distance/reachability based Measures: centrality, similarity, influence
8 Distance/Reachability based measures of Centrality, Influence, Similarity Closeness centrality: Ability of a node to reach other nodes [Bavelas 950]+++: Influence: Ability of a set of nodes to reach others [GLM200,KKT2003,] [Gomez- Rodriguez+ ICML 20], [Du+ NIPS 203]. [C DPW 204] +++: Closeness similarity: Relates two nodes based on the similarity of their reach [C DFGGW203], (local: [AA 2005,LK2007]):
9 This talk: Unified treatment of distance/reachability-based measures Models: Basic definition of distance-based Different ways of using distance/reachability Argue that models capture intuitive properties Scalability: Sketching and estimation
10 SP distances: Distance vectors
11 SP distances: Distance vectors
12 SP distances d ij : Distance vectors
13 SP distances d ij : Distance-based measures Relate entities based on their vectors. Closeness centrality C( ) Closeness Similarity Sim(, ) Influence Inf(,,, )
14 SP distances d ij : Distance-based Models Inbound/outbound/undirected distance Distance/Reachability/Dijkstra (NN) rank Topic filter β v 0: weighted relevance of entity Distances vs. decayed distances α d 0 Robustness by randomizing lengths/presence of edges (REL), or using multiple example instances
15 SP distances d ij : Distance-based Models Inbound/outbound/undirected distance Distance/Reachability/Dijkstra (NN) rank Topic filter β v 0: weighted relevance of entity Distances vs. decayed distances α d 0 Robustness by randomizing lengths/presence of edges (REL), or using multiple example instances
16 Topic Filter β(i) Weigh entities (nodes or entries of vectors) according to Topic, interests, education level, age, community, geography, language, product type, Applications: focus measure on topic. Useful for recommendations, attribute completion, targeted ads Similar to the role of start state in PPR Evil β:
17 SP distances d ij : Distance-based Models Inbound/outbound/undirected distance Distance/Reachability/Dijkstra (NN) rank Topic filter β v 0: weighted relevance of entity Distances vs. decayed distances α d 0 Robustness by randomizing lengths/presence of edges (REL), or using multiple example instances
18 Farness-penalty vs. Closeness-reward To emphasize penalty for weak coverage of far nodes, we use distances as is: The norm of the distances vector of a node is dominated by far nodes. As in classic closeness centrality, facility location, k-medians/means To emphasize reward for better coverage of close nodes, we apply a decay ( kernel ) function α d to the distance: The norm is dominated by closer nodes. As in influence models and local measures. Exact computation: both requires full distance vectors. Scalable approximation: different techniques
19 Quantifying Closeness Reward: Distance Decay (Kernel) Functions Specify how importance/relevance decays with distance α d = exp λd (exponential), d (geomertic), exp λd 2 (Gaussian) d <? : 0 (reachability) d < T? : 0 (threshold) Similar to the role of kernel distribution in PPR
20 Closeness Vectors SP distance vectors: Closeness vectors: α d ij = +d ij β
21 Closeness Centrality Classic (farness penalty) [Bavelas/Sabidussi/Beuchamp 950, 965, 966] C i = (n )/ d ij β(j) Distance-decay (closeness reward) [C Kaplan 2004, Opsahl et. al. 200, Dangalchev 2006, Boldi Vigna 203, ] C i = α(d ij ) β(j) j j
22 Closeness Matrix (distance decay+topic filter) c ij = α(d ij ) β(j) c i (row i ) is the closeness vector of node i Centrality: Influence: of a set of nodes C i = c ij Inf S = max j j i S c ij
23 Example: (dist-decay) Closeness Centrality SP distance vectors: Closeness vectors: α d ij = +d ij β C 2.05 C 2.
24 Example + topic filter: Closeness to Evil α d ij = +d ij 0. β evilness C 2.05 total C 2. total.24 evil part 0.29 evil part
25 Example: Influence from Vectors SP distance vectors: Closeness vectors: α d ij = +d ij β Inf, 3.64 Sum of max (Submodular) C 2. C 2.05
26 Closeness Similarity: Local Measures Local: sim(i, j) depends only on immediate neighbors of i, j Jaccard: Γ i Γ j Adamic Adar: Γ i Γ j x Γ i Γ(j) log Γ x J(, )= 2 9 J(, )= 3 5 AA(, )= log 5 AA(, )= 3 log 4 Work well for d ij 2, but no recall for d ij > 2
27 Closeness Similarity: Global Measures Based on full closeness vectors c ij = α(d ij ) β(j) weighted Jaccard: Cosine: L p Distance: h sim i, j = β h min c ih,c jh sim i, j = c i c j. h β h max c ih,c jh L p -dist i, j = c i c j p
28 Example: (global) Closeness Similarity SP distance vectors: Closeness vectors: α d ij = +d ij β Weighted Jaccard: x min C ix,c jx 0.2 x max C ix,c jx Cosine similarity: x C ix C jx C i 2 C j L Distance: C ix C jx x 2.8
29 So far: Definitions and some motivation of distance-based measures Next: Scalability thru Sketches and estimators
30 Scalable Closeness Centrality Closeness centrality of a node v can be computed exactly using a SSSP computation from v. But this takes near-linear time per node: Does not scale when we want centralities of many or all nodes We want to estimate all centralities within a small relative error using near-linear computation Classic (farness penalty): [C DPW : COSN 204] Previously additive error: [EW SODA 200, OCL 2008] Distance-decay (closeness reward): All-distances sketches [C 94], [C Kaplan SIGMOD 2004] [C : PODS 204].
31 Applications General Tools Scaling Up (closeness reward) All-distances and reachability node sketches (ADS) [C 94], [C Kaplan 2004 [C : PODS 204]. Estimators applicable with ADS: [C K PODC 2007, VLDB 2008, SIGMETRICS 2008], [C PODC 204,KDD 204] Distance-decay closeness centrality [C 94], [C Kaplan 2004] [Boldi Vigna 203] [C : PODS 204]. Closeness similarity [CDFGGW: COSN 203] Influence computation and maximization [CWY KDD 2009] [C Delling Pajor Werneck: CIKM 204] [Du Song Gomez-Rodruiguez Zha NIPS 203] [C Delling Pajor Werneck 205]
32 All-Distances Sketches (ADS) [C 94]+ per-node summary structures of Un/Directed, Un/Weighted networks For a node i n : ADS(i) samples the distance vector of i m edges, n nodes, parameter k trades-off sketch size and information
33 All-Distances Sketches: Definition ADS(v) is a list of pairs of the form (i, d vi ) Draw a random permutation of the nodes: r n [n] i ADS v r(i) < k th smallest rank amongst nodes that are closer to v than i This is a bottom-k ADS, there are other ADS flavors
34 All-Distances Sketches: Definition ADS(v) is a list of pairs of the form (i, d vi ) Draw a random permutation of the nodes: r n [n] i ADS v r(i) < k th smallest rank amongst nodes that are closer to v than i n j= Expected size of ADS(i) is min, k < k ln n j The j th closest node to i is included with probability min {, k j }
35 SP distances: ADS example Random permutation of nodes
36 ADS example k = All nodes sorted by SP distance from k = :
37 ADS example k = 2 Sorted by SP distances from k = 2:
38 Sketch Coordination We use the same permutation to obtain the ADS of all nodes, as a result: ADS Sketches of different nodes are coordinated: related in a way that is useful for queries that involve multiple nodes (similarities, influence, distance) [Brewer, Early, Joyce 972] Generalize MinHash sketches by adding a time/distance dimension
39 Computing ADSs Efficiently Pruned Dijkstra s Algorithm: Build ADS entries by increasing permutation rank. Local (Dynamic programming) Algorithm: Build ADS entries by increasing distance. Either way, we scan outgoing edges of a node only after its ADS is modified number of edge traversals is bounded by ADS size: O(km ln n)
40 Computing ADSs Efficiently Perform pruned Dijkstra from nodes by increasing permutation rank: k =
41 Estimation from sketches
42 Estimation with All-Distances Sketches Closeness centrality of node i, from ADS(i) with NRMSE O( ) (+concentration) [C 94, C Kaplan 04, C 4] k Influence of S, from ADS(i) i S, with NRMSE O( ) k [C 94, Chen WY KDD 09, Du SGZ NIPS 3, C DPW 204 ]+++ Closeness similarity sim(i, j) (W/Jaccard, cosine), from ADS(i) and ADS(j) w. RMSE O( ) [C DFGGW COSN 3] k L p distance, from ADS(i) and ADS(j) with RMSE O( ) times max norm using [C KDD 204] k Side note: ADSs can also be used as distance oracles (estimate pairwise distances), spanners, and more
43 Historic Inverse Probability (HIP) probability & estimator [C 204] The HIP inclusion probability p vi of i ADS(v) is conditioned on fixed rank values of other nodes. To estimate centrality of v, we estimate the presence of i with respect to v: I v i (= if v i, 0 otherwise) using inverse-probability [HT52] a vi = p. This is unbiased (when p > 0). For bottom-k ADS and r U[0,] p vi = k th {r(u) u ADS v d vu < d vi }
44 Example: HIP estimates Bottom-2 ADS of p: a = : p p:2 nd smallest r value among closer nodes
45 HIP cardinality estimate Bottom-2 ADS of distance: a: Query: n 6 v = { i d vi 6 } = I dvi 6 i n 6 (v) = a vi = = 3.59 i,d vi ADS v d vi 6
46 Quality of HIP cardinality Estimate Lemma: The HIP neighborhood cardinality estimator n d (v) = a vi i,d vi ADS v d vi d Has Coefficient of Variation σ μ 2k 2 This is 2 improvement over MinHash estimators [C 94]. Also optimally uses sketch.
47 HIP estimates of Centrality C v = α(d vi )β(i) i α non increasing; β some filter C v = a vi α(d vi ) β(i) i ADS(v)
48 HIP estimates: closeness to good/evil Bottom-2 ADS of distance: a: β: Filter: β 0, measures goodness Distance-decay kernel : e d vi C v = a vi β(i)e d vi = e e e 0 + i ADS(v)
49 Similarity/Influence Estimation We work with HIP inclusions and sum estimators (estimate separately contribution of each node) We use monotone estimation formulation [C K Random 204; C PODC 204] and the L* estimator (unique optimal monotone unbiased nonnegative sum estimator) Influence: can be estimated by merging ADS sketches and applying centrality estimator, but L* is tighter Similarity: inverse-probability on joint inclusion may not even apply, L* gets around the problem
50 Estimating Jaccard Closeness Similarity SP-distance; using kernel α d ij = +d ij : x sim i, j = min C ix,c jx = x max C ix,c jx X x > 0 or N x > 0 only when x ADS i ADS(j) easy to compute sim i, j using only ADS i ADS(j) We use L* for X x. We show N x (inverse probability). Get relative error on denominator, additive error of same order on numerator. k x min max d ix +, d jx + d ix +, d jx + Sum estimator: For all x obtain unbiased estimates N x of min d ix +, d jx + x sim i, j = N x x ; X x of max x X x d ix +, d jx + ;
51 Estimating Jaccard Closeness Similarity Estimate N x for min d ix +, d jx + : For x ADS i ADS(j) We compute HIP probability p for inclusion in intersection An inverse probability estimate: N x = p min, d ix + d jx + Otherwise, estimate is N x = 0 This is the best we can do because if x is not in intersection, data is consistent with 0 value any nonzero unbiased estimator must return 0
52 Estimating Jaccard Closeness Similarity Estimate X x for max d ix +, d jx + For x ADS i ADS(j) We can never determine X x if x is only reachable from one of i, j. We can only lower bound X x. The L* estimator [C 204] is an optimal extension of inverse probability to such a setting.
53 NN-rank based closeness similarity sim(i, j) = x x min max, π ix π jx, π ix π jx π ix : Dijkstra (NN) rank of x with respect to i Lemma: sim i, j is approximated well by proof: Pr[ x ADS i ADS j ] k min Pr[ x ADS i ADS j ] k max ADS i ADS j ADS i ADS j, π ix, π ix ADSs can be stored very compactly: no distances, hash of nodes π jx π jx
54 Next: Enhancing distance/reachability-based models with REL REL: randomizing lengths/presence of edges (Implicit) The Independent Cascade Diffusion model [Kempe Klienberg Tardos KDD 2003] Closeness similarity [C DFGGW 203] Timed Diffusion [Gomez-Rodriguez BS 20, Du SGZ 203, C DPW 204,.]
55 Strength of relation between two entities Basic Intuitions for dependence on path ensemble: Increase with shorter paths Increase with more (redundant paths)
56 Boosting distance by Randomizing Edge Lengths (REL) We expect the strength of a relation to Increase with the multiplicity of paths Decrease with the length of paths Eigenvector measures: Rooted PageRank, RWR, Hitting Time, Commute Time, Eff. Resistance, Katz Satisfy this strictly. SP distances: weakly
57 Randomizing Edge Lengths (REL) Replace edge lengths (or presence) with independent random variables Look at the expected measure
58 Randomizing Edge Lengths (REL) Expected distance is lower with multiple paths: expectation of the minimum < minimum of expectations
59 Randomizing Edge Lengths (REL) Scalability: use Monte Carlo simulations of model to generate fixed instances (graphs) Build sketches for multiple instances
60 Benefits of REL in network analysis Measure does reward for paths multiplicity Robustness: Measure is not sensitive to small changes in edge weights (edge presence for reachability)
61 Next: Some (preliminary) experimental results
62 Scalability: Timed Influence with REL Combined ADS sketches for all nodes, obtained from l = 64 instances generated by assigning independent Exp distributed edge lengths. Sketch parameter k = 64 Centrality ( seed) and Influence (50 seeds) queries Queries α x = exp( 0x) preproc seed 50 seeds #nodes #edges [h:m] time μs %err time μs %err Slashdot 77K 828K : % % Gowalla 97K.9M 3: % % Twitter F 456K 5M 9: % % [C Delling Pajor Werneck 204]
63 Closeness Similarity evaluation [C Delling Fuchs Goldberg Goldszmid Werneck COSN 203] Data Sets Nodes x 0 6 Edges x 0 6 ADS label size arxiv DBLP twitter smallworld
64 Similarity Evaluation Spearman coefficient: correlation between meta-data ranking and similarity measure ranking( = perfect match, 0= random rankings). On pairs selected uniformly by ground-truth similarity. arxiv DBLP Twitter SmallWord Adamic-Adar hops SP distance RWR RWR RWR RWR Closeness Closeness REL
65 Similarity Evaluation No clear winner (networks too different) Local measures are limited RWR has very good recall (with tuning) but computationally expensive Closeness (+REL) : robust, good recall, fast queries (microseconds) arxiv DBLP Twitter SmallWord Adamic-Adar hops SP distance RWR RWR RWR RWR Closeness Closeness REL
66 Conclusion Distance/reachability based measures of centrality, influence, and similarity: global measures that are flexible and highly scalable Presented a unified treatment Scalability through: All-Distances and reachability sketches Estimators applicable to sketches Future: Model fitting framework, Evaluations, Algorithms Engineering
67
68 Components of my work are joint work with (subsets of): Daniel Delling, Fabian Fuchs, Andrew Goldberg, Moises Goldszmidt, Haim Kaplan, Thomas Pajor, and Renato Werneck
Computing Classic Closeness Centrality, at Scale
Computing Classic Closeness Centrality, at Scale Edith Cohen Joint with: Thomas Pajor, Daniel Delling, Renato Werneck Very Large Graphs Model relations and interactions (edges) between entities (nodes)
More informationMinHash Sketches: A Brief Survey. June 14, 2016
MinHash Sketches: A Brief Survey Edith Cohen edith@cohenwang.com June 14, 2016 1 Definition Sketches are a very powerful tool in massive data analysis. Operations and queries that are specified with respect
More informationSimilarity Ranking in Large- Scale Bipartite Graphs
Similarity Ranking in Large- Scale Bipartite Graphs Alessandro Epasto Brown University - 20 th March 2014 1 Joint work with J. Feldman, S. Lattanzi, S. Leonardi, V. Mirrokni [WWW, 2014] 2 AdWords Ads Ads
More informationReduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs
Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs Alessandro Epasto J. Feldman*, S. Lattanzi*, S. Leonardi, V. Mirrokni*. *Google Research Sapienza U. Rome Motivation Recommendation
More informationSection 7.12: Similarity. By: Ralucca Gera, NPS
Section 7.12: Similarity By: Ralucca Gera, NPS Motivation We talked about global properties Average degree, average clustering, ave path length We talked about local properties: Some node centralities
More informationThe link prediction problem for social networks
The link prediction problem for social networks Alexandra Chouldechova STATS 319, February 1, 2011 Motivation Recommending new friends in in online social networks. Suggesting interactions between the
More informationCentrality in Large Networks
Centrality in Large Networks Mostafa H. Chehreghani May 14, 2017 Table of contents Centrality notions Exact algorithm Approximate algorithms Conclusion Centrality notions Exact algorithm Approximate algorithms
More informationLink Prediction in Graph Streams
Peixiang Zhao, Charu C. Aggarwal, and Gewen He Florida State University IBM T J Watson Research Center Link Prediction in Graph Streams ICDE Conference, 2016 Graph Streams Graph Streams arise in a wide
More informationEffective Latent Space Graph-based Re-ranking Model with Global Consistency
Effective Latent Space Graph-based Re-ranking Model with Global Consistency Feb. 12, 2009 1 Outline Introduction Related work Methodology Graph-based re-ranking model Learning a latent space graph A case
More informationReach for A : an Efficient Point-to-Point Shortest Path Algorithm
Reach for A : an Efficient Point-to-Point Shortest Path Algorithm Andrew V. Goldberg Microsoft Research Silicon Valley www.research.microsoft.com/ goldberg/ Joint with Haim Kaplan and Renato Werneck Einstein
More informationDiffusion and Clustering on Large Graphs
Diffusion and Clustering on Large Graphs Alexander Tsiatas Final Defense 17 May 2012 Introduction Graphs are omnipresent in the real world both natural and man-made Examples of large graphs: The World
More informationNode Similarity. Ralucca Gera, Applied Mathematics Dept. Naval Postgraduate School Monterey, California
Node Similarity Ralucca Gera, Applied Mathematics Dept. Naval Postgraduate School Monterey, California rgera@nps.edu Motivation We talked about global properties Average degree, average clustering, ave
More informationSketch-based Influence Maximization and Computation: Scaling up with Guarantees
Sketch-based Influence Maximization and Computation: Scaling up with Guarantees EDITH COHEN Microsoft Research editco@microsoft.com DANIEL DELLING Microsoft Research dadellin@microsoft.com THOMAS PAJOR
More informationScalable Influence Maximization in Social Networks under the Linear Threshold Model
Scalable Influence Maximization in Social Networks under the Linear Threshold Model Wei Chen Microsoft Research Asia Yifei Yuan Li Zhang In collaboration with University of Pennsylvania Microsoft Research
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu SPAM FARMING 2/11/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 2 2/11/2013 Jure Leskovec, Stanford
More informationLink Prediction for Social Network
Link Prediction for Social Network Ning Lin Computer Science and Engineering University of California, San Diego Email: nil016@eng.ucsd.edu Abstract Friendship recommendation has become an important issue
More informationProblem 1: Complexity of Update Rules for Logistic Regression
Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 16 th, 2014 1
More informationCentralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge
Centralities (4) By: Ralucca Gera, NPS Excellence Through Knowledge Some slide from last week that we didn t talk about in class: 2 PageRank algorithm Eigenvector centrality: i s Rank score is the sum
More informationViral Marketing and Outbreak Detection. Fang Jin Yao Zhang
Viral Marketing and Outbreak Detection Fang Jin Yao Zhang Paper 1: Maximizing the Spread of Influence through a Social Network Authors: David Kempe, Jon Kleinberg, Éva Tardos KDD 2003 Outline Problem Statement
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu HITS (Hypertext Induced Topic Selection) Is a measure of importance of pages or documents, similar to PageRank
More informationHighway Dimension and Provably Efficient Shortest Paths Algorithms
Highway Dimension and Provably Efficient Shortest Paths Algorithms Andrew V. Goldberg Microsoft Research Silicon Valley www.research.microsoft.com/ goldberg/ Joint with Ittai Abraham, Amos Fiat, and Renato
More informationLecture 11: Clustering and the Spectral Partitioning Algorithm A note on randomized algorithm, Unbiased estimates
CSE 51: Design and Analysis of Algorithms I Spring 016 Lecture 11: Clustering and the Spectral Partitioning Algorithm Lecturer: Shayan Oveis Gharan May nd Scribe: Yueqi Sheng Disclaimer: These notes have
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationAnalysis of Large Graphs: TrustRank and WebSpam
Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit
More informationExtracting Information from Complex Networks
Extracting Information from Complex Networks 1 Complex Networks Networks that arise from modeling complex systems: relationships Social networks Biological networks Distinguish from random networks uniform
More informationRandomized Composable Core-sets for Distributed Optimization Vahab Mirrokni
Randomized Composable Core-sets for Distributed Optimization Vahab Mirrokni Mainly based on joint work with: Algorithms Research Group, Google Research, New York Hossein Bateni, Aditya Bhaskara, Hossein
More informationMaximizing the Spread of Influence through a Social Network
Maximizing the Spread of Influence through a Social Network By David Kempe, Jon Kleinberg, Eva Tardos Report by Joe Abrams Social Networks Infectious disease networks Viral Marketing Viral Marketing Example:
More informationIRIE: Scalable and Robust Influence Maximization in Social Networks
IRIE: Scalable and Robust Influence Maximization in Social Networks Kyomin Jung KAIST Republic of Korea kyomin@kaist.edu Wooram Heo KAIST Republic of Korea modesty83@kaist.ac.kr Wei Chen Microsoft Research
More informationDS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li
Welcome to DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li Time: 6:00pm 8:50pm R Location: AK232 Fall 2016 Graph Data: Social Networks Facebook social graph 4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna,
More informationNon-exhaustive, Overlapping k-means
Non-exhaustive, Overlapping k-means J. J. Whang, I. S. Dhilon, and D. F. Gleich Teresa Lebair University of Maryland, Baltimore County October 29th, 2015 Teresa Lebair UMBC 1/38 Outline Introduction NEO-K-Means
More informationPart I: Data Mining Foundations
Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?
More informationScaling Link-Based Similarity Search
Scaling Link-Based Similarity Search Dániel Fogaras Budapest University of Technology and Economics Budapest, Hungary, H-1521 fd@cs.bme.hu Balázs Rácz Computer and Automation Research Institute of the
More informationCommunity detection algorithms survey and overlapping communities. Presented by Sai Ravi Kiran Mallampati
Community detection algorithms survey and overlapping communities Presented by Sai Ravi Kiran Mallampati (sairavi5@vt.edu) 1 Outline Various community detection algorithms: Intuition * Evaluation of the
More informationSimilarity Joins in MapReduce
Similarity Joins in MapReduce Benjamin Coors, Kristian Hunt, and Alain Kaeslin KTH Royal Institute of Technology {coors,khunt,kaeslin}@kth.se Abstract. This paper studies how similarity joins can be implemented
More informationA Computational Theory of Clustering
A Computational Theory of Clustering Avrim Blum Carnegie Mellon University Based on work joint with Nina Balcan, Anupam Gupta, and Santosh Vempala Point of this talk A new way to theoretically analyze
More informationQuery Independent Scholarly Article Ranking
Query Independent Scholarly Article Ranking Shuai Ma, Chen Gong, Renjun Hu, Dongsheng Luo, Chunming Hu, Jinpeng Huai SKLSDE Lab, Beihang University, China Beijing Advanced Innovation Center for Big Data
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
More informationI/O Efficient Algorithms for Exact Distance Queries on Disk- Resident Dynamic Graphs
I/O Efficient Algorithms for Exact Distance Queries on Disk- Resident Dynamic Graphs Yishi Lin, Xiaowei Chen, John C.S. Lui The Chinese University of Hong Kong 9/4/15 EXACT DISTANCE QUERIES ON DYNAMIC
More informationEfficient Top-k Shortest-Path Distance Queries on Large Networks by Pruned Landmark Labeling with Application to Network Structure Prediction
Efficient Top-k Shortest-Path Distance Queries on Large Networks by Pruned Landmark Labeling with Application to Network Structure Prediction Takuya Akiba U Tokyo Takanori Hayashi U Tokyo Nozomi Nori Kyoto
More informationAssociation Pattern Mining. Lijun Zhang
Association Pattern Mining Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction The Frequent Pattern Mining Model Association Rule Generation Framework Frequent Itemset Mining Algorithms
More informationCS 6604: Data Mining Large Networks and Time-Series
CS 6604: Data Mining Large Networks and Time-Series Soumya Vundekode Lecture #12: Centrality Metrics Prof. B Aditya Prakash Agenda Link Analysis and Web Search Searching the Web: The Problem of Ranking
More informationFast Nearest Neighbor Search on Large Time-Evolving Graphs
Fast Nearest Neighbor Search on Large Time-Evolving Graphs Leman Akoglu Srinivasan Parthasarathy Rohit Khandekar Vibhore Kumar Deepak Rajan Kun-Lung Wu Graphs are everywhere Leman Akoglu Fast Nearest Neighbor
More informationCompressing Social Networks
Compressing Social Networks The Minimum Logarithmic Arrangement Problem Chad Waters School of Computing Clemson University cgwater@clemson.edu March 4, 2013 Motivation Determine the extent to which social
More informationInformation Retrieval. Lecture 9 - Web search basics
Information Retrieval Lecture 9 - Web search basics Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Up to now: techniques for general
More informationGIVEN a large graph and a query node, finding its k-
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL., NO., 2016 1 Efficient and Exact Local Search for Random Walk Based Top-K Proximity Query in Large Graphs Yubao Wu, Ruoming Jin, and Xiang Zhang
More informationAC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery
: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Hong Cheng Philip S. Yu Jiawei Han University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center {hcheng3, hanj}@cs.uiuc.edu,
More informationStatistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte
Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,
More information10-701/15-781, Fall 2006, Final
-7/-78, Fall 6, Final Dec, :pm-8:pm There are 9 questions in this exam ( pages including this cover sheet). If you need more room to work out your answer to a question, use the back of the page and clearly
More informationCoverage Approximation Algorithms
DATA MINING LECTURE 12 Coverage Approximation Algorithms Example Promotion campaign on a social network We have a social network as a graph. People are more likely to buy a product if they have a friend
More informationOnline Social Networks and Media
Online Social Networks and Media Absorbing Random Walks Link Prediction Why does the Power Method work? If a matrix R is real and symmetric, it has real eigenvalues and eigenvectors: λ, w, λ 2, w 2,, (λ
More informationBig Data Analytics Influx of data pertaining to the 4Vs, i.e. Volume, Veracity, Velocity and Variety
Holistic Analysis of Multi-Source, Multi- Feature Data: Modeling and Computation Challenges Big Data Analytics Influx of data pertaining to the 4Vs, i.e. Volume, Veracity, Velocity and Variety Abhishek
More informationLink Analysis Ranking Algorithms, Theory, and Experiments
Link Analysis Ranking Algorithms, Theory, and Experiments Allan Borodin Gareth O. Roberts Jeffrey S. Rosenthal Panayiotis Tsaparas June 28, 2004 Abstract The explosive growth and the widespread accessibility
More informationAbsorbing Random walks Coverage
DATA MINING LECTURE 3 Absorbing Random walks Coverage Random Walks on Graphs Random walk: Start from a node chosen uniformly at random with probability. n Pick one of the outgoing edges uniformly at random
More informationAbsorbing Random walks Coverage
DATA MINING LECTURE 3 Absorbing Random walks Coverage Random Walks on Graphs Random walk: Start from a node chosen uniformly at random with probability. n Pick one of the outgoing edges uniformly at random
More informationGraph Mining and Social Network Analysis
Graph Mining and Social Network Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References q Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann
More informationLocality- Sensitive Hashing Random Projections for NN Search
Case Study 2: Document Retrieval Locality- Sensitive Hashing Random Projections for NN Search Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 18, 2017 Sham Kakade
More informationHolistic Analysis of Multi-Source, Multi- Feature Data: Modeling and Computation Challenges
Holistic Analysis of Multi-Source, Multi- Feature Data: Modeling and Computation Challenges Abhishek Santra 1 and Sanjukta Bhowmick 2 1 Information Technology Laboratory, CSE Department, University of
More informationPart I Part II Part III Part IV Part V. Influence Maximization
Part I Part II Part III Part IV Part V Influence Maximization 1 Word-of-mouth (WoM) effect in social networks xphone is good xphone is good xphone is good xphone is good xphone is good xphone is good xphone
More informationFast Contextual Preference Scoring of Database Tuples
Fast Contextual Preference Scoring of Database Tuples Kostas Stefanidis Department of Computer Science, University of Ioannina, Greece Joint work with Evaggelia Pitoura http://dmod.cs.uoi.gr 2 Motivation
More informationTopic: Duplicate Detection and Similarity Computing
Table of Content Topic: Duplicate Detection and Similarity Computing Motivation Shingling for duplicate comparison Minhashing LSH UCSB 290N, 2013 Tao Yang Some of slides are from text book [CMS] and Rajaraman/Ullman
More informationarxiv: v3 [cs.db] 1 Dec 2017
Efficient Error-tolerant Search on Knowledge Graphs Zhaoyang Shao University of Alberta zhaoyang@ualberta.ca Davood Rafiei University of Alberta drafiei@ualberta.ca Themis Palpanas Paris Descartes University
More informationMining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Infinite data. Filtering data streams
/9/7 Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them
More informationCOMP 465: Data Mining Still More on Clustering
3/4/015 Exercise COMP 465: Data Mining Still More on Clustering Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Describe each of the following
More informationExact Computation of Influence Spread by Binary Decision Diagrams
Exact Computation of Influence Spread by Binary Decision Diagrams Takanori Maehara 1), Hirofumi Suzuki 2), Masakazu Ishihata 2) 1) Riken Center for Advanced Intelligence Project 2) Hokkaido University
More informationScribe from 2014/2015: Jessica Su, Hieu Pham Date: October 6, 2016 Editor: Jimmy Wu
CS 267 Lecture 3 Shortest paths, graph diameter Scribe from 2014/2015: Jessica Su, Hieu Pham Date: October 6, 2016 Editor: Jimmy Wu Today we will talk about algorithms for finding shortest paths in a graph.
More informationDiffusion and Clustering on Large Graphs
Diffusion and Clustering on Large Graphs Alexander Tsiatas Thesis Proposal / Advancement Exam 8 December 2011 Introduction Graphs are omnipresent in the real world both natural and man-made Examples of
More informationPoint-to-Point Shortest Path Algorithms with Preprocessing
Point-to-Point Shortest Path Algorithms with Preprocessing Andrew V. Goldberg Microsoft Research Silicon Valley www.research.microsoft.com/ goldberg/ Joint work with Chris Harrelson, Haim Kaplan, and Retato
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu [Kumar et al. 99] 2/13/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
More informationCentrality Measures. Computing Closeness and Betweennes. Andrea Marino. Pisa, February PhD Course on Graph Mining Algorithms, Università di Pisa
Computing Closeness and Betweennes PhD Course on Graph Mining Algorithms, Università di Pisa Pisa, February 2018 Centrality measures The problem of identifying the most central nodes in a network is a
More informationApproximation Algorithms
Chapter 8 Approximation Algorithms Algorithm Theory WS 2016/17 Fabian Kuhn Approximation Algorithms Optimization appears everywhere in computer science We have seen many examples, e.g.: scheduling jobs
More informationCS 664 Slides #11 Image Segmentation. Prof. Dan Huttenlocher Fall 2003
CS 664 Slides #11 Image Segmentation Prof. Dan Huttenlocher Fall 2003 Image Segmentation Find regions of image that are coherent Dual of edge detection Regions vs. boundaries Related to clustering problems
More informationA Class of Submodular Functions for Document Summarization
A Class of Submodular Functions for Document Summarization Hui Lin, Jeff Bilmes University of Washington, Seattle Dept. of Electrical Engineering June 20, 2011 Lin and Bilmes Submodular Summarization June
More informationInformation Propagation in Interaction Networks
Information Propagation in Interaction Networks Rohit Kumar Department of Computer and Decision Engineering Université Libre de Bruxelles, Belgium Department of Service and Information System Engineering
More informationProteins, Particles, and Pseudo-Max- Marginals: A Submodular Approach Jason Pacheco Erik Sudderth
Proteins, Particles, and Pseudo-Max- Marginals: A Submodular Approach Jason Pacheco Erik Sudderth Department of Computer Science Brown University, Providence RI Protein Side Chain Prediction Estimate side
More informationNearest Neighbor with KD Trees
Case Study 2: Document Retrieval Finding Similar Documents Using Nearest Neighbors Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Emily Fox January 22 nd, 2013 1 Nearest
More informationReach for A : Efficient Point-to-Point Shortest Path Algorithms
Reach for A : Efficient Point-to-Point Shortest Path Algorithms Andrew V. Goldberg 1 Haim Kaplan 2 Renato F. Werneck 3 October 2005 Technical Report MSR-TR-2005-132 We study the point-to-point shortest
More informationLink Prediction and Anomoly Detection
Graphs and Networks Lecture 23 Link Prediction and Anomoly Detection Daniel A. Spielman November 19, 2013 23.1 Disclaimer These notes are not necessarily an accurate representation of what happened in
More informationSome Interesting Applications of Theory. PageRank Minhashing Locality-Sensitive Hashing
Some Interesting Applications of Theory PageRank Minhashing Locality-Sensitive Hashing 1 PageRank The thing that makes Google work. Intuition: solve the recursive equation: a page is important if important
More informationEfficient Processing of Multiple DTW Queries in Time Series Databases
Efficient Processing of Multiple DTW Queries in Time Series Databases Hardy Kremer 1 Stephan Günnemann 1 Anca-Maria Ivanescu 1 Ira Assent 2 Thomas Seidl 1 1 RWTH Aachen University, Germany 2 Aarhus University,
More informationUniversity of Maryland. Tuesday, March 2, 2010
Data-Intensive Information Processing Applications Session #5 Graph Algorithms Jimmy Lin University of Maryland Tuesday, March 2, 2010 This work is licensed under a Creative Commons Attribution-Noncommercial-Share
More informationPart 1: Link Analysis & Page Rank
Chapter 8: Graph Data Part 1: Link Analysis & Page Rank Based on Leskovec, Rajaraman, Ullman 214: Mining of Massive Datasets 1 Graph Data: Social Networks [Source: 4-degrees of separation, Backstrom-Boldi-Rosa-Ugander-Vigna,
More informationThanks to Jure Leskovec, Anand Rajaraman, Jeff Ullman
Thanks to Jure Leskovec, Anand Rajaraman, Jeff Ullman http://www.mmds.org Overview of Recommender Systems Content-based Systems Collaborative Filtering J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive
More informationTask Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval
Case Study 2: Document Retrieval Task Description: Finding Similar Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 11, 2017 Sham Kakade 2017 1 Document
More informationUncertain Data Management Non-relational models: Graphs
Uncertain Data Management Non-relational models: Graphs Antoine Amarilli 1, Silviu Maniu 2 1 Télécom ParisTech 2 Université Paris-Sud January 16th, 2017 1/21 Credits M. Potamias, F. Bonchi, A. Gionis,
More informationCPSC 340: Machine Learning and Data Mining. Finding Similar Items Fall 2017
CPSC 340: Machine Learning and Data Mining Finding Similar Items Fall 2017 Assignment 1 is due tonight. Admin 1 late day to hand in Monday, 2 late days for Wednesday. Assignment 2 will be up soon. Start
More informationLatent Distance Models and. Graph Feature Models Dr. Asja Fischer, Prof. Jens Lehmann
Latent Distance Models and Graph Feature Models 2016-02-09 Dr. Asja Fischer, Prof. Jens Lehmann Overview Last lecture we saw how neural networks/multi-layer perceptrons (MLPs) can be applied to SRL. After
More informationLeveraging Set Relations in Exact Set Similarity Join
Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
More informationComputer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models
Prof. Daniel Cremers 4. Probabilistic Graphical Models Directed Models The Bayes Filter (Rep.) (Bayes) (Markov) (Tot. prob.) (Markov) (Markov) 2 Graphical Representation (Rep.) We can describe the overall
More informationLatest on Linear Sketches for Large Graphs: Lots of Problems, Little Space, and Loads of Handwaving. Andrew McGregor University of Massachusetts
Latest on Linear Sketches for Large Graphs: Lots of Problems, Little Space, and Loads of Handwaving Andrew McGregor University of Massachusetts Latest on Linear Sketches for Large Graphs: Lots of Problems,
More informationCS224W: Analysis of Networks Jure Leskovec, Stanford University
CS224W: Analysis of Networks Jure Leskovec, Stanford University http://cs224w.stanford.edu 11/13/17 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 2 Observations Models
More informationMaximizing the Spread of Influence through a Social Network. David Kempe, Jon Kleinberg and Eva Tardos
Maximizing the Spread of Influence through a Social Network David Kempe, Jon Kleinberg and Eva Tardos Group 9 Lauren Thomas, Ryan Lieblein, Joshua Hammock and Mary Hanvey Introduction In a social network,
More informationLink Analysis in the Cloud
Cloud Computing Link Analysis in the Cloud Dell Zhang Birkbeck, University of London 2017/18 Graph Problems & Representations What is a Graph? G = (V,E), where V represents the set of vertices (nodes)
More informationarxiv: v1 [cs.ma] 8 May 2018
Ordinal Approximation for Social Choice, Matching, and Facility Location Problems given Candidate Positions Elliot Anshelevich and Wennan Zhu arxiv:1805.03103v1 [cs.ma] 8 May 2018 May 9, 2018 Abstract
More informationInstance-Based Learning: Nearest neighbor and kernel regression and classificiation
Instance-Based Learning: Nearest neighbor and kernel regression and classificiation Emily Fox University of Washington February 3, 2017 Simplest approach: Nearest neighbor regression 1 Fit locally to each
More informationA Tutorial on the Probabilistic Framework for Centrality Analysis, Community Detection and Clustering
A Tutorial on the Probabilistic Framework for Centrality Analysis, Community Detection and Clustering 1 Cheng-Shang Chang All rights reserved Abstract Analysis of networked data has received a lot of attention
More informationUnderstanding Clustering Supervising the unsupervised
Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data
More informationClustering: Classic Methods and Modern Views
Clustering: Classic Methods and Modern Views Marina Meilă University of Washington mmp@stat.washington.edu June 22, 2015 Lorentz Center Workshop on Clusters, Games and Axioms Outline Paradigms for clustering
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS46: Mining Massive Datasets Jure Leskovec, Stanford University http://cs46.stanford.edu /7/ Jure Leskovec, Stanford C46: Mining Massive Datasets Many real-world problems Web Search and Text Mining Billions
More informationIntroduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.
Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How
More information