Randomized Composable Core-sets for Distributed Optimization Vahab Mirrokni
|
|
- Virgil Cook
- 6 years ago
- Views:
Transcription
1 Randomized Composable Core-sets for Distributed Optimization Vahab Mirrokni Mainly based on joint work with: Algorithms Research Group, Google Research, New York Hossein Bateni, Aditya Bhaskara, Hossein Esfandiari, Silvio Lattanzi, Morteza Zadimoghaddam
2 Our team: Google NYC Algorithms Research Teams Market Algorithms/ Ads Optimization (search & display) Tools: e.g. Clustering common expertise: online allocation problems Large-Scale Graph Mining/Distributed Optimization Infrastructure and Large-Scale Optimization Tools: e.g. Balanced Partitioning
3 Three most popular techniques applied in our tools Local Algorithms: Message Passing/Label Propagation/Local Random Walks e.g., similarity ranking via PPR etc, Connected Components Connected components code that s times faster the state-of-the-art Embedding/Hashing/Sketching Techniques e.g., linear embedding for balanced graph partitioning to minimize cut Improves the state-of-the-art by 26%. Improved flash bandwidth for search backend by 25%. Paper appeared in WSDM 16. Randomized Composable Core-sets for Distributed Computation: This Talk
4 Proprietary + Confidential Agenda Composable core-sets: Definitions & Applications Applications in Distributed & Streaming settings Applications: Feature Selection, Diversity in Search & Recom. Composable Core-sets for Four Problems: Survey Diversity Maximization(PODS 14, AAAI 17), Clustering(NIPS 14), Submodular Maximization(STOC 15), and Column Subset Selection (ICML 16) Sketching for Coverage Problems (on arxiv) Sketching Technique
5 Composable Core-Sets for Distributed Optimization Run ALG in each machine Machine 1 T1 T1 T2 Machine 2 T2 S1 Run ALG on selected items to find the final output set S2 Selected Items Input Set Tm Output Set Sm Machine m Tm
6 Composable Core-sets Setup: Consider partitioning data set T of elements into m sets (T1,T2,...,Tm). Goal: Given a set function f, find a subset S* with Find: small core-set,,.,, optimizing f(s*). such that optimum solution in union of core-sets approximates the optimum solution of T
7 Application in MapReduce/Distributed Computation Run ALG in each machine Machine 1 T1 T1 T2 Machine 2 T2 S1 Run ALG on selected items to find the final output set S2 Selected Items Input Set Tm Output Set Sm Machine m Tm E.g., two rounds of MapReduce
8 Application in Streaming Computation Streaming Computation: Processing sequence of n data points on the fly Limited storage Use C-composable core-set of size k, for example: Chunks of size, thus number of chunks is Compute core-set of size k for each chunk Total space:
9 Overview of recent theoretical results Need to solve (combinatorial) optimization problems on large data 1. Diversity Maximization, PODS 14 by IndykMahdianMahabadiMirrokni for Feature Selection in AAAI 17 by AbbasiGhadiriMirrokniZadimoghaddam 2. Capacitated ℓp Clustering, NIPS 14 by BateniBhaskaraLattanziMirrokni Submodular Maximization, STOC 15 by MirrokniZadimoghaddam Column Subset Selection (Feature Selection), ICML 16 by Alschulter et al. Coverage Problems: Submitted by BateniEsfandiariMirrokni
10 Applications: Diversity & Submodular Maximization Diverse suggestions Play apps Campaign keywords Search results News articles YouTube videos Data summarization Feature selection Exemplar sampling
11 Feature selection We have Data points (docs, web pages, etc.) Features (topics, etc.) Goal: pick a small set of representative features Emotion Hotel Finance Hospital Money Business Weather Cloud Laundry Movie Gaming Smartphone Car World Theatre Software Laptop Boat Home Camera Education Security Train Shopping
12 Five Problems Considered General: Find a set S of k items & maximize/minimize f(s). Diversity Maximization: Find a set S of k points, and maximize the sum of pairwise distances i.e. max diversity(s) =. Capacitated/Balanced Clustering: Find a set S of k centers and cluster nodes around them while minimizing the sum of distances to S. Coverage/Submodular Maximization: Find a set S of k items. Maximize submodular function f(s). Generalizing set cover. Column subset selection: Given a matrix A, find a set S of k columns. Minimize
13 Diversity Maximization Problem Given: A set of n points in a metric space (X, dist) Find a set S of k points Goal: maximize diversity(s) i.e. diversity(s) = sum of pairwise distances of points in S. Background: Max Dispersion (Halldorson et al, Abbassi et al) Useful for feature selection, diverse candidate selection in Search, representative centers...
14 Core-sets for Diversity Maximization Two rounds of MapReduce Run LocalSearch on each machine Machine 1 T1 T1 Input Set T2 Machine 2 T2 S1 Run LocalSearch on selected items to find the final output set S2 Selected Items Output Set Sm Tm Machine m Tm Arbitrary Partitioning works. Random partitioning is better.
15 Composable Core-set Results for Diversity Maximization Theorem(IndykMahabadiMahdianM. 14): The local search algorithm computes a constant-factor composable core-set for maximizing sum of pairwise distances in 2 rounds: Theorem(EpastoM.ZadiMoghaddam 16): A sampling+greedy algorithm computes a randomized 2-approximate composable small-size core-set for diversity maximization in one round. randomized: works under random partitioning small-size: size of core-set is less than k.
16 Distributed Clustering Problems Clustering: Divide data into groups containing nearby points Minimize: k-center : k-means : Metric space (d, X) α-approximation algorithm: cost less than α*opt k-median :
17 Mapping Core-sets for Capacitated Clustering
18 Capacitated ℓp clustering Problem: Given n points in a metric space, find k centers and assign points to centers, respecting capacities, to minimize ℓp norm of the distance vector. Generalizes balanced k-median, k-means & k-center. Objective is not minimizing cut size (cf. balanced partitioning in the library) Theorem: For any p and k< n, distributed balanced clustering with approx ratio: small constant * best single machine guarantee # rounds: 2 memory: (n/m)2 with m machines Improves [BMVKV 12] and [BEL 13] (Bateni,Bhaskara,Lattanzi,Mirrokni, NIPS 14)
19 Empirical study for distributed clustering Test in terms of scalability and quality of solution Two base instances & subsamples US graph ~30M nodes World graph ~500M nodes Size of seq. inst Increase in OPT US 1/ World 1/ Quality: pessimistic analysis Sublinear running time scaling
20 Submodular maximization Problem: Given k & submodular function f, find set S of size k that maximizes f(s). Some applications Data summarization Feature selection Exemplar clustering Special case: coverage maximization : Given a family of subsets, choose a subfamily of k sets, and maximize cardinality of union. cover various topics/meanings target all kinds of users
21 Submodular maximization Problem: Given k & submodular function f, find set S of size k that maximizes f(s). Some applications Data summarization Feature selection Exemplar clustering Special case: coverage maximization : Given a family of subsets, choose a subfamily of k sets, and maximize cardinality of union. cover various topics/meanings target all kinds of users [IMMM 14] Bad News: No deterministic composable core-set with approx
22 Submodular maximization Problem: Given k & submodular function f, find set S of size k that maximizes f(s). Some applications Data summarization Feature selection Exemplar clustering Special case: coverage maximization : Given a family of subsets, choose a subfamily of k sets, and maximize cardinality of union. cover various topics/meanings target all kinds of users [IMMM 14] Bad News: No deterministic composable core-set with approx Randomization is necessary and useful: Send each set randomly to some machine Build a coreset on each machine by greedy algorithm
23 Randomization to the Rescue: Randomized Core-sets Run GREEDY on each machine Machine 1 Random T1 Random T2 S1 Machine 2 Run GREEDY on selected items to find the final output set S2 Selected Items Input Set Output Set Sm Random Tm Machine m Two rounds of MapReduce
24 Results for Submodular Maximization: MZ (STOC 15) A class of 0.33-approximate randomized composable core-sets of size k for non-monotone submodular maximization. For example, Greedy Algorithm. Hard to go beyond ½ approximation with size k. Impossible to get better than 1-1/e approximate randomized composable core-set of size 4k for monotone f. Results in 0.54-approximate distributed algorithm in two rounds with linear communication complexity. For small-size composable core-sets of k less than k: sqrt{k /k}-approximate randomized composable core-set.
25 Low-Rank Approximation Given (large) matrix A in Rmxn and target rank k << m,n: Optimal solution: k-rank SVD Applications: Dimensionality reduction Signal denoising Compression...
26 Column Subset Selection (CSS) Columns often have important meaning CSS: Low-rank matrix approximation in column space of A n k n k m A A m A[S]
27 DISTGREEDY: GCSS(A,B,k) with L machines B Machine 1 Machine 2 Machine L
28 DISTGREEDY: GCSS(A,B,k) with L machines B Machine 1 Machine 2 Machine L
29 DISTGREEDY: GCSS(A,B,k) with L machines B Machine 1 Machine 2 Machine L Designated machine
30 DISTGREEDY: GCSS(A,B,k) with L machines B Machine 1 Machine 2 Machine L Designated machine
31 DISTGREEDY for column subset selection
32 Empirical result for column subset selection Training accuracy on massive data set (news 20.binary, 15k x 100k matrix) Speedup over 2-phase algorithm in parentheses Interesting experiment: What if we partition more carefully and not randomly? Recent observation: If we treat each machine separately, it does not help much! Random partitioning is good even compared with more careful partitioning.
33 Coverage Problems Problems: Given a set system (n sets and m elements), 1. K-coverage : pick k sets to max. size of union 2. set cover : cover all elements with least number of sets 3. set cover with outliers : cover (1-λ)m elements with least number of sets
34 Coverage Problems Problems: Given a set system (n sets and m elements), 1. K-coverage : pick k sets to max. size of union 2. set cover : cover all elements with least number of sets 3. set cover with outliers : cover (1-λ)m elements with least number of sets Greedy Algorithm: Pick a subset with the maximum marginal coverage,
35 Coverage Problems Problems: Given a set system (n sets and m elements), 1. K-coverage : pick k sets to max. size of union 2. set cover : cover all elements with least number of sets 3. set cover with outliers : cover (1-λ)m elements with least number of sets Greedy Algorithm: Pick a subset with the maximum marginal coverage, 1-1/e-approx. To k-coverage, log n-approximation for set cover... Goal: Achieve good fast approximation with minimum memory footprint Streaming: elements arrive one by one, not sets Distributed: linear communication and memory independent of the size of ground set
36 Submodular Maximization vs. Maximum Coverage Coverage function is a special case of submodular function: f(r) = cardinality of union of family R of subsets
37 Submodular Maximization vs. Maximum Coverage Coverage function is a special case of submodular maximization: f(r) = cardinality of union of family R of subsets So problem solved? [MirrokniZadimoghaddam STOC 15]: Randomized composable core-sets work [Mirzasoleiman et al NIPS 14]: This method works well in Practice!
38 Submodular Maximization vs. Maximum Coverage Coverage function is a special case of submodular maximization: f(r) = cardinality of union of family R of subsets So problem solved? [MirrokniZadimoghaddam STOC 15]: Randomized composable core-sets work [Mirzasoleiman et al NIPS 14]: This method works well in Practice! No. This solution has several issues for coverage problems: It requires expensive oracle access to computing cardinality of union! Distributed Computation: Send whole sets around? Streaming: Handles set arrival model, does not handle element arrival model!
39 Why can t we apply core-sets for submodular functions? Run ALG in each machine Subfamily of subsets T1 Family of subsets Subfamily of subsets Tm Machine 1 T1 Machine 2 T2 S1 Run ALG on selected items to find the final output set S2 Selected Items Output Set Sm Machine m Tm What if the subsets are large? Can we send a sketch of them?
40 Idea: Send a sketch for each set (e.g., sample of elements) Run ALG in each machine Sketch of subsets T1 Family of subsets Sketch of subsets Tm Machine 1 T1 Machine 2 T2 S1 Run ALG on selected items to find the final output set S2 Selected Items Output Set Sm Machine m Tm Question: Does any approximation-preserving sketch work?
41 Approximation-preserving sketching is not sufficient. Idea: Use sketching to define a (1±ε)-approx oracle to cardinality of union function? [BateniEsfandiariMirrokni 16]: Thm 1: A (1±ε)-approx sketch of coverage function May NOT Help Given an (1±ε)-approx oracle to coverage function, we get n0.49 approximation
42 Approximation-preserving sketching is not sufficient. Idea: Use sketching to define a (1±ε)-approx oracle to cardinality of union function? [BateniEsfandiariMirrokni 16]: Thm 1: A (1±ε)-approx sketch of coverage function May NOT Help Given an (1±ε)-approx oracle to coverage function, we get n0.49 approximation Thm 2: With some tricks, MinHash-based sketch + proper sampling WORKS Sample elements not sets (different from previous coreset idea) Correlation between samples (MinHash) Cap degrees of elements in the sketch (reduces memory footprint)
43 Bipartite Graph Formulation for Coverage Problems Bipartite graph G(U, V, E) U: sets V: elements E: membership sets Set cover problem: Pick minimum number of sets that cover all elements. Set cover with outliers problem: Pick minimum number of sets that cover a 1 - fraction of elements. elements Maximum coverage problem: Pick sets that cover maximum number of elements.
44 Sketching Technique sets Construction Dependent sampling: Assign hash values from [0,1) to elements. Remove any element with hash value exceeding. Arbitrarily remove edges to have max-degree for elements. Parameters Sample parameters 1) =is0.6 easy to compute. 2) 2) = 2 be found via a round of MapReduce. can Hash: elements
45 Approach Build graph Sketch construction Core-set method Final greedy Extract results Sketch: sparse subgraph with sufficient information For instance with many sets, parallelize using core sets. Any single-machine greedy algorithm
46 Proof ingredients: 1. Parameters are chosen to produce small sketch (indep. of size of ground set): O(#sets) Challenge: how to choose parameters in distributed or streaming models 2. Any -approximation on the sketch is an + approximation for original instance
47 Summary of Results for Coverage Functions Special case of submodular maximization Problems are NP-hard and APX-hard Greedy algorithm gives best guarantees Good implementations (linear-time) Lazy greedy algorithm Lazier-than-lazy algorithm GREEDY 1) Start with empty solution 2) Until done, (a) find set with best marginal coverage, and (b) add it to tentative solution. Problem: Graph should be stored in RAM Our algorithm: Memory O(#sets) Linear-time Optimal approximation guarantees MapReduce, streaming, etc.
48 Bounds for distributed coverage problems From [BEM 16]: 1) Space indep. of size of sets or ground set, 2) Optimal Approximation Factor, 3) Communication linear in #sets (indep. of their size), 4) small #rounds Previous work: [39]=[CKT 11], [42]=[MZ 15], [19]=[BENW 16], [43]=[MBKK 16]
49 Bounds for streaming coverage problems From [BEM 16]: 1) Space indep. of size of ground set, 2) Optimal Approximation Factor, 3) Edge vs set arrival Previous work:[14]=[cw 15], [22]=[DIMV 14], [24]=[ER 14], [31]=[IMV 15], [49]=[SG 09]
50 Empirical Study Public datasets Social networks Bags of words Contribution graphs Planted instances Very small sketches (0.01 5%) suffice for obtaining good approximations (95+%). Without core sets, can handle in <1h XXXB edges or elements.
51 features Goal: Pick k representative features entities Feature Selection (ongoing) Based on composable core sets k Random clusters Best cluster method Set cover (pairs) ) Pick features that cover all entities 2) Pick features that cover many pairs (or triples, etc.) of entities
52 Summary: Distributed Algorithms for Five Problems Define on a metric space & composable core-sets apply. 1. Diversity Maximization, 2. PODS 14 by IndykMahdianMahabadiM. for Feature Selection in AAAI 17 by AbbasiGhadiriMirrokniZadimoghaddam Capacitated ℓp Clustering, NIPS 14 by BateniBhaskaraLattanziM. Beyond Metric Spaces. Only Randomized partitioning apply Submodular Maximization, STOC 15 by M. Zadimoghaddam Feature Selection (Column Subset Selection), ICML 16 by Alschulter et al. Needs adaptive sampling/sketching techniques 5. Coverage Problems: by BateniEsfandiariM
53 Our team: Google NYC Algorithms Research Team Recently released external team website: research.google.com/teams/nycalg/ Market Algorithms/ Ads Optimization (search & display) Tools: e.g. Clustering common expertise: online allocation problems Large-Scale Graph Mining/Distributed Optimization Infrastructure and Large-Scale Optimization Tools: e.g. Balanced Partitioning
54 THANK YOU
55 Local Search for Diversity Maximization [KDD 13]
Large-scale Graph Google NY
Large-scale Graph Mining @ Google NY Vahab Mirrokni Google Research New York, NY DIMACS Workshop Large-scale graph mining Many applications Friend suggestions Recommendation systems Security Advertising
More informationDistributed Submodular Maximization in Massive Datasets. Alina Ene. Joint work with Rafael Barbosa, Huy L. Nguyen, Justin Ward
Distributed Submodular Maximization in Massive Datasets Alina Ene Joint work with Rafael Barbosa, Huy L. Nguyen, Justin Ward Combinatorial Optimization Given A set of objects V A function f on subsets
More informationA survey of submodular functions maximization. Yao Zhang 03/19/2015
A survey of submodular functions maximization Yao Zhang 03/19/2015 Example Deploy sensors in the water distribution network to detect contamination F(S): the performance of the detection when a set S of
More informationCoverage Approximation Algorithms
DATA MINING LECTURE 12 Coverage Approximation Algorithms Example Promotion campaign on a social network We have a social network as a graph. People are more likely to buy a product if they have a friend
More informationAbsorbing Random walks Coverage
DATA MINING LECTURE 3 Absorbing Random walks Coverage Random Walks on Graphs Random walk: Start from a node chosen uniformly at random with probability. n Pick one of the outgoing edges uniformly at random
More informationAbsorbing Random walks Coverage
DATA MINING LECTURE 3 Absorbing Random walks Coverage Random Walks on Graphs Random walk: Start from a node chosen uniformly at random with probability. n Pick one of the outgoing edges uniformly at random
More informationClustering Large scale data using MapReduce
Clustering Large scale data using MapReduce Topics Clustering Very Large Multi-dimensional Datasets with MapReduce by Robson L. F. Cordeiro et al. Fast Clustering using MapReduce by Alina Ene et al. Background
More informationDistributed Balanced Clustering via Mapping Coresets
Distributed Balanced Clustering via Mapping Coresets MohammadHossein Bateni Aditya Bhaskara Silvio Lattanzi Vahab Mirrokni Google Research NYC Abstract Large-scale clustering of data points in metric spaces
More informationDiversity Maximization Under Matroid Constraints
Diversity Maximization Under Matroid Constraints Zeinab Abbassi Department of Computer Science Columbia University zeinab@cs.olumbia.edu Vahab S. Mirrokni Google Research, New York mirrokni@google.com
More informationCOMP Analysis of Algorithms & Data Structures
COMP 3170 - Analysis of Algorithms & Data Structures Shahin Kamali Approximation Algorithms CLRS 35.1-35.5 University of Manitoba COMP 3170 - Analysis of Algorithms & Data Structures 1 / 30 Approaching
More informationParallel Algorithms for Geometric Graph Problems Grigory Yaroslavtsev
Parallel Algorithms for Geometric Graph Problems Grigory Yaroslavtsev http://grigory.us Appeared in STOC 2014, joint work with Alexandr Andoni, Krzysztof Onak and Aleksandar Nikolov. The Big Data Theory
More informationCS 598CSC: Approximation Algorithms Lecture date: March 2, 2011 Instructor: Chandra Chekuri
CS 598CSC: Approximation Algorithms Lecture date: March, 011 Instructor: Chandra Chekuri Scribe: CC Local search is a powerful and widely used heuristic method (with various extensions). In this lecture
More informationOptimisation While Streaming
Optimisation While Streaming Amit Chakrabarti Dartmouth College Joint work with S. Kale, A. Wirth DIMACS Workshop on Big Data Through the Lens of Sublinear Algorithms, Aug 2015 Combinatorial Optimisation
More informationKernelization via Sampling with Applications to Finding Matchings and Related Problems in Dynamic Graph Streams
Kernelization via Sampling with Applications to Finding Matchings and Related Problems in Dynamic Graph Streams Sofya Vorotnikova University of Massachusetts Amherst Joint work with Rajesh Chitnis, Graham
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 22, 2016 Course Information Website: http://www.stat.ucdavis.edu/~chohsieh/teaching/ ECS289G_Fall2016/main.html My office: Mathematical Sciences
More informationSublinear Algorithms for Big Data Analysis
Sublinear Algorithms for Big Data Analysis Michael Kapralov Theory of Computation Lab 4 EPFL 7 September 2017 The age of big data: massive amounts of data collected in various areas of science and technology
More informationApproximation Algorithms
Chapter 8 Approximation Algorithms Algorithm Theory WS 2016/17 Fabian Kuhn Approximation Algorithms Optimization appears everywhere in computer science We have seen many examples, e.g.: scheduling jobs
More informationStreaming Algorithms for Matching Size in Sparse Graphs
Streaming Algorithms for Matching Size in Sparse Graphs Graham Cormode g.cormode@warwick.ac.uk Joint work with S. Muthukrishnan (Rutgers), Morteza Monemizadeh (Rutgers Amazon) Hossein Jowhari (Warwick
More informationA Class of Submodular Functions for Document Summarization
A Class of Submodular Functions for Document Summarization Hui Lin, Jeff Bilmes University of Washington, Seattle Dept. of Electrical Engineering June 20, 2011 Lin and Bilmes Submodular Summarization June
More informationLatest on Linear Sketches for Large Graphs: Lots of Problems, Little Space, and Loads of Handwaving. Andrew McGregor University of Massachusetts
Latest on Linear Sketches for Large Graphs: Lots of Problems, Little Space, and Loads of Handwaving Andrew McGregor University of Massachusetts Latest on Linear Sketches for Large Graphs: Lots of Problems,
More informationGraph Connectivity in MapReduce...How Hard Could it Be?
Graph Connectivity in MapReduce......How Hard Could it Be? Sergei Vassilvitskii +Karloff, Kumar, Lattanzi, Moseley, Roughgarden, Suri, Vattani, Wang August 28, 2015 Google NYC Maybe Easy...Maybe Hard...
More informationA Computational Theory of Clustering
A Computational Theory of Clustering Avrim Blum Carnegie Mellon University Based on work joint with Nina Balcan, Anupam Gupta, and Santosh Vempala Point of this talk A new way to theoretically analyze
More informationReduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs
Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs Alessandro Epasto J. Feldman*, S. Lattanzi*, S. Leonardi, V. Mirrokni*. *Google Research Sapienza U. Rome Motivation Recommendation
More informationFast Clustering using MapReduce
Fast Clustering using MapReduce Alina Ene Sungjin Im Benjamin Moseley September 6, 2011 Abstract Clustering problems have numerous applications and are becoming more challenging as the size of the data
More informationSimilarity Ranking in Large- Scale Bipartite Graphs
Similarity Ranking in Large- Scale Bipartite Graphs Alessandro Epasto Brown University - 20 th March 2014 1 Joint work with J. Feldman, S. Lattanzi, S. Leonardi, V. Mirrokni [WWW, 2014] 2 AdWords Ads Ads
More informationProblem 1: Complexity of Update Rules for Logistic Regression
Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 16 th, 2014 1
More informationGraphs and Network Flows IE411. Lecture 21. Dr. Ted Ralphs
Graphs and Network Flows IE411 Lecture 21 Dr. Ted Ralphs IE411 Lecture 21 1 Combinatorial Optimization and Network Flows In general, most combinatorial optimization and integer programming problems are
More informationApproximation Algorithms for Clustering Uncertain Data
Approximation Algorithms for Clustering Uncertain Data Graham Cormode AT&T Labs - Research graham@research.att.com Andrew McGregor UCSD / MSR / UMass Amherst andrewm@ucsd.edu Introduction Many applications
More informationFast Clustering using MapReduce
Fast Clustering using MapReduce Alina Ene, Sungjin Im, Benjamin Moseley UIUC KDD 2011 Clustering Massive Data Group web pages based on their content Group users based on their online behavior Finding communities
More informationCombinatorial Selection and Least Absolute Shrinkage via The CLASH Operator
Combinatorial Selection and Least Absolute Shrinkage via The CLASH Operator Volkan Cevher Laboratory for Information and Inference Systems LIONS / EPFL http://lions.epfl.ch & Idiap Research Institute joint
More informationAlgorithm Design and Analysis
Algorithm Design and Analysis LECTURE 29 Approximation Algorithms Load Balancing Weighted Vertex Cover Reminder: Fill out SRTEs online Don t forget to click submit Sofya Raskhodnikova 12/7/2016 Approximation
More informationDiffusion and Clustering on Large Graphs
Diffusion and Clustering on Large Graphs Alexander Tsiatas Final Defense 17 May 2012 Introduction Graphs are omnipresent in the real world both natural and man-made Examples of large graphs: The World
More informationarxiv: v1 [cs.ma] 8 May 2018
Ordinal Approximation for Social Choice, Matching, and Facility Location Problems given Candidate Positions Elliot Anshelevich and Wennan Zhu arxiv:1805.03103v1 [cs.ma] 8 May 2018 May 9, 2018 Abstract
More informationImproved Approximations for Graph-TSP in Regular Graphs
Improved Approximations for Graph-TSP in Regular Graphs R Ravi Carnegie Mellon University Joint work with Uriel Feige (Weizmann), Jeremy Karp (CMU) and Mohit Singh (MSR) 1 Graph TSP Given a connected unweighted
More informationLecture 2. 1 Introduction. 2 The Set Cover Problem. COMPSCI 632: Approximation Algorithms August 30, 2017
COMPSCI 632: Approximation Algorithms August 30, 2017 Lecturer: Debmalya Panigrahi Lecture 2 Scribe: Nat Kell 1 Introduction In this lecture, we examine a variety of problems for which we give greedy approximation
More informationClustering. Unsupervised Learning
Clustering. Unsupervised Learning Maria-Florina Balcan 11/05/2018 Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints. Question: When and why would
More informationSocial Data Exploration
Social Data Exploration Sihem Amer-Yahia DR CNRS @ LIG Sihem.Amer-Yahia@imag.fr Big Data & Optimization Workshop 12ème Séminaire POC LIP6 Dec 5 th, 2014 Collaborative data model User space (with attributes)
More informationTracking Frequent Items Dynamically: What s Hot and What s Not To appear in PODS 2003
Tracking Frequent Items Dynamically: What s Hot and What s Not To appear in PODS 2003 Graham Cormode graham@dimacs.rutgers.edu dimacs.rutgers.edu/~graham S. Muthukrishnan muthu@cs.rutgers.edu Everyday
More informationA Simple Augmentation Method for Matchings with Applications to Streaming Algorithms
A Simple Augmentation Method for Matchings with Applications to Streaming Algorithms MFCS 2018 Christian Konrad 27.08.2018 Streaming Algorithms sequential access random access Streaming (1996 -) Objective:
More information1 Overview. 2 Applications of submodular maximization. AM 221: Advanced Optimization Spring 2016
AM : Advanced Optimization Spring 06 Prof. Yaron Singer Lecture 0 April th Overview Last time we saw the problem of Combinatorial Auctions and framed it as a submodular maximization problem under a partition
More informationGraphs: Introduction. Ali Shokoufandeh, Department of Computer Science, Drexel University
Graphs: Introduction Ali Shokoufandeh, Department of Computer Science, Drexel University Overview of this talk Introduction: Notations and Definitions Graphs and Modeling Algorithmic Graph Theory and Combinatorial
More informationMulti-label Classification. Jingzhou Liu Dec
Multi-label Classification Jingzhou Liu Dec. 6 2016 Introduction Multi-class problem, Training data (x $, y $ ) ( ), x $ X R., y $ Y = 1,2,, L Learn a mapping f: X Y Each instance x $ is associated with
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu SPAM FARMING 2/11/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 2 2/11/2013 Jure Leskovec, Stanford
More informationClustering. (Part 2)
Clustering (Part 2) 1 k-means clustering 2 General Observations on k-means clustering In essence, k-means clustering aims at minimizing cluster variance. It is typically used in Euclidean spaces and works
More informationMining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams
Mining Data Streams Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction Summarization Methods Clustering Data Streams Data Stream Classification Temporal Models CMPT 843, SFU, Martin Ester, 1-06
More informationAlgorithms for Grid Graphs in the MapReduce Model
University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Computer Science and Engineering: Theses, Dissertations, and Student Research Computer Science and Engineering, Department
More informationMassive Data Analysis
Professor, Department of Electrical and Computer Engineering Tennessee Technological University February 25, 2015 Big Data This talk is based on the report [1]. The growth of big data is changing that
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu /2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 2 Task: Given a large number (N in the millions or
More informationOn the Approximability of Modularity Clustering
On the Approximability of Modularity Clustering Newman s Community Finding Approach for Social Nets Bhaskar DasGupta Department of Computer Science University of Illinois at Chicago Chicago, IL 60607,
More informationApproximation Algorithms
15-251: Great Ideas in Theoretical Computer Science Spring 2019, Lecture 14 March 5, 2019 Approximation Algorithms 1 2 SAT 3SAT Clique Hamiltonian- Cycle given a Boolean formula F, is it satisfiable? same,
More informationIntroduction to Text Mining. Hongning Wang
Introduction to Text Mining Hongning Wang CS@UVa Who Am I? Hongning Wang Assistant professor in CS@UVa since August 2014 Research areas Information retrieval Data mining Machine learning CS@UVa CS6501:
More informationStanford University CS261: Optimization Handout 1 Luca Trevisan January 4, 2011
Stanford University CS261: Optimization Handout 1 Luca Trevisan January 4, 2011 Lecture 1 In which we describe what this course is about and give two simple examples of approximation algorithms 1 Overview
More informationOn Modularity Clustering. Group III (Ying Xuan, Swati Gambhir & Ravi Tiwari)
On Modularity Clustering Presented by: Presented by: Group III (Ying Xuan, Swati Gambhir & Ravi Tiwari) Modularity A quality index for clustering a graph G=(V,E) G=(VE) q( C): EC ( ) EC ( ) + ECC (, ')
More informationModels of distributed computing: port numbering and local algorithms
Models of distributed computing: port numbering and local algorithms Jukka Suomela Adaptive Computing Group Helsinki Institute for Information Technology HIIT University of Helsinki FMT seminar, 26 February
More informationClustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York
Clustering Robert M. Haralick Computer Science, Graduate Center City University of New York Outline K-means 1 K-means 2 3 4 5 Clustering K-means The purpose of clustering is to determine the similarity
More informationOutline. CS38 Introduction to Algorithms. Approximation Algorithms. Optimization Problems. Set Cover. Set cover 5/29/2014. coping with intractibility
Outline CS38 Introduction to Algorithms Lecture 18 May 29, 2014 coping with intractibility approximation algorithms set cover TSP center selection randomness in algorithms May 29, 2014 CS38 Lecture 18
More informationComputational and Communication Complexity in Massively Parallel Computation
Computational and Communication Complexity in Massively Parallel Computation Grigory Yaroslavtsev (Indiana University, Bloomington) http://grigory.us + S space Cluster Computation (a la BSP) Input: size
More informationHardness of Approximation for the TSP. Michael Lampis LAMSADE Université Paris Dauphine
Hardness of Approximation for the TSP Michael Lampis LAMSADE Université Paris Dauphine Sep 2, 2015 Overview Hardness of Approximation What is it? How to do it? (Easy) Examples The PCP Theorem What is it?
More informationCS224W: Analysis of Networks Jure Leskovec, Stanford University
HW2 Q1.1 parts (b) and (c) cancelled. HW3 released. It is long. Start early. CS224W: Analysis of Networks Jure Leskovec, Stanford University http://cs224w.stanford.edu 10/26/17 Jure Leskovec, Stanford
More informationTheorem 2.9: nearest addition algorithm
There are severe limits on our ability to compute near-optimal tours It is NP-complete to decide whether a given undirected =(,)has a Hamiltonian cycle An approximation algorithm for the TSP can be used
More informationIEOR E4008: Computational Discrete Optimization
Yuri Faenza IEOR Department Jan 23th, 2018 Logistics Instructor: Yuri Faenza Assistant Professor @ IEOR from 2016 Research area: Discrete Optimization Schedule: MW, 10:10-11:25 Room: 303 Mudd Office Hours:
More informationPolynomial-Time Approximation Algorithms
6.854 Advanced Algorithms Lecture 20: 10/27/2006 Lecturer: David Karger Scribes: Matt Doherty, John Nham, Sergiy Sidenko, David Schultz Polynomial-Time Approximation Algorithms NP-hard problems are a vast
More informationOnline Ad Allocation: Theory and Practice
Online Ad Allocation: Theory and Practice Vahab Mirrokni December 19, 2014 Based on recent papers in collaboration with my colleagues at Google, Columbia Univ., Cornell, MIT, and Stanford. [S. Balseiro,
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu HITS (Hypertext Induced Topic Selection) Is a measure of importance of pages or documents, similar to PageRank
More information6 Randomized rounding of semidefinite programs
6 Randomized rounding of semidefinite programs We now turn to a new tool which gives substantially improved performance guarantees for some problems We now show how nonlinear programming relaxations can
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS46: Mining Massive Datasets Jure Leskovec, Stanford University http://cs46.stanford.edu /7/ Jure Leskovec, Stanford C46: Mining Massive Datasets Many real-world problems Web Search and Text Mining Billions
More informationDatabase and Knowledge-Base Systems: Data Mining. Martin Ester
Database and Knowledge-Base Systems: Data Mining Martin Ester Simon Fraser University School of Computing Science Graduate Course Spring 2006 CMPT 843, SFU, Martin Ester, 1-06 1 Introduction [Fayyad, Piatetsky-Shapiro
More informationMaximizing the Spread of Influence through a Social Network. David Kempe, Jon Kleinberg and Eva Tardos
Maximizing the Spread of Influence through a Social Network David Kempe, Jon Kleinberg and Eva Tardos Group 9 Lauren Thomas, Ryan Lieblein, Joshua Hammock and Mary Hanvey Introduction In a social network,
More informationDiffusion and Clustering on Large Graphs
Diffusion and Clustering on Large Graphs Alexander Tsiatas Thesis Proposal / Advancement Exam 8 December 2011 Introduction Graphs are omnipresent in the real world both natural and man-made Examples of
More informationFinding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing
Finding Similar Sets Applications Shingling Minhashing Locality-Sensitive Hashing Goals Many Web-mining problems can be expressed as finding similar sets:. Pages with similar words, e.g., for classification
More informationMaximum Betweenness Centrality: Approximability and Tractable Cases
Maximum Betweenness Centrality: Approximability and Tractable Cases Martin Fink and Joachim Spoerhase Chair of Computer Science I University of Würzburg {martin.a.fink, joachim.spoerhase}@uni-wuerzburg.de
More informationDeveloping MapReduce Programs
Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes
More informationApproximation Algorithms
Approximation Algorithms Prof. Tapio Elomaa tapio.elomaa@tut.fi Course Basics A new 4 credit unit course Part of Theoretical Computer Science courses at the Department of Mathematics There will be 4 hours
More information1 Case study of SVM (Rob)
DRAFT a final version will be posted shortly COS 424: Interacting with Data Lecturer: Rob Schapire and David Blei Lecture # 8 Scribe: Indraneel Mukherjee March 1, 2007 In the previous lecture we saw how
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu [Kumar et al. 99] 2/13/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
More informationCSE 202: Design and Analysis of Algorithms Lecture 4
CSE 202: Design and Analysis of Algorithms Lecture 4 Instructor: Kamalika Chaudhuri Announcements HW 1 due in class on Tue Jan 24 Email me your homework partner name, or if you need a partner today Greedy
More informationMaxCover in MapReduce Flavio Chierichetti, Ravi Kumar, Andrew Tomkins
MaxCover in MapReduce Flavio Chierichetti, Ravi Kumar, Andrew Tomkins Advisor Klaus Berberich Presented By: Isha Khosla Outline Motivation Introduction Classical Approach: Greedy Proposed Algorithm: M
More informationApproximation-stability and proxy objectives
Harnessing implicit assumptions in problem formulations: Approximation-stability and proxy objectives Avrim Blum Carnegie Mellon University Based on work joint with Pranjal Awasthi, Nina Balcan, Anupam
More informationHomomorphic Sketches Shrinking Big Data without Sacrificing Structure. Andrew McGregor University of Massachusetts
Homomorphic Sketches Shrinking Big Data without Sacrificing Structure Andrew McGregor University of Massachusetts 4Mv 2 2 32 3 2 3 2 3 4 M 5 3 5 = v 6 7 4 5 = 4Mv5 = 4Mv5 Sketches: Encode data as vector;
More informationPart I Part II Part III Part IV Part V. Influence Maximization
Part I Part II Part III Part IV Part V Influence Maximization 1 Word-of-mouth (WoM) effect in social networks xphone is good xphone is good xphone is good xphone is good xphone is good xphone is good xphone
More informationLocality- Sensitive Hashing Random Projections for NN Search
Case Study 2: Document Retrieval Locality- Sensitive Hashing Random Projections for NN Search Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 18, 2017 Sham Kakade
More information/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/27/18
601.433/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/27/18 22.1 Introduction We spent the last two lectures proving that for certain problems, we can
More informationNetwork Wide Policy Enforcement. Michael K. Reiter (joint work with V. Sekar, R. Krishnaswamy, A. Gupta)
Network Wide Policy Enforcement Michael K. Reiter (joint work with V. Sekar, R. Krishnaswamy, A. Gupta) 1 Enforcing Policy in Future Networks MF vision includes enforcement of rich policies in the network
More informationIntroduction to Graph Theory
Introduction to Graph Theory Tandy Warnow January 20, 2017 Graphs Tandy Warnow Graphs A graph G = (V, E) is an object that contains a vertex set V and an edge set E. We also write V (G) to denote the vertex
More informationCSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection
CSE 255 Lecture 6 Data Mining and Predictive Analytics Community Detection Dimensionality reduction Goal: take high-dimensional data, and describe it compactly using a small number of dimensions Assumption:
More informationSOLVING LARGE CARPOOLING PROBLEMS USING GRAPH THEORETIC TOOLS
July, 2014 1 SOLVING LARGE CARPOOLING PROBLEMS USING GRAPH THEORETIC TOOLS Irith Ben-Arroyo Hartman Datasim project - (joint work with Abed Abu dbai, Elad Cohen, Daniel Keren) University of Haifa, Israel
More informationPredictive Indexing for Fast Search
Predictive Indexing for Fast Search Sharad Goel, John Langford and Alex Strehl Yahoo! Research, New York Modern Massive Data Sets (MMDS) June 25, 2008 Goel, Langford & Strehl (Yahoo! Research) Predictive
More informationProteins, Particles, and Pseudo-Max- Marginals: A Submodular Approach Jason Pacheco Erik Sudderth
Proteins, Particles, and Pseudo-Max- Marginals: A Submodular Approach Jason Pacheco Erik Sudderth Department of Computer Science Brown University, Providence RI Protein Side Chain Prediction Estimate side
More informationarxiv: v2 [cs.ds] 14 Sep 2018
Massively Parallel Dynamic Programming on Trees MohammadHossein Bateni Soheil Behnezhad Mahsa Derakhshan MohammadTaghi Hajiaghayi Vahab Mirrokni arxiv:1809.03685v2 [cs.ds] 14 Sep 2018 Abstract Dynamic
More informationCommunity Detection. Community
Community Detection Community In social sciences: Community is formed by individuals such that those within a group interact with each other more frequently than with those outside the group a.k.a. group,
More informationOn the Max Coloring Problem
On the Max Coloring Problem Leah Epstein Asaf Levin May 22, 2010 Abstract We consider max coloring on hereditary graph classes. The problem is defined as follows. Given a graph G = (V, E) and positive
More informationSubmodular Utility Maximization for Deadline Constrained Data Collection in Sensor Networks
Submodular Utility Maximization for Deadline Constrained Data Collection in Sensor Networks Zizhan Zheng, Member, IEEE and Ness B. Shroff, Fellow, IEEE Abstract We study the utility maximization problem
More informationCSC2420 Fall 2012: Algorithm Design, Analysis and Theory
CSC2420 Fall 2012: Algorithm Design, Analysis and Theory Allan Borodin September 20, 2012 1 / 1 Lecture 2 We continue where we left off last lecture, namely we are considering a PTAS for the the knapsack
More informationNon-exhaustive, Overlapping k-means
Non-exhaustive, Overlapping k-means J. J. Whang, I. S. Dhilon, and D. F. Gleich Teresa Lebair University of Maryland, Baltimore County October 29th, 2015 Teresa Lebair UMBC 1/38 Outline Introduction NEO-K-Means
More informationFrom Routing to Traffic Engineering
1 From Routing to Traffic Engineering Robert Soulé Advanced Networking Fall 2016 2 In the beginning B Goal: pair-wise connectivity (get packets from A to B) Approach: configure static rules in routers
More informationCIS 399: Foundations of Data Science
CIS 399: Foundations of Data Science Massively Parallel Algorithms Grigory Yaroslavtsev Warren Center for Network and Data Sciences http://grigory.us Big Data = buzzword Non-experts, media: a lot of spreadsheets,
More informationAdaptive Caching Algorithms with Optimality Guarantees for NDN Networks. Stratis Ioannidis and Edmund Yeh
Adaptive Caching Algorithms with Optimality Guarantees for NDN Networks Stratis Ioannidis and Edmund Yeh A Caching Network Nodes in the network store content items (e.g., files, file chunks) 1 A Caching
More informationNearest Neighbor with KD Trees
Case Study 2: Document Retrieval Finding Similar Documents Using Nearest Neighbors Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Emily Fox January 22 nd, 2013 1 Nearest
More informationAlgorithms for Nearest Neighbors
Algorithms for Nearest Neighbors Classic Ideas, New Ideas Yury Lifshits Steklov Institute of Mathematics at St.Petersburg http://logic.pdmi.ras.ru/~yura University of Toronto, July 2007 1 / 39 Outline
More informationComp Online Algorithms
Comp 7720 - Online Algorithms Notes 4: Bin Packing Shahin Kamalli University of Manitoba - Fall 208 December, 208 Introduction Bin packing is one of the fundamental problems in theory of computer science.
More information