Templates. for scalable data analysis. 3 Distributed Latent Variable Models. Amr Ahmed, Alexander J Smola, Markus Weimer

Size: px
Start display at page:

Download "Templates. for scalable data analysis. 3 Distributed Latent Variable Models. Amr Ahmed, Alexander J Smola, Markus Weimer"

Transcription

1 Templates for scalable data analysis 3 Distributed Latent Variable Models Amr Ahmed, Alexander J Smola, Markus Weimer Yahoo! Research & UC Berkeley & ANU

2 Variations on a theme inference for mixtures Parallel inference parallelization templates Samplers scaling up LDA

3 Inference for Mixture Models

4 Clustering

5 Clustering

6 Clustering

7 Clustering airline university restaurant

8 Generative Model cluster ID Θ Θ y 1 y 2 y 3 y m y i... x 1 x i x i x m x i objects μ1, Σ1... μk, Σk μj,σj i=1..m j=1..k

9 Generative Model cluster ID Θ Θ y 1 y 2 y 3 y m y i... x 1 x i x i x m x i objects μ1, Σ1... μk, Σk μj,σj i=1..m j=1..k p(x, Y,,µ)= ny i=1 p(x i y i,,µ)p(y i )

10 definetti Any distribution over exchangeable random variables can be written as conditionally independent. p(x 1,...,x n )= Z dp( ) ny i=1 p(x i ) ϴ x i x i Inference should be easy - ϴ x i and x i ϴ

11 Conjugates and Collapsing Exponential Family p(x ) =exp(h (x), i g( )) Conjugate Prior p( µ 0,m 0 )=exp(m 0 hµ 0, i m 0 g( ) h(m 0 µ 0,m 0 )) Posterior p( X, µ 0,m 0 ) / exp (hm 0 µ 0 + mµ[x], i (m 0 + m)g( ) h(m 0 µ 0,m 0 )) Collapsing the natural parameter p(x µ 0,m 0 )=exp(h(m 0 µ 0 + mµ[x],m 0 + m) h(m 0 µ 0,m 0 )) data

12 Conjugates and Collapsing m 0 μ 0 m 0 μ 0 ϴ x i x i collapsed representation definetti

13 Clustering & Topic Models clustering Latent Dirichlet Allocation α prior α prior θ cluster probability θ topic probability y cluster label y topic label x instance x instance

14 Clustering & Topic Models Cluster/ topic distributions membership x = Documents clustering: (0, 1) matrix topic model: stochastic matrix LSI: arbitrary matrices

15 Clustering & Topic Models estimate sample/optimize Cluster/ topic distributions membership x = Documents clustering: (0, 1) matrix topic model: stochastic matrix LSI: arbitrary matrices

16 α θ y x α y x V1 - Brute force maximization Integrate out latent parameters θ and ѱ p(x, Y, ) Discrete maximization problem in Y ѱ β β Hard to implement Overfits a lot (mode is not a typical sample) Hal Daume; Joey Gonzalez Parallelization infeasible

17 α α V2 - Brute force maximization θ θ y x x ѱ ѱ β β Hoffmann, Blei, Bach (in VW) Integrate out latent parameters y p(x,,, ) Continuous nonconvex optimization problem in θ and ѱ Solve by stochastic gradient descent over documents Easy to implement Does not overfit much Great for small datasets Parallelization difficult/impossible Memory storage/access is O(T W) (this breaks for large models) 1M words, 1000 topics = 4GB Per document 1MFlops/iteration

18 α α V3 - Variational approximation θ θ y y x ѱ ѱ β β Blei, Ng, Jordan Approximate intractable joint distribution by tractable factors log p(x) log p(x) D(q(y)kp(y x)) Z = dq(y) [log p(x) + log p(y x) Z = dq(y) log p(x, y) + H[q] Alternating convex optimization problem Dominant cost is matrix matrix multiply Easy to implement Great for small topics/vocabulary Parallelization easy (aggregate statistics) Memory storage is O(T W) (this breaks for large models) Model not quite as good as sampling q(y)]

19 α V4 - Uncollapsed θ y x ѱ β Sampling Sample y ij rest Can be done in parallel Sample θ rest and ѱ rest Can be done in parallel Compatible with MapReduce (only aggregate statistics) Easy to implement Children can be conditionally independent* Memory storage is O(T W) (this breaks for large models) Mixes slowly *for the right model

20 α θ y x α y x V5 - Collapsed Sampling Integrate out latent parameters θ and ѱ Sample one topic assignment y ij X,Y -ij at a time from n ij (t, d)+ p(x, Y, ) t n ij (t, w)+ t ѱ β β Griffiths & Steyvers 2005 n i (d)+ P t Fast mixing Easy to implement Memory efficient n i (t)+ P t Parallelization infeasible (variables lock each other) t t

21 α θ y x α y x V5 - Collapsed Sampling Integrate out latent parameters θ and ѱ Sample one topic assignment y ij X,Y -ij at a time from n ij (t, d)+ p(x, Y, ) t n ij (t, w)+ t ѱ β β Griffiths & Steyvers 2005 n i (d)+ P t Fast mixing Easy to implement Memory efficient n i (t)+ P t Parallelization infeasible (variables lock each other) t t

22 α V6 - Approximating y the Distribution Collapsed sampler per machine n ij (t, d)+ t n ij (t, w)+ t x β Asuncion, Smyth, Welling,... UCI Mimno, McCallum,... UMass n i (d)+ P t Defer synchronization between machines no problem for n(t) big problem for n(t,w) Easy to implement Can be memory efficient Easy parallelization t n i (t)+ P t Mixes slowly/worse likelihood t

23 α V7 - Better Approximations of the Distribution Collapsed sampler n ij (t, d)+ n ij (t, w)+ t t y x β S. and Narayanamurthy, 2009 Ahmed, Gonzalez, et al., 2012 n n i (t)+ P i (d)+ P t t t Make local copies of state Implicit for multicore (delayed updates from samplers) Explicit copies for multi-machine Not a hierarchical model (Welling, Asuncion, et al. 2008) Memory efficient (only need to view its own sufficient statistics) Multicore / Multi-machine Convergence speed depends on synchronizer quality t

24 α V8 - Sequential y x β y x Canini, Shi, Griffiths, 2009 Ahmed et al., 2011 y... p(x, Y, )= x Monte Carlo Integrate out latent θ and ѱ p(x, Y, ) Chain conditional probabilities my p(x i,y i x 1,y 1,...x i 1,y i 1,, ) i=1 For each particle sample y i p(y i x i,x 1,y 1,...x i 1,y i 1,, ) Reweight particle by next step data likelihood p(x i+1 x 1,y 1,...x i,y i,, ) Resample particles if weight distribution is too uneven

25 One pass through data Data sequential parallelization is open problem Nontrivial to implement Sampler is easy V8 - Sequential Monte Carlo Integrate out latent θ and ѱ p(x, Y, ) Chain conditional probabilities Inheritance tree through particles p(x, Y, )= is messy i=1 Need to estimate data likelihood For each particle sample (integration over y), e.g. as part of y i p(y i x i,x 1,y 1,...x i 1,y i 1,, ) sampler Reweight particle by next step data likelihood This is multiplicative update algorithm with log loss... Canini, Shi, Griffiths, 2009 Ahmed et al., 2011 my p(x i,y i x 1,y 1,...x i 1,y i 1,, ) p(x i+1 x 1,y 1,...x i,y i,, ) Resample particles if weight distribution is too uneven

26 Uncollapsed Variational approximation Collapsed natural parameters Collapsed topic assignments Optimization overfits too costly easy parallelization big memory footprint overfits too costly easy to optimize big memory footprint difficult parallelization fast mixing difficult parallelization Sampling slow mixing conditionally independent n.a. approximate inference by delayed updates sampling difficult particle filtering sequential

27 Parallel Inference

28 3 Problems mean variance cluster weight data cluster ID

29 3 Problems data local state global state

30 3 Problems too big for single machine huge only local

31 3 Problems global state Vanilla LDA local state User profiling data global state

32 3 Problems global state Vanilla LDA local state User profiling data global state

33 3 Problems local state is too large does not fit into memory global state is too large network load & barriers does not fit into memory

34 3 Problems local state is too large does stream not local fit data from disk into memory global state is too large network load & barriers does not fit into memory

35 3 Problems local state is too large does stream not local fit data from disk into memory global state is too large network asynchronous load synchronization & barriers does not fit into memory

36 3 Problems local state is too large does stream not local fit data from disk into memory global state is too large network asynchronous load synchronization & barriers does not fit partial view into memory

37 Distribution global replica cluster rack

38 Distribution global replica cluster rack

39 Synchronization Child updates local state Start with common state Child stores old and new state Parent keeps global state Transmit differences asynchronously Inverse element for difference Abelian group for commutativity (sum, log-sum, cyclic group, exponential families) local to global global to local x x old x old x x global x global + x old x x +(x global x old ) x global

40 Synchronization Naive approach (dumb master) Global is only (key,value) storage Local node needs to lock/read/write/unlock master Needs a 4 TCP/IP roundtrips - latency bound Better solution (smart master) Client sends message to master / in queue / master incorporates it Master sends message to client / in queue / client incorporates it Bandwidth bound (>10x speedup in practice) local to global global to local x x old x old x x global x global + x old x x +(x global x old ) x global

41 Distribution Dedicated server for variables Insufficient bandwidth (hotspots) Insufficient memory Select server e.g. via consistent hashing m(x) = argmin m2m h(x, m)

42 Distribution & fault tolerance Storage is O(1/k) per machine Communication is O(1) per machine Fast snapshots O(1/k) per machine (stop sync and dump state per vertex) O(k) open connections per machine O(1/k) throughput per machine m(x) = argmin m2m h(x, m)

43 Distribution & fault tolerance Storage is O(1/k) per machine Communication is O(1) per machine Fast snapshots O(1/k) per machine (stop sync and dump state per vertex) O(k) open connections per machine O(1/k) throughput per machine m(x) = argmin m2m h(x, m)

44 Distribution & fault tolerance Storage is O(1/k) per machine Communication is O(1) per machine Fast snapshots O(1/k) per machine (stop sync and dump state per vertex) O(k) open connections per machine O(1/k) throughput per machine m(x) = argmin m2m h(x, m)

45 Synchronization Data rate between machines is O(1/k) Machines operate asynchronously (barrier free) Solution Schedule message pairs Communicate with r random machines simultaneously local r=1 global

46 Synchronization Data rate between machines is O(1/k) Machines operate asynchronously (barrier free) Solution Schedule message pairs Communicate with r random machines simultaneously local r=1 global

47 Synchronization Data rate between machines is O(1/k) Machines operate asynchronously (barrier free) Solution Schedule message pairs Communicate with r random machines simultaneously local r=1 global

48 Synchronization Data rate between machines is O(1/k) Machines operate asynchronously (barrier free) Solution Schedule message pairs Communicate with r random machines simultaneously local r= < eff. < 0.89 global

49 Synchronization Data rate between machines is O(1/k) Machines operate asynchronously (barrier free) Solution Schedule message pairs Communicate with r random machines simultaneously Use Luby-Rackoff PRPG for load balancing Efficiency guarantee 4 simultaneous connections are sufficient

50 Scalability

51 Samplers

52 Sampling Brute force sampling over large number of items is expensive Ideally want work to scale with entropy of distribution over labels. Sparsity of distribution typically only known after seeing the instances Decompose (dense) probability into dense invariant and sparse variable terms Use fast proposal distribution & rejection sampling

53 Exploiting Sparsity Decomposition (Mimno & McCallum, 2009) Only need to update sparse terms per word p(t w ij ) / w t n(t)+ + w n(t, d = i) n(t)+ + n(t, w = w ij)[n(t, d = i)+ t ] n(t)+ dense but constant sparse Does not work for clustering (too many factors)

54 Exploiting Sparsity Context LDA (Petterson et al., 2009) The smoothers are word and topic dependent p(t w ij ) / (w, t) t n(t)+ (t) + (w, n(t, d = i) t) n(t)+ (t) + n(t, w = w ij)[n(t, d = i)+ t ] n(t)+ (t) topic dependent, dense Simple sparse factorization doesn t work Use Cauchy Schwartz to upper-bound first term X t (w, t) t n(t)+ (t) applek (w, )k n( )+ ( )

55 Collapsed vs Variational Memory requirements (1k topics, 2M words) Variational inference: 8GB RAM (no sparsity) Collapsed sampler: 1.5GB RAM (rare words) Burn-in & sparsity exploit saves a lot 45 Sampling Speed-up (1000 topics) 40 unif doc doc,word Speed-up over standard sampler Iteration Cauchy Schwartz bound multilingual LDA word context smoothing over time

56 Fast Proposal In reality sparsity often not true for real proposal Guess sparse proxy In the storylines model this are the entities

57 Variations on a theme inference for mixtures Parallel inference parallelization templates Samplers scaling up LDA

Partitioning Algorithms for Improving Efficiency of Topic Modeling Parallelization

Partitioning Algorithms for Improving Efficiency of Topic Modeling Parallelization Partitioning Algorithms for Improving Efficiency of Topic Modeling Parallelization Hung Nghiep Tran University of Information Technology VNU-HCMC Vietnam Email: nghiepth@uit.edu.vn Atsuhiro Takasu National

More information

More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server

More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server Q. Ho, J. Cipar, H. Cui, J.K. Kim, S. Lee, *P.B. Gibbons, G.A. Gibson, G.R. Ganger, E.P. Xing Carnegie Mellon University

More information

An Architecture for Parallel Topic Models

An Architecture for Parallel Topic Models An Architecture for Parallel Topic Models Alexander Smola Yahoo! Research, Santa Clara, CA, USA Australian National University, Canberra alex@smola.org Shravan Narayanamurthy Yahoo! Labs, Bangalore Torrey

More information

Scaling with the Parameter Server Variations on a Theme. Alexander Smola Google Research & CMU alex.smola.org

Scaling with the Parameter Server Variations on a Theme. Alexander Smola Google Research & CMU alex.smola.org Scaling with the Parameter Server Variations on a Theme Alexander Smola Google Research & CMU alex.smola.org Thanks Amr Ahmed Nino Shervashidze Joey Gonzalez Shravan Narayanamurthy Sergiy Matyusevich Markus

More information

Clustering Lecture 5: Mixture Model

Clustering Lecture 5: Mixture Model Clustering Lecture 5: Mixture Model Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced topics

More information

Structured Learning. Jun Zhu

Structured Learning. Jun Zhu Structured Learning Jun Zhu Supervised learning Given a set of I.I.D. training samples Learn a prediction function b r a c e Supervised learning (cont d) Many different choices Logistic Regression Maximum

More information

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015 Sum-Product Networks STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015 Introduction Outline What is a Sum-Product Network? Inference Applications In more depth

More information

Machine Learning / Jan 27, 2010

Machine Learning / Jan 27, 2010 Revisiting Logistic Regression & Naïve Bayes Aarti Singh Machine Learning 10-701/15-781 Jan 27, 2010 Generative and Discriminative Classifiers Training classifiers involves learning a mapping f: X -> Y,

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Learning without Class Labels (or correct outputs) Density Estimation Learn P(X) given training data for X Clustering Partition data into clusters Dimensionality Reduction Discover

More information

How to Price a House

How to Price a House How to Price a House An Interpretable Bayesian Approach Dustin Lennon dustin@inferentialist.com Inferentialist Consulting Seattle, WA April 9, 2014 Introduction Project to tie up loose ends / came out

More information

LDA for Big Data - Outline

LDA for Big Data - Outline LDA FOR BIG DATA 1 LDA for Big Data - Outline Quick review of LDA model clustering words-in-context Parallel LDA ~= IPM Fast sampling tricks for LDA Sparsified sampler Alias table Fenwick trees LDA for

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin Clustering K-means Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, 2014 Carlos Guestrin 2005-2014 1 Clustering images Set of Images [Goldberger et al.] Carlos Guestrin 2005-2014

More information

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari Machine Learning Basics: Stochastic Gradient Descent Sargur N. srihari@cedar.buffalo.edu 1 Topics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation Sets

More information

CS281 Section 9: Graph Models and Practical MCMC

CS281 Section 9: Graph Models and Practical MCMC CS281 Section 9: Graph Models and Practical MCMC Scott Linderman November 11, 213 Now that we have a few MCMC inference algorithms in our toolbox, let s try them out on some random graph models. Graphs

More information

Convexization in Markov Chain Monte Carlo

Convexization in Markov Chain Monte Carlo in Markov Chain Monte Carlo 1 IBM T. J. Watson Yorktown Heights, NY 2 Department of Aerospace Engineering Technion, Israel August 23, 2011 Problem Statement MCMC processes in general are governed by non

More information

Parallelism for LDA Yang Ruan, Changsi An

Parallelism for LDA Yang Ruan, Changsi An Parallelism for LDA Yang Ruan, Changsi An (yangruan@indiana.edu, anch@indiana.edu) 1. Overview As parallelism is very important for large scale of data, we want to use different technology to parallelize

More information

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin Clustering K-means Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, 2014 Carlos Guestrin 2005-2014 1 Clustering images Set of Images [Goldberger et al.] Carlos Guestrin 2005-2014

More information

Parallel Gibbs Sampling From Colored Fields to Thin Junction Trees

Parallel Gibbs Sampling From Colored Fields to Thin Junction Trees Parallel Gibbs Sampling From Colored Fields to Thin Junction Trees Joseph Gonzalez Yucheng Low Arthur Gretton Carlos Guestrin Draw Samples Sampling as an Inference Procedure Suppose we wanted to know the

More information

Scaling Distributed Machine Learning with the Parameter Server

Scaling Distributed Machine Learning with the Parameter Server Scaling Distributed Machine Learning with the Parameter Server Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su Presented

More information

Optimization for Machine Learning

Optimization for Machine Learning with a focus on proximal gradient descent algorithm Department of Computer Science and Engineering Outline 1 History & Trends 2 Proximal Gradient Descent 3 Three Applications A Brief History A. Convex

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Overview of Part Two Probabilistic Graphical Models Part Two: Inference and Learning Christopher M. Bishop Exact inference and the junction tree MCMC Variational methods and EM Example General variational

More information

An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework

An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework IEEE SIGNAL PROCESSING LETTERS, VOL. XX, NO. XX, XXX 23 An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework Ji Won Yoon arxiv:37.99v [cs.lg] 3 Jul 23 Abstract In order to cluster

More information

A Brief Look at Optimization

A Brief Look at Optimization A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year s version Overview Introduction Classes of optimization problems Linear programming Steepest

More information

Computer vision: models, learning and inference. Chapter 10 Graphical Models

Computer vision: models, learning and inference. Chapter 10 Graphical Models Computer vision: models, learning and inference Chapter 10 Graphical Models Independence Two variables x 1 and x 2 are independent if their joint probability distribution factorizes as Pr(x 1, x 2 )=Pr(x

More information

Adaptive Dropout Training for SVMs

Adaptive Dropout Training for SVMs Department of Computer Science and Technology Adaptive Dropout Training for SVMs Jun Zhu Joint with Ning Chen, Jingwei Zhuo, Jianfei Chen, Bo Zhang Tsinghua University ShanghaiTech Symposium on Data Science,

More information

Announcements: projects

Announcements: projects Announcements: projects 805 students: Project proposals are due Sun 10/1. If you d like to work with 605 students then indicate this on your proposal. 605 students: the week after 10/1 I will post the

More information

Mixture Models and the EM Algorithm

Mixture Models and the EM Algorithm Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Finite Mixture Models Say we have a data set D = {x 1,..., x N } where x i is

More information

Towards Asynchronous Distributed MCMC Inference for Large Graphical Models

Towards Asynchronous Distributed MCMC Inference for Large Graphical Models Towards Asynchronous Distributed MCMC for Large Graphical Models Sameer Singh Department of Computer Science University of Massachusetts Amherst, MA 13 sameer@cs.umass.edu Andrew McCallum Department of

More information

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas CS839: Probabilistic Graphical Models Lecture 10: Learning with Partially Observed Data Theo Rekatsinas 1 Partially Observed GMs Speech recognition 2 Partially Observed GMs Evolution 3 Partially Observed

More information

Linear Regression Optimization

Linear Regression Optimization Gradient Descent Linear Regression Optimization Goal: Find w that minimizes f(w) f(w) = Xw y 2 2 Closed form solution exists Gradient Descent is iterative (Intuition: go downhill!) n w * w Scalar objective:

More information

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. No calculators or

More information

Note Set 4: Finite Mixture Models and the EM Algorithm

Note Set 4: Finite Mixture Models and the EM Algorithm Note Set 4: Finite Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine Finite Mixture Models A finite mixture model with K components, for

More information

K-Means and Gaussian Mixture Models

K-Means and Gaussian Mixture Models K-Means and Gaussian Mixture Models David Rosenberg New York University June 15, 2015 David Rosenberg (New York University) DS-GA 1003 June 15, 2015 1 / 43 K-Means Clustering Example: Old Faithful Geyser

More information

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C. D-Separation Say: A, B, and C are non-intersecting subsets of nodes in a directed graph. A path from A to B is blocked by C if it contains a node such that either a) the arrows on the path meet either

More information

Machine Learning Basics. Sargur N. Srihari

Machine Learning Basics. Sargur N. Srihari Machine Learning Basics Sargur N. srihari@cedar.buffalo.edu 1 Overview Deep learning is a specific type of ML Necessary to have a solid understanding of the basic principles of ML 2 Topics Stochastic Gradient

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Unsupervised Learning: Kmeans, GMM, EM Readings: Barber 20.1-20.3 Stefan Lee Virginia Tech Tasks Supervised Learning x Classification y Discrete x Regression

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 Clustering and EM Barnabás Póczos & Aarti Singh Contents Clustering K-means Mixture of Gaussians Expectation Maximization Variational Methods 2 Clustering 3 K-

More information

Computer Vision Group Prof. Daniel Cremers. 4a. Inference in Graphical Models

Computer Vision Group Prof. Daniel Cremers. 4a. Inference in Graphical Models Group Prof. Daniel Cremers 4a. Inference in Graphical Models Inference on a Chain (Rep.) The first values of µ α and µ β are: The partition function can be computed at any node: Overall, we have O(NK 2

More information

Clustering Relational Data using the Infinite Relational Model

Clustering Relational Data using the Infinite Relational Model Clustering Relational Data using the Infinite Relational Model Ana Daglis Supervised by: Matthew Ludkin September 4, 2015 Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015

More information

How Learning Differs from Optimization. Sargur N. Srihari

How Learning Differs from Optimization. Sargur N. Srihari How Learning Differs from Optimization Sargur N. srihari@cedar.buffalo.edu 1 Topics in Optimization Optimization for Training Deep Models: Overview How learning differs from optimization Risk, empirical

More information

Stephen Scott.

Stephen Scott. 1 / 33 sscott@cse.unl.edu 2 / 33 Start with a set of sequences In each column, residues are homolgous Residues occupy similar positions in 3D structure Residues diverge from a common ancestral residue

More information

Fully Distributed EM for Very Large Datasets

Fully Distributed EM for Very Large Datasets Fully Distributed for Very Large Datasets Jason Wolfe Aria Haghighi Dan Klein Computer Science Division UC Berkeley Overview Task: unsupervised learning via الولاياتالمتحدة المتحدة المتحدة مو تمر الولايات

More information

5 Machine Learning Abstractions and Numerical Optimization

5 Machine Learning Abstractions and Numerical Optimization Machine Learning Abstractions and Numerical Optimization 25 5 Machine Learning Abstractions and Numerical Optimization ML ABSTRACTIONS [some meta comments on machine learning] [When you write a large computer

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Theory of Variational Inference: Inner and Outer Approximation Eric Xing Lecture 14, February 29, 2016 Reading: W & J Book Chapters Eric Xing @

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:

More information

Text Modeling with the Trace Norm

Text Modeling with the Trace Norm Text Modeling with the Trace Norm Jason D. M. Rennie jrennie@gmail.com April 14, 2006 1 Introduction We have two goals: (1) to find a low-dimensional representation of text that allows generalization to

More information

732A54/TDDE31 Big Data Analytics

732A54/TDDE31 Big Data Analytics 732A54/TDDE31 Big Data Analytics Lecture 10: Machine Learning with MapReduce Jose M. Peña IDA, Linköping University, Sweden 1/27 Contents MapReduce Framework Machine Learning with MapReduce Neural Networks

More information

Using Existing Numerical Libraries on Spark

Using Existing Numerical Libraries on Spark Using Existing Numerical Libraries on Spark Brian Spector Chicago Spark Users Meetup June 24 th, 2015 Experts in numerical algorithms and HPC services How to use existing libraries on Spark Call algorithm

More information

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov ECE521: Week 11, Lecture 20 27 March 2017: HMM learning/inference With thanks to Russ Salakhutdinov Examples of other perspectives Murphy 17.4 End of Russell & Norvig 15.2 (Artificial Intelligence: A Modern

More information

Sampling Using GPU Accelerated Sparse Hierarchical Models

Sampling Using GPU Accelerated Sparse Hierarchical Models Sampling Using GPU Accelerated Sparse Hierarchical Models Miroslav Stoyanov Oak Ridge National Laboratory supported by Exascale Computing Project (ECP) exascaleproject.org April 9, 28 Miroslav Stoyanov

More information

GraphLab: A New Framework for Parallel Machine Learning

GraphLab: A New Framework for Parallel Machine Learning GraphLab: A New Framework for Parallel Machine Learning Yucheng Low, Aapo Kyrola, Carlos Guestrin, Joseph Gonzalez, Danny Bickson, Joe Hellerstein Presented by Guozhang Wang DB Lunch, Nov.8, 2010 Overview

More information

Conditional Random Fields : Theory and Application

Conditional Random Fields : Theory and Application Conditional Random Fields : Theory and Application Matt Seigel (mss46@cam.ac.uk) 3 June 2010 Cambridge University Engineering Department Outline The Sequence Classification Problem Linear Chain CRFs CRF

More information

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of

More information

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016 CPSC 340: Machine Learning and Data Mining Principal Component Analysis Fall 2016 A2/Midterm: Admin Grades/solutions will be posted after class. Assignment 4: Posted, due November 14. Extra office hours:

More information

Variational Methods for Discrete-Data Latent Gaussian Models

Variational Methods for Discrete-Data Latent Gaussian Models Variational Methods for Discrete-Data Latent Gaussian Models University of British Columbia Vancouver, Canada March 6, 2012 The Big Picture Joint density models for data with mixed data types Bayesian

More information

Building Classifiers using Bayesian Networks

Building Classifiers using Bayesian Networks Building Classifiers using Bayesian Networks Nir Friedman and Moises Goldszmidt 1997 Presented by Brian Collins and Lukas Seitlinger Paper Summary The Naive Bayes classifier has reasonable performance

More information

Deep Generative Models Variational Autoencoders

Deep Generative Models Variational Autoencoders Deep Generative Models Variational Autoencoders Sudeshna Sarkar 5 April 2017 Generative Nets Generative models that represent probability distributions over multiple variables in some way. Directed Generative

More information

Parallelizing Big Data Machine Learning Algorithms with Model Rotation

Parallelizing Big Data Machine Learning Algorithms with Model Rotation Parallelizing Big Data Machine Learning Algorithms with Model Rotation Bingjing Zhang, Bo Peng, Judy Qiu School of Informatics and Computing Indiana University Bloomington, IN, USA Email: {zhangbj, pengb,

More information

Lecture 15: More Iterative Ideas

Lecture 15: More Iterative Ideas Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!

More information

Behavioral Data Mining. Lecture 18 Clustering

Behavioral Data Mining. Lecture 18 Clustering Behavioral Data Mining Lecture 18 Clustering Outline Why? Cluster quality K-means Spectral clustering Generative Models Rationale Given a set {X i } for i = 1,,n, a clustering is a partition of the X i

More information

Estimating Labels from Label Proportions

Estimating Labels from Label Proportions Estimating Labels from Label Proportions Novi Quadrianto Novi.Quad@gmail.com The Australian National University, Australia NICTA, Statistical Machine Learning Program, Australia Joint work with Alex Smola,

More information

Scaling Distributed Machine Learning

Scaling Distributed Machine Learning Scaling Distributed Machine Learning with System and Algorithm Co-design Mu Li Thesis Defense CSD, CMU Feb 2nd, 2017 nx min w f i (w) Distributed systems i=1 Large scale optimization methods Large-scale

More information

Divide and Conquer Kernel Ridge Regression

Divide and Conquer Kernel Ridge Regression Divide and Conquer Kernel Ridge Regression Yuchen Zhang John Duchi Martin Wainwright University of California, Berkeley COLT 2013 Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT 2013 1 / 15 Problem

More information

Artificial Intelligence for Robotics: A Brief Summary

Artificial Intelligence for Robotics: A Brief Summary Artificial Intelligence for Robotics: A Brief Summary This document provides a summary of the course, Artificial Intelligence for Robotics, and highlights main concepts. Lesson 1: Localization (using Histogram

More information

Clustering web search results

Clustering web search results Clustering K-means Machine Learning CSE546 Emily Fox University of Washington November 4, 2013 1 Clustering images Set of Images [Goldberger et al.] 2 1 Clustering web search results 3 Some Data 4 2 K-means

More information

Graph-Parallel Problems. ML in the Context of Parallel Architectures

Graph-Parallel Problems. ML in the Context of Parallel Architectures Case Study 4: Collaborative Filtering Graph-Parallel Problems Synchronous v. Asynchronous Computation Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 20 th, 2014

More information

CS281 Section 3: Practical Optimization

CS281 Section 3: Practical Optimization CS281 Section 3: Practical Optimization David Duvenaud and Dougal Maclaurin Most parameter estimation problems in machine learning cannot be solved in closed form, so we often have to resort to numerical

More information

Expectation Maximization: Inferring model parameters and class labels

Expectation Maximization: Inferring model parameters and class labels Expectation Maximization: Inferring model parameters and class labels Emily Fox University of Washington February 27, 2017 Mixture of Gaussian recap 1 2/26/17 Jumble of unlabeled images HISTOGRAM blue

More information

Markov chain Monte Carlo methods

Markov chain Monte Carlo methods Markov chain Monte Carlo methods (supplementary material) see also the applet http://www.lbreyer.com/classic.html February 9 6 Independent Hastings Metropolis Sampler Outline Independent Hastings Metropolis

More information

Distributed Machine Learning" on Spark

Distributed Machine Learning on Spark Distributed Machine Learning" on Spark Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Outline Data flow vs. traditional network programming Spark computing engine Optimization Example Matrix Computations

More information

CS 6453: Parameter Server. Soumya Basu March 7, 2017

CS 6453: Parameter Server. Soumya Basu March 7, 2017 CS 6453: Parameter Server Soumya Basu March 7, 2017 What is a Parameter Server? Server for large scale machine learning problems Machine learning tasks in a nutshell: Feature Extraction (1, 1, 1) (2, -1,

More information

Collective classification in network data

Collective classification in network data 1 / 50 Collective classification in network data Seminar on graphs, UCSB 2009 Outline 2 / 50 1 Problem 2 Methods Local methods Global methods 3 Experiments Outline 3 / 50 1 Problem 2 Methods Local methods

More information

Variational Methods for Graphical Models

Variational Methods for Graphical Models Chapter 2 Variational Methods for Graphical Models 2.1 Introduction The problem of probabb1istic inference in graphical models is the problem of computing a conditional probability distribution over the

More information

Markov Random Fields and Gibbs Sampling for Image Denoising

Markov Random Fields and Gibbs Sampling for Image Denoising Markov Random Fields and Gibbs Sampling for Image Denoising Chang Yue Electrical Engineering Stanford University changyue@stanfoed.edu Abstract This project applies Gibbs Sampling based on different Markov

More information

CSE 250B Assignment 2 Report

CSE 250B Assignment 2 Report CSE 250B Assignment 2 Report February 16, 2012 Yuncong Chen yuncong@cs.ucsd.edu Pengfei Chen pec008@ucsd.edu Yang Liu yal060@cs.ucsd.edu Abstract In this report we describe our implementation of a conditional

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 22, 2016 Course Information Website: http://www.stat.ucdavis.edu/~chohsieh/teaching/ ECS289G_Fall2016/main.html My office: Mathematical Sciences

More information

A Fast Learning Algorithm for Deep Belief Nets

A Fast Learning Algorithm for Deep Belief Nets A Fast Learning Algorithm for Deep Belief Nets Geoffrey E. Hinton, Simon Osindero Department of Computer Science University of Toronto, Toronto, Canada Yee-Whye Teh Department of Computer Science National

More information

Programming Systems for Big Data

Programming Systems for Big Data Programming Systems for Big Data CS315B Lecture 17 Including material from Kunle Olukotun Prof. Aiken CS 315B Lecture 17 1 Big Data We ve focused on parallel programming for computational science There

More information

Nearest Neighbor with KD Trees

Nearest Neighbor with KD Trees Case Study 2: Document Retrieval Finding Similar Documents Using Nearest Neighbors Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Emily Fox January 22 nd, 2013 1 Nearest

More information

Learning the Structure of Deep, Sparse Graphical Models

Learning the Structure of Deep, Sparse Graphical Models Learning the Structure of Deep, Sparse Graphical Models Hanna M. Wallach University of Massachusetts Amherst wallach@cs.umass.edu Joint work with Ryan Prescott Adams & Zoubin Ghahramani Deep Belief Networks

More information

CS5220: Assignment Topic Modeling via Matrix Computation in Julia

CS5220: Assignment Topic Modeling via Matrix Computation in Julia CS5220: Assignment Topic Modeling via Matrix Computation in Julia April 7, 2014 1 Introduction Topic modeling is widely used to summarize major themes in large document collections. Topic modeling methods

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 17 EM CS/CNS/EE 155 Andreas Krause Announcements Project poster session on Thursday Dec 3, 4-6pm in Annenberg 2 nd floor atrium! Easels, poster boards and cookies

More information

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C,

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C, Conditional Random Fields and beyond D A N I E L K H A S H A B I C S 5 4 6 U I U C, 2 0 1 3 Outline Modeling Inference Training Applications Outline Modeling Problem definition Discriminative vs. Generative

More information

Face detection and recognition. Detection Recognition Sally

Face detection and recognition. Detection Recognition Sally Face detection and recognition Detection Recognition Sally Face detection & recognition Viola & Jones detector Available in open CV Face recognition Eigenfaces for face recognition Metric learning identification

More information

Dropout. Sargur N. Srihari This is part of lecture slides on Deep Learning:

Dropout. Sargur N. Srihari This is part of lecture slides on Deep Learning: Dropout Sargur N. srihari@buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Regularization Strategies 1. Parameter Norm Penalties 2. Norm Penalties

More information

Inference. Inference: calculating some useful quantity from a joint probability distribution Examples: Posterior probability: Most likely explanation:

Inference. Inference: calculating some useful quantity from a joint probability distribution Examples: Posterior probability: Most likely explanation: Inference Inference: calculating some useful quantity from a joint probability distribution Examples: Posterior probability: B A E J M Most likely explanation: This slide deck courtesy of Dan Klein at

More information

Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix

Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix Carlos Ordonez, Yiqun Zhang Department of Computer Science, University of Houston, USA Abstract. We study the serial and parallel

More information

Warped Mixture Models

Warped Mixture Models Warped Mixture Models Tomoharu Iwata, David Duvenaud, Zoubin Ghahramani Cambridge University Computational and Biological Learning Lab March 11, 2013 OUTLINE Motivation Gaussian Process Latent Variable

More information

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. No calculators or

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 2: MapReduce Algorithm Design (2/2) January 14, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

CRF Feature Induction

CRF Feature Induction CRF Feature Induction Andrew McCallum Efficiently Inducing Features of Conditional Random Fields Kuzman Ganchev 1 Introduction Basic Idea Aside: Transformation Based Learning Notation/CRF Review 2 Arbitrary

More information

Scalable Bayes Clustering for Outlier Detection Under Informative Sampling

Scalable Bayes Clustering for Outlier Detection Under Informative Sampling Scalable Bayes Clustering for Outlier Detection Under Informative Sampling Based on JMLR paper of T. D. Savitsky Terrance D. Savitsky Office of Survey Methods Research FCSM - 2018 March 7-9, 2018 1 / 21

More information

Regularization and Markov Random Fields (MRF) CS 664 Spring 2008

Regularization and Markov Random Fields (MRF) CS 664 Spring 2008 Regularization and Markov Random Fields (MRF) CS 664 Spring 2008 Regularization in Low Level Vision Low level vision problems concerned with estimating some quantity at each pixel Visual motion (u(x,y),v(x,y))

More information

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz Gradient Descent Wed Sept 20th, 2017 James McInenrey Adapted from slides by Francisco J. R. Ruiz Housekeeping A few clarifications of and adjustments to the course schedule: No more breaks at the midpoint

More information

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013 Machine Learning Topic 5: Linear Discriminants Bryan Pardo, EECS 349 Machine Learning, 2013 Thanks to Mark Cartwright for his extensive contributions to these slides Thanks to Alpaydin, Bishop, and Duda/Hart/Stork

More information

Huge market -- essentially all high performance databases work this way

Huge market -- essentially all high performance databases work this way 11/5/2017 Lecture 16 -- Parallel & Distributed Databases Parallel/distributed databases: goal provide exactly the same API (SQL) and abstractions (relational tables), but partition data across a bunch

More information

Harp-DAAL for High Performance Big Data Computing

Harp-DAAL for High Performance Big Data Computing Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big

More information

Distributed Filesystem

Distributed Filesystem Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the

More information

Application of MRF s to Segmentation

Application of MRF s to Segmentation EE641 Digital Image Processing II: Purdue University VISE - November 14, 2012 1 Application of MRF s to Segmentation Topics to be covered: The Model Bayesian Estimation MAP Optimization Parameter Estimation

More information