Templates. for scalable data analysis. 3 Distributed Latent Variable Models. Amr Ahmed, Alexander J Smola, Markus Weimer
|
|
- Avice Perry
- 6 years ago
- Views:
Transcription
1 Templates for scalable data analysis 3 Distributed Latent Variable Models Amr Ahmed, Alexander J Smola, Markus Weimer Yahoo! Research & UC Berkeley & ANU
2 Variations on a theme inference for mixtures Parallel inference parallelization templates Samplers scaling up LDA
3 Inference for Mixture Models
4 Clustering
5 Clustering
6 Clustering
7 Clustering airline university restaurant
8 Generative Model cluster ID Θ Θ y 1 y 2 y 3 y m y i... x 1 x i x i x m x i objects μ1, Σ1... μk, Σk μj,σj i=1..m j=1..k
9 Generative Model cluster ID Θ Θ y 1 y 2 y 3 y m y i... x 1 x i x i x m x i objects μ1, Σ1... μk, Σk μj,σj i=1..m j=1..k p(x, Y,,µ)= ny i=1 p(x i y i,,µ)p(y i )
10 definetti Any distribution over exchangeable random variables can be written as conditionally independent. p(x 1,...,x n )= Z dp( ) ny i=1 p(x i ) ϴ x i x i Inference should be easy - ϴ x i and x i ϴ
11 Conjugates and Collapsing Exponential Family p(x ) =exp(h (x), i g( )) Conjugate Prior p( µ 0,m 0 )=exp(m 0 hµ 0, i m 0 g( ) h(m 0 µ 0,m 0 )) Posterior p( X, µ 0,m 0 ) / exp (hm 0 µ 0 + mµ[x], i (m 0 + m)g( ) h(m 0 µ 0,m 0 )) Collapsing the natural parameter p(x µ 0,m 0 )=exp(h(m 0 µ 0 + mµ[x],m 0 + m) h(m 0 µ 0,m 0 )) data
12 Conjugates and Collapsing m 0 μ 0 m 0 μ 0 ϴ x i x i collapsed representation definetti
13 Clustering & Topic Models clustering Latent Dirichlet Allocation α prior α prior θ cluster probability θ topic probability y cluster label y topic label x instance x instance
14 Clustering & Topic Models Cluster/ topic distributions membership x = Documents clustering: (0, 1) matrix topic model: stochastic matrix LSI: arbitrary matrices
15 Clustering & Topic Models estimate sample/optimize Cluster/ topic distributions membership x = Documents clustering: (0, 1) matrix topic model: stochastic matrix LSI: arbitrary matrices
16 α θ y x α y x V1 - Brute force maximization Integrate out latent parameters θ and ѱ p(x, Y, ) Discrete maximization problem in Y ѱ β β Hard to implement Overfits a lot (mode is not a typical sample) Hal Daume; Joey Gonzalez Parallelization infeasible
17 α α V2 - Brute force maximization θ θ y x x ѱ ѱ β β Hoffmann, Blei, Bach (in VW) Integrate out latent parameters y p(x,,, ) Continuous nonconvex optimization problem in θ and ѱ Solve by stochastic gradient descent over documents Easy to implement Does not overfit much Great for small datasets Parallelization difficult/impossible Memory storage/access is O(T W) (this breaks for large models) 1M words, 1000 topics = 4GB Per document 1MFlops/iteration
18 α α V3 - Variational approximation θ θ y y x ѱ ѱ β β Blei, Ng, Jordan Approximate intractable joint distribution by tractable factors log p(x) log p(x) D(q(y)kp(y x)) Z = dq(y) [log p(x) + log p(y x) Z = dq(y) log p(x, y) + H[q] Alternating convex optimization problem Dominant cost is matrix matrix multiply Easy to implement Great for small topics/vocabulary Parallelization easy (aggregate statistics) Memory storage is O(T W) (this breaks for large models) Model not quite as good as sampling q(y)]
19 α V4 - Uncollapsed θ y x ѱ β Sampling Sample y ij rest Can be done in parallel Sample θ rest and ѱ rest Can be done in parallel Compatible with MapReduce (only aggregate statistics) Easy to implement Children can be conditionally independent* Memory storage is O(T W) (this breaks for large models) Mixes slowly *for the right model
20 α θ y x α y x V5 - Collapsed Sampling Integrate out latent parameters θ and ѱ Sample one topic assignment y ij X,Y -ij at a time from n ij (t, d)+ p(x, Y, ) t n ij (t, w)+ t ѱ β β Griffiths & Steyvers 2005 n i (d)+ P t Fast mixing Easy to implement Memory efficient n i (t)+ P t Parallelization infeasible (variables lock each other) t t
21 α θ y x α y x V5 - Collapsed Sampling Integrate out latent parameters θ and ѱ Sample one topic assignment y ij X,Y -ij at a time from n ij (t, d)+ p(x, Y, ) t n ij (t, w)+ t ѱ β β Griffiths & Steyvers 2005 n i (d)+ P t Fast mixing Easy to implement Memory efficient n i (t)+ P t Parallelization infeasible (variables lock each other) t t
22 α V6 - Approximating y the Distribution Collapsed sampler per machine n ij (t, d)+ t n ij (t, w)+ t x β Asuncion, Smyth, Welling,... UCI Mimno, McCallum,... UMass n i (d)+ P t Defer synchronization between machines no problem for n(t) big problem for n(t,w) Easy to implement Can be memory efficient Easy parallelization t n i (t)+ P t Mixes slowly/worse likelihood t
23 α V7 - Better Approximations of the Distribution Collapsed sampler n ij (t, d)+ n ij (t, w)+ t t y x β S. and Narayanamurthy, 2009 Ahmed, Gonzalez, et al., 2012 n n i (t)+ P i (d)+ P t t t Make local copies of state Implicit for multicore (delayed updates from samplers) Explicit copies for multi-machine Not a hierarchical model (Welling, Asuncion, et al. 2008) Memory efficient (only need to view its own sufficient statistics) Multicore / Multi-machine Convergence speed depends on synchronizer quality t
24 α V8 - Sequential y x β y x Canini, Shi, Griffiths, 2009 Ahmed et al., 2011 y... p(x, Y, )= x Monte Carlo Integrate out latent θ and ѱ p(x, Y, ) Chain conditional probabilities my p(x i,y i x 1,y 1,...x i 1,y i 1,, ) i=1 For each particle sample y i p(y i x i,x 1,y 1,...x i 1,y i 1,, ) Reweight particle by next step data likelihood p(x i+1 x 1,y 1,...x i,y i,, ) Resample particles if weight distribution is too uneven
25 One pass through data Data sequential parallelization is open problem Nontrivial to implement Sampler is easy V8 - Sequential Monte Carlo Integrate out latent θ and ѱ p(x, Y, ) Chain conditional probabilities Inheritance tree through particles p(x, Y, )= is messy i=1 Need to estimate data likelihood For each particle sample (integration over y), e.g. as part of y i p(y i x i,x 1,y 1,...x i 1,y i 1,, ) sampler Reweight particle by next step data likelihood This is multiplicative update algorithm with log loss... Canini, Shi, Griffiths, 2009 Ahmed et al., 2011 my p(x i,y i x 1,y 1,...x i 1,y i 1,, ) p(x i+1 x 1,y 1,...x i,y i,, ) Resample particles if weight distribution is too uneven
26 Uncollapsed Variational approximation Collapsed natural parameters Collapsed topic assignments Optimization overfits too costly easy parallelization big memory footprint overfits too costly easy to optimize big memory footprint difficult parallelization fast mixing difficult parallelization Sampling slow mixing conditionally independent n.a. approximate inference by delayed updates sampling difficult particle filtering sequential
27 Parallel Inference
28 3 Problems mean variance cluster weight data cluster ID
29 3 Problems data local state global state
30 3 Problems too big for single machine huge only local
31 3 Problems global state Vanilla LDA local state User profiling data global state
32 3 Problems global state Vanilla LDA local state User profiling data global state
33 3 Problems local state is too large does not fit into memory global state is too large network load & barriers does not fit into memory
34 3 Problems local state is too large does stream not local fit data from disk into memory global state is too large network load & barriers does not fit into memory
35 3 Problems local state is too large does stream not local fit data from disk into memory global state is too large network asynchronous load synchronization & barriers does not fit into memory
36 3 Problems local state is too large does stream not local fit data from disk into memory global state is too large network asynchronous load synchronization & barriers does not fit partial view into memory
37 Distribution global replica cluster rack
38 Distribution global replica cluster rack
39 Synchronization Child updates local state Start with common state Child stores old and new state Parent keeps global state Transmit differences asynchronously Inverse element for difference Abelian group for commutativity (sum, log-sum, cyclic group, exponential families) local to global global to local x x old x old x x global x global + x old x x +(x global x old ) x global
40 Synchronization Naive approach (dumb master) Global is only (key,value) storage Local node needs to lock/read/write/unlock master Needs a 4 TCP/IP roundtrips - latency bound Better solution (smart master) Client sends message to master / in queue / master incorporates it Master sends message to client / in queue / client incorporates it Bandwidth bound (>10x speedup in practice) local to global global to local x x old x old x x global x global + x old x x +(x global x old ) x global
41 Distribution Dedicated server for variables Insufficient bandwidth (hotspots) Insufficient memory Select server e.g. via consistent hashing m(x) = argmin m2m h(x, m)
42 Distribution & fault tolerance Storage is O(1/k) per machine Communication is O(1) per machine Fast snapshots O(1/k) per machine (stop sync and dump state per vertex) O(k) open connections per machine O(1/k) throughput per machine m(x) = argmin m2m h(x, m)
43 Distribution & fault tolerance Storage is O(1/k) per machine Communication is O(1) per machine Fast snapshots O(1/k) per machine (stop sync and dump state per vertex) O(k) open connections per machine O(1/k) throughput per machine m(x) = argmin m2m h(x, m)
44 Distribution & fault tolerance Storage is O(1/k) per machine Communication is O(1) per machine Fast snapshots O(1/k) per machine (stop sync and dump state per vertex) O(k) open connections per machine O(1/k) throughput per machine m(x) = argmin m2m h(x, m)
45 Synchronization Data rate between machines is O(1/k) Machines operate asynchronously (barrier free) Solution Schedule message pairs Communicate with r random machines simultaneously local r=1 global
46 Synchronization Data rate between machines is O(1/k) Machines operate asynchronously (barrier free) Solution Schedule message pairs Communicate with r random machines simultaneously local r=1 global
47 Synchronization Data rate between machines is O(1/k) Machines operate asynchronously (barrier free) Solution Schedule message pairs Communicate with r random machines simultaneously local r=1 global
48 Synchronization Data rate between machines is O(1/k) Machines operate asynchronously (barrier free) Solution Schedule message pairs Communicate with r random machines simultaneously local r= < eff. < 0.89 global
49 Synchronization Data rate between machines is O(1/k) Machines operate asynchronously (barrier free) Solution Schedule message pairs Communicate with r random machines simultaneously Use Luby-Rackoff PRPG for load balancing Efficiency guarantee 4 simultaneous connections are sufficient
50 Scalability
51 Samplers
52 Sampling Brute force sampling over large number of items is expensive Ideally want work to scale with entropy of distribution over labels. Sparsity of distribution typically only known after seeing the instances Decompose (dense) probability into dense invariant and sparse variable terms Use fast proposal distribution & rejection sampling
53 Exploiting Sparsity Decomposition (Mimno & McCallum, 2009) Only need to update sparse terms per word p(t w ij ) / w t n(t)+ + w n(t, d = i) n(t)+ + n(t, w = w ij)[n(t, d = i)+ t ] n(t)+ dense but constant sparse Does not work for clustering (too many factors)
54 Exploiting Sparsity Context LDA (Petterson et al., 2009) The smoothers are word and topic dependent p(t w ij ) / (w, t) t n(t)+ (t) + (w, n(t, d = i) t) n(t)+ (t) + n(t, w = w ij)[n(t, d = i)+ t ] n(t)+ (t) topic dependent, dense Simple sparse factorization doesn t work Use Cauchy Schwartz to upper-bound first term X t (w, t) t n(t)+ (t) applek (w, )k n( )+ ( )
55 Collapsed vs Variational Memory requirements (1k topics, 2M words) Variational inference: 8GB RAM (no sparsity) Collapsed sampler: 1.5GB RAM (rare words) Burn-in & sparsity exploit saves a lot 45 Sampling Speed-up (1000 topics) 40 unif doc doc,word Speed-up over standard sampler Iteration Cauchy Schwartz bound multilingual LDA word context smoothing over time
56 Fast Proposal In reality sparsity often not true for real proposal Guess sparse proxy In the storylines model this are the entities
57 Variations on a theme inference for mixtures Parallel inference parallelization templates Samplers scaling up LDA
Partitioning Algorithms for Improving Efficiency of Topic Modeling Parallelization
Partitioning Algorithms for Improving Efficiency of Topic Modeling Parallelization Hung Nghiep Tran University of Information Technology VNU-HCMC Vietnam Email: nghiepth@uit.edu.vn Atsuhiro Takasu National
More informationMore Effective Distributed ML via a Stale Synchronous Parallel Parameter Server
More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server Q. Ho, J. Cipar, H. Cui, J.K. Kim, S. Lee, *P.B. Gibbons, G.A. Gibson, G.R. Ganger, E.P. Xing Carnegie Mellon University
More informationAn Architecture for Parallel Topic Models
An Architecture for Parallel Topic Models Alexander Smola Yahoo! Research, Santa Clara, CA, USA Australian National University, Canberra alex@smola.org Shravan Narayanamurthy Yahoo! Labs, Bangalore Torrey
More informationScaling with the Parameter Server Variations on a Theme. Alexander Smola Google Research & CMU alex.smola.org
Scaling with the Parameter Server Variations on a Theme Alexander Smola Google Research & CMU alex.smola.org Thanks Amr Ahmed Nino Shervashidze Joey Gonzalez Shravan Narayanamurthy Sergiy Matyusevich Markus
More informationClustering Lecture 5: Mixture Model
Clustering Lecture 5: Mixture Model Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced topics
More informationStructured Learning. Jun Zhu
Structured Learning Jun Zhu Supervised learning Given a set of I.I.D. training samples Learn a prediction function b r a c e Supervised learning (cont d) Many different choices Logistic Regression Maximum
More informationSum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015
Sum-Product Networks STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015 Introduction Outline What is a Sum-Product Network? Inference Applications In more depth
More informationMachine Learning / Jan 27, 2010
Revisiting Logistic Regression & Naïve Bayes Aarti Singh Machine Learning 10-701/15-781 Jan 27, 2010 Generative and Discriminative Classifiers Training classifiers involves learning a mapping f: X -> Y,
More informationUnsupervised Learning
Unsupervised Learning Learning without Class Labels (or correct outputs) Density Estimation Learn P(X) given training data for X Clustering Partition data into clusters Dimensionality Reduction Discover
More informationHow to Price a House
How to Price a House An Interpretable Bayesian Approach Dustin Lennon dustin@inferentialist.com Inferentialist Consulting Seattle, WA April 9, 2014 Introduction Project to tie up loose ends / came out
More informationLDA for Big Data - Outline
LDA FOR BIG DATA 1 LDA for Big Data - Outline Quick review of LDA model clustering words-in-context Parallel LDA ~= IPM Fast sampling tricks for LDA Sparsified sampler Alias table Fenwick trees LDA for
More informationDeveloping MapReduce Programs
Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes
More informationClustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin
Clustering K-means Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, 2014 Carlos Guestrin 2005-2014 1 Clustering images Set of Images [Goldberger et al.] Carlos Guestrin 2005-2014
More informationMachine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari
Machine Learning Basics: Stochastic Gradient Descent Sargur N. srihari@cedar.buffalo.edu 1 Topics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation Sets
More informationCS281 Section 9: Graph Models and Practical MCMC
CS281 Section 9: Graph Models and Practical MCMC Scott Linderman November 11, 213 Now that we have a few MCMC inference algorithms in our toolbox, let s try them out on some random graph models. Graphs
More informationConvexization in Markov Chain Monte Carlo
in Markov Chain Monte Carlo 1 IBM T. J. Watson Yorktown Heights, NY 2 Department of Aerospace Engineering Technion, Israel August 23, 2011 Problem Statement MCMC processes in general are governed by non
More informationParallelism for LDA Yang Ruan, Changsi An
Parallelism for LDA Yang Ruan, Changsi An (yangruan@indiana.edu, anch@indiana.edu) 1. Overview As parallelism is very important for large scale of data, we want to use different technology to parallelize
More informationClustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin
Clustering K-means Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, 2014 Carlos Guestrin 2005-2014 1 Clustering images Set of Images [Goldberger et al.] Carlos Guestrin 2005-2014
More informationParallel Gibbs Sampling From Colored Fields to Thin Junction Trees
Parallel Gibbs Sampling From Colored Fields to Thin Junction Trees Joseph Gonzalez Yucheng Low Arthur Gretton Carlos Guestrin Draw Samples Sampling as an Inference Procedure Suppose we wanted to know the
More informationScaling Distributed Machine Learning with the Parameter Server
Scaling Distributed Machine Learning with the Parameter Server Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su Presented
More informationOptimization for Machine Learning
with a focus on proximal gradient descent algorithm Department of Computer Science and Engineering Outline 1 History & Trends 2 Proximal Gradient Descent 3 Three Applications A Brief History A. Convex
More informationProbabilistic Graphical Models
Overview of Part Two Probabilistic Graphical Models Part Two: Inference and Learning Christopher M. Bishop Exact inference and the junction tree MCMC Variational methods and EM Example General variational
More informationAn Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework
IEEE SIGNAL PROCESSING LETTERS, VOL. XX, NO. XX, XXX 23 An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework Ji Won Yoon arxiv:37.99v [cs.lg] 3 Jul 23 Abstract In order to cluster
More informationA Brief Look at Optimization
A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year s version Overview Introduction Classes of optimization problems Linear programming Steepest
More informationComputer vision: models, learning and inference. Chapter 10 Graphical Models
Computer vision: models, learning and inference Chapter 10 Graphical Models Independence Two variables x 1 and x 2 are independent if their joint probability distribution factorizes as Pr(x 1, x 2 )=Pr(x
More informationAdaptive Dropout Training for SVMs
Department of Computer Science and Technology Adaptive Dropout Training for SVMs Jun Zhu Joint with Ning Chen, Jingwei Zhuo, Jianfei Chen, Bo Zhang Tsinghua University ShanghaiTech Symposium on Data Science,
More informationAnnouncements: projects
Announcements: projects 805 students: Project proposals are due Sun 10/1. If you d like to work with 605 students then indicate this on your proposal. 605 students: the week after 10/1 I will post the
More informationMixture Models and the EM Algorithm
Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Finite Mixture Models Say we have a data set D = {x 1,..., x N } where x i is
More informationTowards Asynchronous Distributed MCMC Inference for Large Graphical Models
Towards Asynchronous Distributed MCMC for Large Graphical Models Sameer Singh Department of Computer Science University of Massachusetts Amherst, MA 13 sameer@cs.umass.edu Andrew McCallum Department of
More informationCS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas
CS839: Probabilistic Graphical Models Lecture 10: Learning with Partially Observed Data Theo Rekatsinas 1 Partially Observed GMs Speech recognition 2 Partially Observed GMs Evolution 3 Partially Observed
More informationLinear Regression Optimization
Gradient Descent Linear Regression Optimization Goal: Find w that minimizes f(w) f(w) = Xw y 2 2 Closed form solution exists Gradient Descent is iterative (Intuition: go downhill!) n w * w Scalar objective:
More informationThe exam is closed book, closed notes except your one-page (two-sided) cheat sheet.
CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. No calculators or
More informationNote Set 4: Finite Mixture Models and the EM Algorithm
Note Set 4: Finite Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine Finite Mixture Models A finite mixture model with K components, for
More informationK-Means and Gaussian Mixture Models
K-Means and Gaussian Mixture Models David Rosenberg New York University June 15, 2015 David Rosenberg (New York University) DS-GA 1003 June 15, 2015 1 / 43 K-Means Clustering Example: Old Faithful Geyser
More informationD-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.
D-Separation Say: A, B, and C are non-intersecting subsets of nodes in a directed graph. A path from A to B is blocked by C if it contains a node such that either a) the arrows on the path meet either
More informationMachine Learning Basics. Sargur N. Srihari
Machine Learning Basics Sargur N. srihari@cedar.buffalo.edu 1 Overview Deep learning is a specific type of ML Necessary to have a solid understanding of the basic principles of ML 2 Topics Stochastic Gradient
More informationECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning Topics: Unsupervised Learning: Kmeans, GMM, EM Readings: Barber 20.1-20.3 Stefan Lee Virginia Tech Tasks Supervised Learning x Classification y Discrete x Regression
More informationIntroduction to Machine Learning CMU-10701
Introduction to Machine Learning CMU-10701 Clustering and EM Barnabás Póczos & Aarti Singh Contents Clustering K-means Mixture of Gaussians Expectation Maximization Variational Methods 2 Clustering 3 K-
More informationComputer Vision Group Prof. Daniel Cremers. 4a. Inference in Graphical Models
Group Prof. Daniel Cremers 4a. Inference in Graphical Models Inference on a Chain (Rep.) The first values of µ α and µ β are: The partition function can be computed at any node: Overall, we have O(NK 2
More informationClustering Relational Data using the Infinite Relational Model
Clustering Relational Data using the Infinite Relational Model Ana Daglis Supervised by: Matthew Ludkin September 4, 2015 Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015
More informationHow Learning Differs from Optimization. Sargur N. Srihari
How Learning Differs from Optimization Sargur N. srihari@cedar.buffalo.edu 1 Topics in Optimization Optimization for Training Deep Models: Overview How learning differs from optimization Risk, empirical
More informationStephen Scott.
1 / 33 sscott@cse.unl.edu 2 / 33 Start with a set of sequences In each column, residues are homolgous Residues occupy similar positions in 3D structure Residues diverge from a common ancestral residue
More informationFully Distributed EM for Very Large Datasets
Fully Distributed for Very Large Datasets Jason Wolfe Aria Haghighi Dan Klein Computer Science Division UC Berkeley Overview Task: unsupervised learning via الولاياتالمتحدة المتحدة المتحدة مو تمر الولايات
More information5 Machine Learning Abstractions and Numerical Optimization
Machine Learning Abstractions and Numerical Optimization 25 5 Machine Learning Abstractions and Numerical Optimization ML ABSTRACTIONS [some meta comments on machine learning] [When you write a large computer
More informationProbabilistic Graphical Models
School of Computer Science Probabilistic Graphical Models Theory of Variational Inference: Inner and Outer Approximation Eric Xing Lecture 14, February 29, 2016 Reading: W & J Book Chapters Eric Xing @
More informationNetwork Traffic Measurements and Analysis
DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:
More informationText Modeling with the Trace Norm
Text Modeling with the Trace Norm Jason D. M. Rennie jrennie@gmail.com April 14, 2006 1 Introduction We have two goals: (1) to find a low-dimensional representation of text that allows generalization to
More information732A54/TDDE31 Big Data Analytics
732A54/TDDE31 Big Data Analytics Lecture 10: Machine Learning with MapReduce Jose M. Peña IDA, Linköping University, Sweden 1/27 Contents MapReduce Framework Machine Learning with MapReduce Neural Networks
More informationUsing Existing Numerical Libraries on Spark
Using Existing Numerical Libraries on Spark Brian Spector Chicago Spark Users Meetup June 24 th, 2015 Experts in numerical algorithms and HPC services How to use existing libraries on Spark Call algorithm
More informationECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov
ECE521: Week 11, Lecture 20 27 March 2017: HMM learning/inference With thanks to Russ Salakhutdinov Examples of other perspectives Murphy 17.4 End of Russell & Norvig 15.2 (Artificial Intelligence: A Modern
More informationSampling Using GPU Accelerated Sparse Hierarchical Models
Sampling Using GPU Accelerated Sparse Hierarchical Models Miroslav Stoyanov Oak Ridge National Laboratory supported by Exascale Computing Project (ECP) exascaleproject.org April 9, 28 Miroslav Stoyanov
More informationGraphLab: A New Framework for Parallel Machine Learning
GraphLab: A New Framework for Parallel Machine Learning Yucheng Low, Aapo Kyrola, Carlos Guestrin, Joseph Gonzalez, Danny Bickson, Joe Hellerstein Presented by Guozhang Wang DB Lunch, Nov.8, 2010 Overview
More informationConditional Random Fields : Theory and Application
Conditional Random Fields : Theory and Application Matt Seigel (mss46@cam.ac.uk) 3 June 2010 Cambridge University Engineering Department Outline The Sequence Classification Problem Linear Chain CRFs CRF
More informationMachine Learning and Data Mining. Clustering (1): Basics. Kalev Kask
Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of
More informationCPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016
CPSC 340: Machine Learning and Data Mining Principal Component Analysis Fall 2016 A2/Midterm: Admin Grades/solutions will be posted after class. Assignment 4: Posted, due November 14. Extra office hours:
More informationVariational Methods for Discrete-Data Latent Gaussian Models
Variational Methods for Discrete-Data Latent Gaussian Models University of British Columbia Vancouver, Canada March 6, 2012 The Big Picture Joint density models for data with mixed data types Bayesian
More informationBuilding Classifiers using Bayesian Networks
Building Classifiers using Bayesian Networks Nir Friedman and Moises Goldszmidt 1997 Presented by Brian Collins and Lukas Seitlinger Paper Summary The Naive Bayes classifier has reasonable performance
More informationDeep Generative Models Variational Autoencoders
Deep Generative Models Variational Autoencoders Sudeshna Sarkar 5 April 2017 Generative Nets Generative models that represent probability distributions over multiple variables in some way. Directed Generative
More informationParallelizing Big Data Machine Learning Algorithms with Model Rotation
Parallelizing Big Data Machine Learning Algorithms with Model Rotation Bingjing Zhang, Bo Peng, Judy Qiu School of Informatics and Computing Indiana University Bloomington, IN, USA Email: {zhangbj, pengb,
More informationLecture 15: More Iterative Ideas
Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!
More informationBehavioral Data Mining. Lecture 18 Clustering
Behavioral Data Mining Lecture 18 Clustering Outline Why? Cluster quality K-means Spectral clustering Generative Models Rationale Given a set {X i } for i = 1,,n, a clustering is a partition of the X i
More informationEstimating Labels from Label Proportions
Estimating Labels from Label Proportions Novi Quadrianto Novi.Quad@gmail.com The Australian National University, Australia NICTA, Statistical Machine Learning Program, Australia Joint work with Alex Smola,
More informationScaling Distributed Machine Learning
Scaling Distributed Machine Learning with System and Algorithm Co-design Mu Li Thesis Defense CSD, CMU Feb 2nd, 2017 nx min w f i (w) Distributed systems i=1 Large scale optimization methods Large-scale
More informationDivide and Conquer Kernel Ridge Regression
Divide and Conquer Kernel Ridge Regression Yuchen Zhang John Duchi Martin Wainwright University of California, Berkeley COLT 2013 Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT 2013 1 / 15 Problem
More informationArtificial Intelligence for Robotics: A Brief Summary
Artificial Intelligence for Robotics: A Brief Summary This document provides a summary of the course, Artificial Intelligence for Robotics, and highlights main concepts. Lesson 1: Localization (using Histogram
More informationClustering web search results
Clustering K-means Machine Learning CSE546 Emily Fox University of Washington November 4, 2013 1 Clustering images Set of Images [Goldberger et al.] 2 1 Clustering web search results 3 Some Data 4 2 K-means
More informationGraph-Parallel Problems. ML in the Context of Parallel Architectures
Case Study 4: Collaborative Filtering Graph-Parallel Problems Synchronous v. Asynchronous Computation Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 20 th, 2014
More informationCS281 Section 3: Practical Optimization
CS281 Section 3: Practical Optimization David Duvenaud and Dougal Maclaurin Most parameter estimation problems in machine learning cannot be solved in closed form, so we often have to resort to numerical
More informationExpectation Maximization: Inferring model parameters and class labels
Expectation Maximization: Inferring model parameters and class labels Emily Fox University of Washington February 27, 2017 Mixture of Gaussian recap 1 2/26/17 Jumble of unlabeled images HISTOGRAM blue
More informationMarkov chain Monte Carlo methods
Markov chain Monte Carlo methods (supplementary material) see also the applet http://www.lbreyer.com/classic.html February 9 6 Independent Hastings Metropolis Sampler Outline Independent Hastings Metropolis
More informationDistributed Machine Learning" on Spark
Distributed Machine Learning" on Spark Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Outline Data flow vs. traditional network programming Spark computing engine Optimization Example Matrix Computations
More informationCS 6453: Parameter Server. Soumya Basu March 7, 2017
CS 6453: Parameter Server Soumya Basu March 7, 2017 What is a Parameter Server? Server for large scale machine learning problems Machine learning tasks in a nutshell: Feature Extraction (1, 1, 1) (2, -1,
More informationCollective classification in network data
1 / 50 Collective classification in network data Seminar on graphs, UCSB 2009 Outline 2 / 50 1 Problem 2 Methods Local methods Global methods 3 Experiments Outline 3 / 50 1 Problem 2 Methods Local methods
More informationVariational Methods for Graphical Models
Chapter 2 Variational Methods for Graphical Models 2.1 Introduction The problem of probabb1istic inference in graphical models is the problem of computing a conditional probability distribution over the
More informationMarkov Random Fields and Gibbs Sampling for Image Denoising
Markov Random Fields and Gibbs Sampling for Image Denoising Chang Yue Electrical Engineering Stanford University changyue@stanfoed.edu Abstract This project applies Gibbs Sampling based on different Markov
More informationCSE 250B Assignment 2 Report
CSE 250B Assignment 2 Report February 16, 2012 Yuncong Chen yuncong@cs.ucsd.edu Pengfei Chen pec008@ucsd.edu Yang Liu yal060@cs.ucsd.edu Abstract In this report we describe our implementation of a conditional
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 22, 2016 Course Information Website: http://www.stat.ucdavis.edu/~chohsieh/teaching/ ECS289G_Fall2016/main.html My office: Mathematical Sciences
More informationA Fast Learning Algorithm for Deep Belief Nets
A Fast Learning Algorithm for Deep Belief Nets Geoffrey E. Hinton, Simon Osindero Department of Computer Science University of Toronto, Toronto, Canada Yee-Whye Teh Department of Computer Science National
More informationProgramming Systems for Big Data
Programming Systems for Big Data CS315B Lecture 17 Including material from Kunle Olukotun Prof. Aiken CS 315B Lecture 17 1 Big Data We ve focused on parallel programming for computational science There
More informationNearest Neighbor with KD Trees
Case Study 2: Document Retrieval Finding Similar Documents Using Nearest Neighbors Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Emily Fox January 22 nd, 2013 1 Nearest
More informationLearning the Structure of Deep, Sparse Graphical Models
Learning the Structure of Deep, Sparse Graphical Models Hanna M. Wallach University of Massachusetts Amherst wallach@cs.umass.edu Joint work with Ryan Prescott Adams & Zoubin Ghahramani Deep Belief Networks
More informationCS5220: Assignment Topic Modeling via Matrix Computation in Julia
CS5220: Assignment Topic Modeling via Matrix Computation in Julia April 7, 2014 1 Introduction Topic modeling is widely used to summarize major themes in large document collections. Topic modeling methods
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Lecture 17 EM CS/CNS/EE 155 Andreas Krause Announcements Project poster session on Thursday Dec 3, 4-6pm in Annenberg 2 nd floor atrium! Easels, poster boards and cookies
More informationConditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C,
Conditional Random Fields and beyond D A N I E L K H A S H A B I C S 5 4 6 U I U C, 2 0 1 3 Outline Modeling Inference Training Applications Outline Modeling Problem definition Discriminative vs. Generative
More informationFace detection and recognition. Detection Recognition Sally
Face detection and recognition Detection Recognition Sally Face detection & recognition Viola & Jones detector Available in open CV Face recognition Eigenfaces for face recognition Metric learning identification
More informationDropout. Sargur N. Srihari This is part of lecture slides on Deep Learning:
Dropout Sargur N. srihari@buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Regularization Strategies 1. Parameter Norm Penalties 2. Norm Penalties
More informationInference. Inference: calculating some useful quantity from a joint probability distribution Examples: Posterior probability: Most likely explanation:
Inference Inference: calculating some useful quantity from a joint probability distribution Examples: Posterior probability: B A E J M Most likely explanation: This slide deck courtesy of Dan Klein at
More informationTime Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix
Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix Carlos Ordonez, Yiqun Zhang Department of Computer Science, University of Houston, USA Abstract. We study the serial and parallel
More informationWarped Mixture Models
Warped Mixture Models Tomoharu Iwata, David Duvenaud, Zoubin Ghahramani Cambridge University Computational and Biological Learning Lab March 11, 2013 OUTLINE Motivation Gaussian Process Latent Variable
More informationThe exam is closed book, closed notes except your one-page (two-sided) cheat sheet.
CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. No calculators or
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 2: MapReduce Algorithm Design (2/2) January 14, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo
More informationCRF Feature Induction
CRF Feature Induction Andrew McCallum Efficiently Inducing Features of Conditional Random Fields Kuzman Ganchev 1 Introduction Basic Idea Aside: Transformation Based Learning Notation/CRF Review 2 Arbitrary
More informationScalable Bayes Clustering for Outlier Detection Under Informative Sampling
Scalable Bayes Clustering for Outlier Detection Under Informative Sampling Based on JMLR paper of T. D. Savitsky Terrance D. Savitsky Office of Survey Methods Research FCSM - 2018 March 7-9, 2018 1 / 21
More informationRegularization and Markov Random Fields (MRF) CS 664 Spring 2008
Regularization and Markov Random Fields (MRF) CS 664 Spring 2008 Regularization in Low Level Vision Low level vision problems concerned with estimating some quantity at each pixel Visual motion (u(x,y),v(x,y))
More informationGradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz
Gradient Descent Wed Sept 20th, 2017 James McInenrey Adapted from slides by Francisco J. R. Ruiz Housekeeping A few clarifications of and adjustments to the course schedule: No more breaks at the midpoint
More informationMachine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013
Machine Learning Topic 5: Linear Discriminants Bryan Pardo, EECS 349 Machine Learning, 2013 Thanks to Mark Cartwright for his extensive contributions to these slides Thanks to Alpaydin, Bishop, and Duda/Hart/Stork
More informationHuge market -- essentially all high performance databases work this way
11/5/2017 Lecture 16 -- Parallel & Distributed Databases Parallel/distributed databases: goal provide exactly the same API (SQL) and abstractions (relational tables), but partition data across a bunch
More informationHarp-DAAL for High Performance Big Data Computing
Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big
More informationDistributed Filesystem
Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the
More informationApplication of MRF s to Segmentation
EE641 Digital Image Processing II: Purdue University VISE - November 14, 2012 1 Application of MRF s to Segmentation Topics to be covered: The Model Bayesian Estimation MAP Optimization Parameter Estimation
More information