Tree-based Cluster Weighted Modeling: Towards A Massively Parallel Real- Time Digital Stradivarius

Similar documents
Instruments. 20 Ames Street. Abstract. We present a framework for the analysis and synthesis of acoustical instruments

Developing a Data Driven System for Computational Neuroscience

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

Random projection for non-gaussian mixture models

ORT EP R RCH A ESE R P A IDI! " #$$% &' (# $!"

Note Set 4: Finite Mixture Models and the EM Algorithm

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Mixture Models and the EM Algorithm

Some questions of consensus building using co-association

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

Comparison of Variational Bayes and Gibbs Sampling in Reconstruction of Missing Values with Probabilistic Principal Component Analysis

Distributed Virtual Reality Computation

Markov Random Fields and Gibbs Sampling for Image Denoising

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

Clustering and The Expectation-Maximization Algorithm

An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Introduction to Mobile Robotics

A Taxonomy of Semi-Supervised Learning Algorithms

MAXIMUM LIKELIHOOD ESTIMATION USING ACCELERATED GENETIC ALGORITHMS

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

What is machine learning?

CS 195-5: Machine Learning Problem Set 5

A noninformative Bayesian approach to small area estimation

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

22 October, 2012 MVA ENS Cachan. Lecture 5: Introduction to generative models Iasonas Kokkinos

CS4495 Fall 2014 Computer Vision Problem Set 6: Particle Tracking

Geoff McLachlan and Angus Ng. University of Queensland. Schlumberger Chaired Professor Univ. of Texas at Austin. + Chris Bishop

Methods for Intelligent Systems

Grundlagen der Künstlichen Intelligenz

Probabilistic Graphical Models

10-701/15-781, Fall 2006, Final

A Simple Generative Model for Single-Trial EEG Classification

Fisher vector image representation

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

CS Introduction to Data Mining Instructor: Abdullah Mueen

Sampling PCA, enhancing recovered missing values in large scale matrices. Luis Gabriel De Alba Rivera 80555S

Practical Design of Experiments: Considerations for Iterative Developmental Testing

Chapter 7: Competitive learning, clustering, and self-organizing maps

Detecting Network Intrusions

Chapter 7. Conclusions and Future Work

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015

Clustering: Classic Methods and Modern Views

Client Dependent GMM-SVM Models for Speaker Verification

The Basics of Graphical Models

CSE 586 Final Programming Project Spring 2011 Due date: Tuesday, May 3

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation

Clustering Lecture 5: Mixture Model

A Wavenet for Speech Denoising

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Louis Fourrier Fabien Gaie Thomas Rolf

Unsupervised Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

The Un-normalized Graph p-laplacian based Semi-supervised Learning Method and Speech Recognition Problem

Boosting Simple Model Selection Cross Validation Regularization. October 3 rd, 2007 Carlos Guestrin [Schapire, 1989]

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Today. Lecture 4: Last time. The EM algorithm. We examine clustering in a little more detail; we went over it a somewhat quickly last time

Inference and Representation

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

A Self-Organizing Binary System*

A Visualization Tool to Improve the Performance of a Classifier Based on Hidden Markov Models

J. Weston, A. Gammerman, M. Stitson, V. Vapnik, V. Vovk, C. Watkins. Technical Report. February 5, 1998

27: Hybrid Graphical Models and Neural Networks

Graphical Models. David M. Blei Columbia University. September 17, 2014

Object Segmentation and Tracking in 3D Video With Sparse Depth Information Using a Fully Connected CRF Model

6.801/866. Segmentation and Line Fitting. T. Darrell

Machine Learning A W 1sst KU. b) [1 P] For the probability distribution P (A, B, C, D) with the factorization

Novel Lossy Compression Algorithms with Stacked Autoencoders

Monte Carlo for Spatial Models

3 : Representation of Undirected GMs

Unsupervised Learning: Clustering

Research on the New Image De-Noising Methodology Based on Neural Network and HMM-Hidden Markov Models

The EM Algorithm Lecture What's the Point? Maximum likelihood parameter estimates: One denition of the \best" knob settings. Often impossible to nd di

Machine Learning

PARAMETER OPTIMIZATION FOR AUTOMATED SIGNAL ANALYSIS FOR CONDITION MONITORING OF AIRCRAFT SYSTEMS. Mike Gerdes 1, Dieter Scholz 1

CSE 446 Bias-Variance & Naïve Bayes

CSC412: Stochastic Variational Inference. David Duvenaud

SGN (4 cr) Chapter 11

Evaluating the Effect of Perturbations in Reconstructing Network Topologies

Term Project Final Report for CPSC526 Statistical Models of Poses Using Inverse Kinematics

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Semi- Supervised Learning

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders

Time series, HMMs, Kalman Filters

Laboratory Exercise #5

Radial Basis Function Networks: Algorithms

Network Traffic Measurements and Analysis

IROS 05 Tutorial. MCL: Global Localization (Sonar) Monte-Carlo Localization. Particle Filters. Rao-Blackwellized Particle Filters and Loop Closing

Boosting Simple Model Selection Cross Validation Regularization

k-means demo Administrative Machine learning: Unsupervised learning" Assignment 5 out

CSC 2515 Introduction to Machine Learning Assignment 2

Sampling Frequency & Bit Depth

Transcription:

Tree-based Cluster Weighted Modeling: Towards A Massively Parallel Real- Time Digital Stradivarius Edward S. Boyden III e@media.mit.edu Physics and Media Group MIT Media Lab 0 Ames St. Cambridge, MA 039 USA Abstract Cluster-weighted modeling (CWM) is a versatile inference algorithm for deriving a functional relationship between input data and output data by using a mixture of expert clusters. Each cluster is localized to a Gaussian input region and possesses its own trainable local model. The CWM algorithm uses expectation-maximization (EM) to find the optimal locations of clusters in the input space and to solve for the parameters of the local model. However, the CWM algorithm requires interactions between all the data and all the clusters. For a violin, whose training might easily require over a billion data points and a hundred thousand clusters, such an implementation is clearly undesirable. We use a variant of CWM that makes pseudosoft splits in the data to model the time-series relationship between time-lagged values of human input data and digital audio output data. We describe how this tree implementation would lend itself to a multiprocessor parallelization of the CWM algorithm and examine the expected reduction in time and space requirements. We also consider how this method could be used to perform intelligent training of the violin.. Introduction Among all great mechanical creations, classical instruments alone have been unsurpassable by modern technology; the great violins of Stradivarius and Guarneri have not seen their popularity fade with the years. At the MIT Media Lab, hyperinstruments, augmented instruments, and instruments with unusual interfaces have been developed with abandon, but for the past two years the Physics and Media group, and in particular Neil Gershenfeld, Bernd Schoner, and Eric Metois, have tried to recreate a digital violin with traditional inputs and realistic sound. The theoretical basis for this agenda of reconstruction is solid, and we start with a fact from the theory of nonlinear dynamics. In 98 Floris Takens proved that for a smoothly flowing classical dynamical system, one can embed its d-dimensional phase space manifold into a d+-dimensional real space consisting of time-lagged values of a system observable y []. Since y is a function of the state, this implies that it is possible to predict the value of the observable y from d+ time-lagged values of itself.

For systems with inputs, it suffices to take d+ lags of each input as well []. Therefore it is, in theory, possible to find a mapping between the time-lagged inputs and outputs of a system and future values of the output of the system. The solving of such inference problems has been a very active area of research for the past two decades. Among Bayesian learning techniques, which try and maximize the probability of a model given observed data, one method has been suggested as particularly suitable for the violin problem. If we consider the problem of a one-string violin, we might naively attempt to map the instantaneous values of the finger position and the bow velocity to a sample of an output audio waveform. Takens theorem suggests that we should actually take d+ values of the finger position, bow velocity, and waveform amplitude. This representation is fraught with difficulty, including the fact that gigaflops of computing power would be required, but two facts limit the actual dimension that we have to emulate: ) there is limited information contained in the digital waveform that represents the output sound (6 bit, khz at the current moment), and ) we only have to recreate the signal up to the perceptual limits of the human ear. Much personal experimentation over the past month has demonstrated that a mere three lags in bow velocity, two lags in finger position, and no lags in the output audio suffices to recreate the in-sample sound of the violin. Two years of experimentation have also suggested an optimal way to represent a chunk of sound [3]. Our preprocessing program chops up the training data into 56/050 ~ /86-second frames, and the program averages the finger position and velocity data over each of these intervals. As for the audio, we extract the frequencies and amplitudes of the 5 most important harmonics, so that the output space is 50-dimensional [3]. Thus these /86-second frames represent the input and output data at a discrete instant, and they are the data that we use to train our network. When we want to synthesize new sound from new input data, we simply take time-lagged sets of bow velocity and finger position data, find the 50-dimensional output corresponding to these inputs, and synthesize a /86-second frame of sound from these harmonics. The relevant signal processing was developed by [4].

The method of machine learning used to solve the inference problem is cluster-weighted modeling, which is described in the next section.. Cluster-weighted modeling The cluster-weighted modeling (CWM) algorithm was developed by Gershenfeld, Schoner and Metois [5]. We wish to reconstruct the probability density p(y, x) from y and x; we assume that we can expand the distribution over M clusters as follows: M p( y, x) = p( y x, c ) p( x c ) p( c ) m= m m m We assume that the clusters have the following Gaussian probabilities: p( x cm) = d ( π) C / / m p( y x, cm) = d ( π) C / / ym e e m m m ( x ) C ( x )/ () µ µ () ( f ( b, x) y) C m ( f ( x, b) y)/ where b is a set of parameters such that f(x,b) is linear in the parameters b. One can show that () is equal to the assumption of softmax gating functions in the mixture of experts network, which is described in [6]. We can now apply expectation-maximization to () by letting the cluster means and variances be hidden variables. This gives the following equations for the cluster means and variances, equivalent to the analogous EM derivation of Gaussian mixture parameters in Bishop [7], although Gershenfeld derives it in a completely original way, using Monte Carlo integration, in [5]: σ m = x p( c x, y ) µ m n m n n = p( c x, y ) m n n ( x µ ) p( c y, x ) n m m n n p( c y, x ) m n n We derive the parameters for each cluster m using maximum likelihood: 0 = log( b p ( y, x )) = log( = m b p ( y x, c ) [ y f ( x, b )] m b f ( x, b) (6) which gives us, for a function f that is linear in the parameters b, i.e., f ( x, b ) = b f ( x) (6) the following solution m m, i i m (3) (4) (5) 3

which yields the solution 0 = yf ( x) b f ( x) f ( x) a b B j m, i j i j m, i ij which is just the pseudoinverse solution to the normal equations. (7) b = m B a (7) 4. Tree-based algorithm We devised the following tree-based algorithm to reduce the interactions between the various points and clusters, after Schoner [8]. Let n be the number of clusters that each cluster is subdivided into, d the fraction of duplicated points in the pseudosoft step, L the number of levels of the tree, N the total number of data points, and M the total number of clusters on the lowest level. ) Run cluster-weighted modeling with n clusters on the current data set. ) Assign each data point to the cluster for which the likelihood p(x c m ) is greatest. 3) Choose the d% next highest likelihoods and assign these points to their corresponding clusters, so that the total number of points on any level L is N(+d). This effects a pseudosoft sharing of commonly held clusters. (One point may be assigned to more than two clusters.) 4) With these assignments in mind, reiterate, running CWM on each of the smaller data sets and dividing the training data as appropriate. 5) On the last level, run ordinary CWM so that each small set of data is effectively clustered with n clusters. The reconstruction step is analogous: ) Given a data point x, evaluate the likelihoods p(x c m ) for each cluster in the next lower level. ) Choose the most likely cluster, and repeat step until you reach the last level. 3) On the last level, do ordinary CWM (i.e., evaluate the weighted sum of functions f(x,b) and return the answer. Thus instead of one CWM operation with M clusters and N data points, one performs on the order of M/ CWM operations, each with n clusters and a fraction of the number of data points. According to [], this gives a speedup of m M (+d)logm/logm which can easily be over a thousand for the value of M we gave in the introduction, and reasonable values for m (say, 5) and d (~.). 4

5. Implications On a sample scale fragment, we found rough speedups of 50%-400% for choices of M in the range 5-5, with almost no decline in perceptual sound quality as compared to the ordinary cluster-weighted modeling method. Note that these are values of M that are much smaller than the actual expected values, so the eventual relative speedup will probably be much greater. Also, the tree driver code was completely written in Matlab, while the CWM code itself was written in C. The real speedup for an algorithm completely written in C could be significantly greater. We also were able to use sufficient numbers of clusters such that synthesis of full A-major scales was possible (before this algorithm was developed, only single notes were efficiently synthesized). Complex signal analysis and synthesis is now accomplishable in a matter of minutes, as opposed to a matter of hours. Clearly this will be important as continued work on a realistic violin continues. Furthermore, this tree-model allows the CWM code to be parallelized. The implications for parallelization are obvious: once a pseudosoft split of a batch of input data has been effected, the analysis trees are separate from there onward. Thus different processors can process the different data subsets in parallel. In the reconstruction stage, the likelihoods of each data point with respect to the different clusters on a level can be evaluated on separate processors. We have acquired a 6-processor SGI and plan to experiment with MPI acceleration of the routines. Also, there remains the possibility of more intelligent training of the violin. What if the initial split was done with respect to finger position alone, and subsequent fine-tuning divisions were accomplished with respect to all the data? This coarse split would effectively be an additional set of priors based on finger position, and may improve the accuracy of the violin training. The effectiveness of this method has yet to be explored, but it looks promising. In conclusion, this algorithm is a useful extension to current cluster-weighted-modeling and mixture-of-experts algorithms, when applied to datasets involving vast quantities of data and many different regions of operation. 5

References and Notes. Takens, F., Detecting Strange Attractors in Turbulence, Lecture Notes in Mathematica, 366-38, Springer, 98.. Schoner, B., State Reconstruction for Determining Predictability in Driven Nonlinear Acoustical Systems, Diploma Thesis, MIT Media Lab, Massachusetts Institute of Technology, May 996. 3. Schoner, B. (private communication) 4. Bernd Schoner (Physics and Media, MIT Media Lab) wrote all of the signal processing code for creating frames, Kalman-filtering the bow velocity and finger position data, and synthesizing sound. We will start organizing these elements into coherent libraries soon. 5. N. Gershenfeld, B. Schoner, E. Metois, Cluster-Weighted Modeling for Time Series Prediction and Characterization, preprint (997). 6. Jordan, M. I., and Jacobs, R. A., Hierarchical Mixtures of Experts and the EM Algorithm, Neural Computation, 6, 8-4, 994. 7. Bishop C. M., Neural Networks for Pattern Recognition, Oxford University Press, Oxford, 995. 8. An outline of this algorithm is given in Schoner s thesis [4], for a different situation (at this point in the history of the violin, the output representation was still a pure waveform, not a reduced 50-dimensional representation, so the need for vast cluster-crunching algorithms was even more necessary!). Thanks to Neil Gershenfeld for the original inspiration behind the project, for developing the CWM algorithm, and for writing the original implementation; to Bernd Schoner for lots of signal-processing code and help in understanding it, and for many interesting discussions; to Chris Douglas for helping assemble a coherent infrastructure for the violin software; and to Charles Cooper for many interesting discussions of the relevant music physics. (Digital Stradivarius Team 997.) Copyright 997 Physics and Media, MIT Media Lab. December, 997. 6