Introducing CURRENNT: The Munich Open-Source CUDA RecurREnt Neural Network Toolkit

Similar documents
Long Short-Term Memory Recurrent Neural Network Architectures for Large Scale Acoustic Modeling

Deep Learning Applications

Machine Learning 13. week

CSC 578 Neural Networks and Deep Learning

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

ImageNet Classification with Deep Convolutional Neural Networks

Gate-Variants of Gated Recurrent Unit (GRU) Neural Networks

LSTM: An Image Classification Model Based on Fashion-MNIST Dataset

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University

Kernels vs. DNNs for Speech Recognition

Empirical Evaluation of RNN Architectures on Sentence Classification Task

Sentiment Classification of Food Reviews

Variable-Component Deep Neural Network for Robust Speech Recognition

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

arxiv: v1 [cs.cv] 4 Feb 2018

Sequence Modeling: Recurrent and Recursive Nets. By Pyry Takala 14 Oct 2015

Why DNN Works for Speech and How to Make it More Efficient?

Combining Mask Estimates for Single Channel Audio Source Separation using Deep Neural Networks

SPEECH FEATURE DENOISING AND DEREVERBERATION VIA DEEP AUTOENCODERS FOR NOISY REVERBERANT SPEECH RECOGNITION. Xue Feng, Yaodong Zhang, James Glass

House Price Prediction Using LSTM

Recurrent Neural Networks

SEMANTIC COMPUTING. Lecture 9: Deep Learning: Recurrent Neural Networks (RNNs) TU Dresden, 21 December 2018

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Lecture 7: Neural network acoustic models in speech recognition

Recurrent Neural Network (RNN) Industrial AI Lab.

A Quick Guide on Training a neural network using Keras.

LOW-RANK MATRIX FACTORIZATION FOR DEEP NEURAL NETWORK TRAINING WITH HIGH-DIMENSIONAL OUTPUT TARGETS

27: Hybrid Graphical Models and Neural Networks

CS489/698: Intro to ML

XES Tensorflow Process Prediction using the Tensorflow Deep-Learning Framework

On the Efficiency of Recurrent Neural Network Optimization Algorithms

Deep Learning Workshop. Nov. 20, 2015 Andrew Fishberg, Rowan Zellers

Deep Learning for Computer Vision II

End-to-End Training of Acoustic Models for Large Vocabulary Continuous Speech Recognition with TensorFlow

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Multi-Dimensional Recurrent Neural Networks

Pair-wise Distance Metric Learning of Neural Network Model for Spoken Language Identification

SVD-based Universal DNN Modeling for Multiple Scenarios

A Survey on Speech Recognition Using HCI

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

Speech Recognition with Quaternion Neural Networks

Learning visual odometry with a convolutional network

Voice command module for Smart Home Automation

arxiv: v1 [cs.ai] 14 May 2007

Bidirectional Truncated Recurrent Neural Networks for Efficient Speech Denoising

A Comparison of Sequence-Trained Deep Neural Networks and Recurrent Neural Networks Optical Modeling For Handwriting Recognition

Deep Learning Benchmarks Mumtaz Vauhkonen, Quaizar Vohra, Saurabh Madaan Collaboration with Adam Coates, Stanford Unviersity

End-To-End Spam Classification With Neural Networks

Code Mania Artificial Intelligence: a. Module - 1: Introduction to Artificial intelligence and Python:

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

GMM-FREE DNN TRAINING. Andrew Senior, Georg Heigold, Michiel Bacchiani, Hank Liao

arxiv: v1 [cs.cl] 30 Jan 2018

Detecting BGP Anomalies Using Machine Learning Techniques

CS 224n: Assignment #3

Hybrid Accelerated Optimization for Speech Recognition

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Asynchronous Parallel Learning for Neural Networks and Structured Models with Dense Features

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Layerwise Interweaving Convolutional LSTM

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

Recurrent neural networks for polyphonic sound event detection in real life recordings

Deep Learning with Tensorflow AlexNet

Encoding RNNs, 48 End of sentence (EOS) token, 207 Exploding gradient, 131 Exponential function, 42 Exponential Linear Unit (ELU), 44

Compressing Deep Neural Networks using a Rank-Constrained Topology

Stacked Denoising Autoencoders for Face Pose Normalization

Stochastic Gradient Descent Algorithm in the Computational Network Toolkit

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

Deep Learning and Its Applications

Prediction of Pedestrian Trajectories Final Report

Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent

Recurrent Neural Networks and Transfer Learning for Action Recognition

OPTIMIZING PERFORMANCE OF RECURRENT NEURAL NETWORKS

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks

Multimodal Gesture Recognition using Multi-stream Recurrent Neural Network

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

On the Effectiveness of Neural Networks Classifying the MNIST Dataset

Rolling Bearing Diagnosis Based on CNN-LSTM and Various Condition Dataset

LSTM with Working Memory

DEEP CLUSTERING WITH GATED CONVOLUTIONAL NETWORKS

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center

Revisit Long Short-Term Memory: An Optimization Perspective

Gated Recurrent Unit Based Acoustic Modeling with Future Context

Neural Networks and Deep Learning

arxiv: v1 [cs.cl] 28 Nov 2016

Adversarial Feature-Mapping for Speech Enhancement

Study of Residual Networks for Image Recognition

An Arabic Optical Character Recognition System Using Restricted Boltzmann Machines

Usable while performant: the challenges building. Soumith Chintala

Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting

AUDIO-BASED MULTIMEDIA EVENT DETECTION USING DEEP RECURRENT NEURAL NETWORKS. Yun Wang, Leonardo Neves, Florian Metze

Keras: Handwritten Digit Recognition using MNIST Dataset

Practical Methodology. Lecture slides for Chapter 11 of Deep Learning Ian Goodfellow

PARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Deep Neural Networks for Recognizing Online Handwritten Mathematical Symbols

DropConnect Regularization Method with Sparsity Constraint for Neural Networks

Text Recognition in Videos using a Recurrent Connectionist Approach

Low latency acoustic modeling using temporal convolution and LSTMs

Recurrent Neural Nets II

Transcription:

Journal of Machine Learning Research 6 205) 547-55 Submitted 7/3; Published 3/5 Introducing CURRENNT: The Munich Open-Source CUDA RecurREnt Neural Network Toolkit Felix Weninger weninger@tum.de Johannes Bergmann privat@johannes-bergmann.de Björn Schuller schuller@tum.de Machine Learning & Signal Processing, Technische Universität München, 80290 Munich, Germany Editor: Mikio Braun Abstract In this article, we introduce CURRENNT, an open-source parallel implementation of deep recurrent neural networks RNNs) supporting graphics processing units GPUs) through NVIDIA s Computed Unified Device Architecture CUDA). CURRENNT supports uni- and bidirectional RNNs with Long Short-Term Memory LSTM) memory cells which overcome the vanishing gradient problem. To our knowledge, CURRENNT is the first publicly available parallel implementation of deep LSTM-RNNs. Benchmarks are given on a noisy speech recognition task from the 203 2nd CHiME Speech Separation and Recognition Challenge, where LSTM-RNNs have been shown to deliver best performance. In the result, double digit speedups in bidirectional LSTM training are achieved with respect to a reference single-threaded CPU implementation. CURRENNT is available under the GNU General Public License from http://sourceforge.net/p/currennt. Keywords: parallel computing, deep neural networks, recurrent neural networks, Long Short-Term Memory. Introduction Recurrent neural networks RNNs) are known as powerful sequence learners. In particular, the Long Short-Term Memory LSTM) architecture has been proven to provide excellent modeling of language Sundermeyer et al., 202), music Eck and Schmidhuber, 2002), speech Graves et al., 203), and facial expressions Wöllmer et al., 202). LSTM units overcome the vanishing gradient problem of traditional RNNs by the introduction of a memory cell which can be controlled by input, output and reset operations Gers et al., 2000). In particular, recent research demonstrates that deep LSTM-RNNs exhibit superior performance in speech recognition in comparison to state-of-the-art deep feed forward networks Graves et al., 203). However, in contrast to the widespread usage of the latter Hinton et al., 202), RNNs are still not adopted by the research community at large. One of the major barriers is the lack of high-performance implementations for training RNNs; at the same time, such implementations are non-trivial due to the limited parallelism caused by time dependencies. To the best of our knowledge, there is no publicly available software dedicated to parallel LSTM-RNN training. Thus, we introduce our CUda RecurREnt Neural Network Toolkit CURRENNT) which exploits a mini-batch learning scheme performing parallel weight. B. Schuller is also with the Department of Computing, Imperial College London, UK. c 205 Felix Weninger, Johannes Bergmann and Björn Schuller.

Weninger, Bergmann and Schuller NeuralNetwork N Layer Optimizer M SteepestDecent- Optimizer DataSet K DataSetFraction InputLayer TrainableLayer PostOutputLayer LstmLayer FeedForwardLayer Sse- PostOutputLayer Binary- ClassificationLayer Multiclass- ClassificationLayer SoftmaxLayer Figure : CURRENNT s C++ classes for deep feedforward and LSTM-RNN modeling. updates from multiple sequences. CURRENNT implements learning from big data sets that do not fit into memory by means of a random access data format. GPUs are supported through NVIDIA s Computed Unified Device Architecture CUDA). The RNN structure implemented in CURRENNT is based on LSTM units and in addition, feedforward network training is supported. Besides simple regression, CURRENNT also includes logistic and softmax output layers for training of binary and multi-way classification. To briefly refer some related studies and freely available implementations: A reference CPU implementation of LSTM-RNNs as used by Graves 2008) is available as open-source C++ code Graves, 203). A Python library for many machine learning algorithms including LSTM-RNN has been introduced by Schaul et al. 200); however, it does not directly support parallel processing. Multi-core training of standard) RNNs has been investigated by Cernanský 2009), but the source code is not available. Pascanu et al. have recently released a Python implementation GroundHog ) of various RNN types described in their study Pascanu et al., 204), exploiting GPU-accelerated training through Theano; yet, it does not provide LSTM-RNNs, and at the moment there is no user-friendly interface. 2. Design CURRENNT provides a C++ class library for deep LSTM-RNN modeling cf. Figure ) and a command line application for network training and evaluation. The network architecture can be specified by the user in JavaScript Object Notation JSON), and trained parameters are saved in the same format, allowing, e.g., for deep learning with pre-training. For efficiency reasons, features for training and evaluation are given in binary format, adhering to the NetCDF standard, but network outputs can also be saved in CSV format to facilitate post-processing. All C++ code is designed to be platform independent and has been tested on Windows and various Linux distributions. The required CUDA compute capability is.3 2008), allowing usage on virtually all of the consumer grade NVIDIA GPUs deployed in today s desktop PCs. The behavior of the gradient descent training algorithm is controlled by various switches of the command line application, allowing, e.g., for on-line or batch learning and fine-tuning of the training algorithm such as adding Gaussian noise to the input activations and randomly shuffling training data in on-line learning to improve generalization Graves et al., 203). The interested reader is referred to the documentation for more details. 548

CURRENNT: Open-Source CUDA RecurREnt Neural Network Toolkit 3. Implementation Deep LSTM-RNNs with N layers are implemented as follows. An input sequence x t is mapped to the output sequence y t, t =,..., T through the iteration forward pass): h 0) t := x t, h n) t := L n) t y t := S h n ) t ), h n) t, W N),N+) h N) t + b N+)). In the above and the ongoing, W denotes weight matrices and b stands for bias vectors with superscripts denoting layer indices). h n) t denotes the hidden feature representation of time frame t in the level n units, n =,..., N. The 0-th layer is the input layer and the N + -th layer the output layer. S is the vector valued) output layer function, e.g., a softmax function for multi-way classification cf. Figure ). L n) t denotes the composite LSTM activation function which is used instead of the common simple sigmoid-shaped functions. The crucial point is to augment each unit with a state variable c t, resulting in an automaton-like structure. The hidden layer activations correspond to the state variables memory cells ) scaled by the activations of the output gates o n) t, h n) t = o n) t tanhc n) t ), c n) t = f n) t c n) t + in) t tanh ) W n ),n) h n ) t + W n),n) h n) t + bn) c, ) where denotes element-wise multiplication and tanh is also applied element-wise. Thus, the state is scaled by a forget gate Gers et al., 2000) with dynamic activation f n) t instead of a recurrent connection with static weight. i n) t is the activation of the input gate that regulates the influx from the feedforward and recurrent connections. The activations of the input, output and forget gates are calculated in a similar fashion as c n) t Graves et al., 203). From the dependencies between layers n n) and time steps t t) in the above, it is obvious that parallel computation of feedforward activations cannot be performed across layers; further, parallel computation of recurrent activations is not possible across time steps. Thus, to increase the degree of parallelization, we consider data fractions cf. Figure ) of size P out of S sequences in parallel, each having exactly T time steps creating dummy time steps for shorter sequences which are neglected in the error calculation). For instance, we consider a state matrix C n) for the n-th layer, where c n) t,p C n) = [c n),p cn),p+p cn) T,p cn) T,p+P ], 2) is the state for sequence p in layer n at time t. To realize the update equation ) we can now compute the feedforward part for all time steps and P sequences in parallel simply by pre-multiplication with W n ),n). For the recurrent part, we can update C n) from left to right using W n),n). Input, output and forget gate activations are calculated in an analogous fashion. In this process, the matrix structure 2) ensures memory locality of the data corresponding to one time step matrices are stored in column-major order). For bidirectional layers the above matrix structure is replicated at each layer; in the backward 549

Weninger, Bergmann and Schuller RNNLIB Graves, 203) CURRENNT Parallel sequences P ) 0 50 200 Validation set error 0 ep.) 0.38 0.38 0.35 0.37 0.44 Validation set error 50 ep.) 0.20 0.9 0.6 0.8 0.9 Training time / epoch [s] 7 420 3 805 580 392 334 Speedup.0) 2.0 2.8 8.9 22.2 Table : Performance error / speedup) on CHiME 203 noisy speech recognition task. part, the recurrent parts are updated from right to left, and activations are collected in a single vector before passing them to the subsequent layer Graves et al., 203). During network training, the backward pass for the hidden layers is realized similarly, by splitting the matrix of weight changes into a part propagated to the preceding layer and a recurrent part propagated to the previous time step, resulting in a parallel implementation of the backpropagation through time BPTT) algorithm. The weight changes are applied for all sequences batch learning) or for each data fraction. Thus, if < P < S we perform mini-batch learning. In this case, only P sequences have to be kept in memory at once, allowing for learning from large data sets. 4. Benchmark We conclude with a benchmark on a word recognition task with convolutive non-stationary noise from the 203 2nd CHiME Challenge s track Vincent et al., 203), where bidirectional LSTM decoding has been shown to deliver best performance Geiger et al., 203). We consider frame-wise word error rate as well as computation speedup in training with respect to the open source C++ reference implementation by Graves 203) running in a single CPU thread on an Intel Core2Quad PC with 4 GB of RAM. The GPU is an NVIDIA GTX 560 with 2 GB of RAM. We compare results for different values of P while fixing the other training parameters. The corresponding NetCDF, network configuration, and training parameter files are distributed with CURRENNT. Results Figure ) show that the error rate after 50 epochs is not heavily influenced by the batch size for parallel processing, while speedups of up to 22.2 can be achieved. 5. Conclusions CURRENNT, our GPU implementation of deep LSTM-RNN for labeling sequential data, has been shown to deliver double digit training speedups at equal accuracy in a noisy speech recognition task. Future work will be concentrated on discriminative training objectives and cost functions for transcription tasks Graves, 2008). Acknowledgments This research has been supported by DFG grant SCHU 2502/4-. The authors would like to thank Alex Graves for helpful discussions. 550

CURRENNT: Open-Source CUDA RecurREnt Neural Network Toolkit References M. Cernanský. Training recurrent neural network using multistream extended Kalman filter on multicore processor and CUDA enabled graphic processor unit. In Proc. of ICANN, volume, pages 38 390, 2009. D. Eck and J. Schmidhuber. Learning the long-term structure of the blues. In Proc. of ICANN, pages 284 289, 2002. J. T. Geiger, F. Weninger, A. Hurmalainen, J. F. Gemmeke, M. Wöllmer, B. Schuller, G. Rigoll, and T. Virtanen. The TUM+TUT+KUL approach to the CHiME Challenge 203: Multi-stream ASR exploiting BLSTM networks and sparse NMF. In Proc. 2nd CHiME Workshop, pages 25 30, Vancouver, Canada, 203. F. Gers, J. Schmidhuber, and F. Cummins. Learning to forget: Continual prediction with LSTM. Neural Computation, 20):245 247, 2000. A. Graves. Supervised Sequence Labelling with Recurrent Neural Networks. PhD thesis, Technische Universität München, 2008. A. Graves. RNNLIB: A recurrent neural network library for sequence learning problems. http://sourceforge.net/projects/rnnl/, 203. A. Graves, A. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In Proc. of ICASSP, pages 6645 6649, Vancouver, Canada, 203. G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 296):82 97, 202. R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio. How to construct deep recurrent neural networks. In Proc. of ICLR, 204. T. Schaul, J. Bayer, D. Wierstra, Y. Sun, M. Felder, F. Sehnke, T. Rückstieß, and J. Schmidhuber. PyBrain. Journal of Machine Learning Research, :743 746, 200. M. Sundermeyer, R. Schlüter, and H. Ney. LSTM neural networks for language modeling. In Proc. of INTERSPEECH, Portland, OR, USA, 202. E. Vincent, J. Barker, S. Watanabe, J. Le Roux, F. Nesta, and M. Matassoni. The second CHiME speech separation and recognition challenge: Datasets, tasks and baselines. In Proc. of ICASSP, pages 26 30, Vancouver, Canada, 203. M. Wöllmer, M. Kaiser, F. Eyben, F. Weninger, B. Schuller, and G. Rigoll. Fully automatic audiovisual emotion recognition voice, words, and the face. In T. Fingscheidt and W. Kellermann, editors, Proceedings of Speech Communication; 0. ITG Symposium, pages 4, Braunschweig, Germany, 202. 55