Mocha.jl. Deep Learning for Julia. Chiyuan Zhang CSAIL, MIT

Size: px

Start display at page:

Download "Mocha.jl. Deep Learning for Julia. Chiyuan Zhang CSAIL, MIT"

Helen Mitchell
6 years ago
Views:

1 Mocha.jl Deep Learning for Julia Chiyuan Zhang CSAIL, MIT

2 JULIA BASICS 10-minute Introduction to Julia

3 A glance of basic syntax Numpy Matlab Julia Description x[0] x(1) x[1] Index 1 st elem of array np.random.randn(3,3) randn(3) randn(3,3) 3-by-3 random Gaussian matrix np.arange(1,11) 1:10 1:10 1,2,,10 X * Y X.* Y X.* Y Elementwise mul np.dot(x,y) X * Y X * Y Matrix mul linalg.solve(x,y) X \ Y X \ Y Left matrix division d = { one :1, two :2} d[ one ] d = containers.map({ one, two },{1,2}); d( one ) d = Dict("one"=>1,"two"=>2) d["one"] Hash table r = np.random.rand(*x.shape) y = x * (r > t) r = rand(size(x)) y = x.* (r > t) r = rand(size(x)) y = x.* (r.> t) Dropout f = lambda x, mu, sigma: np.exp(-(xmu)**2/(2*sigma**2)) / sqrt(2*np.pi*sigma^2) f=@(x,mu,sigma) exp(-(x-mu)^2/(2*sigma^2)) / sqrt(2*pi*sigma^2) f(x,μ,δ)= exp(-(x-μ)^2/2δ^2)/sqrt(2π*δ ^2) Gaussian density function

4 Beyond Syntax Close-to-C performance in native Julia code, typically do not need to explicitly vectorize your code (like what you have been doing in Matlab). Type annotation, LLVM-based just-in-time (JIT) compiler, easy parallelization with co-routine on single machine or over nodes of clusters; blah blah blah

5 Convenient FFI Calling C/Fortran functions Calling Python functions (PyCall.jl, PyPlot.jl, IJulia, ) Interaction with C++ functions / objects directly, see Cxx.jl

6 Powerful Macros JuMP, optimization models OpenPPL, probabilistic programming

7 Disadvantages of Julia Still at early stage, so The ecosystem is still young (653 Julia packages vs. 66,687 PyPI packages) e.g. Images.jl still does not have a resize function The core language is still evolving e.g. current v0.4-rc introduced a lot of breaking changes (and also exciting new features)??

8 MOCHA.JL Deep Learning in Julia

Video Pinball Boxing Breakout Star Gunner Robotank Atlantis Crazy Climber Gopher Demon Attack Name This Game Krull Assault Road Runner Kangaroo James Bond Tennis Pong Space Invaders Beam Rider

Asterix Battle Zone Wizard of Wor Chopper Command Centipede Bank Heist River Raid Zaxxon Amidar Alien Venture Seaquest Double Dunk Bowling Ms.

the DQN agent with the best reinforcement learning methods15 in the literature.

Note that the normalized performance of DQN, expressed as a percentage, is calculated as: 100 3 (DQN score 2 random play score)/(human score 2 random play score).

9 Video Pinball Boxing Breakout Star Gunner Robotank Atlantis Crazy Climber Gopher Demon Attack Name This Game Krull Assault Road Runner Kangaroo James Bond Tennis Pong Space Invaders Beam Rider Tutankham Kung-Fu Master Freeway Time Pilot Enduro Fishing Derby Up and Down Ice Hockey Q*bert H.E.R.O. Asterix Battle Zone Wizard of Wor Chopper Command Centipede Bank Heist River Raid Zaxxon Amidar Alien Venture Seaquest Double Dunk Bowling Ms. Pac-Man Asteroids Frostbite Gravitar Private Eye Montezuma's Revenge At human-level or above Below human-level DQN Best linear learner ,000 4,500% Figure 3 Comparison of the DQN agent with the best reinforcement learning methods15 in the literature. The performance of DQN is normalized with respect to a professional human games tester (that is, 100% level) and random play (that is, 0% level). Note that the normalized performance of DQN, expressed as a percentage, is calculated as: (DQN score 2 random play score)/(human score 2 random play score). It can be seen that DQN outperforms competing methods (also see Extended Dat the games, and performs at a level that is broadly comp to a professional human games tester (that is, operation 75% or above) in the majority of games. Audio output human players and agents. Error bars indicate s.d. acro episodes, starting with different initial conditions. see Fig. 3, Supplementary Discussion and Extended Data Table 2). In additional simulations (see Supplementary Discussion and Extended Data Tables 3 and 4), we demonstrate the importance of the individual core components of the DQN agent the replay memory, separate target Q-network and deep convolutional network architecture by disabling them and demonstrating the detrimental effects on performance. We next examined the representations learned by DQN that underpinned the successful performance of the agent in the context of the game Space Invaders (see Supplementary Video 1 for a demonstration of the performance of DQN), by using a technique developed for the visualization of high-dimensional data called t-sne 25 (Fig. 4). As expected, the t-sne algorithm tends to map the DQN representation of perceptually similar states to nearby points. Interestingly, we also found instances in which the t-sne algorithm generated similar embeddings for DQN representations of states that are close in terms of expected reward but perceptually dissimilar (Fig. 4, bottom right, top le sistent with the notion that the network is able to le that support adaptive behaviour from high-dimens Furthermore, we also show that the representatio are able to generalize to data generated from po own in simulations where we presented as input states experienced during human and agent play, sentations of the last hidden layer, and visualized t erated by the t-sne algorithm (Extended Data Fig. 1 Discussion). Extended Data Fig. 2 provides an add how the representations learned by DQN allow it state and action values. It is worth noting that the games in which DQN varied in their nature, from side-scrolling shooters ing games (Boxing) and three-dimensional car-rac 2 6 FEB RUARY 2015 VOL Macmillan Publishers Limited. All rights reserved Image sources: Li Deng and Dong Yu. Deep Learning Methods and Applications. Zheng, Shuai et al. Conditional Random Fields as Recurrent Neural Networks. arxiv.org cs.cv (2015). Google Deep Mind. Human-level control through deep reinforcement learning. Nature, Feb Andrej Karpathy and Li Fei-Fei. Deep Visual-Semantic Alignments for Generating Image Descriptions. CVPR 2015.

10 Why Deep Learning is Successful? Theoretical point of view Nowhere near a complete theoretical understanding of deep learning yet Practical point of view Big Data: large amount of (thank Internet) labeled (thank Amazon M-Turk), highdimensional (large images, videos, speech and text corpus, etc.) Computational Power: GPUs, large clusters Human Power: the deep learning conspiracy Software Engineering: network architecture & computation components decoupled

Layers & back-propagate Top: Typical way of visualizing a neural network: clear and intuitive, but does not have well

Each layer is a black box that could carry out forward and backward computation.

11 Layers & back-propagate Top: Typical way of visualizing a neural network: clear and intuitive, but does not have well decomposition of computation into layers. Bottom: Alternative way of thinking about neural networks. Each layer is a black box that could carry out forward and backward computation. Important thing: the computation is complete encapsulated inside the layer, the black box does NOT need to know the external environment (e.g. the overall network architecture) to do the computation. e.g. Linear Layer (Input Output) Forward: Y = σ(w ' X) Backward: ll W = Y ll W Y ll X = Y ll X Y More generally, a deep neural network can be viewed as an directed acyclic graph (optionally with timedelayed recurrent connections)

12 Image Source: J. Long, E. Shelhamer, T. Darrell. Fully Convolutional Networks for Semantic Segmentation. CVPR Advantage of de-coupled view of NN Highly efficient computation components could be written by programmers and experts in high-performance computing and code optimization. E.g. cudnn library from Nvidia Researchers can try out novel architectures easily without needing to worry too much about internal implementation of commonly used layers Some examples of complicated networks built with standard components: Networkin-Network, Siamese Networks, Fully-Convolutional Networks, etc.

13 Deep Learning Libraries C++: Caffe (widely adopted in academia), dmlc/cxxnet, cuda-convnet, CNTK (by MSR), etc. Python: Theano (auto-differentiation) and various wrappers; NervanaSystems/neon; etc. Lua: Torch 7 (supported by Facebook and Google DeepMind) Matlab: MatConvNet (by VGG) Julia: pluskid/mocha.jl

14 Why Mocha.jl? 1. Julia: written in Julia and easy interaction with the rest of the Julia ecosystem. 2. Minimum dependency: the Julia backend runs out of the box. CUDA backend depends on Nvidia cudnn. 3. Correctness: all the computation components are unit-tested. 4. Modular architecture: layers, activation functions, regularizers, network topology, solvers, etc. Julia compiles with LLVM, so potentially Julia code could be compiled directly to GPU devices in the future. After that, writing neural networks in Julia will be really enjoyable!

15 Image Classification IJulia Demo

Mini-Tutorial: ConvNets on MNIST MNIST: Handwritten digits Data preparation: Following convention, images are represented as 4D tensor: width-by-height-bychannels-by-count

16 Mini-Tutorial: ConvNets on MNIST MNIST: Handwritten digits Data preparation: Following convention, images are represented as 4D tensor: width-by-height-bychannels-by-count For MNIST: 28-by-28-by-1-by-64 Mocha.jl supports general ND tensors Data are stored in HDF5 file format Commonly supported by Matlab, Numpy, etc. See examples/mnist/gen-mnist.sh

17 Defining Network Architecture A network starts with data layers (inputs), and ends with prediction or loss layers data_layer = AsyncHDF5DataLayer(name="train-data", source="data/train.txt", batch_size=64, shuffle=true) Source file data/train.txt lists the HDF5 files for training set 64 images is provided for each mini-batch the data is shuffled to improve convergence async data layer use Julia to pre-read data while waiting for computation on CPU / GPU

18 Convolution Layer INPUT 32x32 C1: feature maps Convolutions C3: f. maps S4: f. maps S2: f. maps Subsampling C5: layer F6: layer Convolutions OUTPUT 10 Gaussian connections Full connection Subsampling Full connection LeCun, Yann, et al. "Gradient-based learning applied to document recognition." Proceedings of the IEEE (1998): conv_layer = ConvolutionLayer(name="conv1", n_filter=20, kernel=(5,5), bottoms=[:data], tops=[:conv])

19 Pooling Layer pool_layer = PoolingLayer(name="pool1", kernel=(2,2), stride=(2,2), bottoms=[:conv], tops=[:pool]) Pooling layer operate on the output of convolution layer By default, MAX pooling is performed; can switch to MEAN pooling by specifying pooling=pooling.mean()

Layers are automatically sorted and connected as a directed acyclic graph (DAG).

20 Constructing DAG with tops and bottoms Network architecture is determined by connecting tops (output) blobs to bottoms (input) blobs with matching blob names. Layers are automatically sorted and connected as a directed acyclic graph (DAG). The figure on the right shows the visualization of the LeNet for MNIST: conv-pool x2 + dense x2

21 Definition of the rest of the layers conv2_layer = ConvolutionLayer(name="conv2", n_filter=50, kernel=(5,5), bottoms=[:pool], tops=[:conv2]) pool2_layer = PoolingLayer(name="pool2", kernel=(2,2), stride=(2,2), bottoms=[:conv2], tops=[:pool2]) fc1_layer = InnerProductLayer(name="ip1", output_dim=500, neuron=neurons.relu(), bottoms=[:pool2], tops=[:ip1]) fc2_layer = InnerProductLayer(name="ip2", output_dim=10, bottoms=[:ip1], tops=[:ip2]) loss_layer = SoftmaxLossLayer(name="loss", bottoms=[:ip2, :label])

22 The Stochastic Gradient Descent Solver method = SGD() params = make_solver_parameters(method, max_iter=10000, regu_coef=0.0005, mom_policy=mompolicy.fixed(0.9), lr_policy=lrpolicy.inv(0.01, , 0.75), load_from=exp_dir) solver = Solver(method, params) Solvers have many customizable parameters, including learning-rate policy, momentum-policy, etc. Advanced policies like halfing the learning rate when performance on validation set drops are also supported. See Mocha.jl document for other available solvers.

23 Coffee Breaks for the solver setup_coffee_lounge(solver, save_into="$exp_dir/statistics.jld", every_n_iter=1000) # report training progress every 100 iterations add_coffee_break(solver, TrainingSummary(), every_n_iter=100) # save snapshots every 5000 iterations add_coffee_break(solver, Snapshot(exp_dir), every_n_iter=5000)

24 Solver Statistics Solver statistics will be automatically saved if coffee lounge is set up. Snapshots save the training progress periodically, can resume training from the last snapshot after interruption.

25 Switching Backends: CPU vs GPU backend = use_gpu? GPUBackend() : CPUBackend()

26 THANK YOU!

Mocha.jl. Deep Learning in Julia. Chiyuan Zhang CSAIL, MIT

Mocha.jl. Deep Learning in Julia. Chiyuan Zhang CSAIL, MIT Mocha.jl Deep Learning in Julia Chiyuan Zhang (@pluskid) CSAIL, MIT Deep Learning Learning with multi-layer (3~30) neural networks, on a huge training set. State-of-the-art on many AI tasks Computer Vision: