Deep learning for music, galaxies and plankton

Deep learning for music, galaxies and plankton Sander Dieleman May 17, 2016 1

I. Galaxies 2

http://www.galaxyzoo.org 3

The Galaxy Challenge: automate this classification process Competition on? model colour image predictions 5

The data: 140 000 JPEG colour images dimensions: 424 x 424 train: 61 578 images test: 79 975 images 6

The solution: a convnet with 7 layers 45 40 16 5 6 3 3 6 16 45 40 3 (RGB) 32 Max pooling = 20x20 4 3 3 5 6 6 64 Max pooling = 8x8 4 128 128 x 16 Max pooling = 2x2 37 2048 maxout(2) 2048 maxout(2) 7

Shallow learning xn ɸn fθ(ɸn) yn training examples extracted features shallow model predictions 8

Deep learning xn fθk( fθ2(fθ1(xn))) yn training examples deep model predictions 9

Deep learning vs. traditional neural networks output layer hidden layer 10

Deep learning vs. traditional neural networks output layer hidden layers 11

Deep learning vs. traditional neural networks output layer hidden layers rectified linear units y = max(x, 0) 12

Deep learning vs. traditional neural networks output layer hidden layers 13

Convolutional neural networks local connectivity flatten translation invariance fully connected convolutional 14

The solution: a convnet with 7 layers 45 40 16 5 6 3 3 6 16 45 40 3 (RGB) 32 Max pooling = 20x20 4 3 3 5 6 6 64 Max pooling = 8x8 4 128 128 x 16 Max pooling = 2x2 37 2048 maxout(2) 2048 maxout(2) 15

Preprocessing: cropping and downsampling 424 x 424 207 x 207 69 x 69 16

Data augmentation: rotation, translation, rescaling, flipping, 17

Network architecture: exploiting rotation invariance 18

Network architecture: exploiting rotation invariance 19

Network architecture: exploiting rotation invariance 20

Training large CNNs requires GPU acceleration Intel Core i7 3930K at 3.2 GHz, 6 cores 32GB RAM NVIDIA GeForce GTX 680 2GB / 4GB (2x) 21

The filters learned in the first convolutional layer Red Green Blue 22

input layer 2 16x16 layer 1 40x40 pooling 2 8x8 layer 3 6x6 pooling 1 20x20 layer 4 4x4 pooling 4 2x2 23

input layer 2 16x16 layer 1 40x40 pooling 2 8x8 layer 3 6x6 pooling 1 20x20 layer 4 4x4 pooling 4 2x2 24

input layer 2 16x16 layer 1 40x40 pooling 2 8x8 layer 3 6x6 pooling 1 20x20 layer 4 4x4 pooling 4 2x2 25

input layer 2 16x16 layer 1 40x40 pooling 2 8x8 layer 3 6x6 pooling 1 20x20 layer 4 4x4 pooling 4 2x2 26

http://benanne.github.io/2014/04/05/galaxy-zoo.html https://github.com/benanne/kaggle-galaxies 37

II. Plankton 38

Pieter Jonas Iryna Jeroen Lionel Sander Aäron 39

Preprocessing and data augmentation rescale zoom, rotate, translate, flip, shear, stretch 42

Network architecture based on OxfordNet 3x3 convolution 3x3 overlapping pooling, stride 2 fully connected layer Very Deep Convolutional Networks for Large-Scale Image Recognition, Simonyan & Zisserman, ICLR 2015 43

Cyclic pooling 0 90 180 270

Cyclic pooling 3x3 convolution cyclic slicing 3x3 pooling, stride 2 cyclic pooling fully connected layer 45

Cyclic rolling 0 90 180 270

Pseudo-labeling averaged test set predictions... test set predictions from various models

Pseudo-labeling testing data + averaged test set predictions 0.33 training data + labels 0.67 larger training set! strong regularizing effect mixed training batch

Traditional CV features Image size in pixels Image moments (capturing size and shape) Haralick texture features 49

Model averaging: ensembling... 50

Model averaging: test-time augmentation quasi-random affine transformations... 51

Model averaging: bagging same networks retrained on different subsets... 52

Software and hardware Lots of GPUs Tesla K40 GeForce GTX 680 GeForce GTX 980 Theano + Lasagne Very fast prototyping through automatic differentiation and graph optimisations 53

http://benanne.github.io/2015/03/17/plankton.html https://github.com/benanne/kaggle-ndsb Reservoir Lab http://reslab.elis.ugent.be Sander Dieleman http://benanne.github.io @sedielem Iryna Korshunova http://irakorshunova.github.io Lionel Pigou http://lpigou.github.io Pieter Buteneers http://playn.be @pieterbuteneers 54

III. Music

Collaborative filtering: use listening patterns for recommendation + good performance - cold start problem many niche items that only appeal to a small audience 56

Content-based: use audio content and/or metadata for recommendation - worse performance + no usage data required Artist Title allows for all items to be recommended regardless of popularity 57

There is a large semantic gap between audio signals and listener preference genre mood popularity time audio signals lyrical themes location instrumentation 58

# listeners the long tail not enough data to recommend these songs! popular unpopular 59

# listeners rich get richer popularity 60

Latent factor models: project users and songs into the same latent space similar songs good recommendations dissimilar songs 61

Predict latent factors from music audio signals regression model audio signals 62

Qualitative evaluation: visualisation of predicted usage patterns (t-sne) 63

Qualitative evaluation: visualisation of predicted usage patterns (t-sne) 64

Qualitative evaluation: visualisation of predicted usage patterns (t-sne) 65

Qualitative evaluation: visualisation of predicted usage patterns (t-sne) 66

Qualitative evaluation: visualisation of predicted usage patterns (t-sne) 67

128 4x MP 2048 256 2048 1536 2x MP 4 256 2x MP 512 mean 40 4 4 4 35 max L2 73 149 599 Spectrograms (30 seconds) Latent factors global temporal pooling 68

Blog post: http://benanne.github.io/2014/08/05/spotify-cnns.html