Deep Learning. Deep Learning. Practical Application Automatically Adding Sounds To Silent Movies

Size: px

Start display at page:

Download "Deep Learning. Deep Learning. Practical Application Automatically Adding Sounds To Silent Movies"

Brandon Hawkins
5 years ago
Views:

was done by hand with human effort because it is such a difficult task.

the problem. http://machinelearningmastery.

1 Automatic Colorization of Black and White Images Automatically Adding Sounds To Silent Movies Traditionally this was done by hand with human effort because it is such a difficult task. Deep learning can be used to use the objects and their context within the photograph to color the image, much like a human operator might approach the problem. In this task the system must synthesize sounds to match a silent video. The system is trained using 1000 examples of video with sound of a drum stick striking different surfaces and creating different sounds. A deep learning model associates the video frames with a database of pre-rerecorded sounds in order to select a sound to play that best matches what is happening in the scene. 1

Automatic Machine Translation Object Classification and Detection in Photographs This is a task where given words, phrase or sentence in one language,

Automatic machine translation has been around for a long time, but deep learning is achieving top results in two specific areas: Automatic Translation of Text.

A more complex variation of this task called object detection involves specifically identifying one or more objects within the scene of the photograph and

com/inspirational-applications-deep-learning/ Automatic Handwriting Generation Automatic Text Generation Please do not use this tool to generate your course

The handwriting is provided as a sequence of coordinates used by a pen when the handwriting samples were created.

2 Automatic Machine Translation Object Classification and Detection in Photographs This is a task where given words, phrase or sentence in one language, automatically translate it into another language. Automatic machine translation has been around for a long time, but deep learning is achieving top results in two specific areas: Automatic Translation of Text. Automatic Translation of Images. This task requires the classification of objects within a photograph as one of a set of previously known objects. A more complex variation of this task called object detection involves specifically identifying one or more objects within the scene of the photograph and drawing a box around them. Automatic Handwriting Generation Automatic Text Generation Please do not use this tool to generate your course project report! This is a task where given a corpus of handwriting examples, generate new handwriting for a given word or phrase. The handwriting is provided as a sequence of coordinates used by a pen when the handwriting samples were created. From this corpus the relationship between the pen movement and the letters is learned and new examples can be generated adhoc. What is fascinating is that different styles can be learned and then mimicked. I would love to see this work combined with some forensic hand writing analysis expertise. Generated, word-by-word or character-by-character. The model is capable of learning how to spell, punctuate, form sentences and even capture the style of the text in the corpus. 2

Automatic Image Caption Generation Automatic Game Playing Automatic image captioning is the task where

Introduction Deep learning (also known as deep structured learning, hierarchical learning or deep

high-level abstractions in data by using a deep graph with multiple processing layers, composed of

Biological Motivation David Hubel and Torsten Wiesel In one experiment, done in 1959, they inserted a

They then projected patterns of light and dark on a screen in front of the cat.

responded best to another angle. They called these neurons "simple cells.

3 Automatic Image Caption Generation Automatic Game Playing Automatic image captioning is the task where given an image the system must generate a caption that describes the contents of the image. Introduction Deep learning (also known as deep structured learning, hierarchical learning or deep machine learning) is a branch of machine learning based on a set of algorithms that attempt to model high-level abstractions in data by using a deep graph with multiple processing layers, composed of multiple linear and non-linear transformations. Biological Motivation David Hubel and Torsten Wiesel In one experiment, done in 1959, they inserted a microelectrode into the primary visual cortex of a cat. They then projected patterns of light and dark on a screen in front of the cat. They found that some neurons fired rapidly when presented with lines at one angle, while others responded best to another angle. They called these neurons "simple cells." Still other neurons, which they termed "complex cells," responded best to lines of a certain angle moving in one direction. These studies showed how the visual system builds an image from simple stimuli into more complex representations 3

Why Brains have a deep architecture Humans organize their ideas

deep architectures can be exponentially inefficient Distributed

by the activation of a set of features that are not mutually

sharing of statistical strength http://www.cs.toronto.

pdf 13 - Features Neural Network Representation Steer an autonomous

4 Why Brains have a deep architecture Humans organize their ideas hierarchically, through composition of simpler ideas Insufficiently deep architectures can be exponentially inefficient Distributed (possibly sparse) representations are necessary Input represented by the activation of a set of features that are not mutually exclusive Multiple levels of latent variables allow combinatorial sharing of statistical strength Features Neural Network Representation Steer an autonomous vehicle driving at normal speeds on public highways Pixels of a figure is a feature in machine learning algorithms 4

- Features Deep Architecture in the Brain

Natural images contain localized, oriented

5 - Features Deep Architecture in the Brain Pixels do not provide much useful information - Features - Features Simple cells Sparse structure of input data: Natural images contain localized, oriented structures with limited phase alignment across spatial frequency 5

- Features - Features 400 16 16-pixel image patches extracted from many natural scenes, denoted as S[i], i=0,1,2,,399. A target image patch T.

6 - Features - Features pixel image patches extracted from many natural scenes, denoted as S[i], i=0,1,2,,399. A target image patch T. The problem seeks to find the as few as possible S[k] such that (T- SUM k (a[k]*s[k])) is minimized. Surprisely, S[K] selected by the algorithm are always localized, oriented structures with limited phase alignment across spatial frequency Overview Train networks with many layers (vs. shallow nets with just a couple of layers) Multiple layers work to build an improved feature space First layer learns 1 st order features (e.g. edges ) 2 nd layer learns higher order features (combinations of first layer features, combinations of edges, etc.) In current models layers often learn in an unsupervised mode and discover general features of the input space serving multiple tasks related to the unsupervised instances (image recognition, etc.) Then final layer features are fed into supervised layer(s) And entire network is often subsequently tuned using supervised training of the entire net, using the initial weightings learned in the unsupervised phase Could also do fully supervised versions, etc. (early BP attempts) Tasks Usually best when input space is locally structured spatial or temporal: images, language, etc. vs arbitrary input features Images Example: view of a learned vision feature layer (Basis) Each square in the figure shows the input image that maximally activates

7 Category of document ->topic (thousand)- > term(10 thousand)->word (million) How Many Features? Shallow Learning (Surface Learning) Support Vector Machine Neural Network Logistic Regression Accuracy and Time Complexity 7

Deep learning and Neural Network Neural Network Simplicity of overfitting and gradient diffusion, difficulty of tuning parameters, high computational complexity, non-superiority of performance The

Then the corrupted input x passes through a basic auto encoder process and is mapped to a hidden representation y = f θ x = s(w x + b). From this hidden representation, we can reconstruct z=g θ (y).

The reconstruction error L H (x,z) might be either the cross-entropy loss with an affine-sigmoid decoder, or the squared error loss with an affine decoder.

8 Deep learning and Neural Network Neural Network Simplicity of overfitting and gradient diffusion, difficulty of tuning parameters, high computational complexity, non-superiority of performance The algorithm consists of multiple steps; starts by a stochastic mapping of x to x through q d ( x x), this is the corrupting step. Then the corrupted input x passes through a basic auto encoder process and is mapped to a hidden representation y = f θ x = s(w x + b). From this hidden representation, we can reconstruct z=g θ (y). In the last stage, a minimization algorithm runs in order to have z as close as possible to uncorrupted input x. The reconstruction error L H (x,z) might be either the cross-entropy loss with an affine-sigmoid decoder, or the squared error loss with an affine decoder. Convolutional Neural Networks CNNs exploit spatially-local correlation by enforcing a local connectivity pattern between neurons of adjacent layers. In other words, the inputs of hidden units in layer m are from a subset of units in layer m-1, units that have spatially contiguous receptive fields. The architecture thus ensures that the learnt "filters" produce the strongest response to a spatially local input pattern. Stacking many such layers leads to non-linear "filters" that become increasingly "global". 8

invariance CNN-Convolutions CNN- Pooling Suppose you want to learned 9 features from a 5 5 image

9 CNN-LeNet-5 CNN-Convolutions Replicating units in this way allows for features to be detected regardless of their position in the visual field, thus constituting the property of translation invariance CNN-Convolutions CNN- Pooling Suppose you want to learned 9 features from a 5 5 image With Fully Connected Neural Network: 5 5 9=225 With Locally Connected Neural: 3 3 9=81 With Weight Sharing:

Convolutional Layer C1 is convolutional layer 6 feature maps 28 28 and each neuron has a 5 5 receptive field in

28)=122,304 Hyper-parameters: (5*5+1)*6=156 parameters to learn If it was fully connected we had (32 32) (28 28)

layer S2: 6 2=12 trainable parameters Connections: 14 14 2 2 6=4704 Convolutional Layer C3 Subsampling Layer S4

receptive fields Connections: 14 14 2 2 6=4704 S4 is Subsampling layer with 16 feature maps of size 5 5 Each unit

10 Convolutional Layer C1 is convolutional layer 6 feature maps and each neuron has a 5 5 receptive field in input layer One neuron corresponds to 5 5 unit parameters and one bias parameter Connection: (5 5) 6 (28 28)=122,304 Hyper-parameters: (5*5+1)*6=156 parameters to learn If it was fully connected we had (32 32) (28 28) 6 parameters S2 is Subsampling layer with 6 feature maps of size non overlapping receptive fields in C1 layer S2: 6 2=12 trainable parameters Connections: =4704 Convolutional Layer C3 Subsampling Layer S4 C3 is Convolutional layer with 16 feature maps of size Each unit in C3 is connected to several 5 5 receptive fields Connections: =4704 S4 is Subsampling layer with 16 feature maps of size 5 5 Each unit in S4 is connected to the corresponding 2 2 receptive field at C3 Layer S4: 16 2=32 trainable parameters Connections: =

Convolutional Layer C5 Layer F6 C5 is Convolutional layer with 120

5 receptive fields in S4 C5: 120 (16 25)=48000 trainable parameter

84 (120+1) = 10164 trainable parameters and connections Output

11 Convolutional Layer C5 Layer F6 C5 is Convolutional layer with 120 feature maps of size 1 1 Each unit in C5 is connected to all receptive fields in S4 C5: 120 (16 25)=48000 trainable parameter and connection (Fully connected) F6 is 84 fully connected units. 84 (120+1) = trainable parameters and connections Output Layer: 84 Weight update: Backpropagation Recurrent Neural Network(RNN) 11

12 Recurrent Neural Network(RNN) Recurrent Neural Network(RNN) V V V W 0 W 1 W 2 W U U U U V t A t = f(ux t + WA t 1 ) h t =softmax(va t ) Long Term Dependencies Long Short Term Memory Networks The clouds are in the sky I grew up in France I speak fluent French 12

13 Long Short Term Memory Networks Input Gate Forget Gate 13

Machine Learning 13. week

Machine Learning 13. week Deep Learning Convolutional Neural Network Recurrent Neural Network 1 Why Deep Learning is so Popular? 1. Increase in the amount of data Thanks to the Internet, huge amount of