Using Metal 2 for Compute

Size: px

Start display at page:

Download "Using Metal 2 for Compute"

Aleesha Miles
6 years ago
Views:

1 Session Graphics and Games #WWDC17 Using Metal 2 for Compute 608 Anna Tikhonova, GPU Software Engineer 2017 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.

2 Metal 2 Ecosystem Metal API and language GPU Tools MetalKit Metal Performance Shaders Metal 2

3 Metal 2 Ecosystem Metal API and language GPU Tools MetalKit Metal Performance Shaders Metal 2

4 Metal Performance Shaders (MPS) GPU accelerated primitives Image Processing Linear Algebra Machine Learning Inference Optimized for ios What s New in Metal, Part 2 WWDC 2016 What s New in Metal, Part 2 WWDC 2015

5 Metal Performance Shaders (MPS) NEW GPU accelerated primitives Image Processing Linear Algebra Machine Learning Inference Optimized for ios and macos What s New in Metal, Part 2 WWDC 2016 What s New in Metal, Part 2 WWDC 2015

6 Image Processing

7 Image Processing Primitives available in ios 10 Convolution Equalization and Specification Gaussian Blur Median Box, Tent Thresholding Sobel Transpose Morphology Image Integral Lanczos Resampling Color Conversion Histogram Gaussian Pyramid

8 Image Processing New primitives NEW Image Keypoints Bilinear Rescale Image Statistics Element-wise Arithmetic Operations With broadcasting

9 Linear Algebra

10 Linear Algebra New primitives NEW Matrix-Matrix Multiplication Matrix-Vector Multiplication Triangular Matrix Factorization and Linear Solvers

11 Data Representations MPSVector Interprets data in MTLBuffer as a 1-dimensional array

12 Data Representations MPSVector Interprets data in MTLBuffer as a 1-dimensional array MPSMatrix Interprets data in MTLBuffer as a rectangular array Row-major order

13 Data Representations MPSVector Interprets data in MTLBuffer as a 1-dimensional array MPSMatrix Interprets data in MTLBuffer as a rectangular array Row-major order MPSTemporaryMatrix Allocated from MTLHeap Use for most of your intermediate matrices

14 MPSVector and MPSMatrix Input types Single Precision Floating-Point Half Precision Floating-Point 16-bit Signed Integer 8-bit Signed Integer

15 MPSVector Code example Create a vector of size N // Create a Metal buffer of length N let buffer = device.makebuffer(length: N * MemoryLayout<Float32>.size) // Create a vector descriptor let descriptor = MPSVectorDescriptor(length: N, datatype:.float32) // Create a vector with descriptor let vector = MPSVector(buffer: buffer, descriptor: descriptor)

16 MPSVector Code example Create a vector of size N // Create a Metal buffer of length N let buffer = device.makebuffer(length: N * MemoryLayout<Float32>.size) // Create a vector descriptor let descriptor = MPSVectorDescriptor(length: N, datatype:.float32) // Create a vector with descriptor let vector = MPSVector(buffer: buffer, descriptor: descriptor)

17 MPSVector Code example Create a vector of size N // Create a Metal buffer of length N let buffer = device.makebuffer(length: N * MemoryLayout<Float32>.size) // Create a vector descriptor let descriptor = MPSVectorDescriptor(length: N, datatype:.float32) // Create a vector with descriptor let vector = MPSVector(buffer: buffer, descriptor: descriptor)

18 MPSVector Code example Create a vector of size N // Create a Metal buffer of length N let buffer = device.makebuffer(length: N * MemoryLayout<Float32>.size) // Create a vector descriptor let descriptor = MPSVectorDescriptor(length: N, datatype:.float32) // Create a vector with descriptor let vector = MPSVector(buffer: buffer, descriptor: descriptor)

19 MPSMatrix Code example Create a matrix with M rows and N columns // Get the recommended bytes per row value to use for sizing a Metal buffer let bytesperrow = MPSMatrixDescriptor.rowBytes(forColumns: N, datatype:.float32) // Create a Metal buffer with the recommended bytes per row let buffer = device.makebuffer(length: M * bytesperrow) // Create a matrix descriptor let descriptor = MPSMatrixDescriptor(rows: M, columns: N, rowbytes: bytesperrow, datatype:.float32) // Create a matrix with descriptor let matrix = MPSMatrix(buffer: buffer, descriptor: descriptor)

20 MPSMatrix Code example Create a matrix with M rows and N columns // Get the recommended bytes per row value to use for sizing a Metal buffer let bytesperrow = MPSMatrixDescriptor.rowBytes(forColumns: N, datatype:.float32) // Create a Metal buffer with the recommended bytes per row let buffer = device.makebuffer(length: M * bytesperrow) // Create a matrix descriptor let descriptor = MPSMatrixDescriptor(rows: M, columns: N, rowbytes: bytesperrow, datatype:.float32) // Create a matrix with descriptor let matrix = MPSMatrix(buffer: buffer, descriptor: descriptor)

21 MPSMatrix Code example Create a matrix with M rows and N columns // Get the recommended bytes per row value to use for sizing a Metal buffer let bytesperrow = MPSMatrixDescriptor.rowBytes(forColumns: N, datatype:.float32) // Create a Metal buffer with the recommended bytes per row let buffer = device.makebuffer(length: M * bytesperrow) // Create a matrix descriptor let descriptor = MPSMatrixDescriptor(rows: M, columns: N, rowbytes: bytesperrow, datatype:.float32) // Create a matrix with descriptor let matrix = MPSMatrix(buffer: buffer, descriptor: descriptor)

22 MPSMatrix Code example Create a matrix with M rows and N columns // Get the recommended bytes per row value to use for sizing a Metal buffer let bytesperrow = MPSMatrixDescriptor.rowBytes(forColumns: N, datatype:.float32) // Create a Metal buffer with the recommended bytes per row let buffer = device.makebuffer(length: M * bytesperrow) // Create a matrix descriptor let descriptor = MPSMatrixDescriptor(rows: M, columns: N, rowbytes: bytesperrow, datatype:.float32) // Create a matrix with descriptor let matrix = MPSMatrix(buffer: buffer, descriptor: descriptor)

23 Primitives Matrix-Matrix and Matrix-Vector Multiplication API modeled after standard BLAS GEMM and GEMV interfaces Triangular Matrix Factorization and Linear Solvers API modeled after standard LAPACK decomposition and solve interfaces

24 // Example: Matrix-Matrix Multiply: C = A B // Create matrices A, B and C let A = MPSMatrix(buffer: ABuffer, descriptor: MPSMatrixDescriptor(rows: M, columns: K, rowbytes: ARowBytes, datatype:.float32)) let B = MPSMatrix(buffer: BBuffer, descriptor: MPSMatrixDescriptor(rows: K, columns: N, rowbytes: BRowBytes, datatype:.float32)) let C = MPSMatrix(buffer: CBuffer, descriptor: MPSMatrixDescriptor(rows: M, columns: N, rowbytes: CRowBytes, datatype:.float32))

25 // Example: Matrix-Matrix Multiply: C = A B // Perform Metal setup let device = MTLCreateSystemDefaultDevice()! let commandqueue = device.makecommandqueue() let commandbuffer = commandqueue.makecommandbuffer() // Create a Matrix-Matrix Multiplication kernel let mmkernel = MPSMatrixMultiplication(device: device, resultrows: M, resultcolumns: N, interiorcolumns: K) // Encode kernel to the command buffer mmkernel.encode(commandbuffer: commandbuffer, leftmatrix: A, rightmatrix: B, resultmatrix: C) // Tell GPU to start doing the work commandbuffer.commit()

26 // Example: Matrix-Matrix Multiply: C = A B // Perform Metal setup let device = MTLCreateSystemDefaultDevice()! let commandqueue = device.makecommandqueue() let commandbuffer = commandqueue.makecommandbuffer() // Create a Matrix-Matrix Multiplication kernel let mmkernel = MPSMatrixMultiplication(device: device, resultrows: M, resultcolumns: N, interiorcolumns: K) // Encode kernel to the command buffer mmkernel.encode(commandbuffer: commandbuffer, leftmatrix: A, rightmatrix: B, resultmatrix: C) // Tell GPU to start doing the work commandbuffer.commit()

27 // Example: Matrix-Matrix Multiply: C = A B // Perform Metal setup let device = MTLCreateSystemDefaultDevice()! let commandqueue = device.makecommandqueue() let commandbuffer = commandqueue.makecommandbuffer() // Create a Matrix-Matrix Multiplication kernel let mmkernel = MPSMatrixMultiplication(device: device, resultrows: M, resultcolumns: N, interiorcolumns: K) // Encode kernel to the command buffer mmkernel.encode(commandbuffer: commandbuffer, leftmatrix: A, rightmatrix: B, resultmatrix: C) // Tell GPU to start doing the work commandbuffer.commit()

28 // Example: Matrix-Matrix Multiply: C = A B // Perform Metal setup let device = MTLCreateSystemDefaultDevice()! let commandqueue = device.makecommandqueue() let commandbuffer = commandqueue.makecommandbuffer() // Create a Matrix-Matrix Multiplication kernel let mmkernel = MPSMatrixMultiplication(device: device, resultrows: M, resultcolumns: N, interiorcolumns: K) // Encode kernel to the command buffer mmkernel.encode(commandbuffer: commandbuffer, leftmatrix: A, rightmatrix: B, resultmatrix: C) // Tell GPU to start doing the work commandbuffer.commit()

29 Sample Code MPSMatrixMultiplication Triangular Matrix Factorization and Linear Solvers Coming soon

30 Machine Learning

31 Machine Learning at Apple Architecture Applications Domain Specific Frameworks Vision NLP ML Framework Core ML ML Performance Primitives Accelerate MPS

32 What Is Deep Learning?

35 panda

37 house ocean dress dog girl sunset bicycle giraffe horse ramp man plant skateboard lights

38 Training and Inference cat rabbit dog giraffe horse Training to Classify Images

39 Training giraffe cat rabbit dog dog cat cat rabbit rabbit horse horse dog cat rabbit dog giraffe horse Training to Classify Images

40 Training cat rabbit dog giraffe horse Training to Classify Images

41 Training cat rabbit dog giraffe Trained Parameters horse Training to Classify Images

42 Inference cat rabbit dog giraffe Trained Parameters horse Training to Classify Images

43 Inference Input Image cat rabbit CNN dog cat giraffe horse Inference Training to Classify Images

44 Agenda Recap on Convolutional Neural Networks (CNN) What s New in Metal, Part 2 WWDC 2016

45 Agenda Recap on Convolutional Neural Networks (CNN) Convolutional Neural Networks New Primitives Neural Network Graph API Recurrent Neural Networks (RNN)

46 Agenda Recap on Convolutional Neural Networks (CNN) Convolutional Neural Networks New Primitives Neural Network Graph API Recurrent Neural Networks (RNN)

47 What Are Convolutional Neural Networks?

48 Convolutional Neural Networks Biologically-inspired, resemble the visual cortex

49 Convolutional Neural Networks Biologically-inspired, resemble the visual cortex Hierarchical representation Organized into a hierarchy of layers Higher-level features are derived from lower-level features

50 Convolutional Neural Networks Biologically-inspired, resemble the visual cortex Hierarchical representation Organized into a hierarchy of layers Higher-level features are derived from lower-level features Think of a feature as a filter that filters data for that feature

51 Convolutional Neural Networks Primitives available in ios 10 Convolution Fully-Connected Pooling Average Max Normalization Cross-Channel Local Contrast Spatial Softmax Neuron Linear ReLU Sigmoid TanH Absolute

52 Convolutional Neural Networks Primitives available in ios 10 Convolution Fully-Connected Pooling Average Max Normalization Cross-Channel Local Contrast Spatial Softmax Neuron Linear ReLU Sigmoid TanH Absolute

53 Convolution Core building block Recognizes features in input

54 1 filter 3 x 3 1-channel input 1-channel output

55 1 filter 3 x 3 1-channel input 1-channel output

56 1 filter 3 x 3 1-channel input 1-channel output

57 1 filter 3 x 3 1-channel input 1-channel output

58 1 filter 3 x 3 1-channel input 1-channel output

59 1 filter 3 x 3 1-channel input 1-channel output

60 16 5x5 filters 3-channel input 40 x channel output 40 x 40

61 3*16 5x5 filters 3-channel input 40 x channel output 40 x 40

62 3*16 5x5 filters 3-channel input 40 x channel output 40 x 40

63 3*16 5x5 filters 3-channel input 40 x channel output 40 x 40

64 Agenda Recap on Convolutional Neural Networks (CNN) Convolutional Neural Networks New Primitives Neural Network Graph API Recurrent Neural Networks (RNN)

65 Convolutional Neural Networks New primitives NEW New Convolution weight types Binary and XNOR Convolution Sub-Pixel Convolution Dilated Convolution Convolution Transpose L2Norm Pooling Dilated Max Pooling Log Softmax Resampling Lanczos, Bilinear Upsampling Arithmetic Operators Addition, Subtraction, Multiplication, Division New Neuron layers Hard Sigmoid, SoftPlus, SoftSign, ELU

66 Convolutional Neural Networks New primitives NEW New Convolution weight types Binary and XNOR Convolution Sub-Pixel Convolution Dilated Convolution Convolution Transpose L2Norm Pooling Dilated Max Pooling Log Softmax Resampling Lanczos, Bilinear Upsampling Arithmetic Operators Addition, Subtraction, Multiplication, Division New Neuron layers Hard Sigmoid, SoftPlus, SoftSign, ELU

67 Convolution Filter weight types NEW Single Precision Floating-Point To reduce memory footprint and improve performance Half Precision Floating-Point 8-bit Integer Binary

68 Convolution Primitives NEW Standard Binary and XNOR Dilated Sub-Pixel Transpose

69 Binary and XNOR Convolution Same operation as regular Convolution Input Weights Improved performance Regular Convolution Less memory

70 Binary and XNOR Convolution Binary Convolution Input Weights Full-sized input, binary weights Regular Convolution Binary Convolution

71 Binary and XNOR Convolution Binary Convolution Input Weights Full-sized input, binary weights Regular Convolution XNOR Convolution Binary input, binary weights Binary Convolution XNOR Convolution

72 Dilated Convolution Comparison to regular convolution Input Output

73 Dilated Convolution Comparison to regular convolution Input Output

74 Dilated Convolution Comparison to regular convolution Input Output 3 x 3 kernel

75 Dilated Convolution Comparison to regular convolution Input Output 3 x 3 kernel

76 Dilated Convolution How it works Input Output 3 x 3 kernel dilationfactorx = 2 dilationfactory = 2

77 Dilated Convolution How it works Input Output 3 x 3 kernel dilationfactorx = 2 dilationfactory = 2

78 Sub-Pixel Convolution and Convolution Transpose Commonly used for upscaling

79 Upscaling Using a box filter Fixed operation with a constant filter Input W x H Output 2W x 2H

80 Upscaling Using a box filter Fixed operation with a constant filter Input W x H Output 2W x 2H

81 Upscaling Using a box filter Fixed operation with a constant filter Input W x H Output 2W x 2H

82 Sub-Pixel Convolution How it works Trained Parameters One-channel input W x H 4 filters for 2x upscaling One-channel output 2W x 2H

83 Sub-Pixel Convolution How it works One-channel input W x H 4 filters for 2x upscaling One-channel output 2W x 2H

84 Sub-Pixel Convolution How it works Reshuffle One-channel input W x H 4 filters for 2x upscaling One-channel output 2W x 2H

85 Convolution Transpose How it works Input W x H

86 Convolution Transpose How it works Input W x H

87 Convolution Transpose How it works Intermediate Result 2W x 2H Output W x H

88 Convolution Transpose How it works Intermediate Result 2W x 2H Output W x H

89 Convolution Transpose How it works Intermediate Result 2W x 2H Output W x H

90 Convolution Transpose How it works Intermediate Result 2W x 2H Output W x H

91 Convolution Transpose How it works Intermediate Result 2W x 2H Output W x H

92 Convolution Transpose How it works Intermediate Result 2W x 2H Output W x H

93 New Convolution Primitives Example: colorizing black and white images

New Convolution Primitives Example: colorizing black and white images Input Output Convolution Dilated Convolution Batch Normalization Convolution Transpose

94 New Convolution Primitives Example: colorizing black and white images Input Output Convolution Dilated Convolution Batch Normalization Convolution Transpose SoftMax Colorization network* *Colorful Image Colorization, Richard Zhang, Phillip Isola, Alexei A. Efros, ECCV 2016,

95 New Convolution Primitives Example: colorizing black and white images Dilated Convolution integrate wider global context Convolution Dilated Convolution Batch Normalization Convolution Transpose SoftMax Colorization network* *Colorful Image Colorization, Richard Zhang, Phillip Isola, Alexei A. Efros, ECCV 2016,

96 New Convolution Primitives Example: colorizing black and white images Dilated Convolution integrate wider global context Convolution Transpose upscale output Convolution Dilated Convolution Batch Normalization Convolution Transpose SoftMax Colorization network* *Colorful Image Colorization, Richard Zhang, Phillip Isola, Alexei A. Efros, ECCV 2016,

97 Demo Image colorization

98 Performance Improvements in ios Higher is better Percentage Improvement 20 0 iphone 6S iphone 7 Plus ipad Pro 9.7 ipad Pro 10.5" Inception-v3 network *Rethinking the Inception Architecture for Computer Vision, Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew Wojna, CVPR 2015,

99 Performance Improvements in ios Higher is better Percentage Improvement 20 22% 22% 29% 21% 0 iphone 6S iphone 7 Plus ipad Pro 9.7 ipad Pro 10.5" Inception-v3 network *Rethinking the Inception Architecture for Computer Vision, Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew Wojna, CVPR 2015,

100 Agenda Recap on Convolutional Neural Networks (CNN) Convolutional Neural Networks New Primitives Neural Network Graph API Recurrent Neural Networks (RNN)

101 Neural Network Graph API Overview NEW Describe neural network using graph API

102 Neural Network Graph API Overview NEW Describe neural network using graph API Convolution Pooling (Avg.) Pooling (Max.) Fully-Connected SoftMax Concatentation Image

103 Neural Network Graph API Overview NEW Describe neural network using graph API Convolution Pooling (Avg.) Pooling (Max.) Fully-Connected SoftMax Concatentation Image

104 Neural Network Graph API Overview NEW Describe neural network using graph API Filter nodes Operations Convolution Pooling (Avg.) Pooling (Max.) Fully-Connected SoftMax Concatentation Image

105 Neural Network Graph API Overview NEW Describe neural network using graph API Filter nodes Operations Image nodes Data Convolution Pooling (Avg.) Pooling (Max.) Fully-Connected SoftMax Concatentation Image

106 Neural Network Graph API Ease of use Compact representation

107 Neural Network Graph API Ease of use Compact representation Save and restore across platforms (NSSecureCoding)

108 Neural Network Graph API Ease of use Compact representation Save and restore across platforms (NSSecureCoding) Initialize once, reuse

109 Neural Network Graph API Ease of use Compact representation Save and restore across platforms (NSSecureCoding) Initialize once, reuse Execute graph on GPU with single call

110 Neural Network Graph API Ease of use Compact representation Save and restore across platforms (NSSecureCoding) Initialize once, reuse Execute graph on GPU with single call No intermediate images to manage, just input/output

111 Neural Network Graph API Ease of use Compact representation Save and restore across platforms (NSSecureCoding) Initialize once, reuse Execute graph on GPU with single call No intermediate images to manage, just input/output Auto-configuration of image sizes, padding, centering

112 Neural Network Graph API Ease of use Compact representation Save and restore across platforms (NSSecureCoding) Initialize once, reuse Execute graph on GPU with single call No intermediate images to manage, just input/output Auto-configuration of image sizes, padding, centering MetalImageRecognition code sample* 4x less code with NN Graph API

113 Neural Network Graph API Deliver best performance Easy to parallelize between CPU and GPU

114 Neural Network Graph API Deliver best performance Easy to parallelize between CPU and GPU Fuse graph nodes

115 Neural Network Graph API Deliver best performance NEW Easy to parallelize between CPU and GPU Fuse graph nodes Execute graph nodes concurrently Convolution Pooling (Avg.) Pooling (Max.) Fully-Connected SoftMax Concatentation Image

116 Neural Network Graph API Deliver best performance NEW Easy to parallelize between CPU and GPU Fuse graph nodes Execute graph nodes concurrently Convolution Pooling (Avg.) Pooling (Max.) Fully-Connected SoftMax Concatentation Image

117 Neural Network Graph API Deliver best performance NEW Easy to parallelize between CPU and GPU Fuse graph nodes Execute graph nodes concurrently Optimize away Concatenation nodes Convolution Pooling (Avg.) Pooling (Max.) Fully-Connected SoftMax Concatentation Image

118 Neural Network Graph API Deliver best performance NEW Easy to parallelize between CPU and GPU Fuse graph nodes Execute graph nodes concurrently Optimize away Concatenation nodes Convolution Pooling (Avg.) Pooling (Max.) Fully-Connected SoftMax Concatentation Image

119 Filter Nodes Convolution node Create a MPSNNConvolutionNode with data source provider let conv1 = MPSCNNConvolutionNode(source: MPSNNImageNode(handle: nil), weights: MyWeights(file: conv1.dat ))

120 Filter Nodes Convolution node Create a MPSNNConvolutionNode with data source provider let conv1 = MPSCNNConvolutionNode(source: MPSNNImageNode(handle: nil), weights: MyWeights(file: conv1.dat ))

121 Filter Nodes Convolution node Create a MPSNNConvolutionNode with data source provider let conv1 = MPSCNNConvolutionNode(source: MPSNNImageNode(handle: nil), weights: MyWeights(file: conv1.dat ))

122 Feeding Parameters to Convolution Layer Just-in-time loading and purging of weights data Minimize memory footprint class MyWeights: NSObject, MPSCNNConvolutionDataSource { // Initialize the data source object init(file: String) { } } public func load() -> Bool { } public func descriptor() -> MPSCNNConvolutionDescriptor { } public func weights() -> UnsafeMutableRawPointer { } public func purge() { }

123 Feeding Parameters to Convolution Layer Just-in-time loading and purging of weights data Minimize memory footprint class MyWeights: NSObject, MPSCNNConvolutionDataSource { // Initialize the data source object init(file: String) { } } public func load() -> Bool { } public func descriptor() -> MPSCNNConvolutionDescriptor { } public func weights() -> UnsafeMutableRawPointer { } public func purge() { }

124 Feeding Parameters to Convolution Layer Just-in-time loading and purging of weights data Minimize memory footprint class MyWeights: NSObject, MPSCNNConvolutionDataSource { // Initialize the data source object init(file: String) { } } public func load() -> Bool { } public func descriptor() -> MPSCNNConvolutionDescriptor { } public func weights() -> UnsafeMutableRawPointer { } public func purge() { }

125 // Example: create a graph func makegraph() -> MPSNNImageNode { conv1 pool1 conv2 pool2 conv3 pool3 conv4 fc1 fc2 }

126 // Example: create a graph func makegraph() -> MPSNNImageNode { conv1 let conv1 = MPSCNNConvolutionNode(source: MPSNNImageNode(handle: nil), weights: MyWeights(file: conv1.dat )) pool1 conv2 pool2 conv3 pool3 conv4 fc1 fc2 }

127 // Example: create a graph func makegraph() -> MPSNNImageNode { conv1 pool1 let conv1 = MPSCNNConvolutionNode(source: MPSNNImageNode(handle: nil), weights: MyWeights(file: conv1.dat )) let pool1 = MPSCNNPoolingMaxNode(source: conv1.resultimage, filtersize: 2) conv2 pool2 conv3 pool3 conv4 fc1 fc2 }

128 // Example: create a graph func makegraph() -> MPSNNImageNode { conv1 pool1 conv2 pool2 conv3 pool3 conv4 let conv1 = MPSCNNConvolutionNode(source: MPSNNImageNode(handle: nil), weights: MyWeights(file: conv1.dat )) let pool1 = MPSCNNPoolingMaxNode(source: conv1.resultimage, filtersize: 2) let conv2 = MPSCNNConvolutionNode(source: pool1.resultimage, weights: MyWeights(file: conv2.dat )) let pool2 = MPSCNNPoolingMaxNode(source: conv2.resultimage, filtersize: 2) let conv3 = MPSCNNConvolutionNode(source: pool2.resultimage, weights: MyWeights(file: conv3.dat )) let pool3 = MPSCNNPoolingMaxNode(source: conv3.resultimage, filtersize: 2) let conv4 = MPSCNNConvolutionNode(source: pool3.resultimage, weights: MyWeights(file: conv4.dat )) let fc1 = MPSCNNFullyConnectedNode(source: conv4.resultimage, weights: MyWeights(file: fc1.dat )) fc1 fc2 } let fc2 = MPSCNNFullyConnectedNode(source: return fc1.resultimage, weights: MyWeights(file: fc2.dat )) fc2.resultimage

129 // Example: create a graph func makegraph() -> MPSNNImageNode { let conv1 = MPSCNNConvolutionNode(source: MPSNNImageNode(handle: nil), weights: MyWeights(file: conv1.dat )) let pool1 = MPSCNNPoolingMaxNode(source: conv1.resultimage, filtersize: 2) let conv2 = MPSCNNConvolutionNode(source: pool1.resultimage, weights: MyWeights(file: conv2.dat )) let pool2 = MPSCNNPoolingMaxNode(source: conv2.resultimage, filtersize: 2) let conv3 = MPSCNNConvolutionNode(source: pool2.resultimage, weights: MyWeights(file: conv3.dat )) let pool3 = MPSCNNPoolingMaxNode(source: conv3.resultimage, filtersize: 2) let conv4 = MPSCNNConvolutionNode(source: pool3.resultimage, weights: MyWeights(file: conv4.dat )) let fc1 = MPSCNNFullyConnectedNode(source: conv4.resultimage, weights: MyWeights(file: fc1.dat )) let fc2 = MPSCNNFullyConnectedNode(source: fc1 fc1.resultimage, weights: MyWeights(file: fc2.dat )) } return fc2 fc2.resultimage

130 // Example: execute graph on the GPU // Metal setup let device = MTLCreateSystemDefaultDevice()! let commandqueue = device.makecommandqueue() let commandbuffer = commandqueue.makecommandbuffer() // Initialize graph let graph = MPSNNGraph(device: device, resultimage: makegraph()) // Create input image let input = MPSImage(texture: texture, ) // Encode graph let output = graph?.encode(to: commandbuffer, sourceimages: [input]) // Tell GPU to start executing work and wait until GPU work is done commandbuffer.commit() commandbuffer.waituntilcompleted()

131 // Example: execute graph on the GPU // Metal setup let device = MTLCreateSystemDefaultDevice()! let commandqueue = device.makecommandqueue() let commandbuffer = commandqueue.makecommandbuffer() // Initialize graph let graph = MPSNNGraph(device: device, resultimage: makegraph()) // Create input image let input = MPSImage(texture: texture, ) // Encode graph let output = graph?.encode(to: commandbuffer, sourceimages: [input]) // Tell GPU to start executing work and wait until GPU work is done commandbuffer.commit() commandbuffer.waituntilcompleted()

132 // Example: execute graph on the GPU // Metal setup let device = MTLCreateSystemDefaultDevice()! let commandqueue = device.makecommandqueue() let commandbuffer = commandqueue.makecommandbuffer() // Initialize graph let graph = MPSNNGraph(device: device, resultimage: makegraph()) // Create input image let input = MPSImage(texture: texture, ) // Encode graph let output = graph?.encode(to: commandbuffer, sourceimages: [input]) // Tell GPU to start executing work and wait until GPU work is done commandbuffer.commit() commandbuffer.waituntilcompleted()

133 // Example: execute graph on the GPU // Metal setup let device = MTLCreateSystemDefaultDevice()! let commandqueue = device.makecommandqueue() let commandbuffer = commandqueue.makecommandbuffer() // Initialize graph let graph = MPSNNGraph(device: device, resultimage: makegraph()) // Create input image let input = MPSImage(texture: texture, ) // Encode graph let output = graph?.encode(to: commandbuffer, sourceimages: [input]) // Tell GPU to start executing work and wait until GPU work is done commandbuffer.commit() commandbuffer.waituntilcompleted()

134 // Example: execute graph on the GPU // Metal setup let device = MTLCreateSystemDefaultDevice()! let commandqueue = device.makecommandqueue() let commandbuffer = commandqueue.makecommandbuffer() // Initialize graph let graph = MPSNNGraph(device: device, resultimage: makegraph()) // Create input image let input = MPSImage(texture: texture, ) // Encode graph let output = graph?.encode(to: commandbuffer, sourceimages: [input]) // Tell GPU to start executing work and wait until GPU work is done commandbuffer.commit() commandbuffer.waituntilcompleted()

135 // Example: execute graph on the GPU // Metal setup let device = MTLCreateSystemDefaultDevice()! let commandqueue = device.makecommandqueue() let commandbuffer = commandqueue.makecommandbuffer() // Initialize graph let graph = MPSNNGraph(device: device, resultimage: makegraph()) // Create input image let input = MPSImage(texture: texture, ) // Encode graph let output = graph?.encode(to: commandbuffer, sourceimages: [input]) // Tell GPU to start executing work and wait until GPU work is done commandbuffer.commit() commandbuffer.waituntilcompleted()

136 // Example: execute graph on the GPU // Metal setup let device = MTLCreateSystemDefaultDevice()! let commandqueue = device.makecommandqueue() let commandbuffer = commandqueue.makecommandbuffer() // Initialize graph let graph = MPSNNGraph(device: device, resultimage: makegraph()) // Create input image let input = MPSImage(texture: texture, ) // Encode graph let output = graph?.encode(to: commandbuffer, sourceimages: [input]) // Tell GPU to start executing work and wait until GPU work is done commandbuffer.commit() commandbuffer.waituntilcompleted()

137 // Example: execute graph on the GPU // Metal setup let device = MTLCreateSystemDefaultDevice()! task1 let commandqueue = device.makecommandqueue() CPU GPU encode Bubble let commandbuffer = commandqueue.makecommandbuffer() execute task1 // Initialize graph let graph = MPSNNGraph(device: device, resultimage: makegraph()) // Create input image let input = MPSImage(texture: texture, ) // Encode graph let output = graph?.encode(to: commandbuffer, sourceimages: [input]) // Tell GPU to start executing work and wait until GPU work is done commandbuffer.commit() commandbuffer.waituntilcompleted() encode task2 Bubble encode task2 Bubble encode task2 time Bubble execute task2 Bubble execute task2 Bubble

138 // Example: execute graph on the GPU asynchronously // Metal setup let device = MTLCreateSystemDefaultDevice()! // Initialize graph let graph = MPSNNGraph(device: device, resultimage: makegraph()) // Create input image let input = MPSImage(texture: texture, ) // Encode graph let output = graph?.executeasync(sourceimages: [input]) { resultimage, error in // check for error and use resultimage inside closure } // Don t wait, encode new GPU task

139 // Example: execute graph on the GPU asynchronously // Metal setup let device = MTLCreateSystemDefaultDevice()! // Initialize graph let graph = MPSNNGraph(device: device, resultimage: makegraph()) // Create input image let input = MPSImage(texture: texture, ) // Encode graph let output = graph?.executeasync(sourceimages: [input]) { resultimage, error in // check for error and use resultimage inside closure } // Don t wait, encode new GPU task

140 // Example: execute graph on the GPU asynchronously // Metal setup let device = MTLCreateSystemDefaultDevice()! // Initialize graph let graph = MPSNNGraph(device: device, resultimage: makegraph()) // Create input image let input = MPSImage(texture: texture, ) // Encode graph let output = graph?.executeasync(sourceimages: [input]) { resultimage, error in // check for error and use resultimage inside closure } // Don t wait, encode new GPU task

141 // Example: execute graph on the GPU asynchronously // Metal setup let device = MTLCreateSystemDefaultDevice()! // Initialize graph let graph = MPSNNGraph(device: device, resultimage: makegraph()) // Create input image let input = MPSImage(texture: texture, ) // Encode graph let output = graph?.executeasync(sourceimages: [input]) { resultimage, error in // check for error and use resultimage inside closure } // Don t wait, encode new GPU task

142 // Example: execute graph on the GPU asynchronously // Metal setup let device = MTLCreateSystemDefaultDevice()! // Initialize graph let graph = MPSNNGraph(device: device, resultimage: makegraph()) // Create input image let input = MPSImage(texture: texture, ) // Encode graph let output = graph?.executeasync(sourceimages: [input]) { resultimage, error in // check for error and use resultimage inside closure } // Don t wait, encode new GPU task

143 // Example: execute graph on the GPU asynchronously // Metal setup CPU encode let device = MTLCreateSystemDefaultDevice()! task1 GPU // Initialize graph encode task2 execute task1 let graph = MPSNNGraph(device: device, resultimage: makegraph()) // Create input image encode task3 let input = MPSImage(texture: texture, ) execute task2 // Encode graph let output = graph?.executeasync(sourceimages: [input]) { } resultimage, error in // check for error and use resultimage inside closure // Don t wait, encode new GPU task encode task4 encode task5 encode task6 time execute task3 execute task4 execute task5

144 Demo Inception-v3 using Neural Network Graph API

145 Agenda Recap on Convolutional Neural Networks (CNN) Convolutional Neural Networks New Primitives Neural Network Graph API Recurrent Neural Networks (RNN)

146 What Are Recurrent Neural Networks?

147 CNN One - to - one One input Image

148 CNN One - to - one CNN dog grass Inference One input Image One output Set of probabilities

149 RNN Sequences: one - to - many CNN Inference

150 RNN Sequences: one - to - many CNN RNN A black and white dog laying in the grass Inference Inference One input Set of probabilities Sequence of outputs Words / image caption

151 RNN Sequences: many - to - many A black and RNN white dog laying in the grass Inference Sequence of inputs Sentence in English

152 RNN Sequences: many - to - many A black and white dog laying in the grass RNN Чёрно-белая собака лежит на траве Mustan ja valkoisen värinen koira makaa ruohikolla Inference Sequence of inputs Sentence in English Sequence of outputs Translated sentence

153 Recurrent Neural Networks New primitives NEW Single Gate Long Short-Term Memory (LSTM) Gated Recurrent Unit (GRU) Minimally Gated Unit (MGU)

154 Single Gate RNN Recurrent Unit enables previous output to affect Output the output of subsequent iterations Recurrent Unit Input

155 Long Short-Term Memory (LSTM) Built from Single Gate RNNs Output Has an internal Memory Cell Gates control information flow inside the LSTM LSTM and what is stored in the Memory Cell Input

156 Long Short-Term Memory (LSTM) Built from Single Gate RNNs Output Has an internal Memory Cell Gates control information flow inside the LSTM LSTM and what is stored in the Memory Cell Memory Cell Input

157 LSTM Architecture Output LSTM Memory Cell Input

158 LSTM Architecture LSTM Memory Cell

159 LSTM Architecture LSTM Old Memory New Memory

160 LSTM Architecture M Matrix-Matrix or Matrix-Vector Multiply LSTM * + Point-wise operations What to keep from old memory Old Memory * New Memory Previous Output Input M M Forget Gate

161 LSTM Architecture M Matrix-Matrix or Matrix-Vector Multiply * + Point-wise operations LSTM What to keep from old memory How new input affects new memory Old Memory * New Memory * Previous Output M Forget Previous Output M Input Previous Output M Cell Input M Gate Input M Gate Input M Gate

162 LSTM Architecture M Matrix-Matrix or Matrix-Vector Multiply * + Point-wise operations LSTM What to keep from old memory How new input affects new memory Old Memory * New Memory * Previous Output M Forget Previous Output M Input Previous Output M Cell Input M Gate Input M Gate Input M Gate

163 LSTM Architecture M Matrix-Matrix or Matrix-Vector Multiply * + Point-wise operations LSTM What to keep from old memory How new input affects new memory Old Memory * New + Memory * Previous Output Input M M Forget Gate Previous Output Input M M Input Gate Previous Output Input M M Cell Gate

164 LSTM Architecture M Matrix-Matrix or Matrix-Vector Multiply * + Point-wise operations LSTM What to keep from old memory Output How new input affects new memory Previous Output Input M M Output Gate * How previous output, current input, new memory affect new output Old Memory * New Memory * Previous Output M Forget Previous Output M Input Previous Output M Cell Input M Gate Input M Gate Input M Gate

165 // Example: Creating a LSTM RNN // Create a LSTM layer descriptor let descriptor = MPSLSTMDescriptor() descriptor.inputfeaturechannels = inputsize descriptor.outputfeaturechannels = outputsize // Create and initialize gate weights with trained parameters, using a data source provider // for just-in-time loading and purging of weights descriptor.forgetgateinputweights = MyWeights(file: forgetgateweights.dat )) descriptor.cellgateinputweights = MyWeights(file: cellgateweights.dat )) // Initialize the rest of the gates // Metal setup let device = MTLCreateSystemDefaultDevice()! // Also get commandqueue and commandbuffer // Create a LSTM layer let layer = MPSRNNMatrixInferenceLayer(device: device, rnndescriptor: descriptor)

166 // Example: Creating a LSTM RNN // Create a LSTM layer descriptor let descriptor = MPSLSTMDescriptor() descriptor.inputfeaturechannels = inputsize descriptor.outputfeaturechannels = outputsize // Create and initialize gate weights with trained parameters, using a data source provider // for just-in-time loading and purging of weights descriptor.forgetgateinputweights = MyWeights(file: forgetgateweights.dat )) descriptor.cellgateinputweights = MyWeights(file: cellgateweights.dat )) // Initialize the rest of the gates // Metal setup let device = MTLCreateSystemDefaultDevice()! // Also get commandqueue and commandbuffer // Create a LSTM layer let layer = MPSRNNMatrixInferenceLayer(device: device, rnndescriptor: descriptor)

167 // Example: Creating a LSTM RNN // Create a LSTM layer descriptor let descriptor = MPSLSTMDescriptor() descriptor.inputfeaturechannels = inputsize descriptor.outputfeaturechannels = outputsize // Create and initialize gate weights with trained parameters, using a data source provider // for just-in-time loading and purging of weights descriptor.forgetgateinputweights = MyWeights(file: forgetgateweights.dat )) descriptor.cellgateinputweights = MyWeights(file: cellgateweights.dat )) // Initialize the rest of the gates // Metal setup let device = MTLCreateSystemDefaultDevice()! // Also get commandqueue and commandbuffer // Create a LSTM layer let layer = MPSRNNMatrixInferenceLayer(device: device, rnndescriptor: descriptor)

168 // Example: Creating a LSTM RNN // Create a LSTM layer descriptor let descriptor = MPSLSTMDescriptor() descriptor.inputfeaturechannels = inputsize descriptor.outputfeaturechannels = outputsize // Create and initialize gate weights with trained parameters, using a data source provider // for just-in-time loading and purging of weights descriptor.forgetgateinputweights = MyWeights(file: forgetgateweights.dat )) descriptor.cellgateinputweights = MyWeights(file: cellgateweights.dat )) // Initialize the rest of the gates // Metal setup let device = MTLCreateSystemDefaultDevice()! // Also get commandqueue and commandbuffer // Create a LSTM layer let layer = MPSRNNMatrixInferenceLayer(device: device, rnndescriptor: descriptor)

169 // Example: Running a LSTM RNN on the GPU // Create input and output data var inputsequence: [MPSMatrix] = [] var outputsequence: [MPSMatrix] = [] for i in 0..< N { // Matrix size is (1, inputsize), inputsize is number of columns inputsequence.append(mpsmatrix( )) // Matrix size is (1, outputsize), outputsize is number of columns outputsequence.append(mpsmatrix( )) } // Submit work to GPU layer.encodesequence(commandbuffer: commandbuffer, sourcematrices: inputsequence, destinationmatrices: outputsequence, recurrentinputstate: nil, recurrentoutputstates: nil) // Tell GPU to start executing work commandbuffer.commit()

170 // Example: Running a LSTM RNN on the GPU // Create input and output data var inputsequence: [MPSMatrix] = [] var outputsequence: [MPSMatrix] = [] for i in 0..< N { // Matrix size is (1, inputsize), inputsize is number of columns inputsequence.append(mpsmatrix( )) // Matrix size is (1, outputsize), outputsize is number of columns outputsequence.append(mpsmatrix( )) } // Submit work to GPU layer.encodesequence(commandbuffer: commandbuffer, sourcematrices: inputsequence, destinationmatrices: outputsequence, recurrentinputstate: nil, recurrentoutputstates: nil) // Tell GPU to start executing work commandbuffer.commit()

171 // Example: Running a LSTM RNN on the GPU // Create input and output data var inputsequence: [MPSMatrix] = [] var outputsequence: [MPSMatrix] = [] for i in 0..< N { // Matrix size is (1, inputsize), inputsize is number of columns inputsequence.append(mpsmatrix( )) // Matrix size is (1, outputsize), outputsize is number of columns outputsequence.append(mpsmatrix( )) } // Submit work to GPU layer.encodesequence(commandbuffer: commandbuffer, sourcematrices: inputsequence, destinationmatrices: outputsequence, recurrentinputstate: nil, recurrentoutputstates: nil) // Tell GPU to start executing work commandbuffer.commit()

172 Example: Image Captioning Training Training to Caption Images

173 Example: Image Captioning Training caption caption caption Trained Parameters Training to Caption Images

174 Example: Image Captioning Training caption caption caption Determine what is depicted in Generate image the image caption CNN RNN Trained Parameters

175 Example: Image Captioning Inference Trained Parameters

176 Example: Image Captioning Inference Determine what is depicted in the image Generate image caption CNN RNN Trained Parameters

177 Example: Image Captioning Inference Determine what is depicted in the image Generate image caption CNN RNN Trained Parameters control CNN layers Trained Parameters control RNN gates

178 Example: Image Captioning Inference Determine what is depicted in the image Generate image caption CNN RNN

179 Example: Image Captioning Inference Determine what is depicted in the image Generate image caption a man riding a wave on top of a surfboard CNN RNN

Example: Image Captioning Inference a man riding a wave on top of a surfboard Determine what is depicted in the image Generate image caption LSTM Inception-v3 Memory Cell Image Captioning Network*

180 Example: Image Captioning Inference a man riding a wave on top of a surfboard Determine what is depicted in the image Generate image caption LSTM Inception-v3 Memory Cell Image Captioning Network* *Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge, Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan, IEEE Transactions on Pattern Analysis and Machine Intelligence,

181 Example: Image Captioning LSTM initialization phase Convolution Pooling (Avg.) Pooling (Max.) Fully-Connected SoftMax LSTM Memory Cell Inception-v3

182 Example: Image Captioning LSTM initialization phase Convolution Pooling (Avg.) Pooling (Max.) Fully-Connected SoftMax LSTM Memory Cell Inception-v3

183 Example: Image Captioning LSTM initialization phase Convolution Pooling (Avg.) Pooling (Max.) Fully-Connected SoftMax LSTM Inception-v3 Feature vector Memory Cell

184 Example: Image Captioning LSTM initialization phase Convolution Pooling (Avg.) Pooling (Max.) Fully-Connected SoftMax LSTM Inception-v3 Feature vector Memory Cell

185 Example: Image Captioning Caption generation phase Convolution Pooling (Avg.) Pooling (Max.) Fully-Connected SoftMax Input Sentence start token LSTM Memory Cell Output

186 Example: Image Captioning Caption generation phase Convolution Pooling (Avg.) Pooling (Max.) Fully-Connected SoftMax Input Sentence start token LSTM Memory Cell Output 3 best one-word captions

187 Example: Image Captioning Caption generation phase Convolution Pooling (Avg.) Pooling (Max.) Fully-Connected SoftMax Input Sentence start token 3 best one-word captions LSTM Memory Cell Output 3 best one-word captions

188 Example: Image Captioning Caption generation phase Convolution Pooling (Avg.) Pooling (Max.) Fully-Connected SoftMax Input Sentence start token 3 best one-word captions LSTM LSTM Memory Cell Memory Cell Output 3 best one-word captions 3 best two-word captions

189 Example: Image Captioning Caption generation phase Convolution Pooling (Avg.) Pooling (Max.) Fully-Connected SoftMax Input Sentence start token 3 best one-word captions 3 best N-word captions LSTM Memory Cell LSTM Memory Cell... LSTM Memory Cell Output 3 best one-word captions 3 best two-word captions End

190 Caption Generation Iteration 1 Iteration 2 Caption Probability Caption Probability man a the Top three captions:

191 Caption Generation Iteration 1 Iteration 2 Caption Probability Caption Probability man a the Top three captions:

192 Caption Generation Iteration 1 Iteration 2 Caption Probability Caption Probability man a the man on man in man surfing Top three captions:

193 Caption Generation Iteration 1 Iteration 2 Caption Probability Caption Probability man a the man on man in man surfing Top three captions: a man a person a surfer

194 Caption Generation Iteration 1 Iteration 2 Caption Probability Caption Probability man a the man on man in man surfing Top three captions: a man a person a surfer the man the surfer the young

195 Caption Generation Iteration 1 Iteration 2 Caption Probability Caption Probability man a the man on man in man surfing Top three captions: a man a person a surfer the man the surfer the young

196 Caption Generation Iteration 2 Iteration 3 Caption Probability Caption Probability Top three captions: man on man in man surfing a man a person a surfer the man the surfer the young a man riding a man on a man is a person riding a person on a person in a surfer is a surfer riding a surfer in

197 Caption Generation Iteration 2 Iteration 3 Caption Probability Caption Probability Top three captions: man on man in man surfing a man a person a surfer the man the surfer the young a man riding a man on a man is a person riding a person on a person in a surfer is a surfer riding a surfer in

198 Caption Generation Iteration 3 Iteration 4 Caption Probability Caption Probability Top three captions: a man riding a man on a man is a person riding a person on a person in a surfer is a surfer riding a surfer in a man riding a a man riding on a man riding the a man on a a man on his a man on the a man is surfing a man is riding a man is on

199 Caption Generation Iteration 3 Iteration 4 Caption Probability Caption Probability Top three captions: a man riding a man on a man is a person riding a person on a person in a surfer is a surfer riding a surfer in a man riding a a man riding on a man riding the a man on a a man on his a man on the a man is surfing a man is riding a man is on

200 Caption Generation Top three captions: 1. a man riding a wave on top of a surfboard 2. a man on a surfboard riding a wave 3. a man riding a wave on a surfboard

201 Caption Generation Top three captions: 1. a man riding a wave on top of a surfboard 2. a man on a surfboard riding a wave 3. a man riding a wave on a surfboard

202 Demo Image captioning CNN + LSTM

203 Summary GPU accelerated primitives Expanded support for Image Processing and Convolutional Neural Networks Added support for Linear Algebra and Recurrent Neural Networks Optimized for ios and macos New Neural Network Graph API

204 Related Sessions Introducing Metal 2 Executive Ballroom Tuesday 1:50PM Introducing Core ML Hall 3 Tuesday 3:10PM VR with Metal 2 Hall 3 Wednesday 10:00AM Vision Framework: Building on Core ML Hall 2 Wednesday 3:10PM Core ML in depth Hall 3 Thursday 09:00AM Accelerate and Sparse Solvers Executive Ballroom Thursday 10:00AM Metal 2 Optimization and Debugging Grand Ballroom B Thursday 3:10PM

205 Labs Metal 2 Lab Technology Lab Friday 09:00AM 12:00PM

206 More Information

207

What s New in Metal. Part 2 #WWDC16. Graphics and Games. Session 605

What s New in Metal. Part 2 #WWDC16. Graphics and Games. Session 605 Graphics and Games #WWDC16 What s New in Metal Part 2 Session 605 Charles Brissart GPU Software Engineer Dan Omachi GPU Software Engineer Anna Tikhonova GPU Software Engineer 2016 Apple Inc. All rights