Usable while performant: the challenges building. Soumith Chintala

Size: px

Start display at page:

Download "Usable while performant: the challenges building. Soumith Chintala"

Angel Park
5 years ago
Views:

2 Usable while performant: the challenges building Soumith Chintala

3 Problem Statement Deep Learning Workloads

4 Problem Statement Deep Learning Workloads for epoch in range(max_epochs): for data, target in enumerate(training_data): output = model(data) loss = F.nll_loss(output, target) loss.backward() optimizer.step()

5 Problem Statement Deep Learning Workloads N samples, each of some shape D for epoch in range(max_epochs): for data, target in enumerate(training_data): output = model(data) loss = F.nll_loss(output, target) loss.backward() optimizer.step()

6 Problem Statement Deep Learning Workloads mini-batch of M samples (M << N), each of shape D for epoch in range(max_epochs): for data, target in enumerate(training_data): output = model(data) loss = F.nll_loss(output, target) loss.backward() optimizer.step()

7 Problem Statement Deep Learning Workloads neural network with weights for epoch in range(max_epochs): for data, target in enumerate(training_data): output = model(data) loss = F.nll_loss(output, target) loss.backward() optimizer.step()

8 Problem Statement Deep Learning Workloads backpropagation: compute derivatives wrt loss, using chain rule for epoch in range(max_epochs): for data, target in enumerate(training_data): output = model(data) loss = F.nll_loss(output, target) loss.backward() optimizer.step()

9 Problem Statement Deep Learning Workloads update weights using the computed gradients for epoch in range(max_epochs): for data, target in enumerate(training_data): output = model(data) loss = F.nll_loss(output, target) loss.backward() optimizer.step()

10 Problem Statement Deep Learning Workloads for epoch in range(max_epochs): for data, target in enumerate(training_data): output = model(data) loss = F.nll_loss(output, target) loss.backward() optimizer.step()

11 Problem Statement Deep Learning Workloads neural network with weights for epoch in range(max_epochs): for data, target in enumerate(training_data): output = model(data) loss = F.nll_loss(output, target) loss.backward() optimizer.step()

12 Types of typical operators Convolution Figure by Vincent Dumolin:

13 Types of typical operators Convolution Figure by Vincent Dumolin:

14 Types of typical operators Convolution for oc in output_channel: for ic in input_channel: for h in output_height: for w in output_width: for kh in kernel_height: for kw in kernel_width: output_pixel += input_pixel * kernel_value Figure by Vincent Dumolin:

15 Types of typical operators Figure by Vincent Dumolin:

16 Types of typical operators Matrix Multiply Figure by Wikipedia:

17 Types of typical operators Pointwise operations for (i=0; i < data_length; i++) { } output[i] = input1[i] + input2[i] Figure by Wikipedia:

18 Types of typical operators Reduction operations double sum = 0.0; for (i=0; i < data_length; i++) { sum += input[i]; } Figure by Wikipedia:

19 Chained Together Input Output Figure by Wikipedia:

20 Chained Together Input Output Figure by Wikipedia:

21 Chained Together Figure by Wikipedia: Input Output "deep"

22 Chained Together Figure by Wikipedia: Input Output "deep" recurrent

23 Trained with Gradient Descent Figure by Wikipedia: Input Output "deep" recurrent

24 Problem Statement Deep Learning Workloads an easy way to see recurrence for epoch in range(max_epochs): for data, target in enumerate(training_data): output = model(data) loss = F.nll_loss(output, target) loss.backward() optimizer.step()

25 Problem Statement Deep Learning Workloads an easy way to see recurrence for epoch in range(max_epochs): for data, target in enumerate(training_data): output, hidden = [], zeros() for t in data.size(0): out, hidden = model(data[t], hidden) output.append(out) loss = F.nll_loss(output, target) loss.backward() optimizer.step()

26 Problem Statement Deep Learning Workloads - Vision models - model is very deep, straight-line chain with no recurrence - lots of convolutions - typically run on GPUs

27 Problem Statement Deep Learning Workloads - Vision models - model is very deep, straight-line chain with no recurrence - lots of convolutions - typically run on GPUs - NLP models - LSTM-RNN - model is 1 to 4 "layers" deep - two matmuls across space and time along with pointwise ops -typically run on CPUs if small, GPUs if large

28 Deep Learning Frameworks Make this easy to program for epoch in range(max_epochs): for data, target in enumerate(training_data): output = model(data) loss = F.nll_loss(output, target) loss.backward() optimizer.step()

29 Pre-PyTorch meta programming meta programming imperative

30 Caffe

31 Caffe define protobuf, run via command-line utility

32 Caffe define protobuf, run via command-line utility small, efficient library. Could do convents well.

33 Theano

34 Theano meta-program Theano VM via Python API

35 Theano meta-program Theano VM via Python API whole program optimizations, graph fusion

36 Theano meta-program Theano VM via Python API whole program optimizations, graph fusion graphs took minutes to hours to compile and start

37 Torch-7

38 Torch-7 imperative programming in Lua

39 Torch-7 imperative programming in Lua tied closely to underlying C89 implementations

40 Torch-7 imperative programming in Lua tied closely to underlying C89 implementations Lua lacked good tooling and ecosystem

41 What is PyTorch? automatic differentiation Ndarray library gradient based Utilities engine with GPU support optimization package (data loading, etc.) Deep Learning Numpy-alternative Reinforcement Learning

42 ndarray library np.ndarray <-> torch.tensor 200+ operations, similar to numpy very fast acceleration on NVIDIA GPUs

43 ndarray library Numpy PyTorch

44 ndarray / Tensor library

45 ndarray / Tensor library

46 ndarray / Tensor library

47 ndarray / Tensor library

48 NumPy bridge

49 NumPy bridge Zero memory-copy very efficient

50 NumPy bridge

51 NumPy bridge

52 Seamless GPU Tensors

53 automatic differentiation engine for deep learning and reinforcement learning

54 PyTorch Autograd W_h = torch.randn(20, 20, requires_grad=true) W_x = torch.randn(20, 10, requires_grad=true) x = torch.randn(1, 10) prev_h = torch.randn(1, 20)

55 PyTorch Autograd W_h = torch.randn(20, 20, requires_grad=true) W_x = torch.randn(20, 10, requires_grad=true) x = torch.randn(1, 10) prev_h = torch.randn(1, 20) MM MM i2h = torch.mm(w_x, x.t()) h2h = torch.mm(w_h, prev_h.t())

56 PyTorch Autograd W_h = torch.randn(20, 20, requires_grad=true) W_x = torch.randn(20, 10, requires_grad=true) x = torch.randn(1, 10) prev_h = torch.randn(1, 20) MM MM i2h = torch.mm(w_x, x.t()) h2h = torch.mm(w_h, prev_h.t()) next_h = i2h + h2h

57 PyTorch Autograd W_h = torch.randn(20, 20, requires_grad=true) W_x = torch.randn(20, 10, requires_grad=true) x = torch.randn(1, 10) prev_h = torch.randn(1, 20) MM MM i2h = torch.mm(w_x, x.t()) h2h = torch.mm(w_h, prev_h.t()) next_h = i2h + h2h Add

58 PyTorch Autograd W_h = torch.randn(20, 20, requires_grad=true) W_x = torch.randn(20, 10, requires_grad=true) x = torch.randn(1, 10) prev_h = torch.randn(1, 20) MM MM i2h = torch.mm(w_x, x.t()) h2h = torch.mm(w_h, prev_h.t()) next_h = i2h + h2h next_h = next_h.tanh() Add Tanh

59 PyTorch Autograd W_h = torch.randn(20, 20, requires_grad=true) W_x = torch.randn(20, 10, requires_grad=true) x = torch.randn(1, 10) prev_h = torch.randn(1, 20) MM MM i2h = torch.mm(w_x, x.t()) h2h = torch.mm(w_h, prev_h.t()) next_h = i2h + h2h next_h = next_h.tanh() Add Tanh next_h.backward(torch.ones(1, 20))

60 Neural Networks

61 Neural Networks

62 Neural Networks

63 Optimization package SGD, Adagrad, RMSProp, LBFGS, etc.

64 Bootstrapping Writing Building models Implementing Checkpointing Dataset loaders Training loop models Python + PyTorch - an environment to do all of this Interfacing with Dealing with Building Building optimizers environments GPUs Baselines

65 Bootstrapping Writing Building models Implementing Checkpointing Dataset loaders Training loop models bootstrapping Python + PyTorch the Python - an environment tooling stack to do for all good of this UX Interfacing with Dealing with Building Building optimizers environments GPUs Baselines

66 Python is slow, interpreted Global interpreter-lock application logic is order of magnitude slower than C++ moved autograd engine to C++ moved everything to ATen - Side-effect, a clean C++ API

Adam Paszke, Sam Gross, Soumith Chintala, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia

Adam Paszke, Sam Gross, Soumith Chintala, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Alban Desmaison, Andreas Kopf, Edward Yang, Zach Devito,