Lecture 4: Backpropagation and Automatic Differentiation. CSE599W: Spring 2018

Size: px

Start display at page:

Download "Lecture 4: Backpropagation and Automatic Differentiation. CSE599W: Spring 2018"

Beverley Freeman
5 years ago
Views:

1 Lecture 4: Backpropagation and Automatic Differentiation CSE599W: Spring 208

2 Announcement Assignment is out today, due in 2 weeks (Apr 9 th, 5pm)

3 Model Training Overview layer extractor layer2 extractor predictor Objective Training

4 Symbolic Differentiation Input formulae is a symbolic expression tree (computation graph). Implement differentiation rules, e.g., sum rule, product rule, chain rule!(# %)!' For complicated functions, the resultant expression can be exponentially large. Wasteful to keep around intermediate symbolic expressions if we only need a numeric value of the gradient in the end Prone to error =!#!'!%!'!(#%)!' =!#!' % #!%!'! h '!' =!# % '!' *!%(') '

5 Numerical Differentiation We can approximate the gradient using!" # " # h/ 0 "(#) lim!$ % *, h " 4, $ = 4 $ <

6 Numerical Differentiation We can approximate the gradient using!" # " # h/ 0 "(#) lim!$ % *, h " 4, $ = 4 $ = " 4, $ = 4 $ 0.8 ; 0.3 =

7 Numerical Differentiation We can approximate the gradient using!" # " # h/ 0 "(#) lim!$ % *, h Reduce the truncation error by using center difference!" # " # h/ 0 "(# h/ 0 ) lim!$ % *, 2h Bad: rounding error, and slow to compute üa powerful tool to check the correctness of implementation, usually use h = e-6.

8 Backpropagation! Operator % # = %(!, ") "

9 Backpropagation " )* )" = )* )$ )$ )" Compute gradient becomes local computation Operator! $ =!(", #) )* )$ # )* )# = )* )$ )$ )#

10 Backpropagation simple example! = % &(( )*(, *( -, - )

11 Backpropagation simple example " #, = *.( ) % # " & % & *% /% " '

12 Backpropagation simple example " # % # " & % & , = *.( ) *% /% " ' 2.0

13 Backpropagation simple example " # % # " & % & " ' , = *.( ) *% /%.0

14 Backpropagation simple example " # % # " & % & " ' , = *.( ) *% /% , % = /% à 89 = 84 /%& :; :% = :; :, :, :% = /%&

15 Backpropagation simple example " # % # " & % & " ' , = *.( ) *% /% , % = % à =

16 Backpropagation simple example " # % # " & % & " ' , = *.( ) *% /% , % = * 4 à 89 :; :% = :; :, 84 = *4 :, :% = :; :, *4

17 Backpropagation simple example " # % # " & % & " ' , = *.( ) *% /% , %, " = %" à 9: 94 = ", 9: 90 = %

18 Backpropagation simple example " # % # " & % & " ' , = *.( ) *% /%

19 Any problem? Can we do better?

20 Problems of backpropagation You always need to keep intermediate data in the memory during the forward pass in case it will be used in the backpropagation. Lack of flexibility, e.g., compute the gradient of gradient.

21 Automatic Differentiation (autodiff) Create computation graph for gradient computation

22 Automatic Differentiation (autodiff) Create computation graph for gradient computation " # % #, = *.( ) " & % & *% /% " '

23 Automatic Differentiation (autodiff) Create computation graph for gradient computation " # % # " & - = * /( ) % & *% /% " ' % & - % = /% à = /%&

24 Automatic Differentiation (autodiff) Create computation graph for gradient computation " # % # " & - = * /( ) % & *% /% " ' % & - % = % à =

25 Automatic Differentiation (autodiff) Create computation graph for gradient computation " # % # " & - = * /( ) % & *% /% " ' % & - % = * 5 à = *5

26 Automatic Differentiation (autodiff) Create computation graph for gradient computation " # % # " & - = * /( ) % & *% /% " ' 89 - %, " = %" à 8; = % % &

27 Automatic Differentiation (autodiff) Create computation graph for gradient computation " # % # " & - = * /( ) % & *% /% " ' % &

28 AutoDiff Algorithm W x y matmult softmax log mul mean cross_en tropy y_ W_grad matmulttranspose log-grad softmax-grad mul / batch_size

29 AutoDiff Algorithm def gradient(out): node_to_grad[out] = nodes = get_node_list(out) for node in reverse_topo_order(nodes): grad ß sum partial adjoints from output edges input_grads ß node.op.gradient(input, grad) for input in node.inputs add input_grads to node_to_grad return node_to_grad! " exp! %

30 AutoDiff Algorithm def gradient(out): node_to_grad[out] = nodes = get_node_list(out) for node in reverse_topo_order(nodes): grad ß sum partial adjoints from output edges input_grads ß node.op.gradient(input, grad) for input in node.inputs add input_grads to node_to_grad return node_to_grad node_to_grad: :! " exp! %

31 AutoDiff Algorithm def gradient(out): node_to_grad[out] = nodes = get_node_list(out) for node in reverse_topo_order(nodes): grad ß sum partial adjoints from output edges input_grads ß node.op.gradient(input, grad) for input in node.inputs add input_grads to node_to_grad return node_to_grad node_to_grad:! %! " :! " exp! %

32 AutoDiff Algorithm def gradient(out): node_to_grad[out] = nodes = get_node_list(out) for node in reverse_topo_order(nodes): grad ß sum partial adjoints from output edges input_grads ß node.op.gradient(input, grad) for input in node.inputs add input_grads to node_to_grad return node_to_grad node_to_grad:! %! " :! " exp! % "! %

33 AutoDiff Algorithm def gradient(out): node_to_grad[out] = nodes = get_node_list(out) for node in reverse_topo_order(nodes): grad ß sum partial adjoints from output edges input_grads ß node.op.gradient(input, grad) for input in node.inputs add input_grads to node_to_grad return node_to_grad node_to_grad: :! % :! % : "! %! "! " exp! % "! %

34 AutoDiff Algorithm def gradient(out): node_to_grad[out] = nodes = get_node_list(out) for node in reverse_topo_order(nodes): grad ß sum partial adjoints from output edges input_grads ß node.op.gradient(input, grad) for input in node.inputs add input_grads to node_to_grad return node_to_grad node_to_grad: :! % :! % : "! %! "! " exp! % "! %

35 AutoDiff Algorithm def gradient(out): node_to_grad[out] = nodes = get_node_list(out) for node in reverse_topo_order(nodes): grad ß sum partial adjoints from output edges input_grads ß node.op.gradient(input, grad) for input in node.inputs add input_grads to node_to_grad return node_to_grad! " exp! % $ "! % id node_to_grad: :! % :! % : "! %! "

36 AutoDiff Algorithm def gradient(out): node_to_grad[out] = nodes = get_node_list(out) for node in reverse_topo_order(nodes): grad ß sum partial adjoints from output edges input_grads ß node.op.gradient(input, grad) for input in node.inputs add input_grads to node_to_grad return node_to_grad! " exp! % $ "! % id node_to_grad: :! % :! % : ", $! %! "

37 AutoDiff Algorithm def gradient(out): node_to_grad[out] = nodes = get_node_list(out) for node in reverse_topo_order(nodes): grad ß sum partial adjoints from output edges input_grads ß node.op.gradient(input, grad) for input in node.inputs add input_grads to node_to_grad return node_to_grad! " exp! % $ "! % id node_to_grad: :! % :! % : ", $! %! "

38 AutoDiff Algorithm def gradient(out): node_to_grad[out] = nodes = get_node_list(out) for node in reverse_topo_order(nodes): grad ß sum partial adjoints from output edges input_grads ß node.op.gradient(input, grad) for input in node.inputs add input_grads to node_to_grad return node_to_grad node_to_grad: :! % :! % : ", $! %! "! " exp! %! " "! % $ id

39 AutoDiff Algorithm def gradient(out): node_to_grad[out] = nodes = get_node_list(out) for node in reverse_topo_order(nodes): grad ß sum partial adjoints from output edges input_grads ß node.op.gradient(input, grad) for input in node.inputs add input_grads to node_to_grad return node_to_grad node_to_grad: :! % :! % : ", $! " :! "! %! "! " exp! %! " $ id "! %

40 AutoDiff Algorithm def gradient(out): node_to_grad[out] = nodes = get_node_list(out) for node in reverse_topo_order(nodes): grad ß sum partial adjoints from output edges input_grads ß node.op.gradient(input, grad) for input in node.inputs add input_grads to node_to_grad return node_to_grad node_to_grad: :! % :! % : ", $! " :! "! %! "! " exp! %! " $ id "! %

41 Backpropagation vs AutoDiff Backpropagation AutoDiff! "! " exp! #! )! (! " exp! #! (! #! #! # id! (! )! )

42 Recap Numerical differentiation Tool to check the correctness of implementation Backpropagation Easy to understand and implement Bad for memory use and schedule optimization Automatic differentiation Generate gradient computation to entire computation graph Better for system optimization

43 References Automatic differentiation in machine learning: a survey CS23n backpropagation:

Lecture 3: Overview of Deep Learning System. CSE599W: Spring 2018

Lecture 3: Overview of Deep Learning System CSE599W: Spring 2018 The Deep Learning Systems Juggle We won t focus on a specific one, but will discuss the common and useful elements of these systems Typical