GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

Size: px

Start display at page:

Download "GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING"

Roderick Blair
5 years ago
Views:

1 GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING M. Babaeizadeh,, I.Frosio, S.Tyree, J. Clemons, J.Kautz University of Illinois at Urbana-Champaign, USA NVIDIA, USA An ICLR 2017 paper A github project

2 GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING M. Babaeizadeh,, I.Frosio, S.Tyree, J. Clemons, J.Kautz University of Illinois at Urbana-Champaign, USA NVIDIA, USA

3 GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING M. Babaeizadeh,, I.Frosio, S.Tyree, J. Clemons, J.Kautz University of Illinois at Urbana-Champaign, USA NVIDIA, USA

4 GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING 4

5 GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING Learning to accomplish a task Image from 5

6 GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING Definitions ~R t S t R, t Environment Agent S t RL agent Observable status S t Reward R t Action a t Policy 6

7 GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING Definitions ~R t S t Deep RL agent 7

8 GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING Definitions ~R t Dp( ) S t 8

9 GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING 9

10 Agent 16 Agent 2 Agent 1 GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING Asynchronous Advantage Actor-Critic (Mnih et al., arxiv: v2, 2015) Dp( ) p ( ) Master model 10

11 GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING 11

12 GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING The GPU 12

13 GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING LOW OCCUPANCY (33%) LOW OCCUPANCY 13

14 Batch size GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING HIGH OCCUPANCY (100%) 14

15 GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING HIGH OCCUPANCY (100%), LOW UTILIZATION (40%) time time 15

16 GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING HIGH OCCUPANCY (100%), HIGH UTILIZATION (100%) time time 16

17 GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING BANDWIDTH LIMITED time time 17

18 Agent 16 Agent 2 Agent 1 Dp( ) p ( ) Master model + 18

status, reward Pear, pear, pear, pear, Empty, empty, Fig, fig,

19 MAPPING DEEP PROBLEMS TO A GPU REGRESSION, CLASSIFICATION, REINFORCEMENT LEARNING data 100% utilization / occupancy status, reward Pear, pear, pear, pear, Empty, empty, Fig, fig, fig, fig, fig, fig, Strawberry, Strawberry, Strawberry, labels? action 19

20 Agent 16 Agent 2 Agent 1 A3C Master model 20

21 Agent 16 Agent 2 Agent 1 A3C Master model 21

22 Agent 16 Agent 2 Agent 1 A3C Master model 22

23 Agent 16 Agent 2 Agent 1 A3C Master model 23

24 Agent 16 Agent 2 Agent 1 A3C Dp( ) Master model 24

25 Agent 16 Agent 2 Agent 1 A3C Dp( ) p ( ) Master model 25

26 GA3C GPU-based A3C 26 El Capitan big wall, Yosemite Valley

27 Agent N Agent 2 Agent 1 a t GA3C (INFERENCE) prediction queue S t predictors {a t } {S t } Master model 27

28 Agent N Agent 2 Agent 1 GA3C (TRAINING) Dp( ) Master model S t,r t R 0 R 1 R 2 R 4 {S t,r t } training queue trainers 28

29 Agent N Agent 2 Agent 1 a t GA3C prediction queue {a t } {S t } Dp( ) S t predictors Master model R 0 R 1 R 2 R 4 {S t,r t } training queue trainers 29

30 GA3C GPU-based A3C 30 El Capitan big wall, Yosemite Valley

31 GA3C Learn how to balance GPU-based A3C 31 El Capitan big wall, Yosemite Valley

32 Agent N Agent 2 Agent 1 GA3C: PREDICTIONS PER SECOND (PPS) a t prediction queue {a t } {S t } Dp( ) S t predictors Master model R 0 R 1 R 2 R 4 {S t,r t } training queue trainers 32

33 Agent N Agent 2 Agent 1 GA3C: TRAININGS PER SECOND (TPS) a t prediction queue {a t } {S t } Dp( ) S t predictors Master model R 0 R 1 R 2 R 4 {S t,r t } training queue trainers 33

34 AUTOMATIC SCHEDULING Balancing the system at run time ATARI Boxing ATARI Pong N P = # predictors, N T = # trainers, N A = # agents, TPS = training per seconds 34

35 THE ADVANTAGE OF SPEED More frames = faster convergence 35

LARGER DNNS For real world applications (e.g.

4x4 filters, stride 2 FC 256 Timothy P. Lillicrap et al.

36 LARGER DNNS For real world applications (e.g. robotics, automotive) A3C (ATARI) Others (robotics) Conv 16 8x8 filters, stride 4 Conv 32 4x4 filters, stride 2 FC 256 Timothy P. Lillicrap et al., Continuous control with deep reinforcement learning, International Conference on Learning Representations, Conv 32 8x8 filters, stride 1, 2, 3, 4 Conv 32 4x4 filters, stride 2? Conv 64 4x4 FC 256 filters, stride 2 S. Levine et al., End-to-end training of deep visuomotor policies, Journal of Machine Learning Research, 17:1-40,

37 PPS GA3C VS. A3C*: PREDICTIONS PER SECONDS * Our Tensor Flow implementation on a CPU x 11x 12x 20x 45x A3C GA3C 1 Small DNN Large DNN, stride 4 Large DNN, stride 3 Large DNN, stride 2 Large DNN, stride 1 37

38 Utilization (%) CPU & GPU UTILIZATION IN GA3C For larger DNNs CPU % GPU % Small DNN Large DNN, stride 4 Large DNN, stride 3 Large DNN, stride 2 Large DNN, stride 1 38

39 Agent N Agent 2 Agent 1 GA3C POLICY LAG Asynchronous playing and training prediction queue {a t } {S t } Dp( ) S t predictors Master model DNN A R 0 R 1 R 2 R 4 {S t,r t } training queue DNN B trainers 39

40 STABILITY AND CONVERGENCE SPEED Reducing policy lag through min training batch size 40

41 GA3C (45x faster) Balancing computational resources, speed and stability. GPU-based A3C 41 El Capitan big wall, Yosemite Valley

RESOURCES THEORY CODING M. Babaeizadeh, I. Frosio, S. Tyree, J.

net/forum?id=r1v GvBcxl&noteId=r1VGvBcxl).

42 RESOURCES THEORY CODING M. Babaeizadeh, I. Frosio, S. Tyree, J. Clemons, J. Kautz, Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU, ICLR 2017 (available at GvBcxl&noteId=r1VGvBcxl). GA3C, a GPU implementation of A3C (open source at A general architecture to generate and consume training data. 42

43 QUESTIONS QUESTIONS ATARI 2600 Policy lag Multiple GPUs Why TensorFlow Replay memory Github: ICLR 2017: Reinforcement Learning through Asynchronous Advantage Actor-Critic on 43 a GPU.

44 BACKUP SLIDES

45 POLICY LAG IN GA3C Potentially large time lag between training data generation and network update 45

46 SCORES 46

47 Training a larger DNN 47

48 BALANCING COMPUTATIONAL RESOURCES The actors: CPU, PCI-E & GPU Trainings per second Training queue size Prediction queue size 48

GUNREAL: GPU-accelerated UNsupervised REinforcement and Auxiliary Learning

GUNREAL: GPU-accelerated UNsupervised REinforcement and Auxiliary Learning Koichi Shirahata, Youri Coppens, Takuya Fukagai, Yasumoto Tomita, and Atsushi Ike FUJITSU LABORATORIES LTD. March 27, 2018 0 Deep