Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System

Size: px

Start display at page:

Download "Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System"

Ralph Berry
6 years ago
Views:

Hwu1, Nam Sung Kim1, Jinjun Xiong2, Deming Chen1 1 University

1 Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System Xiaofan Zhang1, Mohamed El Hadedy1, Wen-mei Hwu1, Nam Sung Kim1, Jinjun Xiong2, Deming Chen1 1 University of Illinois Urbana-Champaign 2 T.J. Watson Research Center, IBM, USA

2 Outline: 1. Motivation 2. Background 3. Challenges & solutions in the accelerator design 4. Challenges & solutions in the CAPI-integrated design 5. Experimental Results 6. Conclusion 2

3 Motivation: For choosing Long-term Recurrent Convolutional Network (LRCN) LRCN has a wide range of applications Image/Video classification, captioning, labeling, storytelling LRCN translates image information to text message Non-structural information to structural information LRCN captures both spatial and temporal features 3

4 Outline: 1. Motivation 2. Background 3. Challenges & solutions in the accelerator design 4. Challenges & solutions in the CAPI-integrated design 5. Experimental Results 6. Conclusion 4

5 Background: Hardware acceleration GPUs fast; relatively easy to program; high power consumption (Power >100 Watts) CPUs easy to program; low speed; Power Watts FPGAs fast; not easy to program; low power; high power/energy efficiency (Power ~30 Watts) Alexnet inference comparison Platform FPGA - Zynq ZC706 (fixed 16) GPU - TX1 (FP16) Img/sec/W GPU -TX1 (FP32) 8.6 GPU TitanX (FP32) 2.5 CPU - i7 6700K (FP32) 1.3 FPGA is the right candidate to host LRCN Fine-grained parallelism Low-power budget High energy efficiency Fully customized design 5

6 Background: CAPI Advantages Accelerator shares the memory space with the host CPU Accelerator has privilege to access memory of the host CPU 6

programing languages such as C/C++ for abstract description Reduced effort in design, design space

7 Background: HLS design flow Negative aspects for using FPGA 1) RTL programming 2) Hardware verification 3) Precise Resource allocation Time consuming & Challenging HLS design flow high-level programing languages such as C/C++ for abstract description Reduced effort in design, design space exploration, debugging HLS offers optimization schemes (#pragma) Fast response to emerging DNNs HLS design flow 7

8 Outline: 1. Motivation 2. Background 3. Challenges & solutions in the accelerator design 4. Challenges & solutions in the CAPI-integrated design 5. Experimental Results 6. Conclusion 8

9 Challenges in LRCN accelerator design Computation complexity Costs 124ms (K80 GPU) or 190ms (Xeno E5 CPU) for processing a single image Unclear allocation scheme & partitioning, tiling factors 18 layers (CNN) + 7x15 layers (RNN) are required to map to a single FPGA chip Requires careful resource allocation among the various loops Partitioning & tiling factors vary from layer to layer Limited on-chip memory & memory access bandwidth Not enough on-chip memory for storing weight & bias (7MB available vs MB required) External memory bandwidth becomes a bottleneck 9

Challenges in LRCN accelerator design An end-to-end model for video content description with a hybrid neural network using both CNN and RNN CN N

10 Challenges in LRCN accelerator design An end-to-end model for video content description with a hybrid neural network using both CNN and RNN CN N RNN with LSTM Layer characteristics Output word 3 Output word 2 Output word 1 Layer Comp. Memory CONV 60.06% 2.69% FC 5.29% 67.73% RNN 34.65% 29.58% 10

11 Solution1 Resource allocation Latency Constant Resource Allocation Management (REALM) Analyze resource allocation among the layers Determine the most efficient allocation to minimize latency Latency/a guideline provided by REALM 11

12 Solution2 IP-based design Using Parameterized HLS IP for regular network construction One instance of IP is responsible for computing one tile in a layer IP consists of Coo multiply-accumulate units of dimension Cii each. Multiplier Tree adder 12

13 Solution3 Network pruning Neurons in FC1~FC3: > 256 Two LSTM layers with 1000 neurons -> one LSTM layer with 256 neurons 7.8x 86.56M -> 11.08M 1.5x 2.22GOP -> 1.45GOP Flexible quantization 16-bit 12-bit Network Accuracy LRCN original (AlexNet+2 LSTM layers) 43% LRCN pruned, fixed-point (AlexNet + 42% 1 LSTM layer) implemented on FPGA 13

14 Solution4 Memory hierarchy design 8 bits have to be discarded Use FIFOs & Ping-pong buffers to hide memory access latency To fully use the bus interface, collect bits from three bus accesses 14

15 Outline: 1. Motivation 2. Background 3. Challenges & solutions in the accelerator design 4. Challenges & solutions in the CAPI-integrated design 5. Experimental Results 6. Conclusion 15

16 Challenges in CAPI-integrated design Need to adapt the HLS generated design with CAPI interface AXI bus interface -> memory interface Need to communicate between the FPGA and the CPU Require control signals connecting the CAPI proxies and HLS-generated kernel Require new data transmission design (fetching data without AXI bus interface) Some manual routing & hacking in if necessary Use the reserved resources 16

17 Solution1 AXI -> CAPI In the synthesizable C code, we need initialize a chunk of memory space for keeping parameters and specify offsets pointing to each network layers L1 parameters L2 parameters L3 parameters L1 Offset L2 Offset L3 Offset Save 23% LUT 17

Control Control Accelerator Function Unit To send/receive Acknowledge Debugging data from

18 Solution2 FSM design CAPI Framework CAPI Proxies To start, reset and stop AFU Abstraction Layer Cache coherent PowerBus PSL Interface To send, read and write commands from the AFU Control Control Accelerator Function Unit To send/receive Acknowledge Debugging data from the Unit completed Control and control AFU commands Command DMA Buffer MMIO Response MMIO PCIe 18

19 Solution3 FSM 19

20 Outline: 1. Motivation 2. Background 3. Challenges & solutions in the accelerator design 4. Challenges & solutions in the CAPI-integrated design 5. Experimental Results 6. Conclusion 20

21 System setup 1) Develop a network using deep frameworks 2) Network pruning & parameter quantization 3) Generating resource allocation schemes (REALM) 4) Covering network into synthesizable C code (HLS IPs) 5) Going through HLS, get optimized RTL design 6) Stitching the accelerator with CAPI 21

22 Implementation strategies Baseline Implement a LRCN IP with four resource allocation schemes Batch Instantiate two LRCN IPs (batch size: 2) Semi-Batch Implement one single front-end CNN and two identical back-end RNNs CNN x1 - Process two images in serial RNN x2 - Process two outputs from CNN in parallel 22

23 Results Baseline Same LRCN design with incremental FPGA resource 45% resource 9 FPS P S L 49% resource 55% resource 59% resource 13 FPS 15 FPS 17 FPS P S L P S L P S L LRC N IP LRC N IP LRC N IP LRC N IP Sche me 1 Sche me 2 Sche me 3 Sche me 4 23

24 Results Batch mode Batch size 2 (20 FPS) Two separate computation units with shared control signals and parameter inputs PSL 24

25 Results Semi-batch mode 15.4 FPS Reuse CNN twice, batch the RNN (batch size 2) PSL X2 25

26 Results Baseline vs. batch vs. semi-batch mode FPGA, CPU, GPU Comparison for running single LRCN 26

27 Results Throughput/power performance comparison 27

28 Outline: 1. Motivation 2. Background 3. Challenges & solutions in the accelerator design 4. Challenges & solutions in the CAPI-integrated design 5. Experimental Results 6. Conclusion 28

29 Conclusion Implementation of FPGA-based LRCN accelerator in POWER system using CAPI interconnection FSMs for signals transmission and data movement Adapt the HLS generated design to CAPI environment Expand the proposed LRCN design with three strategies mode 29

30 Thank you

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

IBM Research AI Systems Day DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1, Junsong Wang 2, Chao Zhu 2, Yonghua Lin 2, Jinjun Xiong 3, Wen-mei