HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing Mingyu Gao and Christos Kozyrakis Stanford University http://mast.stanford.edu HPCA March 14, 2016

PIM is Coming Back End of Dennard scaling MapReduce, graph processing, deep neural networks, Energy-bound systems Near-Data Processing (NDP) HMC, HBM 3D stacking Inmemory analytics Figs: www.extremetech.com www.cisl.columbia.edu/grads/tuku/research/ www.oceanaute.blogspot.com/2015/06/how-to-shuffle-sort-mapreduce.html 2

NDP Logic Requirements Area-efficient o High processing throughput to match the high memory bandwidth o 128 GBps per 50 mm 2 stack > 32 Gflops > 0.6 Gflops/mm 2 Power-efficient o Thermal constraints limit clock frequency o 5 W per stack 100 mw/mm 2 Flexible o Must amortize manufacturing cost through reuse across apps 3

NDP Logic Options Area Efficiency Power Efficiency Flexibility Programmable cores [IRAM, FlexRAM, NDC, TOP-PIM] FPGA (fine-grained) [Active Pages] CGRA (coarse-grained) [NDA] ASIC [MSA, LiM] 4

Reconfigurable Logic Challenges FPGA o Area overhead due to support for bit-level configuration CGRA o Traditional GGRAs Limited flexibility in interconnects, only for regular computation patterns o DySER [HPCA 11] and NDA [HPCA 15] High power due to circuit-switched routing Inefficient for branches and irregular data layouts Heterogeneity: achieve the best of FPGA and CGRA 5

Outline Motivation NDP System Design Heterogeneous Reconfigurable Logic (HRL) Evaluation Conclusions 6

Overall System Architecture Memory stack with NDP capability: runs memory-intensive code High-Speed Serial Link Multi-core chip with cache hierarchy: Runs code with high temporal locality Memory Stack Host Processor Multiple stacks linked to host processor through serial links 7

NDP Stack Vault: Vertical channel Dedicated memory controller 8 16 vaults per stack Bank Channel... DRAM Die vs. DDR3 channel 10x bandwidth (160 GBps) 3-5x power improvement Vault NoC Vault Logic Logic Die Vault logic: Multiple PEs + control logic NoC to interconnect vaults 8

Iterative Execution Flow Processing phase: PEs run tasks independently and in parallel Communication phase: data exchange and sync b/w PEs [PACT 15] o Communication within and across stacks Task Task Task Task PEs PEs PEs PEs Process Local Buffer Task Task Task Task Pull 9

op,addr,sz op,addr,sz op,addr,sz Vault Logic Handles task control and data communication Allows the use of reconfigurable or custom PEs Input & output data info from host or other PEs Global Controller (Host Processor) Cached data (e.g., graph vertices) Fixed Logic User-defined Logic Task Queue To local memory DMA DMA Ctrl Scratchpad Buffer Mem Ctrl Router Generic Load/Store Unit PE PE... PE To remote Streaming data memory (e.g., graph edge list) Separate queue for each consumer Coalesce writebacks In-place combine ops Output Queues write queues 10

Outline Motivation NDP System Design Heterogeneous Reconfigurable Logic (HRL) Evaluation Conclusions 11

HRL Features Fine-grained + coarse-grained reconfigurable blocks o LUTs for flexible control o ALUs for efficient arithmetic Static interconnects o Wide network for data o Separate and narrow network for control Area-efficiency and flexibility Power-efficiency Special blocks for branches & irregular data layout Flexibility Compute throughput per Watt: 2.2x over FPGA, 1.7x over CGRA 12

HRL Array: Logic Blocks Control Input FPGA-style configurable block LUTs for embedded control logic Data Input Special functions: sigmoid, CLBtanh, FU etc. FU FU FU FU CGRA-style functional unit Efficient 48-bit arithmetic/logic ops Registers for pipelining and retiming 1-bit control routing tracks CLB FU FU FU FU FU 16-bit data routing tracks CLB FU FU FU FU FU Flexible IO Interface Simple but flexible alignment CLB OMB OMB OMB OMB OMB Control Output Data Output Output MUX Block Configurable MUXes (tree, cascading, parallel) Put close to output, low cost and flexible 13

HRL Array: Routing Control Input Data Input CLB FU FU FU FU FU 1-bit control routing tracks CLB FU FU FU FU FU 16-bit data routing tracks CLB FU FU FU FU FU CLB OMB OMB OMB OMB OMB Control Output Data Output 14

HRL Array: Routing 16-bit data net tracks 1-bit control net tracks Fully static for low power No switches b/w two networks Separate data and control to reduce area A B Y out FU Block output to routing track connection Routing switch box Block input MUX connection Connection box and switch out CLB in sel OUT INA MXB INB 1-bit control, few tracks 16-bit data, bus-based 15

HRL Array: IO Control Input Data Input CLB FU FU FU FU FU 1-bit control routing tracks CLB FU FU FU FU FU 16-bit data routing tracks CLB FU FU FU FU FU CLB OMB OMB OMB OMB OMB Control Output Data Output 16

HRL Array: IO Control IO Connects to control tracks Same as FPGA Data IO Connect to data net Simple 16-bit chunk alignment 48-bit fixed-point 16-bit short 32-bit int Ctrl IO Data IO 1-bit control tracks 16-bit data tracks 48-bit fixedpoint 16-bit short Low cost and sufficiently flexible even for irregular data 32-bit int 17

Outline Motivation NDP System Design Heterogeneous Reconfigurable Logic (HRL) Evaluation Conclusions 18

Methodology Workloads o 3 analytics frameworks: MapReduce, graph, DNN o 9 representative applications, 11 kernel circuits (KCs) Technology o 45 nm area and power model o CGRA: DySER as in NDA [HPCA 11, HPCA 15] o FPGA: Xilinx Virtex-6 Tools o Synthesize, place & route by Yosys + VTR o System simulation by zsim 19

Single Array Area (mm2) Array Area 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Coarse-grained FUs improve area efficiency Ctrl Routing Data Routing Routing Logic FPGA CGRA HRL Same logic capacity for each type array 20

Power (mw) Array Power 40 35 30 25 20 15 10 5 0 HRL: static but flexible routing CGRA: high power on circuit-switched network FPGA: half the frequency KC1 KC2 KC3 KC4 KC5 KC6 KC7 KC8 KC9 KC10 KC11 Average FPGA CGRA HRL 21

Normalized Perf/Watt Vault Power Efficiency 3.5 3 HRL: 2.2x FPGA, 1.7x CGRA on perf/watt 2.5 2 1.5 1 0.5 0 KC1 KC2 KC3 KC4 KC5 KC6 KC7 KC8 KC9 KC10 KC11 Average FPGA CGRA HRL 22

Normalized Performance Overall Performance ASIC represents the upper bound of efficiency Cores, FPGA, CGRA only match 30% to 80% of ASIC HRL has 92% of ASIC performance on average Memory bandwidth not saturated Memory bandwidth saturated 1 0.8 0.6 0.4 0.2 0 GroupBy Hist LinReg PageRank SSSP CC ConvNet MLP da Cores FPGA CGRA HRL ASIC 23

Conclusions NDP logic requirements: area + power efficiency, flexibility Heterogeneous reconfigurable logic (HRL) o Fine-grained + coarse-grained logic blocks o Static and separate data and control networks o Special blocks for branching and layout management o Vault logic handles communication and control HRL for in-memory analytics o 2.2x performance/watt over FPGA and 1.7x over CGRA o Within 92% of ASIC performance 24

Thanks! Questions?