HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing

Similar documents
Practical Near-Data Processing for In-Memory Analytics Frameworks

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

Lecture 41: Introduction to Reconfigurable Computing

Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays

Versal: The New Xilinx Adaptive Compute Acceleration Platform (ACAP) in 7nm

ReNoC: A Network-on-Chip Architecture with Reconfigurable Topology

Versal: AI Engine & Programming Environment

The Nios II Family of Configurable Soft-core Processors

Hoplite-DSP Harnessing the Xilinx DSP48 Multiplexers to efficiently support NoCs on FPGAs. Chethan Kumar H B and Nachiket Kapre

Flexible Architecture Research Machine (FARM)

Reconfigurable Cell Array for DSP Applications

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps

P51: High Performance Networking

INTRODUCTION TO FPGA ARCHITECTURE

Communication for PIM-based Graph Processing with Efficient Data Partition. Mingxing Zhang, Youwei Zhuo (equal contribution),

Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture

Sort vs. Hash Join Revisited for Near-Memory Execution. Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot

On GPU Bus Power Reduction with 3D IC Technologies

Software Defined Hardware

Reconfigurable Computing. Introduction

When MPPDB Meets GPU:

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

Comparing Memory Systems for Chip Multiprocessors

Memory-Aware Loop Mapping on Coarse-Grained Reconfigurable Architectures

Overcoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics

FastTrack: Leveraging Heterogeneous FPGA Wires to Design Low-cost High-performance Soft NoCs

Scaling Neural Network Acceleration using Coarse-Grained Parallelism

Design methodology for multi processor systems design on regular platforms

PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory

Course Overview Revisited

THE DYNAMIC GRANULARITY MEMORY SYSTEM

Co-synthesis and Accelerator based Embedded System Design

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

High Performance Packet Processing with FlexNIC

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Design of Reusable Context Pipelining for Coarse Grained Reconfigurable Architecture

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle

XPU A Programmable FPGA Accelerator for Diverse Workloads

Addressing the Memory Wall

ESE534: Computer Organization. Tabula. Previously. Today. How often is reuse of the same operation applicable?

Fast Flexible FPGA-Tuned Networks-on-Chip

Corolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion

Address InterLeaving for Low- Cost NoCs

Organic Computing. Dr. rer. nat. Christophe Bobda Prof. Dr. Rolf Wanka Department of Computer Science 12 Hardware-Software-Co-Design

A Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013

Re-Examining Conventional Wisdom for Networks-on-Chip in the Context of FPGAs

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

A Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on

Porting Performance across GPUs and FPGAs

Cymric A Framework for Prototyping Near-Memory Architectures

Coarse Grain Reconfigurable Arrays are Signal Processing Engines!

Adaptable Intelligence The Next Computing Era

Structured Datapaths. Preclass 1. Throughput Yield. Preclass 1

SDA: Software-Defined Accelerator for general-purpose big data analysis system

Multi processor systems with configurable hardware acceleration

Parallel Computing: Parallel Architectures Jin, Hai

Decoupling Dynamic Information Flow Tracking with a Dedicated Coprocessor

"On the Capability and Achievable Performance of FPGAs for HPC Applications"

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System

Register File Organization

FPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011

The University of Texas at Austin

Final Lecture. A few minutes to wrap up and add some perspective

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Enabling the A.I. Era: From Materials to Systems

Embedded Systems. 8. Hardware Components. Lothar Thiele. Computer Engineering and Networks Laboratory

FPGA architecture and design technology

SDSoC: Session 1

Computer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal

THE PATH TO EXASCALE COMPUTING. Bill Dally Chief Scientist and Senior Vice President of Research

CPE/EE 422/522. Introduction to Xilinx Virtex Field-Programmable Gate Arrays Devices. Dr. Rhonda Kay Gaede UAH. Outline

Big Data Systems on Future Hardware. Bingsheng He NUS Computing

KiloCore: A 32 nm 1000-Processor Array

Embedded Systems. 7. System Components

Interfacing a High Speed Crypto Accelerator to an Embedded CPU

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks

Building blocks for custom HyperTransport solutions

GRVI Phalanx. A Massively Parallel RISC-V FPGA Accelerator Accelerator. Jan Gray

Exploiting Dynamically Changing Parallelism with a Reconfigurable Array of Homogeneous Sub-cores (a.k.a. Field Programmable Core Array or FPCA)

Efficient Parallel Programming on Xeon Phi for Exascale

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

GRVI Phalanx Update: Plowing the Cloud with Thousands of RISC-V Chickens. Jan Gray

FABRICATION TECHNOLOGIES

ibench: Quantifying Interference in Datacenter Applications

Chapter 13 Reduced Instruction Set Computers

Sorting big data on heterogeneous near-data processing systems

EECS150 - Digital Design Lecture 11 SRAM (II), Caches. Announcements

Research Faculty Summit Systems Fueling future disruptions

Specializing Hardware for Image Processing

High performance, power-efficient DSPs based on the TI C64x

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain

Exploration of Cache Coherent CPU- FPGA Heterogeneous System

An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware

Multicore SoC is coming. Scalable and Reconfigurable Stream Processor for Mobile Multimedia Systems. Source: 2007 ISSCC and IDF.

Transcription:

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing Mingyu Gao and Christos Kozyrakis Stanford University http://mast.stanford.edu HPCA March 14, 2016

PIM is Coming Back End of Dennard scaling MapReduce, graph processing, deep neural networks, Energy-bound systems Near-Data Processing (NDP) HMC, HBM 3D stacking Inmemory analytics Figs: www.extremetech.com www.cisl.columbia.edu/grads/tuku/research/ www.oceanaute.blogspot.com/2015/06/how-to-shuffle-sort-mapreduce.html 2

NDP Logic Requirements Area-efficient o High processing throughput to match the high memory bandwidth o 128 GBps per 50 mm 2 stack > 32 Gflops > 0.6 Gflops/mm 2 Power-efficient o Thermal constraints limit clock frequency o 5 W per stack 100 mw/mm 2 Flexible o Must amortize manufacturing cost through reuse across apps 3

NDP Logic Options Area Efficiency Power Efficiency Flexibility Programmable cores [IRAM, FlexRAM, NDC, TOP-PIM] FPGA (fine-grained) [Active Pages] CGRA (coarse-grained) [NDA] ASIC [MSA, LiM] 4

Reconfigurable Logic Challenges FPGA o Area overhead due to support for bit-level configuration CGRA o Traditional GGRAs Limited flexibility in interconnects, only for regular computation patterns o DySER [HPCA 11] and NDA [HPCA 15] High power due to circuit-switched routing Inefficient for branches and irregular data layouts Heterogeneity: achieve the best of FPGA and CGRA 5

Outline Motivation NDP System Design Heterogeneous Reconfigurable Logic (HRL) Evaluation Conclusions 6

Overall System Architecture Memory stack with NDP capability: runs memory-intensive code High-Speed Serial Link Multi-core chip with cache hierarchy: Runs code with high temporal locality Memory Stack Host Processor Multiple stacks linked to host processor through serial links 7

NDP Stack Vault: Vertical channel Dedicated memory controller 8 16 vaults per stack Bank Channel... DRAM Die vs. DDR3 channel 10x bandwidth (160 GBps) 3-5x power improvement Vault NoC Vault Logic Logic Die Vault logic: Multiple PEs + control logic NoC to interconnect vaults 8

Iterative Execution Flow Processing phase: PEs run tasks independently and in parallel Communication phase: data exchange and sync b/w PEs [PACT 15] o Communication within and across stacks Task Task Task Task PEs PEs PEs PEs Process Local Buffer Task Task Task Task Pull 9

op,addr,sz op,addr,sz op,addr,sz Vault Logic Handles task control and data communication Allows the use of reconfigurable or custom PEs Input & output data info from host or other PEs Global Controller (Host Processor) Cached data (e.g., graph vertices) Fixed Logic User-defined Logic Task Queue To local memory DMA DMA Ctrl Scratchpad Buffer Mem Ctrl Router Generic Load/Store Unit PE PE... PE To remote Streaming data memory (e.g., graph edge list) Separate queue for each consumer Coalesce writebacks In-place combine ops Output Queues write queues 10

Outline Motivation NDP System Design Heterogeneous Reconfigurable Logic (HRL) Evaluation Conclusions 11

HRL Features Fine-grained + coarse-grained reconfigurable blocks o LUTs for flexible control o ALUs for efficient arithmetic Static interconnects o Wide network for data o Separate and narrow network for control Area-efficiency and flexibility Power-efficiency Special blocks for branches & irregular data layout Flexibility Compute throughput per Watt: 2.2x over FPGA, 1.7x over CGRA 12

HRL Array: Logic Blocks Control Input FPGA-style configurable block LUTs for embedded control logic Data Input Special functions: sigmoid, CLBtanh, FU etc. FU FU FU FU CGRA-style functional unit Efficient 48-bit arithmetic/logic ops Registers for pipelining and retiming 1-bit control routing tracks CLB FU FU FU FU FU 16-bit data routing tracks CLB FU FU FU FU FU Flexible IO Interface Simple but flexible alignment CLB OMB OMB OMB OMB OMB Control Output Data Output Output MUX Block Configurable MUXes (tree, cascading, parallel) Put close to output, low cost and flexible 13

HRL Array: Routing Control Input Data Input CLB FU FU FU FU FU 1-bit control routing tracks CLB FU FU FU FU FU 16-bit data routing tracks CLB FU FU FU FU FU CLB OMB OMB OMB OMB OMB Control Output Data Output 14

HRL Array: Routing 16-bit data net tracks 1-bit control net tracks Fully static for low power No switches b/w two networks Separate data and control to reduce area A B Y out FU Block output to routing track connection Routing switch box Block input MUX connection Connection box and switch out CLB in sel OUT INA MXB INB 1-bit control, few tracks 16-bit data, bus-based 15

HRL Array: IO Control Input Data Input CLB FU FU FU FU FU 1-bit control routing tracks CLB FU FU FU FU FU 16-bit data routing tracks CLB FU FU FU FU FU CLB OMB OMB OMB OMB OMB Control Output Data Output 16

HRL Array: IO Control IO Connects to control tracks Same as FPGA Data IO Connect to data net Simple 16-bit chunk alignment 48-bit fixed-point 16-bit short 32-bit int Ctrl IO Data IO 1-bit control tracks 16-bit data tracks 48-bit fixedpoint 16-bit short Low cost and sufficiently flexible even for irregular data 32-bit int 17

Outline Motivation NDP System Design Heterogeneous Reconfigurable Logic (HRL) Evaluation Conclusions 18

Methodology Workloads o 3 analytics frameworks: MapReduce, graph, DNN o 9 representative applications, 11 kernel circuits (KCs) Technology o 45 nm area and power model o CGRA: DySER as in NDA [HPCA 11, HPCA 15] o FPGA: Xilinx Virtex-6 Tools o Synthesize, place & route by Yosys + VTR o System simulation by zsim 19

Single Array Area (mm2) Array Area 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Coarse-grained FUs improve area efficiency Ctrl Routing Data Routing Routing Logic FPGA CGRA HRL Same logic capacity for each type array 20

Power (mw) Array Power 40 35 30 25 20 15 10 5 0 HRL: static but flexible routing CGRA: high power on circuit-switched network FPGA: half the frequency KC1 KC2 KC3 KC4 KC5 KC6 KC7 KC8 KC9 KC10 KC11 Average FPGA CGRA HRL 21

Normalized Perf/Watt Vault Power Efficiency 3.5 3 HRL: 2.2x FPGA, 1.7x CGRA on perf/watt 2.5 2 1.5 1 0.5 0 KC1 KC2 KC3 KC4 KC5 KC6 KC7 KC8 KC9 KC10 KC11 Average FPGA CGRA HRL 22

Normalized Performance Overall Performance ASIC represents the upper bound of efficiency Cores, FPGA, CGRA only match 30% to 80% of ASIC HRL has 92% of ASIC performance on average Memory bandwidth not saturated Memory bandwidth saturated 1 0.8 0.6 0.4 0.2 0 GroupBy Hist LinReg PageRank SSSP CC ConvNet MLP da Cores FPGA CGRA HRL ASIC 23

Conclusions NDP logic requirements: area + power efficiency, flexibility Heterogeneous reconfigurable logic (HRL) o Fine-grained + coarse-grained logic blocks o Static and separate data and control networks o Special blocks for branching and layout management o Vault logic handles communication and control HRL for in-memory analytics o 2.2x performance/watt over FPGA and 1.7x over CGRA o Within 92% of ASIC performance 24

Thanks! Questions?