Lecture 24 Near Data Computing II

Similar documents
PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks

Deep Learning Accelerators

Newton: Gravitating Towards the Physical Limits of Crossbar Acceleration

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Fundamentals of Quantitative Design and Analysis

Copyright 2012, Elsevier Inc. All rights reserved.

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Manycore Processors. Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar.

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

DNN Accelerator Architectures

High Performance Computing

PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning

Parallel Processing SIMD, Vector and GPU s cont.

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture

Versal: AI Engine & Programming Environment

Data Criticality in Network-On-Chip Design. Joshua San Miguel Natalie Enright Jerger

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

EECS4201 Computer Architecture

Design of Embedded DSP Processors Unit 2: Design basics. 9/11/2017 Unit 2 of TSEA H1 1

Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Computer Architecture: Dataflow/Systolic Arrays

Portland State University ECE 588/688. Graphics Processors

Addressing the Memory Wall

BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques

VTA: Open & Flexible DL Acceleration. Thierry Moreau TVM Conference, Dec 12th 2018

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Advanced Computer Architecture

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Real-Time Rendering Architectures

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

HPC VT Machine-dependent Optimization

Speeding Up Crossbar Resistive Memory by Exploiting In-memory Data Patterns

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Multi-Processors and GPU

Scaling Neural Network Acceleration using Coarse-Grained Parallelism

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

From Shader Code to a Teraflop: How Shader Cores Work

Native Offload of Haskell Repa Programs to Integrated GPUs

Master Informatics Eng.

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

Revolutionizing the Datacenter

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

Advanced Computer Architecture

Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford

Performance potential for simulating spin models on GPU

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 14 EE141

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design

Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University

Warps and Reduction Algorithms

Master Informatics Eng.

Mohsen Imani. University of California San Diego. System Energy Efficiency Lab seelab.ucsd.edu

Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread

gem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood

Lecture 8: GPU Programming. CSE599G1: Spring 2017

EECS150 - Digital Design Lecture 09 - Parallelism

NVIDIA FOR DEEP LEARNING. Bill Veenhuis

ME964 High Performance Computing for Engineering Applications

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

High Performance Computing Hiroki Kanezashi Tokyo Institute of Technology Dept. of mathematical and computing sciences Matsuoka Lab.

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

Lecture 4: Instruction Set Architectures. Review: latency vs. throughput

15-740/ Computer Architecture, Fall 2011 Midterm Exam II

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

XPU A Programmable FPGA Accelerator for Diverse Workloads

TDT4260/DT8803 COMPUTER ARCHITECTURE EXAM

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Vertex Shader Design II

THE PATH TO EXASCALE COMPUTING. Bill Dally Chief Scientist and Senior Vice President of Research

Maximizing Face Detection Performance

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Pipelining. CS701 High Performance Computing

Hybrid Implementation of 3D Kirchhoff Migration

GPU Microarchitecture Note Set 2 Cores

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.

Fundamental CUDA Optimization. NVIDIA Corporation

Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures

Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand

Single Instructions Can Execute Several Low Level

ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University

Practical Near-Data Processing for In-Memory Analytics Frameworks

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,

ECE 154A Introduction to. Fall 2012

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

Simultaneous Multithreading on Pentium 4

Transcription:

EECS 570 Lecture 24 Near Data Computing II Winter 2018 Prof. Satish Narayanasamy http://www.eecs.umich.edu/courses/eecs570/ EECS 570 Lecture 23 Slide 1

Readings ISAAC: A Convolutional Neural Network Accelerator with In- Situ Analog Arithmetic in Crossbars, ISCA 2016, Shafiee et al. In-Memory Data Parallel Processor, ASPLOS 2018, Fujiki, Mahlke, Das. EECS 570 Lecture 23 Slide 2

Executive Summary 3 Classifying Images is in vogue Lots of vector-matrix multiplication Conv nets are the best Analog memristor crossbar is a great fit Analog to Digital conversion overheads! Smart encoding reduces such overheads ISAAC 14.8x better in throughput and 5.5x better in energy than digital state of the art (DaDianNao) Balanced pipeline critical for high efficiency Preserving high precision is essential in analog EECS 570 Lecture 23 Slide 3

State of the art Convolutional Neural Networks 4 Deep Residual Networks 152 layers! 11 billion operations! Convolution Layers Pooling Layers Fully Connected Layers EECS 570 Lecture 23 Slide 4

Convolution Operation 5 Kernel 0 K x = 2 Kernel 1 Kernel 2 N i = 3 K y = 2 N y Stride S x, S y = 1 N o = 3 N x EECS 570 Lecture 23 Slide 5

Memristor Dot-product Engine 6 V1 V2 G1 I1 = V1.G1 G2 I2 = V2.G2 I = I1 + I2 = V1.G1 + V2.G2 x 0 x 1 x 2 x 3 w 00 w 01 w 02 w 03 w 10 w 11 w 12 w 13 w 20 w 21 w 22 w 23 w 30 w 31 w 32 w 33 y 0 y 1 y 2 y 3 EECS 570 Lecture 23 Slide 6

Memristor Dot-product Engine 7 Kernel 0 Kernel 1 Kernel 2 Kernel 0 K x = 2 Kernel 1 Kernel 2 N i = 3 K y = 2 N o = 3 N y Stride S x, S y = 1 N x EECS 570 Lecture 23 Slide 7

Crossbar 8 16-bit 1-bit 16-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit 16 iterations 16-bit 1-bit 16-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit 16-bit 1-bit 16-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit 16-bit 1-bit 16-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit Input Neurons EECS 570 Lecture 23 Slide 8

EECS 570 ISAAC Organization 9 Sigmoid Digital To Analog Rows 128-255 Crossbar 16 Iterations Input Register Output Register Shift and Add Partial Rows Output Partial Output Output 01 Register 0-127 Shift and Add Analog to Digital Lecture 23 Slide 9

An ISAAC Chip Inter-Tile Pipelined 10 Layer 1 Layer 2 Layer 3 edram Tile 1 edram Tile 2 edram Tile 3 EECS 570 Lecture 23 Slide 10

Balanced Pipeline 11 Layer i: S x = 1 and S y = 2 Replicate layer i 1 two times. Not computed yet Storage allocation: Start from Received last layer from previous layer Serviced and released EECS 570 Lecture 23 Slide 11

Balanced Pipeline 12 Crossbar 128x128 Crossbar 128x128 Crossbar 128x128 Crossbar 128x128 Crossbar 128x128 Crossbar 128x128 Crossbar 128x128 Crossbar 128x128 Crossbar 128x128 Crossbar 128x128 Crossbar 128x128 Crossbar 128x128 Crossbar 128x128 S x = 2, S y = 2 S x = 1 S y = 2 S x = 2, S y = 2 EECS 570 Lecture 23 Slide 12

The ADC overhead 13 Large area Power hungry Area and power increases exponentially with ADC resolution and frequency EECS 570 Lecture 23 Slide 13

ADC Resolution = log (R) + v + w 1 (if v=1) ADC Resolution = 9 bits The ADC overhead 14 v bits v bits v bits Memristor cells w bits w bits w bits R = 128 v = 1 w = 2 R rows v bits w bits EECS 570 log (R) 9-bit + v ADC + w 1 Lecture 23 Slide 14

If MSB = 1 with maximal input Store weights in flipped form such that MSB = 0 always. Effective ADC resolution required = 8 bits EECS 570 Encoding Scheme 15 M A X I M A L I N P U T 1 1 1 1 Memristor cells W 0, 0,0 0 W 0, 0,1 W 0, 0,2 2 W 0,R 1 0, R-1 If MSB = 01 Lecture 23 Slide 15

Handling Signed Arithmetic 16 Input neurons 2 s Compliment MSB = 1 represents 2 15 For 16 th iteration do shift-and-subtract Weights Like FP exponent representation Bias of 2 15 Subtract as many biases as the number of 1s in input EECS 570 Lecture 23 Slide 16

Analysis Metrics 17 1) CE: Computational Efficiency -> GOPS/s mm 2 2) PE: Power Efficiency -> GOPS/W 3) SE: Storage Efficiency -> MB/mm 2 EECS 570 Lecture 23 Slide 17

Design Space Exploration 18 1) rows per crossbar 2) ADCs per IMA 3) crossbars per IMA 4) IMA per tile EECS 570 Lecture 23 Slide 18

Design Space Exploration 19 GOPs/mm 2 ISAAC-PE ISAAC-CE ISAAC-SE Various Design Points GOPs/W ISAAC-PE ISAAC-CE ISAAC-SE EECS 570 Various Design Points Lecture 23 Slide 19

Power Contribution 20 Router 3% 5% 58% Hyper Transport 16% 12% 7% 49% EECS 570 Lecture 23 Slide 20

Improvement over DaDianNao (Throughput) 21 Throughput: 14.8x better because: 1. Memristor crossbar have high computational parallelism 2. DaDianNao fetches both inputs and weights from edram, ISAAC fetches just inputs 3. DaDianNao suffers due to bandwidth limitation in fully connected layers. ISAAC requires more power but is 5.5x better in terms of energy due to above reasons. EECS 570 Deep Neural Net Benchmarks Lecture 23 Slide 21

Conclusion 22 Takes advantage of analog in-situ computing. Fetches just the input neurons. Handles ADC overheads with smart encoding. Does not compromise on output precision. Is faster than DaDianNao due to 8x better computational efficiency and a balanced pipeline keeping all units busy. Few questions still remain: integrate online training? EECS 570 Lecture 23 Slide 22

In-Memory Data Parallel Processor Daichi Fujiki Scott Mahlke Reetuparna Das M-Bits Research Group

Data movement is what matters, not arithmetic Bill Dally CPU DATA-PARALLEL APPLICATIONS GPU MANY CORE SIMD OoO ARITHMETIC MANY THREAD SIMT SIMD DATA COMMUNICATION 1000x 40x 24

In-Memory Computing exposes parallelism while minimizing data movement cost CPU GPU IN-MEMORY In-situ computing Massive parallelism SIMD slots over dense memory arrays High bandwidth / Low data movement 25

In-Memory Computing Reduces Data Movement CPU GPU In-situ computing Massive parallelism IN-MEMORY Vdd/2 V1 11 C11 I11 = (Vdd/2) C11 11 C21 I21 = (Vdd/2) C21 I1 = (Vdd/2) (C11+ C21) (a) Addition C12 C11 I11= V1C11 I12= V1C12 V2 C21 C22 I21= V2C21 I22= V2C22 I1=I11+I21 I2=I21+I22 (b) Dot-product 26 (c

In-Memory Computing Exposes Parallelism IN-MEMORY In-situ computing Massive parallelism CPU (2 sockets) Intel Xeon E5-2597 GPU NVIDIA TITAN Xp ReRAM Scaled from ISAAC* Area (mm2) 912.24 471 494 TDP (W) 290 250 416 On-chip memory (MB) 78.96 9.14 8,590 SIMD slots 448 3,840 2,097,152 Freq (GHz) 3.6 1.585 0.02 SIMD Freq Product 3,227 6,086 41,953 27

In-Memory Computing Today V1 C 11 C 12 I11= V1C11 V2 C 21 C 22 I21= V2C21 I12= V1C12 I22= V2C22 V1 C 11 C 12 V2 I11 C 21 C 22 I21 I12 I22 ReRAM Dot-product Accelerator PRIME [Chi 2016, ISCA] ISAAC [Shafiee 2016, ISCA] Dot-Product Engine [Hu 2016, ] PipeLayer [Song 2017, HPCA] I1 = I11+ I21 I2 = I12+ I22 Multiplication + Summation 28

In-Memory Computing No Demonstration of General Purpose Computing IN-MEMORY How to program? No established programming model / execution model Limited computation primitives

HW In-Memory Data Parallel Processor Overview Microarchitecture ISA Memory ISA ADD DOT MUL SUB MOV MOVS MOVI MOVG SHIFT{L/R} MASK LUT REDUCE_SUM Execution Model Compiler IMP Compiler Module ILP IB1 IB1 IB1 IB2 IB2 IB2 SW Programming Model Data Flow Graph DLP 30

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Computation Primitives A C A Information stored in analog (cell conductance C = 1/resistance) A B C B Write Read C A 31

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Computation Primitives V dd V dd /2 C A V dd IA Ohm s law [mult] IA = (Vdd/2) CA C B IB I = ( IA + IB ) Kirchhoff s law [add] (a) Addition (b) Subtraction* * New primitive 32

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Computation Primitives V dd V dd /2 V dd V dd /2 C A 11 CA 11 V dd IA I 11 = 0(V dd /2) CIA 11 C B 00 C 21 B IB I 21 = (V dd /2) IB C 21 I = ( IA + IB ) I 1 = (VI dd /2) = ((C IAA 11 -- IB CB 21 ) B B (a) Addition (b) Subtraction* * New primitive 33

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Computation Primitives X Y A C B D = AX + BY CX + DY X Y (A B) = AX BY VX CA IAX= VACA V Y CB CC ICX= VXCC CD 11 V dd V 1 X0 C 11 A CB 12 VY0 2 V dd - Multiplier Multiplicand IBY= VYCB IDY= VYCD I1=IAX+IBY I2=ICX+IDY I 11 =(V dd V 1 X2 )C 11 A I 12 =(V dd V 2 Y2 ) CB 12 (c) Dot-product (c) d Element-wise multiplication * * New primitive 34

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Microarchitecture Cluster ReRAM PU... ReRAM PU Reg. File ReRAM PU... ReRAM PU LUT Router = RowDecoder + Shift&Add Unit RRAM XB S+H ADC ADC S+A Reg Processing Unit 35

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Microarchitecture ReRAM PU Cluster... ReRAM PU Reg. File Array Size 128 x 128 R/W Latency 50 ns ReRAM PU... ReRAM PU LUT PU + Reg PU + Reg PU + Reg PU + Reg Multi Level Cell 2 ADC Resolution 5 ADC Frequency 1.2 GSps RRAM XB S+H ADC ADC S+A Reg Processing Unit Shift and Hold Sample and Hold 8 PUs/array 128 Regs/PU Resolution 2 LUT size 256 x 8 36

ISA HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW In-situ Computation Moves R/W Misc Opcode Format Cycles ADD <MASK> <DST> 3 DOT <MASK> <REG_MASK> <DST> 18 MUL <SRC> <SRC> <DST> 18 SUB <SRC> <SRC> <DST> 3 MOV <SRC> <DST> 3 MOVS <SRC> <DST> <MASK> 3 MOVI <SRC> <IMM> 1 MOVG <GADDR> <GADDR> Variable SHIFT{L/R} <SRC> <SRC> <IMM> 3 MASK <SRC> <SRC> <IMM> 3 LUT <SRC> <SRC> 4 REDUCE_SUM <SRC> <GADDR> Variable

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Programming Model KEY OBSERVATION Need a programming language that merges concepts of Data-Flow and SIMD for maximizing parallelism Data-Flow SIMD Side-effect Free Explicit dataflow exposes Instruction Level Parallelism Data Level Parallelism No dependence on shared memory primitives 39

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Execution Model Input Matrix A Input Matrix B Data Flow Graph 40

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Execution Model Input Matrix A Input Matrix A Input Matrix B Input Matrix B Decomposed DFG Unroll innermost dimension Module Data Flow Graph Module Modularized execution flow Applied to the innermost dimension DLP 41

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Execution Model Input Matrix A Input Matrix B Input Matrix A Input Matrix B ILP IB1 IB2 IB IB Data Flow Graph Decomposed Data Flow Graph Instruction Block (IB) Partial execution sequence of a Module Mapped to a single array Module 42

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Execution Model IB1 Module IB2 IB1 IB2 IB1 IB2 Module Modularized execution flow Applied to the innermost dimension IB IB IB1 IB1 IB1 IB2 IB2 IB2 Instruction Block (IB) Components of Module Mapped to a single array ReRAM Array 43

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Execution Model Input Matrix A Input Matrix B Module Data Flow Graphs Modularized execution flow Applied to the innermost dimension IB1 IB2 IB1 IB2 IB1 IB2 IB IB Modules Instruction Block (IB) Components of Module Mapped to a single array ReRAM Array IB1 IB1 IB1 IB2 IB2 IB2 44

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compilation Flow Target Machine Modeling Optimization Python C++ Java Semantic Analysis NodeMerging IB Expansion Pipelining Instruction Lowering IB Scheduling Backend CodeGen Tensor Flow DFG (Protocol Buffer) IMP Compiler 46

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compilation Flow Target Machine Modeling Optimization Python C++ Java Semantic Analysis NodeMerging IB Expansion Pipelining Instruction Lowering IB Scheduling Backend CodeGen Tensor Flow DFG (Protocol Buffer) IMP Compiler 47

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compiler Optimizations Exploit multi-operand ADD/SUB Reduce redundant writebacks Semantic Analysis 2 6 3 5 2 6 3 5 Optimization NodeMerging IB Expansion Place holder Place holder Place holder Place holder Pipelining Instruction Lowering + + Add 8 8 NodeMerging + Add + Reduce 16 IB Scheduling + Reduce 16 CodeGen 48

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compiler Optimizations Expose more parallelism in a module to architecture. Semantic Analysis 2 6 3 5 2 6 3 5 Optimization NodeMerging IB Expansion Pipelining Instruction Lowering 2 3 + + Place holder 6 5 Add Place holder 8 8 IB Expansion + 8 Place holder Add Place holder Add Pack Unpack + 8 IB Scheduling CodeGen 49

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compiler Optimizations Pipelining Semantic Analysis Optimization NodeMerging IB Expansion Compute Add_0 WB Compute Add_1 WB Compute Reduce WB Pipelining Pipelining Instruction Lowering IB Scheduling Compute Add_0 WB Compute Add_1 WB CodeGen Compute Reduce WB 50

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compilation Flow Target Machine Modeling Optimization Python C++ Java Semantic Analysis NodeMerging IB Expansion Pipelining Instruction Lowering IB Scheduling Backend CodeGen Tensor Flow DFG (Protocol Buffer) IMP Compiler 51

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compiler Backend Instruction Lowering Semantic Analysis Optimization NodeMerging IB Expansion Pipelining Instruction Lowering Instruction Lowering: Transform high level TF insts into memory ISA Div High-level TF Node Newton-Raphson / Maclaurin Inst Lowering Add LUT Mul Memory ISA Division Algorithm q = a / b 1. y 7 = rcp b (LUT) 2. q 7 = ay 7 3. e 7 = 1 by 7 4. q C = q 7 + e 7 q 7 5. e C = e 7 E 6. q E = q C + e C q C IB Scheduling Supported TF operation nodes CodeGen Add Sub Mul Div Sqrt Exp Sum Less Conv 2D 52

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compiler Backend IB Scheduling Semantic Analysis Large Execution Time IB1 Optimization NodeMerging IB Expansion Pipelining IB Scheduling Instruction Lowering IB Scheduling CodeGen DFG Target # of IBs = 1 53

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compiler Backend Large Execution Time IB Scheduling Semantic Analysis IB1 IB2 IB1 IB2 Optimization NodeMerging IB Expansion Pipelining IB Scheduling Instruction Lowering Network Delay IB Scheduling CodeGen DFG Target # of IBs = 2 Good :) Bad :( 54

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compiler Backend IB Scheduling Semantic Analysis Bottom-Up Greedy [Ellis 1986] Optimization NodeMerging IB Expansion Pipelining Instruction Lowering IB Scheduling CodeGen Collect candidate assignments Make final assignments Minimize data transfer latency by taking both operand & successor location into consideration 55

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compiler Backend IB Scheduling Semantic Analysis Bottom-Up Greedy [Ellis 1986] Optimization NodeMerging IB Expansion Pipelining Instruction Lowering IB Scheduling CodeGen Collect candidate assignments Make final assignments Minimize data transfer latency by taking both operand & successor location into consideration 56

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compiler Backend IB Scheduling Semantic Analysis Bottom-Up Greedy [Ellis 1986] IB1 IB2 Optimization 1 NodeMerging IB Expansion 1 Pipelining Instruction Lowering IB Scheduling CodeGen Collect candidate assignments Make final assignments Minimize data transfer latency by taking both operand & successor location into consideration IB1 is chosen because Closer to operand locations Time 57

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compiler Backend IB Scheduling Semantic Analysis Bottom-Up Greedy [Ellis 1986] IB1 IB2 Optimization 1 2 2 NodeMerging IB Expansion Pipelining 1 2 IB2 is chosen because Earlier slots available Instruction Lowering IB Scheduling CodeGen Collect candidate assignments Make final assignments Minimize data transfer latency by taking both operand & successor location into consideration Time 58

HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compiler Backend IB Scheduling Semantic Analysis Bottom-Up Greedy [Ellis 1986] IB1 IB2 Optimization 1 2 2 NodeMerging IB Expansion 1 2 Pipelining 1 Instruction Lowering IB Scheduling CodeGen Collect candidate assignments Make final assignments Minimize data transfer latency by taking both operand & successor location into consideration 1 IB1 is chosen because Better overlap of comm. and computation Time 59

Evaluation Methodology Benchmarks PARSEC 3.0 Blackscholes, Canneal, Fluidanimate Rodinia Backprop, Hotspot, Kmeans, Streamcluster Methodology Processor CPU (2 sockets) GPU (1 card) IMP Intel Xeon E5-2597 v3, 3.6GHz, 28 cores, 56 threads NVIDIA Titan Xp, 1.6GHz, 3840 cuda cores 20MHz ReRAM, 4096 Tiles, 64 ReRAM PU / Tile On-chip memory 78.96 MB 9.14 MB 8,590 MB Off-chip memory 64 GB DRAM 12 GB DRAM Profiler / Simulator (Performance) Profiler / Simulator (Power) Intel VTune Amplifier Inter RAPL Interface NVPROF NVIDIA System Management Interface Cycle accurate simulator (Booksim Integrated) Trace based simulation

Offloaded Kernel / Application Speedup (CPU) Offloaded Kernel Speedup 100 10 1 41x Blackscholes Fluidanimate Canneal Streamcluster GEOMEAN Normalized Execution Time 1.00 0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 Kernel Data loading NoC Sequential+Barrier 7.5x CPU IMP CPU IMP CPU IMP CPU IMP CPU IMP Blackscholes Fluidanimate Canneal Streamcluster GEOMEAN Offloaded Kernel Speedup Application Speedup Capacity limitation of IMP settles the upper-bound of performance improvement. 61