Lecture 24 Near Data Computing II

Size: px
Start display at page:

Download "Lecture 24 Near Data Computing II"

Transcription

1 EECS 570 Lecture 24 Near Data Computing II Winter 2018 Prof. Satish Narayanasamy EECS 570 Lecture 23 Slide 1

2 Readings ISAAC: A Convolutional Neural Network Accelerator with In- Situ Analog Arithmetic in Crossbars, ISCA 2016, Shafiee et al. In-Memory Data Parallel Processor, ASPLOS 2018, Fujiki, Mahlke, Das. EECS 570 Lecture 23 Slide 2

3 Executive Summary 3 Classifying Images is in vogue Lots of vector-matrix multiplication Conv nets are the best Analog memristor crossbar is a great fit Analog to Digital conversion overheads! Smart encoding reduces such overheads ISAAC 14.8x better in throughput and 5.5x better in energy than digital state of the art (DaDianNao) Balanced pipeline critical for high efficiency Preserving high precision is essential in analog EECS 570 Lecture 23 Slide 3

4 State of the art Convolutional Neural Networks 4 Deep Residual Networks 152 layers! 11 billion operations! Convolution Layers Pooling Layers Fully Connected Layers EECS 570 Lecture 23 Slide 4

5 Convolution Operation 5 Kernel 0 K x = 2 Kernel 1 Kernel 2 N i = 3 K y = 2 N y Stride S x, S y = 1 N o = 3 N x EECS 570 Lecture 23 Slide 5

6 Memristor Dot-product Engine 6 V1 V2 G1 I1 = V1.G1 G2 I2 = V2.G2 I = I1 + I2 = V1.G1 + V2.G2 x 0 x 1 x 2 x 3 w 00 w 01 w 02 w 03 w 10 w 11 w 12 w 13 w 20 w 21 w 22 w 23 w 30 w 31 w 32 w 33 y 0 y 1 y 2 y 3 EECS 570 Lecture 23 Slide 6

7 Memristor Dot-product Engine 7 Kernel 0 Kernel 1 Kernel 2 Kernel 0 K x = 2 Kernel 1 Kernel 2 N i = 3 K y = 2 N o = 3 N y Stride S x, S y = 1 N x EECS 570 Lecture 23 Slide 7

8 Crossbar 8 16-bit 1-bit 16-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit 16 iterations 16-bit 1-bit 16-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit 16-bit 1-bit 16-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit 16-bit 1-bit 16-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit Input Neurons EECS 570 Lecture 23 Slide 8

9 EECS 570 ISAAC Organization 9 Sigmoid Digital To Analog Rows Crossbar 16 Iterations Input Register Output Register Shift and Add Partial Rows Output Partial Output Output 01 Register Shift and Add Analog to Digital Lecture 23 Slide 9

10 An ISAAC Chip Inter-Tile Pipelined 10 Layer 1 Layer 2 Layer 3 edram Tile 1 edram Tile 2 edram Tile 3 EECS 570 Lecture 23 Slide 10

11 Balanced Pipeline 11 Layer i: S x = 1 and S y = 2 Replicate layer i 1 two times. Not computed yet Storage allocation: Start from Received last layer from previous layer Serviced and released EECS 570 Lecture 23 Slide 11

12 Balanced Pipeline 12 Crossbar 128x128 Crossbar 128x128 Crossbar 128x128 Crossbar 128x128 Crossbar 128x128 Crossbar 128x128 Crossbar 128x128 Crossbar 128x128 Crossbar 128x128 Crossbar 128x128 Crossbar 128x128 Crossbar 128x128 Crossbar 128x128 S x = 2, S y = 2 S x = 1 S y = 2 S x = 2, S y = 2 EECS 570 Lecture 23 Slide 12

13 The ADC overhead 13 Large area Power hungry Area and power increases exponentially with ADC resolution and frequency EECS 570 Lecture 23 Slide 13

14 ADC Resolution = log (R) + v + w 1 (if v=1) ADC Resolution = 9 bits The ADC overhead 14 v bits v bits v bits Memristor cells w bits w bits w bits R = 128 v = 1 w = 2 R rows v bits w bits EECS 570 log (R) 9-bit + v ADC + w 1 Lecture 23 Slide 14

15 If MSB = 1 with maximal input Store weights in flipped form such that MSB = 0 always. Effective ADC resolution required = 8 bits EECS 570 Encoding Scheme 15 M A X I M A L I N P U T Memristor cells W 0, 0,0 0 W 0, 0,1 W 0, 0,2 2 W 0,R 1 0, R-1 If MSB = 01 Lecture 23 Slide 15

16 Handling Signed Arithmetic 16 Input neurons 2 s Compliment MSB = 1 represents 2 15 For 16 th iteration do shift-and-subtract Weights Like FP exponent representation Bias of 2 15 Subtract as many biases as the number of 1s in input EECS 570 Lecture 23 Slide 16

17 Analysis Metrics 17 1) CE: Computational Efficiency -> GOPS/s mm 2 2) PE: Power Efficiency -> GOPS/W 3) SE: Storage Efficiency -> MB/mm 2 EECS 570 Lecture 23 Slide 17

18 Design Space Exploration 18 1) rows per crossbar 2) ADCs per IMA 3) crossbars per IMA 4) IMA per tile EECS 570 Lecture 23 Slide 18

19 Design Space Exploration 19 GOPs/mm 2 ISAAC-PE ISAAC-CE ISAAC-SE Various Design Points GOPs/W ISAAC-PE ISAAC-CE ISAAC-SE EECS 570 Various Design Points Lecture 23 Slide 19

20 Power Contribution 20 Router 3% 5% 58% Hyper Transport 16% 12% 7% 49% EECS 570 Lecture 23 Slide 20

21 Improvement over DaDianNao (Throughput) 21 Throughput: 14.8x better because: 1. Memristor crossbar have high computational parallelism 2. DaDianNao fetches both inputs and weights from edram, ISAAC fetches just inputs 3. DaDianNao suffers due to bandwidth limitation in fully connected layers. ISAAC requires more power but is 5.5x better in terms of energy due to above reasons. EECS 570 Deep Neural Net Benchmarks Lecture 23 Slide 21

22 Conclusion 22 Takes advantage of analog in-situ computing. Fetches just the input neurons. Handles ADC overheads with smart encoding. Does not compromise on output precision. Is faster than DaDianNao due to 8x better computational efficiency and a balanced pipeline keeping all units busy. Few questions still remain: integrate online training? EECS 570 Lecture 23 Slide 22

23 In-Memory Data Parallel Processor Daichi Fujiki Scott Mahlke Reetuparna Das M-Bits Research Group

24 Data movement is what matters, not arithmetic Bill Dally CPU DATA-PARALLEL APPLICATIONS GPU MANY CORE SIMD OoO ARITHMETIC MANY THREAD SIMT SIMD DATA COMMUNICATION 1000x 40x 24

25 In-Memory Computing exposes parallelism while minimizing data movement cost CPU GPU IN-MEMORY In-situ computing Massive parallelism SIMD slots over dense memory arrays High bandwidth / Low data movement 25

26 In-Memory Computing Reduces Data Movement CPU GPU In-situ computing Massive parallelism IN-MEMORY Vdd/2 V1 11 C11 I11 = (Vdd/2) C11 11 C21 I21 = (Vdd/2) C21 I1 = (Vdd/2) (C11+ C21) (a) Addition C12 C11 I11= V1C11 I12= V1C12 V2 C21 C22 I21= V2C21 I22= V2C22 I1=I11+I21 I2=I21+I22 (b) Dot-product 26 (c

27 In-Memory Computing Exposes Parallelism IN-MEMORY In-situ computing Massive parallelism CPU (2 sockets) Intel Xeon E GPU NVIDIA TITAN Xp ReRAM Scaled from ISAAC* Area (mm2) TDP (W) On-chip memory (MB) ,590 SIMD slots 448 3,840 2,097,152 Freq (GHz) SIMD Freq Product 3,227 6,086 41,953 27

28 In-Memory Computing Today V1 C 11 C 12 I11= V1C11 V2 C 21 C 22 I21= V2C21 I12= V1C12 I22= V2C22 V1 C 11 C 12 V2 I11 C 21 C 22 I21 I12 I22 ReRAM Dot-product Accelerator PRIME [Chi 2016, ISCA] ISAAC [Shafiee 2016, ISCA] Dot-Product Engine [Hu 2016, ] PipeLayer [Song 2017, HPCA] I1 = I11+ I21 I2 = I12+ I22 Multiplication + Summation 28

29 In-Memory Computing No Demonstration of General Purpose Computing IN-MEMORY How to program? No established programming model / execution model Limited computation primitives

30 HW In-Memory Data Parallel Processor Overview Microarchitecture ISA Memory ISA ADD DOT MUL SUB MOV MOVS MOVI MOVG SHIFT{L/R} MASK LUT REDUCE_SUM Execution Model Compiler IMP Compiler Module ILP IB1 IB1 IB1 IB2 IB2 IB2 SW Programming Model Data Flow Graph DLP 30

31 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Computation Primitives A C A Information stored in analog (cell conductance C = 1/resistance) A B C B Write Read C A 31

32 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Computation Primitives V dd V dd /2 C A V dd IA Ohm s law [mult] IA = (Vdd/2) CA C B IB I = ( IA + IB ) Kirchhoff s law [add] (a) Addition (b) Subtraction* * New primitive 32

33 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Computation Primitives V dd V dd /2 V dd V dd /2 C A 11 CA 11 V dd IA I 11 = 0(V dd /2) CIA 11 C B 00 C 21 B IB I 21 = (V dd /2) IB C 21 I = ( IA + IB ) I 1 = (VI dd /2) = ((C IAA IB CB 21 ) B B (a) Addition (b) Subtraction* * New primitive 33

34 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Computation Primitives X Y A C B D = AX + BY CX + DY X Y (A B) = AX BY VX CA IAX= VACA V Y CB CC ICX= VXCC CD 11 V dd V 1 X0 C 11 A CB 12 VY0 2 V dd - Multiplier Multiplicand IBY= VYCB IDY= VYCD I1=IAX+IBY I2=ICX+IDY I 11 =(V dd V 1 X2 )C 11 A I 12 =(V dd V 2 Y2 ) CB 12 (c) Dot-product (c) d Element-wise multiplication * * New primitive 34

35 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Microarchitecture Cluster ReRAM PU... ReRAM PU Reg. File ReRAM PU... ReRAM PU LUT Router = RowDecoder + Shift&Add Unit RRAM XB S+H ADC ADC S+A Reg Processing Unit 35

36 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Microarchitecture ReRAM PU Cluster... ReRAM PU Reg. File Array Size 128 x 128 R/W Latency 50 ns ReRAM PU... ReRAM PU LUT PU + Reg PU + Reg PU + Reg PU + Reg Multi Level Cell 2 ADC Resolution 5 ADC Frequency 1.2 GSps RRAM XB S+H ADC ADC S+A Reg Processing Unit Shift and Hold Sample and Hold 8 PUs/array 128 Regs/PU Resolution 2 LUT size 256 x 8 36

37 ISA HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW In-situ Computation Moves R/W Misc Opcode Format Cycles ADD <MASK> <DST> 3 DOT <MASK> <REG_MASK> <DST> 18 MUL <SRC> <SRC> <DST> 18 SUB <SRC> <SRC> <DST> 3 MOV <SRC> <DST> 3 MOVS <SRC> <DST> <MASK> 3 MOVI <SRC> <IMM> 1 MOVG <GADDR> <GADDR> Variable SHIFT{L/R} <SRC> <SRC> <IMM> 3 MASK <SRC> <SRC> <IMM> 3 LUT <SRC> <SRC> 4 REDUCE_SUM <SRC> <GADDR> Variable

38 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW

39 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Programming Model KEY OBSERVATION Need a programming language that merges concepts of Data-Flow and SIMD for maximizing parallelism Data-Flow SIMD Side-effect Free Explicit dataflow exposes Instruction Level Parallelism Data Level Parallelism No dependence on shared memory primitives 39

40 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Execution Model Input Matrix A Input Matrix B Data Flow Graph 40

41 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Execution Model Input Matrix A Input Matrix A Input Matrix B Input Matrix B Decomposed DFG Unroll innermost dimension Module Data Flow Graph Module Modularized execution flow Applied to the innermost dimension DLP 41

42 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Execution Model Input Matrix A Input Matrix B Input Matrix A Input Matrix B ILP IB1 IB2 IB IB Data Flow Graph Decomposed Data Flow Graph Instruction Block (IB) Partial execution sequence of a Module Mapped to a single array Module 42

43 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Execution Model IB1 Module IB2 IB1 IB2 IB1 IB2 Module Modularized execution flow Applied to the innermost dimension IB IB IB1 IB1 IB1 IB2 IB2 IB2 Instruction Block (IB) Components of Module Mapped to a single array ReRAM Array 43

44 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Execution Model Input Matrix A Input Matrix B Module Data Flow Graphs Modularized execution flow Applied to the innermost dimension IB1 IB2 IB1 IB2 IB1 IB2 IB IB Modules Instruction Block (IB) Components of Module Mapped to a single array ReRAM Array IB1 IB1 IB1 IB2 IB2 IB2 44

45 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW

46 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compilation Flow Target Machine Modeling Optimization Python C++ Java Semantic Analysis NodeMerging IB Expansion Pipelining Instruction Lowering IB Scheduling Backend CodeGen Tensor Flow DFG (Protocol Buffer) IMP Compiler 46

47 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compilation Flow Target Machine Modeling Optimization Python C++ Java Semantic Analysis NodeMerging IB Expansion Pipelining Instruction Lowering IB Scheduling Backend CodeGen Tensor Flow DFG (Protocol Buffer) IMP Compiler 47

48 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compiler Optimizations Exploit multi-operand ADD/SUB Reduce redundant writebacks Semantic Analysis Optimization NodeMerging IB Expansion Place holder Place holder Place holder Place holder Pipelining Instruction Lowering + + Add 8 8 NodeMerging + Add + Reduce 16 IB Scheduling + Reduce 16 CodeGen 48

49 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compiler Optimizations Expose more parallelism in a module to architecture. Semantic Analysis Optimization NodeMerging IB Expansion Pipelining Instruction Lowering Place holder 6 5 Add Place holder 8 8 IB Expansion + 8 Place holder Add Place holder Add Pack Unpack + 8 IB Scheduling CodeGen 49

50 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compiler Optimizations Pipelining Semantic Analysis Optimization NodeMerging IB Expansion Compute Add_0 WB Compute Add_1 WB Compute Reduce WB Pipelining Pipelining Instruction Lowering IB Scheduling Compute Add_0 WB Compute Add_1 WB CodeGen Compute Reduce WB 50

51 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compilation Flow Target Machine Modeling Optimization Python C++ Java Semantic Analysis NodeMerging IB Expansion Pipelining Instruction Lowering IB Scheduling Backend CodeGen Tensor Flow DFG (Protocol Buffer) IMP Compiler 51

52 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compiler Backend Instruction Lowering Semantic Analysis Optimization NodeMerging IB Expansion Pipelining Instruction Lowering Instruction Lowering: Transform high level TF insts into memory ISA Div High-level TF Node Newton-Raphson / Maclaurin Inst Lowering Add LUT Mul Memory ISA Division Algorithm q = a / b 1. y 7 = rcp b (LUT) 2. q 7 = ay 7 3. e 7 = 1 by 7 4. q C = q 7 + e 7 q 7 5. e C = e 7 E 6. q E = q C + e C q C IB Scheduling Supported TF operation nodes CodeGen Add Sub Mul Div Sqrt Exp Sum Less Conv 2D 52

53 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compiler Backend IB Scheduling Semantic Analysis Large Execution Time IB1 Optimization NodeMerging IB Expansion Pipelining IB Scheduling Instruction Lowering IB Scheduling CodeGen DFG Target # of IBs = 1 53

54 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compiler Backend Large Execution Time IB Scheduling Semantic Analysis IB1 IB2 IB1 IB2 Optimization NodeMerging IB Expansion Pipelining IB Scheduling Instruction Lowering Network Delay IB Scheduling CodeGen DFG Target # of IBs = 2 Good :) Bad :( 54

55 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compiler Backend IB Scheduling Semantic Analysis Bottom-Up Greedy [Ellis 1986] Optimization NodeMerging IB Expansion Pipelining Instruction Lowering IB Scheduling CodeGen Collect candidate assignments Make final assignments Minimize data transfer latency by taking both operand & successor location into consideration 55

56 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compiler Backend IB Scheduling Semantic Analysis Bottom-Up Greedy [Ellis 1986] Optimization NodeMerging IB Expansion Pipelining Instruction Lowering IB Scheduling CodeGen Collect candidate assignments Make final assignments Minimize data transfer latency by taking both operand & successor location into consideration 56

57 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compiler Backend IB Scheduling Semantic Analysis Bottom-Up Greedy [Ellis 1986] IB1 IB2 Optimization 1 NodeMerging IB Expansion 1 Pipelining Instruction Lowering IB Scheduling CodeGen Collect candidate assignments Make final assignments Minimize data transfer latency by taking both operand & successor location into consideration IB1 is chosen because Closer to operand locations Time 57

58 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compiler Backend IB Scheduling Semantic Analysis Bottom-Up Greedy [Ellis 1986] IB1 IB2 Optimization NodeMerging IB Expansion Pipelining 1 2 IB2 is chosen because Earlier slots available Instruction Lowering IB Scheduling CodeGen Collect candidate assignments Make final assignments Minimize data transfer latency by taking both operand & successor location into consideration Time 58

59 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compiler Backend IB Scheduling Semantic Analysis Bottom-Up Greedy [Ellis 1986] IB1 IB2 Optimization NodeMerging IB Expansion 1 2 Pipelining 1 Instruction Lowering IB Scheduling CodeGen Collect candidate assignments Make final assignments Minimize data transfer latency by taking both operand & successor location into consideration 1 IB1 is chosen because Better overlap of comm. and computation Time 59

60 Evaluation Methodology Benchmarks PARSEC 3.0 Blackscholes, Canneal, Fluidanimate Rodinia Backprop, Hotspot, Kmeans, Streamcluster Methodology Processor CPU (2 sockets) GPU (1 card) IMP Intel Xeon E v3, 3.6GHz, 28 cores, 56 threads NVIDIA Titan Xp, 1.6GHz, 3840 cuda cores 20MHz ReRAM, 4096 Tiles, 64 ReRAM PU / Tile On-chip memory MB 9.14 MB 8,590 MB Off-chip memory 64 GB DRAM 12 GB DRAM Profiler / Simulator (Performance) Profiler / Simulator (Power) Intel VTune Amplifier Inter RAPL Interface NVPROF NVIDIA System Management Interface Cycle accurate simulator (Booksim Integrated) Trace based simulation

61 Offloaded Kernel / Application Speedup (CPU) Offloaded Kernel Speedup x Blackscholes Fluidanimate Canneal Streamcluster GEOMEAN Normalized Execution Time Kernel Data loading NoC Sequential+Barrier 7.5x CPU IMP CPU IMP CPU IMP CPU IMP CPU IMP Blackscholes Fluidanimate Canneal Streamcluster GEOMEAN Offloaded Kernel Speedup Application Speedup Capacity limitation of IMP settles the upper-bound of performance improvement. 61

PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory

PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory Scalable and Energy-Efficient Architecture Lab (SEAL) PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in -based Main Memory Ping Chi *, Shuangchen Li *, Tao Zhang, Cong

More information

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018 S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Jakob Progsch, Mathias Wagner GTC 2018 1. Know your hardware BEFORE YOU START What are the target machines, how many nodes? Machine-specific

More information

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks Charles Eckert Xiaowei Wang Jingcheng Wang Arun Subramaniyan Ravi Iyer Dennis Sylvester David Blaauw Reetuparna Das M-Bits Research

More information

Deep Learning Accelerators

Deep Learning Accelerators Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction

More information

Newton: Gravitating Towards the Physical Limits of Crossbar Acceleration

Newton: Gravitating Towards the Physical Limits of Crossbar Acceleration Newton: Gravitating Towards the Physical Limits of Crossbar Acceleration Anirban Nag, Ali Shafiee, Rajeev Balasubramonian, Vivek Srikumar, Naveen Muralimanohar School of Computing, University of Utah,

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Fundamentals of Quantitative Design and Analysis

Fundamentals of Quantitative Design and Analysis Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology

More information

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael

More information

Manycore Processors. Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar.

Manycore Processors. Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar. phi 1 Manycore Processors phi 1 Definition Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar. Manycore Accelerator: [Definition only for this

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

DNN Accelerator Architectures

DNN Accelerator Architectures DNN Accelerator Architectures ISCA Tutorial (2017) Website: http://eyeriss.mit.edu/tutorial.html Joel Emer, Vivienne Sze, Yu-Hsin Chen 1 2 Highly-Parallel Compute Paradigms Temporal Architecture (SIMD/SIMT)

More information

High Performance Computing

High Performance Computing High Performance Computing 9th Lecture 2016/10/28 YUKI ITO 1 Selected Paper: vdnn: Virtualized Deep Neural Networks for Scalable, MemoryEfficient Neural Network Design Minsoo Rhu, Natalia Gimelshein, Jason

More information

PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning

PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning Presented by Nils Weller Hardware Acceleration for Data Processing Seminar, Fall 2017 PipeLayer: A Pipelined ReRAM-Based Accelerator for

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture The 51st Annual IEEE/ACM International Symposium on Microarchitecture Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture Byungchul Hong Yeonju Ro John Kim FuriosaAI Samsung

More information

Versal: AI Engine & Programming Environment

Versal: AI Engine & Programming Environment Engineering Director, Xilinx Silicon Architecture Group Versal: Engine & Programming Environment Presented By Ambrose Finnerty Xilinx DSP Technical Marketing Manager October 16, 2018 MEMORY MEMORY MEMORY

More information

Data Criticality in Network-On-Chip Design. Joshua San Miguel Natalie Enright Jerger

Data Criticality in Network-On-Chip Design. Joshua San Miguel Natalie Enright Jerger Data Criticality in Network-On-Chip Design Joshua San Miguel Natalie Enright Jerger Network-On-Chip Efficiency Efficiency is the ability to produce results with the least amount of waste. Wasted time Wasted

More information

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Using Intel Streaming SIMD Extensions for 3D Geometry Processing Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,

More information

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2017 Deep Neural

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology

More information

EECS4201 Computer Architecture

EECS4201 Computer Architecture Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis These slides are based on the slides provided by the publisher. The slides will be

More information

Design of Embedded DSP Processors Unit 2: Design basics. 9/11/2017 Unit 2 of TSEA H1 1

Design of Embedded DSP Processors Unit 2: Design basics. 9/11/2017 Unit 2 of TSEA H1 1 Design of Embedded DSP Processors Unit 2: Design basics 9/11/2017 Unit 2 of TSEA26-2017 H1 1 ASIP/ASIC design flow We need to have the flow in mind, so that we will know what we are talking about in later

More information

Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks

Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks Yu-Hsin Chen 1, Joel Emer 1, 2, Vivienne Sze 1 1 MIT 2 NVIDIA 1 Contributions of This Work A novel energy-efficient

More information

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM

More information

Computer Architecture: Dataflow/Systolic Arrays

Computer Architecture: Dataflow/Systolic Arrays Data Flow Computer Architecture: Dataflow/Systolic Arrays he models we have examined all assumed Instructions are fetched and retired in sequential, control flow order his is part of the Von-Neumann model

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

Addressing the Memory Wall

Addressing the Memory Wall Lecture 26: Addressing the Memory Wall Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Cage the Elephant Back Against the Wall (Cage the Elephant) This song is for the

More information

BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques

BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques Jingyang Zhu 1, Zhiliang Qian 2*, and Chi-Ying Tsui 1 1 The Hong Kong University of Science and

More information

VTA: Open & Flexible DL Acceleration. Thierry Moreau TVM Conference, Dec 12th 2018

VTA: Open & Flexible DL Acceleration. Thierry Moreau TVM Conference, Dec 12th 2018 VTA: Open & Flexible DL Acceleration Thierry Moreau TVM Conference, Dec 12th 2018 TVM Stack High-Level Differentiable IR Tensor Expression IR LLVM CUDA Metal TVM Stack High-Level Differentiable IR Tensor

More information

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes

More information

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EITF20: Computer Architecture Part2.2.1: Pipeline-1 EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle

More information

Real-Time Rendering Architectures

Real-Time Rendering Architectures Real-Time Rendering Architectures Mike Houston, AMD Part 1: throughput processing Three key concepts behind how modern GPU processing cores run code Knowing these concepts will help you: 1. Understand

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

HPC VT Machine-dependent Optimization

HPC VT Machine-dependent Optimization HPC VT 2013 Machine-dependent Optimization Last time Choose good data structures Reduce number of operations Use cheap operations strength reduction Avoid too many small function calls inlining Use compiler

More information

Speeding Up Crossbar Resistive Memory by Exploiting In-memory Data Patterns

Speeding Up Crossbar Resistive Memory by Exploiting In-memory Data Patterns March 12, 2018 Speeding Up Crossbar Resistive Memory by Exploiting In-memory Data Patterns Wen Wen Lei Zhao, Youtao Zhang, Jun Yang Executive Summary Problems: performance and reliability of write operations

More information

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:

More information

Multi-Processors and GPU

Multi-Processors and GPU Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock

More information

Scaling Neural Network Acceleration using Coarse-Grained Parallelism

Scaling Neural Network Acceleration using Coarse-Grained Parallelism Scaling Neural Network Acceleration using Coarse-Grained Parallelism Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2018 Neural Networks (NNs)

More information

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Ritchie Zhao 1, Weinan Song 2, Wentao Zhang 2, Tianwei Xing 3, Jeng-Hau Lin 4, Mani Srivastava 3, Rajesh Gupta 4, Zhiru

More information

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,

More information

From Shader Code to a Teraflop: How Shader Cores Work

From Shader Code to a Teraflop: How Shader Cores Work From Shader Code to a Teraflop: How Shader Cores Work Kayvon Fatahalian Stanford University This talk 1. Three major ideas that make GPU processing cores run fast 2. Closer look at real GPU designs NVIDIA

More information

Native Offload of Haskell Repa Programs to Integrated GPUs

Native Offload of Haskell Repa Programs to Integrated GPUs Native Offload of Haskell Repa Programs to Integrated GPUs Hai (Paul) Liu with Laurence Day, Neal Glew, Todd Anderson, Rajkishore Barik Intel Labs. September 28, 2016 General purpose computing on integrated

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,

More information

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017 Design of Digital Circuits Lecture 21: GPUs Prof. Onur Mutlu ETH Zurich Spring 2017 12 May 2017 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle and Microprogrammed Microarchitectures

More information

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated

More information

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

SDA: Software-Defined Accelerator for Large- Scale DNN Systems SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, Yong Wang, Bo Yu, Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A dominant

More information

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, Chunyuan Zhang National University of Defense Technology,

More information

Revolutionizing the Datacenter

Revolutionizing the Datacenter Power-Efficient Machine Learning using FPGAs on POWER Systems Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit Top-5

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

EE282 Computer Architecture. Lecture 1: What is Computer Architecture? EE282 Computer Architecture Lecture : What is Computer Architecture? September 27, 200 Marc Tremblay Computer Systems Laboratory Stanford University marctrem@csl.stanford.edu Goals Understand how computer

More information

Advanced Computer Architecture

Advanced Computer Architecture ECE 563 Advanced Computer Architecture Fall 2010 Lecture 6: VLIW 563 L06.1 Fall 2010 Little s Law Number of Instructions in the pipeline (parallelism) = Throughput * Latency or N T L Throughput per Cycle

More information

Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford

Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford Efficiency and Programmability: Enablers for ExaScale Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford Scientific Discovery and Business Analytics Driving an Insatiable

More information

Performance potential for simulating spin models on GPU

Performance potential for simulating spin models on GPU Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational

More information

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 14 EE141

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 14 EE141 EECS 151/251A Fall 2017 Digital Design and Integrated Circuits Instructor: John Wawrzynek and Nicholas Weaver Lecture 14 EE141 Outline Parallelism EE141 2 Parallelism Parallelism is the act of doing more

More information

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design Song Yao 姚颂 Founder & CEO DeePhi Tech 深鉴科技 song.yao@deephi.tech Outline - About DeePhi Tech - Background - Bandwidth Matters

More information

Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 15: Dataflow

More information

Warps and Reduction Algorithms

Warps and Reduction Algorithms Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 2018/19 A.J.Proença Data Parallelism 3 (GPU/CUDA, Neural Nets,...) (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 1 The

More information

Mohsen Imani. University of California San Diego. System Energy Efficiency Lab seelab.ucsd.edu

Mohsen Imani. University of California San Diego. System Energy Efficiency Lab seelab.ucsd.edu Mohsen Imani University of California San Diego Winter 2016 Technology Trend for IoT http://www.flashmemorysummit.com/english/collaterals/proceedi ngs/2014/20140807_304c_hill.pdf 2 Motivation IoT significantly

More information

Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism

Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism Jiecao Yu 1, Andrew Lukefahr 1, David Palframan 2, Ganesh Dasika 2, Reetuparna Das 1, Scott Mahlke 1 1 University of Michigan 2 ARM

More information

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread Intra-Warp Compaction Techniques Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Goal Active thread Idle thread Compaction Compact threads in a warp to coalesce (and eliminate)

More information

gem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood

gem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood gem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood (powerjg/morr)@cs.wisc.edu UW-Madison Computer Sciences 2012 gem5-gpu gem5 + GPGPU-Sim (v3.0.1) Flexible memory

More information

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Lecture 8: GPU Programming. CSE599G1: Spring 2017 Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library

More information

EECS150 - Digital Design Lecture 09 - Parallelism

EECS150 - Digital Design Lecture 09 - Parallelism EECS150 - Digital Design Lecture 09 - Parallelism Feb 19, 2013 John Wawrzynek Spring 2013 EECS150 - Lec09-parallel Page 1 Parallelism Parallelism is the act of doing more than one thing at a time. Optimization

More information

NVIDIA FOR DEEP LEARNING. Bill Veenhuis

NVIDIA FOR DEEP LEARNING. Bill Veenhuis NVIDIA FOR DEEP LEARNING Bill Veenhuis bveenhuis@nvidia.com Nvidia is the world s leading ai platform ONE ARCHITECTURE CUDA 2 GPU: Perfect Companion for Accelerating Apps & A.I. CPU GPU 3 Intro to AI AGENDA

More information

ME964 High Performance Computing for Engineering Applications

ME964 High Performance Computing for Engineering Applications ME964 High Performance Computing for Engineering Applications Execution Scheduling in CUDA Revisiting Memory Issues in CUDA February 17, 2011 Dan Negrut, 2011 ME964 UW-Madison Computers are useless. They

More information

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010 Introduction to Multicore architecture Tao Zhang Oct. 21, 2010 Overview Part1: General multicore architecture Part2: GPU architecture Part1: General Multicore architecture Uniprocessor Performance (ECint)

More information

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction

More information

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian) From Shader Code to a Teraflop: How GPU Shader Cores Work Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian) 1 This talk Three major ideas that make GPU processing cores run fast Closer look at real

More information

High Performance Computing Hiroki Kanezashi Tokyo Institute of Technology Dept. of mathematical and computing sciences Matsuoka Lab.

High Performance Computing Hiroki Kanezashi Tokyo Institute of Technology Dept. of mathematical and computing sciences Matsuoka Lab. High Performance Computing 2015 Hiroki Kanezashi Tokyo Institute of Technology Dept. of mathematical and computing sciences Matsuoka Lab. 1 Reviewed Paper 1 DaDianNao: A Machine- Learning Supercomputer

More information

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION Julien Demouth, NVIDIA Cliff Woolley, NVIDIA WHAT WILL YOU LEARN? An iterative method to optimize your GPU code A way to conduct that method with NVIDIA

More information

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

SDA: Software-Defined Accelerator for Large- Scale DNN Systems SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, 1 Yong Wang, 1 Bo Yu, 1 Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A

More information

Lecture 4: Instruction Set Architectures. Review: latency vs. throughput

Lecture 4: Instruction Set Architectures. Review: latency vs. throughput Lecture 4: Instruction Set Architectures Last Time Performance analysis Amdahl s Law Performance equation Computer benchmarks Today Review of Amdahl s Law and Performance Equations Introduction to ISAs

More information

15-740/ Computer Architecture, Fall 2011 Midterm Exam II

15-740/ Computer Architecture, Fall 2011 Midterm Exam II 15-740/18-740 Computer Architecture, Fall 2011 Midterm Exam II Instructor: Onur Mutlu Teaching Assistants: Justin Meza, Yoongu Kim Date: December 2, 2011 Name: Instructions: Problem I (69 points) : Problem

More information

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved. Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

XPU A Programmable FPGA Accelerator for Diverse Workloads

XPU A Programmable FPGA Accelerator for Diverse Workloads XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for

More information

TDT4260/DT8803 COMPUTER ARCHITECTURE EXAM

TDT4260/DT8803 COMPUTER ARCHITECTURE EXAM Norwegian University of Science and Technology Department of Computer and Information Science Page 1 of 13 Contact: Magnus Jahre (952 22 309) TDT4260/DT8803 COMPUTER ARCHITECTURE EXAM Monday 4. June Time:

More information

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Naveen Suda, Vikas Chandra *, Ganesh Dasika *, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu

More information

Vertex Shader Design II

Vertex Shader Design II The following content is extracted from the paper shown in next page. If any wrong citation or reference missing, please contact ldvan@cs.nctu.edu.tw. I will correct the error asap. This course used only

More information

THE PATH TO EXASCALE COMPUTING. Bill Dally Chief Scientist and Senior Vice President of Research

THE PATH TO EXASCALE COMPUTING. Bill Dally Chief Scientist and Senior Vice President of Research THE PATH TO EXASCALE COMPUTING Bill Dally Chief Scientist and Senior Vice President of Research The Goal: Sustained ExaFLOPs on problems of interest 2 Exascale Challenges Energy efficiency Programmability

More information

Maximizing Face Detection Performance

Maximizing Face Detection Performance Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA GTC 2015 1 Outline Very brief review of cascaded-classifiers Parallelization choices Reducing the amount

More information

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Performance COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals What is Performance? How do we measure the performance of

More information

Pipelining. CS701 High Performance Computing

Pipelining. CS701 High Performance Computing Pipelining CS701 High Performance Computing Student Presentation 1 Two 20 minute presentations Burks, Goldstine, von Neumann. Preliminary Discussion of the Logical Design of an Electronic Computing Instrument.

More information

Hybrid Implementation of 3D Kirchhoff Migration

Hybrid Implementation of 3D Kirchhoff Migration Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation

More information

GPU Microarchitecture Note Set 2 Cores

GPU Microarchitecture Note Set 2 Cores 2 co 1 2 co 1 GPU Microarchitecture Note Set 2 Cores Quick Assembly Language Review Pipelined Floating-Point Functional Unit (FP FU) Typical CPU Statically Scheduled Scalar Core Typical CPU Statically

More information

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al. Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.) Andreas Kurth 2017-12-05 1 In short: The situation Image credit:

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures

Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures Homework 5 Start date: March 24 Due date: 11:59PM on April 10, Monday night 4.1.1, 4.1.2 4.3 4.8.1, 4.8.2 4.9.1-4.9.4 4.13.1 4.16.1, 4.16.2 1 CSCI 402: Computer Architectures The Processor (4) Fengguang

More information

Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand

Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan

More information

Single Instructions Can Execute Several Low Level

Single Instructions Can Execute Several Low Level We have made it easy for you to find a PDF Ebooks without any digging. And by having access to our ebooks online or by storing it on your computer, you have convenient answers with single instructions

More information

ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University

ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University Lab 4: Binarized Convolutional Neural Networks Due Wednesday, October 31, 2018, 11:59pm

More information

Practical Near-Data Processing for In-Memory Analytics Frameworks

Practical Near-Data Processing for In-Memory Analytics Frameworks Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard

More information

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ, Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - fabio.baruffa@lrz.de LRZ, 27.6.- 29.6.2016 Architecture Overview Intel Xeon Processor Intel Xeon Phi Coprocessor, 1st generation Intel Xeon

More information

ECE 154A Introduction to. Fall 2012

ECE 154A Introduction to. Fall 2012 ECE 154A Introduction to Computer Architecture Fall 2012 Dmitri Strukov Lecture 10 Floating point review Pipelined design IEEE Floating Point Format single: 8 bits double: 11 bits single: 23 bits double:

More information

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

Lecture 29 Review CPU time: the best metric Be sure you understand CC, clock period Common (and good) performance metrics Be sure you understand CC, clock period Lecture 29 Review Suggested reading: Everything Q1: D[8] = D[8] + RF[1] + RF[4] I[15]: Add R2, R1, R4 RF[1] = 4 I[16]: MOV R3, 8 RF[4] = 5 I[17]: Add R2, R2, R3

More information

Simultaneous Multithreading on Pentium 4

Simultaneous Multithreading on Pentium 4 Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on

More information