Lecture 24 Near Data Computing II
|
|
- Felicia Maxwell
- 5 years ago
- Views:
Transcription
1 EECS 570 Lecture 24 Near Data Computing II Winter 2018 Prof. Satish Narayanasamy EECS 570 Lecture 23 Slide 1
2 Readings ISAAC: A Convolutional Neural Network Accelerator with In- Situ Analog Arithmetic in Crossbars, ISCA 2016, Shafiee et al. In-Memory Data Parallel Processor, ASPLOS 2018, Fujiki, Mahlke, Das. EECS 570 Lecture 23 Slide 2
3 Executive Summary 3 Classifying Images is in vogue Lots of vector-matrix multiplication Conv nets are the best Analog memristor crossbar is a great fit Analog to Digital conversion overheads! Smart encoding reduces such overheads ISAAC 14.8x better in throughput and 5.5x better in energy than digital state of the art (DaDianNao) Balanced pipeline critical for high efficiency Preserving high precision is essential in analog EECS 570 Lecture 23 Slide 3
4 State of the art Convolutional Neural Networks 4 Deep Residual Networks 152 layers! 11 billion operations! Convolution Layers Pooling Layers Fully Connected Layers EECS 570 Lecture 23 Slide 4
5 Convolution Operation 5 Kernel 0 K x = 2 Kernel 1 Kernel 2 N i = 3 K y = 2 N y Stride S x, S y = 1 N o = 3 N x EECS 570 Lecture 23 Slide 5
6 Memristor Dot-product Engine 6 V1 V2 G1 I1 = V1.G1 G2 I2 = V2.G2 I = I1 + I2 = V1.G1 + V2.G2 x 0 x 1 x 2 x 3 w 00 w 01 w 02 w 03 w 10 w 11 w 12 w 13 w 20 w 21 w 22 w 23 w 30 w 31 w 32 w 33 y 0 y 1 y 2 y 3 EECS 570 Lecture 23 Slide 6
7 Memristor Dot-product Engine 7 Kernel 0 Kernel 1 Kernel 2 Kernel 0 K x = 2 Kernel 1 Kernel 2 N i = 3 K y = 2 N o = 3 N y Stride S x, S y = 1 N x EECS 570 Lecture 23 Slide 7
8 Crossbar 8 16-bit 1-bit 16-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit 16 iterations 16-bit 1-bit 16-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit 16-bit 1-bit 16-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit 16-bit 1-bit 16-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit 2-bit Input Neurons EECS 570 Lecture 23 Slide 8
9 EECS 570 ISAAC Organization 9 Sigmoid Digital To Analog Rows Crossbar 16 Iterations Input Register Output Register Shift and Add Partial Rows Output Partial Output Output 01 Register Shift and Add Analog to Digital Lecture 23 Slide 9
10 An ISAAC Chip Inter-Tile Pipelined 10 Layer 1 Layer 2 Layer 3 edram Tile 1 edram Tile 2 edram Tile 3 EECS 570 Lecture 23 Slide 10
11 Balanced Pipeline 11 Layer i: S x = 1 and S y = 2 Replicate layer i 1 two times. Not computed yet Storage allocation: Start from Received last layer from previous layer Serviced and released EECS 570 Lecture 23 Slide 11
12 Balanced Pipeline 12 Crossbar 128x128 Crossbar 128x128 Crossbar 128x128 Crossbar 128x128 Crossbar 128x128 Crossbar 128x128 Crossbar 128x128 Crossbar 128x128 Crossbar 128x128 Crossbar 128x128 Crossbar 128x128 Crossbar 128x128 Crossbar 128x128 S x = 2, S y = 2 S x = 1 S y = 2 S x = 2, S y = 2 EECS 570 Lecture 23 Slide 12
13 The ADC overhead 13 Large area Power hungry Area and power increases exponentially with ADC resolution and frequency EECS 570 Lecture 23 Slide 13
14 ADC Resolution = log (R) + v + w 1 (if v=1) ADC Resolution = 9 bits The ADC overhead 14 v bits v bits v bits Memristor cells w bits w bits w bits R = 128 v = 1 w = 2 R rows v bits w bits EECS 570 log (R) 9-bit + v ADC + w 1 Lecture 23 Slide 14
15 If MSB = 1 with maximal input Store weights in flipped form such that MSB = 0 always. Effective ADC resolution required = 8 bits EECS 570 Encoding Scheme 15 M A X I M A L I N P U T Memristor cells W 0, 0,0 0 W 0, 0,1 W 0, 0,2 2 W 0,R 1 0, R-1 If MSB = 01 Lecture 23 Slide 15
16 Handling Signed Arithmetic 16 Input neurons 2 s Compliment MSB = 1 represents 2 15 For 16 th iteration do shift-and-subtract Weights Like FP exponent representation Bias of 2 15 Subtract as many biases as the number of 1s in input EECS 570 Lecture 23 Slide 16
17 Analysis Metrics 17 1) CE: Computational Efficiency -> GOPS/s mm 2 2) PE: Power Efficiency -> GOPS/W 3) SE: Storage Efficiency -> MB/mm 2 EECS 570 Lecture 23 Slide 17
18 Design Space Exploration 18 1) rows per crossbar 2) ADCs per IMA 3) crossbars per IMA 4) IMA per tile EECS 570 Lecture 23 Slide 18
19 Design Space Exploration 19 GOPs/mm 2 ISAAC-PE ISAAC-CE ISAAC-SE Various Design Points GOPs/W ISAAC-PE ISAAC-CE ISAAC-SE EECS 570 Various Design Points Lecture 23 Slide 19
20 Power Contribution 20 Router 3% 5% 58% Hyper Transport 16% 12% 7% 49% EECS 570 Lecture 23 Slide 20
21 Improvement over DaDianNao (Throughput) 21 Throughput: 14.8x better because: 1. Memristor crossbar have high computational parallelism 2. DaDianNao fetches both inputs and weights from edram, ISAAC fetches just inputs 3. DaDianNao suffers due to bandwidth limitation in fully connected layers. ISAAC requires more power but is 5.5x better in terms of energy due to above reasons. EECS 570 Deep Neural Net Benchmarks Lecture 23 Slide 21
22 Conclusion 22 Takes advantage of analog in-situ computing. Fetches just the input neurons. Handles ADC overheads with smart encoding. Does not compromise on output precision. Is faster than DaDianNao due to 8x better computational efficiency and a balanced pipeline keeping all units busy. Few questions still remain: integrate online training? EECS 570 Lecture 23 Slide 22
23 In-Memory Data Parallel Processor Daichi Fujiki Scott Mahlke Reetuparna Das M-Bits Research Group
24 Data movement is what matters, not arithmetic Bill Dally CPU DATA-PARALLEL APPLICATIONS GPU MANY CORE SIMD OoO ARITHMETIC MANY THREAD SIMT SIMD DATA COMMUNICATION 1000x 40x 24
25 In-Memory Computing exposes parallelism while minimizing data movement cost CPU GPU IN-MEMORY In-situ computing Massive parallelism SIMD slots over dense memory arrays High bandwidth / Low data movement 25
26 In-Memory Computing Reduces Data Movement CPU GPU In-situ computing Massive parallelism IN-MEMORY Vdd/2 V1 11 C11 I11 = (Vdd/2) C11 11 C21 I21 = (Vdd/2) C21 I1 = (Vdd/2) (C11+ C21) (a) Addition C12 C11 I11= V1C11 I12= V1C12 V2 C21 C22 I21= V2C21 I22= V2C22 I1=I11+I21 I2=I21+I22 (b) Dot-product 26 (c
27 In-Memory Computing Exposes Parallelism IN-MEMORY In-situ computing Massive parallelism CPU (2 sockets) Intel Xeon E GPU NVIDIA TITAN Xp ReRAM Scaled from ISAAC* Area (mm2) TDP (W) On-chip memory (MB) ,590 SIMD slots 448 3,840 2,097,152 Freq (GHz) SIMD Freq Product 3,227 6,086 41,953 27
28 In-Memory Computing Today V1 C 11 C 12 I11= V1C11 V2 C 21 C 22 I21= V2C21 I12= V1C12 I22= V2C22 V1 C 11 C 12 V2 I11 C 21 C 22 I21 I12 I22 ReRAM Dot-product Accelerator PRIME [Chi 2016, ISCA] ISAAC [Shafiee 2016, ISCA] Dot-Product Engine [Hu 2016, ] PipeLayer [Song 2017, HPCA] I1 = I11+ I21 I2 = I12+ I22 Multiplication + Summation 28
29 In-Memory Computing No Demonstration of General Purpose Computing IN-MEMORY How to program? No established programming model / execution model Limited computation primitives
30 HW In-Memory Data Parallel Processor Overview Microarchitecture ISA Memory ISA ADD DOT MUL SUB MOV MOVS MOVI MOVG SHIFT{L/R} MASK LUT REDUCE_SUM Execution Model Compiler IMP Compiler Module ILP IB1 IB1 IB1 IB2 IB2 IB2 SW Programming Model Data Flow Graph DLP 30
31 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Computation Primitives A C A Information stored in analog (cell conductance C = 1/resistance) A B C B Write Read C A 31
32 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Computation Primitives V dd V dd /2 C A V dd IA Ohm s law [mult] IA = (Vdd/2) CA C B IB I = ( IA + IB ) Kirchhoff s law [add] (a) Addition (b) Subtraction* * New primitive 32
33 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Computation Primitives V dd V dd /2 V dd V dd /2 C A 11 CA 11 V dd IA I 11 = 0(V dd /2) CIA 11 C B 00 C 21 B IB I 21 = (V dd /2) IB C 21 I = ( IA + IB ) I 1 = (VI dd /2) = ((C IAA IB CB 21 ) B B (a) Addition (b) Subtraction* * New primitive 33
34 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Computation Primitives X Y A C B D = AX + BY CX + DY X Y (A B) = AX BY VX CA IAX= VACA V Y CB CC ICX= VXCC CD 11 V dd V 1 X0 C 11 A CB 12 VY0 2 V dd - Multiplier Multiplicand IBY= VYCB IDY= VYCD I1=IAX+IBY I2=ICX+IDY I 11 =(V dd V 1 X2 )C 11 A I 12 =(V dd V 2 Y2 ) CB 12 (c) Dot-product (c) d Element-wise multiplication * * New primitive 34
35 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Microarchitecture Cluster ReRAM PU... ReRAM PU Reg. File ReRAM PU... ReRAM PU LUT Router = RowDecoder + Shift&Add Unit RRAM XB S+H ADC ADC S+A Reg Processing Unit 35
36 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Microarchitecture ReRAM PU Cluster... ReRAM PU Reg. File Array Size 128 x 128 R/W Latency 50 ns ReRAM PU... ReRAM PU LUT PU + Reg PU + Reg PU + Reg PU + Reg Multi Level Cell 2 ADC Resolution 5 ADC Frequency 1.2 GSps RRAM XB S+H ADC ADC S+A Reg Processing Unit Shift and Hold Sample and Hold 8 PUs/array 128 Regs/PU Resolution 2 LUT size 256 x 8 36
37 ISA HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW In-situ Computation Moves R/W Misc Opcode Format Cycles ADD <MASK> <DST> 3 DOT <MASK> <REG_MASK> <DST> 18 MUL <SRC> <SRC> <DST> 18 SUB <SRC> <SRC> <DST> 3 MOV <SRC> <DST> 3 MOVS <SRC> <DST> <MASK> 3 MOVI <SRC> <IMM> 1 MOVG <GADDR> <GADDR> Variable SHIFT{L/R} <SRC> <SRC> <IMM> 3 MASK <SRC> <SRC> <IMM> 3 LUT <SRC> <SRC> 4 REDUCE_SUM <SRC> <GADDR> Variable
38 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW
39 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Programming Model KEY OBSERVATION Need a programming language that merges concepts of Data-Flow and SIMD for maximizing parallelism Data-Flow SIMD Side-effect Free Explicit dataflow exposes Instruction Level Parallelism Data Level Parallelism No dependence on shared memory primitives 39
40 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Execution Model Input Matrix A Input Matrix B Data Flow Graph 40
41 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Execution Model Input Matrix A Input Matrix A Input Matrix B Input Matrix B Decomposed DFG Unroll innermost dimension Module Data Flow Graph Module Modularized execution flow Applied to the innermost dimension DLP 41
42 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Execution Model Input Matrix A Input Matrix B Input Matrix A Input Matrix B ILP IB1 IB2 IB IB Data Flow Graph Decomposed Data Flow Graph Instruction Block (IB) Partial execution sequence of a Module Mapped to a single array Module 42
43 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Execution Model IB1 Module IB2 IB1 IB2 IB1 IB2 Module Modularized execution flow Applied to the innermost dimension IB IB IB1 IB1 IB1 IB2 IB2 IB2 Instruction Block (IB) Components of Module Mapped to a single array ReRAM Array 43
44 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Execution Model Input Matrix A Input Matrix B Module Data Flow Graphs Modularized execution flow Applied to the innermost dimension IB1 IB2 IB1 IB2 IB1 IB2 IB IB Modules Instruction Block (IB) Components of Module Mapped to a single array ReRAM Array IB1 IB1 IB1 IB2 IB2 IB2 44
45 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW
46 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compilation Flow Target Machine Modeling Optimization Python C++ Java Semantic Analysis NodeMerging IB Expansion Pipelining Instruction Lowering IB Scheduling Backend CodeGen Tensor Flow DFG (Protocol Buffer) IMP Compiler 46
47 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compilation Flow Target Machine Modeling Optimization Python C++ Java Semantic Analysis NodeMerging IB Expansion Pipelining Instruction Lowering IB Scheduling Backend CodeGen Tensor Flow DFG (Protocol Buffer) IMP Compiler 47
48 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compiler Optimizations Exploit multi-operand ADD/SUB Reduce redundant writebacks Semantic Analysis Optimization NodeMerging IB Expansion Place holder Place holder Place holder Place holder Pipelining Instruction Lowering + + Add 8 8 NodeMerging + Add + Reduce 16 IB Scheduling + Reduce 16 CodeGen 48
49 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compiler Optimizations Expose more parallelism in a module to architecture. Semantic Analysis Optimization NodeMerging IB Expansion Pipelining Instruction Lowering Place holder 6 5 Add Place holder 8 8 IB Expansion + 8 Place holder Add Place holder Add Pack Unpack + 8 IB Scheduling CodeGen 49
50 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compiler Optimizations Pipelining Semantic Analysis Optimization NodeMerging IB Expansion Compute Add_0 WB Compute Add_1 WB Compute Reduce WB Pipelining Pipelining Instruction Lowering IB Scheduling Compute Add_0 WB Compute Add_1 WB CodeGen Compute Reduce WB 50
51 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compilation Flow Target Machine Modeling Optimization Python C++ Java Semantic Analysis NodeMerging IB Expansion Pipelining Instruction Lowering IB Scheduling Backend CodeGen Tensor Flow DFG (Protocol Buffer) IMP Compiler 51
52 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compiler Backend Instruction Lowering Semantic Analysis Optimization NodeMerging IB Expansion Pipelining Instruction Lowering Instruction Lowering: Transform high level TF insts into memory ISA Div High-level TF Node Newton-Raphson / Maclaurin Inst Lowering Add LUT Mul Memory ISA Division Algorithm q = a / b 1. y 7 = rcp b (LUT) 2. q 7 = ay 7 3. e 7 = 1 by 7 4. q C = q 7 + e 7 q 7 5. e C = e 7 E 6. q E = q C + e C q C IB Scheduling Supported TF operation nodes CodeGen Add Sub Mul Div Sqrt Exp Sum Less Conv 2D 52
53 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compiler Backend IB Scheduling Semantic Analysis Large Execution Time IB1 Optimization NodeMerging IB Expansion Pipelining IB Scheduling Instruction Lowering IB Scheduling CodeGen DFG Target # of IBs = 1 53
54 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compiler Backend Large Execution Time IB Scheduling Semantic Analysis IB1 IB2 IB1 IB2 Optimization NodeMerging IB Expansion Pipelining IB Scheduling Instruction Lowering Network Delay IB Scheduling CodeGen DFG Target # of IBs = 2 Good :) Bad :( 54
55 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compiler Backend IB Scheduling Semantic Analysis Bottom-Up Greedy [Ellis 1986] Optimization NodeMerging IB Expansion Pipelining Instruction Lowering IB Scheduling CodeGen Collect candidate assignments Make final assignments Minimize data transfer latency by taking both operand & successor location into consideration 55
56 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compiler Backend IB Scheduling Semantic Analysis Bottom-Up Greedy [Ellis 1986] Optimization NodeMerging IB Expansion Pipelining Instruction Lowering IB Scheduling CodeGen Collect candidate assignments Make final assignments Minimize data transfer latency by taking both operand & successor location into consideration 56
57 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compiler Backend IB Scheduling Semantic Analysis Bottom-Up Greedy [Ellis 1986] IB1 IB2 Optimization 1 NodeMerging IB Expansion 1 Pipelining Instruction Lowering IB Scheduling CodeGen Collect candidate assignments Make final assignments Minimize data transfer latency by taking both operand & successor location into consideration IB1 is chosen because Closer to operand locations Time 57
58 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compiler Backend IB Scheduling Semantic Analysis Bottom-Up Greedy [Ellis 1986] IB1 IB2 Optimization NodeMerging IB Expansion Pipelining 1 2 IB2 is chosen because Earlier slots available Instruction Lowering IB Scheduling CodeGen Collect candidate assignments Make final assignments Minimize data transfer latency by taking both operand & successor location into consideration Time 58
59 HW Processor Architecture - ISA - Execution Model - Programming Model - Compiler SW Compiler Backend IB Scheduling Semantic Analysis Bottom-Up Greedy [Ellis 1986] IB1 IB2 Optimization NodeMerging IB Expansion 1 2 Pipelining 1 Instruction Lowering IB Scheduling CodeGen Collect candidate assignments Make final assignments Minimize data transfer latency by taking both operand & successor location into consideration 1 IB1 is chosen because Better overlap of comm. and computation Time 59
60 Evaluation Methodology Benchmarks PARSEC 3.0 Blackscholes, Canneal, Fluidanimate Rodinia Backprop, Hotspot, Kmeans, Streamcluster Methodology Processor CPU (2 sockets) GPU (1 card) IMP Intel Xeon E v3, 3.6GHz, 28 cores, 56 threads NVIDIA Titan Xp, 1.6GHz, 3840 cuda cores 20MHz ReRAM, 4096 Tiles, 64 ReRAM PU / Tile On-chip memory MB 9.14 MB 8,590 MB Off-chip memory 64 GB DRAM 12 GB DRAM Profiler / Simulator (Performance) Profiler / Simulator (Power) Intel VTune Amplifier Inter RAPL Interface NVPROF NVIDIA System Management Interface Cycle accurate simulator (Booksim Integrated) Trace based simulation
61 Offloaded Kernel / Application Speedup (CPU) Offloaded Kernel Speedup x Blackscholes Fluidanimate Canneal Streamcluster GEOMEAN Normalized Execution Time Kernel Data loading NoC Sequential+Barrier 7.5x CPU IMP CPU IMP CPU IMP CPU IMP CPU IMP Blackscholes Fluidanimate Canneal Streamcluster GEOMEAN Offloaded Kernel Speedup Application Speedup Capacity limitation of IMP settles the upper-bound of performance improvement. 61
PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory
Scalable and Energy-Efficient Architecture Lab (SEAL) PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in -based Main Memory Ping Chi *, Shuangchen Li *, Tao Zhang, Cong
More informationS WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018
S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Jakob Progsch, Mathias Wagner GTC 2018 1. Know your hardware BEFORE YOU START What are the target machines, how many nodes? Machine-specific
More informationNeural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks
Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks Charles Eckert Xiaowei Wang Jingcheng Wang Arun Subramaniyan Ravi Iyer Dennis Sylvester David Blaauw Reetuparna Das M-Bits Research
More informationDeep Learning Accelerators
Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction
More informationNewton: Gravitating Towards the Physical Limits of Crossbar Acceleration
Newton: Gravitating Towards the Physical Limits of Crossbar Acceleration Anirban Nag, Ali Shafiee, Rajeev Balasubramonian, Vivek Srikumar, Naveen Muralimanohar School of Computing, University of Utah,
More informationCS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS
CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight
More informationFundamentals of Quantitative Design and Analysis
Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology
More informationMaximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman
Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael
More informationManycore Processors. Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar.
phi 1 Manycore Processors phi 1 Definition Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar. Manycore Accelerator: [Definition only for this
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationDNN Accelerator Architectures
DNN Accelerator Architectures ISCA Tutorial (2017) Website: http://eyeriss.mit.edu/tutorial.html Joel Emer, Vivienne Sze, Yu-Hsin Chen 1 2 Highly-Parallel Compute Paradigms Temporal Architecture (SIMD/SIMT)
More informationHigh Performance Computing
High Performance Computing 9th Lecture 2016/10/28 YUKI ITO 1 Selected Paper: vdnn: Virtualized Deep Neural Networks for Scalable, MemoryEfficient Neural Network Design Minsoo Rhu, Natalia Gimelshein, Jason
More informationPipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning
PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning Presented by Nils Weller Hardware Acceleration for Data Processing Seminar, Fall 2017 PipeLayer: A Pipelined ReRAM-Based Accelerator for
More informationParallel Processing SIMD, Vector and GPU s cont.
Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP
More informationMulti-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture
The 51st Annual IEEE/ACM International Symposium on Microarchitecture Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture Byungchul Hong Yeonju Ro John Kim FuriosaAI Samsung
More informationVersal: AI Engine & Programming Environment
Engineering Director, Xilinx Silicon Architecture Group Versal: Engine & Programming Environment Presented By Ambrose Finnerty Xilinx DSP Technical Marketing Manager October 16, 2018 MEMORY MEMORY MEMORY
More informationData Criticality in Network-On-Chip Design. Joshua San Miguel Natalie Enright Jerger
Data Criticality in Network-On-Chip Design Joshua San Miguel Natalie Enright Jerger Network-On-Chip Efficiency Efficiency is the ability to produce results with the least amount of waste. Wasted time Wasted
More informationUsing Intel Streaming SIMD Extensions for 3D Geometry Processing
Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,
More informationTETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory
TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2017 Deep Neural
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology
More informationEECS4201 Computer Architecture
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis These slides are based on the slides provided by the publisher. The slides will be
More informationDesign of Embedded DSP Processors Unit 2: Design basics. 9/11/2017 Unit 2 of TSEA H1 1
Design of Embedded DSP Processors Unit 2: Design basics 9/11/2017 Unit 2 of TSEA26-2017 H1 1 ASIP/ASIC design flow We need to have the flow in mind, so that we will know what we are talking about in later
More informationEyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks
Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks Yu-Hsin Chen 1, Joel Emer 1, 2, Vivienne Sze 1 1 MIT 2 NVIDIA 1 Contributions of This Work A novel energy-efficient
More informationFrequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System
Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM
More informationComputer Architecture: Dataflow/Systolic Arrays
Data Flow Computer Architecture: Dataflow/Systolic Arrays he models we have examined all assumed Instructions are fetched and retired in sequential, control flow order his is part of the Von-Neumann model
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationAddressing the Memory Wall
Lecture 26: Addressing the Memory Wall Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Cage the Elephant Back Against the Wall (Cage the Elephant) This song is for the
More informationBHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques
BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques Jingyang Zhu 1, Zhiliang Qian 2*, and Chi-Ying Tsui 1 1 The Hong Kong University of Science and
More informationVTA: Open & Flexible DL Acceleration. Thierry Moreau TVM Conference, Dec 12th 2018
VTA: Open & Flexible DL Acceleration Thierry Moreau TVM Conference, Dec 12th 2018 TVM Stack High-Level Differentiable IR Tensor Expression IR LLVM CUDA Metal TVM Stack High-Level Differentiable IR Tensor
More informationEE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin
EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel
More informationAdvanced Computer Architecture
Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes
More informationEITF20: Computer Architecture Part2.2.1: Pipeline-1
EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle
More informationReal-Time Rendering Architectures
Real-Time Rendering Architectures Mike Houston, AMD Part 1: throughput processing Three key concepts behind how modern GPU processing cores run code Knowing these concepts will help you: 1. Understand
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationHPC VT Machine-dependent Optimization
HPC VT 2013 Machine-dependent Optimization Last time Choose good data structures Reduce number of operations Use cheap operations strength reduction Avoid too many small function calls inlining Use compiler
More informationSpeeding Up Crossbar Resistive Memory by Exploiting In-memory Data Patterns
March 12, 2018 Speeding Up Crossbar Resistive Memory by Exploiting In-memory Data Patterns Wen Wen Lei Zhao, Youtao Zhang, Jun Yang Executive Summary Problems: performance and reliability of write operations
More informationComputer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:
More informationMulti-Processors and GPU
Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock
More informationScaling Neural Network Acceleration using Coarse-Grained Parallelism
Scaling Neural Network Acceleration using Coarse-Grained Parallelism Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2018 Neural Networks (NNs)
More informationAccelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs
Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Ritchie Zhao 1, Weinan Song 2, Wentao Zhang 2, Tianwei Xing 3, Jeng-Hau Lin 4, Mani Srivastava 3, Rajesh Gupta 4, Zhiru
More informationComputer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13
Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,
More informationFrom Shader Code to a Teraflop: How Shader Cores Work
From Shader Code to a Teraflop: How Shader Cores Work Kayvon Fatahalian Stanford University This talk 1. Three major ideas that make GPU processing cores run fast 2. Closer look at real GPU designs NVIDIA
More informationNative Offload of Haskell Repa Programs to Integrated GPUs
Native Offload of Haskell Repa Programs to Integrated GPUs Hai (Paul) Liu with Laurence Day, Neal Glew, Todd Anderson, Rajkishore Barik Intel Labs. September 28, 2016 General purpose computing on integrated
More informationMaster Informatics Eng.
Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,
More informationDesign of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017
Design of Digital Circuits Lecture 21: GPUs Prof. Onur Mutlu ETH Zurich Spring 2017 12 May 2017 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle and Microprogrammed Microarchitectures
More informationNVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU
NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated
More informationSDA: Software-Defined Accelerator for Large- Scale DNN Systems
SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, Yong Wang, Bo Yu, Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A dominant
More informationTowards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA
Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, Chunyuan Zhang National University of Defense Technology,
More informationRevolutionizing the Datacenter
Power-Efficient Machine Learning using FPGAs on POWER Systems Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit Top-5
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationEE282 Computer Architecture. Lecture 1: What is Computer Architecture?
EE282 Computer Architecture Lecture : What is Computer Architecture? September 27, 200 Marc Tremblay Computer Systems Laboratory Stanford University marctrem@csl.stanford.edu Goals Understand how computer
More informationAdvanced Computer Architecture
ECE 563 Advanced Computer Architecture Fall 2010 Lecture 6: VLIW 563 L06.1 Fall 2010 Little s Law Number of Instructions in the pipeline (parallelism) = Throughput * Latency or N T L Throughput per Cycle
More informationEfficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford
Efficiency and Programmability: Enablers for ExaScale Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford Scientific Discovery and Business Analytics Driving an Insatiable
More informationPerformance potential for simulating spin models on GPU
Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational
More informationEECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 14 EE141
EECS 151/251A Fall 2017 Digital Design and Integrated Circuits Instructor: John Wawrzynek and Nicholas Weaver Lecture 14 EE141 Outline Parallelism EE141 2 Parallelism Parallelism is the act of doing more
More informationBandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design
Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design Song Yao 姚颂 Founder & CEO DeePhi Tech 深鉴科技 song.yao@deephi.tech Outline - About DeePhi Tech - Background - Bandwidth Matters
More informationComputer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 15: Dataflow
More informationWarps and Reduction Algorithms
Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum
More informationMaster Informatics Eng.
Advanced Architectures Master Informatics Eng. 2018/19 A.J.Proença Data Parallelism 3 (GPU/CUDA, Neural Nets,...) (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 1 The
More informationMohsen Imani. University of California San Diego. System Energy Efficiency Lab seelab.ucsd.edu
Mohsen Imani University of California San Diego Winter 2016 Technology Trend for IoT http://www.flashmemorysummit.com/english/collaterals/proceedi ngs/2014/20140807_304c_hill.pdf 2 Motivation IoT significantly
More informationScalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism
Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism Jiecao Yu 1, Andrew Lukefahr 1, David Palframan 2, Ganesh Dasika 2, Reetuparna Das 1, Scott Mahlke 1 1 University of Michigan 2 ARM
More informationSudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread
Intra-Warp Compaction Techniques Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Goal Active thread Idle thread Compaction Compact threads in a warp to coalesce (and eliminate)
More informationgem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood
gem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood (powerjg/morr)@cs.wisc.edu UW-Madison Computer Sciences 2012 gem5-gpu gem5 + GPGPU-Sim (v3.0.1) Flexible memory
More informationLecture 8: GPU Programming. CSE599G1: Spring 2017
Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library
More informationEECS150 - Digital Design Lecture 09 - Parallelism
EECS150 - Digital Design Lecture 09 - Parallelism Feb 19, 2013 John Wawrzynek Spring 2013 EECS150 - Lec09-parallel Page 1 Parallelism Parallelism is the act of doing more than one thing at a time. Optimization
More informationNVIDIA FOR DEEP LEARNING. Bill Veenhuis
NVIDIA FOR DEEP LEARNING Bill Veenhuis bveenhuis@nvidia.com Nvidia is the world s leading ai platform ONE ARCHITECTURE CUDA 2 GPU: Perfect Companion for Accelerating Apps & A.I. CPU GPU 3 Intro to AI AGENDA
More informationME964 High Performance Computing for Engineering Applications
ME964 High Performance Computing for Engineering Applications Execution Scheduling in CUDA Revisiting Memory Issues in CUDA February 17, 2011 Dan Negrut, 2011 ME964 UW-Madison Computers are useless. They
More informationIntroduction to Multicore architecture. Tao Zhang Oct. 21, 2010
Introduction to Multicore architecture Tao Zhang Oct. 21, 2010 Overview Part1: General multicore architecture Part2: GPU architecture Part1: General Multicore architecture Uniprocessor Performance (ECint)
More informationEnergy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package
High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction
More informationFrom Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)
From Shader Code to a Teraflop: How GPU Shader Cores Work Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian) 1 This talk Three major ideas that make GPU processing cores run fast Closer look at real
More informationHigh Performance Computing Hiroki Kanezashi Tokyo Institute of Technology Dept. of mathematical and computing sciences Matsuoka Lab.
High Performance Computing 2015 Hiroki Kanezashi Tokyo Institute of Technology Dept. of mathematical and computing sciences Matsuoka Lab. 1 Reviewed Paper 1 DaDianNao: A Machine- Learning Supercomputer
More informationCUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA
CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION Julien Demouth, NVIDIA Cliff Woolley, NVIDIA WHAT WILL YOU LEARN? An iterative method to optimize your GPU code A way to conduct that method with NVIDIA
More informationSDA: Software-Defined Accelerator for Large- Scale DNN Systems
SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, 1 Yong Wang, 1 Bo Yu, 1 Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A
More informationLecture 4: Instruction Set Architectures. Review: latency vs. throughput
Lecture 4: Instruction Set Architectures Last Time Performance analysis Amdahl s Law Performance equation Computer benchmarks Today Review of Amdahl s Law and Performance Equations Introduction to ISAs
More information15-740/ Computer Architecture, Fall 2011 Midterm Exam II
15-740/18-740 Computer Architecture, Fall 2011 Midterm Exam II Instructor: Onur Mutlu Teaching Assistants: Justin Meza, Yoongu Kim Date: December 2, 2011 Name: Instructions: Problem I (69 points) : Problem
More informationChapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.
Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationXPU A Programmable FPGA Accelerator for Diverse Workloads
XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for
More informationTDT4260/DT8803 COMPUTER ARCHITECTURE EXAM
Norwegian University of Science and Technology Department of Computer and Information Science Page 1 of 13 Contact: Magnus Jahre (952 22 309) TDT4260/DT8803 COMPUTER ARCHITECTURE EXAM Monday 4. June Time:
More informationThroughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks
Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Naveen Suda, Vikas Chandra *, Ganesh Dasika *, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu
More informationVertex Shader Design II
The following content is extracted from the paper shown in next page. If any wrong citation or reference missing, please contact ldvan@cs.nctu.edu.tw. I will correct the error asap. This course used only
More informationTHE PATH TO EXASCALE COMPUTING. Bill Dally Chief Scientist and Senior Vice President of Research
THE PATH TO EXASCALE COMPUTING Bill Dally Chief Scientist and Senior Vice President of Research The Goal: Sustained ExaFLOPs on problems of interest 2 Exascale Challenges Energy efficiency Programmability
More informationMaximizing Face Detection Performance
Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA GTC 2015 1 Outline Very brief review of cascaded-classifiers Parallelization choices Reducing the amount
More informationPerformance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals
Performance COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals What is Performance? How do we measure the performance of
More informationPipelining. CS701 High Performance Computing
Pipelining CS701 High Performance Computing Student Presentation 1 Two 20 minute presentations Burks, Goldstine, von Neumann. Preliminary Discussion of the Logical Design of an Electronic Computing Instrument.
More informationHybrid Implementation of 3D Kirchhoff Migration
Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation
More informationGPU Microarchitecture Note Set 2 Cores
2 co 1 2 co 1 GPU Microarchitecture Note Set 2 Cores Quick Assembly Language Review Pipelined Floating-Point Functional Unit (FP FU) Typical CPU Statically Scheduled Scalar Core Typical CPU Statically
More informationCan FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.
Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.) Andreas Kurth 2017-12-05 1 In short: The situation Image credit:
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationHomework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures
Homework 5 Start date: March 24 Due date: 11:59PM on April 10, Monday night 4.1.1, 4.1.2 4.3 4.8.1, 4.8.2 4.9.1-4.9.4 4.13.1 4.16.1, 4.16.2 1 CSCI 402: Computer Architectures The Processor (4) Fengguang
More informationGoogle Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand
Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan
More informationSingle Instructions Can Execute Several Low Level
We have made it easy for you to find a PDF Ebooks without any digging. And by having access to our ebooks online or by storing it on your computer, you have convenient answers with single instructions
More informationECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University
ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University Lab 4: Binarized Convolutional Neural Networks Due Wednesday, October 31, 2018, 11:59pm
More informationPractical Near-Data Processing for In-Memory Analytics Frameworks
Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard
More informationTools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,
Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - fabio.baruffa@lrz.de LRZ, 27.6.- 29.6.2016 Architecture Overview Intel Xeon Processor Intel Xeon Phi Coprocessor, 1st generation Intel Xeon
More informationECE 154A Introduction to. Fall 2012
ECE 154A Introduction to Computer Architecture Fall 2012 Dmitri Strukov Lecture 10 Floating point review Pipelined design IEEE Floating Point Format single: 8 bits double: 11 bits single: 23 bits double:
More informationLecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"
Be sure you understand CC, clock period Lecture 29 Review Suggested reading: Everything Q1: D[8] = D[8] + RF[1] + RF[4] I[15]: Add R2, R1, R4 RF[1] = 4 I[16]: MOV R3, 8 RF[4] = 5 I[17]: Add R2, R2, R3
More informationSimultaneous Multithreading on Pentium 4
Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on
More information