Tutorial Outline. 9:00 am 10:00 am Pre-RTL Simulation Framework: Aladdin. 8:30 am 9:00 am! Introduction! 10:00 am 10:30 am! Break!
|
|
- Albert Parks
- 5 years ago
- Views:
Transcription
1 Tutorial Outline Time Topic! 8:30 am 9:00 am! Introduction! 9:00 am 10:00 am Pre-RTL Simulation Framework: Aladdin 10:00 am 10:30 am! Break! 10:30 am 11:00 am! Workload Characterization Tool: WIICA! 11:00 am 12:00 pm! CAD & Benchmarks: HLS & MachSuite! 12:00 pm 2:00 pm! Lunch! 2:00 pm 3:00 pm! Embedded Keynote Talk: Mark Horowitz (Stanford)! 3:00 pm 3:30 pm! Accelerator Selection Tool: Sigil! 3:30 pm 4:00 pm! Break! 4:00 pm 5:00 pm! Hands-on Exercise 1
2 A Pre- RTL, Power- Performance Accelerator Simulator Enabling Large Design Space of Customized Architectures Yakun Sophia Shao, Brandon Reagen, Gu- Yeon Wei, David Brooks Harvard University 2
3 Today s SoC CPU CPU GPU/ DSP Acc Acc Buses Acc Acc Acc Acc Mem Inter- face Acc Acc Acc 3
4 Future Accelerator- Centric Architectures Big Cores Small Cores GPU/DSP Shared Resources Sea of Fine- Grained Accelerators Memory Interface How to decompose an to accelerators? How to rapidly design lots of accelerators? How to design and manage the shared resources? Flexibility Design Cost Programmability 4
5 Aladdin: A pre- RTL, Power- Performance Accelerator Simulator Shared Memory/Interconnect Models Unmodified C- Code Aladdin Power/Area Accelerator Design Parameters (e.g., # FU, mem. BW) Accelerator Specific Datapath Private L1/ Scratchpad Performance Accelerator Simulator Design Accelerator- Rich SoC Fabrics and Memory Systems 5
6 Aladdin: A pre- RTL, Power- Performance Accelerator Simulator Shared Memory/Interconnect Models Unmodified C- Code Aladdin Power/Area Accelerator Design Parameters (e.g., # FU, mem. BW) Accelerator Specific Datapath Private L1/ Scratchpad Performance Accelerator Simulator Design Accelerator- Rich SoC Fabrics and Memory Systems Flexibility Programmability 6
7 Aladdin: A pre- RTL, Power- Performance Accelerator Simulator Shared Memory/Interconnect Models Unmodified C- Code Aladdin Power/Area Accelerator Design Parameters (e.g., # FU, mem. BW) Accelerator Specific Datapath Private L1/ Scratchpad Performance Accelerator Simulator Design Accelerator- Rich SoC Fabrics and Memory Systems Flexibility Programmability Design Assistant Understand Algorithmic- HW Design Space before RTL Design Cost 7
8 Future Accelerator- Centric Architecture Big Cores Small Cores GPU/ DSP Shared Resources Sea of Fine- Grained Accelerators Memory Interface HLS 100 Power (mw) Execution Time (us) 8
9 Future Accelerator- Centric Architecture Big Cores Small Cores GPU/ DSP Shared Resources Sea of Fine- Grained Accelerators Memory Interface ALADDIN HLS 100 Aladdin can rapidly evaluate large design space of accelerator- centric architectures. Power (mw) Execution Time (us) 9
10 Aladdin Overview Op>miza>on Phase C Code Acc Design Parameters Op@mis@c IR Ini@al Idealis@c Dynamic Data Dependence Graph () Program Constrained Resource Constrained Power/Area Models Performance Ac>vity Power/Area Realiza>on Phase 10
11 Aladdin Overview Op>miza>on Phase C Code Op@mis@c IR Ini@al Idealis@c Performance Acc Design Parameters Program Constrained Resource Constrained Power/Area Models Ac>vity Power/Area Realiza>on Phase 11
12 Aladdin is NOT An HLS flow: No RTL is generated. High- level es#mates of power and performance; Aladdin uses fully dynamic analysis to expose algorithmic parallelism for unmodified HLL codes; Limit of ILP study: but is constructed to model accelerators. 12
13 From C to Design Space C Code: for(i=0; i<n; ++i) c[i] = a[i] + b[i]; 13
14 Aladdin Overview Op>miza>on Phase C Code Op@mis@c IR Ini@al Idealis@c Performance Acc Design Parameters Program Constrained Resource Constrained Power/Area Models Ac>vity Power/Area Realiza>on Phase 14
15 From C to Design Space IR Dynamic Trace C Code: for(i=0; i<n; ++i) c[i] = a[i] + b[i]; 0. r0=0 //i = 0 1. r4=load (r0 + r1) //load a[i] 2. r5=load (r0 + r2) //load b[i] 3. r6=r4 + r5 4. store(r0 + r3, r6) //store c[i] 5. r0=r0 + 1 //++i 6. r4=load(r0 + r1) //load a[i] 7. r5=load(r0 + r2) //load b[i] 8. r6=r4 + r5 9. store(r0 + r3, r6) //store c[i] 10. r0 = r0 + 1 //++i 15
16 IR LLVM IR High- level IR: Machine- and ISA- independent Features: Unlimited Registers Simple Opcodes: add, mul, sin, sqrt Only load/store access memory Shao, et al., ISA-Independent Workload Characterization and Implications for Specialized Architecture,! ISPASS, 2013! 16
17 Aladdin Overview Op>miza>on Phase C Code Op@mis@c IR Ini@al Idealis@c Performance Acc Design Parameters Program Constrained Resource Constrained Power/Area Models Ac>vity Power/Area Realiza>on Phase 17
18 From C to Design Space Ini@al C Code: for(i=0; i<n; ++i) c[i] = a[i] + b[i]; IR Trace: 0. r0=0 //i = 0 1. r4=load (r0 + r1) //load a[i] 2. r5=load (r0 + r2) //load b[i] 3. r6=r4 + r5 4. store(r0 + r3, r6) //store c[i] 5. r0=r0 + 1 //++i 6. r4=load(r0 + r1) //load a[i] 7. r5=load(r0 + r2) //load b[i] 8. r6=r4 + r5 9. store(r0 + r3, r6) //store c[i] 10.r0 = r0 + 1 //++i 0. i=0 5. i++ 1. ld a 2. ld b 10. i++ 6. ld a 7. ld b ld a 12. ld b st c st c 4. st c 18
19 Aladdin Overview Op>miza>on Phase C Code Op@mis@c IR Ini@al Idealis@c Performance Acc Design Parameters Program Constrained Resource Constrained Power/Area Models Ac>vity Power/Area Realiza>on Phase 19
20 From C to Design Space Idealis@c C Code: for(i=0; i<n; ++i) c[i] = a[i] + b[i]; IR Trace: 0. r0=0 //i = 0 1. r4=load (r0 + r1) //load a[i] 2. r5=load (r0 + r2) //load b[i] 3. r6=r4 + r5 4. store(r0 + r3, r6) //store c[i] 5. r0=r0 + 1 //++i 6. r4=load(r0 + r1) //load a[i] 7. r5=load(r0 + r2) //load b[i] 8. r6=r4 + r5 9. store(r0 + r3, r6) //store c[i] 10.r0 = r0 + 1 //++i 0. i=0 5. i i ld a 6. ld a 12. ld b 1. ld a 7. ld b ld b st c 0. i=0 5. i i++ 1. ld a 2. ld b 6. ld a 7. ld b 11. ld a 12. ld b st c st c st c st c 14. st c 20
21 From C to Design Space Idealis@c Include applica@on- specific customiza@on strategies. Node- Level: Bit- width Analysis Strength Reduc@on Tree- height Reduc@on Loop- Level: Remove dependences between loop index variables Memory Op@miza@on: Memory- to- Register Conversion Store- Load Forwarding Store Buffer Extensible e.g. Model CAM accelerator by matching nodes in 21
22 Aladdin Overview Op>miza>on Phase C Code Op@mis@c IR Ini@al Idealis@c Performance Acc Design Parameters Program Constrained Resource Constrained Power/Area Models Ac>vity Power/Area Realiza>on Phase 22
23 From C to Design Space One Design Idealis@c 0. i=0 5.i i i++ 0. i=0 Resource Ac@vity 1. ld a 2. ld b st c 6. ld a 7. ld b st c 11. ld a 12. ld b st c 16. ld a 17. ld b st c 1. ld a 2. ld b st c 5.i++ MEM MEM + + MEM Acc Design Parameters: ü Memory BW <= 2 ü 1 Adder 6. ld a 7. ld b st c Cycle MEM MEM + MEM 23
24 From C to Design Space Another Design Idealis@c 0. i=0 5.i i i++ 0. i=0 5.i++ Resource Ac@vity + 1. ld a 2. ld b ld a 7. ld b ld a 12. ld b ld a 17. ld b ld a 2. ld b ld a 7. ld b 8. + MEM MEM MEM MEM st c 9. st c Acc Design Parameters: ü Memory BW <= 4 ü 2 Adders 14. st c 19. st c 4. st c 10. i ld a 12. ld b st c 9. st c 15. i ld a 17. ld b st c MEM + + MEM MEM MEM MEM MEM Cycle + + MEM MEM 24
25 From C to Design Space Realiza@on Phase: - >Power- Perf Constrain the with program and user- defined resource constraints Program Constraints Control Dependence Memory Ambigua@on Resource Constraints Loop- level Parallelism Loop Pipelining Memory Ports # of FUs (e.g., adders, mul@pliers) 25
26 Memory Idealistic optimistically removes all false memory dependences! Input-dependent memory accesses cannot be calculated statically.! 26
27 Memory 0.i=0 for(i=0; i<n; ++i) { bucket[ a[i] & 0x11 ]++; } Input: a[0] = 1; a[1] = 1; a[2] = 1; 1.ld a[0] 2.& 3.ld b[1] 4.b[1]++ 5.st b[1] 27
28 Memory for(i=0; i<n; ++i) { bucket[ a[i] & 0x11 ]++; } Input: a[0] = 1; a[1] = 2; a[2] = 1; 0.i=0 1.ld a[0] 2.& 3.ld b[1] 4.b[1]++ 5.st b[1] 6.i++ 7.ld a[1] 8.& 9.ld b[2] 10.b[2]++ 11.st b[2] 28
29 Memory 0.i=0 6.i++ 12.i++ for(i=0; i<n; ++i) { bucket[ a[i] & 0x11 ]++; } 1.ld a[0] 2.& 7.ld a[1] 8.& 13.ld a[2] 14.& Input: a[0] = 1; a[1] = 2; a[2] = 2; 3.ld b[1] 4.b[1]++ 5.st b[1] 9.ld b[2] 10.b[2]++ 11.st b[2] 15.ld b[2] 16.b[2]++ 17.st b[2] 29
30 Memory 0.i=0 6.i++ 12.i++ for(i=0; i<n; ++i) { bucket[ a[i] & 0x11 ]++; } 1.ld a[0] 2.& 7.ld a[1] 8.& 13.ld a[2] 14.& Input: a[0] = 1; a[1] = 2; a[2] = 2; 3.ld b[1] 4.b[1]++ 5.st b[1] 9.ld b[2] 10.b[2]++ 11.st b[2] 15.ld b[2] 16.b[2]++ 17.st b[2] 30
31 Memory 0.i=0 6.i++ 12.i++ for(i=0; i<n; ++i) { bucket[ a[i] & 0x11 ]++; } 1.ld a[0] 2.& 7.ld a[1] 8.& 13.ld a[2] 14.& Input: a[0] = 1; a[1] = 2; a[2] = 2; 3.ld b[1] 4.b[1]++ 5.st b[1] 9.ld b[2] 10.b[2]++ 11.st b[2] 15.ld b[2] 16.b[2]++ 17.st b[2] 31
32 Memory 0.i=0 6.i++ 12.i++ for(i=0; i<n; ++i) { bucket[ a[i] & 0x11 ]++; } 1.ld a[0] 2.& 7.ld a[1] 8.& 13.ld a[2] 14.& Input: a[0] = 1; a[1] = 2; a[2] = 2; 3.ld b[1] 4.b[1]++ 5.st b[1] 9.ld b[2] 10.b[2]++ 11.st b[2] 15.ld b[2] 16.b[2]++ 17.st b[2] 32
33 Memory 0.i=0 6.i++ 12.i++ for(i=0; i<n; ++i) { bucket[ a[i] & 0x11 ]++; } 1.ld a[0] 2.& 7.ld a[1] 8.& 13.ld a[2] 14.& Input: a[0] = 1; a[1] = 2; a[2] = 2; 3.ld b[1] 4.b[1]++ 5.st b[1] 9.ld b[2] 10.b[2]++ 11.st b[2] 15.ld b[2] 16.b[2]++ 17.st b[2] 33
34 From C to Design Space Power- Performance per Design Power Acc Design Parameters: ü Memory BW <= 4 ü 2 Adders Acc Design Parameters: ü Memory BW <= 2 ü 1 Adder Cycle 34
35 From C to Design Space Design Space of an Algorithm Power Cycle 35
36 Cycle- Level 200 Twiddle Active Functional Units Memory Bandwidth Number of Active Functional Units and Bandwidth FFT8 Shuffle FFT8 Twiddle Shuffle FFT Time (Cycles) 36
37 Power Model Units Power Model Microbenchmarks characterize various FUs. Design Compiler with 40nm Standard Cell Power = (activity i * Pi dynamic ) + Pi leakage 1<i<N SRAM Power Model Commercial register file and SRAM memory compilers with the same 40nm standard cell library 37
38 Aladdin Overview Op>miza>on Phase C Code Op@mis@c IR Ini@al Idealis@c Performance Acc Design Parameters Program Constrained Resource Constrained Power/Area Models Ac>vity Power/Area Realiza>on Phase 38
39 Aladdin Aladdin C Code Power/Area Performance Verilog Design Compiler Ac@vity ModelSim 39
40 Aladdin Aladdin C Code Power/Area Performance RTL Designer HLS C Tuning Vivado HLS Verilog Design Compiler Ac@vity ModelSim 40
41 Benchmarks Type! Benchmark! Description! MD! Pairwise calculation of the L-J Potential! STENCIL! Apply 3x3 filter to an image! SHOC! Benchmark Suite! FFT! GEMM! TRIAD! SORT! 1D 512 FFT! Blocked Matrix Multiply! Single Computation in DOALL loop! Radix Sort! Optimized! HLS! Designs! SCAN! Parallel prefix sum! REDUCTION! Return sum of an array! Proposed! Accelerator! Constructs! NPU! Memcached! HARP! An individual neuron in a network [MICRO 12]! GET function in Memcached [ISCA 13]! Data partition accelerator [ISCA 13]! Hand RTL! Designs! 41
42 Aladdin Time (KCycles) % Aladdin RTL Flow 0 FFT Power (mw) % FFT Aladdin RTL Flow % Area ( mm 2 ) Aladdin RTL Flow 0.0 FFT 42
43 Aladdin Time (KCycles) MD STENCIL FFT GEMM TRIAD SORT SCAN REDUCTION 0.9% Aladdin RTL Flow 4.9% Time (KCycles) NPU HASH HARP Power (mw) Aladdin RTL Flow Power (mw) 2 1 Area ( mm 2 ) MD STENCIL FFT GEMM TRIAD SORT SCAN REDUCTION 6.5% Aladdin RTL Flow Area ( mm 2 ) NPU HASH HARP 0.0 MD STENCIL FFT GEMM TRIAD SORT SCAN REDUCTION 0 NPU HASH HARP 43
44 Aladdin enables rapid design space for accelerators. Aladdin C Code Power/Area Performance RTL Designer HLS C Tuning Vivado HLS Verilog Design Compiler Ac@vity ModelSim 44
45 Algorithm Choices Aladdin generates a design space per algorithm Can use Aladdin to quickly compare the design spaces of algorithms Input Dependent Inputs that exercise all paths of the code Input C Code Aladdin can create for any C code. C constructs that require resources outside the accelerator, such as system calls and dynamic memory alloca@on, are not modeled. 45
46 Aladdin enables pre- RTL of accelerators with the rest of the SoC. gem5 Big Cores... gem5 Small Cores GPGPU- GPU Sim Shared CacL/Orion2 Resources Sea of Fine- Grained Accelerators Memory DRAMSim2 Interface 46
47 Accelerator with Memory System using Aladdin Acc! Cache! Memory! 47
48 Acc! Core! Cache! Modeling Accelerators in an SoC- like Environment Memory! Power (mw) block=16 block=32 Without Memory Contention Time (Million Cycles)
49 Acc! Cache! Core! Modeling Accelerators in a SoC- like Environment Memory! block=16 block= block=16 block=32 Power (mw) Without Memory Contention Power (mw) With Memory Contention Time (Million Cycles) Time (Million Cycles) 49
50 Aladdin: A pre- RTL, Power- Performance Accelerator Simulator Architectures with 1000s of accelerators will be radically different; New design tools are needed. Aladdin enables rapid design space explora@on of future accelerator- centric plavorms. You can find Aladdin at hwp://vlsiarch.eecs.harvard.edu/accelerators 50
51 Tutorial References Y.S. Shao and D. Brooks, ISA-Independent Workload Characterization and its Implications for Specialized Architectures, ISPASS 13.! B. Reagen, Y.S. Shao, G.-Y. Wei, D. Brooks, Quantifying Acceleration: Power/ Performance Trade-Offs of Application Kernels in Hardware, ISLPED 13.! Y.S. Shao, B. Reagen, G.-Y. Wei, D. Brooks, Aladdin: A Pre-RTL, Power- Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures, ISCA 14.! B. Reagen, B. Adolf, Y.S. Shao, G.-Y. Wei, D. Brooks, MachSuite: Benchmarks for Accelerator Design and Customized Architectures, IISWC 14.! 51
RoboBees + Aladdin + HELIX Approximate Accelerator Architectures
RoboBees + Aladdin + HELIX Approximate Accelerator Architectures Gu-Yeon Wei School of Engineering and Applied Sciences Harvard University CMOS scaling is running out Technological Fallow Period 2 Power
More informationAccelerator Design, Tradeoffs, and Benchmarking
Accelerator Design, Tradeoffs, and Benchmarking Vivado HLS MachSuite [ IISWC 2014 ] QuanIfying AcceleraIon [ ISLPED 2013 ] Brandon Reagen, Yakun Sophia Shao, Bob Adolf, Gu- Yeon Wei, David Brooks Harvard
More informationQuantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware
Quantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware Brandon Reagen, Yakun Sophia Shao, Gu-Yeon Wei, David Brooks Harvard University, Cambridge, MA, USA {reagen, shao,
More informationXPU A Programmable FPGA Accelerator for Diverse Workloads
XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for
More informationDNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses
DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough 1,2 S. K. Lee 2, N. Mulholland 2, P. Hansen 2, S. Kodali 3, D. Brooks 2, G.-Y. Wei 2 1 ARM Research, Boston,
More informationDesign and Modeling of Specialized Architectures
Design and Modeling of Specialized Architectures a dissertation presented by Yakun Sophia Shao to The School of Engineering and Applied Sciences in partial fulfillment of the requirements for the degree
More informationHigh-Level Synthesis Creating Custom Circuits from High-Level Code
High-Level Synthesis Creating Custom Circuits from High-Level Code Hao Zheng Comp Sci & Eng University of South Florida Exis%ng Design Flow Register-transfer (RT) synthesis - Specify RT structure (muxes,
More informationECE 5775 (Fall 17) High-Level Digital Design Automation. More Pipelining
ECE 5775 (Fall 17) High-Level Digital Design Automation More Pipelining Announcements HW 2 due Monday 10/16 (no late submission) Second round paper bidding @ 5pm tomorrow on Piazza Talk by Prof. Margaret
More informationFlexible wireless communication architectures
Flexible wireless communication architectures Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston TX Faculty Candidate Seminar Southern Methodist University April
More informationMaximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman
Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael
More informationPerformance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference
The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee
More informationDNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs
IBM Research AI Systems Day DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1, Junsong Wang 2, Chao Zhu 2, Yonghua Lin 2, Jinjun Xiong 3, Wen-mei
More informationAn Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware
An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware Tao Chen, Shreesha Srinath Christopher Batten, G. Edward Suh Computer Systems Laboratory School of Electrical
More informationNISC Application and Advantages
NISC Application and Advantages Daniel D. Gajski Mehrdad Reshadi Center for Embedded Computer Systems University of California, Irvine Irvine, CA 92697-3425, USA {gajski, reshadi}@cecs.uci.edu CECS Technical
More informationSoftware Defined Hardware
Software Defined Hardware For data intensive computation Wade Shen DARPA I2O September 19, 2017 1 Goal Statement Build runtime reconfigurable hardware and software that enables near ASIC performance (within
More informationExploration of Cache Coherent CPU- FPGA Heterogeneous System
Exploration of Cache Coherent CPU- FPGA Heterogeneous System Wei Zhang Department of Electronic and Computer Engineering Hong Kong University of Science and Technology 1 Outline ointroduction to FPGA-based
More informationKiloCore: A 32 nm 1000-Processor Array
KiloCore: A 32 nm 1000-Processor Array Brent Bohnenstiehl, Aaron Stillmaker, Jon Pimentel, Timothy Andreas, Bin Liu, Anh Tran, Emmanuel Adeagbo, Bevan Baas University of California, Davis VLSI Computation
More informationESE532: System-on-a-Chip Architecture. Today. Message. Preclass 1. Computing Forms. Preclass 1
ESE532: System-on-a-Chip Architecture Day 15: March 15, 2017 (Very Long Instruction Word Processors) Today (Very Large Instruction Word) Demand Basic Model Costs Tuning Penn ESE532 Spring 2017 -- DeHon
More informationSDA: Software-Defined Accelerator for general-purpose big data analysis system
SDA: Software-Defined Accelerator for general-purpose big data analysis system Jian Ouyang(ouyangjian@baidu.com), Wei Qi, Yong Wang, Yichen Tu, Jing Wang, Bowen Jia Baidu is beyond a search engine Search
More informationIndustrial-Strength High-Performance RISC-V Processors for Energy-Efficient Computing
Industrial-Strength High-Performance RISC-V Processors for Energy-Efficient Computing Dave Ditzel dave@esperanto.ai President and CEO Esperanto Technologies, Inc. 7 th RISC-V Workshop November 28, 2017
More informationUnderstanding GPGPU Vector Register File Usage
Understanding GPGPU Vector Register File Usage Mark Wyse AMD Research, Advanced Micro Devices, Inc. Paul G. Allen School of Computer Science & Engineering, University of Washington AGENDA GPU Architecture
More informationDesign of Embedded DSP Processors Unit 2: Design basics. 9/11/2017 Unit 2 of TSEA H1 1
Design of Embedded DSP Processors Unit 2: Design basics 9/11/2017 Unit 2 of TSEA26-2017 H1 1 ASIP/ASIC design flow We need to have the flow in mind, so that we will know what we are talking about in later
More informationProgram Op*miza*on and Analysis. Chenyang Lu CSE 467S
Program Op*miza*on and Analysis Chenyang Lu CSE 467S 1 Program Transforma*on op#mize Analyze HLL compile assembly assemble Physical Address Rela5ve Address assembly object load executable link Absolute
More informationCPE300: Digital System Architecture and Design
CPE300: Digital System Architecture and Design Fall 2011 MW 17:30-18:45 CBC C316 Pipelining 11142011 http://www.egr.unlv.edu/~b1morris/cpe300/ 2 Outline Review I/O Chapter 5 Overview Pipelining Pipelining
More informationEfficient Data Supply for Hardware Accelerators with Prefetching and Access/ Execute Decoupling
Cornell University Efficient Data Supply for Hardware Accelerators with Prefetching and Access/ Execute Decoupling Tao Chen and G. Edward Suh Computer Systems Laboratory Cornell University Accelerator-Rich
More informationCS 152, Spring 2011 Section 10
CS 152, Spring 2011 Section 10 Christopher Celio University of California, Berkeley Agenda Stuff (Quiz 4 Prep) http://3dimensionaljigsaw.wordpress.com/2008/06/18/physics-based-games-the-new-genre/ Intel
More informationDigital Signal Processor Core Technology
The World Leader in High Performance Signal Processing Solutions Digital Signal Processor Core Technology Abhijit Giri Satya Simha November 4th 2009 Outline Introduction to SHARC DSP ADSP21469 ADSP2146x
More informationReal-Time Support for GPU. GPU Management Heechul Yun
Real-Time Support for GPU GPU Management Heechul Yun 1 This Week Topic: Real-Time Support for General Purpose Graphic Processing Unit (GPGPU) Today Background Challenges Real-Time GPU Management Frameworks
More informationEE282 Computer Architecture. Lecture 1: What is Computer Architecture?
EE282 Computer Architecture Lecture : What is Computer Architecture? September 27, 200 Marc Tremblay Computer Systems Laboratory Stanford University marctrem@csl.stanford.edu Goals Understand how computer
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationDRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric
DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based
More informationPyMTL/Pydgin Tutorial Schedule
PyMTL/Pydgin Tutorial Schedule 8:30am 8:50am Virtual Machine Installation and Setup 8:50am 9:00am : PyMTL/Pydgin Tutorial 9:00am 9:10am : Introduction to Pydgin 9:10am 10:00am : Adding a uction using Pydgin
More informationStorage I/O Summary. Lecture 16: Multimedia and DSP Architectures
Storage I/O Summary Storage devices Storage I/O Performance Measures» Throughput» Response time I/O Benchmarks» Scaling to track technological change» Throughput with restricted response time is normal
More informationOptimizing HW/SW Partition of a Complex Embedded Systems. Simon George November 2015.
Optimizing HW/SW Partition of a Complex Embedded Systems Simon George November 2015 Zynq-7000 All Programmable SoC HP ACP GP Page 2 Zynq UltraScale+ MPSoC Page 3 HW/SW Optimization Challenges application()
More informationPerformance and Power Impact of Issuewidth in Chip-Multiprocessor Cores
Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores Magnus Ekman Per Stenstrom Department of Computer Engineering, Department of Computer Engineering, Outline Problem statement Assumptions
More informationQuantifying the Energy Cost of Data Movement for Emerging Smartphone Workloads on Mobile Platforms
Quantifying the Energy Cost of Data Movement for Emerging Smartphone Workloads on Mobile Platforms Arizona State University Dhinakaran Pandiyan(dpandiya@asu.edu) and Carole-Jean Wu(carole-jean.wu@asu.edu
More informationA framework for optimizing OpenVX Applications on Embedded Many Core Accelerators
A framework for optimizing OpenVX Applications on Embedded Many Core Accelerators Giuseppe Tagliavini, DEI University of Bologna Germain Haugou, IIS ETHZ Andrea Marongiu, DEI University of Bologna & IIS
More informationThe Bifrost GPU architecture and the ARM Mali-G71 GPU
The Bifrost GPU architecture and the ARM Mali-G71 GPU Jem Davies ARM Fellow and VP of Technology Hot Chips 28 Aug 2016 Introduction to ARM Soft IP ARM licenses Soft IP cores (amongst other things) to our
More informationCSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization
CSE 160 Lecture 10 Instruction level parallelism (ILP) Vectorization Announcements Quiz on Friday Signup for Friday labs sessions in APM 2013 Scott B. Baden / CSE 160 / Winter 2013 2 Particle simulation
More information08 - Address Generator Unit (AGU)
October 2, 2014 Todays lecture Memory subsystem Address Generator Unit (AGU) Schedule change A new lecture has been entered into the schedule (to compensate for the lost lecture last week) Memory subsystem
More informationLACORE: A RISC-V BASED LINEAR ALGEBRA ACCELERATOR FOR SOC DESIGNS
1 LACORE: A RISC-V BASED LINEAR ALGEBRA ACCELERATOR FOR SOC DESIGNS Samuel Steffl and Sherief Reda Brown University, Department of Computer Engineering Partially funded by NSF grant 1438958 Published as
More informationVenezia: a Scalable Multicore Subsystem for Multimedia Applications
Venezia: a Scalable Multicore Subsystem for Multimedia Applications Takashi Miyamori Toshiba Corporation Outline Background Venezia Hardware Architecture Venezia Software Architecture Evaluation Chip and
More informationFrom Brook to CUDA. GPU Technology Conference
From Brook to CUDA GPU Technology Conference A 50 Second Tutorial on GPU Programming by Ian Buck Adding two vectors in C is pretty easy for (i=0; i
More informationECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2
ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 13 Memory Part 2 Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall12.html
More informationOpenCAPI Technology. Myron Slota Speaker name, Title OpenCAPI Consortium Company/Organization Name. Join the Conversation #OpenPOWERSummit
OpenCAPI Technology Myron Slota Speaker name, Title OpenCAPI Consortium Company/Organization Name Join the Conversation #OpenPOWERSummit Industry Collaboration and Innovation OpenCAPI Topics Computation
More informationSoft GPGPUs for Embedded FPGAS: An Architectural Evaluation
Soft GPGPUs for Embedded FPGAS: An Architectural Evaluation 2nd International Workshop on Overlay Architectures for FPGAs (OLAF) 2016 Kevin Andryc, Tedy Thomas and Russell Tessier University of Massachusetts
More informationECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 13 Memory Part 2
ECE 252 / CPS 220 Advanced Computer Architecture I Lecture 13 Memory Part 2 Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall11.html
More informationAnand Raghunathan
ECE 695R: SYSTEM-ON-CHIP DESIGN Module 2: HW/SW Partitioning Lecture 2.15: ASIP: Approaches to Design Anand Raghunathan raghunathan@purdue.edu ECE 695R: System-on-Chip Design, Fall 2014 Fall 2014, ME 1052,
More informationDRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric
DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based
More informationData-Parallel Algorithms on GPUs. Mark Harris NVIDIA Developer Technology
Data-Parallel Algorithms on GPUs Mark Harris NVIDIA Developer Technology Outline Introduction Algorithmic complexity on GPUs Algorithmic Building Blocks Gather & Scatter Reductions Scan (parallel prefix)
More informationUnlocking FPGAs Using High- Level Synthesis Compiler Technologies
Unlocking FPGAs Using High- Leel Synthesis Compiler Technologies Fernando Mar*nez Vallina, Henry Styles Xilinx Feb 22, 2015 Why are FPGAs Good Scalable, highly parallel and customizable compute 10s to
More informationFrom Application to Technology OpenCL Application Processors Chung-Ho Chen
From Application to Technology OpenCL Application Processors Chung-Ho Chen Computer Architecture and System Laboratory (CASLab) Department of Electrical Engineering and Institute of Computer and Communication
More informationScalable GPU Graph Traversal!
Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang
More informationEfficient Evaluation and Management of Temperature and Reliability for Multiprocessor Systems
Efficient Evaluation and Management of Temperature and Reliability for Multiprocessor Systems Ayse K. Coskun Electrical and Computer Engineering Department Boston University http://people.bu.edu/acoskun
More informationDeep Learning Accelerators
Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction
More informationVivado HLx Design Entry. June 2016
Vivado HLx Design Entry June 2016 Agenda What is the HLx Design Methodology? New & Early Access features for Connectivity Platforms Creating Differentiated Logic 2 What is the HLx Design Methodology? Page
More informationVLIW DSP Processor Design for Mobile Communication Applications. Contents crafted by Dr. Christian Panis Catena Radio Design
VLIW DSP Processor Design for Mobile Communication Applications Contents crafted by Dr. Christian Panis Catena Radio Design Agenda Trends in mobile communication Architectural core features with significant
More informationMicroprocessor Architecture Dr. Charles Kim Howard University
EECE416 Microcomputer Fundamentals Microprocessor Architecture Dr. Charles Kim Howard University 1 Computer Architecture Computer System CPU (with PC, Register, SR) + Memory 2 Computer Architecture ALU
More informationZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS
ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200
More informationA scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment
LETTER IEICE Electronics Express, Vol.11, No.2, 1 9 A scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment Ting Chen a), Hengzhu Liu, and Botao Zhang College of
More informationZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS
ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200
More informationModeling and Simulation of System-on. Platorms. Politecnico di Milano. Donatella Sciuto. Piazza Leonardo da Vinci 32, 20131, Milano
Modeling and Simulation of System-on on-chip Platorms Donatella Sciuto 10/01/2007 Politecnico di Milano Dipartimento di Elettronica e Informazione Piazza Leonardo da Vinci 32, 20131, Milano Key SoC Market
More informationHuge market -- essentially all high performance databases work this way
11/5/2017 Lecture 16 -- Parallel & Distributed Databases Parallel/distributed databases: goal provide exactly the same API (SQL) and abstractions (relational tables), but partition data across a bunch
More informationImplementation of DSP Algorithms
Implementation of DSP Algorithms Main frame computers Dedicated (application specific) architectures Programmable digital signal processors voice band data modem speech codec 1 PDSP and General-Purpose
More informationComputer Architecture Dr. Charles Kim Howard University
EECE416 Microcomputer Fundamentals Computer Architecture Dr. Charles Kim Howard University 1 Computer Architecture Computer Architecture Art of selecting and interconnecting hardware components to create
More informationStaged Memory Scheduling
Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:
More informationECE 5745 Complex Digital ASIC Design, Spring 2017 Lab 2: Sorting Accelerator
School of Electrical and Computer Engineering Cornell University revision: 2017-03-16-23-56 In this lab, you will explore a medium-grain hardware accelerator for sorting an array of integer values of unknown
More informationCS377P Programming for Performance GPU Programming - II
CS377P Programming for Performance GPU Programming - II Sreepathi Pai UTCS November 11, 2015 Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work Outline
More informationLec 25: Parallel Processors. Announcements
Lec 25: Parallel Processors Kavita Bala CS 340, Fall 2008 Computer Science Cornell University PA 3 out Hack n Seek Announcements The goal is to have fun with it Recitations today will talk about it Pizza
More informationLUMOS. A Framework with Analy1cal Models for Heterogeneous Architectures. Liang Wang, and Kevin Skadron (University of Virginia)
LUMOS A Framework with Analy1cal Models for Heterogeneous Architectures Liang Wang, and Kevin Skadron (University of Virginia) What is LUMOS A set of first- order analy1cal models targe1ng heterogeneous
More informationRe-Examining Conventional Wisdom for Networks-on-Chip in the Context of FPGAs
This work was funded by NSF. We thank Xilinx for their FPGA and tool donations. We thank Bluespec for their tool donations. Re-Examining Conventional Wisdom for Networks-on-Chip in the Context of FPGAs
More informationLecture 2: CUDA Programming
CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:
More informationgem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood
gem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood (powerjg/morr)@cs.wisc.edu UW-Madison Computer Sciences 2012 gem5-gpu gem5 + GPGPU-Sim (v3.0.1) Flexible memory
More informationCourse Overview Revisited
Course Overview Revisited void blur_filter_3x3( Image &in, Image &blur) { // allocate blur array Image blur(in.width(), in.height()); // blur in the x dimension for (int y = ; y < in.height(); y++) for
More informationECE369. Chapter 5 ECE369
Chapter 5 1 State Elements Unclocked vs. Clocked Clocks used in synchronous logic Clocks are needed in sequential logic to decide when an element that contains state should be updated. State element 1
More informationThis Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?
This Unit: Putting It All Together CIS 371 Computer Organization and Design Unit 15: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital
More informationManaging Hybrid On-chip Scratchpad and Cache Memories for Multi-tasking Embedded Systems
Managing Hybrid On-chip Scratchpad and Cache Memories for Multi-tasking Embedded Systems Zimeng Zhou, Lei Ju, Zhiping Jia, Xin Li School of Computer Science and Technology Shandong University, China Outline
More informationLecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University
Lecture 12 Memory Design & Caches, part 2 Christos Kozyrakis Stanford University http://eeclass.stanford.edu/ee108b 1 Announcements HW3 is due today PA2 is available on-line today Part 1 is due on 2/27
More informationIntroduction to Parallel Programming Models
Introduction to Parallel Programming Models Tim Foley Stanford University Beyond Programmable Shading 1 Overview Introduce three kinds of parallelism Used in visual computing Targeting throughput architectures
More informationProject Proposals. 1 Project 1: On-chip Support for ILP, DLP, and TLP in an Imagine-like Stream Processor
EE482C: Advanced Computer Organization Lecture #12 Stream Processor Architecture Stanford University Tuesday, 14 May 2002 Project Proposals Lecture #12: Tuesday, 14 May 2002 Lecturer: Students of the class
More informationHigher Level Programming Abstractions for FPGAs using OpenCL
Higher Level Programming Abstractions for FPGAs using OpenCL Desh Singh Supervising Principal Engineer Altera Corporation Toronto Technology Center ! Technology scaling favors programmability CPUs."#/0$*12'$-*
More information100M Gate Designs in FPGAs
100M Gate Designs in FPGAs Fact or Fiction? NMI FPGA Network 11 th October 2016 Jonathan Meadowcroft, Cadence Design Systems Why in the world, would I do that? ASIC replacement? Probably not! Cost prohibitive
More informationTen Reasons to Optimize a Processor
By Neil Robinson SoC designs today require application-specific logic that meets exacting design requirements, yet is flexible enough to adjust to evolving industry standards. Optimizing your processor
More informationBringing Intelligence to Enterprise Storage Drives
Bringing Intelligence to Enterprise Storage Drives Neil Werdmuller Director Storage Solutions Arm Santa Clara, CA 1 Who am I? 28 years experience in embedded Lead the storage solutions team Work closely
More informationData Warehouse Tuning. Without SQL Modification
Data Warehouse Tuning Without SQL Modification Agenda About Me Tuning Objectives Data Access Profile Data Access Analysis Performance Baseline Potential Model Changes Model Change Testing Testing Results
More informationVLSI Signal Processing
VLSI Signal Processing Programmable DSP Architectures Chih-Wei Liu VLSI Signal Processing Lab Department of Electronics Engineering National Chiao Tung University Outline DSP Arithmetic Stream Interface
More informationGeneric Cycle Accounting GOODA. Generic Optimization Data Analyzer
Generic Cycle Accounting GOODA Generic Optimization Data Analyzer What is Gooda Open sourced PMU analysis tool Processes perf.data file created with "perf record" Intrinsically incorporates hierarchical
More informationPACE: Power-Aware Computing Engines
PACE: Power-Aware Computing Engines Krste Asanovic Saman Amarasinghe Martin Rinard Computer Architecture Group MIT Laboratory for Computer Science http://www.cag.lcs.mit.edu/ PACE Approach Energy- Conscious
More informationCS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology
CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367
More informationGeneral Purpose GPU Programming. Advanced Operating Systems Tutorial 9
General Purpose GPU Programming Advanced Operating Systems Tutorial 9 Tutorial Outline Review of lectured material Key points Discussion OpenCL Future directions 2 Review of Lectured Material Heterogeneous
More informationGeneral Purpose Signal Processors
General Purpose Signal Processors First announced in 1978 (AMD) for peripheral computation such as in printers, matured in early 80 s (TMS320 series). General purpose vs. dedicated architectures: Pros:
More informationA Near Memory Processor for Vector, Streaming and Bit Manipulations Workloads
A Near Memory Processor for Vector, Streaming and Bit Manipulations Workloads Mingliang Wei, Marc Snir, Josep Torrellas (UIUC) Brett Tremaine (IBM) Work supported by HPCS/PERCS Motivation Many important
More informationGeneral Purpose Processors
Calcolatori Elettronici e Sistemi Operativi Specifications Device that executes a program General Purpose Processors Program list of instructions Instructions are stored in an external memory Stored program
More informationVTA: Open & Flexible DL Acceleration. Thierry Moreau TVM Conference, Dec 12th 2018
VTA: Open & Flexible DL Acceleration Thierry Moreau TVM Conference, Dec 12th 2018 TVM Stack High-Level Differentiable IR Tensor Expression IR LLVM CUDA Metal TVM Stack High-Level Differentiable IR Tensor
More informationASIC Design of Shared Vector Accelerators for Multicore Processors
26 th International Symposium on Computer Architecture and High Performance Computing 2014 ASIC Design of Shared Vector Accelerators for Multicore Processors Spiridon F. Beldianu & Sotirios G. Ziavras
More informationOptimizing Cache Coherent Subsystem Architecture for Heterogeneous Multicore SoCs
Optimizing Cache Coherent Subsystem Architecture for Heterogeneous Multicore SoCs Niu Feng Technical Specialist, ARM Tech Symposia 2016 Agenda Introduction Challenges: Optimizing cache coherent subsystem
More informationThis Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources
This Unit: Putting It All Together CIS 371 Computer Organization and Design Unit 15: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital
More informationWindowing System on a 3D Pipeline. February 2005
Windowing System on a 3D Pipeline February 2005 Agenda 1.Overview of the 3D pipeline 2.NVIDIA software overview 3.Strengths and challenges with using the 3D pipeline GeForce 6800 220M Transistors April
More informationCenter for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop
Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop http://cscads.rice.edu/ Discussion and Feedback CScADS Autotuning 07 Top Priority Questions for Discussion
More informationCMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago
CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on
More information