Tutorial Outline. 9:00 am 10:00 am Pre-RTL Simulation Framework: Aladdin. 8:30 am 9:00 am! Introduction! 10:00 am 10:30 am! Break!

Size: px
Start display at page:

Download "Tutorial Outline. 9:00 am 10:00 am Pre-RTL Simulation Framework: Aladdin. 8:30 am 9:00 am! Introduction! 10:00 am 10:30 am! Break!"

Transcription

1 Tutorial Outline Time Topic! 8:30 am 9:00 am! Introduction! 9:00 am 10:00 am Pre-RTL Simulation Framework: Aladdin 10:00 am 10:30 am! Break! 10:30 am 11:00 am! Workload Characterization Tool: WIICA! 11:00 am 12:00 pm! CAD & Benchmarks: HLS & MachSuite! 12:00 pm 2:00 pm! Lunch! 2:00 pm 3:00 pm! Embedded Keynote Talk: Mark Horowitz (Stanford)! 3:00 pm 3:30 pm! Accelerator Selection Tool: Sigil! 3:30 pm 4:00 pm! Break! 4:00 pm 5:00 pm! Hands-on Exercise 1

2 A Pre- RTL, Power- Performance Accelerator Simulator Enabling Large Design Space of Customized Architectures Yakun Sophia Shao, Brandon Reagen, Gu- Yeon Wei, David Brooks Harvard University 2

3 Today s SoC CPU CPU GPU/ DSP Acc Acc Buses Acc Acc Acc Acc Mem Inter- face Acc Acc Acc 3

4 Future Accelerator- Centric Architectures Big Cores Small Cores GPU/DSP Shared Resources Sea of Fine- Grained Accelerators Memory Interface How to decompose an to accelerators? How to rapidly design lots of accelerators? How to design and manage the shared resources? Flexibility Design Cost Programmability 4

5 Aladdin: A pre- RTL, Power- Performance Accelerator Simulator Shared Memory/Interconnect Models Unmodified C- Code Aladdin Power/Area Accelerator Design Parameters (e.g., # FU, mem. BW) Accelerator Specific Datapath Private L1/ Scratchpad Performance Accelerator Simulator Design Accelerator- Rich SoC Fabrics and Memory Systems 5

6 Aladdin: A pre- RTL, Power- Performance Accelerator Simulator Shared Memory/Interconnect Models Unmodified C- Code Aladdin Power/Area Accelerator Design Parameters (e.g., # FU, mem. BW) Accelerator Specific Datapath Private L1/ Scratchpad Performance Accelerator Simulator Design Accelerator- Rich SoC Fabrics and Memory Systems Flexibility Programmability 6

7 Aladdin: A pre- RTL, Power- Performance Accelerator Simulator Shared Memory/Interconnect Models Unmodified C- Code Aladdin Power/Area Accelerator Design Parameters (e.g., # FU, mem. BW) Accelerator Specific Datapath Private L1/ Scratchpad Performance Accelerator Simulator Design Accelerator- Rich SoC Fabrics and Memory Systems Flexibility Programmability Design Assistant Understand Algorithmic- HW Design Space before RTL Design Cost 7

8 Future Accelerator- Centric Architecture Big Cores Small Cores GPU/ DSP Shared Resources Sea of Fine- Grained Accelerators Memory Interface HLS 100 Power (mw) Execution Time (us) 8

9 Future Accelerator- Centric Architecture Big Cores Small Cores GPU/ DSP Shared Resources Sea of Fine- Grained Accelerators Memory Interface ALADDIN HLS 100 Aladdin can rapidly evaluate large design space of accelerator- centric architectures. Power (mw) Execution Time (us) 9

10 Aladdin Overview Op>miza>on Phase C Code Acc Design Parameters Op@mis@c IR Ini@al Idealis@c Dynamic Data Dependence Graph () Program Constrained Resource Constrained Power/Area Models Performance Ac>vity Power/Area Realiza>on Phase 10

11 Aladdin Overview Op>miza>on Phase C Code Op@mis@c IR Ini@al Idealis@c Performance Acc Design Parameters Program Constrained Resource Constrained Power/Area Models Ac>vity Power/Area Realiza>on Phase 11

12 Aladdin is NOT An HLS flow: No RTL is generated. High- level es#mates of power and performance; Aladdin uses fully dynamic analysis to expose algorithmic parallelism for unmodified HLL codes; Limit of ILP study: but is constructed to model accelerators. 12

13 From C to Design Space C Code: for(i=0; i<n; ++i) c[i] = a[i] + b[i]; 13

14 Aladdin Overview Op>miza>on Phase C Code Op@mis@c IR Ini@al Idealis@c Performance Acc Design Parameters Program Constrained Resource Constrained Power/Area Models Ac>vity Power/Area Realiza>on Phase 14

15 From C to Design Space IR Dynamic Trace C Code: for(i=0; i<n; ++i) c[i] = a[i] + b[i]; 0. r0=0 //i = 0 1. r4=load (r0 + r1) //load a[i] 2. r5=load (r0 + r2) //load b[i] 3. r6=r4 + r5 4. store(r0 + r3, r6) //store c[i] 5. r0=r0 + 1 //++i 6. r4=load(r0 + r1) //load a[i] 7. r5=load(r0 + r2) //load b[i] 8. r6=r4 + r5 9. store(r0 + r3, r6) //store c[i] 10. r0 = r0 + 1 //++i 15

16 IR LLVM IR High- level IR: Machine- and ISA- independent Features: Unlimited Registers Simple Opcodes: add, mul, sin, sqrt Only load/store access memory Shao, et al., ISA-Independent Workload Characterization and Implications for Specialized Architecture,! ISPASS, 2013! 16

17 Aladdin Overview Op>miza>on Phase C Code Op@mis@c IR Ini@al Idealis@c Performance Acc Design Parameters Program Constrained Resource Constrained Power/Area Models Ac>vity Power/Area Realiza>on Phase 17

18 From C to Design Space Ini@al C Code: for(i=0; i<n; ++i) c[i] = a[i] + b[i]; IR Trace: 0. r0=0 //i = 0 1. r4=load (r0 + r1) //load a[i] 2. r5=load (r0 + r2) //load b[i] 3. r6=r4 + r5 4. store(r0 + r3, r6) //store c[i] 5. r0=r0 + 1 //++i 6. r4=load(r0 + r1) //load a[i] 7. r5=load(r0 + r2) //load b[i] 8. r6=r4 + r5 9. store(r0 + r3, r6) //store c[i] 10.r0 = r0 + 1 //++i 0. i=0 5. i++ 1. ld a 2. ld b 10. i++ 6. ld a 7. ld b ld a 12. ld b st c st c 4. st c 18

19 Aladdin Overview Op>miza>on Phase C Code Op@mis@c IR Ini@al Idealis@c Performance Acc Design Parameters Program Constrained Resource Constrained Power/Area Models Ac>vity Power/Area Realiza>on Phase 19

20 From C to Design Space Idealis@c C Code: for(i=0; i<n; ++i) c[i] = a[i] + b[i]; IR Trace: 0. r0=0 //i = 0 1. r4=load (r0 + r1) //load a[i] 2. r5=load (r0 + r2) //load b[i] 3. r6=r4 + r5 4. store(r0 + r3, r6) //store c[i] 5. r0=r0 + 1 //++i 6. r4=load(r0 + r1) //load a[i] 7. r5=load(r0 + r2) //load b[i] 8. r6=r4 + r5 9. store(r0 + r3, r6) //store c[i] 10.r0 = r0 + 1 //++i 0. i=0 5. i i ld a 6. ld a 12. ld b 1. ld a 7. ld b ld b st c 0. i=0 5. i i++ 1. ld a 2. ld b 6. ld a 7. ld b 11. ld a 12. ld b st c st c st c st c 14. st c 20

21 From C to Design Space Idealis@c Include applica@on- specific customiza@on strategies. Node- Level: Bit- width Analysis Strength Reduc@on Tree- height Reduc@on Loop- Level: Remove dependences between loop index variables Memory Op@miza@on: Memory- to- Register Conversion Store- Load Forwarding Store Buffer Extensible e.g. Model CAM accelerator by matching nodes in 21

22 Aladdin Overview Op>miza>on Phase C Code Op@mis@c IR Ini@al Idealis@c Performance Acc Design Parameters Program Constrained Resource Constrained Power/Area Models Ac>vity Power/Area Realiza>on Phase 22

23 From C to Design Space One Design Idealis@c 0. i=0 5.i i i++ 0. i=0 Resource Ac@vity 1. ld a 2. ld b st c 6. ld a 7. ld b st c 11. ld a 12. ld b st c 16. ld a 17. ld b st c 1. ld a 2. ld b st c 5.i++ MEM MEM + + MEM Acc Design Parameters: ü Memory BW <= 2 ü 1 Adder 6. ld a 7. ld b st c Cycle MEM MEM + MEM 23

24 From C to Design Space Another Design Idealis@c 0. i=0 5.i i i++ 0. i=0 5.i++ Resource Ac@vity + 1. ld a 2. ld b ld a 7. ld b ld a 12. ld b ld a 17. ld b ld a 2. ld b ld a 7. ld b 8. + MEM MEM MEM MEM st c 9. st c Acc Design Parameters: ü Memory BW <= 4 ü 2 Adders 14. st c 19. st c 4. st c 10. i ld a 12. ld b st c 9. st c 15. i ld a 17. ld b st c MEM + + MEM MEM MEM MEM MEM Cycle + + MEM MEM 24

25 From C to Design Space Realiza@on Phase: - >Power- Perf Constrain the with program and user- defined resource constraints Program Constraints Control Dependence Memory Ambigua@on Resource Constraints Loop- level Parallelism Loop Pipelining Memory Ports # of FUs (e.g., adders, mul@pliers) 25

26 Memory Idealistic optimistically removes all false memory dependences! Input-dependent memory accesses cannot be calculated statically.! 26

27 Memory 0.i=0 for(i=0; i<n; ++i) { bucket[ a[i] & 0x11 ]++; } Input: a[0] = 1; a[1] = 1; a[2] = 1; 1.ld a[0] 2.& 3.ld b[1] 4.b[1]++ 5.st b[1] 27

28 Memory for(i=0; i<n; ++i) { bucket[ a[i] & 0x11 ]++; } Input: a[0] = 1; a[1] = 2; a[2] = 1; 0.i=0 1.ld a[0] 2.& 3.ld b[1] 4.b[1]++ 5.st b[1] 6.i++ 7.ld a[1] 8.& 9.ld b[2] 10.b[2]++ 11.st b[2] 28

29 Memory 0.i=0 6.i++ 12.i++ for(i=0; i<n; ++i) { bucket[ a[i] & 0x11 ]++; } 1.ld a[0] 2.& 7.ld a[1] 8.& 13.ld a[2] 14.& Input: a[0] = 1; a[1] = 2; a[2] = 2; 3.ld b[1] 4.b[1]++ 5.st b[1] 9.ld b[2] 10.b[2]++ 11.st b[2] 15.ld b[2] 16.b[2]++ 17.st b[2] 29

30 Memory 0.i=0 6.i++ 12.i++ for(i=0; i<n; ++i) { bucket[ a[i] & 0x11 ]++; } 1.ld a[0] 2.& 7.ld a[1] 8.& 13.ld a[2] 14.& Input: a[0] = 1; a[1] = 2; a[2] = 2; 3.ld b[1] 4.b[1]++ 5.st b[1] 9.ld b[2] 10.b[2]++ 11.st b[2] 15.ld b[2] 16.b[2]++ 17.st b[2] 30

31 Memory 0.i=0 6.i++ 12.i++ for(i=0; i<n; ++i) { bucket[ a[i] & 0x11 ]++; } 1.ld a[0] 2.& 7.ld a[1] 8.& 13.ld a[2] 14.& Input: a[0] = 1; a[1] = 2; a[2] = 2; 3.ld b[1] 4.b[1]++ 5.st b[1] 9.ld b[2] 10.b[2]++ 11.st b[2] 15.ld b[2] 16.b[2]++ 17.st b[2] 31

32 Memory 0.i=0 6.i++ 12.i++ for(i=0; i<n; ++i) { bucket[ a[i] & 0x11 ]++; } 1.ld a[0] 2.& 7.ld a[1] 8.& 13.ld a[2] 14.& Input: a[0] = 1; a[1] = 2; a[2] = 2; 3.ld b[1] 4.b[1]++ 5.st b[1] 9.ld b[2] 10.b[2]++ 11.st b[2] 15.ld b[2] 16.b[2]++ 17.st b[2] 32

33 Memory 0.i=0 6.i++ 12.i++ for(i=0; i<n; ++i) { bucket[ a[i] & 0x11 ]++; } 1.ld a[0] 2.& 7.ld a[1] 8.& 13.ld a[2] 14.& Input: a[0] = 1; a[1] = 2; a[2] = 2; 3.ld b[1] 4.b[1]++ 5.st b[1] 9.ld b[2] 10.b[2]++ 11.st b[2] 15.ld b[2] 16.b[2]++ 17.st b[2] 33

34 From C to Design Space Power- Performance per Design Power Acc Design Parameters: ü Memory BW <= 4 ü 2 Adders Acc Design Parameters: ü Memory BW <= 2 ü 1 Adder Cycle 34

35 From C to Design Space Design Space of an Algorithm Power Cycle 35

36 Cycle- Level 200 Twiddle Active Functional Units Memory Bandwidth Number of Active Functional Units and Bandwidth FFT8 Shuffle FFT8 Twiddle Shuffle FFT Time (Cycles) 36

37 Power Model Units Power Model Microbenchmarks characterize various FUs. Design Compiler with 40nm Standard Cell Power = (activity i * Pi dynamic ) + Pi leakage 1<i<N SRAM Power Model Commercial register file and SRAM memory compilers with the same 40nm standard cell library 37

38 Aladdin Overview Op>miza>on Phase C Code Op@mis@c IR Ini@al Idealis@c Performance Acc Design Parameters Program Constrained Resource Constrained Power/Area Models Ac>vity Power/Area Realiza>on Phase 38

39 Aladdin Aladdin C Code Power/Area Performance Verilog Design Compiler Ac@vity ModelSim 39

40 Aladdin Aladdin C Code Power/Area Performance RTL Designer HLS C Tuning Vivado HLS Verilog Design Compiler Ac@vity ModelSim 40

41 Benchmarks Type! Benchmark! Description! MD! Pairwise calculation of the L-J Potential! STENCIL! Apply 3x3 filter to an image! SHOC! Benchmark Suite! FFT! GEMM! TRIAD! SORT! 1D 512 FFT! Blocked Matrix Multiply! Single Computation in DOALL loop! Radix Sort! Optimized! HLS! Designs! SCAN! Parallel prefix sum! REDUCTION! Return sum of an array! Proposed! Accelerator! Constructs! NPU! Memcached! HARP! An individual neuron in a network [MICRO 12]! GET function in Memcached [ISCA 13]! Data partition accelerator [ISCA 13]! Hand RTL! Designs! 41

42 Aladdin Time (KCycles) % Aladdin RTL Flow 0 FFT Power (mw) % FFT Aladdin RTL Flow % Area ( mm 2 ) Aladdin RTL Flow 0.0 FFT 42

43 Aladdin Time (KCycles) MD STENCIL FFT GEMM TRIAD SORT SCAN REDUCTION 0.9% Aladdin RTL Flow 4.9% Time (KCycles) NPU HASH HARP Power (mw) Aladdin RTL Flow Power (mw) 2 1 Area ( mm 2 ) MD STENCIL FFT GEMM TRIAD SORT SCAN REDUCTION 6.5% Aladdin RTL Flow Area ( mm 2 ) NPU HASH HARP 0.0 MD STENCIL FFT GEMM TRIAD SORT SCAN REDUCTION 0 NPU HASH HARP 43

44 Aladdin enables rapid design space for accelerators. Aladdin C Code Power/Area Performance RTL Designer HLS C Tuning Vivado HLS Verilog Design Compiler Ac@vity ModelSim 44

45 Algorithm Choices Aladdin generates a design space per algorithm Can use Aladdin to quickly compare the design spaces of algorithms Input Dependent Inputs that exercise all paths of the code Input C Code Aladdin can create for any C code. C constructs that require resources outside the accelerator, such as system calls and dynamic memory alloca@on, are not modeled. 45

46 Aladdin enables pre- RTL of accelerators with the rest of the SoC. gem5 Big Cores... gem5 Small Cores GPGPU- GPU Sim Shared CacL/Orion2 Resources Sea of Fine- Grained Accelerators Memory DRAMSim2 Interface 46

47 Accelerator with Memory System using Aladdin Acc! Cache! Memory! 47

48 Acc! Core! Cache! Modeling Accelerators in an SoC- like Environment Memory! Power (mw) block=16 block=32 Without Memory Contention Time (Million Cycles)

49 Acc! Cache! Core! Modeling Accelerators in a SoC- like Environment Memory! block=16 block= block=16 block=32 Power (mw) Without Memory Contention Power (mw) With Memory Contention Time (Million Cycles) Time (Million Cycles) 49

50 Aladdin: A pre- RTL, Power- Performance Accelerator Simulator Architectures with 1000s of accelerators will be radically different; New design tools are needed. Aladdin enables rapid design space explora@on of future accelerator- centric plavorms. You can find Aladdin at hwp://vlsiarch.eecs.harvard.edu/accelerators 50

51 Tutorial References Y.S. Shao and D. Brooks, ISA-Independent Workload Characterization and its Implications for Specialized Architectures, ISPASS 13.! B. Reagen, Y.S. Shao, G.-Y. Wei, D. Brooks, Quantifying Acceleration: Power/ Performance Trade-Offs of Application Kernels in Hardware, ISLPED 13.! Y.S. Shao, B. Reagen, G.-Y. Wei, D. Brooks, Aladdin: A Pre-RTL, Power- Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures, ISCA 14.! B. Reagen, B. Adolf, Y.S. Shao, G.-Y. Wei, D. Brooks, MachSuite: Benchmarks for Accelerator Design and Customized Architectures, IISWC 14.! 51

RoboBees + Aladdin + HELIX Approximate Accelerator Architectures

RoboBees + Aladdin + HELIX Approximate Accelerator Architectures RoboBees + Aladdin + HELIX Approximate Accelerator Architectures Gu-Yeon Wei School of Engineering and Applied Sciences Harvard University CMOS scaling is running out Technological Fallow Period 2 Power

More information

Accelerator Design, Tradeoffs, and Benchmarking

Accelerator Design, Tradeoffs, and Benchmarking Accelerator Design, Tradeoffs, and Benchmarking Vivado HLS MachSuite [ IISWC 2014 ] QuanIfying AcceleraIon [ ISLPED 2013 ] Brandon Reagen, Yakun Sophia Shao, Bob Adolf, Gu- Yeon Wei, David Brooks Harvard

More information

Quantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware

Quantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware Quantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware Brandon Reagen, Yakun Sophia Shao, Gu-Yeon Wei, David Brooks Harvard University, Cambridge, MA, USA {reagen, shao,

More information

XPU A Programmable FPGA Accelerator for Diverse Workloads

XPU A Programmable FPGA Accelerator for Diverse Workloads XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for

More information

DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses

DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough 1,2 S. K. Lee 2, N. Mulholland 2, P. Hansen 2, S. Kodali 3, D. Brooks 2, G.-Y. Wei 2 1 ARM Research, Boston,

More information

Design and Modeling of Specialized Architectures

Design and Modeling of Specialized Architectures Design and Modeling of Specialized Architectures a dissertation presented by Yakun Sophia Shao to The School of Engineering and Applied Sciences in partial fulfillment of the requirements for the degree

More information

High-Level Synthesis Creating Custom Circuits from High-Level Code

High-Level Synthesis Creating Custom Circuits from High-Level Code High-Level Synthesis Creating Custom Circuits from High-Level Code Hao Zheng Comp Sci & Eng University of South Florida Exis%ng Design Flow Register-transfer (RT) synthesis - Specify RT structure (muxes,

More information

ECE 5775 (Fall 17) High-Level Digital Design Automation. More Pipelining

ECE 5775 (Fall 17) High-Level Digital Design Automation. More Pipelining ECE 5775 (Fall 17) High-Level Digital Design Automation More Pipelining Announcements HW 2 due Monday 10/16 (no late submission) Second round paper bidding @ 5pm tomorrow on Piazza Talk by Prof. Margaret

More information

Flexible wireless communication architectures

Flexible wireless communication architectures Flexible wireless communication architectures Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston TX Faculty Candidate Seminar Southern Methodist University April

More information

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael

More information

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee

More information

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs IBM Research AI Systems Day DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1, Junsong Wang 2, Chao Zhu 2, Yonghua Lin 2, Jinjun Xiong 3, Wen-mei

More information

An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware

An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware Tao Chen, Shreesha Srinath Christopher Batten, G. Edward Suh Computer Systems Laboratory School of Electrical

More information

NISC Application and Advantages

NISC Application and Advantages NISC Application and Advantages Daniel D. Gajski Mehrdad Reshadi Center for Embedded Computer Systems University of California, Irvine Irvine, CA 92697-3425, USA {gajski, reshadi}@cecs.uci.edu CECS Technical

More information

Software Defined Hardware

Software Defined Hardware Software Defined Hardware For data intensive computation Wade Shen DARPA I2O September 19, 2017 1 Goal Statement Build runtime reconfigurable hardware and software that enables near ASIC performance (within

More information

Exploration of Cache Coherent CPU- FPGA Heterogeneous System

Exploration of Cache Coherent CPU- FPGA Heterogeneous System Exploration of Cache Coherent CPU- FPGA Heterogeneous System Wei Zhang Department of Electronic and Computer Engineering Hong Kong University of Science and Technology 1 Outline ointroduction to FPGA-based

More information

KiloCore: A 32 nm 1000-Processor Array

KiloCore: A 32 nm 1000-Processor Array KiloCore: A 32 nm 1000-Processor Array Brent Bohnenstiehl, Aaron Stillmaker, Jon Pimentel, Timothy Andreas, Bin Liu, Anh Tran, Emmanuel Adeagbo, Bevan Baas University of California, Davis VLSI Computation

More information

ESE532: System-on-a-Chip Architecture. Today. Message. Preclass 1. Computing Forms. Preclass 1

ESE532: System-on-a-Chip Architecture. Today. Message. Preclass 1. Computing Forms. Preclass 1 ESE532: System-on-a-Chip Architecture Day 15: March 15, 2017 (Very Long Instruction Word Processors) Today (Very Large Instruction Word) Demand Basic Model Costs Tuning Penn ESE532 Spring 2017 -- DeHon

More information

SDA: Software-Defined Accelerator for general-purpose big data analysis system

SDA: Software-Defined Accelerator for general-purpose big data analysis system SDA: Software-Defined Accelerator for general-purpose big data analysis system Jian Ouyang(ouyangjian@baidu.com), Wei Qi, Yong Wang, Yichen Tu, Jing Wang, Bowen Jia Baidu is beyond a search engine Search

More information

Industrial-Strength High-Performance RISC-V Processors for Energy-Efficient Computing

Industrial-Strength High-Performance RISC-V Processors for Energy-Efficient Computing Industrial-Strength High-Performance RISC-V Processors for Energy-Efficient Computing Dave Ditzel dave@esperanto.ai President and CEO Esperanto Technologies, Inc. 7 th RISC-V Workshop November 28, 2017

More information

Understanding GPGPU Vector Register File Usage

Understanding GPGPU Vector Register File Usage Understanding GPGPU Vector Register File Usage Mark Wyse AMD Research, Advanced Micro Devices, Inc. Paul G. Allen School of Computer Science & Engineering, University of Washington AGENDA GPU Architecture

More information

Design of Embedded DSP Processors Unit 2: Design basics. 9/11/2017 Unit 2 of TSEA H1 1

Design of Embedded DSP Processors Unit 2: Design basics. 9/11/2017 Unit 2 of TSEA H1 1 Design of Embedded DSP Processors Unit 2: Design basics 9/11/2017 Unit 2 of TSEA26-2017 H1 1 ASIP/ASIC design flow We need to have the flow in mind, so that we will know what we are talking about in later

More information

Program Op*miza*on and Analysis. Chenyang Lu CSE 467S

Program Op*miza*on and Analysis. Chenyang Lu CSE 467S Program Op*miza*on and Analysis Chenyang Lu CSE 467S 1 Program Transforma*on op#mize Analyze HLL compile assembly assemble Physical Address Rela5ve Address assembly object load executable link Absolute

More information

CPE300: Digital System Architecture and Design

CPE300: Digital System Architecture and Design CPE300: Digital System Architecture and Design Fall 2011 MW 17:30-18:45 CBC C316 Pipelining 11142011 http://www.egr.unlv.edu/~b1morris/cpe300/ 2 Outline Review I/O Chapter 5 Overview Pipelining Pipelining

More information

Efficient Data Supply for Hardware Accelerators with Prefetching and Access/ Execute Decoupling

Efficient Data Supply for Hardware Accelerators with Prefetching and Access/ Execute Decoupling Cornell University Efficient Data Supply for Hardware Accelerators with Prefetching and Access/ Execute Decoupling Tao Chen and G. Edward Suh Computer Systems Laboratory Cornell University Accelerator-Rich

More information

CS 152, Spring 2011 Section 10

CS 152, Spring 2011 Section 10 CS 152, Spring 2011 Section 10 Christopher Celio University of California, Berkeley Agenda Stuff (Quiz 4 Prep) http://3dimensionaljigsaw.wordpress.com/2008/06/18/physics-based-games-the-new-genre/ Intel

More information

Digital Signal Processor Core Technology

Digital Signal Processor Core Technology The World Leader in High Performance Signal Processing Solutions Digital Signal Processor Core Technology Abhijit Giri Satya Simha November 4th 2009 Outline Introduction to SHARC DSP ADSP21469 ADSP2146x

More information

Real-Time Support for GPU. GPU Management Heechul Yun

Real-Time Support for GPU. GPU Management Heechul Yun Real-Time Support for GPU GPU Management Heechul Yun 1 This Week Topic: Real-Time Support for General Purpose Graphic Processing Unit (GPGPU) Today Background Challenges Real-Time GPU Management Frameworks

More information

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

EE282 Computer Architecture. Lecture 1: What is Computer Architecture? EE282 Computer Architecture Lecture : What is Computer Architecture? September 27, 200 Marc Tremblay Computer Systems Laboratory Stanford University marctrem@csl.stanford.edu Goals Understand how computer

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based

More information

PyMTL/Pydgin Tutorial Schedule

PyMTL/Pydgin Tutorial Schedule PyMTL/Pydgin Tutorial Schedule 8:30am 8:50am Virtual Machine Installation and Setup 8:50am 9:00am : PyMTL/Pydgin Tutorial 9:00am 9:10am : Introduction to Pydgin 9:10am 10:00am : Adding a uction using Pydgin

More information

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures Storage I/O Summary Storage devices Storage I/O Performance Measures» Throughput» Response time I/O Benchmarks» Scaling to track technological change» Throughput with restricted response time is normal

More information

Optimizing HW/SW Partition of a Complex Embedded Systems. Simon George November 2015.

Optimizing HW/SW Partition of a Complex Embedded Systems. Simon George November 2015. Optimizing HW/SW Partition of a Complex Embedded Systems Simon George November 2015 Zynq-7000 All Programmable SoC HP ACP GP Page 2 Zynq UltraScale+ MPSoC Page 3 HW/SW Optimization Challenges application()

More information

Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores

Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores Magnus Ekman Per Stenstrom Department of Computer Engineering, Department of Computer Engineering, Outline Problem statement Assumptions

More information

Quantifying the Energy Cost of Data Movement for Emerging Smartphone Workloads on Mobile Platforms

Quantifying the Energy Cost of Data Movement for Emerging Smartphone Workloads on Mobile Platforms Quantifying the Energy Cost of Data Movement for Emerging Smartphone Workloads on Mobile Platforms Arizona State University Dhinakaran Pandiyan(dpandiya@asu.edu) and Carole-Jean Wu(carole-jean.wu@asu.edu

More information

A framework for optimizing OpenVX Applications on Embedded Many Core Accelerators

A framework for optimizing OpenVX Applications on Embedded Many Core Accelerators A framework for optimizing OpenVX Applications on Embedded Many Core Accelerators Giuseppe Tagliavini, DEI University of Bologna Germain Haugou, IIS ETHZ Andrea Marongiu, DEI University of Bologna & IIS

More information

The Bifrost GPU architecture and the ARM Mali-G71 GPU

The Bifrost GPU architecture and the ARM Mali-G71 GPU The Bifrost GPU architecture and the ARM Mali-G71 GPU Jem Davies ARM Fellow and VP of Technology Hot Chips 28 Aug 2016 Introduction to ARM Soft IP ARM licenses Soft IP cores (amongst other things) to our

More information

CSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization

CSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization CSE 160 Lecture 10 Instruction level parallelism (ILP) Vectorization Announcements Quiz on Friday Signup for Friday labs sessions in APM 2013 Scott B. Baden / CSE 160 / Winter 2013 2 Particle simulation

More information

08 - Address Generator Unit (AGU)

08 - Address Generator Unit (AGU) October 2, 2014 Todays lecture Memory subsystem Address Generator Unit (AGU) Schedule change A new lecture has been entered into the schedule (to compensate for the lost lecture last week) Memory subsystem

More information

LACORE: A RISC-V BASED LINEAR ALGEBRA ACCELERATOR FOR SOC DESIGNS

LACORE: A RISC-V BASED LINEAR ALGEBRA ACCELERATOR FOR SOC DESIGNS 1 LACORE: A RISC-V BASED LINEAR ALGEBRA ACCELERATOR FOR SOC DESIGNS Samuel Steffl and Sherief Reda Brown University, Department of Computer Engineering Partially funded by NSF grant 1438958 Published as

More information

Venezia: a Scalable Multicore Subsystem for Multimedia Applications

Venezia: a Scalable Multicore Subsystem for Multimedia Applications Venezia: a Scalable Multicore Subsystem for Multimedia Applications Takashi Miyamori Toshiba Corporation Outline Background Venezia Hardware Architecture Venezia Software Architecture Evaluation Chip and

More information

From Brook to CUDA. GPU Technology Conference

From Brook to CUDA. GPU Technology Conference From Brook to CUDA GPU Technology Conference A 50 Second Tutorial on GPU Programming by Ian Buck Adding two vectors in C is pretty easy for (i=0; i

More information

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2 ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 13 Memory Part 2 Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall12.html

More information

OpenCAPI Technology. Myron Slota Speaker name, Title OpenCAPI Consortium Company/Organization Name. Join the Conversation #OpenPOWERSummit

OpenCAPI Technology. Myron Slota Speaker name, Title OpenCAPI Consortium Company/Organization Name. Join the Conversation #OpenPOWERSummit OpenCAPI Technology Myron Slota Speaker name, Title OpenCAPI Consortium Company/Organization Name Join the Conversation #OpenPOWERSummit Industry Collaboration and Innovation OpenCAPI Topics Computation

More information

Soft GPGPUs for Embedded FPGAS: An Architectural Evaluation

Soft GPGPUs for Embedded FPGAS: An Architectural Evaluation Soft GPGPUs for Embedded FPGAS: An Architectural Evaluation 2nd International Workshop on Overlay Architectures for FPGAs (OLAF) 2016 Kevin Andryc, Tedy Thomas and Russell Tessier University of Massachusetts

More information

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 13 Memory Part 2

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 13 Memory Part 2 ECE 252 / CPS 220 Advanced Computer Architecture I Lecture 13 Memory Part 2 Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall11.html

More information

Anand Raghunathan

Anand Raghunathan ECE 695R: SYSTEM-ON-CHIP DESIGN Module 2: HW/SW Partitioning Lecture 2.15: ASIP: Approaches to Design Anand Raghunathan raghunathan@purdue.edu ECE 695R: System-on-Chip Design, Fall 2014 Fall 2014, ME 1052,

More information

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based

More information

Data-Parallel Algorithms on GPUs. Mark Harris NVIDIA Developer Technology

Data-Parallel Algorithms on GPUs. Mark Harris NVIDIA Developer Technology Data-Parallel Algorithms on GPUs Mark Harris NVIDIA Developer Technology Outline Introduction Algorithmic complexity on GPUs Algorithmic Building Blocks Gather & Scatter Reductions Scan (parallel prefix)

More information

Unlocking FPGAs Using High- Level Synthesis Compiler Technologies

Unlocking FPGAs Using High- Level Synthesis Compiler Technologies Unlocking FPGAs Using High- Leel Synthesis Compiler Technologies Fernando Mar*nez Vallina, Henry Styles Xilinx Feb 22, 2015 Why are FPGAs Good Scalable, highly parallel and customizable compute 10s to

More information

From Application to Technology OpenCL Application Processors Chung-Ho Chen

From Application to Technology OpenCL Application Processors Chung-Ho Chen From Application to Technology OpenCL Application Processors Chung-Ho Chen Computer Architecture and System Laboratory (CASLab) Department of Electrical Engineering and Institute of Computer and Communication

More information

Scalable GPU Graph Traversal!

Scalable GPU Graph Traversal! Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang

More information

Efficient Evaluation and Management of Temperature and Reliability for Multiprocessor Systems

Efficient Evaluation and Management of Temperature and Reliability for Multiprocessor Systems Efficient Evaluation and Management of Temperature and Reliability for Multiprocessor Systems Ayse K. Coskun Electrical and Computer Engineering Department Boston University http://people.bu.edu/acoskun

More information

Deep Learning Accelerators

Deep Learning Accelerators Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction

More information

Vivado HLx Design Entry. June 2016

Vivado HLx Design Entry. June 2016 Vivado HLx Design Entry June 2016 Agenda What is the HLx Design Methodology? New & Early Access features for Connectivity Platforms Creating Differentiated Logic 2 What is the HLx Design Methodology? Page

More information

VLIW DSP Processor Design for Mobile Communication Applications. Contents crafted by Dr. Christian Panis Catena Radio Design

VLIW DSP Processor Design for Mobile Communication Applications. Contents crafted by Dr. Christian Panis Catena Radio Design VLIW DSP Processor Design for Mobile Communication Applications Contents crafted by Dr. Christian Panis Catena Radio Design Agenda Trends in mobile communication Architectural core features with significant

More information

Microprocessor Architecture Dr. Charles Kim Howard University

Microprocessor Architecture Dr. Charles Kim Howard University EECE416 Microcomputer Fundamentals Microprocessor Architecture Dr. Charles Kim Howard University 1 Computer Architecture Computer System CPU (with PC, Register, SR) + Memory 2 Computer Architecture ALU

More information

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200

More information

A scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment

A scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment LETTER IEICE Electronics Express, Vol.11, No.2, 1 9 A scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment Ting Chen a), Hengzhu Liu, and Botao Zhang College of

More information

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200

More information

Modeling and Simulation of System-on. Platorms. Politecnico di Milano. Donatella Sciuto. Piazza Leonardo da Vinci 32, 20131, Milano

Modeling and Simulation of System-on. Platorms. Politecnico di Milano. Donatella Sciuto. Piazza Leonardo da Vinci 32, 20131, Milano Modeling and Simulation of System-on on-chip Platorms Donatella Sciuto 10/01/2007 Politecnico di Milano Dipartimento di Elettronica e Informazione Piazza Leonardo da Vinci 32, 20131, Milano Key SoC Market

More information

Huge market -- essentially all high performance databases work this way

Huge market -- essentially all high performance databases work this way 11/5/2017 Lecture 16 -- Parallel & Distributed Databases Parallel/distributed databases: goal provide exactly the same API (SQL) and abstractions (relational tables), but partition data across a bunch

More information

Implementation of DSP Algorithms

Implementation of DSP Algorithms Implementation of DSP Algorithms Main frame computers Dedicated (application specific) architectures Programmable digital signal processors voice band data modem speech codec 1 PDSP and General-Purpose

More information

Computer Architecture Dr. Charles Kim Howard University

Computer Architecture Dr. Charles Kim Howard University EECE416 Microcomputer Fundamentals Computer Architecture Dr. Charles Kim Howard University 1 Computer Architecture Computer Architecture Art of selecting and interconnecting hardware components to create

More information

Staged Memory Scheduling

Staged Memory Scheduling Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:

More information

ECE 5745 Complex Digital ASIC Design, Spring 2017 Lab 2: Sorting Accelerator

ECE 5745 Complex Digital ASIC Design, Spring 2017 Lab 2: Sorting Accelerator School of Electrical and Computer Engineering Cornell University revision: 2017-03-16-23-56 In this lab, you will explore a medium-grain hardware accelerator for sorting an array of integer values of unknown

More information

CS377P Programming for Performance GPU Programming - II

CS377P Programming for Performance GPU Programming - II CS377P Programming for Performance GPU Programming - II Sreepathi Pai UTCS November 11, 2015 Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work Outline

More information

Lec 25: Parallel Processors. Announcements

Lec 25: Parallel Processors. Announcements Lec 25: Parallel Processors Kavita Bala CS 340, Fall 2008 Computer Science Cornell University PA 3 out Hack n Seek Announcements The goal is to have fun with it Recitations today will talk about it Pizza

More information

LUMOS. A Framework with Analy1cal Models for Heterogeneous Architectures. Liang Wang, and Kevin Skadron (University of Virginia)

LUMOS. A Framework with Analy1cal Models for Heterogeneous Architectures. Liang Wang, and Kevin Skadron (University of Virginia) LUMOS A Framework with Analy1cal Models for Heterogeneous Architectures Liang Wang, and Kevin Skadron (University of Virginia) What is LUMOS A set of first- order analy1cal models targe1ng heterogeneous

More information

Re-Examining Conventional Wisdom for Networks-on-Chip in the Context of FPGAs

Re-Examining Conventional Wisdom for Networks-on-Chip in the Context of FPGAs This work was funded by NSF. We thank Xilinx for their FPGA and tool donations. We thank Bluespec for their tool donations. Re-Examining Conventional Wisdom for Networks-on-Chip in the Context of FPGAs

More information

Lecture 2: CUDA Programming

Lecture 2: CUDA Programming CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:

More information

gem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood

gem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood gem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood (powerjg/morr)@cs.wisc.edu UW-Madison Computer Sciences 2012 gem5-gpu gem5 + GPGPU-Sim (v3.0.1) Flexible memory

More information

Course Overview Revisited

Course Overview Revisited Course Overview Revisited void blur_filter_3x3( Image &in, Image &blur) { // allocate blur array Image blur(in.width(), in.height()); // blur in the x dimension for (int y = ; y < in.height(); y++) for

More information

ECE369. Chapter 5 ECE369

ECE369. Chapter 5 ECE369 Chapter 5 1 State Elements Unclocked vs. Clocked Clocks used in synchronous logic Clocks are needed in sequential logic to decide when an element that contains state should be updated. State element 1

More information

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture? This Unit: Putting It All Together CIS 371 Computer Organization and Design Unit 15: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital

More information

Managing Hybrid On-chip Scratchpad and Cache Memories for Multi-tasking Embedded Systems

Managing Hybrid On-chip Scratchpad and Cache Memories for Multi-tasking Embedded Systems Managing Hybrid On-chip Scratchpad and Cache Memories for Multi-tasking Embedded Systems Zimeng Zhou, Lei Ju, Zhiping Jia, Xin Li School of Computer Science and Technology Shandong University, China Outline

More information

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University Lecture 12 Memory Design & Caches, part 2 Christos Kozyrakis Stanford University http://eeclass.stanford.edu/ee108b 1 Announcements HW3 is due today PA2 is available on-line today Part 1 is due on 2/27

More information

Introduction to Parallel Programming Models

Introduction to Parallel Programming Models Introduction to Parallel Programming Models Tim Foley Stanford University Beyond Programmable Shading 1 Overview Introduce three kinds of parallelism Used in visual computing Targeting throughput architectures

More information

Project Proposals. 1 Project 1: On-chip Support for ILP, DLP, and TLP in an Imagine-like Stream Processor

Project Proposals. 1 Project 1: On-chip Support for ILP, DLP, and TLP in an Imagine-like Stream Processor EE482C: Advanced Computer Organization Lecture #12 Stream Processor Architecture Stanford University Tuesday, 14 May 2002 Project Proposals Lecture #12: Tuesday, 14 May 2002 Lecturer: Students of the class

More information

Higher Level Programming Abstractions for FPGAs using OpenCL

Higher Level Programming Abstractions for FPGAs using OpenCL Higher Level Programming Abstractions for FPGAs using OpenCL Desh Singh Supervising Principal Engineer Altera Corporation Toronto Technology Center ! Technology scaling favors programmability CPUs."#/0$*12'$-*

More information

100M Gate Designs in FPGAs

100M Gate Designs in FPGAs 100M Gate Designs in FPGAs Fact or Fiction? NMI FPGA Network 11 th October 2016 Jonathan Meadowcroft, Cadence Design Systems Why in the world, would I do that? ASIC replacement? Probably not! Cost prohibitive

More information

Ten Reasons to Optimize a Processor

Ten Reasons to Optimize a Processor By Neil Robinson SoC designs today require application-specific logic that meets exacting design requirements, yet is flexible enough to adjust to evolving industry standards. Optimizing your processor

More information

Bringing Intelligence to Enterprise Storage Drives

Bringing Intelligence to Enterprise Storage Drives Bringing Intelligence to Enterprise Storage Drives Neil Werdmuller Director Storage Solutions Arm Santa Clara, CA 1 Who am I? 28 years experience in embedded Lead the storage solutions team Work closely

More information

Data Warehouse Tuning. Without SQL Modification

Data Warehouse Tuning. Without SQL Modification Data Warehouse Tuning Without SQL Modification Agenda About Me Tuning Objectives Data Access Profile Data Access Analysis Performance Baseline Potential Model Changes Model Change Testing Testing Results

More information

VLSI Signal Processing

VLSI Signal Processing VLSI Signal Processing Programmable DSP Architectures Chih-Wei Liu VLSI Signal Processing Lab Department of Electronics Engineering National Chiao Tung University Outline DSP Arithmetic Stream Interface

More information

Generic Cycle Accounting GOODA. Generic Optimization Data Analyzer

Generic Cycle Accounting GOODA. Generic Optimization Data Analyzer Generic Cycle Accounting GOODA Generic Optimization Data Analyzer What is Gooda Open sourced PMU analysis tool Processes perf.data file created with "perf record" Intrinsically incorporates hierarchical

More information

PACE: Power-Aware Computing Engines

PACE: Power-Aware Computing Engines PACE: Power-Aware Computing Engines Krste Asanovic Saman Amarasinghe Martin Rinard Computer Architecture Group MIT Laboratory for Computer Science http://www.cag.lcs.mit.edu/ PACE Approach Energy- Conscious

More information

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367

More information

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9 General Purpose GPU Programming Advanced Operating Systems Tutorial 9 Tutorial Outline Review of lectured material Key points Discussion OpenCL Future directions 2 Review of Lectured Material Heterogeneous

More information

General Purpose Signal Processors

General Purpose Signal Processors General Purpose Signal Processors First announced in 1978 (AMD) for peripheral computation such as in printers, matured in early 80 s (TMS320 series). General purpose vs. dedicated architectures: Pros:

More information

A Near Memory Processor for Vector, Streaming and Bit Manipulations Workloads

A Near Memory Processor for Vector, Streaming and Bit Manipulations Workloads A Near Memory Processor for Vector, Streaming and Bit Manipulations Workloads Mingliang Wei, Marc Snir, Josep Torrellas (UIUC) Brett Tremaine (IBM) Work supported by HPCS/PERCS Motivation Many important

More information

General Purpose Processors

General Purpose Processors Calcolatori Elettronici e Sistemi Operativi Specifications Device that executes a program General Purpose Processors Program list of instructions Instructions are stored in an external memory Stored program

More information

VTA: Open & Flexible DL Acceleration. Thierry Moreau TVM Conference, Dec 12th 2018

VTA: Open & Flexible DL Acceleration. Thierry Moreau TVM Conference, Dec 12th 2018 VTA: Open & Flexible DL Acceleration Thierry Moreau TVM Conference, Dec 12th 2018 TVM Stack High-Level Differentiable IR Tensor Expression IR LLVM CUDA Metal TVM Stack High-Level Differentiable IR Tensor

More information

ASIC Design of Shared Vector Accelerators for Multicore Processors

ASIC Design of Shared Vector Accelerators for Multicore Processors 26 th International Symposium on Computer Architecture and High Performance Computing 2014 ASIC Design of Shared Vector Accelerators for Multicore Processors Spiridon F. Beldianu & Sotirios G. Ziavras

More information

Optimizing Cache Coherent Subsystem Architecture for Heterogeneous Multicore SoCs

Optimizing Cache Coherent Subsystem Architecture for Heterogeneous Multicore SoCs Optimizing Cache Coherent Subsystem Architecture for Heterogeneous Multicore SoCs Niu Feng Technical Specialist, ARM Tech Symposia 2016 Agenda Introduction Challenges: Optimizing cache coherent subsystem

More information

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources This Unit: Putting It All Together CIS 371 Computer Organization and Design Unit 15: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital

More information

Windowing System on a 3D Pipeline. February 2005

Windowing System on a 3D Pipeline. February 2005 Windowing System on a 3D Pipeline February 2005 Agenda 1.Overview of the 3D pipeline 2.NVIDIA software overview 3.Strengths and challenges with using the 3D pipeline GeForce 6800 220M Transistors April

More information

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop http://cscads.rice.edu/ Discussion and Feedback CScADS Autotuning 07 Top Priority Questions for Discussion

More information

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on

More information