Efficient Data Supply for Hardware Accelerators with Prefetching and Access/ Execute Decoupling

Size: px
Start display at page:

Download "Efficient Data Supply for Hardware Accelerators with Prefetching and Access/ Execute Decoupling"

Transcription

1 Cornell University Efficient Data Supply for Hardware Accelerators with Prefetching and Access/ Execute Decoupling Tao Chen and G. Edward Suh Computer Systems Laboratory Cornell University

2 Accelerator-Rich Computing Systems Computing systems are becoming accelerator-rich General-purpose cores + a large number of accelerators Challenge: Design and verification complexity Non-recurring engineering (NRE) cost per accelerator Manual efforts are a major source of cost Create computation pipelines Manage data supply from memory High-Level Synthesis (HLS) This work: An automated framework for generating accelerators with efficient data supply 2

3 Inefficiencies in Accelerator Data Supply Scratchpad-based accelerators On-chip scratchpad memory (SPM) Manually designed logic to move data between SPM and main memory Pros: Good performance Cons: High design effort, accelerator-specific, not reusable Accelerator SPM Compute Logic Memory Bus Preload Logic 3

4 Inefficiencies in Accelerator Data Supply Scratchpad-based accelerators On-chip scratchpad memory (SPM) Manually designed logic to move data between SPM and main memory Pros: Good performance Cons: High design effort, accelerator-specific, not reusable Cache-based accelerators Accelerator Pros: Low design effort, cache can be reused Compute Logic Cache Memory Bus Cons: Uncertain memory latency impacts performance 3

5 Optimize Data Supply for Cache-Based Accelerators Approach: automated framework for generating accelerators with efficient data supply Accelerator Source Automated Framework Accelerator w/ Efficient Data Supply 4

6 Optimize Data Supply for Cache-Based Accelerators Approach: automated framework for generating accelerators with efficient data supply Accelerator Source Automated Framework Accelerator w/ Efficient Data Supply Techniques Prefetching Tagging memory accesses Accelerator Compute Logic Cache HW Prefetcher Memory Bus 4

7 Optimize Data Supply for Cache-Based Accelerators Approach: automated framework for generating accelerators with efficient data supply Accelerator Source Automated Framework Accelerator w/ Efficient Data Supply Techniques Accelerator Prefetching Tagging memory accesses Access/Execute Decoupling Program slicing + architecture template Access Logic Cache Execute Logic HW Prefetcher Memory Bus 4

8 Impact of Uncertain Memory Latency Example: Sparse Matrix Vector Multiplication (spmv) Pipeline generated with High-Level Synthesis (HLS) // inner loop of sparse matrix // vector multiplication for (j = begin; j < end; j++) { #pragma HLS pipeline Si = val[j] * vec[cols[j]]; sum = sum + Si; } HLS Time 5

9 Impact of Uncertain Memory Latency Example: Sparse Matrix Vector Multiplication (spmv) Pipeline generated with High-Level Synthesis (HLS) // inner loop of sparse matrix // vector multiplication for (j = begin; j < end; j++) { #pragma HLS pipeline Si = val[j] * vec[cols[j]]; sum = sum + Si; } Regular stride Irregular Regular stride Reduce cache misses for regular accesses Prefetch data into the cache Tolerate cache misses for irregular accesses Access/Execute Decoupling HLS 5 Time miss A cache miss stalls the entire accelerator pipeline

10 Hardware Prefetching Predict future memory accesses PC is often used as a hint Stream localization Spatial correlation prediction for (j = begin; j < end; j++) { Si = val[j] * vec[cols[j]]; sum = sum + Si; } Global Addr Stream C C B8 6

11 Hardware Prefetching Predict future memory accesses PC is often used as a hint Stream localization Spatial correlation prediction for (j = begin; j < end; j++) { Si = val[j] * vec[cols[j]]; sum = sum + Si; } Problem: accelerators lack a PC Solution: generate PC-like tags for accelerator memory accesses Global Addr Stream C C B8 PC x C Local Addr Streams PC y 541C regular strides PC z B8 irregular no pred 6

12 Hardware Prefetching Predict future memory accesses PC is often used as a hint Stream localization Spatial correlation prediction for (j = begin; j < end; j++) { Si = val[j] * vec[cols[j]]; sum = sum + Si; } Problem: accelerators lack a PC Solution: generate PC-like tags for accelerator memory accesses Global Addr Stream C C B8 Local Addr Streams 6 CDFG PC x C PC y 541C regular strides BB2 x BB1 y + BB3 PC z B8 irregular no pred z

13 Decoupled Access/Execute (DAE) Limitations of Hardware Prefetching Not accurate for complex patterns / Needs warm-up time Fundamental reason: lack of semantic information Decoupled Access/Execute Allow memory accesses to run ahead to preload data Memory Access Cache w/ DAE memory latency Value Comp Accelerator Access Logic Execute Logic Time Manual Preload SPM w/ Manual Preload memory latency Compute Cache HW Prefetche r Memory Bus 7

14 Traditional DAE is not Effective for Accelerators miss Traditional DAE: access part forwards data to execute part Problem: access pipeline stalls on misses Throughput is limited by access pipeline Original Access Decoupled Execute Goal: allow access pipeline to continue to flow under misses 8

15 DAE Accelerator with Decoupled Loads Anatomy of a load AGen Req AGen Req Resp Solution: Delegate request/response handling Resp Req Resp AGen 9

16 Memory Unit Proxy for handling memory requests and responses Supports response reordering and store-to-load forwarding Store Addr Load Addr Load Data to AccU Store Data from ExeU Store Addr Queue Dep Check Fwd Data Store Data Queue Fwd Data Queue Mem Unit Load Queue to ExeU memreq memresp 10

17 Memory Unit Proxy for handling memory requests and responses Supports response reordering and store-to-load forwarding Store Addr Load Addr Load Data to AccU Store Data from ExeU Store Addr Queue Dep Check Fwd Data Store Data Queue Fwd Data Queue Mem Unit Load Queue to ExeU memreq ST memresp 10

18 Automated DAE Accelerator Generation Program slicing for generating access/execute slices Architectural template with configurable parameters access.c HLS slicing accel.c slicing execute.c HLS access.v Architectural Template execute.v Parameters Queue sizes Port width MemUnit config etc HW Generation Access/Execute Decoupled Accel Written in PyMTL 11

19 Evaluation Methodology Vertically integrated modeling methodology System components: cycle-level (gem5) Accelerators: register-transfer-level (Vivado HLS, PyMTL) Area, power and energy: gate-level (commercial ASIC flow) Benchmark accelerators from MachSuite Name Description bbgemm Blocked matrix multiplication bfsbulk Breadth-First Search gemm Dense matrix multiplication mdknn Molecular dynamics (K-Nearest Neighbor) nw Needleman-Wunsch algorithm spmvcrs Sparse matrix vector multiplication stencil2d 2D stencil computation viterbi Viterbi algorithm 12

20 Performance Comparison 2.28x speedup on average Prefetching and DAE work in synergy 13

21 Energy Comparison 15% energy reduction on average because of reduced stalls MemUnits/queues only consume a small amount of energy 14

22 More Details in the Paper Deadlock Avoidance Customization of Memory Units Baseline Validation Power and Area Comparison Energy, Power and Area Breakdown Sensitivity Study on Varying Queue Sizes Design Space Exploration: Queue Size Customization 15

23 Summary Cache-based accelerators Avoid the high design cost of manual data movement logic Problem: Inefficient in handling uncertain memory latency Approach: Automated program analysis and architectural template to generate accelerators with efficient data supply Tagging memory requests to enable prefetching Decoupling to enable memory accesses to run ahead Results: High-performance cache-based accelerators with minimal manual efforts 16

Efficient Data Supply for Hardware Accelerators with Prefetching and Access/Execute Decoupling

Efficient Data Supply for Hardware Accelerators with Prefetching and Access/Execute Decoupling Efficient Data Supply for Hardware Accelerators with Prefetching and Access/Execute Decoupling Tao Chen and G. Edward Suh Cornell University Ithaca, NY 14850, USA {tc466, gs272}@cornell.edu Abstract This

More information

An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware

An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware Tao Chen, Shreesha Srinath Christopher Batten, G. Edward Suh Computer Systems Laboratory School of Electrical

More information

ARCHITECTURAL FRAMEWORKS FOR AUTOMATED DESIGN AND OPTIMIZATION OF HARDWARE ACCELERATORS

ARCHITECTURAL FRAMEWORKS FOR AUTOMATED DESIGN AND OPTIMIZATION OF HARDWARE ACCELERATORS ARCHITECTURAL FRAMEWORKS FOR AUTOMATED DESIGN AND OPTIMIZATION OF HARDWARE ACCELERATORS A Dissertation Presented to the Faculty of the Graduate School of Cornell University in Partial Fulfillment of the

More information

Graph Prefetching Using Data Structure Knowledge SAM AINSWORTH, TIMOTHY M. JONES

Graph Prefetching Using Data Structure Knowledge SAM AINSWORTH, TIMOTHY M. JONES Graph Prefetching Using Data Structure Knowledge SAM AINSWORTH, TIMOTHY M. JONES Background and Motivation Graph applications are memory latency bound Caches & Prefetching are existing solution for memory

More information

ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests

ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests Mingxing Tan 1 2, Gai Liu 1, Ritchie Zhao 1, Steve Dai 1, Zhiru Zhang 1 1 Computer Systems Laboratory, Electrical and Computer

More information

An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware

An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware Appears in the Proceedings of the 51st ACM/IEEE Int l Symp. on Microarchitecture (MICRO-51), October 2018 An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware

More information

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont. History Table. Correlating Prediction Table

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont.   History Table. Correlating Prediction Table Lecture 15 History Table Correlating Prediction Table Prefetching Latest A0 A0,A1 A3 11 Fall 2018 Jon Beaumont A1 http://www.eecs.umich.edu/courses/eecs470 Prefetch A3 Slides developed in part by Profs.

More information

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. Computer Architectures Chapter 5 Tien-Fu Chen National Chung Cheng Univ. Chap5-0 Topics in Memory Hierachy! Memory Hierachy Features: temporal & spatial locality Common: Faster -> more expensive -> smaller!

More information

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero The Nineteenth International Conference on Parallel Architectures and Compilation Techniques (PACT) 11-15

More information

A Comprehensive Analytical Performance Model of DRAM Caches

A Comprehensive Analytical Performance Model of DRAM Caches A Comprehensive Analytical Performance Model of DRAM Caches Authors: Nagendra Gulur *, Mahesh Mehendale *, and R Govindarajan + Presented by: Sreepathi Pai * Texas Instruments, + Indian Institute of Science

More information

Computer Systems Architecture I. CSE 560M Lecture 17 Guest Lecturer: Shakir James

Computer Systems Architecture I. CSE 560M Lecture 17 Guest Lecturer: Shakir James Computer Systems Architecture I CSE 560M Lecture 17 Guest Lecturer: Shakir James Plan for Today Announcements and Reminders Project demos in three weeks (Nov. 23 rd ) Questions Today s discussion: Improving

More information

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 10: Runahead and MLP Prof. Onur Mutlu Carnegie Mellon University Last Time Issues in Out-of-order execution Buffer decoupling Register alias tables Physical

More information

Introduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem

Introduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem Introduction Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: Increase computation power Make the best use of available bandwidth We study the bandwidth

More information

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era

More information

18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II. Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012

18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II. Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012 18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012 Reminder: Lab Assignments Lab Assignment 6 Implementing a more

More information

ECE 5775 Student-Led Discussions (10/16)

ECE 5775 Student-Led Discussions (10/16) ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A Adam Macioszek, Julia Currie, Nick Sarkis Sparse Matrix Vector Multiplication Nick Comly, Felipe Fortuna, Mark Li, Serena Krech Matrix

More information

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs IBM Research AI Systems Day DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1, Junsong Wang 2, Chao Zhu 2, Yonghua Lin 2, Jinjun Xiong 3, Wen-mei

More information

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh Accelerating Pointer Chasing in 3D-Stacked : Challenges, Mechanisms, Evaluation Kevin Hsieh Samira Khan, Nandita Vijaykumar, Kevin K. Chang, Amirali Boroumand, Saugata Ghose, Onur Mutlu Executive Summary

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,

More information

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based

More information

DeSC: Decoupled Supply-Compute Communication Management for Heterogeneous Architectures

DeSC: Decoupled Supply-Compute Communication Management for Heterogeneous Architectures DeSC Decoupled Supply-Compute Communication Management for Heterogeneous Architectures Tae Jun Ham Princeton University tae@princeton.edu Juan L. Aragón University of Murcia jlaragon@um.es Margaret Martonosi

More information

arxiv: v1 [cs.dc] 30 Jul 2018

arxiv: v1 [cs.dc] 30 Jul 2018 AutoAccel: Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture Jason Cong 1,2, Peng Wei 1, Cody Hao Yu 1,2, Peng Zhang 2 1 University of California, Los

More information

APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs

APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs Yunho Oh, Keunsoo Kim, Myung Kuk Yoon, Jong Hyun

More information

Johnathan Alsop*, Matthew D. Sinclair*, Sarita V. Adve* *Illinois, AMD, Wisconsin. Sponsors: NSF, C-FAR, ADA (JUMP center by SRC, DARPA)

Johnathan Alsop*, Matthew D. Sinclair*, Sarita V. Adve* *Illinois, AMD, Wisconsin. Sponsors: NSF, C-FAR, ADA (JUMP center by SRC, DARPA) Johnathan Alsop*, Matthew D. Sinclair*, Sarita V. Adve* *Illinois, AMD, Wisconsin Sponsors: NSF, C-FAR, ADA (JUMP center by SRC, DARPA) Specialized architectures are increasingly important in all compute

More information

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2 ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 13 Memory Part 2 Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall12.html

More information

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based

More information

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 13 Memory Part 2

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 13 Memory Part 2 ECE 252 / CPS 220 Advanced Computer Architecture I Lecture 13 Memory Part 2 Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall11.html

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun,

More information

A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation

A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation Abstract: The power budget is expected to limit the portion of the chip that we can power ON at the upcoming technology nodes. This problem,

More information

arxiv: v1 [cs.ar] 3 Jul 2018

arxiv: v1 [cs.ar] 3 Jul 2018 Jason Cong cong@cs.ucla.edu Peng Wei peng.wei.prc@cs.ucla.edu Best-Effort FPGA Programming: A Few Steps Can Go a Long Way Zhenman Fang zhenman@cs.ucla.edu Cody Hao Yu hyu@cs.ucla.edu Yuchen Hao haoyc@cs.ucla.edu

More information

ECE/CS 752 Final Project: The Best-Offset & Signature Path Prefetcher Implementation. Qisi Wang Hui-Shun Hung Chien-Fu Chen

ECE/CS 752 Final Project: The Best-Offset & Signature Path Prefetcher Implementation. Qisi Wang Hui-Shun Hung Chien-Fu Chen ECE/CS 752 Final Project: The Best-Offset & Signature Path Prefetcher Implementation Qisi Wang Hui-Shun Hung Chien-Fu Chen Outline Data Prefetching Exist Data Prefetcher Stride Data Prefetcher Offset Prefetcher

More information

EE 660: Computer Architecture Advanced Caches

EE 660: Computer Architecture Advanced Caches EE 660: Computer Architecture Advanced Caches Yao Zheng Department of Electrical Engineering University of Hawaiʻi at Mānoa Based on the slides of Prof. David Wentzlaff Agenda Review Three C s Basic Cache

More information

Caches Concepts Review

Caches Concepts Review Caches Concepts Review What is a block address? Why not bring just what is needed by the processor? What is a set associative cache? Write-through? Write-back? Then we ll see: Block allocation policy on

More information

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University Lecture 4: Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee282 Lecture 4-1 Announcements HW1 is out (handout and online) Due on 10/15

More information

Cache Refill/Access Decoupling for Vector Machines

Cache Refill/Access Decoupling for Vector Machines Cache Refill/Access Decoupling for Vector Machines Christopher Batten, Ronny Krashinsky, Steve Gerding, Krste Asanović Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of

More information

Caches. Hiding Memory Access Times

Caches. Hiding Memory Access Times Caches Hiding Memory Access Times PC Instruction Memory 4 M U X Registers Sign Ext M U X Sh L 2 Data Memory M U X C O N T R O L ALU CTL INSTRUCTION FETCH INSTR DECODE REG FETCH EXECUTE/ ADDRESS CALC MEMORY

More information

Memory. Lecture 22 CS301

Memory. Lecture 22 CS301 Memory Lecture 22 CS301 Administrative Daily Review of today s lecture w Due tomorrow (11/13) at 8am HW #8 due today at 5pm Program #2 due Friday, 11/16 at 11:59pm Test #2 Wednesday Pipelined Machine Fetch

More information

Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009

Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009 Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009 Agenda Introduction Memory Hierarchy Design CPU Speed vs.

More information

ECE 5730 Memory Systems

ECE 5730 Memory Systems ECE 5730 Memory Systems Spring 2009 Off-line Cache Content Management Lecture 7: 1 Quiz 4 on Tuesday Announcements Only covers today s lecture No office hours today Lecture 7: 2 Where We re Headed Off-line

More information

Software Defined Hardware

Software Defined Hardware Software Defined Hardware For data intensive computation Wade Shen DARPA I2O September 19, 2017 1 Goal Statement Build runtime reconfigurable hardware and software that enables near ASIC performance (within

More information

15-740/ Computer Architecture Lecture 14: Runahead Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/12/2011

15-740/ Computer Architecture Lecture 14: Runahead Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/12/2011 15-740/18-740 Computer Architecture Lecture 14: Runahead Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/12/2011 Reviews Due Today Chrysos and Emer, Memory Dependence Prediction Using

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 omputer Architecture Spring 2016 Lecture 09: Prefetching Shuai Wang Department of omputer Science and Technology Nanjing University Prefetching(1/3) Fetch block ahead of demand Target compulsory, capacity,

More information

Tutorial Outline. 9:00 am 10:00 am Pre-RTL Simulation Framework: Aladdin. 8:30 am 9:00 am! Introduction! 10:00 am 10:30 am! Break!

Tutorial Outline. 9:00 am 10:00 am Pre-RTL Simulation Framework: Aladdin. 8:30 am 9:00 am! Introduction! 10:00 am 10:30 am! Break! Tutorial Outline Time Topic! 8:30 am 9:00 am! Introduction! 9:00 am 10:00 am Pre-RTL Simulation Framework: Aladdin 10:00 am 10:30 am! Break! 10:30 am 11:00 am! Workload Characterization Tool: WIICA! 11:00

More information

Multi-Level Memory Prefetching for Media and Stream Processing

Multi-Level Memory Prefetching for Media and Stream Processing Multi-Level Memory Prefetching for Media and Stream Processing Jason Fritts Assistant Professor, Introduction Multimedia is a dominant computer workload Traditional cache-based memory systems not well-suited

More information

CS7810 Prefetching. Seth Pugsley

CS7810 Prefetching. Seth Pugsley CS7810 Prefetching Seth Pugsley Predicting the Future Where have we seen prediction before? Does it always work? Prefetching is prediction Predict which cache line will be used next, and place it in the

More information

Integrating NVIDIA Deep Learning Accelerator (NVDLA) with RISC-V SoC on FireSim

Integrating NVIDIA Deep Learning Accelerator (NVDLA) with RISC-V SoC on FireSim Integrating NVIDIA Deep Learning Accelerator (NVDLA) with RISC-V SoC on FireSim Farzad Farshchi, Qijing Huang, Heechul Yun University of Kansas, University of California, Berkeley SiFive Internship Rocket

More information

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation

More information

CSC 631: High-Performance Computer Architecture

CSC 631: High-Performance Computer Architecture CSC 631: High-Performance Computer Architecture Spring 2017 Lecture 10: Memory Part II CSC 631: High-Performance Computer Architecture 1 Two predictable properties of memory references: Temporal Locality:

More information

Bandwidth Avoiding Stencil Computations

Bandwidth Avoiding Stencil Computations Bandwidth Avoiding Stencil Computations By Kaushik Datta, Sam Williams, Kathy Yelick, and Jim Demmel, and others Berkeley Benchmarking and Optimization Group UC Berkeley March 13, 2008 http://bebop.cs.berkeley.edu

More information

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, AmiraliBoroumand, Onur Mutlu, Phillip B. Gibbons, Michael A.

More information

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

COSC 6385 Computer Architecture. - Memory Hierarchies (II) COSC 6385 Computer Architecture - Memory Hierarchies (II) Fall 2008 Cache Performance Avg. memory access time = Hit time + Miss rate x Miss penalty with Hit time: time to access a data item which is available

More information

BOBCAT: AMD S LOW-POWER X86 PROCESSOR

BOBCAT: AMD S LOW-POWER X86 PROCESSOR ARCHITECTURES FOR MULTIMEDIA SYSTEMS PROF. CRISTINA SILVANO LOW-POWER X86 20/06/2011 AMD Bobcat Small, Efficient, Low Power x86 core Excellent Performance Synthesizable with smaller number of custom arrays

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

Project Proposals. 1 Project 1: On-chip Support for ILP, DLP, and TLP in an Imagine-like Stream Processor

Project Proposals. 1 Project 1: On-chip Support for ILP, DLP, and TLP in an Imagine-like Stream Processor EE482C: Advanced Computer Organization Lecture #12 Stream Processor Architecture Stanford University Tuesday, 14 May 2002 Project Proposals Lecture #12: Tuesday, 14 May 2002 Lecturer: Students of the class

More information

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache Classifying Misses: 3C Model (Hill) Divide cache misses into three categories Compulsory (cold): never seen this address before Would miss even in infinite cache Capacity: miss caused because cache is

More information

Limitations of Scalar Pipelines

Limitations of Scalar Pipelines Limitations of Scalar Pipelines Superscalar Organization Modern Processor Design: Fundamentals of Superscalar Processors Scalar upper bound on throughput IPC = 1 Inefficient unified pipeline

More information

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp. Cache associativity Cache and performance 12 1 CMPE110 Spring 2005 A. Di Blas 110 Spring 2005 CMPE Cache Direct-mapped cache Reads and writes Textbook Edition: 7.1 to 7.3 Second Third Edition: 7.1 to 7.3

More information

Jim Keller. Digital Equipment Corp. Hudson MA

Jim Keller. Digital Equipment Corp. Hudson MA Jim Keller Digital Equipment Corp. Hudson MA ! Performance - SPECint95 100 50 21264 30 21164 10 1995 1996 1997 1998 1999 2000 2001 CMOS 5 0.5um CMOS 6 0.35um CMOS 7 0.25um "## Continued Performance Leadership

More information

SGI Challenge Overview

SGI Challenge Overview CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 2 (Case Studies) Copyright 2001 Mark D. Hill University of Wisconsin-Madison Slides are derived

More information

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University Lecture 16: Checkpointed Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 18-1 Announcements Reading for today: class notes Your main focus:

More information

Dataflow Supercomputers

Dataflow Supercomputers Dataflow Supercomputers Michael J. Flynn Maxeler Technologies and Stanford University Outline History Dataflow as a supercomputer technology openspl: generalizing the dataflow programming model Optimizing

More information

Co-synthesis and Accelerator based Embedded System Design

Co-synthesis and Accelerator based Embedded System Design Co-synthesis and Accelerator based Embedded System Design COE838: Embedded Computer System http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer

More information

Staged Memory Scheduling

Staged Memory Scheduling Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:

More information

1 Introduction The demand on the performance of memory subsystems is rapidly increasing with the advances in microprocessor architecture. The growing

1 Introduction The demand on the performance of memory subsystems is rapidly increasing with the advances in microprocessor architecture. The growing Tango: a Hardware-based Data Prefetching Technique for Superscalar Processors 1 Shlomit S. Pinter IBM Science and Technology MATAM Advance Technology ctr. Haifa 31905, Israel E-mail: shlomit@vnet.ibm.com

More information

Co-Design of Many-Accelerator Heterogeneous Systems Exploiting Virtual Platforms. SAMOS XIV July 14-17,

Co-Design of Many-Accelerator Heterogeneous Systems Exploiting Virtual Platforms. SAMOS XIV July 14-17, Co-Design of Many-Accelerator Heterogeneous Systems Exploiting Virtual Platforms SAMOS XIV July 14-17, 2014 1 Outline Introduction + Motivation Design requirements for many-accelerator SoCs Design problems

More information

Lecture 7 - Memory Hierarchy-II

Lecture 7 - Memory Hierarchy-II CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II John Wawrzynek Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~johnw

More information

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste!

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

Using FPGAs as Microservices

Using FPGAs as Microservices Using FPGAs as Microservices David Ojika, Ann Gordon-Ross, Herman Lam, Bhavesh Patel, Gaurav Kaul, Jayson Strayer (University of Florida, DELL EMC, Intel Corporation) The 9 th Workshop on Big Data Benchmarks,

More information

ESE532: System-on-a-Chip Architecture. Today. Message. Clock Cycle BRAM

ESE532: System-on-a-Chip Architecture. Today. Message. Clock Cycle BRAM ESE532: System-on-a-Chip Architecture Day 20: April 3, 2017 Pipelining, Frequency, Dataflow Today What drives cycle times Pipelining in Vivado HLS C Avoiding bottlenecks feeding data in Vivado HLS C Penn

More information

Load Balancing for Parallel Multi-core Machines with Non-Uniform Communication Costs

Load Balancing for Parallel Multi-core Machines with Non-Uniform Communication Costs Load Balancing for Parallel Multi-core Machines with Non-Uniform Communication Costs Laércio Lima Pilla llpilla@inf.ufrgs.br LIG Laboratory INRIA Grenoble University Grenoble, France Institute of Informatics

More information

ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University

ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University Lab 4: Binarized Convolutional Neural Networks Due Wednesday, October 31, 2018, 11:59pm

More information

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III CS 152 Computer Architecture and Engineering Lecture 8 - Memory Hierarchy-III Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste!

More information

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1) Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1) 1 Types of Cache Misses Compulsory misses: happens the first time a memory word is accessed the misses for an infinite cache

More information

SHOC: The Scalable HeterOgeneous Computing Benchmark Suite

SHOC: The Scalable HeterOgeneous Computing Benchmark Suite SHOC: The Scalable HeterOgeneous Computing Benchmark Suite Dakar Team Future Technologies Group Oak Ridge National Laboratory Version 1.1.2, November 2011 1 Introduction The Scalable HeterOgeneous Computing

More information

h Coherence Controllers

h Coherence Controllers High-Throughput h Coherence Controllers Anthony-Trung Nguyen Microprocessor Research Labs Intel Corporation 9/30/03 Motivations Coherence Controller (CC) throughput is bottleneck of scalable systems. CCs

More information

All About the Cell Processor

All About the Cell Processor All About the Cell H. Peter Hofstee, Ph. D. IBM Systems and Technology Group SCEI/Sony Toshiba IBM Design Center Austin, Texas Acknowledgements Cell is the result of a deep partnership between SCEI/Sony,

More information

Final Exam Fall 2008

Final Exam Fall 2008 COE 308 Computer Architecture Final Exam Fall 2008 page 1 of 8 Saturday, February 7, 2009 7:30 10:00 AM Computer Engineering Department College of Computer Sciences & Engineering King Fahd University of

More information

History Table. Latest

History Table. Latest Lecture 15 Prefetching Latest History Table A0 Correlating Prediction Table A0,A1 A3 11 Winter 2019 Prof. Ronald Dreslinski A1 Prefetch A3 h8p://www.eecs.umich.edu/courses/eecs470 Slides developed in part

More information

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5) Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5) 1 Techniques to Reduce Cache Misses Victim caches Better replacement policies pseudo-lru, NRU Prefetching, cache

More information

Data Placement Optimization in GPU Memory Hierarchy Using Predictive Modeling

Data Placement Optimization in GPU Memory Hierarchy Using Predictive Modeling Data Placement Optimization in GPU Memory Hierarchy Using Predictive Modeling Larisa Stoltzfus*, Murali Emani, Pei-Hung Lin, Chunhua Liao *University of Edinburgh (UK), Lawrence Livermore National Laboratory

More information

Optimizations Enabled by a Decoupled Front-End Architecture

Optimizations Enabled by a Decoupled Front-End Architecture Optimizations Enabled by a Decoupled Front-End Architecture Glenn Reinman y Brad Calder y Todd Austin z y Department of Computer Science and Engineering, University of California, San Diego z Electrical

More information

Computer Architecture EE 4720 Final Examination

Computer Architecture EE 4720 Final Examination Name Computer Architecture EE 4720 Final Examination Primary: 6 December 1999, Alternate: 7 December 1999, 10:00 12:00 CST 15:00 17:00 CST Alias Problem 1 Problem 2 Problem 3 Problem 4 Exam Total (25 pts)

More information

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance 6.823, L11--1 Cache Performance and Memory Management: From Absolute Addresses to Demand Paging Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Cache Performance 6.823,

More information

15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 16: Prefetching Wrap-up Prof. Onur Mutlu Carnegie Mellon University Announcements Exam solutions online Pick up your exams Feedback forms 2 Feedback Survey Results

More information

ECE 4750 Computer Architecture, Fall 2014 T05 FSM and Pipelined Cache Memories

ECE 4750 Computer Architecture, Fall 2014 T05 FSM and Pipelined Cache Memories ECE 4750 Computer Architecture, Fall 2014 T05 FSM and Pipelined Cache Memories School of Electrical and Computer Engineering Cornell University revision: 2014-10-08-02-02 1 FSM Set-Associative Cache Memory

More information

Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores

Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores Magnus Ekman Per Stenstrom Department of Computer Engineering, Department of Computer Engineering, Outline Problem statement Assumptions

More information

Cache Performance (H&P 5.3; 5.5; 5.6)

Cache Performance (H&P 5.3; 5.5; 5.6) Cache Performance (H&P 5.3; 5.5; 5.6) Memory system and processor performance: CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st

More information

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ 1 An Out-of-Order Processor Implementation Reorder Buffer (ROB)

More information

I/O Buffering and Streaming

I/O Buffering and Streaming I/O Buffering and Streaming I/O Buffering and Caching I/O accesses are reads or writes (e.g., to files) Application access is arbitary (offset, len) Convert accesses to read/write of fixed-size blocks

More information

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Science CPUtime = IC CPI Execution + Memory accesses Instruction

More information

ECE 5745 PyMTL CL Modeling

ECE 5745 PyMTL CL Modeling README.md - Grip ECE 5745 PyMTL CL Modeling Most students are quite familiar with functional-level (FL) and register-transfer-level (RTL) modeling from ECE 4750, but students are often less familiar with

More information

A 1-GHz Configurable Processor Core MeP-h1

A 1-GHz Configurable Processor Core MeP-h1 A 1-GHz Configurable Processor Core MeP-h1 Takashi Miyamori, Takanori Tamai, and Masato Uchiyama SoC Research & Development Center, TOSHIBA Corporation Outline Background Pipeline Structure Bus Interface

More information

ESE532: System-on-a-Chip Architecture. Today. Message. Real Time. Real-Time Tasks. Real-Time Guarantees. Real Time Demands Challenges

ESE532: System-on-a-Chip Architecture. Today. Message. Real Time. Real-Time Tasks. Real-Time Guarantees. Real Time Demands Challenges ESE532: System-on-a-Chip Architecture Day 9: October 1, 2018 Real Time Real Time Demands Challenges Today Algorithms Architecture Disciplines to achieve Penn ESE532 Fall 2018 -- DeHon 1 Penn ESE532 Fall

More information

15-740/ Computer Architecture, Fall 2011 Midterm Exam II

15-740/ Computer Architecture, Fall 2011 Midterm Exam II 15-740/18-740 Computer Architecture, Fall 2011 Midterm Exam II Instructor: Onur Mutlu Teaching Assistants: Justin Meza, Yoongu Kim Date: December 2, 2011 Name: Instructions: Problem I (69 points) : Problem

More information

ECE 4750 Computer Architecture, Fall 2018 Lab 3: Blocking Cache

ECE 4750 Computer Architecture, Fall 2018 Lab 3: Blocking Cache School of Electrical and Computer Engineering Cornell University revision: 2018-10-18-16-37 In this lab, you will design two finite-state-machine (FSM) cache microarchitectures, which we will eventually

More information

Inherently Lower Complexity Architectures using Dynamic Optimization. Michael Gschwind Erik Altman

Inherently Lower Complexity Architectures using Dynamic Optimization. Michael Gschwind Erik Altman Inherently Lower Complexity Architectures using Dynamic Optimization Michael Gschwind Erik Altman ÿþýüûúùúüø öõôóüòñõñ ðïîüíñóöñð What is the Problem? Out of order superscalars achieve high performance....butatthecostofhighhigh

More information