The Vector-Thread Architecture Ronny Krashinsky, Christopher Batten, Mark Hampton, Steve Gerding, Brian Pharris, Jared Casper, Krste Asanovic
|
|
- Gerard Sutton
- 5 years ago
- Views:
Transcription
1 The Vector-Thread Architecture Ronny Krashinsky, Chriopher Batten, Mark Hampton, Steve Gerding, Brian Pharris, Jared Casper, Kre Asanovic MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA ISCA 2004
2 Goals For Vector-Thread Architecture Primary goal is efficiency High performance with low energy and small area Take advantage of whatever parallelism and locality is available: DLP, TLP, ILP Allow intermixing of multiple levels of parallelism Programming model is key Encode parallelism and locality in a way that enables a complexity-effective implementation Provide clean abractions to simplify coding and compilation
3 Vector and Multithreaded Architectures Control Processor vector control PE0 PE1 PE2 PEN PE0 PE1 PE2 PEN thread control Memory Memory Vector processors provide efficient DLP execution Amortize inruction control Amortize loop bookkeeping overhead Exploit ructured memory accesses Unable to execute loops with loop-carried dependencies or complex internal control flow Multithreaded processors can flexibly exploit TLP Unable to amortize common control overhead across threads Unable to exploit ructured memory accesses across threads Coly memory-based synchronization and communication between threads
4 Vector-Thread Architecture VT unifies the vector and multithreaded compute models A control processor interacts with a vector of virtual processors (VPs) Vector-fetch: control processor fetches inructions for all VPs in parallel Thread-fetch: a VP fetches its own inructions VT allows a seamless intermixing of vector and thread control Control Processor vector-fetch VP0 VP1 VP2 VP3 VPN threadfetch Memory
5 Outline Vector-Thread Architectural Paradigm Abract model Physical Model SCALE VT Processor Evaluation Related Work
6 Virtual Processor Abraction VPs contain a set of regiers VPs execute RISC-like inructions grouped into atomic inruction blocks (AIBs) VPs have no automatic program counter, AIBs mu be explicitly fetched VPs contain pending vector-fetch and thread-fetch addresses A fetch inruction allows a VP to fetch its own AIB May be predicated for conditional branch If an AIB does not execute a fetch, the VP thread ops vector-fetch VP Regiers threadfetch VP thread execution AIB inruction fetch fetch s thread-fetch thread-fetch
7 Virtual Processor Vector A VT architecture includes a control processor and a virtual processor vector Two interacting inruction sets A vector-fetch command allows the control processor to fetch an AIB for all the VPs in parallel Vector-load and vector-ore commands transfer blocks of data between memory and the VP regiers Control vector-fetch Processor VP0 vector -load vector -ore Regiers VP1 Regiers VPN Regiers Vector Memory Unit s s s Memory
8 Cross-VP Data Transfers Cross-VP connections provide fine-grain data operand communication and synchronization VP inructions may target nextvp as deination or use prevvp as a source CrossVP queue hos wrap-around data, control processor can push and pop Rericted ring communication pattern is cheap to implement, scalable, and matches the software usage model for VPs Control vector-fetch Processor crossvppop VP0 crossvppush Regiers VP1 Regiers VPN Regiers crossvp queue s s s
9 Mapping Loops to VT A broad class of loops map naturally to VT Vectorizable loops Loops with loop-carried dependencies Loops with internal control flow Each VP executes one loop iteration Control processor manages the execution Stripmining enables implementation-dependent vector lengths Programmer or compiler only schedules one loop iteration on one VP No cross-iteration scheduling
10 Vectorizable Loops << x + Control Processor vector-load vector-load vector-fetch vector-ore vector-load vector-load vector-fetch Data-parallel loops with no internal control flow mapped using vector commands predication for small conditionals operation loop iteration DAG VP0 VP1 VP2 VP3 VPN << + x << x + << x + << x + << x +
11 Loop-Carried Dependencies << x + Control Processor vector-load vector-load vector-fetch vector-ore Loops with cross-iteration dependencies mapped using vector commands with cross-vp data transfers Vector-fetch introduces chain of prevvp receives and nextvp sends Vector-memory commands with non-vectorizable compute VP0 VP1 VP2 VP3 VPN << x + << x + << x + << x + << x +
12 Loops with Internal Control Flow == br Control Processor vector-load vector-fetch vector-ore Data-parallel loops with large conditionals or inner-loops mapped using thread-fetches Vector-commands and thread-fetches freely intermixed Once launched, the VP threads execute to completion before the next control processor command VP0 VP1 VP2 VP3 VPN == br == br == br == br == br == br == br == br
13 VT Physical Model A Vector-Thread Unit contains an array of lanes with physical regier files and execution units VPs map to lanes and share physical resources, VP execution is time-multiplexed on the lanes Independent parallel lanes exploit parallelism across VPs and data operand locality within VPs Control Processor Vector-Thread Unit Lane 0 Lane 1 Lane 2 Lane 3 VP12 VP13 VP14 VP15 VP8 VP9 VP10 VP11 VP4 VP5 VP6 VP7 VP0 VP1 VP2 VP3 Vector Memory Unit Memory
14 Lane Execution Lanes execute decoupled from each other Command management unit handles vector-fetch and thread-fetch commands Execution cluer executes inructions in-order from small AIB cache (e.g. 32 inructions) AIB caches exploit locality to reduce inruction fetch energy (on par with regier read) Execute directives point to AIBs and indicate which VP(s) the AIB shou be executed for For a thread-fetch command, the lane executes the AIB for the requeing VP For a vector-fetch command, the lane executes the AIB for every VP AIBs and vector-fetch commands reduce control overhead 10s 100s of inructions executed per fetch address tag-check, even for nonvectorizable loops vector-fetch miss addr miss AIB fill vector-fetch AIB tags execute directive AIB cache AIB Lane 0 VP inr. thread-fetch AIB address VP12 VP8 VP4 VP0
15 VP Execution Interleaving Hardware provides the benefits of loop unrolling by interleaving VPs Time-multiplexing can hide thread-fetch, memory, and functional unit latencies vector-fetch AIB time-multiplexing Lane 0 VP0 VP4 VP8 VP12 Lane 1 Lane 2 Lane 3 VP1 VP5 VP9 VP13 VP2 VP6 VP10 VP14 VP3 VP7 VP11 VP15 vector-fetch time threadfetch vector-fetch
16 VP Execution Interleaving Dynamic scheduling of cross-vp data transfers automatically adapts to software critical path (in contra to atic software pipelining) No atic cross-iteration scheduling Tolerant to variable dynamic latencies vector-fetch AIB time-multiplexing Lane 0 VP0 VP4 VP8 VP12 Lane 1 Lane 2 Lane 3 VP1 VP5 VP9 VP13 VP2 VP6 VP10 VP14 VP3 VP7 VP11 VP15 time
17 SCALE Vector-Thread Processor SCALE is designed to be a complexity-effective all-purpose embedded processor Exploit all available forms of parallelism and locality to achieve high performance and low energy Conrained to small area (eimated 10 mm 2 in 0.18 µm) Reduce wire delay and complexity Support tiling of multiple SCALE processors for increased throughput Careful balance between software and hardware for code mapping and scheduling Optimize runtime energy, area efficiency, and performance while maintaining a clean scalable programming model
18 SCALE Cluers VPs partitioned into four cluers to exploit ILP and allow lane implementations to optimize area, energy, and circuit delay Cluers are heterogeneous c0 can execute loads and ores, c1 can execute fetches, c3 has integer mult/div Cluers execute decoupled from each other SCALE VP Control Processor Lane 0 Lane 1 Lane 2 Lane 3 c3 c3 c3 c3 c3 c2 c1 AIB Fill Unit c2 c1 c2 c1 c2 c1 c2 c1 c0 c0 c0 c0 c0 L1 Cache
19 SCALE Regiers and VP Configuration Atomic inruction blocks allow VPs to share temporary ate only valid within the AIB VP general regiers divided into private and shared Chain regiers at inputs avoid reading and writing general regier file to save energy c0 shared cr0 VP8 VP4 VP0 cr1 Number of VP regiers in each cluer is configurable The hardware can support more VPs when they each have fewer private regiers Low overhead: Control processor inruction configures VPs before entering ripmine loop, VP ate undefined across reconfigurations 4 VPs with 0 shared regs 8 private regs VP12 VP8 VP4 VP0 7 VPs with 4 shared regs 4 private regs shared VP24 VP20 VP16 VP12 VP8 VP4 VP0 25 VPs with 7 shared regs 1 private reg shared
20 SCALE Micro-Ops Assembler translates portable software ISA into hardware micro-ops Per-cluer micro-op bundles access local regiers only Inter-cluer data transfers broken into transports and writebacks Software VP code: cluer c0 c1 c2 operation xor pr0, pr1 sll cr0, 2 add cr0, cr1 deinations fi c1/cr0, c2/cr0 fi c2/cr1 fi pr0 Hardware micro-ops: Cluer 0 Cluer 1 Cluer 2 wb compute tp wb compute tp wb compute tp xor pr0,pr1 fic1,c2 c0ficr0 sll cr0,2 fic2 c0ficr0 c1ficr1 add cr0,cr1fipr0 cluer micro-op bundle Cluer 3 not shown
21 SCALE Cluer Decoupling Cluer execution is decoupled Cluer AIB caches ho micro-op bundles Each cluer has its own executedirective queue, and local control Inter-cluer data transfers synchronize with handshake signals Memory Access Decoupling (see paper) Load-data-queue enables continued execution after a cache miss Decoupled-ore-queue enables loads to slip ahead of ores Cluer 3 AIB cache VP Cluer 2 AIBs Cluer 1 AIBs Cluer 0 AIBs compute compute compute compute Regs writeback transport writeback transport writeback transport writeback transport
22 SCALE Prototype and Simulator Prototype SCALE processor in development Control processor: MIPS, 1 inr/cycle VTU: 4 lanes, 4 cluers/lane, 32 regiers/cluer, 128 VPs max Primary I/D cache: 32 KB, 4x128b per cycle, non-blocking DRAM: 64b, 200 MHz DDR2 (64b at 400Mb/s: 3.2GB/s) Eimated 10 mm 2 in 0.18µm, 400 MHz (25 FO4) Cycle-level execution-driven microarchitectural simulator Detailed VTU and memory syem model Cache Bank (8KB) Cache Bank (8KB) Cache Bank (8KB) Crossbar Memory Unit Cache Bank (8KB) L/S byp LDQ LDQ LDQ LDQ MD PC 4 mm CP0 Cache Tags Memory Interface / Cache Control Control Processor Mult Div Cluer Lane 2.5 mm
23 Benchmarks Diverse set of 22 benchmarks chosen to evaluate a range of applications with different types of parallelism 16 from EEMBC, 6 from MiBench, Mediabench, and SpecInt Hand-written VP assembly code linked with C code compiled for control processor using gcc Reflects typical usage model for embedded processors EEMBC enables comparison to other processors running handoptimized code, but it is not an ideal comparison Performance depends greatly on programmer effort, algorithmic changes are allowed for some benchmarks, these are often unpublished Performance depends greatly on special compute inructions or sub-word SIMD operations (for the current evaluation, SCALE does not provide these) Processors use unequal silicon area, process technologies, and circuit yles Overall results: SCALE is competitive with larger and more complex processors on a wide range of codes from different domains See paper for detailed results Results are a snapshot, SCALE microarchitecture and benchmark mappings continue to improve
24 Mapping Parallelism to SCALE Data-parallel loops with no complex control flow Use vector-fetch and vector-memory commands EEMBC rgbcmy, rgbyiq, and hpg execute 6-10 ops/cycle for 12x-32x speedup over control processor, performance scales with number of lanes Loops with loop-carried dependencies Use vector-fetched AIBs with cross-vp data transfers Mediabench adpcm.dec: two loop-carried dependencies propagate in parallel, on average 7 loop iterations execute in parallel, 8x speedup MiBench sha has 5 loop-carried dependencies, exploits ILP Speedup vs. Control Processor rgbcmy rgbyiq hpg adpcm.dec prototype 1 Lane 2 Lanes 4 Lanes 8 Lanes sha
25 Mapping Parallelism to SCALE Data-parallel loops with large conditionals Use vector-fetched AIBs with conditional thread-fetches EEMBC dither: special-case for white pixels (18 ops vs. 49) Data-parallel loops with inner loops Use vector-fetched AIBs with thread-fetches for inner loop EEMBC lookup: VPs execute pointer-chaising IP address lookups in routing table Free-running threads No control processor interaction VP worker threads get tasks from shared queue using atomic memory ops EEMBC pntrch and MiBench qsort achieve significant speedups Speedup vs. Control Processor Lane 2 Lanes 4 Lanes 8 Lanes dither lookup pntrch qsort
26 Comparison to Related Work TRIPS and Smart Memories can also exploit multiple types of parallelism, but mu explicitly switch modes Raw s tiled processors provide much lower compute density than SCALE s cluers which factor out inruction and data overheads and use direct communication inead of programmed switches Multiscalar passes loop-carried regier dependencies around a ring network, but it focuses on speculative execution and memory, whereas VT uses simple logic to support common loop types Imagine organizes computation into kernels to improve regier file locality, but it exposes machine details with a low-level VLIW ISA, in contra to VT s VP abraction and AIBs CODE uses regier-renaming to hide its decoupled cluers from software, whereas SCALE simplifies hardware by exposing cluers and atically partitioning inter-cluer transport and writeback ops Jesshope s micro-threading is similar in spirit to VT, but its threads are dynamically scheduled and cross-iteration synchronization uses full/empty bits on global regiers
27 Summary The vector-thread architectural paradigm unifies the vector and multithreaded compute models VT abraction introduces a small set of primitives to allow software to succinctly encode parallelism and locality and seamlessly intermingle DLP, TLP, and ILP Virtual processors, AIBs, vector-fetch and vector-memory commands, thread-fetches, cross-vp data transfers SCALE VT processor efficiently achieves highperformance on a wide range of embedded applications
LOW POWER AND SMALL AREA.
THE VECTOR-THREAD ARCHITECTURE THE VECTOR-THREAD (VT) ARCHITECTURE SUPPORTS A SEAMLESS INTERMINGLING OF VECTOR AND MULTITHREADED COMPUTATION TO FLEXIBLY AND COMPACTLY ENCODE APPLICATION PARALLELISM AND
More informationronny@mit.edu www.cag.lcs.mit.edu/scale Introduction Architectures are all about exploiting the parallelism inherent to applications Performance Energy The Vector-Thread Architecture is a new approach
More informationThe Vector-Thread Architecture
Appears in, The 31st Annual International Symposium on Computer Architecture (ISCA-31), Munich, Germany, June 2004 The Vector-Thread Architecture Ronny Krashinsky, Christopher Batten, Mark Hampton, Steve
More informationCache Refill/Access Decoupling for Vector Machines
Cache Refill/Access Decoupling for Vector Machines Christopher Batten, Ronny Krashinsky, Steve Gerding, Krste Asanović Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of
More informationPACE: Power-Aware Computing Engines
PACE: Power-Aware Computing Engines Krste Asanovic Saman Amarasinghe Martin Rinard Computer Architecture Group MIT Laboratory for Computer Science http://www.cag.lcs.mit.edu/ PACE Approach Energy- Conscious
More informationExploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture
Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture Ramadass Nagarajan Karthikeyan Sankaralingam Haiming Liu Changkyu Kim Jaehyuk Huh Doug Burger Stephen W. Keckler Charles R. Moore Computer
More informationPortland State University ECE 588/688. Dataflow Architectures
Portland State University ECE 588/688 Dataflow Architectures Copyright by Alaa Alameldeen and Haitham Akkary 2018 Hazards in von Neumann Architectures Pipeline hazards limit performance Structural hazards
More informationCompiling for Vector-Thread Architectures
Compiling for Vector-Thread Architectures Mark Hampton MIT Computer Science and Artificial Intelligence Laboratory Cambridge, MA 02139 mhampton@csail.mit.edu Krste Asanović Computer Science Division, EECS
More informationExploitation of instruction level parallelism
Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering
More informationCS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines
CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines Assigned April 7 Problem Set #5 Due April 21 http://inst.eecs.berkeley.edu/~cs152/sp09 The problem sets are intended
More informationVector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks
Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks Christos Kozyrakis Stanford University David Patterson U.C. Berkeley http://csl.stanford.edu/~christos Motivation Ideal processor
More informationTRIPS: Extending the Range of Programmable Processors
TRIPS: Extending the Range of Programmable Processors Stephen W. Keckler Doug Burger and Chuck oore Computer Architecture and Technology Laboratory Department of Computer Sciences www.cs.utexas.edu/users/cart
More information2D-VLIW: An Architecture Based on the Geometry of Computation
2D-VLIW: An Architecture Based on the Geometry of Computation Ricardo Santos,2 2 Dom Bosco Catholic University Computer Engineering Department Campo Grande, MS, Brazil Rodolfo Azevedo Guido Araujo State
More informationBeyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji
Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of
More informationIntroduction Architecture overview. Multi-cluster architecture Addressing modes. Single-cluster Pipeline. architecture Instruction folding
ST20 icore and architectures D Albis Tiziano 707766 Architectures for multimedia systems Politecnico di Milano A.A. 2006/2007 Outline ST20-iCore Introduction Introduction Architecture overview Multi-cluster
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationItanium 2 Processor Microarchitecture Overview
Itanium 2 Processor Microarchitecture Overview Don Soltis, Mark Gibson Cameron McNairy, August 2002 Block Diagram F 16KB L1 I-cache Instr 2 Instr 1 Instr 0 M/A M/A M/A M/A I/A Template I/A B B 2 FMACs
More informationLecture-13 (ROB and Multi-threading) CS422-Spring
Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationImplementing Fine/Medium Grained TLP Support in a Many-Core Architecture
Implementing Fine/Medium Grained TLP Support in a Many-Core Architecture Roberto Giorgi, Zdravko Popovic, Nikola Puzovic Department of Information Engineering, University of Siena, Italy http://www.dii.unisi.it/~
More informationOutline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??
Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross
More informationCS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines
CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture VLIW, Vector, and Multithreaded Machines Assigned 3/24/2019 Problem Set #4 Due 4/5/2019 http://inst.eecs.berkeley.edu/~cs152/sp19
More informationDEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year & Semester : III/VI Section : CSE-1 & CSE-2 Subject Code : CS2354 Subject Name : Advanced Computer Architecture Degree & Branch : B.E C.S.E. UNIT-1 1.
More informationCMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More informationInstruction Level Parallelism. Appendix C and Chapter 3, HP5e
Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation
More informationIntroducing the Cray XMT. Petr Konecny May 4 th 2007
Introducing the Cray XMT Petr Konecny May 4 th 2007 Agenda Origins of the Cray XMT Cray XMT system architecture Cray XT infrastructure Cray Threadstorm processor Shared memory programming model Benefits/drawbacks/solutions
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationSeveral Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining
Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the
More informationGetting CPI under 1: Outline
CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more
More informationLecture 13 - VLIW Machines and Statically Scheduled ILP
CS 152 Computer Architecture and Engineering Lecture 13 - VLIW Machines and Statically Scheduled ILP John Wawrzynek Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~johnw
More informationCSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1
CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level
More informationSudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread
Intra-Warp Compaction Techniques Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Goal Active thread Idle thread Compaction Compact threads in a warp to coalesce (and eliminate)
More informationChapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)
Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3) Hardware Speculation and Precise
More informationChapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST
Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism
More informationEN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)
EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering
More informationMetodologie di Progettazione Hardware-Software
Metodologie di Progettazione Hardware-Software Advanced Pipelining and Instruction-Level Paralelism Metodologie di Progettazione Hardware/Software LS Ing. Informatica 1 ILP Instruction-level Parallelism
More informationHardware-Based Speculation
Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register
More informationCache Refill/Access Decoupling for Vector Machines
Appears in the Proceedings of the 37th International Symposium on Microarchitecture (MICRO-37), December 24 Cache Refill/Access Decoupling for Vector Machines Christopher Batten, Ronny Krashinsky, Steve
More informationInstruction-Level Parallelism and Its Exploitation
Chapter 2 Instruction-Level Parallelism and Its Exploitation 1 Overview Instruction level parallelism Dynamic Scheduling Techniques es Scoreboarding Tomasulo s s Algorithm Reducing Branch Cost with Dynamic
More informationMotivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture
Motivation Banked Register File for SMT Processors Jessica H. Tseng and Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA 02139, USA BARC2004 Increasing demand on
More informationBanked Multiported Register Files for High-Frequency Superscalar Microprocessors
Banked Multiported Register Files for High-Frequency Superscalar Microprocessors Jessica H. T seng and Krste Asanoviü MIT Laboratory for Computer Science, Cambridge, MA 02139, USA ISCA2003 1 Motivation
More informationCS 152 Computer Architecture and Engineering. Lecture 13 - VLIW Machines and Statically Scheduled ILP
CS 152 Computer Architecture and Engineering Lecture 13 - VLIW Machines and Statically Scheduled ILP Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste!
More informationPerformance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.
Performance of Computer Systems CSE 586 Computer Architecture Review Jean-Loup Baer http://www.cs.washington.edu/education/courses/586/00sp Performance metrics Use (weighted) arithmetic means for execution
More informationCS 152, Spring 2011 Section 10
CS 152, Spring 2011 Section 10 Christopher Celio University of California, Berkeley Agenda Stuff (Quiz 4 Prep) http://3dimensionaljigsaw.wordpress.com/2008/06/18/physics-based-games-the-new-genre/ Intel
More informationNOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline
CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism
More informationPortland State University ECE 588/688. Cray-1 and Cray T3E
Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector
More informationProcessor (IV) - advanced ILP. Hwansoo Han
Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle
More informationHandout 2 ILP: Part B
Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP
More informationThe University of Texas at Austin
EE382N: Principles in Computer Architecture Parallelism and Locality Fall 2009 Lecture 24 Stream Processors Wrapup + Sony (/Toshiba/IBM) Cell Broadband Engine Mattan Erez The University of Texas at Austin
More informationTDT 4260 lecture 7 spring semester 2015
1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding
More informationChapter 4 Data-Level Parallelism
CS359: Computer Architecture Chapter 4 Data-Level Parallelism Yanyan Shen Department of Computer Science and Engineering Shanghai Jiao Tong University 1 Outline 4.1 Introduction 4.2 Vector Architecture
More informationComputer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士
Computer Architecture 计算机体系结构 Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2018 Review Hazards (data/name/control) RAW, WAR, WAW hazards Different types
More informationThe University of Texas at Austin
EE382 (20): Computer Architecture - Parallelism and Locality Lecture 4 Parallelism in Hardware Mattan Erez The University of Texas at Austin EE38(20) (c) Mattan Erez 1 Outline 2 Principles of parallel
More informationConstructing Virtual Architectures on Tiled Processors. David Wentzlaff Anant Agarwal MIT
Constructing Virtual Architectures on Tiled Processors David Wentzlaff Anant Agarwal MIT 1 Emulators and JITs for Multi-Core David Wentzlaff Anant Agarwal MIT 2 Why Multi-Core? 3 Why Multi-Core? Future
More informationComputer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014
18-447 Computer Architecture Lecture 15: Load/Store Handling and Data Flow Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014 Lab 4 Heads Up Lab 4a out Branch handling and branch predictors
More informationDesign of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017
Design of Digital Circuits Lecture 21: GPUs Prof. Onur Mutlu ETH Zurich Spring 2017 12 May 2017 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle and Microprogrammed Microarchitectures
More informationMultithreading and the Tera MTA. Multithreading for Latency Tolerance
Multithreading and the Tera MTA Krste Asanovic krste@lcs.mit.edu http://www.cag.lcs.mit.edu/6.893-f2000/ 6.893: Advanced VLSI Computer Architecture, October 31, 2000, Lecture 6, Slide 1. Krste Asanovic
More informationProcessors. Young W. Lim. May 12, 2016
Processors Young W. Lim May 12, 2016 Copyright (c) 2016 Young W. Lim. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version
More informationAdapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]
Review and Advanced d Concepts Adapted from instructor s supplementary material from Computer Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK] Pipelining Review PC IF/ID ID/EX EX/M
More informationImplementing Virtual Memory in a Vector Processor with Software Restart Markers
Implementing Virtual Memory in a Vector Processor with Software Restart Markers Mark Hampton & Krste Asanovic Computer Architecture Group MIT CSAIL Vector processors offer many benefits One instruction
More informationCISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP
CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationProcessor: Superscalars Dynamic Scheduling
Processor: Superscalars Dynamic Scheduling Z. Jerry Shi Assistant Professor of Computer Science and Engineering University of Connecticut * Slides adapted from Blumrich&Gschwind/ELE475 03, Peh/ELE475 (Princeton),
More informationAdvanced issues in pipelining
Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one
More informationUsing Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology
Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology September 19, 2007 Markus Levy, EEMBC and Multicore Association Enabling the Multicore Ecosystem Multicore
More informationAchieving Out-of-Order Performance with Almost In-Order Complexity
Achieving Out-of-Order Performance with Almost In-Order Complexity Comprehensive Examination Part II By Raj Parihar Background Info: About the Paper Title Achieving Out-of-Order Performance with Almost
More informationSimultaneous Multithreading (SMT)
Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue
More information5008: Computer Architecture
5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage
More informationData/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)
Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) A 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software
More informationEECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)
Evolution of Processor Performance So far we examined static & dynamic techniques to improve the performance of single-issue (scalar) pipelined CPU designs including: static & dynamic scheduling, static
More informationCS 152 Computer Architecture and Engineering. Lecture 18: Multithreading
CS 152 Computer Architecture and Engineering Lecture 18: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste
More informationThe Processor: Instruction-Level Parallelism
The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy
More informationCore Fusion: Accommodating Software Diversity in Chip Multiprocessors
Core Fusion: Accommodating Software Diversity in Chip Multiprocessors Authors: Engin Ipek, Meyrem Kırman, Nevin Kırman, and Jose F. Martinez Navreet Virk Dept of Computer & Information Sciences University
More informationCS 351 Final Exam Solutions
CS 351 Final Exam Solutions Notes: You must explain your answers to receive partial credit. You will lose points for incorrect extraneous information, even if the answer is otherwise correct. Question
More informationCPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor
1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic
More informationDynamic Control Hazard Avoidance
Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>
More informationAdvanced Computer Architecture
ECE 563 Advanced Computer Architecture Fall 2010 Lecture 6: VLIW 563 L06.1 Fall 2010 Little s Law Number of Instructions in the pipeline (parallelism) = Throughput * Latency or N T L Throughput per Cycle
More informationProblem M5.1: VLIW Programming
Problem M5.1: VLIW Programming Last updated: Ben Bitdiddle and Louis Reasoner have started a new company called Transbeta and are designing a new processor named Titanium. The Titanium processor is a single-issue
More informationVLIW/EPIC: Statically Scheduled ILP
6.823, L21-1 VLIW/EPIC: Statically Scheduled ILP Computer Science & Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationCS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars
CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory
More informationThe Bifrost GPU architecture and the ARM Mali-G71 GPU
The Bifrost GPU architecture and the ARM Mali-G71 GPU Jem Davies ARM Fellow and VP of Technology Hot Chips 28 Aug 2016 Introduction to ARM Soft IP ARM licenses Soft IP cores (amongst other things) to our
More informationLecture 14: Multithreading
CS 152 Computer Architecture and Engineering Lecture 14: Multithreading John Wawrzynek Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~johnw
More informationData-Level Parallelism in SIMD and Vector Architectures. Advanced Computer Architectures, Laura Pozzi & Cristina Silvano
Data-Level Parallelism in SIMD and Vector Architectures Advanced Computer Architectures, Laura Pozzi & Cristina Silvano 1 Current Trends in Architecture Cannot continue to leverage Instruction-Level parallelism
More informationMultithreaded Processors. Department of Electrical Engineering Stanford University
Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread
More informationSimultaneous Multithreading: a Platform for Next Generation Processors
Simultaneous Multithreading: a Platform for Next Generation Processors Paulo Alexandre Vilarinho Assis Departamento de Informática, Universidade do Minho 4710 057 Braga, Portugal paulo.assis@bragatel.pt
More informationParallel Processing SIMD, Vector and GPU s cont.
Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationInstruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov
Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Dealing With Control Hazards Simplest solution to stall pipeline until branch is resolved and target address is calculated
More informationCOMPUTER ORGANIZATION AND DESI
COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler
More informationProcessor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP
Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP Presenter: Course: EEC 289Q: Reconfigurable Computing Course Instructor: Professor Soheil Ghiasi Outline Overview of M.I.T. Raw processor
More informationComputer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM
Computer Architecture Computer Science & Engineering Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware
More informationECE 4750 Computer Architecture, Fall 2018 T15 Advanced Processors: VLIW Processors
ECE 4750 Computer Architecture, Fall 2018 T15 Advanced Processors: VLIW Processors School of Electrical and Computer Engineering Cornell University revision: 2018-11-28-13-01 1 Motivating VLIW Processors
More informationCPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor
Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction
More informationUNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.
UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation. July 14) (June 2013) (June 2015)(Jan 2016)(June 2016) H/W Support : Conditional Execution Also known
More informationChapter 4 The Processor (Part 4)
Department of Electr rical Eng ineering, Chapter 4 The Processor (Part 4) 王振傑 (Chen-Chieh Wang) ccwang@mail.ee.ncku.edu.tw ncku edu Depar rtment of Electr rical Engineering, Feng-Chia Unive ersity Outline
More informationCS 654 Computer Architecture Summary. Peter Kemper
CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining
More informationEITF20: Computer Architecture Part2.1.1: Instruction Set Architecture
EITF20: Computer Architecture Part2.1.1: Instruction Set Architecture Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Instruction Set Principles The Role of Compilers MIPS 2 Main Content Computer
More informationMultithreading: Exploiting Thread-Level Parallelism within a Processor
Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced
More informationReal Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University
Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel
More information