The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization
|
|
- Griffin Lyons
- 5 years ago
- Views:
Transcription
1 The Potential for Using Thread-Level Data peculation to acilitate Automatic Parallelization J. Gregory teffan and Todd C. Mowry Department of Computer cience University The TAMPede Project 1
2 tate-of-the-art vs. uture Processors??? MIP R10000 (6.8 Million Transistors) In 15 Years (1 Billion Transistors) Challenge: translating these resources into higher performance One possibility: multiple processors on a single chip The TAMPede Project 2
3 Performance Benefits of ingle-chip Multiprocessing Multiprogramming Workload: improved throughput Time P 1 P 1 P 2 P 3 P 4 ingle Application: how to reduce execution time => must contain parallel threads Time P 1 P 1 P 2 P 3 P 4 The Big Question: how do we automatically parallelize all applications? The TAMPede Project 3
4 tate-of-the-art in Automatic Parallelization Numeric Applications: dominated by regular array references inside loops: OR i = 1 to N OR j = 1 to N OR k = 1 to N C[i][j] += A[i][k] * B[k][j]; significant progress has been made e.g., fastest PECfp95 number [Bugnion et al. 1996] Non-Numeric Applications: access patterns & control flow may be highly irregular pointer dereferences, recursion, unstructured loops, etc. little (if any) success in automatic parallelization but these applications are important! we must expand the scope of automatic parallelization The TAMPede Project 4
5 Why Is Automatic Parallelization o Difficult? Current Approach: parallelize only if we can statically prove independence OR i = 1 to N A[i] += i; Parallel transformations can help to eliminate dependences or Non-Numeric Codes: understanding memory addresses is extremely difficult Major Limitation: while (foo()) { x = *q; *p = bar(); } OR i = 1 to N A[i] += f(a[i-1]); equential data dependence? when the compiler is uncertain, it must be conservative The TAMPede Project 5
6 Expanding the cope of Automatic Parallelization The Problem: statically proving independence is hopelessly restrictive a full understanding of memory addresses is unrealistic instead, we should be focusing on performance issues Our olution: Thread-Level Data peculation (TLD) The TAMPede Project 6
7 Overview Thread-Level Data peculation (TLD) An Example: Compress Experimental Results Architectural upport Conclusions The TAMPede Project 7
8 Data peculation Basic Idea: optimistically perform access assuming no dependence if speculation was unsafe, invoke recovery action Example: LOAD b = *q? a = f() TORE *p = a LOAD b = *q g(b) (p == q) b = a a = f() TORE *p = a (p!= q) Original Execution g(b) Instruction-Level Data peculation The TAMPede Project 8
9 Thread-Level Data peculation Analogous to instruction-level data speculation except that it involves separate, parallel threads of control Processor 1 Processor 2 a = f() TORE *p = a a = f() TORE *p = a LOAD b = *q g(b)? { LOAD b = *q g(b) (p == q) LOAD b = *q (p!= q) Original Execution g(b) Thread-Level Data peculation The TAMPede Project 9
10 Related Work Prior to this tudy: Wisconsin Multiscalar Architecture [ohi et al, 1995] tightly-coupled ring architecture with register forwarding ARB detects memory dependences, hardware rollback + requires relatively little software support - large, centralized ARB may increase load latency - ring architecture limits flexibility (multiprogramming, locality) Other Recent Work: tanford Hydra [Oplinger et al, 1997] Wisconsin peculative Versioning Cache [Gopal et al, 1997] Illinois peculative Run-Time Parallelization [Zhang et al, 1998] The TAMPede Project 10
11 Objectives of This tudy Does TLD offer compelling performance improvements? study of the PEC92 and PEC95 integer benchmarks Can we provide cost-effective hardware support for TLD? detecting dependence violations recovering from failed speculation goal: performance of non-tld code is not sacrificed What compiler support is necessary to exploit TLD? optimizations, scheduling, etc. The TAMPede Project 11
12 Overview Thread-Level Data peculation (TLD) An Example: Compress Experimental Results Architectural upport Conclusions The TAMPede Project 12
13 An Example: Compress Epoch i while ((c = getchar())!= EO) { /* perform data compression */ = hash[hash_function(c)]; hash[hash_function()] = ; }??? Potential ource of Parallelism: data parallelism across the input stream? rom the Compiler s Perspective: hash accesses cannot be statically disambiguated In Reality: consecutive characters rarely hash to the same entry The TAMPede Project 13
14 TLD Execution of Compress Processor 1 Processor 2 Processor 3 Processor 4 Time while () { x = hash[idx1]; hash[idx2] = y; } Epoch 1 Epoch 2 Epoch 3 = hash[3] = hash[19] = hash[33] = hash[10] Violation! hash[10] = hash[21] = hash[30] = hash[25] = attempt_commit() attempt_commit() attempt_commit() attempt_commit() Epoch 5 = hash[31] Epoch 6 = hash[9] Epoch 7 = hash[27] Epoch 4 Retry Epoch 4 = hash[10] hash[25] = attempt_commit() The TAMPede Project 14
15 Other Data Dependences in Compress while ((c = getchar())!= EO) { /* perform data compression */ in_count++; if () {out_count++; putchar();} if (free_entries < ) free_entries = } in_count: induction variable implicit in the epoch number out_count: reduction operation compute partial sums getchar(), putchar(): use parallel library routines (also, malloc(), etc.) free_entries: cannot eliminate dependence forward between epochs The TAMPede Project 15
16 Overview Thread-Level Data peculation (TLD) An Example: Compress Experimental Results Relaxing Memory Dependences orwarding Data Between Epochs peedups Architectural upport Conclusions The TAMPede Project 16
17 Benchmarks uite Name Region Average Dynamic Instrs per Epoch % of Total Dynamic Instrs PEC92 compress r gcc r r espresso r li r r sc r PEC95 m88ksim r NA Parallel ijpeg r perl r go r buk r r The TAMPede Project 17
18 Measuring Memory Dependences: Run Lengths Run Length: # of epochs between Read-After-Write (RAW) dependences Violation! RAW d=2 RAW d=1 E 1 E 2 E 3 E 4 Time Violation! E 1 E 2 E 3 E 4 E 5 E 6 E 7 E 8 E 9 Epochs: E 3 E 4 E 5 E 6 Retry Run Lengths: Average Run Length = 3 Retry E 6 E 7 E 8 E 9 Execution on 4 Processors average run length = rough limit of potential parallelism The TAMPede Project 18
19 Relaxing Memory Data Dependences peedup Limit B O B O compress.r1 gcc.r1 B O B O B O gcc.r2 li.r1 espresso.r1 B O li.r2 B O B O B O B O sc.r1 ijpeg.r1 m88ksim.r1 perl.r1 B O go.r1 B O B O buk.r2 buk.r1 B = base case O = compiler optimizations applied to remove dependences = dependences due to forwardable scalars also removed eliminating dependences and forwarding scalars are important significant parallelism is available in many cases The TAMPede Project 19
20 calar Memory Values: orwarding Data Between Epochs forward if dependences occur frequently synchronization is faster than speculation recovery helpful for performance, not necessary for correctness Register Values: must be forwarded to maintain correctness some register dependences may be eliminated: induction variables through simple loop rescheduling all other register dependences forwarded through memory what is the impact on performance? The TAMPede Project 20
21 Critical Path Lengths E 1 Time E 1 E i E i E 2 E 2 E 3 WAIT LOAD A E 3 WAIT A LOAD A E 4 TORE A LOAD B TORE B IGNAL E 4 TORE A IGNAL A WAIT B LOAD B TORE B IGNAL B Course-Grain ynchronization ine-grain ynchronization The TAMPede Project 21
22 orwarding Data Between Epochs peedup Limit f3.2 c s 1.0 f1.0 c s 1.1 f1.1 c s 1.0 f c s 1.2 f c s 1.1 f c s 1.5 f2.4 c compress.r1 gcc.r2 li.r1 sc.r1 ijpeg.r1 go.r1 buk.r2 gcc.r1 espresso.r1 li.r2 m88ksim.r1 perl.r1 buk.r1 s 1.0 f1.0 c s f c s 1.0 f3.4 c s 1.4 f6.0 c s f c s f c s c = coarse-grain synchronization f = fine-grain synchronization s = fine-grain synchronization w/ aggressive instruction scheduling fine-grain synchronization is helpful with aggressive instruction scheduling, forwarding is not a bottleneck The TAMPede Project 22
23 Potential Region peedup on 4 Processors Region peedup compress.r1 gcc.r1 gcc.r2 li.r1 espresso.r1 li.r2 sc.r1 ijpeg.r1 m88ksim.r1 perl.r1 go.r1 buk.r1 buk.r2 = optimize dependences, forward w/ fine-grain synchronization = also perform aggressive instruction scheduling communication latency = 10 cycles aggressive instruction scheduling is a major performance win potential speedups of twofold or more in 11 of 13 regions The TAMPede Project 23
24 Program peedups buk perl gcc 0 go ijpeg m88ksim sc li espresso compress Program peedup 99.9% 12.1% 19.4% 73.1% 69.3% 99.3% 15.3% 35.8% 6.8% 27.9% Coverage: Region peedups: Region peedup buk.r2 go.r1 ijpeg.r1 sc.r1 li.r1 gcc.r2 compress.r1 buk.r1 perl.r1 m88ksim.r1 li.r2 espresso.r1 gcc.r1 The TAMPede Project 24
25 Overview Thread-Level Data peculation (TLD) An Example: Compress Experimental Results Architectural upport Communication Latency Key Architectural Issues Detecting Data Dependence Violations Buffering peculative tate Conclusions The TAMPede Project 25
26 Base Architecture ingle Chip Processor Processor Processor Processor L1 Cache L1 Cache L1 Cache L1 Cache 4x4 Crossbar witch Interleaved L2 Cache Each processor has its own L1 data cache maintain single-cycle load latency L1 caches are kept coherent shared-memory programming model The TAMPede Project 26
27 How Important Is Communication Latency? ingle Chip Processor Processor Processor Processor L1 Cache L1 Cache L1 Cache L1 Cache 4x4 Crossbar witch Interleaved L2 Cache ome Options: direct L1-to-L1 communication ~2 cycles communicate through the L2 cache ~10 cycles The TAMPede Project 27
28 Impact of Communication Latency Region peedup compress.r1 gcc.r gcc.r2 li.r1 espresso.r li.r sc.r1 ijpeg.r1 m88ksim.r1 perl.r go.r buk.r buk.r2 = optimize dependences, forward w/ fine-grain synchronization = also perform aggressive instruction scheduling communication latency: 10 = 10 cycles, 2 = 2 cycles instruction scheduling reduces the sensitivity to communication latency communicating through the L2 cache is a viable option The TAMPede Project 28
29 Thread Management Key Architectural Issues Thread creation and epoch scheduling Epoch numbers must be visible to the hardware Distinguishing speculative vs. non-speculative memory accesses Recovering from data dependence violations: hardware notifies software of violation software performs the bulk of the recovery Detecting Data Dependence Violations extend invalidation-based cache coherence Buffering peculative tate extend the functionality of the primary data caches The TAMPede Project 29
30 Invalidation-Based Cache Coherence Time 2 Processor 1 Processor 2 TORE X = 2; 1 LOAD a = X; L1 Cache L1 Cache X = 1 2 X = 1 2 Invalidation 1 Read Request The TAMPede Project 30
31 Detecting Data Dependence Violations Time 2 Processor 1 (p = q = &X) Processor 2 Epoch 5 Epoch 6 TORE *q = 2; 1 3 become_speculative() LOAD a = *p; attempt_commit() AIL! L1 Cache Epoch # = 5 Violation? = alse X = 1 2 L M L1 Cache Epoch # = 6 Violation? = TRUE X = 1 T L M 2 peculatively Loaded? peculatively Modified? 2 Invalidation (Epoch #5) 1 Read Request The TAMPede Project 31
32 Buffering peculative tate peculative tores: software cannot realistically roll back memory side effects our solution: buffer in L1 cache until safe to commit to memory peculative Loads: if displaced, then we can no longer track dependence violations set violation flag upon eviction of a speculatively accessed line correctness is preserved, but performance may suffer Processor L1 Cache Victim Cache a 16KB 2-way set-associative cache with 4 victim entries suffices The TAMPede Project 32
33 Overview Thread-Level Data peculation (TLD) An Example: Compress Experimental Results Architectural upport Conclusions The TAMPede Project 33
34 Conclusions TLD potentially offers compelling performance improvements 12 of 13 regions: speedups of on 4 processors 7 of 10 programs: speedups of on 4 processors Only modest hardware modifications are required cache coherence protocol augmented to detect violations primary data cache is used to buffer speculative state Compiler support is crucial yet feasible eliminating data dependences aggressive scheduling to minimize critical path (forwarding) Ongoing and future work refining the architecture (described in technical report) building the compiler The TAMPede Project 34
The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization
The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J. Gregory Steffan and Todd C. Mowry Computer Science Department Carnegie Mellon University Pittsburgh, PA
More informationData/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)
Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) A 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software
More informationData/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)
Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra is a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software
More informationData/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)
Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra ia a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software
More informationData Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun
Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun Computer Systems Laboratory Stanford University http://www-hydra.stanford.edu A Chip Multiprocessor Implementation
More informationMulti-Version Caches for Multiscalar Processors. Manoj Franklin. Clemson University. 221-C Riggs Hall, Clemson, SC , USA
Multi-Version Caches for Multiscalar Processors Manoj Franklin Department of Electrical and Computer Engineering Clemson University 22-C Riggs Hall, Clemson, SC 29634-095, USA Email: mfrankl@blessing.eng.clemson.edu
More informationA Scalable Approach to Thread-Level Speculation
arnegie Mellon University Research Showcase @ MU omputer Science Department School of omputer Science 6-2000 A Scalable Approach to Thread-Level Speculation J. Gregory Steffan arnegie Mellon University
More informationSPECULATIVE MULTITHREADED ARCHITECTURES
2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions
More informationA Scalable Approach to Thread-Level Speculation
A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, hristopher B. olohan, Antonia Zhai, and Todd. Mowry omputer Science Department arnegie Mellon University Pittsburgh, PA 523 steffan,colohan,zhaia,tcm
More informationSoftware-Controlled Multithreading Using Informing Memory Operations
Software-Controlled Multithreading Using Informing Memory Operations Todd C. Mowry Computer Science Department University Sherwyn R. Ramkissoon Department of Electrical & Computer Engineering University
More informationSpeculative Synchronization
Speculative Synchronization José F. Martínez Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu/martinez Problem 1: Conservative Parallelization No parallelization
More informationCMP Support for Large and Dependent Speculative Threads
TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 1 CMP Support for Large and Dependent Speculative Threads Christopher B. Colohan, Anastassia Ailamaki, J. Gregory Steffan, Member, IEEE, and Todd C. Mowry
More informationCompiler Optimization of Memory-Resident Value Communication Between Speculative Threads
Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Grery Steffan, and Todd C. Mowry School of Computer Science Carnegie Mellon
More informationSpeculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois
Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications José éf. Martínez * and Josep Torrellas University of Illinois ASPLOS 2002 * Now at Cornell University Overview Allow
More informationComputer Science. Carnegie Mellon. DISTRIBUTION STATEMENT A Approved for Public Release Distribution Unlimited
Computer Science Carnegie Mellon DISTRIBUTION STATEMENT A Approved for Public Release Distribution Unlimited Extending Cache Coherence to Support Thread-Level Data Speculation on a Single Chip and Beyond
More informationData Speculation Support for a Chip Multiprocessor
Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey and Kunle Olukotun Computer Systems Laboratory Stanford University Stanford, CA 94305-4070 http://www-hydra.stanford.edu/ Abstract
More informationModule 15: "Memory Consistency Models" Lecture 34: "Sequential Consistency and Relaxed Models" Memory Consistency Models. Memory consistency
Memory Consistency Models Memory consistency SC SC in MIPS R10000 Relaxed models Total store ordering PC and PSO TSO, PC, PSO Weak ordering (WO) [From Chapters 9 and 11 of Culler, Singh, Gupta] [Additional
More informationOutline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??
Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross
More informationBeyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji
Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of
More informationCMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More informationIncrementally Parallelizing. Twofold Speedup on a Quad-Core. Thread-Level Speculation. A Case Study with BerkeleyDB. What Am I Working on Now?
Incrementally Parallelizing Database Transactions with Thread-Level Speculation Todd C. Mowry Carnegie Mellon University (in collaboration with Chris Colohan, J. Gregory Steffan, and Anastasia Ailamaki)
More informationCS533: Speculative Parallelization (Thread-Level Speculation)
CS533: Speculative Parallelization (Thread-Level Speculation) Josep Torrellas University of Illinois in Urbana-Champaign March 5, 2015 Josep Torrellas (UIUC) CS533: Lecture 14 March 5, 2015 1 / 21 Concepts
More informationSoftware and Hardware for Exploiting Speculative Parallelism with a Multiprocessor
Software and Hardware for Exploiting Speculative Parallelism with a Multiprocessor Jeffrey Oplinger, David Heine, Shih-Wei Liao, Basem A. Nayfeh, Monica S. Lam and Kunle Olukotun Computer Systems Laboratory
More informationCISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP
CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationElectrical and Computer Engineering Georgia Institute of Technology Atlanta, Georgia
Journal of Universal Computer cience, vol. 6, no. 10 (2000), 908-927 submitted: 8/2/00, accepted: 8/9/00, appeared: 28/10/00 pringer Pub. Co. Electrical and Computer Engineering Georgia Institute of Technology
More informationHeckaton. SQL Server's Memory Optimized OLTP Engine
Heckaton SQL Server's Memory Optimized OLTP Engine Agenda Introduction to Hekaton Design Consideration High Level Architecture Storage and Indexing Query Processing Transaction Management Transaction Durability
More informationThe STAMPede Approach to Thread-Level Speculation
The STAMPede Approach to Thread-Level Speculation J. GREGORY STEFFAN University of Toronto CHRISTOPHER COLOHAN Carnegie Mellon University ANTONIA ZHAI University of Minnesota and TODD C. MOWRY Carnegie
More informationA Chip-Multiprocessor Architecture with Speculative Multithreading
866 IEEE TRANSACTIONS ON COMPUTERS, VOL. 48, NO. 9, SEPTEMBER 1999 A Chip-Multiprocessor Architecture with Speculative Multithreading Venkata Krishnan, Member, IEEE, and Josep Torrellas AbstractÐMuch emphasis
More informationSpeculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution
Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Ravi Rajwar and Jim Goodman University of Wisconsin-Madison International Symposium on Microarchitecture, Dec. 2001 Funding
More informationLecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6)
Lecture: Consistency Models, TM Topics: consistency models, TM intro (Section 5.6) 1 Coherence Vs. Consistency Recall that coherence guarantees (i) that a write will eventually be seen by other processors,
More informationDual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window
Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era
More informationAn Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks
An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 s Joshua J. Yi and David J. Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing
More informationTDT 4260 lecture 7 spring semester 2015
1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding
More informationSummary: Open Questions:
Summary: The paper proposes an new parallelization technique, which provides dynamic runtime parallelization of loops from binary single-thread programs with minimal architectural change. The realization
More informationNOW that the microprocessor industry has shifted its
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 18, NO. 8, AUGUST 2007 1 CMP Support for Large and Dependent Speculative Threads Christopher B. Colohan, Anastasia Ailamaki, Member, IEEE Computer
More informationThe Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Mark Willey, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun
The Stanford Hydra CMP Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Mark Willey, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun Computer Systems Laboratory Stanford University http://www-hydra.stanford.edu
More informationThe Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun
The Stanford Hydra CMP Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun Computer Systems Laboratory Stanford University http://www-hydra.stanford.edu
More informationOutline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA
CS 258 Parallel Computer Architecture Data Speculation Support for a Chip Multiprocessor (Hydra CMP) Lance Hammond, Mark Willey and Kunle Olukotun Presented: May 7 th, 2008 Ankit Jain Outline The Hydra
More informationPOSH: A TLS Compiler that Exploits Program Structure
POSH: A TLS Compiler that Exploits Program Structure Wei Liu, James Tuck, Luis Ceze, Wonsun Ahn, Karin Strauss, Jose Renau and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign
More informationLIMITS OF ILP. B649 Parallel Architectures and Programming
LIMITS OF ILP B649 Parallel Architectures and Programming A Perfect Processor Register renaming infinite number of registers hence, avoids all WAW and WAR hazards Branch prediction perfect prediction Jump
More informationMultiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering
Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to
More informationCISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP
CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More information4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins
4 Chip Multiprocessors (I) Robert Mullins Overview Coherent memory systems Introduction to cache coherency protocols Advanced cache coherency protocols, memory systems and synchronization covered in the
More informationSpeculative Versioning Cache: Unifying Speculation and Coherence
Speculative Versioning Cache: Unifying Speculation and Coherence Sridhar Gopal T.N. Vijaykumar, Jim Smith, Guri Sohi Multiscalar Project Computer Sciences Department University of Wisconsin, Madison Electrical
More informationAbstract. 1 Introduction. 2 The Hydra CMP. Computer Systems Laboratory Stanford University Stanford, CA
Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey and Kunle Olukotun Computer Systems Laboratory Stanford University Stanford, CA 94305-4070 http://www-hydra.stanford.edu/ Abstract
More informationLecture: Consistency Models, TM
Lecture: Consistency Models, TM Topics: consistency models, TM intro (Section 5.6) No class on Monday (please watch TM videos) Wednesday: TM wrap-up, interconnection networks 1 Coherence Vs. Consistency
More informationAR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors Computer Sciences Department University of Wisconsin Madison http://www.cs.wisc.edu/~ericro/ericro.html ericro@cs.wisc.edu High-Performance
More informationRelaxing Concurrency Control in Transactional Memory. Utku Aydonat
Relaxing Concurrency Control in Transactional Memory by Utku Aydonat A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of The Edward S. Rogers
More informationFall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012
18-742 Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II Prof. Onur Mutlu Carnegie Mellon University 10/12/2012 Past Due: Review Assignments Was Due: Tuesday, October 9, 11:59pm. Sohi
More informationInstruction Level Parallelism (ILP)
1 / 26 Instruction Level Parallelism (ILP) ILP: The simultaneous execution of multiple instructions from a program. While pipelining is a form of ILP, the general application of ILP goes much further into
More informationMultiprocessor scheduling
Chapter 10 Multiprocessor scheduling When a computer system contains multiple processors, a few new issues arise. Multiprocessor systems can be categorized into the following: Loosely coupled or distributed.
More informationLimits of Task-based Parallelism in Irregular Applications
In Proceedings of the 3rd International Symposium on High Performance Computing (ISHPC2K), October 2000, (c) Springer-Verlag. Limits of Task-based Parallelism in Irregular Applications Barbara Kreaseck
More informationLecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory
Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory 1 Example Programs Initially, A = B = 0 P1 P2 A = 1 B = 1 if (B == 0) if (A == 0) critical section
More informationMartin Kruliš, v
Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal
More informationThread-level Parallelism for the Masses. Kunle Olukotun Computer Systems Lab Stanford University 2007
Thread-level Parallelism for the Masses Kunle Olukotun Computer Systems Lab Stanford University 2007 The World has Changed Process Technology Stops Improving! Moore s law but! Transistors don t get faster
More informationMemory Consistency and Multiprocessor Performance
Memory Consistency Model Memory Consistency and Multiprocessor Performance Define memory correctness for parallel execution Execution appears to the that of some correct execution of some theoretical parallel
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationRECAP. B649 Parallel Architectures and Programming
RECAP B649 Parallel Architectures and Programming RECAP 2 Recap ILP Exploiting ILP Dynamic scheduling Thread-level Parallelism Memory Hierarchy Other topics through student presentations Virtual Machines
More informationROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING 16 MARKS CS 2354 ADVANCE COMPUTER ARCHITECTURE 1. Explain the concepts and challenges of Instruction-Level Parallelism. Define
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationMemory Consistency and Multiprocessor Performance. Adapted from UCB CS252 S01, Copyright 2001 USB
Memory Consistency and Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB 1 Memory Consistency Model Define memory correctness for parallel execution Execution appears to the that
More informationLecture 17: Parallel Architectures and Future Computer Architectures. Shared-Memory Multiprocessors
Lecture 17: arallel Architectures and Future Computer Architectures rof. Kunle Olukotun EE 282h Fall 98/99 1 Shared-emory ultiprocessors Several processors share one address space» conceptually a shared
More informationLecture 13: March 25
CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 13: March 25 Lecturer: John Cavazos Scribe: Ying Yu 13.1. Bryan Youse-Optimization of Sparse Matrix-Vector Multiplication on Emerging
More informationChapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationMultiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University
A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor
More informationEvaluation of Design Alternatives for a Multiprocessor Microprocessor
Evaluation of Design Alternatives for a Multiprocessor Microprocessor Basem A. Nayfeh, Lance Hammond and Kunle Olukotun Computer Systems Laboratory Stanford University Stanford, CA 9-7 {bnayfeh, lance,
More informationHardware Speculation Support
Hardware Speculation Support Conditional instructions Most common form is conditional move BNEZ R1, L ;if MOV R2, R3 ;then CMOVZ R2,R3, R1 L: ;else Other variants conditional loads and stores nullification
More informationM7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle
M7: Next Generation SPARC Hotchips 26 August 12, 2014 Stephen Phillips Senior Director, SPARC Architecture Oracle Safe Harbor Statement The following is intended to outline our general product direction.
More informationComputer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,
More informationTransient-Fault Recovery Using Simultaneous Multithreading
To appear in Proceedings of the International Symposium on ComputerArchitecture (ISCA), May 2002. Transient-Fault Recovery Using Simultaneous Multithreading T. N. Vijaykumar, Irith Pomeranz, and Karl Cheng
More informationSupertask Successor. Predictor. Global Sequencer. Super PE 0 Super PE 1. Value. Predictor. (Optional) Interconnect ARB. Data Cache
Hierarchical Multi-Threading For Exploiting Parallelism at Multiple Granularities Abstract Mohamed M. Zahran Manoj Franklin ECE Department ECE Department and UMIACS University of Maryland University of
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationComplexity Analysis of A Cache Controller for Speculative Multithreading Chip Multiprocessors
Complexity Analysis of A Cache Controller for Speculative Multithreading Chip Multiprocessors Yoshimitsu Yanagawa, Luong Dinh Hung, Chitaka Iwama, Niko Demus Barli, Shuichi Sakai and Hidehiko Tanaka Although
More informationTransactional Memory. Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech
Transactional Memory Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech (Adapted from Stanford TCC group and MIT SuperTech Group) Motivation Uniprocessor Systems Frequency
More informationCOS 318: Operating Systems. NSF, Snapshot, Dedup and Review
COS 318: Operating Systems NSF, Snapshot, Dedup and Review Topics! NFS! Case Study: NetApp File System! Deduplication storage system! Course review 2 Network File System! Sun introduced NFS v2 in early
More informationLecture-13 (ROB and Multi-threading) CS422-Spring
Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationImproving the Performance of Speculatively Parallel Applications on the Hydra CMP
Improving the Performance of Speculatively Parallel Applications on the Hydra CMP Kunle Olukotun, Lance Hammond and Mark Willey Computer Systems Laboratory Stanford University Stanford, CA 94305-4070 http://www-hydra.stanford.edu/
More informationRecent Advances in Heterogeneous Computing using Charm++
Recent Advances in Heterogeneous Computing using Charm++ Jaemin Choi, Michael Robson Parallel Programming Laboratory University of Illinois Urbana-Champaign April 12, 2018 1 / 24 Heterogeneous Computing
More informationThree basic multiprocessing issues
Three basic multiprocessing issues 1. artitioning. The sequential program must be partitioned into subprogram units or tasks. This is done either by the programmer or by the compiler. 2. Scheduling. Associated
More informationCS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS
CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight
More informationUNIT I (Two Marks Questions & Answers)
UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-
More informationHardware Support for Thread-Level Speculation
Hardware Support for Thread-Level Speculation J. Gregory Steffan CMU-CS-3-122 April 23 School of Computer Science Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213 Thesis Committee:
More informationMultiplex: Unifying Conventional and Speculative Thread-Level Parallelism on a Chip Multiprocessor
Multiplex: Unifying Conventional and Speculative Thread-Level Parallelism on a Chip Multiprocessor Seon Wook Kim, Chong-Liang Ooi, Il Park, Rudolf Eigenmann, Babak Falsafi, and T. N. Vijaykumar School
More informationOverview of Today s Lecture: Cost & Price, Performance { 1+ Administrative Matters Finish Lecture1 Cost and Price Add/Drop - See me after class
Overview of Today s Lecture: Cost & Price, Performance EE176-SJSU Computer Architecture and Organization Lecture 2 Administrative Matters Finish Lecture1 Cost and Price Add/Drop - See me after class EE176
More informationCS 1013 Advance Computer Architecture UNIT I
CS 1013 Advance Computer Architecture UNIT I 1. What are embedded computers? List their characteristics. Embedded computers are computers that are lodged into other devices where the presence of the computer
More informationDeterministic Shared Memory Multiprocessing
Deterministic Shared Memory Multiprocessing Luis Ceze, University of Washington joint work with Owen Anderson, Tom Bergan, Joe Devietti, Brandon Lucia, Karin Strauss, Dan Grossman, Mark Oskin. Safe MultiProcessing
More informationMotivations. Shared Memory Consistency Models. Optimizations for Performance. Memory Consistency
Shared Memory Consistency Models Authors : Sarita.V.Adve and Kourosh Gharachorloo Presented by Arrvindh Shriraman Motivations Programmer is required to reason about consistency to ensure data race conditions
More informationGetting CPI under 1: Outline
CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more
More informationEECS 570 Final Exam - SOLUTIONS Winter 2015
EECS 570 Final Exam - SOLUTIONS Winter 2015 Name: unique name: Sign the honor code: I have neither given nor received aid on this exam nor observed anyone else doing so. Scores: # Points 1 / 21 2 / 32
More informationMemory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System
Center for Information ervices and High Performance Computing (ZIH) Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor ystem Parallel Architectures and Compiler Technologies
More informationHammer Slide: Work- and CPU-efficient Streaming Window Aggregation
Large-Scale Data & Systems Group Hammer Slide: Work- and CPU-efficient Streaming Window Aggregation Georgios Theodorakis, Alexandros Koliousis, Peter Pietzuch, Holger Pirk Large-Scale Data & Systems (LSDS)
More informationDonn Morrison Department of Computer Science. TDT4255 Memory hierarchies
TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,
More informationScalable GPU Graph Traversal!
Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang
More informationMulti-Level Memory Prefetching for Media and Stream Processing
Multi-Level Memory Prefetching for Media and Stream Processing Jason Fritts Assistant Professor, Introduction Multimedia is a dominant computer workload Traditional cache-based memory systems not well-suited
More informationLecture 7 Instruction Level Parallelism (5) EEC 171 Parallel Architectures John Owens UC Davis
Lecture 7 Instruction Level Parallelism (5) EEC 171 Parallel Architectures John Owens UC Davis Credits John Owens / UC Davis 2007 2009. Thanks to many sources for slide material: Computer Organization
More informationComputer Architecture Spring 2016
Computer Architecture Spring 2016 Final Review Shuai Wang Department of Computer Science and Technology Nanjing University Computer Architecture Computer architecture, like other architecture, is the art
More informationLecture 24: Multiprocessing Computer Architecture and Systems Programming ( )
Systems Group Department of Computer Science ETH Zürich Lecture 24: Multiprocessing Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Most of the rest of this
More informationInstruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov
Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Dealing With Control Hazards Simplest solution to stall pipeline until branch is resolved and target address is calculated
More informationParallel Computer Architecture Spring Memory Consistency. Nikos Bellas
Parallel Computer Architecture Spring 2018 Memory Consistency Nikos Bellas Computer and Communications Engineering Department University of Thessaly Parallel Computer Architecture 1 Coherence vs Consistency
More information