The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization

Size: px
Start display at page:

Download "The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization"

Transcription

1 The Potential for Using Thread-Level Data peculation to acilitate Automatic Parallelization J. Gregory teffan and Todd C. Mowry Department of Computer cience University The TAMPede Project 1

2 tate-of-the-art vs. uture Processors??? MIP R10000 (6.8 Million Transistors) In 15 Years (1 Billion Transistors) Challenge: translating these resources into higher performance One possibility: multiple processors on a single chip The TAMPede Project 2

3 Performance Benefits of ingle-chip Multiprocessing Multiprogramming Workload: improved throughput Time P 1 P 1 P 2 P 3 P 4 ingle Application: how to reduce execution time => must contain parallel threads Time P 1 P 1 P 2 P 3 P 4 The Big Question: how do we automatically parallelize all applications? The TAMPede Project 3

4 tate-of-the-art in Automatic Parallelization Numeric Applications: dominated by regular array references inside loops: OR i = 1 to N OR j = 1 to N OR k = 1 to N C[i][j] += A[i][k] * B[k][j]; significant progress has been made e.g., fastest PECfp95 number [Bugnion et al. 1996] Non-Numeric Applications: access patterns & control flow may be highly irregular pointer dereferences, recursion, unstructured loops, etc. little (if any) success in automatic parallelization but these applications are important! we must expand the scope of automatic parallelization The TAMPede Project 4

5 Why Is Automatic Parallelization o Difficult? Current Approach: parallelize only if we can statically prove independence OR i = 1 to N A[i] += i; Parallel transformations can help to eliminate dependences or Non-Numeric Codes: understanding memory addresses is extremely difficult Major Limitation: while (foo()) { x = *q; *p = bar(); } OR i = 1 to N A[i] += f(a[i-1]); equential data dependence? when the compiler is uncertain, it must be conservative The TAMPede Project 5

6 Expanding the cope of Automatic Parallelization The Problem: statically proving independence is hopelessly restrictive a full understanding of memory addresses is unrealistic instead, we should be focusing on performance issues Our olution: Thread-Level Data peculation (TLD) The TAMPede Project 6

7 Overview Thread-Level Data peculation (TLD) An Example: Compress Experimental Results Architectural upport Conclusions The TAMPede Project 7

8 Data peculation Basic Idea: optimistically perform access assuming no dependence if speculation was unsafe, invoke recovery action Example: LOAD b = *q? a = f() TORE *p = a LOAD b = *q g(b) (p == q) b = a a = f() TORE *p = a (p!= q) Original Execution g(b) Instruction-Level Data peculation The TAMPede Project 8

9 Thread-Level Data peculation Analogous to instruction-level data speculation except that it involves separate, parallel threads of control Processor 1 Processor 2 a = f() TORE *p = a a = f() TORE *p = a LOAD b = *q g(b)? { LOAD b = *q g(b) (p == q) LOAD b = *q (p!= q) Original Execution g(b) Thread-Level Data peculation The TAMPede Project 9

10 Related Work Prior to this tudy: Wisconsin Multiscalar Architecture [ohi et al, 1995] tightly-coupled ring architecture with register forwarding ARB detects memory dependences, hardware rollback + requires relatively little software support - large, centralized ARB may increase load latency - ring architecture limits flexibility (multiprogramming, locality) Other Recent Work: tanford Hydra [Oplinger et al, 1997] Wisconsin peculative Versioning Cache [Gopal et al, 1997] Illinois peculative Run-Time Parallelization [Zhang et al, 1998] The TAMPede Project 10

11 Objectives of This tudy Does TLD offer compelling performance improvements? study of the PEC92 and PEC95 integer benchmarks Can we provide cost-effective hardware support for TLD? detecting dependence violations recovering from failed speculation goal: performance of non-tld code is not sacrificed What compiler support is necessary to exploit TLD? optimizations, scheduling, etc. The TAMPede Project 11

12 Overview Thread-Level Data peculation (TLD) An Example: Compress Experimental Results Architectural upport Conclusions The TAMPede Project 12

13 An Example: Compress Epoch i while ((c = getchar())!= EO) { /* perform data compression */ = hash[hash_function(c)]; hash[hash_function()] = ; }??? Potential ource of Parallelism: data parallelism across the input stream? rom the Compiler s Perspective: hash accesses cannot be statically disambiguated In Reality: consecutive characters rarely hash to the same entry The TAMPede Project 13

14 TLD Execution of Compress Processor 1 Processor 2 Processor 3 Processor 4 Time while () { x = hash[idx1]; hash[idx2] = y; } Epoch 1 Epoch 2 Epoch 3 = hash[3] = hash[19] = hash[33] = hash[10] Violation! hash[10] = hash[21] = hash[30] = hash[25] = attempt_commit() attempt_commit() attempt_commit() attempt_commit() Epoch 5 = hash[31] Epoch 6 = hash[9] Epoch 7 = hash[27] Epoch 4 Retry Epoch 4 = hash[10] hash[25] = attempt_commit() The TAMPede Project 14

15 Other Data Dependences in Compress while ((c = getchar())!= EO) { /* perform data compression */ in_count++; if () {out_count++; putchar();} if (free_entries < ) free_entries = } in_count: induction variable implicit in the epoch number out_count: reduction operation compute partial sums getchar(), putchar(): use parallel library routines (also, malloc(), etc.) free_entries: cannot eliminate dependence forward between epochs The TAMPede Project 15

16 Overview Thread-Level Data peculation (TLD) An Example: Compress Experimental Results Relaxing Memory Dependences orwarding Data Between Epochs peedups Architectural upport Conclusions The TAMPede Project 16

17 Benchmarks uite Name Region Average Dynamic Instrs per Epoch % of Total Dynamic Instrs PEC92 compress r gcc r r espresso r li r r sc r PEC95 m88ksim r NA Parallel ijpeg r perl r go r buk r r The TAMPede Project 17

18 Measuring Memory Dependences: Run Lengths Run Length: # of epochs between Read-After-Write (RAW) dependences Violation! RAW d=2 RAW d=1 E 1 E 2 E 3 E 4 Time Violation! E 1 E 2 E 3 E 4 E 5 E 6 E 7 E 8 E 9 Epochs: E 3 E 4 E 5 E 6 Retry Run Lengths: Average Run Length = 3 Retry E 6 E 7 E 8 E 9 Execution on 4 Processors average run length = rough limit of potential parallelism The TAMPede Project 18

19 Relaxing Memory Data Dependences peedup Limit B O B O compress.r1 gcc.r1 B O B O B O gcc.r2 li.r1 espresso.r1 B O li.r2 B O B O B O B O sc.r1 ijpeg.r1 m88ksim.r1 perl.r1 B O go.r1 B O B O buk.r2 buk.r1 B = base case O = compiler optimizations applied to remove dependences = dependences due to forwardable scalars also removed eliminating dependences and forwarding scalars are important significant parallelism is available in many cases The TAMPede Project 19

20 calar Memory Values: orwarding Data Between Epochs forward if dependences occur frequently synchronization is faster than speculation recovery helpful for performance, not necessary for correctness Register Values: must be forwarded to maintain correctness some register dependences may be eliminated: induction variables through simple loop rescheduling all other register dependences forwarded through memory what is the impact on performance? The TAMPede Project 20

21 Critical Path Lengths E 1 Time E 1 E i E i E 2 E 2 E 3 WAIT LOAD A E 3 WAIT A LOAD A E 4 TORE A LOAD B TORE B IGNAL E 4 TORE A IGNAL A WAIT B LOAD B TORE B IGNAL B Course-Grain ynchronization ine-grain ynchronization The TAMPede Project 21

22 orwarding Data Between Epochs peedup Limit f3.2 c s 1.0 f1.0 c s 1.1 f1.1 c s 1.0 f c s 1.2 f c s 1.1 f c s 1.5 f2.4 c compress.r1 gcc.r2 li.r1 sc.r1 ijpeg.r1 go.r1 buk.r2 gcc.r1 espresso.r1 li.r2 m88ksim.r1 perl.r1 buk.r1 s 1.0 f1.0 c s f c s 1.0 f3.4 c s 1.4 f6.0 c s f c s f c s c = coarse-grain synchronization f = fine-grain synchronization s = fine-grain synchronization w/ aggressive instruction scheduling fine-grain synchronization is helpful with aggressive instruction scheduling, forwarding is not a bottleneck The TAMPede Project 22

23 Potential Region peedup on 4 Processors Region peedup compress.r1 gcc.r1 gcc.r2 li.r1 espresso.r1 li.r2 sc.r1 ijpeg.r1 m88ksim.r1 perl.r1 go.r1 buk.r1 buk.r2 = optimize dependences, forward w/ fine-grain synchronization = also perform aggressive instruction scheduling communication latency = 10 cycles aggressive instruction scheduling is a major performance win potential speedups of twofold or more in 11 of 13 regions The TAMPede Project 23

24 Program peedups buk perl gcc 0 go ijpeg m88ksim sc li espresso compress Program peedup 99.9% 12.1% 19.4% 73.1% 69.3% 99.3% 15.3% 35.8% 6.8% 27.9% Coverage: Region peedups: Region peedup buk.r2 go.r1 ijpeg.r1 sc.r1 li.r1 gcc.r2 compress.r1 buk.r1 perl.r1 m88ksim.r1 li.r2 espresso.r1 gcc.r1 The TAMPede Project 24

25 Overview Thread-Level Data peculation (TLD) An Example: Compress Experimental Results Architectural upport Communication Latency Key Architectural Issues Detecting Data Dependence Violations Buffering peculative tate Conclusions The TAMPede Project 25

26 Base Architecture ingle Chip Processor Processor Processor Processor L1 Cache L1 Cache L1 Cache L1 Cache 4x4 Crossbar witch Interleaved L2 Cache Each processor has its own L1 data cache maintain single-cycle load latency L1 caches are kept coherent shared-memory programming model The TAMPede Project 26

27 How Important Is Communication Latency? ingle Chip Processor Processor Processor Processor L1 Cache L1 Cache L1 Cache L1 Cache 4x4 Crossbar witch Interleaved L2 Cache ome Options: direct L1-to-L1 communication ~2 cycles communicate through the L2 cache ~10 cycles The TAMPede Project 27

28 Impact of Communication Latency Region peedup compress.r1 gcc.r gcc.r2 li.r1 espresso.r li.r sc.r1 ijpeg.r1 m88ksim.r1 perl.r go.r buk.r buk.r2 = optimize dependences, forward w/ fine-grain synchronization = also perform aggressive instruction scheduling communication latency: 10 = 10 cycles, 2 = 2 cycles instruction scheduling reduces the sensitivity to communication latency communicating through the L2 cache is a viable option The TAMPede Project 28

29 Thread Management Key Architectural Issues Thread creation and epoch scheduling Epoch numbers must be visible to the hardware Distinguishing speculative vs. non-speculative memory accesses Recovering from data dependence violations: hardware notifies software of violation software performs the bulk of the recovery Detecting Data Dependence Violations extend invalidation-based cache coherence Buffering peculative tate extend the functionality of the primary data caches The TAMPede Project 29

30 Invalidation-Based Cache Coherence Time 2 Processor 1 Processor 2 TORE X = 2; 1 LOAD a = X; L1 Cache L1 Cache X = 1 2 X = 1 2 Invalidation 1 Read Request The TAMPede Project 30

31 Detecting Data Dependence Violations Time 2 Processor 1 (p = q = &X) Processor 2 Epoch 5 Epoch 6 TORE *q = 2; 1 3 become_speculative() LOAD a = *p; attempt_commit() AIL! L1 Cache Epoch # = 5 Violation? = alse X = 1 2 L M L1 Cache Epoch # = 6 Violation? = TRUE X = 1 T L M 2 peculatively Loaded? peculatively Modified? 2 Invalidation (Epoch #5) 1 Read Request The TAMPede Project 31

32 Buffering peculative tate peculative tores: software cannot realistically roll back memory side effects our solution: buffer in L1 cache until safe to commit to memory peculative Loads: if displaced, then we can no longer track dependence violations set violation flag upon eviction of a speculatively accessed line correctness is preserved, but performance may suffer Processor L1 Cache Victim Cache a 16KB 2-way set-associative cache with 4 victim entries suffices The TAMPede Project 32

33 Overview Thread-Level Data peculation (TLD) An Example: Compress Experimental Results Architectural upport Conclusions The TAMPede Project 33

34 Conclusions TLD potentially offers compelling performance improvements 12 of 13 regions: speedups of on 4 processors 7 of 10 programs: speedups of on 4 processors Only modest hardware modifications are required cache coherence protocol augmented to detect violations primary data cache is used to buffer speculative state Compiler support is crucial yet feasible eliminating data dependences aggressive scheduling to minimize critical path (forwarding) Ongoing and future work refining the architecture (described in technical report) building the compiler The TAMPede Project 34

The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization

The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J. Gregory Steffan and Todd C. Mowry Computer Science Department Carnegie Mellon University Pittsburgh, PA

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) A 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra is a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra ia a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun

Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun Computer Systems Laboratory Stanford University http://www-hydra.stanford.edu A Chip Multiprocessor Implementation

More information

Multi-Version Caches for Multiscalar Processors. Manoj Franklin. Clemson University. 221-C Riggs Hall, Clemson, SC , USA

Multi-Version Caches for Multiscalar Processors. Manoj Franklin. Clemson University. 221-C Riggs Hall, Clemson, SC , USA Multi-Version Caches for Multiscalar Processors Manoj Franklin Department of Electrical and Computer Engineering Clemson University 22-C Riggs Hall, Clemson, SC 29634-095, USA Email: mfrankl@blessing.eng.clemson.edu

More information

A Scalable Approach to Thread-Level Speculation

A Scalable Approach to Thread-Level Speculation arnegie Mellon University Research Showcase @ MU omputer Science Department School of omputer Science 6-2000 A Scalable Approach to Thread-Level Speculation J. Gregory Steffan arnegie Mellon University

More information

SPECULATIVE MULTITHREADED ARCHITECTURES

SPECULATIVE MULTITHREADED ARCHITECTURES 2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions

More information

A Scalable Approach to Thread-Level Speculation

A Scalable Approach to Thread-Level Speculation A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, hristopher B. olohan, Antonia Zhai, and Todd. Mowry omputer Science Department arnegie Mellon University Pittsburgh, PA 523 steffan,colohan,zhaia,tcm

More information

Software-Controlled Multithreading Using Informing Memory Operations

Software-Controlled Multithreading Using Informing Memory Operations Software-Controlled Multithreading Using Informing Memory Operations Todd C. Mowry Computer Science Department University Sherwyn R. Ramkissoon Department of Electrical & Computer Engineering University

More information

Speculative Synchronization

Speculative Synchronization Speculative Synchronization José F. Martínez Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu/martinez Problem 1: Conservative Parallelization No parallelization

More information

CMP Support for Large and Dependent Speculative Threads

CMP Support for Large and Dependent Speculative Threads TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 1 CMP Support for Large and Dependent Speculative Threads Christopher B. Colohan, Anastassia Ailamaki, J. Gregory Steffan, Member, IEEE, and Todd C. Mowry

More information

Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads

Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Grery Steffan, and Todd C. Mowry School of Computer Science Carnegie Mellon

More information

Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois

Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications José éf. Martínez * and Josep Torrellas University of Illinois ASPLOS 2002 * Now at Cornell University Overview Allow

More information

Computer Science. Carnegie Mellon. DISTRIBUTION STATEMENT A Approved for Public Release Distribution Unlimited

Computer Science. Carnegie Mellon. DISTRIBUTION STATEMENT A Approved for Public Release Distribution Unlimited Computer Science Carnegie Mellon DISTRIBUTION STATEMENT A Approved for Public Release Distribution Unlimited Extending Cache Coherence to Support Thread-Level Data Speculation on a Single Chip and Beyond

More information

Data Speculation Support for a Chip Multiprocessor

Data Speculation Support for a Chip Multiprocessor Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey and Kunle Olukotun Computer Systems Laboratory Stanford University Stanford, CA 94305-4070 http://www-hydra.stanford.edu/ Abstract

More information

Module 15: "Memory Consistency Models" Lecture 34: "Sequential Consistency and Relaxed Models" Memory Consistency Models. Memory consistency

Module 15: Memory Consistency Models Lecture 34: Sequential Consistency and Relaxed Models Memory Consistency Models. Memory consistency Memory Consistency Models Memory consistency SC SC in MIPS R10000 Relaxed models Total store ordering PC and PSO TSO, PC, PSO Weak ordering (WO) [From Chapters 9 and 11 of Culler, Singh, Gupta] [Additional

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Incrementally Parallelizing. Twofold Speedup on a Quad-Core. Thread-Level Speculation. A Case Study with BerkeleyDB. What Am I Working on Now?

Incrementally Parallelizing. Twofold Speedup on a Quad-Core. Thread-Level Speculation. A Case Study with BerkeleyDB. What Am I Working on Now? Incrementally Parallelizing Database Transactions with Thread-Level Speculation Todd C. Mowry Carnegie Mellon University (in collaboration with Chris Colohan, J. Gregory Steffan, and Anastasia Ailamaki)

More information

CS533: Speculative Parallelization (Thread-Level Speculation)

CS533: Speculative Parallelization (Thread-Level Speculation) CS533: Speculative Parallelization (Thread-Level Speculation) Josep Torrellas University of Illinois in Urbana-Champaign March 5, 2015 Josep Torrellas (UIUC) CS533: Lecture 14 March 5, 2015 1 / 21 Concepts

More information

Software and Hardware for Exploiting Speculative Parallelism with a Multiprocessor

Software and Hardware for Exploiting Speculative Parallelism with a Multiprocessor Software and Hardware for Exploiting Speculative Parallelism with a Multiprocessor Jeffrey Oplinger, David Heine, Shih-Wei Liao, Basem A. Nayfeh, Monica S. Lam and Kunle Olukotun Computer Systems Laboratory

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Electrical and Computer Engineering Georgia Institute of Technology Atlanta, Georgia

Electrical and Computer Engineering Georgia Institute of Technology Atlanta, Georgia Journal of Universal Computer cience, vol. 6, no. 10 (2000), 908-927 submitted: 8/2/00, accepted: 8/9/00, appeared: 28/10/00 pringer Pub. Co. Electrical and Computer Engineering Georgia Institute of Technology

More information

Heckaton. SQL Server's Memory Optimized OLTP Engine

Heckaton. SQL Server's Memory Optimized OLTP Engine Heckaton SQL Server's Memory Optimized OLTP Engine Agenda Introduction to Hekaton Design Consideration High Level Architecture Storage and Indexing Query Processing Transaction Management Transaction Durability

More information

The STAMPede Approach to Thread-Level Speculation

The STAMPede Approach to Thread-Level Speculation The STAMPede Approach to Thread-Level Speculation J. GREGORY STEFFAN University of Toronto CHRISTOPHER COLOHAN Carnegie Mellon University ANTONIA ZHAI University of Minnesota and TODD C. MOWRY Carnegie

More information

A Chip-Multiprocessor Architecture with Speculative Multithreading

A Chip-Multiprocessor Architecture with Speculative Multithreading 866 IEEE TRANSACTIONS ON COMPUTERS, VOL. 48, NO. 9, SEPTEMBER 1999 A Chip-Multiprocessor Architecture with Speculative Multithreading Venkata Krishnan, Member, IEEE, and Josep Torrellas AbstractÐMuch emphasis

More information

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Ravi Rajwar and Jim Goodman University of Wisconsin-Madison International Symposium on Microarchitecture, Dec. 2001 Funding

More information

Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6)

Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6) Lecture: Consistency Models, TM Topics: consistency models, TM intro (Section 5.6) 1 Coherence Vs. Consistency Recall that coherence guarantees (i) that a write will eventually be seen by other processors,

More information

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era

More information

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 s Joshua J. Yi and David J. Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

TDT 4260 lecture 7 spring semester 2015

TDT 4260 lecture 7 spring semester 2015 1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding

More information

Summary: Open Questions:

Summary: Open Questions: Summary: The paper proposes an new parallelization technique, which provides dynamic runtime parallelization of loops from binary single-thread programs with minimal architectural change. The realization

More information

NOW that the microprocessor industry has shifted its

NOW that the microprocessor industry has shifted its IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 18, NO. 8, AUGUST 2007 1 CMP Support for Large and Dependent Speculative Threads Christopher B. Colohan, Anastasia Ailamaki, Member, IEEE Computer

More information

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Mark Willey, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Mark Willey, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun The Stanford Hydra CMP Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Mark Willey, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun Computer Systems Laboratory Stanford University http://www-hydra.stanford.edu

More information

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun The Stanford Hydra CMP Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun Computer Systems Laboratory Stanford University http://www-hydra.stanford.edu

More information

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA CS 258 Parallel Computer Architecture Data Speculation Support for a Chip Multiprocessor (Hydra CMP) Lance Hammond, Mark Willey and Kunle Olukotun Presented: May 7 th, 2008 Ankit Jain Outline The Hydra

More information

POSH: A TLS Compiler that Exploits Program Structure

POSH: A TLS Compiler that Exploits Program Structure POSH: A TLS Compiler that Exploits Program Structure Wei Liu, James Tuck, Luis Ceze, Wonsun Ahn, Karin Strauss, Jose Renau and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign

More information

LIMITS OF ILP. B649 Parallel Architectures and Programming

LIMITS OF ILP. B649 Parallel Architectures and Programming LIMITS OF ILP B649 Parallel Architectures and Programming A Perfect Processor Register renaming infinite number of registers hence, avoids all WAW and WAR hazards Branch prediction perfect prediction Jump

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins 4 Chip Multiprocessors (I) Robert Mullins Overview Coherent memory systems Introduction to cache coherency protocols Advanced cache coherency protocols, memory systems and synchronization covered in the

More information

Speculative Versioning Cache: Unifying Speculation and Coherence

Speculative Versioning Cache: Unifying Speculation and Coherence Speculative Versioning Cache: Unifying Speculation and Coherence Sridhar Gopal T.N. Vijaykumar, Jim Smith, Guri Sohi Multiscalar Project Computer Sciences Department University of Wisconsin, Madison Electrical

More information

Abstract. 1 Introduction. 2 The Hydra CMP. Computer Systems Laboratory Stanford University Stanford, CA

Abstract. 1 Introduction. 2 The Hydra CMP. Computer Systems Laboratory Stanford University Stanford, CA Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey and Kunle Olukotun Computer Systems Laboratory Stanford University Stanford, CA 94305-4070 http://www-hydra.stanford.edu/ Abstract

More information

Lecture: Consistency Models, TM

Lecture: Consistency Models, TM Lecture: Consistency Models, TM Topics: consistency models, TM intro (Section 5.6) No class on Monday (please watch TM videos) Wednesday: TM wrap-up, interconnection networks 1 Coherence Vs. Consistency

More information

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors Computer Sciences Department University of Wisconsin Madison http://www.cs.wisc.edu/~ericro/ericro.html ericro@cs.wisc.edu High-Performance

More information

Relaxing Concurrency Control in Transactional Memory. Utku Aydonat

Relaxing Concurrency Control in Transactional Memory. Utku Aydonat Relaxing Concurrency Control in Transactional Memory by Utku Aydonat A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of The Edward S. Rogers

More information

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012 18-742 Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II Prof. Onur Mutlu Carnegie Mellon University 10/12/2012 Past Due: Review Assignments Was Due: Tuesday, October 9, 11:59pm. Sohi

More information

Instruction Level Parallelism (ILP)

Instruction Level Parallelism (ILP) 1 / 26 Instruction Level Parallelism (ILP) ILP: The simultaneous execution of multiple instructions from a program. While pipelining is a form of ILP, the general application of ILP goes much further into

More information

Multiprocessor scheduling

Multiprocessor scheduling Chapter 10 Multiprocessor scheduling When a computer system contains multiple processors, a few new issues arise. Multiprocessor systems can be categorized into the following: Loosely coupled or distributed.

More information

Limits of Task-based Parallelism in Irregular Applications

Limits of Task-based Parallelism in Irregular Applications In Proceedings of the 3rd International Symposium on High Performance Computing (ISHPC2K), October 2000, (c) Springer-Verlag. Limits of Task-based Parallelism in Irregular Applications Barbara Kreaseck

More information

Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory

Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory 1 Example Programs Initially, A = B = 0 P1 P2 A = 1 B = 1 if (B == 0) if (A == 0) critical section

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal

More information

Thread-level Parallelism for the Masses. Kunle Olukotun Computer Systems Lab Stanford University 2007

Thread-level Parallelism for the Masses. Kunle Olukotun Computer Systems Lab Stanford University 2007 Thread-level Parallelism for the Masses Kunle Olukotun Computer Systems Lab Stanford University 2007 The World has Changed Process Technology Stops Improving! Moore s law but! Transistors don t get faster

More information

Memory Consistency and Multiprocessor Performance

Memory Consistency and Multiprocessor Performance Memory Consistency Model Memory Consistency and Multiprocessor Performance Define memory correctness for parallel execution Execution appears to the that of some correct execution of some theoretical parallel

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

RECAP. B649 Parallel Architectures and Programming

RECAP. B649 Parallel Architectures and Programming RECAP B649 Parallel Architectures and Programming RECAP 2 Recap ILP Exploiting ILP Dynamic scheduling Thread-level Parallelism Memory Hierarchy Other topics through student presentations Virtual Machines

More information

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING 16 MARKS CS 2354 ADVANCE COMPUTER ARCHITECTURE 1. Explain the concepts and challenges of Instruction-Level Parallelism. Define

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Memory Consistency and Multiprocessor Performance. Adapted from UCB CS252 S01, Copyright 2001 USB

Memory Consistency and Multiprocessor Performance. Adapted from UCB CS252 S01, Copyright 2001 USB Memory Consistency and Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB 1 Memory Consistency Model Define memory correctness for parallel execution Execution appears to the that

More information

Lecture 17: Parallel Architectures and Future Computer Architectures. Shared-Memory Multiprocessors

Lecture 17: Parallel Architectures and Future Computer Architectures. Shared-Memory Multiprocessors Lecture 17: arallel Architectures and Future Computer Architectures rof. Kunle Olukotun EE 282h Fall 98/99 1 Shared-emory ultiprocessors Several processors share one address space» conceptually a shared

More information

Lecture 13: March 25

Lecture 13: March 25 CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 13: March 25 Lecturer: John Cavazos Scribe: Ying Yu 13.1. Bryan Youse-Optimization of Sparse Matrix-Vector Multiplication on Emerging

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

Evaluation of Design Alternatives for a Multiprocessor Microprocessor

Evaluation of Design Alternatives for a Multiprocessor Microprocessor Evaluation of Design Alternatives for a Multiprocessor Microprocessor Basem A. Nayfeh, Lance Hammond and Kunle Olukotun Computer Systems Laboratory Stanford University Stanford, CA 9-7 {bnayfeh, lance,

More information

Hardware Speculation Support

Hardware Speculation Support Hardware Speculation Support Conditional instructions Most common form is conditional move BNEZ R1, L ;if MOV R2, R3 ;then CMOVZ R2,R3, R1 L: ;else Other variants conditional loads and stores nullification

More information

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle M7: Next Generation SPARC Hotchips 26 August 12, 2014 Stephen Phillips Senior Director, SPARC Architecture Oracle Safe Harbor Statement The following is intended to outline our general product direction.

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

Transient-Fault Recovery Using Simultaneous Multithreading

Transient-Fault Recovery Using Simultaneous Multithreading To appear in Proceedings of the International Symposium on ComputerArchitecture (ISCA), May 2002. Transient-Fault Recovery Using Simultaneous Multithreading T. N. Vijaykumar, Irith Pomeranz, and Karl Cheng

More information

Supertask Successor. Predictor. Global Sequencer. Super PE 0 Super PE 1. Value. Predictor. (Optional) Interconnect ARB. Data Cache

Supertask Successor. Predictor. Global Sequencer. Super PE 0 Super PE 1. Value. Predictor. (Optional) Interconnect ARB. Data Cache Hierarchical Multi-Threading For Exploiting Parallelism at Multiple Granularities Abstract Mohamed M. Zahran Manoj Franklin ECE Department ECE Department and UMIACS University of Maryland University of

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Complexity Analysis of A Cache Controller for Speculative Multithreading Chip Multiprocessors

Complexity Analysis of A Cache Controller for Speculative Multithreading Chip Multiprocessors Complexity Analysis of A Cache Controller for Speculative Multithreading Chip Multiprocessors Yoshimitsu Yanagawa, Luong Dinh Hung, Chitaka Iwama, Niko Demus Barli, Shuichi Sakai and Hidehiko Tanaka Although

More information

Transactional Memory. Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech

Transactional Memory. Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech Transactional Memory Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech (Adapted from Stanford TCC group and MIT SuperTech Group) Motivation Uniprocessor Systems Frequency

More information

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review COS 318: Operating Systems NSF, Snapshot, Dedup and Review Topics! NFS! Case Study: NetApp File System! Deduplication storage system! Course review 2 Network File System! Sun introduced NFS v2 in early

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Improving the Performance of Speculatively Parallel Applications on the Hydra CMP

Improving the Performance of Speculatively Parallel Applications on the Hydra CMP Improving the Performance of Speculatively Parallel Applications on the Hydra CMP Kunle Olukotun, Lance Hammond and Mark Willey Computer Systems Laboratory Stanford University Stanford, CA 94305-4070 http://www-hydra.stanford.edu/

More information

Recent Advances in Heterogeneous Computing using Charm++

Recent Advances in Heterogeneous Computing using Charm++ Recent Advances in Heterogeneous Computing using Charm++ Jaemin Choi, Michael Robson Parallel Programming Laboratory University of Illinois Urbana-Champaign April 12, 2018 1 / 24 Heterogeneous Computing

More information

Three basic multiprocessing issues

Three basic multiprocessing issues Three basic multiprocessing issues 1. artitioning. The sequential program must be partitioned into subprogram units or tasks. This is done either by the programmer or by the compiler. 2. Scheduling. Associated

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

UNIT I (Two Marks Questions & Answers)

UNIT I (Two Marks Questions & Answers) UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-

More information

Hardware Support for Thread-Level Speculation

Hardware Support for Thread-Level Speculation Hardware Support for Thread-Level Speculation J. Gregory Steffan CMU-CS-3-122 April 23 School of Computer Science Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213 Thesis Committee:

More information

Multiplex: Unifying Conventional and Speculative Thread-Level Parallelism on a Chip Multiprocessor

Multiplex: Unifying Conventional and Speculative Thread-Level Parallelism on a Chip Multiprocessor Multiplex: Unifying Conventional and Speculative Thread-Level Parallelism on a Chip Multiprocessor Seon Wook Kim, Chong-Liang Ooi, Il Park, Rudolf Eigenmann, Babak Falsafi, and T. N. Vijaykumar School

More information

Overview of Today s Lecture: Cost & Price, Performance { 1+ Administrative Matters Finish Lecture1 Cost and Price Add/Drop - See me after class

Overview of Today s Lecture: Cost & Price, Performance { 1+ Administrative Matters Finish Lecture1 Cost and Price Add/Drop - See me after class Overview of Today s Lecture: Cost & Price, Performance EE176-SJSU Computer Architecture and Organization Lecture 2 Administrative Matters Finish Lecture1 Cost and Price Add/Drop - See me after class EE176

More information

CS 1013 Advance Computer Architecture UNIT I

CS 1013 Advance Computer Architecture UNIT I CS 1013 Advance Computer Architecture UNIT I 1. What are embedded computers? List their characteristics. Embedded computers are computers that are lodged into other devices where the presence of the computer

More information

Deterministic Shared Memory Multiprocessing

Deterministic Shared Memory Multiprocessing Deterministic Shared Memory Multiprocessing Luis Ceze, University of Washington joint work with Owen Anderson, Tom Bergan, Joe Devietti, Brandon Lucia, Karin Strauss, Dan Grossman, Mark Oskin. Safe MultiProcessing

More information

Motivations. Shared Memory Consistency Models. Optimizations for Performance. Memory Consistency

Motivations. Shared Memory Consistency Models. Optimizations for Performance. Memory Consistency Shared Memory Consistency Models Authors : Sarita.V.Adve and Kourosh Gharachorloo Presented by Arrvindh Shriraman Motivations Programmer is required to reason about consistency to ensure data race conditions

More information

Getting CPI under 1: Outline

Getting CPI under 1: Outline CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more

More information

EECS 570 Final Exam - SOLUTIONS Winter 2015

EECS 570 Final Exam - SOLUTIONS Winter 2015 EECS 570 Final Exam - SOLUTIONS Winter 2015 Name: unique name: Sign the honor code: I have neither given nor received aid on this exam nor observed anyone else doing so. Scores: # Points 1 / 21 2 / 32

More information

Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System

Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System Center for Information ervices and High Performance Computing (ZIH) Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor ystem Parallel Architectures and Compiler Technologies

More information

Hammer Slide: Work- and CPU-efficient Streaming Window Aggregation

Hammer Slide: Work- and CPU-efficient Streaming Window Aggregation Large-Scale Data & Systems Group Hammer Slide: Work- and CPU-efficient Streaming Window Aggregation Georgios Theodorakis, Alexandros Koliousis, Peter Pietzuch, Holger Pirk Large-Scale Data & Systems (LSDS)

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

Scalable GPU Graph Traversal!

Scalable GPU Graph Traversal! Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang

More information

Multi-Level Memory Prefetching for Media and Stream Processing

Multi-Level Memory Prefetching for Media and Stream Processing Multi-Level Memory Prefetching for Media and Stream Processing Jason Fritts Assistant Professor, Introduction Multimedia is a dominant computer workload Traditional cache-based memory systems not well-suited

More information

Lecture 7 Instruction Level Parallelism (5) EEC 171 Parallel Architectures John Owens UC Davis

Lecture 7 Instruction Level Parallelism (5) EEC 171 Parallel Architectures John Owens UC Davis Lecture 7 Instruction Level Parallelism (5) EEC 171 Parallel Architectures John Owens UC Davis Credits John Owens / UC Davis 2007 2009. Thanks to many sources for slide material: Computer Organization

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Final Review Shuai Wang Department of Computer Science and Technology Nanjing University Computer Architecture Computer architecture, like other architecture, is the art

More information

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 24: Multiprocessing Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Most of the rest of this

More information

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Dealing With Control Hazards Simplest solution to stall pipeline until branch is resolved and target address is calculated

More information

Parallel Computer Architecture Spring Memory Consistency. Nikos Bellas

Parallel Computer Architecture Spring Memory Consistency. Nikos Bellas Parallel Computer Architecture Spring 2018 Memory Consistency Nikos Bellas Computer and Communications Engineering Department University of Thessaly Parallel Computer Architecture 1 Coherence vs Consistency

More information