Conventional processor designs run out of steam Complexity (verification) Power (thermal) Physics (CMOS scaling)

Size: px
Start display at page:

Download "Conventional processor designs run out of steam Complexity (verification) Power (thermal) Physics (CMOS scaling)"

Transcription

1 A Gentler, Kinder Guide to the Multi-core Galaxy Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech Guest lecture for ECE4100/6100 for Prof. Yalamanchili Reality Check Conventional processor designs run out of steam Complexity (verification) Power (thermal) Physics (CMOS scaling) Unanimous direction Multi-core Simple cores (massive number) Keep Wire communication on leash Gordon Moore happy (Moore s Law) Architects menace: kick the ball to the other side of the court? What do you (or your customers) want? Performance (and/or availability) Throughput > latency (turnaround time) Total cost of ownership (performance per dollar) Energy (performance per watt) Reliability and dependability, SPAM/spy free 2

2 Multi-core Processor Gala 3 Intel s Multicore Roadmap Desktop processors DC 2/4MB DC 4MB DC 2/4MB shared DC 3MB /6MB shared (45nm) 8C 12MB shared (45nm) Mobile processors DC 2MB SC 1MB DC 3 MB/6 MB shared (45nm) DC 2/4MB shared Enterprise processors DC 16MB DC 4MB DC 2MB 8C 12MB shared (45nm) QC 8/16MB shared QC 4MB SC 512KB/ 1/ 2MB To extend Moore s Law To delay the ultimate limit of physics By 2010 all Intel processors delivered will be multicore Intel s 80-core processor Source: Adapted from Tom s Hardware 4

3 Is a Multi-core really better off? Well, it is hard to say in Computing World 5 Is a Multi-core really better off? DEEP BLUE 480 chess chips Can evaluate 200,000,000 moves per second!! 6

4 Computing Paradigm Evolution Unused Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Execution Time FU1 FU2 FU3 FU4 Conventional Superscalar Single Threaded Fine-grained Coarse-grained Multithreading Multithreading (cycle-by-cycle (Block Interleaving) Interleaving) Why SMT is failing? Simultaneous Chip Multithreading Multiprocessor (or Intel s HT) (CMP) or Multi-Core Processors 7 Major Challenges for Multi-Core Designs Communication Memory hierarchy Data allocation (you have a large shared L2/L3 now) Interconnection network Scalability Bus Bandwidth, how to get there? Power-Performance Win or lose? Borkar s multicore arguments 15% per core performance drop 50% power saving Giant, single core wastes power when task is small How about leakage? Process variation and yield Programming Model 8

5 Intel Core 2 Duo Homogeneous cores Bus based on chip interconnect Shared on-die Cache Memory Traditional I/O Classic OOO: Reservation Stations, Issue ports, Schedulers etc Source: Intel Corp. Large, shared set associative, prefetch, etc. 9 Core 2 Duo Microarchitecture 10

6 Why Sharing on-die L2? What happens when L2 is too large? 11 Intel Core 2 Duo (Merom) 12

7 Core TM µarch Wide Dynamic Execution 13 Core TM µarch Wide Dynamic Execution 14

8 Core TM µarch MACRO Fusion Common Intel 32 (once used to be called x86 or IA-32) instruction pairs are combined 15 Micro(-ops) Fusion (from Pentium M) A misnomer.. Instead of breaking up an Intel32 instruction into µop, they decide not to break it up A better naming scheme would call the previous techniques IA32 fission To fuse Store address and store data µops Load-and-op µops (e.g. ADD (%esp), %eax) Extend each RS entry to take 3 operands To reduce micro-ops (10% reduction in the OOO logic) Decoder bandwidth (simple decoder can decode fusion type instruction) Energy consumption Performance improved by 5% for INT and 9% for FP (Pentium M data) 16

9 Smart Memory Access 17 Sun UltraSparc T1 Eight cores, each 4-way threaded Fine-grained multithreading a thread-selection logic Round-robin cycle-by-cycle 1.2 GHz (90nm) No OOO, 8 instructions per cycle Cache 16K 4-way 32B L1-I 8K 4-way 16B L1-D Blocking cache (reason for MT) 4-banked 12-way 3MB L2 + 4 memory controllers. Data moved between the L2 and the cores using an integrated crossbar switch to provide high throughput (200GB/s) 18

10 Sun UltraSparc T2 A fatter version of T1 1.4GHz (65nm) 8 threads per core, 8 cores on-die 1 FPU per core (1 FPU per die in T1), 16 INT EU (8 in T1) L2 increased to 8-banked 16-way 4MB shared 8 stage integer pipeline ( as opposed to 6 for T1) 16 instructions per cycle One PCI Express port (x8 1.0) Two 10 Gigabit Ethernet ports with packet classification and filtering Eight encryption engines Four dual-channel FBDIMM memory controllers 711 signal I/O,1831 total 19 STI Cell Broadband Engine Heterogeneous! 64-bit PowerPC Eight SPEs In-order, Dual-issue 128-bit SIMD 128x128b RF 256KB LS Fast Local SRAM Globally coherent DMA (128B/cycle) 128+ concurrent transactions to memory per core High bandwidth EIB (96B/cycle) 20

11 Cell Chip Block Diagram Synergistic Memory flow controller 21 Non-Uniform Cache Architecture ASPLOS 2002 proposed by UT-Austin Facts Large shared on-die L2 Wire-delay dominating on-die cache 3 cycles 1MB 180nm, cycles 4MB 90nm, cycles 16MB 50nm,

12 Multi-banked L2 cache Bank=128KB 11 cycles 130nm Bank Access time = 3 cycles Interconnect delay = 8 cycles 23 Multi-banked L2 cache Bank=64KB 47 cycles 50nm Bank Access time = 3 cycles Interconnect delay = 44 cycles 24

13 Static NUCA-1 Sub-bank Bank Data Bus Address Bus Tag Array Predecoder Sense amplifier Wordline driver and decoder Use private per-bank channel Each bank has its distinct access latency Statically decide data location for its given address Average access latency =34.2 cycles Wire overhead = 20.9% an issue 25 Static NUCA-2 Bank Switch Tag Array Data bus Predecoder Wordline driver and decoder Use a 2D switched network to alleviate wire area overhead Average access latency =24.2 cycles Wire overhead = 5.9% 26

14 Dynamic NUCA Data can dynamically migrate Move frequently used cache lines closer to CPU 27 Dynamic NUCA way 0 bank 8 bank sets one set way 1 way 2 way 3 Simple Mapping All 4 ways of each bank set needs to be searched Farther bank sets longer access 28

15 Dynamic NUCA way 0 bank 8 bank sets one set way 1 way 2 way 3 Fair Mapping Average access time across all bank sets are equal 29 Dynamic NUCA way 0 way 1 way 2 way 3 bank 8 bank sets Shared Mapping Sharing the closet banks for farther banks 30

16 Transactional Memory Parallel Software Problems Parallel systems are often programmed with Synchronization through barriers Shared objects access control through locks Lock granularity and organization must balance performance and correctness Coarse-grain locking: Lock contention Fine-grain locking: Extra overhead Must be careful to avoid deadlocks or data races Must be careful not to leave anything unprotected for correctness Performance tuning is not intuitive Performance bottlenecks are related to low level events E.g. false sharing, coherence misses Feedback is often indirect (cache lines, rather than variables) 32

17 Parallel Hardware Complexity (TCC s view) Cache coherence protocols are complex Must track ownership of cache lines Difficult to implement and verify all corner cases Consistency protocols are complex Must provide rules to correctly order individual loads/stores Difficult for both hardware and software Current protocols rely on low latency, not bandwidth Critical short control messages on ownership transfers Latency of short messages unlikely to scale well in the future Bandwidth is likely to scale much better High speed interchip connections Multicore (CMP) = on-chip bandwidth 33 What do we want? A shared memory system with A simple, easy programming model (unlike message passing) A simple, low-complexity hardware implementation (unlike shared memory) Good performance 34

18 Lock Freedom Why lock is bad? Common problems in conventional locking mechanisms in concurrent systems Priority inversion: When low-priority process is preempted while holding a lock needed by a high-priority process Convoying: When a process holding a lock is de-scheduled (e.g. page fault, no more quantum), no forward progress for other processes capable of running Deadlock (or Livelock): Processes attempt to lock the same set of objects in different orders (could be bugs by programmers) Error-prone 35 Using Transactions What is a transaction? A sequence of instructions that is guaranteed to execute and complete only as an atomic unit Begin Transaction Inst #1 Inst #2 Inst #3 End Transaction End Transaction Satisfy the following properties Serializability: Transactions appear to execute serially. Atomicity (or Failure-Atomicity): A transaction either commits changes when complete, visible to all; or aborts, discarding changes (will retry again) 36

19 TCC (Stanford) [ISCA 2004] Transactional Coherence and Consistency Programmer-defined groups of instructions within a program Begin Transaction Inst #1 Inst #2 Inst #3 End Transaction Start Buffering Results Commit Results Now Only commit machine state at the end of each transaction Each must update machine state atomically, all at once To other processors, all instructions within one transaction appear to execute only when the transaction commits These commits impose an order on how processors may modify machine state 37 Transaction Code Example MIT LTM instruction set xstart: XBEGIN on_abort lw r1, 0(r2) addi r1, r1, on_abort: // back off j xstart // retry 38

20 Transactional Memory Transactions appear to execute in commit order Flow (RAW) dependency cause transaction violation and restart Transaction A Time ld 0xdddd... st 0xbeef Arbitrate Commit 0xbeef Transaction B ld 0xdddd... ld 0xbbbb Arbitrate 0xbeef Transaction C ld 0xbeef Violation! Commit ld 0xbeef Re-execute execute with new data 39 Transactional Memory Output and Anti-dependencies are automatically handled WAW are handled by writing buffers only in commit order (think about sequential consistency) Transaction A Store X Commit X Local buffer Shared Memory Transaction B Store X Commit X Local buffer 40

21 Transactional Memory Output and Anti-dependencies are automatically handled WAW are handled by writing buffers only in commit order WAR are handled by keeping all writes private until commit Transaction A Store X Commit X Local buffer Shared Memory Transaction B Store X Commit X Local buffer Transaction A ST X = 1 LD X (=1) Commit X X = 1 X = 3 Local stores supply data Transaction B ST X = 3 LD X (=3) LD X (=3) Commit X 41 TCC System Similar to prior thread-level speculation (TLS) techniques CMU Stampede Stanford Hydra Wisconsin Multiscalar UIUC speculative multithreading CMP Loosely coupled TLS system Completely eliminates conventional cache coherence and consistency models No MESI-style cache coherence protocol But require new hardware support 42

22 The TCC Cycle Transactions run in a cycle Speculatively execute code and buffer Wait for commit permission Phase provides synchronization, if necessary Arbitrate with other processors Commit stores together (as a packet) Provides a well-defined write ordering Can invalidate or update other caches Large packet utilizes bandwidth effectively And repeat 43 Advantages of TCC Trades bandwidth for simplicity and latency tolerance Easier to build Not dependent on timing/latency of loads and stores Transactions eliminate locks Transactions are inherently atomic Catches most common parallel programming errors Shared memory consistency is simplified Conventional model sequences individual loads and stores Now only have hardware sequence transaction commits Shared memory coherence is simplified Processors may have copies of cache lines in any state (no MESI!) Commit order implies an ownership sequence 44

23 How to Use TCC Divide code into potentially parallel tasks Usually loop iterations For initial division, tasks = transactions But can be subdivided up or grouped to match HW limits (buffering) Similar to threading in conventional parallel programming, but: We do not have to verify parallelism in advance Locking is handled automatically Easier to get parallel programs running correctly Programmer then orders transactions as necessary Ordering techniques implemented using phase number Deadlock-free (At least one transaction is the oldest one) Livelock-free (watchdog HW can easily insert barriers anywhere) 45 How to Use TCC Three common ordering scenarios Unordered for purely parallel tasks Fully ordered to specify sequential task (algorithm level) Partially ordered to insert synchronization like barriers 46

24 Basic TCC Transaction Control Bits In each local cache Read bits (per cache line, or per word to eliminate false sharing) Set on speculative loads Snooped by a committing transaction (writes by other CPU) Modified bits (per cache line) Set on speculative stores Indicate what to rollback if a violation is detected Different from dirty bit Renamed bits (optional) At word or byte granularity To indicate local updates (WAR) that do not cause a violation Subsequent reads that read lines with these bits set, they do NOT set read bits because local WAR is not considered a violation 47 During A Transaction Commit Need to collect all of the modified caches together into a commit packet Potential solutions A separate write buffer, or An address buffer maintaining a list of the line tags to be committed Size? Broadcast all writes out as one single (large) packet to the rest of the system 48

25 Re-execute A Transaction Rollback is needed when a transaction cannot commit Checkpoints needed prior to a transaction Checkpoint memory Use local cache Overflow issue Conflict or capacity misses require all the victim lines to be kept somewhere (e.g. victim cache) Checkpoint register state Hardware approach: Flash-copying rename table / arch register file Software approach: extra instruction overheads 49 Sample TCC Hardware Write buffers and L1 Transaction Control Bits Write buffer in processor, before broadcast A broadcast bus or network to distribute commit packets All processors see the commits in a single order Snooping on broadcasts triggers violations, if necessary Commit arbitration/sequence logic 50

26 Ideal Speedups with TCC equake_l : long transactions equake_s : short transactions 51 Speculative Write Buffer Needs Only a few KB of write buffering needed Set by the natural transaction sizes in applications Small write buffer can capture 90% of modified state Infrequent overflow can be always handled by committing early 52

27 Broadcast Bandwidth Broadcast is bursty Average bandwidth Needs ~16 32 processors with whole modified lines Needs ~8 32 processors with dirty data only High, but feasible on-chip 53 TCC vs MESI [PACT 2005] Application, Protocol + Processor count 54

28 Implementation of MIT s LTM [HPCA 05] Transactional Memory should support transactions of arbitrary size and duration LTM Large Transactional Memory No change in cache coherence protocol Abort when a memory conflict is detected Use coherency protocol to check conflicts Abort (younger) transactions during conflict resolution to guarantee forward progress For potential rollback Checkpoint rename table and physical registers Use local cache for all speculative memory operations Use shared L2 (or low level memory) for non-speculative data storage 55 Multiple In-Flight Transactions decode Original XBEGIN L1 ADD R1, R1, R1 ST 1000, R1 XBEGIN L2 ADD R1, R1, R1 ST 2000, R1 Rename Table R1 P1, Saved Set {P1, } During instruction decode: Maintain rename table and saved bits in physical registers Saved bits track registers mentioned in current rename table Constant # of set bits: every time a register is added to saved set we also remove one 56

29 Multiple In-Flight Transactions decode Original XBEGIN L1 ADD R1, R1, R1 ST 1000, R1 XBEGIN L2 ADD R1, R1, R1 ST 2000, R1 Rename Table R1 P1, R1 P2, Saved Set {P1, } {P2, } When XBEGIN is decoded Snapshots taken of current rename table and S bits This snapshot is not active until XBEGIN retires 57 Multiple In-Flight Transactions decode Original XBEGIN L1 ADD R1, R1, R1 ST 1000, R1 XBEGIN L2 ADD R1, R1, R1 ST 2000, R1 Rename Table R1 P1, R1 P2, Saved Set {P1, } {P2, } 58

30 Multiple In-Flight Transactions decode Original XBEGIN L1 ADD R1, R1, R1 ST 1000, R1 XBEGIN L2 ADD R1, R1, R1 ST 2000, R1 Rename Table R1 P1, R1 P2, Saved Set {P1, } {P2, } 59 Multiple In-Flight Transactions retire decode Original XBEGIN L1 ADD R1, R1, R1 ST 1000, R1 XBEGIN L2 ADD R1, R1, R1 ST 2000, R1 Rename Table R1 P1, R1 P2, Saved Set {P1, } {P2, } Active snapshot When XBEGIN retires Snapshots taken at decode become active, which will prevent P1 from reuse 1 st transaction queued to become active in memory To abort, we just restore the active snapshot s rename table 60

31 Multiple In-Flight Transactions retire decode Original XBEGIN L1 ADD R1, R1, R1 ST 1000, R1 XBEGIN L2 ADD R1, R1, R1 ST 2000, R1 Rename Table R1 P1, R1 P2, R1 P3, Saved Set {P1, } {P2, } {P3, } Active snapshot We are only reserving registers in the active set This implies that exactly # of arch registers are saved This number is strictly limited, even as we speculatively execute through multiple transactions 61 Multiple In-Flight Transactions retire decode Original XBEGIN L1 ADD R1, R1, R1 ST 1000, R1 XBEGIN L2 ADD R1, R1, R1 ST 2000, R1 Rename Table R1 P1, R1 P2, R1 P3, Saved Set {P1, } {P2, } {P3, } Active snapshot Normally, P1 would be freed here Since it is in the active snapshot s saved set, we place it onto the register reserved list 62

32 Multiple In-Flight Transactions retire decode Original XBEGIN L1 ADD R1, R1, R1 ST 1000, R1 XBEGIN L2 ADD R1, R1, R1 ST 2000, R1 Rename Table R1 P2, R1 P3, Saved Set {P2, } {P3, } When retires: Reserved physical registers (e.g. P1) are freed, and active snapshot is cleared Store queue is empty 63 Multiple In-Flight Transactions retire Original XBEGIN L1 ADD R1, R1, R1 ST 1000, R1 XBEGIN L2 ADD R1, R1, R1 ST 2000, R1 Rename Table R1 P2, Saved Set {P2, } Active snapshot Second transaction becomes active in memory 64

33 Cache Overflow Mechanism Way 0 O T tag data Way 1 T tag data Overflow Hashtable key data ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST 3000, 77 LD R1, 1000 Need to keep Current (speculative) values Rollback values Common case is commit, so keep Current in cache Problem: uncommitted current values do not fit in local cache Solution Overflow hashtable as extension of cache 65 Cache Overflow Mechanism Way 0 O T tag data Way 1 T tag data Overflow Hashtable key data ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST 3000, 77 LD R1, 1000 T bit per cache line Set if accessed during a transaction O bit per cache set Indicate set overflow Overflow storage in physical DRAM Allocate and resize by the OS Search when miss : complexity of a page table walk If a line is found, swapped with a line in the set 66

34 Cache Overflow Mechanism Way 0 O T tag data Way 1 T tag data Overflow Hashtable key data Start with non-transactional data in the cache ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST 3000, 77 LD R1, Cache Overflow Mechanism Way 0 O T tag data Way 1 T tag data Overflow Hashtable key data Transactional read sets the T bit ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST 3000, 77 LD R1,

35 Cache Overflow Mechanism Way 0 O T tag data Way 1 T tag data Overflow Hashtable key data Expect most transactional writes fit in the cache ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST 3000, 77 LD R1, Cache Overflow Mechanism Way 0 O T tag data Way 1 T tag data Overflow Hashtable key data ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST 3000, 77 LD R1, 1000 A conflict miss Overflow sets O bit Replacement taken place (LRU) Old data spilled to DRAM (hashtable) 70

36 Cache Overflow Mechanism Way 0 O T tag data Way 1 T tag data Overflow Hashtable key data ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST 3000, 77 LD R1, 1000 Miss to an overflowed line, checks overflow table If found, swap (like a victim cache) Else, proceed as miss 71 Cache Overflow Mechanism Way 0 O T tag data Way 1 T tag data L2 Overflow Hashtable key data ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST 3000, 77 LD R1, 1000 Abort Invalidate all lines with T set (assume L2 or lower level memory contains original values) Discard overflow hashtable Clear O and T bits Commit Write back hashtable; NACK interventions during this Clear O and T bits in the cache 72

37 LTM vs. Lock-based 73 Further Readings M. Herlihy and J. E. B. Moss, Transactional Memory: Architectural Support for Lock-Free Data Structures, ISCA R. Rajwar and J. R. Goodman, Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution, MICRO 2001 R. Rajwar and J. R. Goodman, Transactional Lock-Free Execution of Lock-Based Programs, ASPLOS 2002 J. F. Martinez and J. Torrellas, Speculative Synchronization: Applying Thread-Level Speculation to Explicitly Parallel Applications, ASPLOS 2002 L. Hammond, V. Wong, M. Chen, B. D. Calrstrom, J. D. Davis, B. Hertzberg, M. K. Prabhu, H. Wijaya, C. Kozyrakis, and K. Olukoton Transactional Memory Coherence and Consistency, ISCA 2004 C. S. Ananian, K. Asanovic, B. C. Kuszmaul, C. E. Leiserson, S. Lie, Unbounded Transactional Memory, HPCA 2005 A. McDonald, J. Chung, H. Chaf, C. C. Minh, B. D. Calrstrom, L. Hammond, C. Kozyrakis and K. Olukotun, Characterization of TCC on a Chip- Multiprocessors, PACT

Transactional Memory. Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech

Transactional Memory. Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech Transactional Memory Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech (Adapted from Stanford TCC group and MIT SuperTech Group) Motivation Uniprocessor Systems Frequency

More information

Transactional Memory Coherence and Consistency

Transactional Memory Coherence and Consistency Transactional emory Coherence and Consistency all transactions, all the time Lance Hammond, Vicky Wong, ike Chen, rian D. Carlstrom, ohn D. Davis, en Hertzberg, anohar K. Prabhu, Honggo Wijaya, Christos

More information

Thread-level Parallelism for the Masses. Kunle Olukotun Computer Systems Lab Stanford University 2007

Thread-level Parallelism for the Masses. Kunle Olukotun Computer Systems Lab Stanford University 2007 Thread-level Parallelism for the Masses Kunle Olukotun Computer Systems Lab Stanford University 2007 The World has Changed Process Technology Stops Improving! Moore s law but! Transistors don t get faster

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University Lecture 16: Checkpointed Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 18-1 Announcements Reading for today: class notes Your main focus:

More information

Unbounded Transactional Memory

Unbounded Transactional Memory (1) Unbounded Transactional Memory!"#$%&''#()*)+*),#-./'0#(/*)&1+2, #3.*4506#!"#-7/89*75,#!:*.50/#;"#

More information

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012 18-742 Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II Prof. Onur Mutlu Carnegie Mellon University 10/12/2012 Past Due: Review Assignments Was Due: Tuesday, October 9, 11:59pm. Sohi

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

ATLAS: A Chip-Multiprocessor. with Transactional Memory Support

ATLAS: A Chip-Multiprocessor. with Transactional Memory Support ATLAS: A Chip-Multiprocessor with Transactional Memory Support Njuguna Njoroge, Jared Casper, Sewook Wee, Yuriy Teslyar, Daxia Ge, Christos Kozyrakis, and Kunle Olukotun Transactional Coherence and Consistency

More information

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache. Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra ia a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA CS 258 Parallel Computer Architecture Data Speculation Support for a Chip Multiprocessor (Hydra CMP) Lance Hammond, Mark Willey and Kunle Olukotun Presented: May 7 th, 2008 Ankit Jain Outline The Hydra

More information

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Cray-1 and Cray T3E Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector

More information

Portland State University ECE 588/688. Transactional Memory

Portland State University ECE 588/688. Transactional Memory Portland State University ECE 588/688 Transactional Memory Copyright by Alaa Alameldeen 2018 Issues with Lock Synchronization Priority Inversion A lower-priority thread is preempted while holding a lock

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra is a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) A 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Speculative Synchronization

Speculative Synchronization Speculative Synchronization José F. Martínez Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu/martinez Problem 1: Conservative Parallelization No parallelization

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Multiprocessors and Locking

Multiprocessors and Locking Types of Multiprocessors (MPs) Uniform memory-access (UMA) MP Access to all memory occurs at the same speed for all processors. Multiprocessors and Locking COMP9242 2008/S2 Week 12 Part 1 Non-uniform memory-access

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

How to write powerful parallel Applications

How to write powerful parallel Applications How to write powerful parallel Applications 08:30-09.00 09.00-09:45 09.45-10:15 10:15-10:30 10:30-11:30 11:30-12:30 12:30-13:30 13:30-14:30 14:30-15:15 15:15-15:30 15:30-16:00 16:00-16:45 16:45-17:15 Welcome

More information

Fall 2012 Parallel Computer Architecture Lecture 15: Speculation I. Prof. Onur Mutlu Carnegie Mellon University 10/10/2012

Fall 2012 Parallel Computer Architecture Lecture 15: Speculation I. Prof. Onur Mutlu Carnegie Mellon University 10/10/2012 18-742 Fall 2012 Parallel Computer Architecture Lecture 15: Speculation I Prof. Onur Mutlu Carnegie Mellon University 10/10/2012 Reminder: Review Assignments Was Due: Tuesday, October 9, 11:59pm. Sohi

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

Transactional Memory. How to do multiple things at once. Benjamin Engel Transactional Memory 1 / 28

Transactional Memory. How to do multiple things at once. Benjamin Engel Transactional Memory 1 / 28 Transactional Memory or How to do multiple things at once Benjamin Engel Transactional Memory 1 / 28 Transactional Memory: Architectural Support for Lock-Free Data Structures M. Herlihy, J. Eliot, and

More information

Transactional Memory

Transactional Memory Transactional Memory Architectural Support for Practical Parallel Programming The TCC Research Group Computer Systems Lab Stanford University http://tcc.stanford.edu TCC Overview - January 2007 The Era

More information

Thread-level Parallelism. Synchronization. Explicit multithreading Implicit multithreading Redundant multithreading Summary

Thread-level Parallelism. Synchronization. Explicit multithreading Implicit multithreading Redundant multithreading Summary Chapter 11: Executing Multiple Threads Modern Processor Design: Fundamentals of Superscalar Processors Executing Multiple Threads Thread-level parallelism Synchronization Multiprocessors Explicit multithreading

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order

More information

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun The Stanford Hydra CMP Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun Computer Systems Laboratory Stanford University http://www-hydra.stanford.edu

More information

6 Transactional Memory. Robert Mullins

6 Transactional Memory. Robert Mullins 6 Transactional Memory ( MPhil Chip Multiprocessors (ACS Robert Mullins Overview Limitations of lock-based programming Transactional memory Programming with TM ( STM ) Software TM ( HTM ) Hardware TM 2

More information

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon

More information

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Mark Willey, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Mark Willey, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun The Stanford Hydra CMP Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Mark Willey, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun Computer Systems Laboratory Stanford University http://www-hydra.stanford.edu

More information

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012) Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues

More information

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) 18-447 Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/13/2015 Agenda for Today & Next Few Lectures

More information

Portland State University ECE 588/688. IBM Power4 System Microarchitecture

Portland State University ECE 588/688. IBM Power4 System Microarchitecture Portland State University ECE 588/688 IBM Power4 System Microarchitecture Copyright by Alaa Alameldeen 2018 IBM Power4 Design Principles SMP optimization Designed for high-throughput multi-tasking environments

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

COMP 322: Principles of Parallel Programming. Lecture 17: Understanding Parallel Computers (Chapter 2) Fall 2009

COMP 322: Principles of Parallel Programming. Lecture 17: Understanding Parallel Computers (Chapter 2) Fall 2009 COMP 322: Principles of Parallel Programming Lecture 17: Understanding Parallel Computers (Chapter 2) Fall 2009 http://www.cs.rice.edu/~vsarkar/comp322 Vivek Sarkar Department of Computer Science Rice

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

Kaisen Lin and Michael Conley

Kaisen Lin and Michael Conley Kaisen Lin and Michael Conley Simultaneous Multithreading Instructions from multiple threads run simultaneously on superscalar processor More instruction fetching and register state Commercialized! DEC

More information

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off

More information

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1) Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling

More information

Chí Cao Minh 28 May 2008

Chí Cao Minh 28 May 2008 Chí Cao Minh 28 May 2008 Uniprocessor systems hitting limits Design complexity overwhelming Power consumption increasing dramatically Instruction-level parallelism exhausted Solution is multiprocessor

More information

Hyperthreading 3/25/2008. Hyperthreading. ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01.

Hyperthreading 3/25/2008. Hyperthreading. ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01. Hyperthreading ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01.pdf Hyperthreading is a design that makes everybody concerned believe that they are actually using

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

SGI Challenge Overview

SGI Challenge Overview CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 2 (Case Studies) Copyright 2001 Mark D. Hill University of Wisconsin-Madison Slides are derived

More information

Computer Science 146. Computer Architecture

Computer Science 146. Computer Architecture Computer Architecture Spring 24 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 2: More Multiprocessors Computation Taxonomy SISD SIMD MISD MIMD ILP Vectors, MM-ISAs Shared Memory

More information

Module 14: "Directory-based Cache Coherence" Lecture 31: "Managing Directory Overhead" Directory-based Cache Coherence: Replacement of S blocks

Module 14: Directory-based Cache Coherence Lecture 31: Managing Directory Overhead Directory-based Cache Coherence: Replacement of S blocks Directory-based Cache Coherence: Replacement of S blocks Serialization VN deadlock Starvation Overflow schemes Sparse directory Remote access cache COMA Latency tolerance Page migration Queue lock in hardware

More information

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University EE382A Lecture 7: Dynamic Scheduling Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 7-1 Announcements Project proposal due on Wed 10/14 2-3 pages submitted

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

Pentium IV-XEON. Computer architectures M

Pentium IV-XEON. Computer architectures M Pentium IV-XEON Computer architectures M 1 Pentium IV block scheme 4 32 bytes parallel Four access ports to the EU 2 Pentium IV block scheme Address Generation Unit BTB Branch Target Buffer I-TLB Instruction

More information

A More Sophisticated Snooping-Based Multi-Processor

A More Sophisticated Snooping-Based Multi-Processor Lecture 16: A More Sophisticated Snooping-Based Multi-Processor Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2014 Tunes The Projects Handsome Boy Modeling School (So... How

More information

Memory Hierarchy. Slides contents from:

Memory Hierarchy. Slides contents from: Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory

More information

Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun

Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun Computer Systems Laboratory Stanford University http://www-hydra.stanford.edu A Chip Multiprocessor Implementation

More information

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Marble House The Knife (Silent Shout) Before starting The Knife, we were working

More information

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading CS 152 Computer Architecture and Engineering Lecture 18: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste

More information

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering

More information

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:

More information

Handout 3 Multiprocessor and thread level parallelism

Handout 3 Multiprocessor and thread level parallelism Handout 3 Multiprocessor and thread level parallelism Outline Review MP Motivation SISD v SIMD (SIMT) v MIMD Centralized vs Distributed Memory MESI and Directory Cache Coherency Synchronization and Relaxed

More information

Lecture 14: Multithreading

Lecture 14: Multithreading CS 152 Computer Architecture and Engineering Lecture 14: Multithreading John Wawrzynek Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~johnw

More information

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Ravi Rajwar and Jim Goodman University of Wisconsin-Madison International Symposium on Microarchitecture, Dec. 2001 Funding

More information

Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois

Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications José éf. Martínez * and Josep Torrellas University of Illinois ASPLOS 2002 * Now at Cornell University Overview Allow

More information

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr

More information

Flexible Architecture Research Machine (FARM)

Flexible Architecture Research Machine (FARM) Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense

More information

UNIT I (Two Marks Questions & Answers)

UNIT I (Two Marks Questions & Answers) UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-

More information

Speculative Locks. Dept. of Computer Science

Speculative Locks. Dept. of Computer Science Speculative Locks José éf. Martínez and djosep Torrellas Dept. of Computer Science University it of Illinois i at Urbana-Champaign Motivation Lock granularity a trade-off: Fine grain greater concurrency

More information

Potential violations of Serializability: Example 1

Potential violations of Serializability: Example 1 CSCE 6610:Advanced Computer Architecture Review New Amdahl s law A possible idea for a term project Explore my idea about changing frequency based on serial fraction to maintain fixed energy or keep same

More information

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013 18-447 Computer Architecture Lecture 14: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013 Reminder: Homework 3 Homework 3 Due Feb 25 REP MOVS in Microprogrammed

More information

15-740/ Computer Architecture Lecture 19: Main Memory. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 19: Main Memory. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 19: Main Memory Prof. Onur Mutlu Carnegie Mellon University Last Time Multi-core issues in caching OS-based cache partitioning (using page coloring) Handling

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Software-Controlled Multithreading Using Informing Memory Operations

Software-Controlled Multithreading Using Informing Memory Operations Software-Controlled Multithreading Using Informing Memory Operations Todd C. Mowry Computer Science Department University Sherwyn R. Ramkissoon Department of Electrical & Computer Engineering University

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming Cache design review Let s say your code executes int x = 1; (Assume for simplicity x corresponds to the address 0x12345604

More information

Lecture 6: Lazy Transactional Memory. Topics: TM semantics and implementation details of lazy TM

Lecture 6: Lazy Transactional Memory. Topics: TM semantics and implementation details of lazy TM Lecture 6: Lazy Transactional Memory Topics: TM semantics and implementation details of lazy TM 1 Transactions Access to shared variables is encapsulated within transactions the system gives the illusion

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Final Review Shuai Wang Department of Computer Science and Technology Nanjing University Computer Architecture Computer architecture, like other architecture, is the art

More information

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction

More information

Goldibear and the 3 Locks. Programming With Locks Is Tricky. More Lock Madness. And To Make It Worse. Transactional Memory: The Big Idea

Goldibear and the 3 Locks. Programming With Locks Is Tricky. More Lock Madness. And To Make It Worse. Transactional Memory: The Big Idea Programming With Locks s Tricky Multicore processors are the way of the foreseeable future thread-level parallelism anointed as parallelism model of choice Just one problem Writing lock-based multi-threaded

More information

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University 18-742 Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core Prof. Onur Mutlu Carnegie Mellon University Research Project Project proposal due: Jan 31 Project topics Does everyone have a topic?

More information

Multi-threaded processors. Hung-Wei Tseng x Dean Tullsen

Multi-threaded processors. Hung-Wei Tseng x Dean Tullsen Multi-threaded processors Hung-Wei Tseng x Dean Tullsen OoO SuperScalar Processor Fetch instructions in the instruction window Register renaming to eliminate false dependencies edule an instruction to

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

Concurrent Preliminaries

Concurrent Preliminaries Concurrent Preliminaries Sagi Katorza Tel Aviv University 09/12/2014 1 Outline Hardware infrastructure Hardware primitives Mutual exclusion Work sharing and termination detection Concurrent data structures

More information

EECS 570 Final Exam - SOLUTIONS Winter 2015

EECS 570 Final Exam - SOLUTIONS Winter 2015 EECS 570 Final Exam - SOLUTIONS Winter 2015 Name: unique name: Sign the honor code: I have neither given nor received aid on this exam nor observed anyone else doing so. Scores: # Points 1 / 21 2 / 32

More information

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism Computing Systems & Performance Beyond Instruction-Level Parallelism MSc Informatics Eng. 2012/13 A.J.Proença From ILP to Multithreading and Shared Cache (most slides are borrowed) When exploiting ILP,

More information

Intel Architecture for Software Developers

Intel Architecture for Software Developers Intel Architecture for Software Developers 1 Agenda Introduction Processor Architecture Basics Intel Architecture Intel Core and Intel Xeon Intel Atom Intel Xeon Phi Coprocessor Use Cases for Software

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 11

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

Chapter 9 Multiprocessors

Chapter 9 Multiprocessors ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University

More information

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho Why memory hierarchy? L1 cache design Sangyeun Cho Computer Science Department Memory hierarchy Memory hierarchy goals Smaller Faster More expensive per byte CPU Regs L1 cache L2 cache SRAM SRAM To provide

More information

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols Portland State University ECE 588/688 Directory-Based Cache Coherence Protocols Copyright by Alaa Alameldeen and Haitham Akkary 2018 Why Directory Protocols? Snooping-based protocols may not scale All

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Page 1. Multilevel Memories (Improving performance using a little cash )

Page 1. Multilevel Memories (Improving performance using a little cash ) Page 1 Multilevel Memories (Improving performance using a little cash ) 1 Page 2 CPU-Memory Bottleneck CPU Memory Performance of high-speed computers is usually limited by memory bandwidth & latency Latency

More information

Lecture 11: Large Cache Design

Lecture 11: Large Cache Design Lecture 11: Large Cache Design Topics: large cache basics and An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al., ASPLOS 02 Distance Associativity for High-Performance

More information

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM UNIT III MULTIPROCESSORS AND THREAD LEVEL PARALLELISM 1. Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared

More information

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013 Reminder: Homework 5 (Today) Due April 3 (Wednesday!) Topics: Vector processing,

More information

CSE 392/CS 378: High-performance Computing - Principles and Practice

CSE 392/CS 378: High-performance Computing - Principles and Practice CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures A Conceptual Introduction for Software Developers Jim Browne browne@cs.utexas.edu Parallel Computer

More information