Computer Science 246. Computer Architecture

Size: px
Start display at page:

Download "Computer Science 246. Computer Architecture"

Transcription

1 Computer Architecture Spring 2010 Harvard University Instructor: Prof. Multiprocessing

2 Multiprocessors Outline Multiprocessors Computation taxonomy Interconnection overview Shared Memory design criteria Cache coherence Very broad overview H&P: Chapter 6, mostly , 8.5 Next time: More Multiprocessors +Multithreading

3 Parallel Processing Multiple processors working cooperatively on problems: not the same as multiprogramming Goals/Motivation Performance: limits of uniprocessors ILP (branch prediction, RAW dependencies, memory) Cost Efficiency: build big systems with commodity parts (uniprocessor cores) Scalability: just add more cores to get more performance Fault tolerance: One processor fails you still can continue processing

4 Parallel Processing Sounds great, but As usual, software is the bottleneck Difficult to parallelize applications Compiler parallelization is hard (huge research efforts in 80s/90s) By-hand parallelization is harder/expensive/error-prone Difficult to make parallel applications run fast Communication is expensive Makes first task even harder Inherent problems Insufficient parallelism in some apps Long-latency remote communication

5 Recall Amdahl s Law Speedup Overall = 1 ((1 - Fraction enhanced ) + ) Fraction enhanced Speedup enhanced Example: Achieve speedup of 80x using 100 processors 80 = 1/[Frac parallel /100+1-Frac parallel ] Frac parallel = => only.25% of the work can be serial! Assumes linear speedup Superlinear speedup is possible in some cases Increased memory + cache with increased processor count

6 Parallel Application Domains Traditional Parallel Programs True parallelism in one job Regular loop structures Data usually tightly shared Automatic parallelization Called data-level parallelism Can often exploit vectors as well Workloads Scientific simulation codes (FFT, weather, fluid dynamics) Dominant market segment in the 1980s

7 Parallel Program Example: Parameters Size of matrix: (N*N) Matrix Multiply P processors: N/P 1/2 * N/P 1/2 blocks Cij needs Aik, Bkj k=0 to n-1 Growth Functions Computation grows as f(n 3 ) Computation per processor: f(n 3 /P) Data size: f(n 2 ) Data size per processor: f(n 2 /P) Communication: f(n 2 /P 1/2 ) Computation/Communication: f(n/p 1/2 ) P N

8 Application Domains Parallel independent but similar tasks Irregular control structures Loosely shared data locked at different granularities Programmer defines and fine tunes parallelism Cannot exploit vectors Thread-level Parallelism (throughput is important) Workloads Transaction Processing, databases, web-servers Dominant MP market segment today

9 Database Application Example Bank Database Parameters: D = number of accounts P = number of processors in server N = number of ATMs (parallel transactions) Growth Functions Computation: f(n) Computation per processor: F(N/P) Communication? Lock records while changing them Communication: f(n) Computation/communication: f(1)

10 Computation Taxonomy Proposed by Flynn (1966) Dimensions: Instruction streams: single (SI) or multiple (MI) Data streams: single (SD) or multiple (MD) Cross-product: SISD: uniprocessors (chapters 3-4) SIMD: multimedia/vector processors (MMX, HP-MAX) MISD: no real examples MIMD: multiprocessors + multicomputers (ch. 6)

11 SIMD vs. MIMD MP vs. SIMD Programming model flexibility Could simulate vectors with MIMD but not the other way Dominant market segment cannot use vectors Cost effectiveness Commodity Parts: high volume (cheap) components MPs can be made up of many uniprocessors Allows easy scalability from small to large

12 Taxonomy of Parallel (MIMD) Processors Center on organization of main memory Shared vs. Distributed Appearance of memory to hardware Q1: Memory access latency uniform? Shared (UMA): yes, doesn t matter where data goes Distributed (NUMA): no, makes a big difference Appearance of memory to software Q2: Can processors communicate directly via memory? Shared (shared memory): yes, communicate via load/store Distributed (message passing): no, communicate via messages Dimensions are orthogonal e.g. DSM: (physically) distributed, (logically) shared memory

13 UMA vs. NUMA: Why it matters Memory p0 p1 p2 p3 Ideal model: Perfect (single-cycle) memory latency Perfect (infinite) memory bandwidth Real systems: Latencies are long and grow with system size Bandwidth is limited Add memory banks, interconnect to hook up (latency goes up) Low latency

14 UMA vs. NUMA: Why it matters long latency long latency short latency m0 m1 m2 m3 Interconnect p0 p1 p2 p3 Interconnect m0 m1 m2 m3 p0 p1 p2 p3 UMA: uniform memory access From p0 same latency to m0-m3 Data placement doesn t matter Latency worse as system scales Interconnect contention restricts bandwidth Small multiprocessors only NUMA: non-uniform memory access From p0 faster to m0 than m1-m3 Low latency to local memory helps performance Data placement important (software!) Less contention => more scalable Large multiprocessor systems

15 Interconnection Networks Need something to connect those processors/memories Direct: endpoints connect directly Indirect: endpoints connect via switches/routers Interconnect issues Latency: average latency more important than max Bandwidth: per processor Cost: #wires, #switches, #ports per switch Scalability: how latency, bandwidth, cost grow with processors Interconnect topology primarily concerns architects

16 Interconnect 1: Bus m0 m1 m2 m3 Direct Interconnect Style p0 p1 p2 p3 m0 m1 m2 m3 + Cost: f(1) wires + Latency: f(1) - Bandwidth: Not Scalable, f(1/p) Only works in small systems (<=4) + Only performs ordered broadcast No other protocols work p0 p1 p2 p3

17 Interconnect 2: Crossbar Switch p0 p1 p2 p3 m0 m1 m2 m3 Indirect Interconnect Switch implemented as big MUXes + Latency: f(1) + Bandwidth: f(1) - Cost: f(2p) wires F(P 2 ) switches 4 wires per switch

18 Interconnect 3: Multistage Network m0 m1 m2 m3 p0 p1 p2 p3 Indirect Interconnect Routing done by address decoding k: switch arity (#inputs/#outputs per switch) d: number of network stages = log k P + Cost f(d*p/k) switches f(p*d) wires f(k) wires per switch + Latency: f(d) + Bandwidth: f(1) Commonly used in large UMA systems

19 Interconnect 4: 2D Torus p/m p/m p/m p/m p/m p/m p/m p/m p/m p/m p/m p/m p/m p/m p/m p/m Direct Interconnect + Cost f(2p) wires 4 wires per switch + Latency: f(p 1/2 ) + Bandwidth: f(1) Good scalability (widely used) Variants: 1D (ring), 3D, mesh (no wraparound)

20 Interconnect 5: Hypercube Direct Interconnect K: arity (#nodes per dimension) p/m p/m p/m p/m p/m p/m p/m p/m D: dimension =log k P + Latency: f(d) + Bandwidth: f(k*d) p/m p/m p/m p/m p/m p/m p/m p/m - Cost F(k*d*P) wires F(k*d) wires per switch Good scalability, expensive switches new design needed for more nodes

21 Interconnect Routing Store-and-Forward Routing Switch buffers entire message before passing it on Latency = [(message length / bandwidth) + fixed switch overhead] * #hops Wormhole Routing Pipeline message through interconnect Switch passes message on before completely arrives Latency = (message length / bandwidth) + (fixed switch overhead * #hops) No buffering needed at switch Latency (relative) independent of number of intermediate hops

22 Shared Memory vs. Message Passing MIMD (appearance of memory to software) Message Passing (multicomputers) Each processor has its own address space Processors send and receive messages to and from each other Communication patterns explicit and precise Explicit messaging forces programmer to optimize this Used for scientific codes (explicit communication) Message passing systems: PVM, MPI, OpenMP Simple Hardware Difficult programming Model

23 Shared Memory vs. Message Passing Shared Memory (multiprocessors) One shared address space Processors use conventional load/stores to access shared data Communication can be complex/dynamic Simpler programming model (compatible with uniprocessors) Hardware controlled caching is useful to reduce latency + contention Has drawbacks Synchronization (discussed later) More complex hardware needed

24 Parallel Systems (80s and 90s) Machine Communication Interconnect #cpus Remote latency (us) SPARCcenter Shared memory Bus <=20 1 SGI Challenge Shared memory Bus <=32 1 CRAY T3D Shared memory 3D Torus Convex SPP Shared memory X-bar/ring KSR-1 Shared memory Bus/ring TMC CM-5 Messages Fat tree Intel Paragon Messages 2D mesh IBM SP-2 Messages Multistage Trend towards shared memory systems

25 Multiprocessor Trends Shared Memory Easier, more dynamic programming model Can do more to optimize the hardware Small-to-medium size UMA systems (2-8 processors) Processor + memory + switch on single board (4x Pentium) Single-chip multiprocessors (POWER4) Commodity parts soon glueless MP systems Larger NUMAs built from smaller UMAs Use commodity small UMAs with commodity interconnects (ethernet, myrinet) NUMA clusters

26 Computation Taxonomy SISD SIMD MISD MIMD ILP Vectors, MM-ISAs Shared Memory Message Passing UMA vs. NUMA Cache Coherence Write Invalidate/Update? Snooping? Directory? Synchronization Memory Consistency

27 Major issues for Shared Memory Cache coherence Common Sense P1 Read[X] => P1 Write[X] => P1 Read[X] will return X P1 Write[X] => P2 Read[X] => will return value written by P1 P1 Write[X] => P2 Write[X] => Serialized (all processor see the writes in the same order) Synchronization Atomic read/write operations Memory consistency Model When will a written value be seen? P1 Write[X] (10ps later) P2 Read[X] what happens? These are not issues for message passing systems Why?

28 Cache Coherence Benefits of coherent caches in parallel systems? Migration and Replication of shared data Migration: data moved locally lowers latency + main memory bandwidth Replication: data being simultaneously read can be replicated to reduce latency + contention Problem: sharing of writeable data Processor 0 Processor 1 Correct value of A in: Read A Write A Read a Read A Memory Memory, p0 cache Memory, p0 cache, p1 cache P0 cache, memory (if write-through) P1 gets stale value on hit

29 Solutions to Cache Coherence No caches Not likely :) Make shared data non-cacheable Simple software solution low performance if lots of shared data Software flush at strategic times Relatively simple, but could be costly (frequent syncs) Hardware cache coherence Make memory and caches coherent/consistent with each other

30 HW Coherence Protocols Absolute coherence All copies of each block have same data at all times A little bit overkill Need appearance of absolute coherence Temporary incoherence is ok Similar to write-back cache As long as all loads get their correct value Coherence protocol: FSM that runs at every cache Invalidate protocols: invalidate copies in other caches Update protocols: update copies in other caches Snooping vs. Directory based protocols (HW implementation) Memory is always updated

31 Write Invalidate Much more common for most systems Mechanics Broadcast address of cache line to invalidate All processor snoop until they see it, then invalidate if in local cache Same policy can be used to service cache misses in write-back caches Processor Activity Bus Activity Contents of CPU A s cache Contents of CPU B s cache Contents of Memory Location X 0 CPU A reads X Cache miss for X 0 0 CPU B reads X Cache miss for X CPU A writes 1 to X Invalidation miss for X 1 0 CPU B reads X Cache miss for X 1 1 1

32 Write Update (Broadcast) Processor Activity Bus Activity Contents of CPU A s cache Contents of CPU B s cache CPU A reads X Cache miss for X 0 0 CPU B reads X Cache miss for X CPU A writes 1 to X Write Broadcast for X CPU B reads X Contents of Memory Location X Bandwidth requirements are excessive Can reduce by checking if word is shared (also can reduce write-invalidate traffic) Comparison with Write invalidate Multiple writes, no intervening reads require multiple broadcasts Multiple broadcasts needed for multiple word cache line writes (only 1 invalidate) Advantage: other processors can get the data faster 0

33 Bus-based protocols (Snooping) Snooping All caches see and react to all bus events Protocol relies on global visibility of events (ordered broadcast) Events: Processor (events from own processor) Read (R), Write (W), Writeback (WB) Bus Events (events from other processors) Bus Read (BR), Bus Write (BW)

34 Three-State Invalidate Protocol CPU Write Put write miss on bus invalid CPU Read/ exclusive Write Hit CPU read hit Write miss for block CPU Read Put read miss on bus Write miss for block WB block shared CPU Write Miss WB, write miss on bus CPU read miss Put read miss on bus Three states Invalid: don t have block Exclusive: have block and wrote it Shared: have block but only read it CPU activity Bus activity

35 Three-State Invalidate Protocol Processor 1 Processor 2 C E invalid exclusive A D B C shared A1 invalid E exclusive A1 B D F-Z shared A: P1 Read A1: P2 Read B: P1 Write C: P2 Read D: P1 Write E: P2 Write F-Z: P2 Write

36 Three State Invalidate Problems Real implementations are much more complicated Assumed all operations are atomic Multiphase operation can be done with no intervening ops E.g. write miss detected, acquire bus, receive response Even read misses are non-atomic with split transaction busses Deadlock is a possibility, see Appendix I Other Simplifications Write hits/misses treated same Don t distinguish between true sharing and clean in one cache Other caches could supply data on misses to shared block

37 Scalable Coherence Protocols: Directory Based Systems Bus based systems are not scalable Not enough bus B/W for everyone s coherence traffic Not enough processor snooping B/W to handle everyone s traffic Directories: scalable cache coherence for large MPs Each memory entry (cache line) has a bit vector (1 bit per processor) Bit vector tracks which processors have cached copies of line Send all messages to memory directory If no other cached copies, memory returns data Otherwise memory forwards request to correct processor + Low b/w consumption (communicate only with relevant processors) + Works with general interconnect (bus not needed) - Longer latency (3-hop transaction: p0 => directory => p1 => p0)

38 Multiprocessor Performance OLTP: TPC-C (Oracle) CPI ~ 7.0 DSS: TPC-D CPI ~ 1.61 AV: Altavista CPI ~1.3 Huge fraction of time in memory and I/O for TPC-C

39 Coherence Protocols: Performance Misses per 1000 instructions As block size is increased Coherence misses increase (false sharing) False sharing Sharing of different data in the same cache line Block is shared, but data word is not shared OLTP Workload

40 Coherence Protocols: Performance Miss Rate 3C Miss model => 4C miss model Capacity, compulsory, conflict Coherence: additional misses due to coherence protocol Complicates uniprocessor cache analysis As cache size is increased Capacity misses decrease Coherence misses may increase (more shared data is cached)

41 Coherence Protocols: Performance As processors are added Coherence misses increase (more communication)

42 Today multiprocessor trends: Single Chip multiprocessors e.g. IBM POWER4 1 chip contains: 2 1.3GHz processors, L2, L3 tags Can connect 4 chips on 1 MCM to create 8-way system Targets threaded server workloads, high-performance computing Core0 Core1 L3 Cache L2 Tags/Cache L3 Tags Bus I/O L3 Cache

43 Single Chip MP: The Future but how many cores? Some companies are betting big on it Sun s Niagara (8 cores per chip) Intel s rumored Tukwila or Tanglewood processor (8 or 16?) Basis in Piranha research project (ISCA 2000)

44 Multithreading Another trend: multithreaded processors Processor utilization: IPC / Processor Width Decreases as width increases (~50% on 4-wide) Why? Cache misses, branch mis-predicts, RAW dependencies Idea? Two (or more) processes (threads) share one pipeline Replicate process (thread) state PC, register file, bpred, page table pointer, etc (5% area overhead) One copy of stateless (naturally tagged) structures Caches, functional units, buses, etc Hardware thread context must be fast Multiple on-chip contexts => no need to load from memory

45 Multithreading Paradigms SuperScalar Coarse MT Fine MT Simultaneous MT Pentium 3, Alpha EV6/7 IBM Pulsar Intel P4-HT, EV8,POWER5

46 Coarse vs. Fine Grained MT Coarse-Grained Makes sense for in-order/shorter pipelines Switch threads on long stalls (L2 cache misses) Threads don t interfere with each other much Can t improve utilization on L1 misses/bpred mispredicts Fine-grained Out-of-order, deep pipelines Instructions from multiple threads in stage at a time, miss or not Improves utilization in all scenarios Individual thread performance suffers due to interference

47 Pentium 4 Front-End Front end resources arbitrate between threads every cycle ITLB, RAS, BHB(Global History), Decode Queue are duplicated

48 Pentium 4 Backend Some queues and buffers are partitioned (only ½ entries per thread) Scheduler is oblivious to instruction thread Ids (limit on # per scheduler)

49 Multithreading Performance 9-stage pipeline 128KB I/D Cache 6 IntALU (4 Load/Store) 4 FPALUs 8 SMT threads 8-insn fetch (2 threads) Intel claiming more like 20-30% increase on P4 Debate still exist on performance benefits

50 Synchronization Synchronization: important issue for shared memory Regulates access to shared data E.g. semaphore, monitor, critical section (s/w constructs) Synchronization primitive: lock Well written parallel programs will synchronize on shared data acquire(lock); //while (lock!= 0); lock = 1; critical section; release(lock); // lock = 0; 0: ldw r1, lock // wait for lock to be free 1: bnez r1, #0 2: stw #1, lock // acquire lock // critical section 9: stw #0, lock // release lock

51 Implementing Locks Called spin lock, doesn t work quite right Processor 0 Processor 1 0: ldw r1, lock 1: bnez r1, #0 // p0 sees lock free 0: ldw r1, lock 1: bnez r1, #0 // p1 sees lock free 2: stw #1, lock // p0 acquires lock 9: stw #0, lock 2: stw #1, lock // p1 acquires lock // p0 and p1 in // critical section // together

52 Implementing Locks Problem: acquire sequence (load-test-store) is not atomic Solution: ISA provides an atomic lock-acquire operation Load+Check+Store in one instruction (uninterruptible by definition) E.g. test+set instruction (fetch-and-increment, exchange) t&s r1, lock // ldw r1, lock; stw #1, lock 0: t&s r1, lock 1: bnez r1, 0 2: 3: stw r1, #0 Lock-release is already atomic

53 More atomic instructions All atomic instructions can implement semaphores Fetch-and-Add (generalization of T&S) Compare-and-Swap (CAS) Load-linked and Store-conditional (LL/SC) Only CAS and LL/SC can implement lock-free and waitfree algorithms (hard) Semaphores can deadlock if a thread with lock swaps out Lock-free: Thread does not require lock, always makes progress Wait-free: Can complete in finite-steps See proof: M.P. Herlihy. Wait-free synchronization. ACM Transactions on Programming Languages and Systems, 13(1): , January 1991.

54 Load-Linked/Store-Conditional Example from Prof. Computer Bruning, Science 246 Universität Mannheim

55 HW Implementation Issues

56 Memory Ordering Consistency Memory updates may become reordered by the memory system Processor 0 Processor 1 A = 0; B = 0; A = 1; B = 1; L1: if (B==0) L2: if (A==0) Critical section Critical Section Intuitively, impossible for both to be in critical section BUT, if memory ops are reordered it could happen Coherence: A s and B s must be same eventually Doesn t specify relative timing of coherence See: Sarita V. Adve, Kourosh Gharachorloo, Shared Memory Consistency Models: A Tutorial, IEEE Computer, 1995.

57 Memory Ordering: Sequential Consistency System is sequentially consistent if the result of any execution is the same as if the operations of all processors were executed in some sequential order and the operations of each individual processor appear in this sequence in program order [Lamport] Sequential Consistency All loads and stores in order Delay completion of memory access until all invalidations caused by that access complete Delay next memory access until previous one completes Delay read of A/B until previous write of A/B completes Cannot place writes in a write buffer and continue with read! Simple for programmer Not much room for HW/SW performance optimizations

58 Sequential Consistency Performance Suppose a write miss takes: 40 cycles to establish ownership 10 cycles to issue each invalidate after ownership 50 cycles to complete/acknowledge invalidates Suppose 4 processors have shared cache block Sequential Consistency: Must wait cycles = 130 cycles Ownership time is only 40 cycles Write buffers could continue before this

59 Weaker Consistency Models Assume programs are synchronized Observation: SC only needed for lock variables Other variables? Either in critical section (no parallel access) Or not shared Weaker consistency: can delay/reorder loads and stores More room for hardware optimizations Somewhat trickier programming model (memory barriers)

60 One other possibility Many processors have extensive support for speculation recovery Can this be re-used to support strict consistency? Use delayed commit feature for memory references If we receive an invalidate message, back-out the computation and restart Rarely triggered anyway (only on unsynchronized accesses that cause a race)

Computer Science 146. Computer Architecture

Computer Science 146. Computer Architecture Computer Architecture Spring 24 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 2: More Multiprocessors Computation Taxonomy SISD SIMD MISD MIMD ILP Vectors, MM-ISAs Shared Memory

More information

Readings. Multiprocessors and Multithreading. Threads, Processes, Processors, etc. Why MT/MP? H+P chapter 3.5, 4 Will only loosely follow text

Readings. Multiprocessors and Multithreading. Threads, Processes, Processors, etc. Why MT/MP? H+P chapter 3.5, 4 Will only loosely follow text why multiprocessors and/or multithreading? and why is parallel processing difficult? types of MT/MP interconnection networks why caching shared memory is challenging cache coherence, synchronization, and

More information

Interconnect Routing

Interconnect Routing Interconnect Routing store-and-forward routing switch buffers entire message before passing it on latency = [(message length / bandwidth) + fixed overhead] * # hops wormhole routing pipeline message through

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

Chapter 9 Multiprocessors

Chapter 9 Multiprocessors ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism Computing Systems & Performance Beyond Instruction-Level Parallelism MSc Informatics Eng. 2012/13 A.J.Proença From ILP to Multithreading and Shared Cache (most slides are borrowed) When exploiting ILP,

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence Parallel Computer Architecture Spring 2018 Shared Memory Multiprocessors Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly Parallel Computer Architecture

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

Chap. 4 Multiprocessors and Thread-Level Parallelism

Chap. 4 Multiprocessors and Thread-Level Parallelism Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,

More information

CS/COE1541: Intro. to Computer Architecture

CS/COE1541: Intro. to Computer Architecture CS/COE1541: Intro. to Computer Architecture Multiprocessors Sangyeun Cho Computer Science Department Tilera TILE64 IBM BlueGene/L nvidia GPGPU Intel Core 2 Duo 2 Why multiprocessors? For improved latency

More information

Issues in Multiprocessors

Issues in Multiprocessors Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores SPARCCenter, SGI Challenge, Cray T3D, Convex Exemplar, KSR-1&2, today s CMPs message

More information

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors? Parallel Computers CPE 63 Session 20: Multiprocessors Department of Electrical and Computer Engineering University of Alabama in Huntsville Definition: A parallel computer is a collection of processing

More information

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache. Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network

More information

Handout 3 Multiprocessor and thread level parallelism

Handout 3 Multiprocessor and thread level parallelism Handout 3 Multiprocessor and thread level parallelism Outline Review MP Motivation SISD v SIMD (SIMT) v MIMD Centralized vs Distributed Memory MESI and Directory Cache Coherency Synchronization and Relaxed

More information

Parallel Architecture. Hwansoo Han

Parallel Architecture. Hwansoo Han Parallel Architecture Hwansoo Han Performance Curve 2 Unicore Limitations Performance scaling stopped due to: Power Wire delay DRAM latency Limitation in ILP 3 Power Consumption (watts) 4 Wire Delay Range

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

Computer parallelism Flynn s categories

Computer parallelism Flynn s categories 04 Multi-processors 04.01-04.02 Taxonomy and communication Parallelism Taxonomy Communication alessandro bogliolo isti information science and technology institute 1/9 Computer parallelism Flynn s categories

More information

Issues in Multiprocessors

Issues in Multiprocessors Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing explicit sends & receives Which execution model control parallel

More information

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration I/O MultiProcessor Summary 2 Virtual memory benifits Using physical memory efficiently

More information

Multiprocessors 1. Outline

Multiprocessors 1. Outline Multiprocessors 1 Outline Multiprocessing Coherence Write Consistency Snooping Building Blocks Snooping protocols and examples Coherence traffic and performance on MP Directory-based protocols and examples

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

Chapter 5. Thread-Level Parallelism

Chapter 5. Thread-Level Parallelism Chapter 5 Thread-Level Parallelism Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors saturated

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

COSC4201 Multiprocessors

COSC4201 Multiprocessors COSC4201 Multiprocessors Prof. Mokhtar Aboelaze Parts of these slides are taken from Notes by Prof. David Patterson (UCB) Multiprocessing We are dedicating all of our future product development to multicore

More information

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems 1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase

More information

Chapter-4 Multiprocessors and Thread-Level Parallelism

Chapter-4 Multiprocessors and Thread-Level Parallelism Chapter-4 Multiprocessors and Thread-Level Parallelism We have seen the renewed interest in developing multiprocessors in early 2000: - The slowdown in uniprocessor performance due to the diminishing returns

More information

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence 1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations

More information

Aleksandar Milenkovich 1

Aleksandar Milenkovich 1 Parallel Computers Lecture 8: Multiprocessors Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Definition: A parallel computer is a collection

More information

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid Chapter 5 Thread-Level Parallelism Abdullah Muzahid 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors is saturating + Modern multiple issue processors are becoming very complex

More information

Computer Architecture

Computer Architecture Computer Architecture Chapter 7 Parallel Processing 1 Parallelism Instruction-level parallelism (Ch.6) pipeline superscalar latency issues hazards Processor-level parallelism (Ch.7) array/vector of processors

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors Multiple threads use shared memory (address space) SysV Shared Memory or Threads in software Communication implicit

More information

Multiprocessors - Flynn s Taxonomy (1966)

Multiprocessors - Flynn s Taxonomy (1966) Multiprocessors - Flynn s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) Conventional uniprocessor Although ILP is exploited Single Program Counter -> Single Instruction stream The

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1)

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1) 1 MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1) Chapter 5 Appendix F Appendix I OUTLINE Introduction (5.1) Multiprocessor Architecture Challenges in Parallel Processing Centralized Shared Memory

More information

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:

More information

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST Chapter 8. Multiprocessors In-Cheol Park Dept. of EE, KAIST Can the rapid rate of uniprocessor performance growth be sustained indefinitely? If the pace does slow down, multiprocessor architectures will

More information

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville Lecture 18: Multiprocessors Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Parallel Computers Definition: A parallel computer is a collection

More information

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Cray-1 and Cray T3E Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector

More information

Flynn s Classification

Flynn s Classification Flynn s Classification SISD (Single Instruction Single Data) Uniprocessors MISD (Multiple Instruction Single Data) No machine is built yet for this type SIMD (Single Instruction Multiple Data) Examples:

More information

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996 Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.S96 1 Flynn Categories SISD (Single Instruction Single

More information

COSC4201. Multiprocessors and Thread Level Parallelism. Prof. Mokhtar Aboelaze York University

COSC4201. Multiprocessors and Thread Level Parallelism. Prof. Mokhtar Aboelaze York University COSC4201 Multiprocessors and Thread Level Parallelism Prof. Mokhtar Aboelaze York University COSC 4201 1 Introduction Why multiprocessor The turning away from the conventional organization came in the

More information

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Issues in Parallel Processing Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Introduction Goal: connecting multiple computers to get higher performance

More information

EE382 Processor Design. Processor Issues for MP

EE382 Processor Design. Processor Issues for MP EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors, Part I EE 382 Processor Design Winter 98/99 Michael Flynn 1 Processor Issues for MP Initialization Interrupts Virtual Memory TLB Coherency

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen

More information

Parallel Architectures

Parallel Architectures Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36

More information

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013 18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013 Readings: Multiprocessing Required Amdahl, Validity of the single processor

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 19: Multiprocessing Shuai Wang Department of Computer Science and Technology Nanjing University [Slides adapted from CSE 502 Stony Brook University] Getting More

More information

Chapter 6. Parallel Processors from Client to Cloud Part 2 COMPUTER ORGANIZATION AND DESIGN. Homogeneous & Heterogeneous Multicore Architectures

Chapter 6. Parallel Processors from Client to Cloud Part 2 COMPUTER ORGANIZATION AND DESIGN. Homogeneous & Heterogeneous Multicore Architectures COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Part 2 Homogeneous & Heterogeneous Multicore Architectures Intel XEON 22nm

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

CS 654 Computer Architecture Summary. Peter Kemper

CS 654 Computer Architecture Summary. Peter Kemper CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining

More information

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon

More information

3/13/2008 Csci 211 Lecture %/year. Manufacturer/Year. Processors/chip Threads/Processor. Threads/chip 3/13/2008 Csci 211 Lecture 8 4

3/13/2008 Csci 211 Lecture %/year. Manufacturer/Year. Processors/chip Threads/Processor. Threads/chip 3/13/2008 Csci 211 Lecture 8 4 Outline CSCI Computer System Architecture Lec 8 Multiprocessor Introduction Xiuzhen Cheng Department of Computer Sciences The George Washington University MP Motivation SISD v. SIMD v. MIMD Centralized

More information

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015 18-447 Computer Architecture Lecture 27: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015 Assignments Lab 7 out Due April 17 HW 6 Due Friday (April 10) Midterm II April

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology Review: Multiprocessor CPE 631 Session 21: Multiprocessors (Part 2) Department of Electrical and Computer Engineering University of Alabama in Huntsville Basic issues and terminology Communication: share

More information

Processor Architecture and Interconnect

Processor Architecture and Interconnect Processor Architecture and Interconnect What is Parallelism? Parallel processing is a term used to denote simultaneous computation in CPU for the purpose of measuring its computation speeds. Parallel Processing

More information

CSCI 4717 Computer Architecture

CSCI 4717 Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Symmetric Multiprocessors & Clusters Reading: Stallings, Sections 18.1 through 18.4 Classifications of Parallel Processing M. Flynn classified types of parallel

More information

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on

More information

Parallel Architecture. Sathish Vadhiyar

Parallel Architecture. Sathish Vadhiyar Parallel Architecture Sathish Vadhiyar Motivations of Parallel Computing Faster execution times From days or months to hours or seconds E.g., climate modelling, bioinformatics Large amount of data dictate

More information

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology CS252 Graduate Computer Architecture Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency Review: Multiprocessor Basic issues and terminology Communication:

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University 740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University Readings: Memory Consistency Required Lamport, How to Make a Multiprocessor Computer That Correctly Executes Multiprocess

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Multi-{Socket,,Thread} Getting More Performance Keep pushing IPC and/or frequenecy Design complexity (time to market) Cooling (cost) Power delivery (cost) Possible, but too

More information

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors. CS 320 Ch. 17 Parallel Processing Multiple Processor Organization The author makes the statement: "Processors execute programs by executing machine instructions in a sequence one at a time." He also says

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

SMD149 - Operating Systems - Multiprocessing

SMD149 - Operating Systems - Multiprocessing SMD149 - Operating Systems - Multiprocessing Roland Parviainen December 1, 2005 1 / 55 Overview Introduction Multiprocessor systems Multiprocessor, operating system and memory organizations 2 / 55 Introduction

More information

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy Overview SMD149 - Operating Systems - Multiprocessing Roland Parviainen Multiprocessor systems Multiprocessor, operating system and memory organizations December 1, 2005 1/55 2/55 Multiprocessor system

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Online Course Evaluation. What we will do in the last week?

Online Course Evaluation. What we will do in the last week? Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do

More information

Convergence of Parallel Architecture

Convergence of Parallel Architecture Parallel Computing Convergence of Parallel Architecture Hwansoo Han History Parallel architectures tied closely to programming models Divergent architectures, with no predictable pattern of growth Uncertainty

More information

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Parallel Processing Basics Prof. Onur Mutlu Carnegie Mellon University Readings Required Hill, Jouppi, Sohi, Multiprocessors and Multicomputers, pp. 551-560 in Readings in Computer

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering

More information

CMSC 611: Advanced. Parallel Systems

CMSC 611: Advanced. Parallel Systems CMSC 611: Advanced Computer Architecture Parallel Systems Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems

More information

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization Computer Architecture Computer Architecture Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr nizamettinaydin@gmail.com Parallel Processing http://www.yildiz.edu.tr/~naydin 1 2 Outline Multiple Processor

More information

Today s Outline: Shared Memory Review. Shared Memory & Concurrency. Concurrency v. Parallelism. Thread-Level Parallelism. CS758: Multicore Programming

Today s Outline: Shared Memory Review. Shared Memory & Concurrency. Concurrency v. Parallelism. Thread-Level Parallelism. CS758: Multicore Programming CS758: Multicore Programming Today s Outline: Shared Memory Review Shared Memory & Concurrency Introduction to Shared Memory Thread-Level Parallelism Shared Memory Prof. David A. Wood University of Wisconsin-Madison

More information

Memory Consistency and Multiprocessor Performance. Adapted from UCB CS252 S01, Copyright 2001 USB

Memory Consistency and Multiprocessor Performance. Adapted from UCB CS252 S01, Copyright 2001 USB Memory Consistency and Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB 1 Memory Consistency Model Define memory correctness for parallel execution Execution appears to the that

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming Cache design review Let s say your code executes int x = 1; (Assume for simplicity x corresponds to the address 0x12345604

More information

Shared Symmetric Memory Systems

Shared Symmetric Memory Systems Shared Symmetric Memory Systems Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University

More information

Overview. Processor organizations Types of parallel machines. Real machines

Overview. Processor organizations Types of parallel machines. Real machines Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters, DAS Programming methods, languages, and environments

More information

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont Lecture 18 Simultaneous Multithreading Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Slides developed in part by Profs. Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi,

More information

Memory Consistency and Multiprocessor Performance

Memory Consistency and Multiprocessor Performance Memory Consistency Model Memory Consistency and Multiprocessor Performance Define memory correctness for parallel execution Execution appears to the that of some correct execution of some theoretical parallel

More information

Parallel Computing Platforms

Parallel Computing Platforms Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)

More information

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains: The Lecture Contains: Four Organizations Hierarchical Design Cache Coherence Example What Went Wrong? Definitions Ordering Memory op Bus-based SMP s file:///d /...audhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture10/10_1.htm[6/14/2012

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012) Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues

More information

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple

More information

Module 5 Introduction to Parallel Processing Systems

Module 5 Introduction to Parallel Processing Systems Module 5 Introduction to Parallel Processing Systems 1. What is the difference between pipelining and parallelism? In general, parallelism is simply multiple operations being done at the same time.this

More information

EE382 Processor Design. Illinois

EE382 Processor Design. Illinois EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors Part II EE 382 Processor Design Winter 98/99 Michael Flynn 1 Illinois EE 382 Processor Design Winter 98/99 Michael Flynn 2 1 Write-invalidate

More information

Parallel Architectures

Parallel Architectures Parallel Architectures CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Parallel Architectures Spring 2018 1 / 36 Outline 1 Parallel Computer Classification Flynn s

More information

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems Multiprocessors II: CC-NUMA DSM DSM cache coherence the hardware stuff Today s topics: what happens when we lose snooping new issues: global vs. local cache line state enter the directory issues of increasing

More information

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor Multiprocessing Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Almasi and Gottlieb, Highly Parallel

More information

Introducing Multi-core Computing / Hyperthreading

Introducing Multi-core Computing / Hyperthreading Introducing Multi-core Computing / Hyperthreading Clock Frequency with Time 3/9/2017 2 Why multi-core/hyperthreading? Difficult to make single-core clock frequencies even higher Deeply pipelined circuits:

More information