TDT 4260 lecture 9 spring semester 2015

Size: px
Start display at page:

Download "TDT 4260 lecture 9 spring semester 2015"

Transcription

1 1 TDT 4260 lecture 9 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU

2 2 Lecture overview Repetition - CMP application classes - Vector MIPS Today Vector & SIMD processing Sparse matrices Gather/scatter Stride Shared Multiprocessor Intro Cache coherence snooping

3 3 Repetition

4 4 Repetition

5 5 Excursion to the Vilje supercomputer??? 5 min walk Your interest yes Monday 23/3 in the 12:15 14:00 timeslot???

6 6 Stride Consider straightforward matrix multiplication: for (i = 0; i < 100; i=i+1) for (j = 0; j < 100; j=j+1) { A[i][j] = 0.0; for (k = 0; k < 100; k=k+1) A[i][j] = A[i][j] + B[i][k] * D[k][j]; } Vectorize multiplication of rows of B with columns of D Matrices stored row-by-row (row-major order) or column-by-column (column-major order) Distance between elements that are to be gathered into a vector register = stride Vector processors must handle strides > 1 (Example: LVWS) Load Vector With Stride

7 7 Background, Sparse matrices Sparse matrix: Only few elements have non-zero values Example: from solving a finite element problem in 2 dimensions (FEM) (The nonzero elements are shown in black) «Often no sense in multiplying with zero» How can non-zero elements be condensed into a dense vector? ==> compact storage formats (using meta-data) and scatter/gather (Lars-Ivar Hesselberg Simonsen, Master thesis (2013))

8 8 Background, Sparse matrices Sparse matrix formats CSV, compressed sparse vector CSR, compressed sparse row CSX (Research CARD & NTNU-IT/HPC & GRNET under PRACE)) Energy-efficient Sparse Matrix Autotuning with CSX - A Trade-off Study, J C Meyer, J M Cebrian, L Natvig, V Karakasis, D Siakavaras, and K Nikas, HP-PAC (High-Performance, Power-Aware Computing, Boston, May 2013) (part of PRACE) Scatter-gather LVI: Load Vector Indexed (gather) SVI: Store Vector Indexed (scatter)

9 9 Scatter-Gather Consider: for (i = 0; i < n; i=i+1) A[K[i]] = A[K[i]] + C[M[i]]; Use index vector: LV Vk, Rk ;load K LVI Va, (Ra+Vk) ;load A[K[]] LV Vm, Rm ;load M LVI Vc, (Rc+Vm) ;load C[M[]] ADDVV.D Va, Va, Vc ;add them SVI (Ra+Vk), Va ;store A[K[]]

10 10 Section 4.3 SIMD INSTRUCTION SET EXTENSIONS FOR MULTIMEDIA

11 11 SIMD Extensions Media applications operate on data types narrower than the native word size Limitations, compared to vector instructions: Number of data operands encoded into op code No sophisticated addressing modes (strided, scatter-gather) No mask registers

12 12 SIMD DAXPY L.D F0,a ;load scalar a MOV F1, F0 ;copy a into F1 for SIMD MUL MOV F2, F0 ;copy a into F2 for SIMD MUL MOV F3, F0 ;copy a into F3 for SIMD MUL DADDIU R4,Rx,#512 ;last address to load Loop: L.4D F4,0[Rx] ;load X[i], X[i+1], X[i+2], X[i+3] MUL.4D F4,F4,F0 ;a X[i],a X[i+1],a X[i+2],a X[i+3] L.4D F8,0[Ry] ;load Y[i], Y[i+1], Y[i+2], Y[i+3] ADD.4D F8,F8,F4 ;a X[i]+Y[i],..., a X[i+3]+Y[i+3] S.4D 0[Ry],F8 ;store into Y[i], Y[i+1], Y[i+2], Y[i+3] DADDIU Rx,Rx,#32 ;increment index to X DADDIU Ry,Ry,#32 ;increment index to Y DSUBU R20,R4,Rx ;compute bound BNEZ R20,Loop ;check if done

13 13 SSE and AVX is energy efficient Recent research in the CARD group See PP4EE video (Juan Manuel Cebrian) (link under Its Learning), Paper in Computing 2013 (link under Its Learning), also newer and larger paper presented at ISPASS March 2014 Details not required

14 14 CHAP 5 - THREAD LEVEL PARALLELISM

15 15 Single/ILP Multi/TLP Uniprocessor trends Getting too complex Speed of light Diminishing returns from ILP Multiprocessor Focus in the textbook: 4-32 CPUs Increased performance through parallelism Multichip Multicore ((Single) Chip Multiprocessors CMP) Cost effective Right balance of ILP and TLP is unclear today Desktop vs. server?

16 16 Other Factors Multiprocessors Growth in data-intensive applications Databases, file servers, multimedia, Growing interest in servers, server performance Increasing desktop performance less important Outside of graphics Improved understanding in how to use multiprocessors effectively Especially in servers where significant natural TLP Advantage of leveraging design investment by replication Rather than unique design Power/cooling issues multicore

17 17 A bit of history 60s and 70s: Lots of research in multiprocessors. 80s and 90s: Relentless focus on single threaded performance, common belief that parallel computing was dead. Today: Almost ALL new systems are parallel systems!

18 18 MIMD: architecture $ = cache P1 Pn $ $ P1 P n Interconnection network (IN) Mem $ Mem $ Mem Interconnection network (IN) Centralized Distributed

19 19 Centralized Multiprocessor Also called Symmetric Multiprocessors (SMPs) Uniform Access (UMA) architecture Shared memory becomes bottleneck Large caches single memory can satisfy memory demands of small number of processors Can scale to a few dozen processors by using a switch and by using many memory banks Scaling beyond that is hard

20 20 Distributed memory P P P Network M 1. Shared address space Logically shared, physically distributed Distributed Shared (DSM) NUMA architecture Conceptual Model P M P M P M Network Implementation 2. Separate address spaces Every P-M module is a separate computer Multicomputer Clusters Not a focus in this course

21 21 Distributed (Shared) Multiprocessor Pro: Cost-effective way to scale memory bandwidth If most accesses are to local memory Pro: Reduces latency of local memory accesses Con: Communication becomes more complex Pro/Con: Possible to change software to take advantage of memory that is close, but this can also make SW less portable Non-Uniform Access (NUMA) Classical performance vs. portability trade-off

22 22 MP (MIMD), cluster of SMPs Proc. Proc. Proc. Proc. Proc. Proc. Caches Caches Caches Caches Caches Caches Node Interc. Network Node Interc. Network I/O I/O Cluster Interconnection Network Combination of centralized and distributed

23 23 Kahoot Quiz no 2

24 24 Communication models Shared memory Centralized or Distributed Shared Communication using LOAD/STORE Coordinated using traditional OS methods Semaphores, monitors, etc. Busy-wait more acceptable than for uniprosessor Message passing Using send (put) and receive (get) Asynchronous / Synchronous Libraries, standards, PVM, MPI,

25 25 Shared How do we program these things? Thread 0: Thread 0: (on a different core) data = 42; flag = 1; while (flag == 0) { } printf( data %i, data); $ $ data = 0; flag = 0;

26 26 Shared How do we program these things? Thread 0: data = 42; flag = 1; Thread 0: while (flag == 0) { } printf( data %i, data); $ $ flag = 0; data = 0; flag = 0;

27 27 Shared How do we program these things? Thread 0: data = 42; flag = 1; Thread 0: while (flag == 0) { } printf( data %i, data); $ $ flag = 0; data = 0; flag = 0;

28 28 Shared How do we program these things? Thread 0: data = 42; flag = 1; Thread 0: while (flag == 0) { } printf( data %i, data); $ $ data = 42; flag = 0; data = 0; flag = 0;

29 29 Shared How do we program these things? Thread 0: data = 42; flag = 1; Thread 0: while (flag == 0) { } printf( data %i, data); $ $ data = 42; flag = 1; data = 0; flag = 0; flag = 0;

30 30 Shared How do we program these things? Thread 0: data = 42; flag = 1; $ $ data = 42; flag = 1; Thread 0: while (flag == 0) { } printf( data %i, data); flag = 0; Now What?? data = 0; flag = 0;

31 31 Shared How do we program these things? Thread 0: data = 42; flag = 1; Thread 0: while (flag == 0) { } printf( data %i, data); Now What?? $ $ data = 42; flag = 1; flag = 0; Nothing happens since this cache is not updated data = 0; flag = 0;

32 32 Enforcing coherence Separate caches makes multiple copies frequent Migration Moved from shared memory to local cache Speeds up access, reduces memory bandwidth requirements Replication Several local copies when item is read by several Speeds up access, reduces memory contention Need coherence protocols to track shared data Directory based Status stored in shared location (Centralized or distributed) (Bus) snooping Each cache maintains local status All caches monitor broadcast medium Write invalidate / Write update

33 33 Snooping: Write invalidate Several reads or one write: No change Writes require exclusive access Writes to shared data: All other cache copies invalidated Invalidate command and address broadcasted All caches listen (snoops) and invalidates if necessary Read miss: Write-Through: always up to date Write-Back: Caches listen and any exclusive copy is put on the bus

34 34 Snooping: Write update Also called write broadcast Must know which cache blocks are shared Usually Write-Through Write to shared data: Broadcast, all caches listen and updates their copy (if any) Read miss: Main memory is up to date

35 35 Snooping: Invalidate vs. Update Repeated writes to the same address (no reads) requires several updates, but only one invalidate Invalidates are done at cache block level, while updates are done of individual words Invalidate most common Less bus traffic Less memory traffic Bus and memory bandwidth typical bottleneck

36 36 An Example Snoopy Protocol Invalidation protocol, write-back cache Each cache block is in one state Shared : Clean in all caches and up-to-date in memory, block can be read Exclusive : One cache has only copy, its writeable, and dirty Invalid : block contains no data (1) example (2) in more detail (Next lecture)

37 37 Snooping: Invalidation protocol (1/6) Processor 0 Processor 1 Processor 2 Processor N-1 read x read miss Interconnection Network x o I/O System Main

38 38 Snooping: Invalidation protocol (2/6) Processor 0 Processor 1 Processor 2 Processor N-1 x o shared Interconnection Network x o I/O System Main

39 39 Snooping: Invalidation protocol (3/6) Processor 0 Processor 1 Processor 2 read x Processor N-1 x o shared read miss Interconnection Network x o I/O System Main

40 40 Snooping: Invalidation protocol (4/6) Processor 0 Processor 1 Processor 2 Processor N-1 x o shared x o shared Interconnection Network x o I/O System Main

41 41 Snooping: Invalidation protocol (5/6) Processor 0 Processor 1 Processor 2 Processor N-1 write x x o shared x o shared invalidate Interconnection Network x o I/O System Main

42 42 Snooping: Invalidation protocol (6/6) Processor 0 Processor 1 Processor 2 Processor N-1 x 1 exclusive Interconnection Network x o I/O System Main

43 Directory Based Coherency Snooping doesn t scale Idea: Track who is sharing various addresses in a directory. Used both in large DSM machines and in small multicores. Complicated! Things you have to worry about: Deadlock Livelock & Starvation Consistency models Lasse Natvig

44 Directory Based Coherency Ld X Ld X Ld X Ld X Directory: 44 Lasse Natvig

45 Directory Based Coherency Ld X Ld X Ld X Ld X Directory: 0 45 Lasse Natvig

46 Directory Based Coherency Ld X Ld X Ld X Ld X Directory: 0 46 Lasse Natvig

47 Directory Based Coherency Ld X Ld X Ld X Ld X Directory: 0,1 47 Lasse Natvig

48 Directory Based Coherency Ld X Ld X Ld X Ld X Directory: 0,1 48 Lasse Natvig

49 Directory Based Coherency Ld X Ld X Ld X Ld X Directory: 0,1,2 49 Lasse Natvig

50 Directory Based Coherency Ld X Ld X Ld X Ld X Directory: 0,1,2 50 Lasse Natvig

51 Directory Based Coherency Ld X Ld X Ld X Ld X Directory: 0,1,2,4 51 Lasse Natvig

52 Directory Based Coherency Ld X Ld X Ld X St X=7 Directory: 0,1,2,4 52 Lasse Natvig

53 Directory Based Coherency Ld X Ld X Ld X St X=7 Can I write???? Please? Directory: 0,1,2,4 53 Lasse Natvig

54 Directory Based Coherency Ld X Ld X Ld X St X=7 Inval Can I write???? Please? Directory: 0,1,2,4 54 Lasse Natvig

55 Directory Based Coherency Ld X Ld X Ld X St X=7 Inval Inval is complete Can I write???? Please? Directory: 0,1,2,4 55 Lasse Natvig

56 Directory Based Coherency Ld X Ld X Ld X St X=7 Inval Inval is complete Can I write???? Please? Directory: 1,2,4 56 Lasse Natvig

57 Directory Based Coherency Ld X Ld X Ld X St X=7 Inval Can I write???? Please? Directory: 1,2,4 57 Lasse Natvig

58 Directory Based Coherency Ld X Ld X Ld X St X=7 Inval Inval complete Directory: 1,2,4 Can I write???? Please? 58 Lasse Natvig

59 Directory Based Coherency Ld X Ld X Ld X St X=7 Inval Inval complete Directory: 2,4 Can I write???? Please? 59 Lasse Natvig

60 Directory Based Coherency Ld X Ld X Ld X St X=7 Inval Can I write???? Please? Directory: 2,4 60 Lasse Natvig

61 Directory Based Coherency Ld X Ld X Ld X St X=7 Inval complete Inval Can I write???? Please? Directory: 2,4 61 Lasse Natvig

62 Directory Based Coherency Ld X Ld X Ld X St X=7 Inval complete Inval Can I write???? Please? Directory: 4 62 Lasse Natvig

63 Directory Based Coherency Ld X Ld X Ld X St X=7 Yes, you can write! Can I write???? Please? Directory: 4 63 Lasse Natvig

64 Directory Based Coherency Ld X Ld X Ld X St X=7 X = 7 Yes, you can write! Can I write???? Please? Directory: 4 64 Lasse Natvig

65 Directory Based Coherency Ld X St X=7 X = 7 Give me new X Directory: 4 65 Lasse Natvig

66 Directory Based Coherency Ld X St X=7 X = 7 Give me your modified x Directory: 4 66 Lasse Natvig

67 Directory Based Coherency Ld X St X=7 X = 7 Give me your modified x Here you go! Directory: 4 67 Lasse Natvig

68 Directory Based Coherency Ld X St X=7 X = 7 Give me your modified x Here you go! Directory: 4 X = 7 68 Lasse Natvig

69 Directory Based Coherency Ld X St X=7 X = 7 X = 7 Give me your modified x Here you go! Directory: 4 X = 7 69 Lasse Natvig

70 Directory Based Coherency Ld X St X=7 X = 7 X = 7 Give me your modified x Here you go! Directory: 1,4 X = 7 70 Lasse Natvig

71 Directory Based Coherency All the different types of messages are travelling over various interconnects. Interconnect design is a big component of shared memory systems! Ld X St X=7 X = 7 X = 7 Give me your modified x Here you go! Directory: 1,4 X = 7 71 Lasse Natvig

72 Directory based cache coherence Large multiprocessor systems, lots of CPUs Distributed memory preferable Increases memory bandwidth Snooping bus with broadcast? A single bus becomes a bottleneck Other ways of communicating needed With these, broadcasting is hard/expensive Can avoid broadcast if we know exactly which caches have a copy Directory 72 Lasse Natvig

73 Directory based cache coherence Directory knows which blocks are in which cache and their state Directory can be partitioned and distributed Typical states: Shared Uncached Modified Protocol based on messages Invalidate and update sent only where needed Avoids broadcast, reduces traffic 73 Lasse Natvig

! An alternate classification. Introduction. ! Vector architectures (slides 5 to 18) ! SIMD & extensions (slides 19 to 23)

! An alternate classification. Introduction. ! Vector architectures (slides 5 to 18) ! SIMD & extensions (slides 19 to 23) Master Informatics Eng. Advanced Architectures 2015/16 A.J.Proença Data Parallelism 1 (vector, SIMD ext., GPU) (most slides are borrowed) Instruction and Data Streams An alternate classification Instruction

More information

Data-Level Parallelism in SIMD and Vector Architectures. Advanced Computer Architectures, Laura Pozzi & Cristina Silvano

Data-Level Parallelism in SIMD and Vector Architectures. Advanced Computer Architectures, Laura Pozzi & Cristina Silvano Data-Level Parallelism in SIMD and Vector Architectures Advanced Computer Architectures, Laura Pozzi & Cristina Silvano 1 Current Trends in Architecture Cannot continue to leverage Instruction-Level parallelism

More information

EE 4683/5683: COMPUTER ARCHITECTURE

EE 4683/5683: COMPUTER ARCHITECTURE EE 4683/5683: COMPUTER ARCHITECTURE Lecture 5B: Data Level Parallelism Avinash Kodi, kodi@ohio.edu Thanks to Morgan Kauffman and Krtse Asanovic Agenda 2 Flynn s Classification Data Level Parallelism Vector

More information

UNIT III DATA-LEVEL PARALLELISM IN VECTOR, SIMD, AND GPU ARCHITECTURES

UNIT III DATA-LEVEL PARALLELISM IN VECTOR, SIMD, AND GPU ARCHITECTURES UNIT III DATA-LEVEL PARALLELISM IN VECTOR, SIMD, AND GPU ARCHITECTURES Flynn s Taxonomy Single instruction stream, single data stream (SISD) Single instruction stream, multiple data streams (SIMD) o Vector

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

Vector Architectures. Intensive Computation. Annalisa Massini 2017/2018

Vector Architectures. Intensive Computation. Annalisa Massini 2017/2018 Vector Architectures Intensive Computation Annalisa Massini 2017/2018 2 SIMD ARCHITECTURES 3 Computer Architecture - A Quantitative Approach, Fifth Edition Hennessy Patterson Chapter 4 - Data-Level Parallelism

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

Chap. 4 Multiprocessors and Thread-Level Parallelism

Chap. 4 Multiprocessors and Thread-Level Parallelism Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,

More information

COSC 6385 Computer Architecture. - Vector Processors

COSC 6385 Computer Architecture. - Vector Processors COSC 6385 Computer Architecture - Vector Processors Spring 011 Vector Processors Chapter F of the 4 th edition (Chapter G of the 3 rd edition) Available in CD attached to the book Anybody having problems

More information

Handout 3 Multiprocessor and thread level parallelism

Handout 3 Multiprocessor and thread level parallelism Handout 3 Multiprocessor and thread level parallelism Outline Review MP Motivation SISD v SIMD (SIMT) v MIMD Centralized vs Distributed Memory MESI and Directory Cache Coherency Synchronization and Relaxed

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Static Compiler Optimization Techniques

Static Compiler Optimization Techniques Static Compiler Optimization Techniques We examined the following static ISA/compiler techniques aimed at improving pipelined CPU performance: Static pipeline scheduling. Loop unrolling. Static branch

More information

Parallel Processing SIMD, Vector and GPU s

Parallel Processing SIMD, Vector and GPU s Parallel Processing SIMD, Vector and GPU s EECS4201 Fall 2016 York University 1 Introduction Vector and array processors Chaining GPU 2 Flynn s taxonomy SISD: Single instruction operating on Single Data

More information

DATA-LEVEL PARALLELISM IN VECTOR, SIMD ANDGPU ARCHITECTURES(PART 2)

DATA-LEVEL PARALLELISM IN VECTOR, SIMD ANDGPU ARCHITECTURES(PART 2) 1 DATA-LEVEL PARALLELISM IN VECTOR, SIMD ANDGPU ARCHITECTURES(PART 2) Chapter 4 Appendix A (Computer Organization and Design Book) OUTLINE SIMD Instruction Set Extensions for Multimedia (4.3) Graphical

More information

Shared Symmetric Memory Systems

Shared Symmetric Memory Systems Shared Symmetric Memory Systems Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University

More information

Advanced Computer Architecture

Advanced Computer Architecture Fiscal Year 2018 Ver. 2019-01-24a Course number: CSC.T433 School of Computing, Graduate major in Computer Science Advanced Computer Architecture 11. Multi-Processor: Distributed Memory and Shared Memory

More information

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence 1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations

More information

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism Computing Systems & Performance Beyond Instruction-Level Parallelism MSc Informatics Eng. 2012/13 A.J.Proença From ILP to Multithreading and Shared Cache (most slides are borrowed) When exploiting ILP,

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

TDT 4260 lecture 3 spring semester 2015

TDT 4260 lecture 3 spring semester 2015 1 TDT 4260 lecture 3 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU http://research.idi.ntnu.no/multicore 2 Lecture overview Repetition Chap.1: Performance,

More information

Multiprocessors 1. Outline

Multiprocessors 1. Outline Multiprocessors 1 Outline Multiprocessing Coherence Write Consistency Snooping Building Blocks Snooping protocols and examples Coherence traffic and performance on MP Directory-based protocols and examples

More information

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon

More information

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence Parallel Computer Architecture Spring 2018 Shared Memory Multiprocessors Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly Parallel Computer Architecture

More information

Lecture 24: Virtual Memory, Multiprocessors

Lecture 24: Virtual Memory, Multiprocessors Lecture 24: Virtual Memory, Multiprocessors Today s topics: Virtual memory Multiprocessors, cache coherence 1 Virtual Memory Processes deal with virtual memory they have the illusion that a very large

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance

More information

Mul$processor Architecture. CS 5334/4390 Spring 2014 Shirley Moore, Instructor February 4, 2014

Mul$processor Architecture. CS 5334/4390 Spring 2014 Shirley Moore, Instructor February 4, 2014 Mul$processor Architecture CS 5334/4390 Spring 2014 Shirley Moore, Instructor February 4, 2014 1 Agenda Announcements (5 min) Quick quiz (10 min) Analyze results of STREAM benchmark (15 min) Mul$processor

More information

Parallel Processing SIMD, Vector and GPU s

Parallel Processing SIMD, Vector and GPU s Parallel Processing SIMD, ector and GPU s EECS4201 Comp. Architecture Fall 2017 York University 1 Introduction ector and array processors Chaining GPU 2 Flynn s taxonomy SISD: Single instruction operating

More information

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

Chapter Seven. Idea: create powerful computers by connecting many smaller ones Chapter Seven Multiprocessors Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) vector processing may be coming back bad news:

More information

Thread- Level Parallelism. ECE 154B Dmitri Strukov

Thread- Level Parallelism. ECE 154B Dmitri Strukov Thread- Level Parallelism ECE 154B Dmitri Strukov Introduc?on Thread- Level parallelism Have mul?ple program counters and resources Uses MIMD model Targeted for?ghtly- coupled shared- memory mul?processors

More information

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems 1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase

More information

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996 Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.S96 1 Flynn Categories SISD (Single Instruction Single

More information

Multiprocessors - Flynn s Taxonomy (1966)

Multiprocessors - Flynn s Taxonomy (1966) Multiprocessors - Flynn s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) Conventional uniprocessor Although ILP is exploited Single Program Counter -> Single Instruction stream The

More information

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence Computer Architecture ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols 1 Shared Memory Multiprocessor Memory Bus P 1 Snoopy Cache Physical Memory P 2 Snoopy

More information

Interconnect Routing

Interconnect Routing Interconnect Routing store-and-forward routing switch buffers entire message before passing it on latency = [(message length / bandwidth) + fixed overhead] * # hops wormhole routing pipeline message through

More information

ELE 455/555 Computer System Engineering. Section 4 Parallel Processing Class 1 Challenges

ELE 455/555 Computer System Engineering. Section 4 Parallel Processing Class 1 Challenges ELE 455/555 Computer System Engineering Section 4 Class 1 Challenges Introduction Motivation Desire to provide more performance (processing) Scaling a single processor is limited Clock speeds Power concerns

More information

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration I/O MultiProcessor Summary 2 Virtual memory benifits Using physical memory efficiently

More information

Online Course Evaluation. What we will do in the last week?

Online Course Evaluation. What we will do in the last week? Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do

More information

Chapter-4 Multiprocessors and Thread-Level Parallelism

Chapter-4 Multiprocessors and Thread-Level Parallelism Chapter-4 Multiprocessors and Thread-Level Parallelism We have seen the renewed interest in developing multiprocessors in early 2000: - The slowdown in uniprocessor performance due to the diminishing returns

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

Lecture 2. Memory locality optimizations Address space organization

Lecture 2. Memory locality optimizations Address space organization Lecture 2 Memory locality optimizations Address space organization Announcements Office hours in EBU3B Room 3244 Mondays 3.00 to 4.00pm; Thurs 2:00pm-3:30pm Partners XSED Portal accounts Log in to Lilliput

More information

Flynn s Classification

Flynn s Classification Flynn s Classification SISD (Single Instruction Single Data) Uniprocessors MISD (Multiple Instruction Single Data) No machine is built yet for this type SIMD (Single Instruction Multiple Data) Examples:

More information

Parallel Architecture. Hwansoo Han

Parallel Architecture. Hwansoo Han Parallel Architecture Hwansoo Han Performance Curve 2 Unicore Limitations Performance scaling stopped due to: Power Wire delay DRAM latency Limitation in ILP 3 Power Consumption (watts) 4 Wire Delay Range

More information

3/13/2008 Csci 211 Lecture %/year. Manufacturer/Year. Processors/chip Threads/Processor. Threads/chip 3/13/2008 Csci 211 Lecture 8 4

3/13/2008 Csci 211 Lecture %/year. Manufacturer/Year. Processors/chip Threads/Processor. Threads/chip 3/13/2008 Csci 211 Lecture 8 4 Outline CSCI Computer System Architecture Lec 8 Multiprocessor Introduction Xiuzhen Cheng Department of Computer Sciences The George Washington University MP Motivation SISD v. SIMD v. MIMD Centralized

More information

Chapter 5. Thread-Level Parallelism

Chapter 5. Thread-Level Parallelism Chapter 5 Thread-Level Parallelism Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors saturated

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors? Parallel Computers CPE 63 Session 20: Multiprocessors Department of Electrical and Computer Engineering University of Alabama in Huntsville Definition: A parallel computer is a collection of processing

More information

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:

More information

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid Chapter 5 Thread-Level Parallelism Abdullah Muzahid 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors is saturating + Modern multiple issue processors are becoming very complex

More information

Lecture 1: Parallel Architecture Intro

Lecture 1: Parallel Architecture Intro Lecture 1: Parallel Architecture Intro Course organization: ~13 lectures based on textbook ~10 lectures on recent papers ~5 lectures on parallel algorithms and multi-thread programming New topics: interconnection

More information

CISC 662 Graduate Computer Architecture Lectures 15 and 16 - Multiprocessors and Thread-Level Parallelism

CISC 662 Graduate Computer Architecture Lectures 15 and 16 - Multiprocessors and Thread-Level Parallelism CISC 662 Graduate Computer Architecture Lectures 15 and 16 - Multiprocessors and Thread-Level Parallelism Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 11

More information

Lecture 17: Multiprocessors. Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )

Lecture 17: Multiprocessors. Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections ) Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections 4.1-4.2) 1 Taxonomy SISD: single instruction and single data stream: uniprocessor

More information

Exam Parallel Computer Systems

Exam Parallel Computer Systems Exam Parallel Computer Systems Academic Year 2014-2015 Friday January 9, 2015: 8u30 12u00 Prof. L. Eeckhout Some remarks: Fill in your name on every page. Write down the answers in the boxes answers are

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico February 29, 2016 CPD

More information

Aleksandar Milenkovich 1

Aleksandar Milenkovich 1 Parallel Computers Lecture 8: Multiprocessors Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Definition: A parallel computer is a collection

More information

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico September 26, 2011 CPD

More information

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains: The Lecture Contains: Shared Memory Multiprocessors Shared Cache Private Cache/Dancehall Distributed Shared Memory Shared vs. Private in CMPs Cache Coherence Cache Coherence: Example What Went Wrong? Implementations

More information

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST Chapter 8. Multiprocessors In-Cheol Park Dept. of EE, KAIST Can the rapid rate of uniprocessor performance growth be sustained indefinitely? If the pace does slow down, multiprocessor architectures will

More information

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing MSc in Information Systems and Computer Engineering DEA in Computational Engineering Department of Computer

More information

Chapter 5: Thread-Level Parallelism Part 1

Chapter 5: Thread-Level Parallelism Part 1 Chapter 5: Thread-Level Parallelism Part 1 Introduction What is a parallel or multiprocessor system? Why parallel architecture? Performance potential Flynn classification Communication models Architectures

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1)

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1) 1 MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1) Chapter 5 Appendix F Appendix I OUTLINE Introduction (5.1) Multiprocessor Architecture Challenges in Parallel Processing Centralized Shared Memory

More information

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville Lecture 18: Multiprocessors Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Parallel Computers Definition: A parallel computer is a collection

More information

Module 9: "Introduction to Shared Memory Multiprocessors" Lecture 16: "Multiprocessor Organizations and Cache Coherence" Shared Memory Multiprocessors

Module 9: Introduction to Shared Memory Multiprocessors Lecture 16: Multiprocessor Organizations and Cache Coherence Shared Memory Multiprocessors Shared Memory Multiprocessors Shared memory multiprocessors Shared cache Private cache/dancehall Distributed shared memory Shared vs. private in CMPs Cache coherence Cache coherence: Example What went

More information

Parallel Systems I The GPU architecture. Jan Lemeire

Parallel Systems I The GPU architecture. Jan Lemeire Parallel Systems I The GPU architecture Jan Lemeire 2012-2013 Sequential program CPU pipeline Sequential pipelined execution Instruction-level parallelism (ILP): superscalar pipeline out-of-order execution

More information

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor Multiprocessing Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Almasi and Gottlieb, Highly Parallel

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

Data-Level Parallelism in Vector and SIMD Architectures

Data-Level Parallelism in Vector and SIMD Architectures Data-Level Parallelism in Vector and SIMD Architectures Flynn Taxonomy of Computer Architectures (1972) It is based on parallelism of instruction streams and data streams SISD single instruction stream,

More information

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction A set of general purpose processors is connected together.

More information

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Issues in Parallel Processing Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Introduction Goal: connecting multiple computers to get higher performance

More information

Data-Level Parallelism in Vector and SIMD Architectures

Data-Level Parallelism in Vector and SIMD Architectures Data-Level Parallelism in Vector and SIMD Architectures Flynn Taxonomy of Computer Architectures (1972) It is based on parallelism of instruction streams and data streams SISD single instruction stream,

More information

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

Suggested Readings! What makes a memory system coherent?! Lecture 27 Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality! 1! 2! Suggested Readings! Readings!! H&P: Chapter 5.8! Could also look at material on CD referenced on p. 538 of your text! Lecture 27" Cache Coherency! 3! Processor components! Multicore processors and

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

1. Memory technology & Hierarchy

1. Memory technology & Hierarchy 1. Memory technology & Hierarchy Back to caching... Advances in Computer Architecture Andy D. Pimentel Caches in a multi-processor context Dealing with concurrent updates Multiprocessor architecture In

More information

COSC4201 Multiprocessors

COSC4201 Multiprocessors COSC4201 Multiprocessors Prof. Mokhtar Aboelaze Parts of these slides are taken from Notes by Prof. David Patterson (UCB) Multiprocessing We are dedicating all of our future product development to multicore

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

Lecture 24: Memory, VM, Multiproc

Lecture 24: Memory, VM, Multiproc Lecture 24: Memory, VM, Multiproc Today s topics: Security wrap-up Off-chip Memory Virtual memory Multiprocessors, cache coherence 1 Spectre: Variant 1 x is controlled by attacker Thanks to bpred, x can

More information

Computer parallelism Flynn s categories

Computer parallelism Flynn s categories 04 Multi-processors 04.01-04.02 Taxonomy and communication Parallelism Taxonomy Communication alessandro bogliolo isti information science and technology institute 1/9 Computer parallelism Flynn s categories

More information

Today s Outline: Shared Memory Review. Shared Memory & Concurrency. Concurrency v. Parallelism. Thread-Level Parallelism. CS758: Multicore Programming

Today s Outline: Shared Memory Review. Shared Memory & Concurrency. Concurrency v. Parallelism. Thread-Level Parallelism. CS758: Multicore Programming CS758: Multicore Programming Today s Outline: Shared Memory Review Shared Memory & Concurrency Introduction to Shared Memory Thread-Level Parallelism Shared Memory Prof. David A. Wood University of Wisconsin-Madison

More information

ECSE 425 Lecture 30: Directory Coherence

ECSE 425 Lecture 30: Directory Coherence ECSE 425 Lecture 30: Directory Coherence H&P Chapter 4 Last Time Snoopy Coherence Symmetric SMP Performance 2 Today Directory- based Coherence 3 A Scalable Approach: Directories One directory entry for

More information

Lecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol

Lecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol Lecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

More information

Computer Science 146. Computer Architecture

Computer Science 146. Computer Architecture Computer Architecture Spring 24 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 2: More Multiprocessors Computation Taxonomy SISD SIMD MISD MIMD ILP Vectors, MM-ISAs Shared Memory

More information

Parallel Architectures

Parallel Architectures Parallel Architectures CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Parallel Architectures Spring 2018 1 / 36 Outline 1 Parallel Computer Classification Flynn s

More information

Chapter 18 Parallel Processing

Chapter 18 Parallel Processing Chapter 18 Parallel Processing Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple data stream - SIMD Multiple instruction, single data stream - MISD

More information

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols Portland State University ECE 588/688 Directory-Based Cache Coherence Protocols Copyright by Alaa Alameldeen and Haitham Akkary 2018 Why Directory Protocols? Snooping-based protocols may not scale All

More information

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors. CS 320 Ch. 17 Parallel Processing Multiple Processor Organization The author makes the statement: "Processors execute programs by executing machine instructions in a sequence one at a time." He also says

More information

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8. Multiprocessor System Multiprocessor Systems Chapter 8, 8.1 We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than

More information

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems Reference Papers on SMP/NUMA Systems: EE 657, Lecture 5 September 14, 2007 SMP and ccnuma Multiprocessor Systems Professor Kai Hwang USC Internet and Grid Computing Laboratory Email: kaihwang@usc.edu [1]

More information

Memory Systems in Pipelined Processors

Memory Systems in Pipelined Processors Advanced Computer Architecture (0630561) Lecture 12 Memory Systems in Pipelined Processors Prof. Kasim M. Al-Aubidy Computer Eng. Dept. Interleaved Memory: In a pipelined processor data is required every

More information

Multiprocessor Systems. COMP s1

Multiprocessor Systems. COMP s1 Multiprocessor Systems 1 Multiprocessor System We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than one CPU to improve

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

Too large design space?

Too large design space? 1 Multi-core HW/SW interplay and energy efficiency examples and ideas Lasse Natvig CARD group, Dept. of comp.sci. (IDI) - NTNU & HPC-section NTNU 2 Too large design space? Very young and highly dynamic

More information

Parallel Architecture. Sathish Vadhiyar

Parallel Architecture. Sathish Vadhiyar Parallel Architecture Sathish Vadhiyar Motivations of Parallel Computing Faster execution times From days or months to hours or seconds E.g., climate modelling, bioinformatics Large amount of data dictate

More information

Why Multiprocessors?

Why Multiprocessors? Why Multiprocessors? Motivation: Go beyond the performance offered by a single processor Without requiring specialized processors Without the complexity of too much multiple issue Opportunity: Software

More information

Page 1. Instruction-Level Parallelism (ILP) CISC 662 Graduate Computer Architecture Lectures 16 and 17 - Multiprocessors and Thread-Level Parallelism

Page 1. Instruction-Level Parallelism (ILP) CISC 662 Graduate Computer Architecture Lectures 16 and 17 - Multiprocessors and Thread-Level Parallelism CISC 662 Graduate Computer Architecture Lectures 16 and 17 - Multiprocessors and Thread-Level Parallelism Michela Taufer Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer Architecture,

More information