TDT 4260 lecture 9 spring semester 2015
|
|
- Jerome Moore
- 6 years ago
- Views:
Transcription
1 1 TDT 4260 lecture 9 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU
2 2 Lecture overview Repetition - CMP application classes - Vector MIPS Today Vector & SIMD processing Sparse matrices Gather/scatter Stride Shared Multiprocessor Intro Cache coherence snooping
3 3 Repetition
4 4 Repetition
5 5 Excursion to the Vilje supercomputer??? 5 min walk Your interest yes Monday 23/3 in the 12:15 14:00 timeslot???
6 6 Stride Consider straightforward matrix multiplication: for (i = 0; i < 100; i=i+1) for (j = 0; j < 100; j=j+1) { A[i][j] = 0.0; for (k = 0; k < 100; k=k+1) A[i][j] = A[i][j] + B[i][k] * D[k][j]; } Vectorize multiplication of rows of B with columns of D Matrices stored row-by-row (row-major order) or column-by-column (column-major order) Distance between elements that are to be gathered into a vector register = stride Vector processors must handle strides > 1 (Example: LVWS) Load Vector With Stride
7 7 Background, Sparse matrices Sparse matrix: Only few elements have non-zero values Example: from solving a finite element problem in 2 dimensions (FEM) (The nonzero elements are shown in black) «Often no sense in multiplying with zero» How can non-zero elements be condensed into a dense vector? ==> compact storage formats (using meta-data) and scatter/gather (Lars-Ivar Hesselberg Simonsen, Master thesis (2013))
8 8 Background, Sparse matrices Sparse matrix formats CSV, compressed sparse vector CSR, compressed sparse row CSX (Research CARD & NTNU-IT/HPC & GRNET under PRACE)) Energy-efficient Sparse Matrix Autotuning with CSX - A Trade-off Study, J C Meyer, J M Cebrian, L Natvig, V Karakasis, D Siakavaras, and K Nikas, HP-PAC (High-Performance, Power-Aware Computing, Boston, May 2013) (part of PRACE) Scatter-gather LVI: Load Vector Indexed (gather) SVI: Store Vector Indexed (scatter)
9 9 Scatter-Gather Consider: for (i = 0; i < n; i=i+1) A[K[i]] = A[K[i]] + C[M[i]]; Use index vector: LV Vk, Rk ;load K LVI Va, (Ra+Vk) ;load A[K[]] LV Vm, Rm ;load M LVI Vc, (Rc+Vm) ;load C[M[]] ADDVV.D Va, Va, Vc ;add them SVI (Ra+Vk), Va ;store A[K[]]
10 10 Section 4.3 SIMD INSTRUCTION SET EXTENSIONS FOR MULTIMEDIA
11 11 SIMD Extensions Media applications operate on data types narrower than the native word size Limitations, compared to vector instructions: Number of data operands encoded into op code No sophisticated addressing modes (strided, scatter-gather) No mask registers
12 12 SIMD DAXPY L.D F0,a ;load scalar a MOV F1, F0 ;copy a into F1 for SIMD MUL MOV F2, F0 ;copy a into F2 for SIMD MUL MOV F3, F0 ;copy a into F3 for SIMD MUL DADDIU R4,Rx,#512 ;last address to load Loop: L.4D F4,0[Rx] ;load X[i], X[i+1], X[i+2], X[i+3] MUL.4D F4,F4,F0 ;a X[i],a X[i+1],a X[i+2],a X[i+3] L.4D F8,0[Ry] ;load Y[i], Y[i+1], Y[i+2], Y[i+3] ADD.4D F8,F8,F4 ;a X[i]+Y[i],..., a X[i+3]+Y[i+3] S.4D 0[Ry],F8 ;store into Y[i], Y[i+1], Y[i+2], Y[i+3] DADDIU Rx,Rx,#32 ;increment index to X DADDIU Ry,Ry,#32 ;increment index to Y DSUBU R20,R4,Rx ;compute bound BNEZ R20,Loop ;check if done
13 13 SSE and AVX is energy efficient Recent research in the CARD group See PP4EE video (Juan Manuel Cebrian) (link under Its Learning), Paper in Computing 2013 (link under Its Learning), also newer and larger paper presented at ISPASS March 2014 Details not required
14 14 CHAP 5 - THREAD LEVEL PARALLELISM
15 15 Single/ILP Multi/TLP Uniprocessor trends Getting too complex Speed of light Diminishing returns from ILP Multiprocessor Focus in the textbook: 4-32 CPUs Increased performance through parallelism Multichip Multicore ((Single) Chip Multiprocessors CMP) Cost effective Right balance of ILP and TLP is unclear today Desktop vs. server?
16 16 Other Factors Multiprocessors Growth in data-intensive applications Databases, file servers, multimedia, Growing interest in servers, server performance Increasing desktop performance less important Outside of graphics Improved understanding in how to use multiprocessors effectively Especially in servers where significant natural TLP Advantage of leveraging design investment by replication Rather than unique design Power/cooling issues multicore
17 17 A bit of history 60s and 70s: Lots of research in multiprocessors. 80s and 90s: Relentless focus on single threaded performance, common belief that parallel computing was dead. Today: Almost ALL new systems are parallel systems!
18 18 MIMD: architecture $ = cache P1 Pn $ $ P1 P n Interconnection network (IN) Mem $ Mem $ Mem Interconnection network (IN) Centralized Distributed
19 19 Centralized Multiprocessor Also called Symmetric Multiprocessors (SMPs) Uniform Access (UMA) architecture Shared memory becomes bottleneck Large caches single memory can satisfy memory demands of small number of processors Can scale to a few dozen processors by using a switch and by using many memory banks Scaling beyond that is hard
20 20 Distributed memory P P P Network M 1. Shared address space Logically shared, physically distributed Distributed Shared (DSM) NUMA architecture Conceptual Model P M P M P M Network Implementation 2. Separate address spaces Every P-M module is a separate computer Multicomputer Clusters Not a focus in this course
21 21 Distributed (Shared) Multiprocessor Pro: Cost-effective way to scale memory bandwidth If most accesses are to local memory Pro: Reduces latency of local memory accesses Con: Communication becomes more complex Pro/Con: Possible to change software to take advantage of memory that is close, but this can also make SW less portable Non-Uniform Access (NUMA) Classical performance vs. portability trade-off
22 22 MP (MIMD), cluster of SMPs Proc. Proc. Proc. Proc. Proc. Proc. Caches Caches Caches Caches Caches Caches Node Interc. Network Node Interc. Network I/O I/O Cluster Interconnection Network Combination of centralized and distributed
23 23 Kahoot Quiz no 2
24 24 Communication models Shared memory Centralized or Distributed Shared Communication using LOAD/STORE Coordinated using traditional OS methods Semaphores, monitors, etc. Busy-wait more acceptable than for uniprosessor Message passing Using send (put) and receive (get) Asynchronous / Synchronous Libraries, standards, PVM, MPI,
25 25 Shared How do we program these things? Thread 0: Thread 0: (on a different core) data = 42; flag = 1; while (flag == 0) { } printf( data %i, data); $ $ data = 0; flag = 0;
26 26 Shared How do we program these things? Thread 0: data = 42; flag = 1; Thread 0: while (flag == 0) { } printf( data %i, data); $ $ flag = 0; data = 0; flag = 0;
27 27 Shared How do we program these things? Thread 0: data = 42; flag = 1; Thread 0: while (flag == 0) { } printf( data %i, data); $ $ flag = 0; data = 0; flag = 0;
28 28 Shared How do we program these things? Thread 0: data = 42; flag = 1; Thread 0: while (flag == 0) { } printf( data %i, data); $ $ data = 42; flag = 0; data = 0; flag = 0;
29 29 Shared How do we program these things? Thread 0: data = 42; flag = 1; Thread 0: while (flag == 0) { } printf( data %i, data); $ $ data = 42; flag = 1; data = 0; flag = 0; flag = 0;
30 30 Shared How do we program these things? Thread 0: data = 42; flag = 1; $ $ data = 42; flag = 1; Thread 0: while (flag == 0) { } printf( data %i, data); flag = 0; Now What?? data = 0; flag = 0;
31 31 Shared How do we program these things? Thread 0: data = 42; flag = 1; Thread 0: while (flag == 0) { } printf( data %i, data); Now What?? $ $ data = 42; flag = 1; flag = 0; Nothing happens since this cache is not updated data = 0; flag = 0;
32 32 Enforcing coherence Separate caches makes multiple copies frequent Migration Moved from shared memory to local cache Speeds up access, reduces memory bandwidth requirements Replication Several local copies when item is read by several Speeds up access, reduces memory contention Need coherence protocols to track shared data Directory based Status stored in shared location (Centralized or distributed) (Bus) snooping Each cache maintains local status All caches monitor broadcast medium Write invalidate / Write update
33 33 Snooping: Write invalidate Several reads or one write: No change Writes require exclusive access Writes to shared data: All other cache copies invalidated Invalidate command and address broadcasted All caches listen (snoops) and invalidates if necessary Read miss: Write-Through: always up to date Write-Back: Caches listen and any exclusive copy is put on the bus
34 34 Snooping: Write update Also called write broadcast Must know which cache blocks are shared Usually Write-Through Write to shared data: Broadcast, all caches listen and updates their copy (if any) Read miss: Main memory is up to date
35 35 Snooping: Invalidate vs. Update Repeated writes to the same address (no reads) requires several updates, but only one invalidate Invalidates are done at cache block level, while updates are done of individual words Invalidate most common Less bus traffic Less memory traffic Bus and memory bandwidth typical bottleneck
36 36 An Example Snoopy Protocol Invalidation protocol, write-back cache Each cache block is in one state Shared : Clean in all caches and up-to-date in memory, block can be read Exclusive : One cache has only copy, its writeable, and dirty Invalid : block contains no data (1) example (2) in more detail (Next lecture)
37 37 Snooping: Invalidation protocol (1/6) Processor 0 Processor 1 Processor 2 Processor N-1 read x read miss Interconnection Network x o I/O System Main
38 38 Snooping: Invalidation protocol (2/6) Processor 0 Processor 1 Processor 2 Processor N-1 x o shared Interconnection Network x o I/O System Main
39 39 Snooping: Invalidation protocol (3/6) Processor 0 Processor 1 Processor 2 read x Processor N-1 x o shared read miss Interconnection Network x o I/O System Main
40 40 Snooping: Invalidation protocol (4/6) Processor 0 Processor 1 Processor 2 Processor N-1 x o shared x o shared Interconnection Network x o I/O System Main
41 41 Snooping: Invalidation protocol (5/6) Processor 0 Processor 1 Processor 2 Processor N-1 write x x o shared x o shared invalidate Interconnection Network x o I/O System Main
42 42 Snooping: Invalidation protocol (6/6) Processor 0 Processor 1 Processor 2 Processor N-1 x 1 exclusive Interconnection Network x o I/O System Main
43 Directory Based Coherency Snooping doesn t scale Idea: Track who is sharing various addresses in a directory. Used both in large DSM machines and in small multicores. Complicated! Things you have to worry about: Deadlock Livelock & Starvation Consistency models Lasse Natvig
44 Directory Based Coherency Ld X Ld X Ld X Ld X Directory: 44 Lasse Natvig
45 Directory Based Coherency Ld X Ld X Ld X Ld X Directory: 0 45 Lasse Natvig
46 Directory Based Coherency Ld X Ld X Ld X Ld X Directory: 0 46 Lasse Natvig
47 Directory Based Coherency Ld X Ld X Ld X Ld X Directory: 0,1 47 Lasse Natvig
48 Directory Based Coherency Ld X Ld X Ld X Ld X Directory: 0,1 48 Lasse Natvig
49 Directory Based Coherency Ld X Ld X Ld X Ld X Directory: 0,1,2 49 Lasse Natvig
50 Directory Based Coherency Ld X Ld X Ld X Ld X Directory: 0,1,2 50 Lasse Natvig
51 Directory Based Coherency Ld X Ld X Ld X Ld X Directory: 0,1,2,4 51 Lasse Natvig
52 Directory Based Coherency Ld X Ld X Ld X St X=7 Directory: 0,1,2,4 52 Lasse Natvig
53 Directory Based Coherency Ld X Ld X Ld X St X=7 Can I write???? Please? Directory: 0,1,2,4 53 Lasse Natvig
54 Directory Based Coherency Ld X Ld X Ld X St X=7 Inval Can I write???? Please? Directory: 0,1,2,4 54 Lasse Natvig
55 Directory Based Coherency Ld X Ld X Ld X St X=7 Inval Inval is complete Can I write???? Please? Directory: 0,1,2,4 55 Lasse Natvig
56 Directory Based Coherency Ld X Ld X Ld X St X=7 Inval Inval is complete Can I write???? Please? Directory: 1,2,4 56 Lasse Natvig
57 Directory Based Coherency Ld X Ld X Ld X St X=7 Inval Can I write???? Please? Directory: 1,2,4 57 Lasse Natvig
58 Directory Based Coherency Ld X Ld X Ld X St X=7 Inval Inval complete Directory: 1,2,4 Can I write???? Please? 58 Lasse Natvig
59 Directory Based Coherency Ld X Ld X Ld X St X=7 Inval Inval complete Directory: 2,4 Can I write???? Please? 59 Lasse Natvig
60 Directory Based Coherency Ld X Ld X Ld X St X=7 Inval Can I write???? Please? Directory: 2,4 60 Lasse Natvig
61 Directory Based Coherency Ld X Ld X Ld X St X=7 Inval complete Inval Can I write???? Please? Directory: 2,4 61 Lasse Natvig
62 Directory Based Coherency Ld X Ld X Ld X St X=7 Inval complete Inval Can I write???? Please? Directory: 4 62 Lasse Natvig
63 Directory Based Coherency Ld X Ld X Ld X St X=7 Yes, you can write! Can I write???? Please? Directory: 4 63 Lasse Natvig
64 Directory Based Coherency Ld X Ld X Ld X St X=7 X = 7 Yes, you can write! Can I write???? Please? Directory: 4 64 Lasse Natvig
65 Directory Based Coherency Ld X St X=7 X = 7 Give me new X Directory: 4 65 Lasse Natvig
66 Directory Based Coherency Ld X St X=7 X = 7 Give me your modified x Directory: 4 66 Lasse Natvig
67 Directory Based Coherency Ld X St X=7 X = 7 Give me your modified x Here you go! Directory: 4 67 Lasse Natvig
68 Directory Based Coherency Ld X St X=7 X = 7 Give me your modified x Here you go! Directory: 4 X = 7 68 Lasse Natvig
69 Directory Based Coherency Ld X St X=7 X = 7 X = 7 Give me your modified x Here you go! Directory: 4 X = 7 69 Lasse Natvig
70 Directory Based Coherency Ld X St X=7 X = 7 X = 7 Give me your modified x Here you go! Directory: 1,4 X = 7 70 Lasse Natvig
71 Directory Based Coherency All the different types of messages are travelling over various interconnects. Interconnect design is a big component of shared memory systems! Ld X St X=7 X = 7 X = 7 Give me your modified x Here you go! Directory: 1,4 X = 7 71 Lasse Natvig
72 Directory based cache coherence Large multiprocessor systems, lots of CPUs Distributed memory preferable Increases memory bandwidth Snooping bus with broadcast? A single bus becomes a bottleneck Other ways of communicating needed With these, broadcasting is hard/expensive Can avoid broadcast if we know exactly which caches have a copy Directory 72 Lasse Natvig
73 Directory based cache coherence Directory knows which blocks are in which cache and their state Directory can be partitioned and distributed Typical states: Shared Uncached Modified Protocol based on messages Invalidate and update sent only where needed Avoids broadcast, reduces traffic 73 Lasse Natvig
! An alternate classification. Introduction. ! Vector architectures (slides 5 to 18) ! SIMD & extensions (slides 19 to 23)
Master Informatics Eng. Advanced Architectures 2015/16 A.J.Proença Data Parallelism 1 (vector, SIMD ext., GPU) (most slides are borrowed) Instruction and Data Streams An alternate classification Instruction
More informationData-Level Parallelism in SIMD and Vector Architectures. Advanced Computer Architectures, Laura Pozzi & Cristina Silvano
Data-Level Parallelism in SIMD and Vector Architectures Advanced Computer Architectures, Laura Pozzi & Cristina Silvano 1 Current Trends in Architecture Cannot continue to leverage Instruction-Level parallelism
More informationEE 4683/5683: COMPUTER ARCHITECTURE
EE 4683/5683: COMPUTER ARCHITECTURE Lecture 5B: Data Level Parallelism Avinash Kodi, kodi@ohio.edu Thanks to Morgan Kauffman and Krtse Asanovic Agenda 2 Flynn s Classification Data Level Parallelism Vector
More informationUNIT III DATA-LEVEL PARALLELISM IN VECTOR, SIMD, AND GPU ARCHITECTURES
UNIT III DATA-LEVEL PARALLELISM IN VECTOR, SIMD, AND GPU ARCHITECTURES Flynn s Taxonomy Single instruction stream, single data stream (SISD) Single instruction stream, multiple data streams (SIMD) o Vector
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationVector Architectures. Intensive Computation. Annalisa Massini 2017/2018
Vector Architectures Intensive Computation Annalisa Massini 2017/2018 2 SIMD ARCHITECTURES 3 Computer Architecture - A Quantitative Approach, Fifth Edition Hennessy Patterson Chapter 4 - Data-Level Parallelism
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationChap. 4 Multiprocessors and Thread-Level Parallelism
Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,
More informationCOSC 6385 Computer Architecture. - Vector Processors
COSC 6385 Computer Architecture - Vector Processors Spring 011 Vector Processors Chapter F of the 4 th edition (Chapter G of the 3 rd edition) Available in CD attached to the book Anybody having problems
More informationHandout 3 Multiprocessor and thread level parallelism
Handout 3 Multiprocessor and thread level parallelism Outline Review MP Motivation SISD v SIMD (SIMT) v MIMD Centralized vs Distributed Memory MESI and Directory Cache Coherency Synchronization and Relaxed
More informationMultiprocessors & Thread Level Parallelism
Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction
More informationChapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationChapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationComputer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationStatic Compiler Optimization Techniques
Static Compiler Optimization Techniques We examined the following static ISA/compiler techniques aimed at improving pipelined CPU performance: Static pipeline scheduling. Loop unrolling. Static branch
More informationParallel Processing SIMD, Vector and GPU s
Parallel Processing SIMD, Vector and GPU s EECS4201 Fall 2016 York University 1 Introduction Vector and array processors Chaining GPU 2 Flynn s taxonomy SISD: Single instruction operating on Single Data
More informationDATA-LEVEL PARALLELISM IN VECTOR, SIMD ANDGPU ARCHITECTURES(PART 2)
1 DATA-LEVEL PARALLELISM IN VECTOR, SIMD ANDGPU ARCHITECTURES(PART 2) Chapter 4 Appendix A (Computer Organization and Design Book) OUTLINE SIMD Instruction Set Extensions for Multimedia (4.3) Graphical
More informationShared Symmetric Memory Systems
Shared Symmetric Memory Systems Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University
More informationAdvanced Computer Architecture
Fiscal Year 2018 Ver. 2019-01-24a Course number: CSC.T433 School of Computing, Graduate major in Computer Science Advanced Computer Architecture 11. Multi-Processor: Distributed Memory and Shared Memory
More informationCOEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence
1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations
More informationMultiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism
Computing Systems & Performance Beyond Instruction-Level Parallelism MSc Informatics Eng. 2012/13 A.J.Proença From ILP to Multithreading and Shared Cache (most slides are borrowed) When exploiting ILP,
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationTDT 4260 lecture 3 spring semester 2015
1 TDT 4260 lecture 3 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU http://research.idi.ntnu.no/multicore 2 Lecture overview Repetition Chap.1: Performance,
More informationMultiprocessors 1. Outline
Multiprocessors 1 Outline Multiprocessing Coherence Write Consistency Snooping Building Blocks Snooping protocols and examples Coherence traffic and performance on MP Directory-based protocols and examples
More informationMultiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types
Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon
More informationParallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence
Parallel Computer Architecture Spring 2018 Shared Memory Multiprocessors Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly Parallel Computer Architecture
More informationLecture 24: Virtual Memory, Multiprocessors
Lecture 24: Virtual Memory, Multiprocessors Today s topics: Virtual memory Multiprocessors, cache coherence 1 Virtual Memory Processes deal with virtual memory they have the illusion that a very large
More informationMultiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering
Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance
More informationMul$processor Architecture. CS 5334/4390 Spring 2014 Shirley Moore, Instructor February 4, 2014
Mul$processor Architecture CS 5334/4390 Spring 2014 Shirley Moore, Instructor February 4, 2014 1 Agenda Announcements (5 min) Quick quiz (10 min) Analyze results of STREAM benchmark (15 min) Mul$processor
More informationParallel Processing SIMD, Vector and GPU s
Parallel Processing SIMD, ector and GPU s EECS4201 Comp. Architecture Fall 2017 York University 1 Introduction ector and array processors Chaining GPU 2 Flynn s taxonomy SISD: Single instruction operating
More informationChapter Seven. Idea: create powerful computers by connecting many smaller ones
Chapter Seven Multiprocessors Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) vector processing may be coming back bad news:
More informationThread- Level Parallelism. ECE 154B Dmitri Strukov
Thread- Level Parallelism ECE 154B Dmitri Strukov Introduc?on Thread- Level parallelism Have mul?ple program counters and resources Uses MIMD model Targeted for?ghtly- coupled shared- memory mul?processors
More information10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems
1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase
More informationLecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996
Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.S96 1 Flynn Categories SISD (Single Instruction Single
More informationMultiprocessors - Flynn s Taxonomy (1966)
Multiprocessors - Flynn s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) Conventional uniprocessor Although ILP is exploited Single Program Counter -> Single Instruction stream The
More informationESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence
Computer Architecture ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols 1 Shared Memory Multiprocessor Memory Bus P 1 Snoopy Cache Physical Memory P 2 Snoopy
More informationInterconnect Routing
Interconnect Routing store-and-forward routing switch buffers entire message before passing it on latency = [(message length / bandwidth) + fixed overhead] * # hops wormhole routing pipeline message through
More informationELE 455/555 Computer System Engineering. Section 4 Parallel Processing Class 1 Challenges
ELE 455/555 Computer System Engineering Section 4 Class 1 Challenges Introduction Motivation Desire to provide more performance (processing) Scaling a single processor is limited Clock speeds Power concerns
More informationEITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor
EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration I/O MultiProcessor Summary 2 Virtual memory benifits Using physical memory efficiently
More informationOnline Course Evaluation. What we will do in the last week?
Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do
More informationChapter-4 Multiprocessors and Thread-Level Parallelism
Chapter-4 Multiprocessors and Thread-Level Parallelism We have seen the renewed interest in developing multiprocessors in early 2000: - The slowdown in uniprocessor performance due to the diminishing returns
More informationIntroduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed
More informationLecture 2. Memory locality optimizations Address space organization
Lecture 2 Memory locality optimizations Address space organization Announcements Office hours in EBU3B Room 3244 Mondays 3.00 to 4.00pm; Thurs 2:00pm-3:30pm Partners XSED Portal accounts Log in to Lilliput
More informationFlynn s Classification
Flynn s Classification SISD (Single Instruction Single Data) Uniprocessors MISD (Multiple Instruction Single Data) No machine is built yet for this type SIMD (Single Instruction Multiple Data) Examples:
More informationParallel Architecture. Hwansoo Han
Parallel Architecture Hwansoo Han Performance Curve 2 Unicore Limitations Performance scaling stopped due to: Power Wire delay DRAM latency Limitation in ILP 3 Power Consumption (watts) 4 Wire Delay Range
More information3/13/2008 Csci 211 Lecture %/year. Manufacturer/Year. Processors/chip Threads/Processor. Threads/chip 3/13/2008 Csci 211 Lecture 8 4
Outline CSCI Computer System Architecture Lec 8 Multiprocessor Introduction Xiuzhen Cheng Department of Computer Sciences The George Washington University MP Motivation SISD v. SIMD v. MIMD Centralized
More informationChapter 5. Thread-Level Parallelism
Chapter 5 Thread-Level Parallelism Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors saturated
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationParallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?
Parallel Computers CPE 63 Session 20: Multiprocessors Department of Electrical and Computer Engineering University of Alabama in Huntsville Definition: A parallel computer is a collection of processing
More informationCOSC 6385 Computer Architecture - Multi Processor Systems
COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:
More informationChapter 5 Thread-Level Parallelism. Abdullah Muzahid
Chapter 5 Thread-Level Parallelism Abdullah Muzahid 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors is saturating + Modern multiple issue processors are becoming very complex
More informationLecture 1: Parallel Architecture Intro
Lecture 1: Parallel Architecture Intro Course organization: ~13 lectures based on textbook ~10 lectures on recent papers ~5 lectures on parallel algorithms and multi-thread programming New topics: interconnection
More informationCISC 662 Graduate Computer Architecture Lectures 15 and 16 - Multiprocessors and Thread-Level Parallelism
CISC 662 Graduate Computer Architecture Lectures 15 and 16 - Multiprocessors and Thread-Level Parallelism Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from
More informationINSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing
UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 11
More informationLecture 17: Multiprocessors. Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )
Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections 4.1-4.2) 1 Taxonomy SISD: single instruction and single data stream: uniprocessor
More informationExam Parallel Computer Systems
Exam Parallel Computer Systems Academic Year 2014-2015 Friday January 9, 2015: 8u30 12u00 Prof. L. Eeckhout Some remarks: Fill in your name on every page. Write down the answers in the boxes answers are
More informationWHY PARALLEL PROCESSING? (CE-401)
PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:
More informationNon-Uniform Memory Access (NUMA) Architecture and Multicomputers
Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico February 29, 2016 CPD
More informationAleksandar Milenkovich 1
Parallel Computers Lecture 8: Multiprocessors Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Definition: A parallel computer is a collection
More informationNon-Uniform Memory Access (NUMA) Architecture and Multicomputers
Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico September 26, 2011 CPD
More informationModule 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:
The Lecture Contains: Shared Memory Multiprocessors Shared Cache Private Cache/Dancehall Distributed Shared Memory Shared vs. Private in CMPs Cache Coherence Cache Coherence: Example What Went Wrong? Implementations
More informationChapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST
Chapter 8. Multiprocessors In-Cheol Park Dept. of EE, KAIST Can the rapid rate of uniprocessor performance growth be sustained indefinitely? If the pace does slow down, multiprocessor architectures will
More informationNon-Uniform Memory Access (NUMA) Architecture and Multicomputers
Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing MSc in Information Systems and Computer Engineering DEA in Computational Engineering Department of Computer
More informationChapter 5: Thread-Level Parallelism Part 1
Chapter 5: Thread-Level Parallelism Part 1 Introduction What is a parallel or multiprocessor system? Why parallel architecture? Performance potential Flynn classification Communication models Architectures
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1)
1 MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1) Chapter 5 Appendix F Appendix I OUTLINE Introduction (5.1) Multiprocessor Architecture Challenges in Parallel Processing Centralized Shared Memory
More informationAleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville
Lecture 18: Multiprocessors Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Parallel Computers Definition: A parallel computer is a collection
More informationModule 9: "Introduction to Shared Memory Multiprocessors" Lecture 16: "Multiprocessor Organizations and Cache Coherence" Shared Memory Multiprocessors
Shared Memory Multiprocessors Shared memory multiprocessors Shared cache Private cache/dancehall Distributed shared memory Shared vs. private in CMPs Cache coherence Cache coherence: Example What went
More informationParallel Systems I The GPU architecture. Jan Lemeire
Parallel Systems I The GPU architecture Jan Lemeire 2012-2013 Sequential program CPU pipeline Sequential pipelined execution Instruction-level parallelism (ILP): superscalar pipeline out-of-order execution
More informationParallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor
Multiprocessing Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Almasi and Gottlieb, Highly Parallel
More informationModule 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT
TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012
More informationData-Level Parallelism in Vector and SIMD Architectures
Data-Level Parallelism in Vector and SIMD Architectures Flynn Taxonomy of Computer Architectures (1972) It is based on parallelism of instruction streams and data streams SISD single instruction stream,
More informationLecture 9: MIMD Architectures
Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction A set of general purpose processors is connected together.
More informationIssues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University
Issues in Parallel Processing Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Introduction Goal: connecting multiple computers to get higher performance
More informationData-Level Parallelism in Vector and SIMD Architectures
Data-Level Parallelism in Vector and SIMD Architectures Flynn Taxonomy of Computer Architectures (1972) It is based on parallelism of instruction streams and data streams SISD single instruction stream,
More informationSuggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!
1! 2! Suggested Readings! Readings!! H&P: Chapter 5.8! Could also look at material on CD referenced on p. 538 of your text! Lecture 27" Cache Coherency! 3! Processor components! Multicore processors and
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More information1. Memory technology & Hierarchy
1. Memory technology & Hierarchy Back to caching... Advances in Computer Architecture Andy D. Pimentel Caches in a multi-processor context Dealing with concurrent updates Multiprocessor architecture In
More informationCOSC4201 Multiprocessors
COSC4201 Multiprocessors Prof. Mokhtar Aboelaze Parts of these slides are taken from Notes by Prof. David Patterson (UCB) Multiprocessing We are dedicating all of our future product development to multicore
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationLecture 24: Memory, VM, Multiproc
Lecture 24: Memory, VM, Multiproc Today s topics: Security wrap-up Off-chip Memory Virtual memory Multiprocessors, cache coherence 1 Spectre: Variant 1 x is controlled by attacker Thanks to bpred, x can
More informationComputer parallelism Flynn s categories
04 Multi-processors 04.01-04.02 Taxonomy and communication Parallelism Taxonomy Communication alessandro bogliolo isti information science and technology institute 1/9 Computer parallelism Flynn s categories
More informationToday s Outline: Shared Memory Review. Shared Memory & Concurrency. Concurrency v. Parallelism. Thread-Level Parallelism. CS758: Multicore Programming
CS758: Multicore Programming Today s Outline: Shared Memory Review Shared Memory & Concurrency Introduction to Shared Memory Thread-Level Parallelism Shared Memory Prof. David A. Wood University of Wisconsin-Madison
More informationECSE 425 Lecture 30: Directory Coherence
ECSE 425 Lecture 30: Directory Coherence H&P Chapter 4 Last Time Snoopy Coherence Symmetric SMP Performance 2 Today Directory- based Coherence 3 A Scalable Approach: Directories One directory entry for
More informationLecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol
Lecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu
More informationComputer Science 146. Computer Architecture
Computer Architecture Spring 24 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 2: More Multiprocessors Computation Taxonomy SISD SIMD MISD MIMD ILP Vectors, MM-ISAs Shared Memory
More informationParallel Architectures
Parallel Architectures CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Parallel Architectures Spring 2018 1 / 36 Outline 1 Parallel Computer Classification Flynn s
More informationChapter 18 Parallel Processing
Chapter 18 Parallel Processing Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple data stream - SIMD Multiple instruction, single data stream - MISD
More informationPortland State University ECE 588/688. Directory-Based Cache Coherence Protocols
Portland State University ECE 588/688 Directory-Based Cache Coherence Protocols Copyright by Alaa Alameldeen and Haitham Akkary 2018 Why Directory Protocols? Snooping-based protocols may not scale All
More informationNon-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.
CS 320 Ch. 17 Parallel Processing Multiple Processor Organization The author makes the statement: "Processors execute programs by executing machine instructions in a sequence one at a time." He also says
More informationMultiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.
Multiprocessor System Multiprocessor Systems Chapter 8, 8.1 We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than
More informationSMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems
Reference Papers on SMP/NUMA Systems: EE 657, Lecture 5 September 14, 2007 SMP and ccnuma Multiprocessor Systems Professor Kai Hwang USC Internet and Grid Computing Laboratory Email: kaihwang@usc.edu [1]
More informationMemory Systems in Pipelined Processors
Advanced Computer Architecture (0630561) Lecture 12 Memory Systems in Pipelined Processors Prof. Kasim M. Al-Aubidy Computer Eng. Dept. Interleaved Memory: In a pipelined processor data is required every
More informationMultiprocessor Systems. COMP s1
Multiprocessor Systems 1 Multiprocessor System We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than one CPU to improve
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationToo large design space?
1 Multi-core HW/SW interplay and energy efficiency examples and ideas Lasse Natvig CARD group, Dept. of comp.sci. (IDI) - NTNU & HPC-section NTNU 2 Too large design space? Very young and highly dynamic
More informationParallel Architecture. Sathish Vadhiyar
Parallel Architecture Sathish Vadhiyar Motivations of Parallel Computing Faster execution times From days or months to hours or seconds E.g., climate modelling, bioinformatics Large amount of data dictate
More informationWhy Multiprocessors?
Why Multiprocessors? Motivation: Go beyond the performance offered by a single processor Without requiring specialized processors Without the complexity of too much multiple issue Opportunity: Software
More informationPage 1. Instruction-Level Parallelism (ILP) CISC 662 Graduate Computer Architecture Lectures 16 and 17 - Multiprocessors and Thread-Level Parallelism
CISC 662 Graduate Computer Architecture Lectures 16 and 17 - Multiprocessors and Thread-Level Parallelism Michela Taufer Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer Architecture,
More information