Parallel Architecture. Hwansoo Han
|
|
- Earl Rice
- 5 years ago
- Views:
Transcription
1 Parallel Architecture Hwansoo Han
2 Performance Curve 2
3 Unicore Limitations Performance scaling stopped due to: Power Wire delay DRAM latency Limitation in ILP 3
4 Power Consumption (watts) 4
5 Wire Delay Range of a wire in one clock cycle 5
6 DRAM Latency Microprocessor 60% / year 2x / 18 months DRAM latency 9% / year 2x / 10 years 6
7 Instruction Level Parallelism 1980s: More transistors Superscalar Pipeline 10 CPI 1 CPI 1990s: Exploit last implicit parallelism Multi-way issue, out-of-order issue, branch prediction 1 CPI 0.5 CPI 2000s: Multicore Explicit parallelism is needed 7
8 Multicore Processors cancelled Intel Tejas & Jayhawk Unicore, 4 GHz P4 IBM Cell Scalable multicore IBM Power 4 & 5 Dual Cores since 2001 Intel Montecito Dual Core IA/64 Intel Pentium D (Smithfield) Intel Pentium Extreme 3.2 GHz Dual Core AMD Opteron Dual Core Intel Yonah Dual Core Mobile Intel Tanglewood Dual Core IA/64 Intel Dempsey Dual Core Xeon SUN Olympus & Niagara 8 Processor Cores IBM Power 6 Dual Core 8 2H H H H H 2006
9 Chip Multiprocessors (Multicores) Processor Name Company Target Market Cores PE Interconnect Programming Model Power7 IBM Servers 4~8xPower7 (16~32 threads) Niagara2 Sun Servers 8xUltraSPARC (64 threads) Bloomfield (i7) Intel Servers, Desktop 4xNehalem (8 threads) Barcelona AMD Servers, Desktop 4xNG-Opteron (4 threads) Full crossbar to L2$ Full crossbar to L2$ Point-to-point network Full crossbar onchip Shared Memory Multi-threading Shared Memory Multi-threading Traditional SMP Traditional SMP Xenon IBM/ Microsoft XBox360 3xPowerPC w/vmx128 (6 threads) Traditional SMP Cell Sony/ Toshiba/ IBM Game Consoles, DTV, HPC PowerPC +8xSPE(SIMD) (2+8 threads) 4 Rings Shared DRAM Private SRAM Tesla NVIDIA GPGPU 240 streaming processors CUDA 9
10 Why Multiprocessors? 1. Microprocessors as the fastest CPUs Collecting several CPUs much easier than redesigning 1 CPU 2. Complexity of current microprocessors Do we have enough ideas to sustain 1.5X/yr? Can we deliver such complexity on schedule? 3. Slow (but steady) improvement in parallel software Scientific apps, databases, OS 4. Emergence of embedded and server markets drive microprocessors in addition to desktops Embedded system Functional parallelism Server performance Producer/consumer model Transactions/sec vs. latency of one transaction 10
11 Many Parallel Workloads Exist Multiprogramming OS & multiple programs Commercial workloads OLTP, data-mining Scientific computing Weather prediction, chemical simulation, Multimedia HDTV playback, speech recognition, All interesting workloads are parallel Demand for higher performance drives parallel computers 11
12 Challenges of Multiprocessors Difficult to write parallel programs Most programmers think sequentially Performance vs. correctness tradeoffs Missing good parallel abstractions Automatic parallelization by compilers Works with some applications (loop parallelism, reduction) Unclear how we can apply to other complex applications 12
13 Limitations of Multiprocessors Serial portion of applications Amdhal s law f is parallelizable with n CPUs : speedup = 1 / (1-f + f/n) If 80% parallelizable, maximum speedup is 5 Latency of communication Often takes 10~1000 cycles for CPUs to communicate CPUs often stall waiting for communications Solutions Exploit locality (caches) Overlaps communication with independent computation 13
14 Popular Flynn Categories SISD (single instruction single data) Uniprocessors SIMD (single instruction multiple data) Vector processors (e.g. CM-2, Cray XP/YP, ) Multimedia extension (Intel MMX/SSE, ) MISD (multiple instruction single data) Systolic arrays MIMD (multiple instructions multiple data) MPP (massively parallel processors - special interconnect) SMP (symmetric multi-processors) Cluster (commodity CPUs connected with basically ethernet) Most successful model virtually all multiprocessors today Sun Enterprise 10000, SGI Origin, Cray T3D, 14
15 Parallel Architectures (MIMD) Shared memory Access all data within a single address space SMP, UMA, cc-numa Popular programming model Thread APIs (pthread, ) OpenMP Distributed memory Access only partial data. Others are accessed via communication NUMA, Cluster Popular programming model PVM (obsolete) MPI (de facto standard) CPU $ CPU $ CPU $ CPU $ CPU $ CPU $ Mem Mem Mem Memory 15
16 Machine Abstraction for Program Shared-memory Message-passing Single address space for all CPUs Private address space per CPU Communication through regular load/store (implicit) Communication through message send/receive over network interface (explicit) Synchronization using locks and barriers Synchronization using blocking messages Ease of programming Need to program explicit communication Complex HW for cache coherence Simple HW (no cache coherence supporting HW) 16
17 Cache Coherence in SMP Assume the following sequence P0 loads A (A in P0 s $D) P1 loads A (A in P1 s $D) P0 writes a new value to A P1 loads A (Can P1 get a new value?) CPU $ CPU $ Memory CPU $ Memory system behavior Cache coherence What value can be returned by a load Memory consistency When a written value can be read (or visible) by a load A solution for cache coherence Multiple read-only copies and exclusive modified copy (invalidate other copies when a CPU need to update a cache line) 17
18 Snooping Protocol All cache controllers monitor (or snoop) on the bus Send all requests for data to all processors Processors snoop to see if they have a shared block Requires broadcast, since caching information resides at processors Works well with bus (natural broadcast) Dominates for small scale machines Cache coherence unit Cache block (line) is the unit of management False sharing is possible Two processors share the same cache line but not the actual word Coherence miss Invalidate can cause a miss for the data read before 18
19 Write Invalidate vs. Write Update Write invalidate protocol in snooping A write to shared data occurs An invalidate is sent to all caches which snoop Invalidate any copies If a read miss occurs Write-through: memory is always up-to-date Write-back: snoop to force the write-back of most recent copy Write update protocol in snooping A write to shared data occurs Broadcast on bus, processors snoop, & update copies 19 If a read miss occurs Write-through: memory is always up-to-date Write-back: one of sharers (owner) updates memory
20 An Example Snoopy Protocol Invalidation protocol, write-back cache Each cache block is in one state (MSI protocol) Modified: cache has only copy (writable & dirty) Shared: block can be read Invalid: block contains no data State change due to the actions from both CPU and Bus CPU MSI Cache Block Bus 20
21 Snoopy-Cache State Machine CPU State of each cache block CPU Read / Bus Read CPU Read hit /- (no Bus traffic) CPU Read miss / Bus Read MSI Cache Block Bus Invalid Bus ReadX / - Shared (read-only) Bus Read / - invalidated due to other CPUs 21 CPU Read,Write hit / - (no Bus traffic) Modified (read/write) CPU Write miss / Bus WriteBack(Flush); BusReadX
22 MESI Protocol Add 4 th state Distinguish Shared and Exclusive Shared (read-only) MSI protocol Shared (read only) Exclusive (read-only) MESI protocol Common case optimization In MSI, [shared modified] causes invalidate traffic Writes to non-shared data cause unnecessary invalidate Even for shared data, only one processor often reads and write In MESI, [exclusive modified] without invalidate traffic 22
23 MESI Protocol State Machine Needs shared signal in the physical interconnect Invalid CPU Read / Bus Read & S-signal on Bus ReadX / - CPU Read / - (no Bus traffic) Shared (read-only) If cache miss occurs, cache will write back modified block. Bus ReadX / Bus WriteBack(Flush) CPU Write / Bus ReadX Bus Read / Bus S-signal on Bus Read / Bus S-signal on 23 CPU read / - (no Bus traffic) Modified (read/write) CPU Write / - (invalidate is not needed) CPU wrtie / - (no Bus traffic) Exclusive (read-only) CPU Read / - (no Bus traffic)
24 Distributed Shared-Memory Architectures Non-uniform memory access time (NUMA, E.g. Cray T3D/E) Cannot use snooping protocol for cache coherence Snooping protocol requires all cache communication on misses using bus, but NUMA does not have such central structure Snooping protocol is efficient only for small scale multiprocessors Use directory per cached memory block (directory protocol) Keep track of the states of memory blocks in cached local memory Which processors have data when in the shared/exclusive state 3 processors involved Local node: where a request originates Home node: where the original memory block s location Remote node: where a copy of a memory block exists 24
25 Directory-based Cache Coherence A directory is added to each node for cache coherence Processor Processor Processor Processor cache cache cache cache memory I/O memory I/O memory I/O memory I/O directory directory directory directory interconnection network directory directory directory directory memory I/O memory I/O memory I/O memory I/O cache cache cache cache Processor Processor Processor Processor 25
26 Directory Protocol Three cache block states in directory Exclusive: 1 processor (owner) has data, memory is out-of-date Shared: one or more processors have data, memory is up-to-date Uncached: no processor has it, not valid in any cached memory In addition to memory block state, must track which processors have data in the shared/exclusive state (Sharers) Usually bit vector, 1 if processor has copy Directories at home nodes gather information of all memory blocks Instead of bus snooping, home nodes have all the information required If you need to broadcast some message, send it to home directory 26
27 State Trans. in Cache CPU requests for each cache block Home directory sends message to sharers (remote cache) Local cache sends message to home directory Remote cache sends message to home directory Fetch/Invalidate Send Data Write Back Invalid CPU Write (miss) Send Write Miss Invalidate CPU Read (miss) Send Read Miss Shared (read-only) Fetch Send Data Write Back CPU Read (hit) CPU read (miss) Send Read Miss CPU Write (hit or miss) Send Write Miss CPU read (hit) CPU write (hit) Modified (read/write) CPU read (miss due to addr conflict) Send Data Write Back & Read Miss CPU write (miss due to addr conflict) Send Data Write Back & Write Miss 27
28 State Trans in Directory Requests to Home Directory from caches for each cache block Not in local cache, but in home memory Uncached Read miss: Sharers = {P}; Send Data Value Reply Read miss: Sharers += {P}; Send Data Value Reply Shared (read-only) Write Miss: Sharers = {P}; Send Data Value Reply; Data Write Back: Sharers = {}; (Write back) Write Miss: Send Invalidate to Sharers; Sharers = {P}; Send Data Value Reply; Write Miss: Send Fetch/Invalidate; Sharers = {P}; Send Data Value Reply; 28 Exclusive (read/write) Read miss: Send Fetch; Sharers += {P}; Send Data Value Reply; (Write back)
29 Synchronization Why synchronize? Mutual exclusion Need to know when it is safe for other processes to use shared data Keep pace with other processes (event synchronization) Wait until other processes calculate needed results Implementation Atomic instructions (uninterruptible) Fetch-and-update, test-and-swap, User level synchronization operations Implemented with the atomic instructions For large scale MPs, synchronization can be a bottleneck Optimization techniques to reduce contention & latency 29
30 Atomic Instructions Atomic exchange Interchange a value in a register for a value in memory 0 => synchronization variable is free 1 => synchronization variable is locked and unavailable Test-and-set Tests the value in memory is zero and sets it to 1 if the value passes the test. Then returns old value. Fetch-and-increment Returns the value of a memory location and atomically increments it 30
31 Implementation of Spin Locks (1) Spin lock Try to find lock variable is 0 before proceed further First version li R2, #1 lockit: exch R2, 0(R1) bnez R2, lockit ; 0(R1) is lock var ; atomic exchange ; already locked? MP with cache coherence protocol Whenever exch writes to cache block containing 0(R1) coherence protocol invalidates all other copies of the rest of the processors, which possibly perform spin locks, too. Many invalidate traffic on bus Do not want to disrupt the caches in other processor 31
32 Implementation of Spin Locks (2) Second version ( test and test-and-set ) Repeatedly reading the variable. When it changes, then try exchange li R2, #1 lockit: lw R3, 0(R1) bnez R3, lockit exch R2, 0(R1) bnez R2, lockit ; 0(R1) is lock var ; not free then spin ; atomic exchange ; already locked? Most of the time it will spin reading lock variable in cache When it changes, it tries exch (invalidating other copies) 32
33 Barrier Synchronization Keep pace with other processes (or threads) Wait until all threads finish to a certain point (barrier) Make all updates on shared data visible Proceed the next processing until the next barrier 33 P0 Do i=1,10 S0 += A[i] barrier(0); S = S0+S1+S2 barrier(1); P1 Do i = 11,20 S1 += A[i] barrier(0); barrier(1); P2 Do i = 21, 30 S2 += A[i] barrier(0); barrier(1);
34 Time (proc cycles) Multithreading Superscalar vs. multithreading vs. simultaneous multithreading Issue Slots Issue Slots Issue Slots Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Superscalar Multi-threading SMT 34
35 Summary Parallel architecture Shared memory Distributed memory Cache coherence Keep multiple read-only copies & exclusive modified copy Snoopy protocol vs. directory protocol Synchronization Implement with an atomic instruction Used for mutual exclusion and event synchronization Multithreading architectures 35
Chapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationChapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationMultiprocessors & Thread Level Parallelism
Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction
More informationComputer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationMultiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types
Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon
More informationChapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST
Chapter 8. Multiprocessors In-Cheol Park Dept. of EE, KAIST Can the rapid rate of uniprocessor performance growth be sustained indefinitely? If the pace does slow down, multiprocessor architectures will
More informationParallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?
Parallel Computers CPE 63 Session 20: Multiprocessors Department of Electrical and Computer Engineering University of Alabama in Huntsville Definition: A parallel computer is a collection of processing
More informationParallel Computing. Hwansoo Han (SKKU)
Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationChapter-4 Multiprocessors and Thread-Level Parallelism
Chapter-4 Multiprocessors and Thread-Level Parallelism We have seen the renewed interest in developing multiprocessors in early 2000: - The slowdown in uniprocessor performance due to the diminishing returns
More informationChapter 5. Thread-Level Parallelism
Chapter 5 Thread-Level Parallelism Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors saturated
More informationChapter 5 Thread-Level Parallelism. Abdullah Muzahid
Chapter 5 Thread-Level Parallelism Abdullah Muzahid 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors is saturating + Modern multiple issue processors are becoming very complex
More informationParallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor
Multiprocessing Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Almasi and Gottlieb, Highly Parallel
More informationAleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville
Lecture 18: Multiprocessors Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Parallel Computers Definition: A parallel computer is a collection
More informationAleksandar Milenkovich 1
Parallel Computers Lecture 8: Multiprocessors Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Definition: A parallel computer is a collection
More informationCOEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence
1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations
More informationChap. 4 Multiprocessors and Thread-Level Parallelism
Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,
More informationReview: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology
Review: Multiprocessor CPE 631 Session 21: Multiprocessors (Part 2) Department of Electrical and Computer Engineering University of Alabama in Huntsville Basic issues and terminology Communication: share
More informationCOSC 6385 Computer Architecture - Multi Processor Systems
COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:
More informationChapter 9 Multiprocessors
ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University
More information10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems
1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationChapter 6: Multiprocessors and Thread-Level Parallelism
Chapter 6: Multiprocessors and Thread-Level Parallelism 1 Parallel Architectures 2 Flynn s Four Categories SISD (Single Instruction Single Data) Uniprocessors MISD (Multiple Instruction Single Data)???;
More informationParallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationCOSC 6385 Computer Architecture - Thread Level Parallelism (I)
COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month
More informationHandout 3 Multiprocessor and thread level parallelism
Handout 3 Multiprocessor and thread level parallelism Outline Review MP Motivation SISD v SIMD (SIMT) v MIMD Centralized vs Distributed Memory MESI and Directory Cache Coherency Synchronization and Relaxed
More informationINSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing
UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 11
More informationComputer Architecture
Jens Teubner Computer Architecture Summer 2016 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2016 Jens Teubner Computer Architecture Summer 2016 83 Part III Multi-Core
More information5008: Computer Architecture
5008: Computer Architecture Chapter 4 Multiprocessors and Thread-Level Parallelism --II CA Lecture08 - multiprocessors and TLP (cwliu@twins.ee.nctu.edu.tw) 09-1 Review Caches contain all information on
More informationWHY PARALLEL PROCESSING? (CE-401)
PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:
More informationConvergence of Parallel Architecture
Parallel Computing Convergence of Parallel Architecture Hwansoo Han History Parallel architectures tied closely to programming models Divergent architectures, with no predictable pattern of growth Uncertainty
More informationCS758: Multicore Programming
CS758: Multicore Programming Introduction Fall 2009 1 CS758 Credits Material for these slides has been contributed by Prof. Saman Amarasinghe, MIT Prof. Mark Hill, Wisconsin Prof. David Patterson, Berkeley
More informationMultiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.
Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network
More informationParallel Computing Platforms
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationMultiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism
Computing Systems & Performance Beyond Instruction-Level Parallelism MSc Informatics Eng. 2012/13 A.J.Proença From ILP to Multithreading and Shared Cache (most slides are borrowed) When exploiting ILP,
More informationMultiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering
Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to
More informationParallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence
Parallel Computer Architecture Spring 2018 Shared Memory Multiprocessors Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly Parallel Computer Architecture
More informationPage 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology
CS252 Graduate Computer Architecture Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency Review: Multiprocessor Basic issues and terminology Communication:
More informationShared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP
Shared Memory Parallel Programming Shared Memory Systems Introduction to OpenMP Parallel Architectures Distributed Memory Machine (DMP) Shared Memory Machine (SMP) DMP Multicomputer Architecture SMP Multiprocessor
More informationFlynn s Classification
Flynn s Classification SISD (Single Instruction Single Data) Uniprocessors MISD (Multiple Instruction Single Data) No machine is built yet for this type SIMD (Single Instruction Multiple Data) Examples:
More informationSaman Amarasinghe and Rodric Rabbah Massachusetts Institute of Technology
Saman Amarasinghe and Rodric Rabbah Massachusetts Institute of Technology http://cag.csail.mit.edu/ps3 6.189-chair@mit.edu A new processor design pattern emerges: The Arrival of Multicores MIT Raw 16 Cores
More informationCS/COE1541: Intro. to Computer Architecture
CS/COE1541: Intro. to Computer Architecture Multiprocessors Sangyeun Cho Computer Science Department Tilera TILE64 IBM BlueGene/L nvidia GPGPU Intel Core 2 Duo 2 Why multiprocessors? For improved latency
More informationCOSC4201 Multiprocessors
COSC4201 Multiprocessors Prof. Mokhtar Aboelaze Parts of these slides are taken from Notes by Prof. David Patterson (UCB) Multiprocessing We are dedicating all of our future product development to multicore
More informationIntroduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano Outline The problem of cache coherence Snooping protocols Directory-based protocols Prof. Cristina Silvano, Politecnico
More informationCS654 Advanced Computer Architecture Lec 14 Directory Based Multiprocessors
CS654 Advanced Computer Architecture Lec 14 Directory Based Multiprocessors Peter Kemper Adapted from the slides of EECS 252 by Prof. David Patterson Electrical Engineering and Computer Sciences University
More informationIssues in Multiprocessors
Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores SPARCCenter, SGI Challenge, Cray T3D, Convex Exemplar, KSR-1&2, today s CMPs message
More informationComputer Architecture Lecture 10: Thread Level Parallelism II (Chapter 5) Chih Wei Liu 劉志尉 National Chiao Tung University
Computer Architecture Lecture 10: Thread Level Parallelism II (Chapter 5) Chih Wei Liu 劉志尉 National Chiao Tung University cwliu@twins.ee.nctu.edu.tw Review Caches contain all information on state of cached
More informationChapter 6. Parallel Processors from Client to Cloud Part 2 COMPUTER ORGANIZATION AND DESIGN. Homogeneous & Heterogeneous Multicore Architectures
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Part 2 Homogeneous & Heterogeneous Multicore Architectures Intel XEON 22nm
More informationComputer Architecture
Computer Architecture Chapter 7 Parallel Processing 1 Parallelism Instruction-level parallelism (Ch.6) pipeline superscalar latency issues hazards Processor-level parallelism (Ch.7) array/vector of processors
More informationLecture 17: Multiprocessors: Size, Consitency. Review: Networking Summary
Lecture 17: Multiprocessors: Size, Consitency Professor David A. Patterson Computer Science 252 Spring 1998 DAP Spr. 98 UCB 1 Review: Networking Summary Protocols allow hetereogeneous networking Protocols
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationComputer Science 146. Computer Architecture
Computer Architecture Spring 24 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 2: More Multiprocessors Computation Taxonomy SISD SIMD MISD MIMD ILP Vectors, MM-ISAs Shared Memory
More informationModule 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT
TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012
More informationParallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence
Parallel Computer Architecture Spring 2018 Distributed Shared Memory Architectures & Directory-Based Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly
More informationIssues in Multiprocessors
Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing explicit sends & receives Which execution model control parallel
More informationMultiprocessor Synchronization
Multiprocessor Synchronization Material in this lecture in Henessey and Patterson, Chapter 8 pgs. 694-708 Some material from David Patterson s slides for CS 252 at Berkeley 1 Multiprogramming and Multiprocessing
More informationProcessor Architecture and Interconnect
Processor Architecture and Interconnect What is Parallelism? Parallel processing is a term used to denote simultaneous computation in CPU for the purpose of measuring its computation speeds. Parallel Processing
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationCSCI 4717 Computer Architecture
CSCI 4717/5717 Computer Architecture Topic: Symmetric Multiprocessors & Clusters Reading: Stallings, Sections 18.1 through 18.4 Classifications of Parallel Processing M. Flynn classified types of parallel
More informationIntroduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization
Lec-11 Multi-Threading Concepts: Coherency, Consistency, and Synchronization Coherency vs Consistency Memory coherency and consistency are major concerns in the design of shared-memory systems. Consistency
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationCSE 392/CS 378: High-performance Computing - Principles and Practice
CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures A Conceptual Introduction for Software Developers Jim Browne browne@cs.utexas.edu Parallel Computer
More informationShared Symmetric Memory Systems
Shared Symmetric Memory Systems Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University
More informationINSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing
UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 12
More informationShared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB
Shared SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB 1 Review: Snoopy Cache Protocol Write Invalidate Protocol: Multiple readers, single writer Write to shared data: an
More informationOnline Course Evaluation. What we will do in the last week?
Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do
More informationLecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996
Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.S96 1 Flynn Categories SISD (Single Instruction Single
More informationComputer Organization. Chapter 16
William Stallings Computer Organization and Architecture t Chapter 16 Parallel Processing Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple data
More informationChapter Seven. Idea: create powerful computers by connecting many smaller ones
Chapter Seven Multiprocessors Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) vector processing may be coming back bad news:
More informationMultiprocessors - Flynn s Taxonomy (1966)
Multiprocessors - Flynn s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) Conventional uniprocessor Although ILP is exploited Single Program Counter -> Single Instruction stream The
More informationNon-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.
CS 320 Ch. 17 Parallel Processing Multiple Processor Organization The author makes the statement: "Processors execute programs by executing machine instructions in a sequence one at a time." He also says
More informationProcessor Performance and Parallelism Y. K. Malaiya
Processor Performance and Parallelism Y. K. Malaiya Processor Execution time The time taken by a program to execute is the product of n Number of machine instructions executed n Number of clock cycles
More informationEC 513 Computer Architecture
EC 513 Computer Architecture Cache Coherence - Directory Cache Coherence Prof. Michel A. Kinsy Shared Memory Multiprocessor Processor Cores Local Memories Memory Bus P 1 Snoopy Cache Physical Memory P
More informationDirectory Implementation. A High-end MP
ectory Implementation Distributed memory each processor (or cluster of processors) has its own memory processor-memory pairs are connected via a multi-path interconnection network snooping with broadcasting
More informationComp. Org II, Spring
Lecture 11 Parallel Processor Architectures Flynn s taxonomy from 1972 Parallel Processing & computers 8th edition: Ch 17 & 18 Earlier editions contain only Parallel Processing (Sta09 Fig 17.1) 2 Parallel
More informationOrganisasi Sistem Komputer
LOGO Organisasi Sistem Komputer OSK 14 Parallel Processing Pendidikan Teknik Elektronika FT UNY Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple
More informationOverview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware
Overview: Shared Memory Hardware Shared Address Space Systems overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and
More informationOverview: Shared Memory Hardware
Overview: Shared Memory Hardware overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and update protocols false sharing
More informationComputer parallelism Flynn s categories
04 Multi-processors 04.01-04.02 Taxonomy and communication Parallelism Taxonomy Communication alessandro bogliolo isti information science and technology institute 1/9 Computer parallelism Flynn s categories
More informationParallel Processing & Multicore computers
Lecture 11 Parallel Processing & Multicore computers 8th edition: Ch 17 & 18 Earlier editions contain only Parallel Processing Parallel Processor Architectures Flynn s taxonomy from 1972 (Sta09 Fig 17.1)
More informationLecture 24: Virtual Memory, Multiprocessors
Lecture 24: Virtual Memory, Multiprocessors Today s topics: Virtual memory Multiprocessors, cache coherence 1 Virtual Memory Processes deal with virtual memory they have the illusion that a very large
More informationLecture 24: Multiprocessing Computer Architecture and Systems Programming ( )
Systems Group Department of Computer Science ETH Zürich Lecture 24: Multiprocessing Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Most of the rest of this
More informationParallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization
Computer Architecture Computer Architecture Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr nizamettinaydin@gmail.com Parallel Processing http://www.yildiz.edu.tr/~naydin 1 2 Outline Multiple Processor
More informationAleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville. Review: Small-Scale Shared Memory
Lecture 20: Multiprocessors Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Review: Small-Scale Shared Memory Caches serve to: Increase
More informationParallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam
Parallel Computer Architectures Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam Outline Flynn s Taxonomy Classification of Parallel Computers Based on Architectures Flynn s Taxonomy Based on notions of
More informationLecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )
Systems Group Department of Computer Science ETH Zürich Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Today Non-Uniform
More informationDEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year & Semester : III/VI Section : CSE-1 & CSE-2 Subject Code : CS2354 Subject Name : Advanced Computer Architecture Degree & Branch : B.E C.S.E. UNIT-1 1.
More informationChapter - 5 Multiprocessors and Thread-Level Parallelism
Chapter - 5 Multiprocessors and Thread-Level Parallelism We have seen the renewed interest in developing multiprocessors in early 2000: - The slowdown in uniprocessor performance due to the diminishing
More informationComp. Org II, Spring
Lecture 11 Parallel Processing & computers 8th edition: Ch 17 & 18 Earlier editions contain only Parallel Processing Parallel Processor Architectures Flynn s taxonomy from 1972 (Sta09 Fig 17.1) Computer
More informationMultiprocessors: Basics, Cache coherence, Synchronization, and Memory consistency
Multiprocessors: Basics, Cache coherence, Synchronization, and Memory consistency Z. Jerry Shi Assistant Professor of Computer Science and Engineering University of Connecticut * Slides adapted from Blumrich&Gschwind/ELE475
More informationLecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol
Lecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu
More informationLimitations of parallel processing
Your professor du jour: Steve Gribble gribble@cs.washington.edu 323B Sieg Hall all material in this lecture in Henessey and Patterson, Chapter 8 635-640 645, 646 654-665 11/8/00 CSE 471 Multiprocessors
More informationCOSC4201. Multiprocessors and Thread Level Parallelism. Prof. Mokhtar Aboelaze York University
COSC4201 Multiprocessors and Thread Level Parallelism Prof. Mokhtar Aboelaze York University COSC 4201 1 Introduction Why multiprocessor The turning away from the conventional organization came in the
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance
More informationEEC 581 Computer Architecture. Lec 11 Synchronization and Memory Consistency Models (4.5 & 4.6)
EEC 581 Computer rchitecture Lec 11 Synchronization and Memory Consistency Models (4.5 & 4.6) Chansu Yu Electrical and Computer Engineering Cleveland State University cknowledgement Part of class notes
More informationToday. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )
Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Systems Group Department of Computer Science ETH Zürich SMP architecture
More information