Memory Architectures. Week 2, Lecture 1. Copyright 2009 by W. Feng. Based on material from Matthew Sottile.
|
|
- Roderick Preston
- 6 years ago
- Views:
Transcription
1 Week 2, Lecture 1 Copyright 2009 by W. Feng. Based on material from Matthew Sottile.
2 Directory-Based Coherence Idea Maintain pointers instead of simple states with each cache block. Ingredients Data owners ( home nodes) keeping track of nodes with cached copies ( sharers ). Coherence decisions implemented via distributed linked-list update algorithms. Accomplished via a doubly linked list formed between caches, with sharers linking to other sharers, and the home node acting as the list head. Issues?
3 Directory-Based Coherence Read Misses Transaction from requestor to home node. Home node replies with data or head of sharer list. Head pointer is changed to requestor. Requestor sends transaction to first sharer to insert into list. Write Miss Requestor obtains head information from home. Requestor is inserted as new head. If already on the list, it is deleted and re-inserted as head. List is traversed and shared copies are invalided
4 Recall: Cache Coherence & Performance Unlike details with pipelining that only concern compiler writers, you the programmer really need to acknowledge that this is going on under the covers. The coherence protocol can impact your performance.
5 Cache Coherence: Performance Demo
6 Abusing Cache Coherence Why the difference in performance? For the same overall computation, change the interleaving of memory accesses between the processing cores.
7 Traversal Patterns void worker_unfriendly(void *arg) { int rank = (int)arg; int start = rank; int end = N; int stride = P; int i, iter; int numiters = 256; } for (iter=0; iter<numiters; iter++) { for (i=start; i<end; i += stride) { if (i>0 && i<n-1) data[i] = (data[i-1]+data[i]+ } } data[i+1])/3.0;
8 Explanation of Abusing Cache Coherence Each block shared across multiple caches. Each iteration, all threads read from the same block ( Shared) and try to write to this shared block ( Modified). One thread wins and forces all the other processors to invalidate their block ( Invalidate). When one of the others executes a read, the writer is forced into the Shared state and must write back ( Shared). In unfriendly case, many transitions to the Invalid state mean blocks need to be re-read over and over by some cores. Result: Serialized code. So, we observe significantly worse than sequential performance.
9 Philosophical Perspective Cache Coherence Protocols Rarely addressed in the popular literature on multicore. What is the focus? Libraries and languages. Dangerous? Why or why not? Confucius say Understand the hardware consequences of code structure achieve good performance. :-) Just like you need to understand caches to write good sequential code.
10 Software-Based Cache Coherency (CC)? Question Can one build a shared-memory software layer above a distributed memory system (such as a cluster) that takes concepts from cache coherence protocols to implement a largescale shared memory view of a big cluster. Answer: Possible but slow Distributed Shared Memory (DSM) Distributed at hardware level; shared at programming abstraction. Why slow? Bus transactions needed to make CC protocol work. Hardware: Bit level. Low BW. Time scales of a single clock cycle. Software: Bus level. High bandwidth.
11 To Date: Architectural Enhancements Coarse-Grained Parallelism: Core Replication Instruction-Level Parallelism (ILP) Deep Pipelines, Superscalar, and VLIW Sophisticated Memory Hierarchy Lesson? Do sophisticated activities in hardware such that the abstraction layer seen by programmers is similar as possible to writing plain single threaded code. Assuming that enough parallelism exists, the key is memory performance.
12 Key: Memory Performance Hide it or control it ILP and Caches: Hide it. Coherence Control memory consistency with respect to multiple viewpoints. Hide how this is achieved in hardware
13 Memory Access Timing Increase number of processors Scaling of a bus interconnect becomes an issue. More complexity needed to avoid excessive bus transfers Example: Avoid flushing or broadcasting to caches that are not affected by a cache coherence conflict. Longer-term solution?
14 Memory Access Timing Sequential Computer Every location in memory takes the same amount of time to access. Small Shared-Memory Systems Ditto. Also known as Uniform Memory Access (UMA). Larger Shared-Memory Systems More complex memory subsystem makes constant-time access prohibitively difficult. Memory access takes different amounts of time based on where the physical memory exists in the machine relative to accessor. Known as Non-Uniform Memory Access (NUMA). Cache-coherent NUMA ccnuma
15 ccnuma Diagram
16 NUMA A model often found on parallel systems a decade ago. Example: SGI Origin 2000, composed of 128 CPUs with a hierarchy of bricks from which the entire machine is built. Multicore UMA: Intel. NUMA: AMD. The world is moving to NUMA why?
17 UMA vs. NUMA UMA Do not care where data actually lies relative to computations that work on it. NUMA Do care where data is. Better performance if data is near computation. Problem: Hard to enforce without OS and run-time assistance. Why? A combination of resource scheduling, load balancing, and coupling memory management with process management. (Outside scope of this course just wanted you to be aware of this as more and more, not all memory is treated equally from a timing perspective. )
18 NUMA Another Perspective Think of NUMA as simply adding more layers to the cache hierarchy. Memory near computation is faster, much like a local cache. Memory far away from a computation is slower, much like main memory (or a cache deeper in the hierarchy). But why continue to add all these layers? Remember: Memory and bus logic consume an increasingly large amount of real estate on a die, e.g., see Intel Nehalem die (previous lecture). Memory performance becomes critical to code performance.
19 Time for a New Approach? Part I Let s start over KISS: Keep It Simple Stupid Simplify the hardware. Get rid of speculative fetching logic. Get rid of cache coherence hardware get rid of all cache logic Hmm let s get rid of caches altogether (at least for most processor cores) Create a fast processor in terms of functional execution with no fancy memory hierarchy. Use spare logic real estate to add high-performance functional units and a reasonable, but small, amount of RAM-like memory right next to the processor. Much smaller than a cache, but much larger than a register set and addressable in a RAM fashion.
20 Welcome The Cell
21 Initial Reactions Blasphemy! The cache and memory subsystem help me write code productively. Why does Cell want to take this away? How can I survive without a memory hierarchy that just works I want to focus on the functional logic of my program, not memory traversal. Hmmm maybe not
22 Re-Think Programming? Current View Create sequences of instructions with memory as a secondary concern. Challenging Conventional Wisdom How does my world as a programmer change if data (memory) movement is viewed on equal footing as computation? Can I shift my way of thinking to exploit this new memory model and exception computational capability? Answer: Probably, but with effort (which runs counter to code productivity maybe).
23 How We Currently Program A perspective from the designer(s) of the Cell. Plumbing project at home. If we repaired plumbing the way we programmed, we d
24 How We Currently Program A perspective from the designer(s) of the Cell. Plumbing project at home. If we repaired plumbing the way we programmed, we d
25 How We Currently Program A perspective from the designer(s) of the Cell. Plumbing project at home. If we repaired plumbing the way we programmed, we d
26 How We Currently Program A perspective from the designer(s) of the Cell. Plumbing project at home. If we repaired plumbing the way we programmed, we d
27 How We Currently Program A perspective from the designer(s) of the Cell. Plumbing project at home. If we repaired plumbing the way we programmed, we d
28 How We Currently Program A perspective from the designer(s) of the Cell. Plumbing project at home. If we repaired plumbing the way we programmed, we d
29 How We Currently Program A perspective from the designer(s) of the Cell. Plumbing project at home. If we repaired plumbing the way we programmed, we d
30 How We Currently Program A perspective from the designer(s) of the Cell. Plumbing project at home. If we repaired plumbing the way we programmed, we d
31 How We Currently Program A perspective from the designer(s) of the Cell. Plumbing project at home. If we repaired plumbing the way we programmed, we d
32 How We Currently Program A perspective from the designer(s) of the Cell. Plumbing project at home. If we repaired plumbing the way we programmed, we d
33 How We Currently Program A perspective from the designer(s) of the Cell. Plumbing project at home. If we repaired plumbing the way we programmed, we d and so on
34 How We Currently Program Example from a Code Perspective x = a(1) y = b(2) z = c(14) d(1) = x + y + z A good compiler optimizes the above away But it s not difficult to build algorithms that are complex enough that the compiler cannot optimize it all away, thus resulting in going back and forth to main memory over and over Possible causes?
35 How We Currently Program Wasteful and inefficient? (See previous slides.) Ask for memory when we need it, and it happens magically but with potential performance impact Current Solution Performance tuning, e.g., fetch multiple items to CPU at once by packing into cache-friendly blocks. Not always feasible Mesh-based data structures in finite element method (FEM) codes where one struggles with structs of arrays versus arrays of structs Arbitrary data layouts and structures. Plumbing example. New Solution Instead of back-and-forth, on-demand memory model, why not plan ahead and hand the system a shopping list?
36 Shopping List Model More efficient performance. (Obvious.) But it requires programmers to plan ahead. Oh the pain of planning ahead The free ride of cache coherency in this model does not exist. Programmers must translate algorithms previously developed assuming lots of hardware on your CPU to support an ondemand memory model. (Remember pictures of dies?) Not so bad if algorithm already computes the memory traversal pattern, e.g., butterfly pattern in the Fast Fourier Transform. (Are we sacrificing programmer productivity for performance?)
37 Shopping List Model More efficient performance. (Obvious.) But it requires programmers to plan ahead. Oh the pain of planning ahead The free ride of cache coherency in this model does not exist. Programmers must translate algorithms previously developed assuming lots of hardware on your CPU to support an ondemand memory model. (Remember pictures of dies?) Not so bad if algorithm already computes the memory traversal pattern, e.g., butterfly pattern in the Fast Fourier Transform. (Are we sacrificing programmer productivity for performance?)
38 Shopping List Model Plumbing project at home. Using the shopping list model
39 Shopping List Model Plumbing project at home. Using the shopping list model
40 Shopping List Model Plumbing project at home. Using the shopping list model
41 Shopping List Model Plumbing project at home. Using the shopping list model
42 Support for the Shopping List Model Sophisticated memory access hardware and a high-bw, low-latency internal bus on the chip. Difficulty in programming to this model? Not due to shortcoming of the chip but due to lack of programming tools to support it. Historical perspective Back in 2007, program to Cell via assembly. IBM releases SDK with libraries to help out Now compiler vendors are helping too. Gedae, RapidMind, and so on. Here to stay?!
43 The Cell A collaboration between Sony, IBM, and Toshiba. Formal name: Cell Broadband Engine Architecture.
44 The Cell PPE: dual-threaded (SMT) PowerPC processing element SPE: singled-threaded, high-performance synergistic processing element EIB: circular data bus connection PPE and SPEs (and I/O + memory)
45 PPE: PowerPC Processing Element Multi-Purpose General-purpose CPU Assist the SPEs by performing computatinos that help keep the SPEs running with minimal idle time. Anatomy Registers 64-bit general-purpose and floating-point registers 128-bit registers for Altivec SIMD operations Caches L1: 32-kB instruction, 32-kB data L2: 512 kb
46 SPE: Synergistic Processing Element High-performance specialized processor element that can be replicated within the Cell architecture. Structure A high-performance execution core called the Synergistic Processing Unit (SPU), which interacts with the outside world via the Memory Flow Controller (MFC). MFC: MMU, DMA, and bus interface. SPE 128-bit registers 256-kB local store SRAM that is visible from PPE. (NOT a cache!) Instruction and data memory (32-bit addressing)
47 EIB: Element Interconnection Bus A Bi-directional Circular Bus Four 16-byte channels Two in each direction 12 devices on the bus 8 SPEs, 1 PPE, 1 interface to system memory, and 2 I/O units. Each device has a single 16B read and a single 16B write port onto the bus. Maximum (Realizable) EIB Bandwidth: ~200 GB/s. Ideal EIB Bandwidth: ~300 GB/s
48 Time for an Old New Approach? Part II The Cell was not the first architecture to re-think how memory is managed. Tera (now owned by Cray) created MTA: MultiThreaded Architecture in the 1990s Throw out the concept of a cache. Basic idea Exploit massive levels of threading to hide high latency to main memory.
49 MTA: Example with 6 Threads
50 MTA: Example with 6 Threads
51 MTA: Example with 6 Threads
52 MTA: Example with 6 Threads
53 MTA: Example with 6 Threads
54 MTA: Example with 6 Threads
55 MTA: Example with 6 Threads
56 MTA: Example with 6 Threads
57 MTA Overview A great deal of processor real estate to support the context for many threads. That is, remove the cache(s) and hide latency by servicing many threads. Dedicate hardware, otherwise used for caching, to service threads. Remember this when we reach GPGPU Good performance requires a high degree of threading such that memory latency can be hidden by servicing many threads. ( Plumbing project only had three threads.) Disclaimer: In reality, MTA capable of more than just roundrobin schedule. Just wanted to get basic idea across.
58 Time for an Old New Approach? Part III Combining Ideas from Part I and Part II Plug-In Architecture: GPGPU Explicit data movement between CPU and GPGPU card out on PCIe (a la FPGA acceleration) Shopping List Model of Cell outside the chip Massive parallelism via threading Memory Model of MTA within the chip
59 Welcome The GPGPU
60 Next Time? Interconnection Networks Shallow coverage of interconnection networks Important to parallel computing Fallen out of fashion over the past deacde or so. The return of its importance Tilera 64-core, Intel 80-core, and so on. Thursday should be the last architecturally-driven lecture.
Computer Systems Architecture
Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationMultiprocessors & Thread Level Parallelism
Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationWHY PARALLEL PROCESSING? (CE-401)
PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:
More informationMultiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering
Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to
More informationMultiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism
Computing Systems & Performance Beyond Instruction-Level Parallelism MSc Informatics Eng. 2012/13 A.J.Proença From ILP to Multithreading and Shared Cache (most slides are borrowed) When exploiting ILP,
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationIntroduction to Computing and Systems Architecture
Introduction to Computing and Systems Architecture 1. Computability A task is computable if a sequence of instructions can be described which, when followed, will complete such a task. This says little
More informationINSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing
UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 11
More informationMultiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types
Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon
More informationThe University of Texas at Austin
EE382N: Principles in Computer Architecture Parallelism and Locality Fall 2009 Lecture 24 Stream Processors Wrapup + Sony (/Toshiba/IBM) Cell Broadband Engine Mattan Erez The University of Texas at Austin
More informationHigh Performance Computing. University questions with solution
High Performance Computing University questions with solution Q1) Explain the basic working principle of VLIW processor. (6 marks) The following points are basic working principle of VLIW processor. The
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationMaster Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.
Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi Multithreading Contents Main features of explicit multithreading
More informationCell Processor and Playstation 3
Cell Processor and Playstation 3 Guillem Borrell i Nogueras February 24, 2009 Cell systems Bad news More bad news Good news Q&A IBM Blades QS21 Cell BE based. 8 SPE 460 Gflops Float 20 GFLops Double QS22
More informationMultiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.
Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network
More informationHow to Write Fast Code , spring th Lecture, Mar. 31 st
How to Write Fast Code 18-645, spring 2008 20 th Lecture, Mar. 31 st Instructor: Markus Püschel TAs: Srinivas Chellappa (Vas) and Frédéric de Mesmay (Fred) Introduction Parallelism: definition Carrying
More informationCOSC 6385 Computer Architecture - Thread Level Parallelism (I)
COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month
More informationSMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems
Reference Papers on SMP/NUMA Systems: EE 657, Lecture 5 September 14, 2007 SMP and ccnuma Multiprocessor Systems Professor Kai Hwang USC Internet and Grid Computing Laboratory Email: kaihwang@usc.edu [1]
More informationComplexity and Advanced Algorithms. Introduction to Parallel Algorithms
Complexity and Advanced Algorithms Introduction to Parallel Algorithms Why Parallel Computing? Save time, resources, memory,... Who is using it? Academia Industry Government Individuals? Two practical
More informationComputer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley
Computer Systems Architecture I CSE 560M Lecture 19 Prof. Patrick Crowley Plan for Today Announcement No lecture next Wednesday (Thanksgiving holiday) Take Home Final Exam Available Dec 7 Due via email
More informationComputer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,
More informationModule 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT
TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012
More informationParallel Architecture. Hwansoo Han
Parallel Architecture Hwansoo Han Performance Curve 2 Unicore Limitations Performance scaling stopped due to: Power Wire delay DRAM latency Limitation in ILP 3 Power Consumption (watts) 4 Wire Delay Range
More informationLecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )
Systems Group Department of Computer Science ETH Zürich Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Today Non-Uniform
More informationToday. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )
Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Systems Group Department of Computer Science ETH Zürich SMP architecture
More informationRoadrunner. By Diana Lleva Julissa Campos Justina Tandar
Roadrunner By Diana Lleva Julissa Campos Justina Tandar Overview Roadrunner background On-Chip Interconnect Number of Cores Memory Hierarchy Pipeline Organization Multithreading Organization Roadrunner
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationINSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing
UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 12
More informationIntroducing Multi-core Computing / Hyperthreading
Introducing Multi-core Computing / Hyperthreading Clock Frequency with Time 3/9/2017 2 Why multi-core/hyperthreading? Difficult to make single-core clock frequencies even higher Deeply pipelined circuits:
More informationCOSC4201 Multiprocessors
COSC4201 Multiprocessors Prof. Mokhtar Aboelaze Parts of these slides are taken from Notes by Prof. David Patterson (UCB) Multiprocessing We are dedicating all of our future product development to multicore
More informationIBM Cell Processor. Gilbert Hendry Mark Kretschmann
IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:
More informationMultiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University
A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor
More informationMultiprocessors - Flynn s Taxonomy (1966)
Multiprocessors - Flynn s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) Conventional uniprocessor Although ILP is exploited Single Program Counter -> Single Instruction stream The
More informationSony/Toshiba/IBM (STI) CELL Processor. Scientific Computing for Engineers: Spring 2008
Sony/Toshiba/IBM (STI) CELL Processor Scientific Computing for Engineers: Spring 2008 Nec Hercules Contra Plures Chip's performance is related to its cross section same area 2 performance (Pollack's Rule)
More informationChapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationMultiprocessor Synchronization
Multiprocessor Systems Memory Consistency In addition, read Doeppner, 5.1 and 5.2 (Much material in this section has been freely borrowed from Gernot Heiser at UNSW and from Kevin Elphinstone) MP Memory
More informationMultiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed
Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking
More informationMultithreading: Exploiting Thread-Level Parallelism within a Processor
Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationMultiprocessor Cache Coherency. What is Cache Coherence?
Multiprocessor Cache Coherency CS448 1 What is Cache Coherence? Two processors can have two different values for the same memory location 2 1 Terminology Coherence Defines what values can be returned by
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More information! Readings! ! Room-level, on-chip! vs.!
1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads
More informationMultiprocessors and Locking
Types of Multiprocessors (MPs) Uniform memory-access (UMA) MP Access to all memory occurs at the same speed for all processors. Multiprocessors and Locking COMP9242 2008/S2 Week 12 Part 1 Non-uniform memory-access
More informationCOEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence
1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations
More information6 February Parallel Computing: A View From Berkeley. E. M. Hielscher. Introduction. Applications and Dwarfs. Hardware. Programming Models
Parallel 6 February 2008 Motivation All major processor manufacturers have switched to parallel architectures This switch driven by three Walls : the Power Wall, Memory Wall, and ILP Wall Power = Capacitance
More informationCOMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory
COMP 633 - Parallel Computing Lecture 6 September 6, 2018 SMM (1) Memory Hierarchies and Shared Memory 1 Topics Memory systems organization caches and the memory hierarchy influence of the memory hierarchy
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationOutline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??
Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationParallel Computing. Hwansoo Han (SKKU)
Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo
More informationComputer Architecture
Computer Architecture Chapter 7 Parallel Processing 1 Parallelism Instruction-level parallelism (Ch.6) pipeline superscalar latency issues hazards Processor-level parallelism (Ch.7) array/vector of processors
More informationLecture 25: Board Notes: Threads and GPUs
Lecture 25: Board Notes: Threads and GPUs Announcements: - Reminder: HW 7 due today - Reminder: Submit project idea via (plain text) email by 11/24 Recap: - Slide 4: Lecture 23: Introduction to Parallel
More informationParallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization
Computer Architecture Computer Architecture Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr nizamettinaydin@gmail.com Parallel Processing http://www.yildiz.edu.tr/~naydin 1 2 Outline Multiple Processor
More informationEECS 570 Final Exam - SOLUTIONS Winter 2015
EECS 570 Final Exam - SOLUTIONS Winter 2015 Name: unique name: Sign the honor code: I have neither given nor received aid on this exam nor observed anyone else doing so. Scores: # Points 1 / 21 2 / 32
More informationCMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More informationParallel Computing Platforms
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)
More informationChapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture
An Introduction to Parallel Programming Peter Pacheco Chapter 2 Parallel Hardware and Parallel Software 1 The Von Neuman Architecture Control unit: responsible for deciding which instruction in a program
More informationExam-2 Scope. 3. Shared memory architecture, distributed memory architecture, SMP, Distributed Shared Memory and Directory based coherence
Exam-2 Scope 1. Memory Hierarchy Design (Cache, Virtual memory) Chapter-2 slides memory-basics.ppt Optimizations of Cache Performance Memory technology and optimizations Virtual memory 2. SIMD, MIMD, Vector,
More informationCache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)
Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues
More informationDirectory Implementation. A High-end MP
ectory Implementation Distributed memory each processor (or cluster of processors) has its own memory processor-memory pairs are connected via a multi-path interconnection network snooping with broadcasting
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationParallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple
More informationCellSs Making it easier to program the Cell Broadband Engine processor
Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of
More informationHigher Level Programming Abstractions for FPGAs using OpenCL
Higher Level Programming Abstractions for FPGAs using OpenCL Desh Singh Supervising Principal Engineer Altera Corporation Toronto Technology Center ! Technology scaling favors programmability CPUs."#/0$*12'$-*
More information10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems
1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase
More informationObjective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.
CS 612 Software Design for High-performance Architectures 1 computers. CS 412 is desirable but not high-performance essential. Course Organization Lecturer:Paul Stodghill, stodghil@cs.cornell.edu, Rhodes
More informationComp. Org II, Spring
Lecture 11 Parallel Processor Architectures Flynn s taxonomy from 1972 Parallel Processing & computers 8th edition: Ch 17 & 18 Earlier editions contain only Parallel Processing (Sta09 Fig 17.1) 2 Parallel
More informationWilliam Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers
William Stallings Computer Organization and Architecture 8 th Edition Chapter 18 Multicore Computers Hardware Performance Issues Microprocessors have seen an exponential increase in performance Improved
More informationCSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore
CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors
More informationParallel Processing & Multicore computers
Lecture 11 Parallel Processing & Multicore computers 8th edition: Ch 17 & 18 Earlier editions contain only Parallel Processing Parallel Processor Architectures Flynn s taxonomy from 1972 (Sta09 Fig 17.1)
More informationComp. Org II, Spring
Lecture 11 Parallel Processing & computers 8th edition: Ch 17 & 18 Earlier editions contain only Parallel Processing Parallel Processor Architectures Flynn s taxonomy from 1972 (Sta09 Fig 17.1) Computer
More informationDEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK
DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK SUBJECT : CS6303 / COMPUTER ARCHITECTURE SEM / YEAR : VI / III year B.E. Unit I OVERVIEW AND INSTRUCTIONS Part A Q.No Questions BT Level
More informationECE/CS 250 Computer Architecture. Summer 2016
ECE/CS 250 Computer Architecture Summer 2016 Multicore Dan Sorin and Tyler Bletsch Duke University Multicore and Multithreaded Processors Why multicore? Thread-level parallelism Multithreaded cores Multiprocessors
More informationParallel Processing SIMD, Vector and GPU s cont.
Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP
More informationMultithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others
Multithreading Processors and Static Optimization Review Adapted from Bhuyan, Patterson, Eggers, probably others Schedule of things to do By Wednesday the 9 th at 9pm Please send a milestone report (as
More informationA Multiprocessor system generally means that more than one instruction stream is being executed in parallel.
Multiprocessor Systems A Multiprocessor system generally means that more than one instruction stream is being executed in parallel. However, Flynn s SIMD machine classification, also called an array processor,
More informationMulticore Hardware and Parallelism
Multicore Hardware and Parallelism Minsoo Ryu Department of Computer Science and Engineering 2 1 Advent of Multicore Hardware 2 Multicore Processors 3 Amdahl s Law 4 Parallelism in Hardware 5 Q & A 2 3
More informationLecture 14: Multithreading
CS 152 Computer Architecture and Engineering Lecture 14: Multithreading John Wawrzynek Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~johnw
More informationChap. 4 Multiprocessors and Thread-Level Parallelism
Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,
More informationComputer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationModule 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:
The Lecture Contains: Shared Memory Multiprocessors Shared Cache Private Cache/Dancehall Distributed Shared Memory Shared vs. Private in CMPs Cache Coherence Cache Coherence: Example What Went Wrong? Implementations
More informationCOSC4201. Multiprocessors and Thread Level Parallelism. Prof. Mokhtar Aboelaze York University
COSC4201 Multiprocessors and Thread Level Parallelism Prof. Mokhtar Aboelaze York University COSC 4201 1 Introduction Why multiprocessor The turning away from the conventional organization came in the
More informationLecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015
Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Marble House The Knife (Silent Shout) Before starting The Knife, we were working
More informationMo Money, No Problems: Caches #2...
Mo Money, No Problems: Caches #2... 1 Reminder: Cache Terms... Cache: A small and fast memory used to increase the performance of accessing a big and slow memory Uses temporal locality: The tendency to
More informationFlynn s Classification
Flynn s Classification SISD (Single Instruction Single Data) Uniprocessors MISD (Multiple Instruction Single Data) No machine is built yet for this type SIMD (Single Instruction Multiple Data) Examples:
More informationIntroduction to CELL B.E. and GPU Programming. Agenda
Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU
More informationCSE Opera+ng System Principles
CSE 30341 Opera+ng System Principles Lecture 2 Introduc5on Con5nued Recap Last Lecture What is an opera+ng system & kernel? What is an interrupt? CSE 30341 Opera+ng System Principles 2 1 OS - Kernel CSE
More informationLecture 5: Directory Protocols. Topics: directory-based cache coherence implementations
Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations 1 Flat Memory-Based Directories Block size = 128 B Memory in each node = 1 GB Cache in each node = 1 MB For 64 nodes
More informationLecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming Cache design review Let s say your code executes int x = 1; (Assume for simplicity x corresponds to the address 0x12345604
More informationUnit 11: Putting it All Together: Anatomy of the XBox 360 Game Console
Computer Architecture Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Milo Martin & Amir Roth at University of Pennsylvania! Computer Architecture
More informationWilliam Stallings Computer Organization and Architecture 8th Edition. Cache Memory
William Stallings Computer Organization and Architecture 8th Edition Chapter 4 Cache Memory Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics
More informationComputer Architecture Crash course
Computer Architecture Crash course Frédéric Haziza Department of Computer Systems Uppsala University Summer 2008 Conclusions The multicore era is already here cost of parallelism is dropping
More informationMemory Systems IRAM. Principle of IRAM
Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several
More informationDonn Morrison Department of Computer Science. TDT4255 Memory hierarchies
TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,
More informationUC Berkeley CS61C : Machine Structures
inst.eecs.berkeley.edu/~cs61c UC Berkeley CS61C : Machine Structures Lecture 40 Hardware Parallel Computing 2006-12-06 Thanks to John Lazarro for his CS152 slides inst.eecs.berkeley.edu/~cs152/ Head TA
More informationMulti-core processors are here, but how do you resolve data bottlenecks in native code?
Multi-core processors are here, but how do you resolve data bottlenecks in native code? hint: it s all about locality Michael Wall October, 2008 part I of II: System memory 2 PDC 2008 October 2008 Session
More information