Advanced Parallel Programming I
|
|
- Angela Richardson
- 5 years ago
- Views:
Transcription
1 Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz
2 Levels of Parallelism RISC Software GmbH Johannes Kepler University Linz
3 Motivation for Parallelism - History RISC Software GmbH Johannes Kepler University Linz
4 Motivation for Parallelism CPU Scaling Moore s Law (doubling transistor density / 2 years) Dennard Scaling made Moore s Law work Smaller Faster Constant RISC Software GmbH Johannes Kepler University Linz
5 Motivation for Parallelism Limits of serial execution End of classical Dennard Scaling No tox scaling 2. No voltage scaling 3. No delay time scaling 4. Leakage P = C EFF * V DD 2 * f + I LEAK * V DD Power wall because of critical head dissipation Frequency wall because of gate delay time RISC Software GmbH Johannes Kepler University Linz
6 Level of Parallelism Code Granularity Messages Messages Task i - 1 Task i Task i + 1 Large grain (task level - TLP) func1() { } func2() { } func3() { } Medium grain (control level - CLP) a[0]= b[0]= a[1]= b[1]= a[2]= b[2]= Fine grain (data level - DLP) + * / Very fine grain (instruction level - ILP) RISC Software GmbH Johannes Kepler University Linz
7 Level of Parallelism Programmers control Messages Messages Task i - 1 Task i Task i + 1 Programmer (MPI, PVM, ) func1() { } func2() { } func3() { } Programmer (OpenMP, Pthreads, ) a[0]= b[0]= a[1]= b[1]= a[2]= b[2]= Programmer/Automatic (Vector Intrinsics/Compiler) + * / Automatic (CPU) RISC Software GmbH Johannes Kepler University Linz
8 Level of Parallelism - Milestones RISC Software GmbH Johannes Kepler University Linz
9 Instruction Level Parallism Also known as multiple issue Implemented in Hardware via superscalar CPU ( parallel pipelines) Time [Cycles] IF ID EX WB IF ID EX WB IF ID EX WB IF ID EX WB IF ID EX WB IF ID EX WB IF ID EX WB IF ID EX WB Instructions Fully utilized pipeline RISC Software GmbH Johannes Kepler University Linz
10 Instruction Level Parallelism Out of Order Execution Out of order execution Input: serial instruction stream Identification and execution of parallel executable instructions Pipeline hazards Structural hazards (resource conflicts) Data hazards (data dependencies) Control hazards (conditional and unconditional jumps Branch Prediction) ILP wall: Any further effort for parallelization leads to little increase in resource utilization RISC Software GmbH Johannes Kepler University Linz
11 Out of order In order Instruction Level Parallelism Sandy Bridge Core Pipeline Instruction Queue Pre Decode Instruction Fetch L1 Instruction Cache (32kB) Branch Prediction Decoder Decoder Decoder Decoder uop Cache 4 uops/cycle Rename/Allocate/Retire 4 uops/cycle issued ReOrder Buffer Load Buffer Store Buffer L2 Cache (256kB) Port 0 ALU VI MUL SSE MUL DIV AVX FP MUL Reservation Station 6 uops/cycle dispatched Port 1 Port 5 Port 2 Port 3 Port 4 ALU ALU Load Store Address Store Data VI ADD JMP Store Address Load SSE ADD AVX/FP Shuf AVX FP ADD AVX/FP Bool Imm Blend Imm Blend Memory Control 48 Bytes/cylce L1 Data Cache (32kB) RISC Software GmbH Johannes Kepler University Linz
12 Data Level Parallelism Same instruction is executed simultaneous with N>1 elements of a vector S1 S2 V1 V1 V1 V1 V2 V2 V2 V2 Scalar Processing + Vector Processing S3 V3 V3 V3 V3 RISC Software GmbH Johannes Kepler University Linz
13 Data Level Parallelism RISC Software GmbH Johannes Kepler University Linz
14 Control Level Parallelism Simultaneous Multi Threading (SMT) Also known as Hyper Threading (HT) or Hardware Threading Multiple logic processors per core Per logic processor Administrative logic Fed with independent serial instruction stream Mapped to ILP increase of resource utilization in superscalar CPU RISC Software GmbH Johannes Kepler University Linz
15 Control Level Parallelism Simultaneous Multi Threading (SMT) Time [Cycles] without SMT with SMT Note: Each box represents a processor execution unit RISC Software GmbH Johannes Kepler University Linz
16 Control Level Parallelism Sandy Bridge Multi-Core CPU (Intel Xeon E5-2600) 2 QPI Links 8.0GT/s Ring bus 32 Bytes wide Bandwidth: 96GB/s 1 cycle per hop Integrated memory controller (IMC) Bandwidth: 4 * 1600 MHz * 8 Byte = 51.2 GB/s 4 Memory Channels 1600MHz DDR3 RISC Software GmbH Johannes Kepler University Linz
17 Control Level Parallelism Multi-Socket CPUs Sandy Bridge Memory access Latency [ns] Bandwidth [GB/s] local remote *) *) 8.0GT/s * 2 Byte = 16.0GB/s RISC Software GmbH Johannes Kepler University Linz
18 Task Level Parallelism Intra-Node Simultaneous Multi Threading Multi Core Multi Socket Shared memory model Inter-Node Different compute nodes Connected via fast interconnect (e.g. Infiniband) Different connection topologies Distributed memory model RISC Software GmbH Johannes Kepler University Linz
19 Summary RISC Software GmbH Johannes Kepler University Linz
20 Memory Hierarchies RISC Software GmbH Johannes Kepler University Linz
21 Motivation for Memory Hierarchies Moore s Law: CPU performance doubles roughly every two years Memory bandwidth does not keep up (doubles roughly every five years) memory speed gap (memory wall) Year 1980: CPU and memory cycle about 1 µs Year 2000: CPU cycle 1 ns, memory cycle 100 ns RISC Software GmbH Johannes Kepler University Linz
22 Principles of Locality Each program has to a certain degree some locality Usage of recently accessed data and instructions Two kinds of locality Temporal locality: A recently accessed object is used in the near future. E.g.: x is read x is read or written in the near future Spatial locality: Neighbouring objects are accessed in similar time. E.g.: y[i] is read y[i + 1] is read soon. RISC Software GmbH Johannes Kepler University Linz
23 Cache Memory Processor Cache Main Memory Keeps copies of main meory blocks (data and instructions) Access time is much faster (1 ns vs. 100 ns) Much smaller in size than main memory (costs) RISC Software GmbH Johannes Kepler University Linz
24 Cache Design Cache block size: Bytes Cache block usually named cache line Cache design criteria When to cache? Where to cache? How to find cache block again? Which cache block is replaced (after miss)? What to do in case of writes? Methods must be easy to implement (in Hardware) RISC Software GmbH Johannes Kepler University Linz
25 When to Cache? Read access Cache always Memory location is read and copy is not in cache (read miss) cache Write access Nonapplicable for instruction caches Different strategies (see later on) RISC Software GmbH Johannes Kepler University Linz
26 Where to Cache? Direct mapped cache Bytes Associative cache Bytes full address full address m-bits block index n-bits block offset m-bits set index n-bits block offset RISC Software GmbH Johannes Kepler University Linz
27 How to Find Cache Block Again? Load address Find set for address Information per cache block Address tag Valid bit Check for matching address tag in valid blocks Full address address tag m bits set index n bits block offset RISC Software GmbH Johannes Kepler University Linz
28 Which cache block is replaced (after miss)? Direct mapped cache No choice, replacement of indexed block Associative cache Randomly replacement Least recently used (LRU), better, but implementation more difficult RISC Software GmbH Johannes Kepler University Linz
29 What to Do in Case of Writes? Write through Write data directly to main memory (no caching) Used for data I/O and some multiprocessor cache coherency implementations CPU avoids latencies of writes via write buffer Write back Write data only to cache In case of cache block replacement data is written from cache to main memory (checked via dirty bit) Reduced memory traffic More difficult to implement than write through RISC Software GmbH Johannes Kepler University Linz
30 Cache Memory Access Costs Average memory access time hit time + miss ratio x miss time Time to load data from cache to CPU Proportion of accesses which cause a miss Time to load data from main memory to cache Goal: optimization for all three parameters RISC Software GmbH Johannes Kepler University Linz
31 Three Reasons for Cache Misses (3 Cs) Compulsory or cold start: first access to data of an address Capacity: cache is not big enough to hold all data Conflict: to many memory addresses are mapped to the same set index of the cache RISC Software GmbH Johannes Kepler University Linz
32 Memory Hierarchie Speed (and cost) 1 cycle CPU Registers ~1 kb Capacity 2-3 cycles L1 Cache ~100 kb ~20 cycles L2 Cache ~1-10 MB ~50 cycles L3 Cache ~10-50 MB ~300 cycles Main Memory ~1 GB RISC Software GmbH Johannes Kepler University Linz
33 Example 1 Cache Hierarchy 1. Go to the directory example_1 2. Compile the code cache.cpp icc std=c++11 cache.cpp 3. Execute the binary and try to identify the cache boundaries. 4. How big are the different caches? RISC Software GmbH Johannes Kepler University Linz
34 Cache Coherency RISC Software GmbH Johannes Kepler University Linz
35 Shared Memory Model All processors have access to global memory Communication with main memory via read and write Caches are automatically kept up to date or coherent Scalability difficult because of memory bottleneck Low number of processors Multi-Core CPU RISC Software GmbH Johannes Kepler University Linz
36 Shared Memory Model P P P P P P P P L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 L2 L2 Chip L3 Memory RISC Software GmbH Johannes Kepler University Linz
37 Cache Coherence Shared memory model shared variable has unique value at any given point in time Caching multiple copies of a memory location Avoidance of different values caches are kept coherent Write to a memory location must invalidate all copies in other caches RISC Software GmbH Johannes Kepler University Linz
38 Coherence Protocols Book keeping of sharing state of cache blocks Block was modified Is block stored in more than one cache Snooping (or Broadcast) based protocols Each copy in cache has sharing state No centralized state Each processor sees each request Directory based protocols Book keeping of sharing state is centralized RISC Software GmbH Johannes Kepler University Linz
39 Snooping Based Protocols Usage of the valid tags of cache line for invalidation Extra tag for sharing state Usage of dirty bit of Write Back caches All processors see all bus transactions Invalidation message: block in cache invalidation Memory read request: block in cache provide data and cancel memory request Many different implementations RISC Software GmbH Johannes Kepler University Linz
40 Three State Snoopy Protocol MSI Simplest protocol, allowing multiple copies Each cache block can have one of the following states Modified: only on valid copy in any cache and value is different to main memory Shared: several valid copies in caches and value is identical to main memory Invalid: copy is not valid and cannot be used RISC Software GmbH Johannes Kepler University Linz
41 MSI Example: P2 Read P1 P2 P3 PrRd value S value S Snooper Snooper BusRd Snooper Main Memory P2 wants to read the value. Its cache does not have the data, so it places a BusRd to notify other processors and ask for the data. The memory controller provides the data. RISC Software GmbH Johannes Kepler University Linz
42 MSI Example: P3 Write P1 P2 P3 PrWr value S I value S I value M BusRdX Snooper Snooper Snooper Main Memory P3 wants to write the value. It places a BusRdX to get exclusive access and the most recent copy of the data. The caches of P1 and P2 see the BusRdX and invalidate their copies. Because the value is still up-to-date in memory, memory provides the data. RISC Software GmbH Johannes Kepler University Linz
43 MSI Example: P2 Read P1 P2 P3 PrRd value I value I S value M S Snooper BusRd Snooper Flush Snooper Main Memory P2 wants to read the value. P3 s cache has the most up-to-date copy and will provide it. P2 s cache puts a BusRd on the bus. P3 s cache snoops this and cancels the memory access because it will provide the data. P3 s cache flushes the data to the bus. RISC Software GmbH Johannes Kepler University Linz
44 False Sharing Coherent operations are on cache lines Cache lines can contain multiple units of a data type (e.g. int) Two processors write on different units in the same cache line (data values are not really shared) False Sharing Each write invalidation of copies in other caches bus traffic and memory access Significant performance problem and difficult identification Cache-Line n - 1 Cache-Line n Cache-Line n + 1 P1 P2 RISC Software GmbH Johannes Kepler University Linz
45 Example 2 Cache Coherency False Sharing 1. Go to the directory example_2 2. Compile the code coherence.cpp icc openmp coherency.cpp 3. Set two execution threads export OMP_NUM_THREADS=2 4. Use different cores on same CPU export KMP_AFFINITY=granularity=fine,proclist=[0,2],explicit 5. Execute the binary. 6. Can you see the drop in the performance? 7. What is the cache line size? RISC Software GmbH Johannes Kepler University Linz
46 NUMA (Non Uniform Memory Access) RISC Software GmbH Johannes Kepler University Linz
47 Distributed Shared Memory Model P P P P P P P P C C C C M M M M Interconnect RISC Software GmbH Johannes Kepler University Linz
48 Directory Based Coherence Scalability: No fast bus snooping not possible Hence directory structure Bit vector for each cache block One bit per processor (1 in cache of processor) Stored in (distributed) memory Scalability problems Additional memory for directory E. g. 128 Byte cache block, 256 processors 20% RISC Software GmbH Johannes Kepler University Linz
49 Directory Based Coherence P P P P P P P P C C C C M D M D M D M D Interconnect RISC Software GmbH Johannes Kepler University Linz
50 Directory Based Coherence Home node: node where memory (and directory entry) is located Principially like snoopy based protocols: Cache line has three states (MSI) Directory entry has modified, shared, and uncached state Cache miss home node for data, directory bits are set accordingly for read/write miss Directory can Invalidate a copy in a remote cache Fetch the data back from a remote cache Cache can write back to home node RISC Software GmbH Johannes Kepler University Linz
51 NUMA Problem Access to remote memory takes longer than to local memory Operating system is responsible for allocating pages Common allocation policies are: First touch: allocate page on node which makes first access Round robin: allocate cyclically RISC Software GmbH Johannes Kepler University Linz
52 Example 3 Memory Bandwidth Saturation & NUMA 1. Go to directory example_3 2. Compile the file cacheomp.cpp icc std=c++11 openmp cacheomp.cpp 3. Execute the binary. 4. How big is the main memory bandwidth? 5. Activate the parallel initialization of the array in the file cacheomp.cpp and compile/execute the program again. 6. How big is the improvement in the main memory bandwidth? RISC Software GmbH Johannes Kepler University Linz
53 Roofline Model RISC Software GmbH Johannes Kepler University Linz
54 Basic Idea Compute attainable peak floating point performance as a function of the arithmetic intensity Correlates peak floating point performance and peak memory bandwidth of a target machine Enables identification if a certain algorithm is memory bound (bandwidth) or compute bound localization of optimization potential RISC Software GmbH Johannes Kepler University Linz
55 Attainable performance calculation Calculation Peak Floating-Point Performance = #Sockets * #Cores * #Instructions per cycle * Frequency (#Instructions per cycle from DLP and ILP) Peak Memory Bandwidth = #Sockets * #Channels * Frequency * #Bytes (Bandwidth RAM to LLC) Operational Intensity = Flops / Bytes read Differentiation between double precision (DP) and single precision (SP) RISC Software GmbH Johannes Kepler University Linz
56 Operational Intensity Example Matrix Vector Multiplication in DP Matrix: a x b Vector: b x 1 Flops: flops ~ 2 a b Bytes read: mem ~ a b 8 Operational intensity: i = flops 2 a b = = 1 mem a b 8 4 RISC Software GmbH Johannes Kepler University Linz
57 Roofline Model 2 x Xeon E GHz RISC Software GmbH Johannes Kepler University Linz
58 Roofline Model 2 x Xeon E GHz Memory bound Compute bound RISC Software GmbH Johannes Kepler University Linz
59 Roofline Model 2 x Xeon E GHz Memory bound Compute bound RISC Software GmbH Johannes Kepler University Linz
60 Roofline Model 2 x Xeon E GHz Memory bound Compute bound RISC Software GmbH Johannes Kepler University Linz
61 Roofline Model 2 x Xeon E GHz Compute bound RISC Software GmbH Johannes Kepler University Linz
62 Roofline Model Comparision RISC Software GmbH Johannes Kepler University Linz
63 Thank You! Castor, 4228m Pollux, 4092m zwischen Monte-Rosa-Massiv und Matterhorn Wallis, Schweiz RISC Software GmbH Johannes Kepler University Linz
Modern CPU Architectures
Modern CPU Architectures Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2014 16.04.2014 1 Motivation for Parallelism I CPU History RISC Software GmbH Johannes
More informationAdvanced OpenMP. Lecture 3: Cache Coherency
Advanced OpenMP Lecture 3: Cache Coherency Cache coherency Main difficulty in building multiprocessor systems is the cache coherency problem. The shared memory programming model assumes that a shared variable
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationCOSC 6385 Computer Architecture - Thread Level Parallelism (I)
COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month
More informationMulticore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh
Multicore Workshop Cache Coherency Mark Bull David Henty EPCC, University of Edinburgh Symmetric MultiProcessing 2 Each processor in an SMP has equal access to all parts of memory same latency and bandwidth
More informationIntel Architecture for Software Developers
Intel Architecture for Software Developers 1 Agenda Introduction Processor Architecture Basics Intel Architecture Intel Core and Intel Xeon Intel Atom Intel Xeon Phi Coprocessor Use Cases for Software
More informationCache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)
Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues
More informationIntroduction to OpenMP. Lecture 10: Caches
Introduction to OpenMP Lecture 10: Caches Overview Why caches are needed How caches work Cache design and performance. The memory speed gap Moore s Law: processors speed doubles every 18 months. True for
More informationLecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming Cache design review Let s say your code executes int x = 1; (Assume for simplicity x corresponds to the address 0x12345604
More informationCS 654 Computer Architecture Summary. Peter Kemper
CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationMultilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology
1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823
More information45-year CPU Evolution: 1 Law -2 Equations
4004 8086 PowerPC 601 Pentium 4 Prescott 1971 1978 1992 45-year CPU Evolution: 1 Law -2 Equations Daniel Etiemble LRI Université Paris Sud 2004 Xeon X7560 Power9 Nvidia Pascal 2010 2017 2016 Are there
More informationLecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015
Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Marble House The Knife (Silent Shout) Before starting The Knife, we were working
More informationTutorial 11. Final Exam Review
Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache
More informationCray XE6 Performance Workshop
Cray XE6 Performance Workshop Mark Bull David Henty EPCC, University of Edinburgh Overview Why caches are needed How caches work Cache design and performance. 2 1 The memory speed gap Moore s Law: processors
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationCOSC 6385 Computer Architecture - Multi Processor Systems
COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:
More informationLecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol
Lecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu
More informationKeywords and Review Questions
Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain
More informationMultiprocessors & Thread Level Parallelism
Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction
More informationHow to Write Fast Numerical Code
How to Write Fast Numerical Code Lecture: Memory hierarchy, locality, caches Instructor: Markus Püschel TA: Alen Stojanov, Georg Ofenbeck, Gagandeep Singh Organization Temporal and spatial locality Memory
More informationLecture 24: Multiprocessing Computer Architecture and Systems Programming ( )
Systems Group Department of Computer Science ETH Zürich Lecture 24: Multiprocessing Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Most of the rest of this
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationMultiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering
Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to
More informationProcessor Architecture
Processor Architecture Shared Memory Multiprocessors M. Schölzel The Coherence Problem s may contain local copies of the same memory address without proper coordination they work independently on their
More informationCray XE6 Performance Workshop
Cray XE6 Performance Workshop Cache Coherency Mark Bull David Henty EPCC, University of Edinburgh ymmetric MultiProcessing Each processor in an MP has equal access to all parts of memory same latency and
More informationCS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence
CS252 Spring 2017 Graduate Computer Architecture Lecture 12: Cache Coherence Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 Last Time in Lecture 11 Memory Systems DRAM
More information12 Cache-Organization 1
12 Cache-Organization 1 Caches Memory, 64M, 500 cycles L1 cache 64K, 1 cycles 1-5% misses L2 cache 4M, 10 cycles 10-20% misses L3 cache 16M, 20 cycles Memory, 256MB, 500 cycles 2 Improving Miss Penalty
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationCS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS
CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight
More informationLecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"
Be sure you understand CC, clock period Lecture 29 Review Suggested reading: Everything Q1: D[8] = D[8] + RF[1] + RF[4] I[15]: Add R2, R1, R4 RF[1] = 4 I[16]: MOV R3, 8 RF[4] = 5 I[17]: Add R2, R2, R3
More informationMemory Hierarchy. Slides contents from:
Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory
More informationPage 1. Multilevel Memories (Improving performance using a little cash )
Page 1 Multilevel Memories (Improving performance using a little cash ) 1 Page 2 CPU-Memory Bottleneck CPU Memory Performance of high-speed computers is usually limited by memory bandwidth & latency Latency
More informationOverview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware
Overview: Shared Memory Hardware Shared Address Space Systems overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and
More informationOverview: Shared Memory Hardware
Overview: Shared Memory Hardware overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and update protocols false sharing
More informationLecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)
Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling
More informationMulti-threaded processors. Hung-Wei Tseng x Dean Tullsen
Multi-threaded processors Hung-Wei Tseng x Dean Tullsen OoO SuperScalar Processor Fetch instructions in the instruction window Register renaming to eliminate false dependencies edule an instruction to
More informationMaster Informatics Eng.
Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,
More informationIntroducing Sandy Bridge
Introducing Sandy Bridge Bob Valentine Senior Principal Engineer 1 Sandy Bridge - Intel Next Generation Microarchitecture Sandy Bridge: Overview Integrates CPU, Graphics, MC, PCI Express* On Single Chip
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationMultiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types
Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon
More informationComputer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM
Computer Architecture Computer Science & Engineering Chapter 5 Memory Hierachy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic
More informationShared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network
Shared Memory Multis Processor Processor Processor i Processor n Symmetric Shared Memory Architecture (SMP) cache cache cache cache Interconnection Network Main Memory I/O System Cache Coherence Cache
More informationLecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)
Lecture 11: Snooping Cache Coherence: Part II CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Announcements Assignment 2 due tonight 11:59 PM - Recall 3-late day policy Assignment
More informationCS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II
CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste
More informationAgenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File
EE 260: Introduction to Digital Design Technology Yao Zheng Department of Electrical Engineering University of Hawaiʻi at Mānoa 2 Technology Naive Register File Write Read clk Decoder Read Write 3 4 Arrays:
More informationMemory Hierarchy. Slides contents from:
Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory
More informationMultiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.
Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationIntroduction. CSCI 4850/5850 High-Performance Computing Spring 2018
Introduction CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University What is Parallel
More informationEN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)
EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering
More informationChap. 4 Multiprocessors and Thread-Level Parallelism
Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,
More informationLecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015
Lecture 11: Cache Coherence: Part II Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Bang Bang (My Baby Shot Me Down) Nancy Sinatra (Kill Bill Volume 1 Soundtrack) It
More informationSRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design
SRAMs to Memory Low Power VLSI System Design Lecture 0: Low Power Memory Design Prof. R. Iris Bahar October, 07 Last lecture focused on the SRAM cell and the D or D memory architecture built from these
More informationDonn Morrison Department of Computer Science. TDT4255 Memory hierarchies
TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationMemory Hierarchy. Reading. Sections 5.1, 5.2, 5.3, 5.4, 5.8 (some elements), 5.9 (2) Lecture notes from MKP, H. H. Lee and S.
Memory Hierarchy Lecture notes from MKP, H. H. Lee and S. Yalamanchili Sections 5.1, 5.2, 5.3, 5.4, 5.8 (some elements), 5.9 Reading (2) 1 SRAM: Value is stored on a pair of inerting gates Very fast but
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address
More informationEN164: Design of Computing Systems Lecture 24: Processor / ILP 5
EN164: Design of Computing Systems Lecture 24: Processor / ILP 5 Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationChapter 5. Large and Fast: Exploiting Memory Hierarchy
Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per
More informationS = 32 2 d kb (1) L = 32 2 D B (2) A = 2 2 m mod 4 (3) W = 16 2 y mod 4 b (4)
1 Cache Design You have already written your civic registration number (personnummer) on the cover page in the format YyMmDd-XXXX. Use the following formulas to calculate the parameters of your caches:
More informationLecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )
Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections 5.1-5.3) 1 Reducing Miss Rate Large block size reduces compulsory misses, reduces miss penalty in case
More informationModule 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT
TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012
More informationIntroduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano Outline The problem of cache coherence Snooping protocols Directory-based protocols Prof. Cristina Silvano, Politecnico
More informationMultithreaded Processors. Department of Electrical Engineering Stanford University
Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread
More informationIntroduction to cache memories
Course on: Advanced Computer Architectures Introduction to cache memories Prof. Cristina Silvano Politecnico di Milano email: cristina.silvano@polimi.it 1 Summary Summary Main goal Spatial and temporal
More informationLecture 20: Multi-Cache Designs. Spring 2018 Jason Tang
Lecture 20: Multi-Cache Designs Spring 2018 Jason Tang 1 Topics Split caches Multi-level caches Multiprocessor caches 2 3 Cs of Memory Behaviors Classify all cache misses as: Compulsory Miss (also cold-start
More informationChapter 6. Parallel Processors from Client to Cloud Part 2 COMPUTER ORGANIZATION AND DESIGN. Homogeneous & Heterogeneous Multicore Architectures
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Part 2 Homogeneous & Heterogeneous Multicore Architectures Intel XEON 22nm
More informationShared Symmetric Memory Systems
Shared Symmetric Memory Systems Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University
More informationMultiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism
Computing Systems & Performance Beyond Instruction-Level Parallelism MSc Informatics Eng. 2012/13 A.J.Proença From ILP to Multithreading and Shared Cache (most slides are borrowed) When exploiting ILP,
More informationAdvanced Memory Organizations
CSE 3421: Introduction to Computer Architecture Advanced Memory Organizations Study: 5.1, 5.2, 5.3, 5.4 (only parts) Gojko Babić 03-29-2018 1 Growth in Performance of DRAM & CPU Huge mismatch between CPU
More informationEN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy
EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University Material from: Parallel Computer Organization and Design by Debois,
More informationCS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II
CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste!
More informationComputer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationChapter 5. Large and Fast: Exploiting Memory Hierarchy
Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per
More informationCOEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence
1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations
More informationEECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141
EECS151/251A Spring 2018 Digital Design and Integrated Circuits Instructors: John Wawrzynek and Nick Weaver Lecture 19: Caches Cache Introduction 40% of this ARM CPU is devoted to SRAM cache. But the role
More informationPage 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence
SMP Review Multiprocessors Today s topics: SMP cache coherence general cache coherence issues snooping protocols Improved interaction lots of questions warning I m going to wait for answers granted it
More informationCaches. Cache Memory. memory hierarchy. CPU memory request presented to first-level cache first
Cache Memory memory hierarchy CPU memory request presented to first-level cache first if data NOT in cache, request sent to next level in hierarchy and so on CS3021/3421 2017 jones@tcd.ie School of Computer
More informationAgenda. System Performance Scaling of IBM POWER6 TM Based Servers
System Performance Scaling of IBM POWER6 TM Based Servers Jeff Stuecheli Hot Chips 19 August 2007 Agenda Historical background POWER6 TM chip components Interconnect topology Cache Coherence strategies
More informationHow to Write Fast Numerical Code
How to Write Fast Numerical Code Lecture: Memory hierarchy, locality, caches Instructor: Markus Püschel TA: Daniele Spampinato & Alen Stojanov Left alignment Attractive font (sans serif, avoid Arial) Calibri,
More informationMemory systems. Memory technology. Memory technology Memory hierarchy Virtual memory
Memory systems Memory technology Memory hierarchy Virtual memory Memory technology DRAM Dynamic Random Access Memory bits are represented by an electric charge in a small capacitor charge leaks away, need
More informationLecture-22 (Cache Coherence Protocols) CS422-Spring
Lecture-22 (Cache Coherence Protocols) CS422-Spring 2018 Biswa@CSE-IITK Single Core Core 0 Private L1 Cache Bus (Packet Scheduling) Private L2 DRAM CS422: Spring 2018 Biswabandan Panda, CSE@IITK 2 Multicore
More informationLecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )
Systems Group Department of Computer Science ETH Zürich Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Today Non-Uniform
More informations complement 1-bit Booth s 2-bit Booth s
ECE/CS 552 : Introduction to Computer Architecture FINAL EXAM May 12th, 2002 NAME: This exam is to be done individually. Total 6 Questions, 100 points Show all your work to receive partial credit for incorrect
More informationLecture-14 (Memory Hierarchy) CS422-Spring
Lecture-14 (Memory Hierarchy) CS422-Spring 2018 Biswa@CSE-IITK The Ideal World Instruction Supply Pipeline (Instruction execution) Data Supply - Zero-cycle latency - Infinite capacity - Zero cost - Perfect
More informationToday. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )
Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Systems Group Department of Computer Science ETH Zürich SMP architecture
More informationCommunications and Computer Engineering II: Lecturer : Tsuyoshi Isshiki
Communications and Computer Engineering II: Microprocessor 2: Processor Micro-Architecture Lecturer : Tsuyoshi Isshiki Dept. Communications and Computer Engineering, Tokyo Institute of Technology isshiki@ict.e.titech.ac.jp
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationParallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence
Parallel Computer Architecture Spring 2018 Shared Memory Multiprocessors Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly Parallel Computer Architecture
More information3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationCPU Architecture Overview. Varun Sampath CIS 565 Spring 2012
CPU Architecture Overview Varun Sampath CIS 565 Spring 2012 Objectives Performance tricks of a modern CPU Pipelining Branch Prediction Superscalar Out-of-Order (OoO) Execution Memory Hierarchy Vector Operations
More informationECE 485/585 Microprocessor System Design
Microprocessor System Design Lecture 11: Reducing Hit Time Cache Coherence Zeshan Chishti Electrical and Computer Engineering Dept Maseeh College of Engineering and Computer Science Source: Lecture based
More informationShared Memory Multiprocessors
Parallel Computing Shared Memory Multiprocessors Hwansoo Han Cache Coherence Problem P 0 P 1 P 2 cache load r1 (100) load r1 (100) r1 =? r1 =? 4 cache 5 cache store b (100) 3 100: a 100: a 1 Memory 2 I/O
More informationComputer Architecture CS372 Exam 3
Name: Computer Architecture CS372 Exam 3 This exam has 7 pages. Please make sure you have all of them. Write your name on this page and initials on every other page now. You may only use the green card
More informationHigh performance computing. Memory
High performance computing Memory Performance of the computations For many programs, performance of the calculations can be considered as the retrievability from memory and processing by processor In fact
More information