Modern CPU Architectures
|
|
- May Kelley
- 6 years ago
- Views:
Transcription
1 Modern CPU Architectures Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz
2 Motivation for Parallelism I CPU History RISC Software GmbH Johannes Kepler University Linz
3 Motivation for Parallelism II CPU Scaling Moore s Law (doubling transitor density / 2 years) Dennard Scaling made Moore s Law work Smaller Faster Constant RISC Software GmbH Johannes Kepler University Linz
4 Motivation for Parallelism III Limits of Serial Code Execution End of classical Dennard Scaling 2 1. No tox scaling 2. No voltage scaling No delay time scaling 4. Leakage P = C EFF * V DD 2 * f + I LEAK * V DD Power wall because of critical heat dissipation Frequency wall because of gate delay time RISC Software GmbH Johannes Kepler University Linz
5 Levels of Parallelism I Code Granularity Messages Messages Task i - 1 Task i Task i + 1 Task level parallelism (TLP) large grain func1() { } func2() { } func3() { } Control level parallelism (CLP) medium grain a[0]= b[0]= a[1]= b[1]= a[2]= b[2]= Data level parallelism (DLP) fine grain + * / Instruction level parallelism (ILP) very fine grain RISC Software GmbH Johannes Kepler University Linz
6 Levels of Parallelism II Programmers Control Messages Messages Task i - 1 Task i Task i + 1 Programmer (MPI, PVM, ) func1() { } func2() { } func3() { } Programmer (OpenMP, Pthreads, ) a[0]= b[0]= a[1]= b[1]= a[2]= b[2]= Programmer (Vector Intrinsics) or automatic (Compiler) + * / Automatic (CPU) RISC Software GmbH Johannes Kepler University Linz
7 Level of Parallelism III Milestones of Parallelization RISC Software GmbH Johannes Kepler University Linz
8 Instruction Level Parallelism I Superscalar Pipeline Also known as multiple issue Implemented in Hardware via superscalar CPU ( parallel pipelines) Time [Cycles] IF ID EX WB IF ID EX WB IF ID EX WB IF ID EX WB IF ID EX WB IF ID EX WB IF ID EX WB IF ID EX WB Instructions Fully utilized pipeline RISC Software GmbH Johannes Kepler University Linz
9 Instruction Level Parallelism II Out of Order Execution Out of Order Execution Input: serial instruction stream Identification and execution of parallel executable instructions Pipeline hazards Structural hazards (resource conflicts) Data hazards (data dependencies) Control hazards (conditional and unconditional jumps Branch Prediction) ILP Wall: Further effort for parallelization leads to little increase in ressource utilization RISC Software GmbH Johannes Kepler University Linz
10 Out of order In order Instruction Level Parallelism III Sandy Bridge Core Pipeline Instruction Queue Pre Decode Instruction Fetch Branch Prediction L1 Instruction Cache (32kB) Decoder Decoder Decoder Decoder uop Cache 4 uops/cycle Rename/Allocate/Retire 4 uops/cycle issued ReOrder Buffer Load Buffer Store Buffer L2 Cache (256kB) Port 0 ALU VI MUL SSE MUL DIV AVX FP MUL Reservation Station 6 uops/cycle dispatched Port 1 Port 5 Port 2 Port 3 Port 4 ALU ALU Load Store Address Store Data VI ADD JMP Store Address Load SSE ADD AVX/FP Shuf AVX FP ADD AVX/FP Bool Imm Blend Imm Blend Memory Control 48 Bytes/cylce L1 Data Cache (32kB) RISC Software GmbH Johannes Kepler University Linz
11 Data Level Parallelism I Basics Same instruction is executed simultaneous with N>1 elements of a vector S1 S2 V1 V1 V1 V1 V2 V2 V2 V2 Scalar Processing + Vector Processing S3 V3 V3 V3 V3 RISC Software GmbH Johannes Kepler University Linz
12 Data Level Parallelism II History RISC Software GmbH Johannes Kepler University Linz
13 Control Level Parallelism I Simultaneous Multi Threading (SMT) Also known as Hyper Threading (HT) or Hardware Threading Multiple logic processors per core Per logic processor Administrative logic Feed with independent serial instruction stream Mapped to ILP increase of resource utilization in superscalar CPU RISC Software GmbH Johannes Kepler University Linz
14 Control Level Parallelism II Simultaneous Multi Threading (SMT) Time [Cycles] without SMT with SMT Note: Each box represents a processor execution unit RISC Software GmbH Johannes Kepler University Linz
15 Control Level Parallelism III Sandy Bridge Multi-Core CPU (Intel Xeon E5-2600) 2 QPI Links 8.0GT/s Ring bus 32 Bytes wide Bandwidth: 96GB/s 1 Cycle per hop Integrated Memory Controller IMC Bandwidth: 4 * 1600MHz * 8 Byte = 51.2 GB/s Latency: 60ns 4 Memory Channels 1600MHz DDR3 RISC Software GmbH Johannes Kepler University Linz
16 Control Level Parallelism IV Multi-Socket CPUs Sandy Bridge QPI Memory access Latency [ns] Bandwidth [GB/s] local remote *) *) 8.0GT/s * 2 Byte = 16.0GB/s RISC Software GmbH Johannes Kepler University Linz
17 Task Level Parallelism Intra-Node Simultaneous Multi Threading Multi Core Multi Socket Shared memory model Inter-Node Different compute nodes Connected via fast interconnect (e.g. Infiniband) Different connection topologies Distributed memory model RISC Software GmbH Johannes Kepler University Linz
18 Caching RISC Software GmbH Johannes Kepler University Linz
19 Motivation for Caching Moore s Law: CPU performance doubles roughly every two years Memory bandwidth does not keep up (doubles roughly every five years) Memory speed gap (Memory Wall) Year 1980: CPU and memory cylce about 1us Year 2000: CPU cycle 1ns, memory cycle 100ns RISC Software GmbH Johannes Kepler University Linz
20 Principles of Locality Each program has to a certain degree some locality Usage of recently accessed data and instructions Two kinds of locality Temporal locality: A recently accessed object is used in the near future. E.g.: x is read x is read or written in the near future Spatial locality: Neighbouring objects are accessed in similar time. E.g.: y[i] is read y[i+1] is read soon. RISC Software GmbH Johannes Kepler University Linz
21 Cache Memory Processor Cache Main Memory Keeps copies of main memory blocks (data and instructions) Access time is much faster (1ns vs. 100ns) Much smaller in size than main memory (costs) RISC Software GmbH Johannes Kepler University Linz
22 Cache Design Cache block size: Bytes Cache block usually named cache line Cache design criteria When to cache? Where to cache? How to find cache block again? Which cache block is replaced (after miss)? What to do in case of writes? Methods must be easy (Hardware) RISC Software GmbH Johannes Kepler University Linz
23 Cache Design When to Cache? Read access Cache always Memory location is read and copy is not in cache (read miss) cache Write access Nonapplicable for instruction caches Different strategies (see later on) RISC Software GmbH Johannes Kepler University Linz
24 Cache Design Where to cache? Direct mapped cache Bytes Associative Cache 64 Bytes cache cache Full address Full address m bits n bits m bits n bits block index block offset set index block offset RISC Software GmbH Johannes Kepler University Linz
25 Cache Design How to Find Cache Block again? Load address Find set for address Information per cache block Address tag Valid bit Check for matching address tag in valid blocks Full address address tag m bits set index n bits block offset RISC Software GmbH Johannes Kepler University Linz
26 Cache Design Which cache block is replaced (after miss)? Direct mapped cache No choice, replacement of indexed block Associative cache Randomly replacement Least recently used (LRU), better, but implementation more difficult RISC Software GmbH Johannes Kepler University Linz
27 Cache Design What to do in case of writes? Strategy write through Write data directly to main memory (no caching) Used for data I/O and some multiprocessor cache coherencey implementations CPU avoids latencies of writes via write buffer Strategy write back Write data only to cache In case of cache block replacement data is written from cache to main memory (checked via dirty bit) Reduced memory traffic More difficult to implement than write through RISC Software GmbH Johannes Kepler University Linz
28 Cache Memory Access Costs Average memory access time hit time avg memory access time = + miss ratio x miss time Time to load data from cache to CPU Proportion of accesses which cause a miss Time to load data from main memory to cache Goal: optimization for all three parameters RISC Software GmbH Johannes Kepler University Linz
29 The Three Reasons for Cache Misses (3 Cs) Compulsory or cold start: first access to data of an address Capacity: cache is not big enough to hold all data Conflict: To many memory addresses are mapped to the same set index of the cache RISC Software GmbH Johannes Kepler University Linz
30 Memory Hierarchy Speed (and cost) 1 cycle CPU Registers ~1 kb Capacity 2-3 cycles L1 Cache ~100 kb ~20 cycles L2 Cache ~1-10 MB ~50 cycles L3 Cache ~10-50 MB ~300 cycles Main Memory ~1 GB RISC Software GmbH Johannes Kepler University Linz
31 Exercise 1 Cache Hierarchy 1. Look at example_1/cache.c. 2. Compile the code with icc openmp cache.c. 3. Execute the binary and try to identify the cache boundaries. 4. How big are the different caches? RISC Software GmbH Johannes Kepler University Linz
32 Cache Coherency RISC Software GmbH Johannes Kepler University Linz
33 Shared Memory Model All processors have access to global memory Communication with main memory via read and write Caches are automatically kept up to date or coherent Scalability difficult because of memory bottleneck Low number of processors Multi core CPU RISC Software GmbH Johannes Kepler University Linz
34 Shared Memory Model P P P P P P P P L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 L2 L2 Chip L3 Memory RISC Software GmbH Johannes Kepler University Linz
35 Cache Coherence Shared memory model shared variable has unique value at any given point in time Caching multiple copies of a memory location Avoidance of different values caches are kept coherent Write to a memory location must invalidate all copies in other caches RISC Software GmbH Johannes Kepler University Linz
36 Coherence Protocols Book keeping of sharing state of cache blocks Block was modified Is block stored in more than one cache Snooping (or Broadcast) based protocols Each copy in cache has sharing state No centralized state Each processor sees each request Directory based protocols Book keeping of sharing state is centralized RISC Software GmbH Johannes Kepler University Linz
37 Snooping Based Protocols Usage of the valid tags of cache line for invalidation Extra tag for sharing state Usage of dirty bit of Write Back caches All processors see all bus transactions Invalidation message: block in cache invalidation Memory read request: block in cache provide data and cancel memory request Many different implementations RISC Software GmbH Johannes Kepler University Linz
38 Three State Snoopy Protocol MSI Simplest protocol, allowing multiple copies Each cache block can have one of the following states: Modified: only one valid copy in any cache and value is different to main memory Shared: several valid copies in caches and value is identical to main memory Invalid: copy is not valid and cannot be used RISC Software GmbH Johannes Kepler University Linz
39 MSI Example: R2 P1 P2 P3 PrRd value S value S Snooper Snooper BusRd Snooper Main Memory P2 wants to read the value. Its cache does not have the data, so it places a BusRd to notify other processors and ask for the data. The memory controller provides the data. RISC Software GmbH Johannes Kepler University Linz
40 MSI Example: W3 P1 P2 P3 PrWr value S I value S I value M BusRdX Snooper Snooper Snooper Main Memory P3 wants to write the value. It places a BusRdX to get exclusive access and the most recent copy of the data. The caches of P1 and P2 see the BusRdX and invalidate their copies. Because the value is still up-to-date in memory, memory provides the data. RISC Software GmbH Johannes Kepler University Linz
41 MSI Example: R2 P1 P2 P3 PrRd value I value I S value M S Snooper BusRd Snooper Flush Snooper Main Memory P2 wants to read the value. P3 s cache has the most up-to-date copy and will provide it. P2 s cache puts a BusRd on the bus. P3 s cache snoops this and cancels the memory access because it will provide the data. P3 s cache flushes the data to the bus. RISC Software GmbH Johannes Kepler University Linz
42 MSI Example: W1 P1 PrWr P2 P3 BusRdX value I M value S I value S I Snooper Snooper Snooper Main Memory P1 wants to write to its cache. The cache places a BusRdX on the bus to gain exclusive access and the most up-to-date value. Main memory is not stale so it provides the data. The snoopers for P2 and P3 see the BusRdX and invalidate their copies in cache. RISC Software GmbH Johannes Kepler University Linz
43 False Sharing Coherent operations are on cache lines Cache lines can contain multiple units of a data type (e.g. int) Two processors write on different units in the same cache line (data values are not really shared) False Sharing Each write invalidation of copies in other caches bus traffic and memory access Significant performance problem and difficult identification RISC Software GmbH Johannes Kepler University Linz
44 Exercise 2 Cache Coherency False Sharing 1. Look at example_2/coherency.c 2. Compile the the code with icc openmp coherency.c location.c 3. Set two execution threads via export OMP_NUM_THREADS=2 4. Use different cores on same CPU via export KMP_AFFINITY=granularity=fine,proclist=[0,2],explicit 5. Execute the binary. 6. Can you see the drop in the performance? 7. What is the cache line size? RISC Software GmbH Johannes Kepler University Linz
45 NUMA (Non Uniform Memory Access) RISC Software GmbH Johannes Kepler University Linz
46 Distributed Shared Memory Model P P P P P P P P C C C C M M M M Interconnect RISC Software GmbH Johannes Kepler University Linz
47 Directory Based Coherence Scalability: No fast bus Snooping not possible Hence directory structure Bit vector for each cache block One bit per processor (1 in cache of processor) Stored in (distributed) memory Scalability problems Additional memory for directory E. g. 128 Byte cache block, 256 processor 20 % RISC Software GmbH Johannes Kepler University Linz
48 Directory based coherence Home node: node where memory (and directory entry) is located Principially like snoopy based protocols: Cache line has three states (MSI) Directory entry has modified, shared, and uncached state Cache miss home node for data, directory bits are set accordingly for read/write miss Directory can Invalidate a copy in a remote cache Fetch the data back from a remote cache Cache can write back to home node RISC Software GmbH Johannes Kepler University Linz
49 NUMA Problem Access to remote memory takes longer than to local memory Operation system is responsible for allocating pages Common allocation policies are: First touch: allocate page on node which makes first access Round robin: allocate cyclically RISC Software GmbH Johannes Kepler University Linz
50 Exercise 3 Memory Bandwidth Saturation & NUMA 1. Look at example_3/cacheomp.c. 2. Compile the file via icc openmp cacheomp.c location.c 3. Execute the binary. 4. How big is the main memory bandwidth? 5. Activate the parallel initialization of the array in the file cacheomp.c and compile/execute the program again. 6. How big is the improvement in the main memory bandwidth? RISC Software GmbH Johannes Kepler University Linz
51 Roofline Model RISC Software GmbH Johannes Kepler University Linz
52 Roofline Model Basic Idea Compute attainable peak floating point performance as function of the arithmetic intensity Correlates peak floating point performance and peak memory bandwidth of a target machine Enables identification if a certain algorithm is memory bound (bandwidth) or compute bound localization of optimization potential RISC Software GmbH Johannes Kepler University Linz
53 Roofline Model Calculation Calculation: Peak Floating-Point Performance = #Sockets * #Cores * #Instructions per Cylce * Frequency (#Instructions per Cylcle from DLP and ILP) Peak Memory Bandwidth = #Sockets * #Channels * Frequency * #Bytes (Bandwidth RAM to LLC) Operational Intensity = Flops / Bytes read Differentiation between double precision (DP) and single precision (SP) RISC Software GmbH Johannes Kepler University Linz
54 Roofline Model Example: Matrix Vector Multiplication in DP Matrix: a x b Vector: b x 1 Flops: flops ~ 2 a b Bytes read: mem ~ a b 8 Operational intensisty: i = flops 2 a b = mem a b 8 = 1 4 RISC Software GmbH Johannes Kepler University Linz
55 Roofline Model Example 2 x Xeon E GHz RISC Software GmbH Johannes Kepler University Linz
56 Roofline Model Example 2 x Xeon E GHz Compute bound RISC Software GmbH Johannes Kepler University Linz
57 Roofline Model Example 2 x Xeon E GHz Memory bound Compute bound RISC Software GmbH Johannes Kepler University Linz
58 Roofline Model Example 2 x Xeon E GHz Memory bound Compute bound RISC Software GmbH Johannes Kepler University Linz
59 Roofline Model Example 2 x Xeon E GHz Memory bound Compute bound RISC Software GmbH Johannes Kepler University Linz
60 Roofline Model Comparison of Different Architectures RISC Software GmbH Johannes Kepler University Linz
61 Thank You! Castor, 4228m Pollux, 4092m between Monte-Rosa-Massiv and Matterhorn Wallis, Schweiz RISC Software GmbH Johannes Kepler University Linz
Advanced Parallel Programming I
Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University
More informationAdvanced OpenMP. Lecture 3: Cache Coherency
Advanced OpenMP Lecture 3: Cache Coherency Cache coherency Main difficulty in building multiprocessor systems is the cache coherency problem. The shared memory programming model assumes that a shared variable
More informationMulticore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh
Multicore Workshop Cache Coherency Mark Bull David Henty EPCC, University of Edinburgh Symmetric MultiProcessing 2 Each processor in an SMP has equal access to all parts of memory same latency and bandwidth
More informationCache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)
Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationCOSC 6385 Computer Architecture - Thread Level Parallelism (I)
COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month
More informationLecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming Cache design review Let s say your code executes int x = 1; (Assume for simplicity x corresponds to the address 0x12345604
More informationIntel Architecture for Software Developers
Intel Architecture for Software Developers 1 Agenda Introduction Processor Architecture Basics Intel Architecture Intel Core and Intel Xeon Intel Atom Intel Xeon Phi Coprocessor Use Cases for Software
More informationIntroduction to OpenMP. Lecture 10: Caches
Introduction to OpenMP Lecture 10: Caches Overview Why caches are needed How caches work Cache design and performance. The memory speed gap Moore s Law: processors speed doubles every 18 months. True for
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationLecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015
Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Marble House The Knife (Silent Shout) Before starting The Knife, we were working
More information45-year CPU Evolution: 1 Law -2 Equations
4004 8086 PowerPC 601 Pentium 4 Prescott 1971 1978 1992 45-year CPU Evolution: 1 Law -2 Equations Daniel Etiemble LRI Université Paris Sud 2004 Xeon X7560 Power9 Nvidia Pascal 2010 2017 2016 Are there
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationCS 654 Computer Architecture Summary. Peter Kemper
CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining
More information12 Cache-Organization 1
12 Cache-Organization 1 Caches Memory, 64M, 500 cycles L1 cache 64K, 1 cycles 1-5% misses L2 cache 4M, 10 cycles 10-20% misses L3 cache 16M, 20 cycles Memory, 256MB, 500 cycles 2 Improving Miss Penalty
More informationCray XE6 Performance Workshop
Cray XE6 Performance Workshop Cache Coherency Mark Bull David Henty EPCC, University of Edinburgh ymmetric MultiProcessing Each processor in an MP has equal access to all parts of memory same latency and
More informationCray XE6 Performance Workshop
Cray XE6 Performance Workshop Mark Bull David Henty EPCC, University of Edinburgh Overview Why caches are needed How caches work Cache design and performance. 2 1 The memory speed gap Moore s Law: processors
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationProcessor Architecture
Processor Architecture Shared Memory Multiprocessors M. Schölzel The Coherence Problem s may contain local copies of the same memory address without proper coordination they work independently on their
More informationLecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol
Lecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu
More informationCS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence
CS252 Spring 2017 Graduate Computer Architecture Lecture 12: Cache Coherence Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 Last Time in Lecture 11 Memory Systems DRAM
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationMultilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology
1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823
More informationMultiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering
Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to
More informationMultiprocessors & Thread Level Parallelism
Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction
More informationTutorial 11. Final Exam Review
Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache
More informationLecture 24: Multiprocessing Computer Architecture and Systems Programming ( )
Systems Group Department of Computer Science ETH Zürich Lecture 24: Multiprocessing Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Most of the rest of this
More informationHow to Write Fast Numerical Code
How to Write Fast Numerical Code Lecture: Memory hierarchy, locality, caches Instructor: Markus Püschel TA: Alen Stojanov, Georg Ofenbeck, Gagandeep Singh Organization Temporal and spatial locality Memory
More informationKeywords and Review Questions
Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain
More informationCOSC 6385 Computer Architecture - Multi Processor Systems
COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationMemory Hierarchy. Slides contents from:
Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory
More informationOverview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware
Overview: Shared Memory Hardware Shared Address Space Systems overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and
More informationOverview: Shared Memory Hardware
Overview: Shared Memory Hardware overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and update protocols false sharing
More informationLecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)
Lecture 11: Snooping Cache Coherence: Part II CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Announcements Assignment 2 due tonight 11:59 PM - Recall 3-late day policy Assignment
More informationMaster Informatics Eng.
Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,
More informationLecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"
Be sure you understand CC, clock period Lecture 29 Review Suggested reading: Everything Q1: D[8] = D[8] + RF[1] + RF[4] I[15]: Add R2, R1, R4 RF[1] = 4 I[16]: MOV R3, 8 RF[4] = 5 I[17]: Add R2, R2, R3
More informationComputer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM
Computer Architecture Computer Science & Engineering Chapter 5 Memory Hierachy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic
More informationPage 1. Multilevel Memories (Improving performance using a little cash )
Page 1 Multilevel Memories (Improving performance using a little cash ) 1 Page 2 CPU-Memory Bottleneck CPU Memory Performance of high-speed computers is usually limited by memory bandwidth & latency Latency
More informationMulti-threaded processors. Hung-Wei Tseng x Dean Tullsen
Multi-threaded processors Hung-Wei Tseng x Dean Tullsen OoO SuperScalar Processor Fetch instructions in the instruction window Register renaming to eliminate false dependencies edule an instruction to
More informationLecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)
Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling
More informationMemory Hierarchy. Slides contents from:
Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationIntroduction. CSCI 4850/5850 High-Performance Computing Spring 2018
Introduction CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University What is Parallel
More informationS = 32 2 d kb (1) L = 32 2 D B (2) A = 2 2 m mod 4 (3) W = 16 2 y mod 4 b (4)
1 Cache Design You have already written your civic registration number (personnummer) on the cover page in the format YyMmDd-XXXX. Use the following formulas to calculate the parameters of your caches:
More informationMultiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.
Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network
More informationMultiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types
Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon
More informationLecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015
Lecture 11: Cache Coherence: Part II Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Bang Bang (My Baby Shot Me Down) Nancy Sinatra (Kill Bill Volume 1 Soundtrack) It
More informationCS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS
CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight
More informationShared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network
Shared Memory Multis Processor Processor Processor i Processor n Symmetric Shared Memory Architecture (SMP) cache cache cache cache Interconnection Network Main Memory I/O System Cache Coherence Cache
More informationChap. 4 Multiprocessors and Thread-Level Parallelism
Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,
More informationIntroducing Sandy Bridge
Introducing Sandy Bridge Bob Valentine Senior Principal Engineer 1 Sandy Bridge - Intel Next Generation Microarchitecture Sandy Bridge: Overview Integrates CPU, Graphics, MC, PCI Express* On Single Chip
More informationCS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II
CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste
More informationAgenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File
EE 260: Introduction to Digital Design Technology Yao Zheng Department of Electrical Engineering University of Hawaiʻi at Mānoa 2 Technology Naive Register File Write Read clk Decoder Read Write 3 4 Arrays:
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationEN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy
EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University Material from: Parallel Computer Organization and Design by Debois,
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationLecture 20: Multi-Cache Designs. Spring 2018 Jason Tang
Lecture 20: Multi-Cache Designs Spring 2018 Jason Tang 1 Topics Split caches Multi-level caches Multiprocessor caches 2 3 Cs of Memory Behaviors Classify all cache misses as: Compulsory Miss (also cold-start
More informationChapter 5. Large and Fast: Exploiting Memory Hierarchy
Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per
More informationECE 485/585 Microprocessor System Design
Microprocessor System Design Lecture 11: Reducing Hit Time Cache Coherence Zeshan Chishti Electrical and Computer Engineering Dept Maseeh College of Engineering and Computer Science Source: Lecture based
More informationEN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)
EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering
More informationShared Symmetric Memory Systems
Shared Symmetric Memory Systems Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University
More informationChapter 6. Parallel Processors from Client to Cloud Part 2 COMPUTER ORGANIZATION AND DESIGN. Homogeneous & Heterogeneous Multicore Architectures
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Part 2 Homogeneous & Heterogeneous Multicore Architectures Intel XEON 22nm
More informationIntroducing Multi-core Computing / Hyperthreading
Introducing Multi-core Computing / Hyperthreading Clock Frequency with Time 3/9/2017 2 Why multi-core/hyperthreading? Difficult to make single-core clock frequencies even higher Deeply pipelined circuits:
More informationPage 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence
SMP Review Multiprocessors Today s topics: SMP cache coherence general cache coherence issues snooping protocols Improved interaction lots of questions warning I m going to wait for answers granted it
More informationDonn Morrison Department of Computer Science. TDT4255 Memory hierarchies
TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,
More informationSRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design
SRAMs to Memory Low Power VLSI System Design Lecture 0: Low Power Memory Design Prof. R. Iris Bahar October, 07 Last lecture focused on the SRAM cell and the D or D memory architecture built from these
More informationComputer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationCS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II
CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste!
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address
More informationChapter 5. Large and Fast: Exploiting Memory Hierarchy
Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationChapter 5A. Large and Fast: Exploiting Memory Hierarchy
Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM
More informationLecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )
Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections 5.1-5.3) 1 Reducing Miss Rate Large block size reduces compulsory misses, reduces miss penalty in case
More informationEN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University
EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University Material from: Parallel Computer Organization and Design by Debois,
More informationMultithreaded Processors. Department of Electrical Engineering Stanford University
Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread
More informationIntroduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano Outline The problem of cache coherence Snooping protocols Directory-based protocols Prof. Cristina Silvano, Politecnico
More informationCSC 631: High-Performance Computer Architecture
CSC 631: High-Performance Computer Architecture Spring 2017 Lecture 10: Memory Part II CSC 631: High-Performance Computer Architecture 1 Two predictable properties of memory references: Temporal Locality:
More informationModule 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT
TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012
More informationIntroduction to cache memories
Course on: Advanced Computer Architectures Introduction to cache memories Prof. Cristina Silvano Politecnico di Milano email: cristina.silvano@polimi.it 1 Summary Summary Main goal Spatial and temporal
More information( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture
( ZIH ) Center for Information Services and High Performance Computing Overvi ew over the x86 Processor Architecture Daniel Molka Ulf Markwardt Daniel.Molka@tu-dresden.de ulf.markwardt@tu-dresden.de Outline
More informationMultiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism
Computing Systems & Performance Beyond Instruction-Level Parallelism MSc Informatics Eng. 2012/13 A.J.Proença From ILP to Multithreading and Shared Cache (most slides are borrowed) When exploiting ILP,
More informationEC 513 Computer Architecture
EC 513 Computer Architecture Cache Coherence - Directory Cache Coherence Prof. Michel A. Kinsy Shared Memory Multiprocessor Processor Cores Local Memories Memory Bus P 1 Snoopy Cache Physical Memory P
More informationLecture-22 (Cache Coherence Protocols) CS422-Spring
Lecture-22 (Cache Coherence Protocols) CS422-Spring 2018 Biswa@CSE-IITK Single Core Core 0 Private L1 Cache Bus (Packet Scheduling) Private L2 DRAM CS422: Spring 2018 Biswabandan Panda, CSE@IITK 2 Multicore
More informationShared Memory Multiprocessors
Parallel Computing Shared Memory Multiprocessors Hwansoo Han Cache Coherence Problem P 0 P 1 P 2 cache load r1 (100) load r1 (100) r1 =? r1 =? 4 cache 5 cache store b (100) 3 100: a 100: a 1 Memory 2 I/O
More informationAdvanced Memory Organizations
CSE 3421: Introduction to Computer Architecture Advanced Memory Organizations Study: 5.1, 5.2, 5.3, 5.4 (only parts) Gojko Babić 03-29-2018 1 Growth in Performance of DRAM & CPU Huge mismatch between CPU
More informationHigh performance computing. Memory
High performance computing Memory Performance of the computations For many programs, performance of the calculations can be considered as the retrievability from memory and processing by processor In fact
More informationCache Coherence in Bus-Based Shared Memory Multiprocessors
Cache Coherence in Bus-Based Shared Memory Multiprocessors Shared Memory Multiprocessors Variations Cache Coherence in Shared Memory Multiprocessors A Coherent Memory System: Intuition Formal Definition
More informationMemory Hierarchy. Reading. Sections 5.1, 5.2, 5.3, 5.4, 5.8 (some elements), 5.9 (2) Lecture notes from MKP, H. H. Lee and S.
Memory Hierarchy Lecture notes from MKP, H. H. Lee and S. Yalamanchili Sections 5.1, 5.2, 5.3, 5.4, 5.8 (some elements), 5.9 Reading (2) 1 SRAM: Value is stored on a pair of inerting gates Very fast but
More informationSuggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!
1! 2! Suggested Readings! Readings!! H&P: Chapter 5.8! Could also look at material on CD referenced on p. 538 of your text! Lecture 27" Cache Coherency! 3! Processor components! Multicore processors and
More informationCOEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence
1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations
More informationEECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141
EECS151/251A Spring 2018 Digital Design and Integrated Circuits Instructors: John Wawrzynek and Nick Weaver Lecture 19: Caches Cache Introduction 40% of this ARM CPU is devoted to SRAM cache. But the role
More informationComputer Architecture Memory hierarchies and caches
Computer Architecture Memory hierarchies and caches S Coudert and R Pacalet January 23, 2019 Outline Introduction Localities principles Direct-mapped caches Increasing block size Set-associative caches
More informationCaches. Cache Memory. memory hierarchy. CPU memory request presented to first-level cache first
Cache Memory memory hierarchy CPU memory request presented to first-level cache first if data NOT in cache, request sent to next level in hierarchy and so on CS3021/3421 2017 jones@tcd.ie School of Computer
More informationHow to Write Fast Numerical Code
How to Write Fast Numerical Code Lecture: Memory hierarchy, locality, caches Instructor: Markus Püschel TA: Daniele Spampinato & Alen Stojanov Left alignment Attractive font (sans serif, avoid Arial) Calibri,
More informations complement 1-bit Booth s 2-bit Booth s
ECE/CS 552 : Introduction to Computer Architecture FINAL EXAM May 12th, 2002 NAME: This exam is to be done individually. Total 6 Questions, 100 points Show all your work to receive partial credit for incorrect
More informationMemory systems. Memory technology. Memory technology Memory hierarchy Virtual memory
Memory systems Memory technology Memory hierarchy Virtual memory Memory technology DRAM Dynamic Random Access Memory bits are represented by an electric charge in a small capacitor charge leaks away, need
More informationCommunications and Computer Engineering II: Lecturer : Tsuyoshi Isshiki
Communications and Computer Engineering II: Microprocessor 2: Processor Micro-Architecture Lecturer : Tsuyoshi Isshiki Dept. Communications and Computer Engineering, Tokyo Institute of Technology isshiki@ict.e.titech.ac.jp
More informationEN164: Design of Computing Systems Lecture 24: Processor / ILP 5
EN164: Design of Computing Systems Lecture 24: Processor / ILP 5 Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More information