Master Informatics Eng.
|
|
- Calvin Greene
- 5 years ago
- Views:
Transcription
1 Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8
2 AJProença, Advanced Architectures, MiEI, UMinho, 207/8 2
3 Goals of the Roofline Model AJProença, Advanced Architectures, MiEI, UMinho, 207/8 3
4 Performance Limiting Factors AJProença, Advanced Architectures, MiEI, UMinho, 207/8 4
5 Roofline Performance Model n n Basic idea: n n Plot peak floating-point throughput as a function of arithmetic intensity Ties together floating-point performance and memory performance for a target machine Arithmetic intensity n Floating-point operations per byte read SIMD Instruction Set Extensions for Multimedia Copyright 202, Elsevier Inc. All rights reserved. 5
6 AJProença, Advanced Architectures, MiEI, UMinho, 207/8 6
7 ... but we could also consider SP or int AJProença, Advanced Architectures, MiEI, UMinho, 207/8 7
8 AJProença, Advanced Architectures, MiEI, UMinho, 207/8 8
9 3Cs model for caches AJProença, Advanced Architectures, MiEI, UMinho, 207/8 9
10 Three Classes of Locality v Temporal Locality reusing data (either registers or cache lines) multiple times amortizes the impact of limited bandwidth. transform loops or algorithms to maximize reuse. v Spatial Locality data is transferred from cache to registers in words. However, data is transferred to the cache in 64-28Byte lines using every word in a line maximizes spatial locality. transform data structures into structure of arrays (SoA) layout v Sequential Locality Many memory address patterns access cache lines sequentially. CPU s hardware stream prefetchers exploit this observation to hide speculatively load data to memory latency. Transform loops to generate (a few) long, unit-stride accesses. LAWRENCE BERKELEY NATIONAL LABORATORY 0
11 Preliminary notes in the Roofline Model goal: integrate in-core performance, memory bandwidth, and locality into a single readily understandable performance figure graphically show the penalty associated with not including certain software optimizations Roofline model will be unique to each architecture AJProença, Advanced Architectures, MiEI, UMinho, 207/8
12 Key elements in the Roofline Model x-axis: the operational intensity, operations per byte of RAM traffic, Flops/byte (traffic between caches and memory) y-axis: the attainable floating-point performance, GFlops/sec includes both peak processor/memory performance peak processor FP performance: a horizontal line computed from the processor specs peak memory performance: bounds the max FP performance of the memory system for a given operational intensity for each kernel: its performance is a point on a vertical line that crosses the x-axis on the kernel operational intensity AJProença, Advanced Architectures, MiEI, UMinho, 207/8 2
13 Arithmetic Intensity O( ) O( log(n) ) O( N ) A r i t h m e t i c I n t e n s i t y SpMV, BLAS,2 Stencils (PDEs) Lattice Methods FFTs Dense Linear Algebra (BLAS3) Particle Methods v v v v v True Arithmetic Intensity (AI) ~ Total Flops / Total DRAM Bytes Some HPC kernels have an arithmetic intensity that scales with problem size (increased temporal locality) Others have constant intensity Arithmetic intensity is ultimately limited by compulsory traffic Arithmetic intensity is diminished by conflict or capacity misses. LAWRENCE BERKELEY NATIONAL LABORATORY 3
14 NUMA v Recent multicore SMPs have integrated the memory controllers on chip. v As a result, memory-access is non-uniform (NUMA) v That is, the bandwidth to read a given address varies dramatically among between cores v Exploit NUMA (affinity+first touch) when you malloc/init data. v Concept is similar to data decomposition for distributed memory Core Core 4MB L2 FSB Core Core 4MB L2 Core Core 4MB L2 FSB Core 0.66 GB/s 0.66 GB/s Core 4MB L2 Opteron Opteron Opteron Opteron 52K 52K 52K 52K 2MB victim SRI / xbar HyperTransport 4GB/s (each direction) HyperTransport Opteron Opteron Opteron Opteron 52K 52K 52K 52K 2MB victim SRI / xbar MCH (4x64b controllers) 2.33 GB/s 0.66 GB/s 8 x 667MHz FBDIMMs 2x64b controllers 0.66GB/s 667MHz DDR2 DIMMs 2x64b controllers 0.66GB/s 667MHz DDR2 DIMMs LAWRENCE BERKELEY NATIONAL LABORATORY 4
15 Additional notes Memory bandwidth # s collected via micro benchmarks (or the STREAM benchmark) Computation # s derived from optimization manuals (pencil and paper) Assume complete overlap of either communication or computation => AJProença, Advanced Architectures, MiEI, UMinho, 207/8 5
16 AJProença, Advanced Architectures, MiEI, UMinho, 207/8 6
17 Example v Consider the Opteron 2356: Dual Socket (NUMA) limited HW stream prefetchers quad-core (8 total) 2.3GHz 2-way SIMD (DP) separate FPMUL and FPADD datapaths 4-cycle FP latency Opteron 52K Opteron 52K Opteron 52K Opteron 52K 2MB victim SRI / xbar 2x64b controllers 0.66GB/s HyperTransport 4GB/s (each direction) HyperTransport Opteron 52K Opteron 52K Opteron 52K 2x64b controllers 0.66GB/s Opteron 52K 2MB victim SRI / xbar 667MHz DDR2 DIMMs 667MHz DDR2 DIMMs v Assuming expression of parallelism is the challenge on this architecture, what would the roofline model look like? LAWRENCE BERKELEY NATIONAL LABORATORY 7
18 attainable GFLOP/s Roofline Model Basic Concept Opteron 2356 (Barcelona) peak DP v Plot on log-log scale v Given AI, we can easily bound performance v But architectures are much more complicated v We will bound performance as we eliminate specific forms of in-core parallelism 0.5 / 8 / 4 / actual FLOP:Byte ratio LAWRENCE BERKELEY NATIONAL LABORATORY 8
19 attainable GFLOP/s Roofline Model computational ceilings Opteron 2356 (Barcelona) peak DP mul / add imbalance v Opterons have dedicated multipliers and adders. v If the code is dominated by adds, then attainable performance is half of peak. v We call these Ceilings v They act like constraints on performance 0.5 / 8 / 4 / actual FLOP:Byte ratio LAWRENCE BERKELEY NATIONAL LABORATORY 9
20 attainable GFLOP/s Roofline Model computational ceilings Opteron 2356 (Barcelona) peak DP mul / add imbalance w/out SIMD v Opterons have 28-bit datapaths. v If instructions aren t SIMDized, attainable performance will be halved 0.5 / 8 / 4 / actual FLOP:Byte ratio LAWRENCE BERKELEY NATIONAL LABORATORY 20
21 attainable GFLOP/s Roofline Model computational ceilings Opteron 2356 (Barcelona) peak DP mul / add imbalance w/out SIMD w/out ILP v On Opterons, floating-point instructions have a 4 cycle latency. v If we don t express 4-way ILP, performance will drop by as much as 4x 0.5 / 8 / 4 / actual FLOP:Byte ratio LAWRENCE BERKELEY NATIONAL LABORATORY 2
22 attainable GFLOP/s Roofline Model communication ceilings Opteron 2356 (Barcelona) peak DP v We can perform a similar exercise taking away parallelism from the memory subsystem 0.5 / 8 / 4 / actual FLOP:Byte ratio LAWRENCE BERKELEY NATIONAL LABORATORY 22
23 attainable GFLOP/s Roofline Model communication ceilings Opteron 2356 (Barcelona) peak DP v Explicit software prefetch instructions are required to achieve peak bandwidth 0.5 / 8 / 4 / actual FLOP:Byte ratio LAWRENCE BERKELEY NATIONAL LABORATORY 23
24 attainable GFLOP/s Roofline Model communication ceilings Opteron 2356 (Barcelona) peak DP v Opterons are NUMA v As such memory traffic must be correctly balanced among the two sockets to achieve good Stream bandwidth. v We could continue this by examining strided or random memory access patterns 0.5 / 8 / 4 / actual FLOP:Byte ratio LAWRENCE BERKELEY NATIONAL LABORATORY 24
25 attainable GFLOP/s Roofline Model computation + communication ceilings Opteron 2356 (Barcelona) peak DP mul / add imbalance w/out SIMD w/out ILP v We may bound performance based on the combination of expressed in-core parallelism and attained bandwidth. 0.5 / 8 / 4 / actual FLOP:Byte ratio LAWRENCE BERKELEY NATIONAL LABORATORY 25
26 LAWRENCE BERKELEY NATIONAL LABORATORY 26
27 attainable GFLOP/s Roofline Model locality walls / 8 Opteron 2356 (Barcelona) peak DP mul / add imbalance only compulsory miss traffic w/out SIMD w/out ILP / 4 / actual FLOP:Byte ratio LAWRENCE BERKELEY NATIONAL LABORATORY v Remember, memory traffic includes more than just compulsory misses. v As such, actual arithmetic intensity may be substantially lower. v Walls are unique to the architecture-kernel combination AI = FLOPs Compulsory Misses 27
28 attainable GFLOP/s Roofline Model locality walls / 8 Opteron 2356 (Barcelona) mul / add imbalance +write allocation traffic peak DP only compulsory miss traffic w/out SIMD w/out ILP / 4 / actual FLOP:Byte ratio AI = LAWRENCE BERKELEY NATIONAL LABORATORY v Remember, memory traffic includes more than just compulsory misses. v As such, actual arithmetic intensity may be substantially lower. v Walls are unique to the architecture-kernel combination FLOPs Allocations + Compulsory Misses 28
29 attainable GFLOP/s Roofline Model locality walls / 8 Opteron 2356 (Barcelona) +capacity miss traffic mul / add imbalance +write allocation traffic peak DP only compulsory miss traffic w/out SIMD w/out ILP / 4 / actual FLOP:Byte ratio AI = LAWRENCE BERKELEY NATIONAL LABORATORY v Remember, memory traffic includes more than just compulsory misses. v As such, actual arithmetic intensity may be substantially lower. v Walls are unique to the architecture-kernel combination FLOPs Capacity + Allocations + Compulsory 29
30 attainable GFLOP/s Roofline Model locality walls / 8 Opteron 2356 (Barcelona) +conflict miss traffic +capacity miss traffic mul / add imbalance +write allocation traffic peak DP only compulsory miss traffic w/out SIMD w/out ILP / 4 / actual FLOP:Byte ratio LAWRENCE BERKELEY NATIONAL LABORATORY v Remember, memory traffic includes more than just compulsory misses. v As such, actual arithmetic intensity may be substantially lower. v Walls are unique to the architecture-kernel combination FLOPs AI = Conflict + Capacity + Allocations + Compulsory 30
31 attainable GFLOP/s Roofline Model locality walls / 8 Opteron 2356 (Barcelona) +conflict miss traffic +capacity miss traffic mul / add imbalance +write allocation traffic peak DP only compulsory miss traffic w/out SIMD w/out ILP / 4 / actual FLOP:Byte ratio v Optimizations remove these walls and ceilings which act to constrain performance. LAWRENCE BERKELEY NATIONAL LABORATORY 3
32 Optimization Categorization Maximizing In-core Performance Exploit in-core parallelism (ILP, DLP, etc ) Good (enough) floating-point balance? reorder unroll?& jam eliminate?? explicit branches SIMD Maximizing Memory Bandwidth Exploit NUMA Hide memory latency Satisfy Little s Law memory?? affinity? DMA lists unit-stride? streams SW prefetch? TLB blocking Minimizing Memory Traffic Eliminate: Capacity misses Conflict misses Compulsory misses Write allocate behavior cache blocking???? array padding compress data streaming stores LAWRENCE BERKELEY NATIONAL LABORATORY 32
33 No overlap of communication and computation v Previously, we assumed perfect overlap of communication or computation. v What happens if there is a dependency (either inherent or by a lack of optimization) that serializes communication and computation? Byte s / STREAM Bandwidth Byte s / STREAM Bandwidth Flop s / Flop/s Flop s / Flop/s time v Time is the sum of communication time and computation time. v The result is that flop/s grows asymptotically. LAWRENCE BERKELEY NATIONAL LABORATORY 33
34 attainable GFLOP/s No overlap of communication and computation Generic Machine Stream Stream Bandwidth Bandwidth w/out w/out NUMA NUMA peak DP w/out ILP / 8 / 4 / actual flop:byte ratio v Consider a generic machine v If we can perfectly decouple and overlap communication with computation, the roofline is sharp/angular. v However, without overlap, the roofline is smoothed, and attainable performance is degraded by up to a factor of 2x. LAWRENCE BERKELEY NATIONAL LABORATORY 34
35 Alternate Bandwidths v Thus far, we assumed a synergy between streaming applications and bandwidth (proxied by the STREAM benchmark) v STREAM is NOT a good proxy for short stanza/random cacheline access patterns as memory latency (instead of just bandwidth) is being exposed. v Thus one might conceive of alternate memory benchmarks to provide a bandwidth upper bound (ceiling) v Similarly, if data is primarily local in the LLC cache, one should construct rooflines based on LLC bandwidth and flop:llc byte ratios. v For GPUs/accelerators, PCIe bandwidth can be an impediment. Thus one can construct a roofline model based on PCIe bandwidth and the flop:pcie byte ratio. LAWRENCE BERKELEY NATIONAL LABORATORY 35
36 AJProença, Advanced Architectures, MiEI, UMinho, 207/8 36
37 AJProença, Advanced Architectures, MiEI, UMinho, 207/8 37
38 Some more examples AJProença, Advanced Architectures, MiEI, UMinho, 207/8 38
39 Some more examples AJProença, Advanced Architectures, MiEI, UMinho, 207/8 39
40 Some more examples AJProença, Advanced Architectures, MiEI, UMinho, 207/8 40
41 Alternate Computations v Arising from HPC kernels, its no surprise roofline use DP Flop/s. v Of course, it could use SP flop/s, integer ops, bit operations, pairwise comparisons (sorting), graphics operations, etc LAWRENCE BERKELEY NATIONAL LABORATORY 4
The Roofline Model: EECS. A pedagogical tool for program analysis and optimization
P A R A L L E L C O M P U T I N G L A B O R A T O R Y The Roofline Model: A pedagogical tool for program analysis and optimization Samuel Williams,, David Patterson, Leonid Oliker,, John Shalf, Katherine
More informationLawrence Berkeley National Laboratory Recent Work
Lawrence Berkeley National Laboratory Recent Work Title The roofline model: A pedagogical tool for program analysis and optimization Permalink https://escholarship.org/uc/item/3qf33m0 ISBN 9767379 Authors
More informationThe Roofline Model: A pedagogical tool for program analysis and optimization
P A R A L L E L C O M P U T I N G L A B O R A T O R Y The Roofline Model: A pedagogical tool for program analysis and optimization ParLab Summer Retreat Samuel Williams, David Patterson samw@cs.berkeley.edu
More informationSCIENTIFIC COMPUTING FOR ENGINEERS PERFORMANCE MODELING
2/20/13 CS 594: SCIENTIFIC COMPUTING FOR ENGINEERS PERFORMANCE MODELING Heike McCraw mccraw@icl.utk.edu 1. Basic Essentials OUTLINE Abstract architecture model Communication, Computation, and Locality
More informationChapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.
Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2
More informationInstructor: Leopold Grinberg
Part 1 : Roofline Model Instructor: Leopold Grinberg IBM, T.J. Watson Research Center, USA e-mail: leopoldgrinberg@us.ibm.com 1 ICSC 2014, Shanghai, China The Roofline Model DATA CALCULATIONS (+, -, /,
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationPARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort
PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort rob@cs.vu.nl Schedule 2 1. Introduction, performance metrics & analysis 2. Many-core hardware 3. Cuda class 1: basics 4. Cuda class
More informationPERI - Auto-tuning memory-intensive kernels for multicore
PERI - Auto-tuning memory-intensive kernels for multicore Samuel Williams 1,2, Kaushik Datta 2, Jonathan Carter 1, Leonid Oliker 1,2, John Shalf 1, Katherine Yelick 1,2, David Bailey 1 1 CRD/NERSC, Lawrence
More informationAdvanced Parallel Programming I
Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance
More informationHow to Write Fast Numerical Code
How to Write Fast Numerical Code Lecture: Memory hierarchy, locality, caches Instructor: Markus Püschel TA: Alen Stojanov, Georg Ofenbeck, Gagandeep Singh Organization Temporal and spatial locality Memory
More informationStudy of Performance Issues on a SMP-NUMA System using the Roofline Model
Study of Performance Issues on a SMP-NUMA System using the Roofline Model Juan A. Lorenzo 1, Juan C. Pichel 1, Tomás F. Pena 1, Marcos Suárez 2 and Francisco F. Rivera 1 1 Computer Architecture Group,
More informationModern CPU Architectures
Modern CPU Architectures Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2014 16.04.2014 1 Motivation for Parallelism I CPU History RISC Software GmbH Johannes
More informationThe ECM (Execution-Cache-Memory) Performance Model
The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited Loop Kernels. Proceedings of the Workshop Memory issues on Multi- and Manycore
More informationCMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More informationMulti-core processors are here, but how do you resolve data bottlenecks in native code?
Multi-core processors are here, but how do you resolve data bottlenecks in native code? hint: it s all about locality Michael Wall October, 2008 part I of II: System memory 2 PDC 2008 October 2008 Session
More informationPerformance of Variant Memory Configurations for Cray XT Systems
Performance of Variant Memory Configurations for Cray XT Systems Wayne Joubert, Oak Ridge National Laboratory ABSTRACT: In late 29 NICS will upgrade its 832 socket Cray XT from Barcelona (4 cores/socket)
More informationCS560 Lecture Parallel Architecture 1
Parallel Architecture Announcements The RamCT merge is done! Please repost introductions. Manaf s office hours HW0 is due tomorrow night, please try RamCT submission HW1 has been posted Today Isoefficiency
More informationMulti-core is Here! But How Do You Resolve Data Bottlenecks in PC Games? hint: it s all about locality
Multi-core is Here! But How Do You Resolve Data Bottlenecks in PC Games? hint: it s all about locality Michael Wall Principal Member of Technical Staff, Advanced Micro Devices, Inc. Game Developers Conference
More informationEECS Electrical Engineering and Computer Sciences
P A R A L L E L C O M P U T I N G L A B O R A T O R Y Auto-tuning Stencil Codes for Cache-Based Multicore Platforms Kaushik Datta Dissertation Talk December 4, 2009 Motivation Multicore revolution has
More informationBandwidth Avoiding Stencil Computations
Bandwidth Avoiding Stencil Computations By Kaushik Datta, Sam Williams, Kathy Yelick, and Jim Demmel, and others Berkeley Benchmarking and Optimization Group UC Berkeley March 13, 2008 http://bebop.cs.berkeley.edu
More informationIntel Architecture for HPC
Intel Architecture for HPC Georg Zitzlsberger georg.zitzlsberger@vsb.cz 1st of March 2018 Agenda Salomon Architectures Intel R Xeon R processors v3 (Haswell) Intel R Xeon Phi TM coprocessor (KNC) Ohter
More informationPerformance of Multicore LUP Decomposition
Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations
More informationMemory Hierarchy. Slides contents from:
Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory
More informationWhy GPUs? Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)
Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G (TUDo( TUDo), Dominik Behr (AMD) Conference on Parallel Processing and Applied Mathematics Wroclaw, Poland, September 13-16, 16, 2009 www.gpgpu.org/ppam2009
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationVisualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature
Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature Intel Software Developer Conference Frankfurt, 2017 Klaus-Dieter Oertel, Intel Agenda Intel Advisor for vectorization
More informationComparing Memory Systems for Chip Multiprocessors
Comparing Memory Systems for Chip Multiprocessors Jacob Leverich Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis Computer Systems Laboratory Stanford University
More informationXT Node Architecture
XT Node Architecture Let s Review: Dual Core v. Quad Core Core Dual Core 2.6Ghz clock frequency SSE SIMD FPU (2flops/cycle = 5.2GF peak) Cache Hierarchy L1 Dcache/Icache: 64k/core L2 D/I cache: 1M/core
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationVisualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017
Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature Intel Software Developer Conference London, 2017 Agenda Vectorization is becoming more and more important What is
More informationChapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture
An Introduction to Parallel Programming Peter Pacheco Chapter 2 Parallel Hardware and Parallel Software 1 The Von Neuman Architecture Control unit: responsible for deciding which instruction in a program
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationCS 426 Parallel Computing. Parallel Computing Platforms
CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:
More informationCS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III
CS 152 Computer Architecture and Engineering Lecture 8 - Memory Hierarchy-III Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationOutline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??
Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross
More informationApplication Performance on Dual Processor Cluster Nodes
Application Performance on Dual Processor Cluster Nodes by Kent Milfeld milfeld@tacc.utexas.edu edu Avijit Purkayastha, Kent Milfeld, Chona Guiang, Jay Boisseau TEXAS ADVANCED COMPUTING CENTER Thanks Newisys
More informationIntel Architecture for Software Developers
Intel Architecture for Software Developers 1 Agenda Introduction Processor Architecture Basics Intel Architecture Intel Core and Intel Xeon Intel Atom Intel Xeon Phi Coprocessor Use Cases for Software
More informationChapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for
More informationSubmission instructions (read carefully): SS17 / Assignment 4 Instructor: Markus Püschel. ETH Zurich
263-2300-00: How To Write Fast Numerical Code Assignment 4: 120 points Due Date: Th, April 13th, 17:00 http://www.inf.ethz.ch/personal/markusp/teaching/263-2300-eth-spring17/course.html Questions: fastcode@lists.inf.ethz.ch
More informationCISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP
CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationHow to Write Fast Numerical Code
How to Write Fast Numerical Code Lecture: Memory hierarchy, locality, caches Instructor: Markus Püschel TA: Daniele Spampinato & Alen Stojanov Left alignment Attractive font (sans serif, avoid Arial) Calibri,
More informationEvaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li
Evaluation of sparse LU factorization and triangular solution on multicore architectures X. Sherry Li Lawrence Berkeley National Laboratory ParLab, April 29, 28 Acknowledgement: John Shalf, LBNL Rich Vuduc,
More information4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.
Chapter 4: CPU 4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.8 Control hazard 4.14 Concluding Rem marks Hazards Situations that
More informationMulticore Scaling: The ECM Model
Multicore Scaling: The ECM Model Single-core performance prediction The saturation point Stencil code examples: 2D Jacobi in L1 and L2 cache 3D Jacobi in memory 3D long-range stencil G. Hager, J. Treibig,
More informationCOMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory
COMP 633 - Parallel Computing Lecture 6 September 6, 2018 SMM (1) Memory Hierarchies and Shared Memory 1 Topics Memory systems organization caches and the memory hierarchy influence of the memory hierarchy
More informationBasics of Performance Engineering
ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently
More informationModule 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains:
The Lecture Contains: Data Access and Communication Data Access Artifactual Comm. Capacity Problem Temporal Locality Spatial Locality 2D to 4D Conversion Transfer Granularity Worse: False Sharing Contention
More informationLecture 2: Single processor architecture and memory
Lecture 2: Single processor architecture and memory David Bindel 30 Aug 2011 Teaser What will this plot look like? for n = 100:10:1000 tic; A = []; for i = 1:n A(i,i) = 1; end times(n) = toc; end ns =
More information1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core
1. Many Core vs Multi Core 2. Performance Optimization Concepts for Many Core 3. Performance Optimization Strategy for Many Core 4. Example Case Studies NERSC s Cori will begin to transition the workload
More informationHow HPC Hardware and Software are Evolving Towards Exascale
How HPC Hardware and Software are Evolving Towards Exascale Kathy Yelick Associate Laboratory Director and NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley NERSC Overview
More informationPerformance of the AMD Opteron LS21 for IBM BladeCenter
August 26 Performance Analysis Performance of the AMD Opteron LS21 for IBM BladeCenter Douglas M. Pase and Matthew A. Eckl IBM Systems and Technology Group Page 2 Abstract In this paper we examine the
More informationCOMPUTER ORGANIZATION AND DESIGN
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance
More informationEvaluation of Intel Memory Drive Technology Performance for Scientific Applications
Evaluation of Intel Memory Drive Technology Performance for Scientific Applications Vladimir Mironov, Andrey Kudryavtsev, Yuri Alexeev, Alexander Moskovsky, Igor Kulikov, and Igor Chernykh Introducing
More informationKirill Rogozhin. Intel
Kirill Rogozhin Intel From Old HPC principle to modern performance model Old HPC principles: 1. Balance principle (e.g. Kung 1986) hw and software parameters altogether 2. Compute Density, intensity, machine
More informationMake sure that your exam is not missing any sheets, then write your full name and login ID on the front.
ETH login ID: (Please print in capital letters) Full name: 63-300: How to Write Fast Numerical Code ETH Computer Science, Spring 015 Midterm Exam Wednesday, April 15, 015 Instructions Make sure that your
More informationPerformance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply
Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply University of California, Berkeley Berkeley Benchmarking and Optimization Group (BeBOP) http://bebop.cs.berkeley.edu
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationAlgorithms of Scientific Computing
Algorithms of Scientific Computing Fast Fourier Transform (FFT) Michael Bader Technical University of Munich Summer 2018 The Pair DFT/IDFT as Matrix-Vector Product DFT and IDFT may be computed in the form
More informationMulticore Programming
Multi Programming Parallel Hardware and Performance 8 Nov 00 (Part ) Peter Sewell Jaroslav Ševčík Tim Harris Merge sort 6MB input (-bit integers) Recurse(left) ~98% execution time Recurse(right) Merge
More informationMemory hierarchy review. ECE 154B Dmitri Strukov
Memory hierarchy review ECE 154B Dmitri Strukov Outline Cache motivation Cache basics Six basic optimizations Virtual memory Cache performance Opteron example Processor-DRAM gap in latency Q1. How to deal
More informationOptimising for the p690 memory system
Optimising for the p690 memory Introduction As with all performance optimisation it is important to understand what is limiting the performance of a code. The Power4 is a very powerful micro-processor
More informationChapter 7. Multicores, Multiprocessors, and Clusters
Chapter 7 Multicores, Multiprocessors, and Clusters Introduction Goal: connecting multiple computers to get higher performance Multiprocessors Scalability, availability, power efficiency Job-level (process-level)
More informationChapter 6. Parallel Processors from Client to Cloud Part 2 COMPUTER ORGANIZATION AND DESIGN. Homogeneous & Heterogeneous Multicore Architectures
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Part 2 Homogeneous & Heterogeneous Multicore Architectures Intel XEON 22nm
More informationRoofline Model (Will be using this in HW2)
Parallel Architecture Announcements HW0 is due Friday night, thank you for those who have already submitted HW1 is due Wednesday night Today Computing operational intensity Dwarves and Motifs Stencil computation
More informationKNL tools. Dr. Fabio Baruffa
KNL tools Dr. Fabio Baruffa fabio.baruffa@lrz.de 2 Which tool do I use? A roadmap to optimization We will focus on tools developed by Intel, available to users of the LRZ systems. Again, we will skip the
More informationOptimization Techniques for Parallel Code 2. Introduction to GPU architecture
Optimization Techniques for Parallel Code 2. Introduction to GPU architecture Sylvain Collange Inria Rennes Bretagne Atlantique http://www.irisa.fr/alf/collange/ sylvain.collange@inria.fr OPT - 2017 What
More informationAdaptive MPI Multirail Tuning for Non-Uniform Input/Output Access
Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access S. Moreaud, B. Goglin and R. Namyst INRIA Runtime team-project University of Bordeaux, France Context Multicore architectures everywhere
More informationImplicit and Explicit Optimizations for Stencil Computations
Implicit and Explicit Optimizations for Stencil Computations By Shoaib Kamil 1,2, Kaushik Datta 1, Samuel Williams 1,2, Leonid Oliker 2, John Shalf 2 and Katherine A. Yelick 1,2 1 BeBOP Project, U.C. Berkeley
More informationUsing the Roofline Model and Intel Advisor
Using the Roofline Model and Intel Advisor Samuel Williams SWWilliams@lbl.gov Computational Research Division Lawrence Berkeley National Lab Tuomas Koskela TKoskela@lbl.gov NERSC Lawrence Berkeley National
More informationJackson Marusarz Intel
Jackson Marusarz Intel Agenda Motivation Threading Advisor Threading Advisor Workflow Advisor Interface Survey Report Annotations Suitability Analysis Dependencies Analysis Vectorization Advisor & Roofline
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationOptimizing Sparse Data Structures for Matrix-Vector Multiply
Summary Optimizing Sparse Data Structures for Matrix-Vector Multiply William Gropp (UIUC) and Dahai Guo (NCSA) Algorithms and Data Structures need to take memory prefetch hardware into account This talk
More informationCache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance
6.823, L11--1 Cache Performance and Memory Management: From Absolute Addresses to Demand Paging Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Cache Performance 6.823,
More informationGPU programming: Code optimization part 1. Sylvain Collange Inria Rennes Bretagne Atlantique
GPU programming: Code optimization part 1 Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr Outline Analytical performance modeling Optimizing host-device data transfers Optimizing
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationAdapted from David Patterson s slides on graduate computer architecture
Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual
More informationStream Processor Architecture. William J. Dally Stanford University August 22, 2003 Streaming Workshop
Stream Processor Architecture William J. Dally Stanford University August 22, 2003 Streaming Workshop Stream Arch: 1 August 22, 2003 Some Definitions A Stream Program expresses a computation as streams
More informationFundamentals of Quantitative Design and Analysis
Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature
More informationDouble-Precision Matrix Multiply on CUDA
Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices
More informationPerformance Analysis and Prediction for distributed homogeneous Clusters
Performance Analysis and Prediction for distributed homogeneous Clusters Heinz Kredel, Hans-Günther Kruse, Sabine Richling, Erich Strohmaier IT-Center, University of Mannheim, Germany IT-Center, University
More informationMemories. CPE480/CS480/EE480, Spring Hank Dietz.
Memories CPE480/CS480/EE480, Spring 2018 Hank Dietz http://aggregate.org/ee480 What we want, what we have What we want: Unlimited memory space Fast, constant, access time (UMA: Uniform Memory Access) What
More informationCS 6354: Memory Hierarchy I. 29 August 2016
1 CS 6354: Memory Hierarchy I 29 August 2016 Survey results 2 Processor/Memory Gap Figure 2.2 Starting with 1980 performance as a baseline, the gap in performance, measured as the difference in the time
More informationCS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III
CS 152 Computer Architecture and Engineering Lecture 8 - Memory Hierarchy-III Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste
More informationAlgorithms and Architecture. William D. Gropp Mathematics and Computer Science
Algorithms and Architecture William D. Gropp Mathematics and Computer Science www.mcs.anl.gov/~gropp Algorithms What is an algorithm? A set of instructions to perform a task How do we evaluate an algorithm?
More informationFast-multipole algorithms moving to Exascale
Numerical Algorithms for Extreme Computing Architectures Software Institute for Methodologies and Abstractions for Codes SIMAC 3 Fast-multipole algorithms moving to Exascale Lorena A. Barba The George
More informationCISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP
CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationA Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures
A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures W.M. Roshan Weerasuriya and D.N. Ranasinghe University of Colombo School of Computing A Comparative
More informationTechniques for Mitigating Memory Latency Effects in the PA-8500 Processor. David Johnson Systems Technology Division Hewlett-Packard Company
Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor David Johnson Systems Technology Division Hewlett-Packard Company Presentation Overview PA-8500 Overview uction Fetch Capabilities
More informationChapter-5 Memory Hierarchy Design
Chapter-5 Memory Hierarchy Design Unlimited amount of fast memory - Economical solution is memory hierarchy - Locality - Cost performance Principle of locality - most programs do not access all code or
More informationTutorial 11. Final Exam Review
Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache
More informationBOPS, Not FLOPS! A New Metric, Measuring Tool, and Roofline Performance Model For Datacenter Computing. Chen Zheng ICT,CAS
BOPS, Not FLOPS! A New Metric, Measuring Tool, and Roofline Performance Model For Datacenter Computing Chen Zheng ICT,CAS Data Center Computing (DC ) HPC only takes 20% market share Big Data, AI, Internet
More informationMultiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism
Computing Systems & Performance Beyond Instruction-Level Parallelism MSc Informatics Eng. 2012/13 A.J.Proença From ILP to Multithreading and Shared Cache (most slides are borrowed) When exploiting ILP,
More informationMulti-Level Cache Hierarchy Evaluation for Programmable Media Processors. Overview
Multi-Level Cache Hierarchy Evaluation for Programmable Media Processors Jason Fritts Assistant Professor Department of Computer Science Co-Author: Prof. Wayne Wolf Overview Why Programmable Media Processors?
More informationCS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook
CS356: Discussion #9 Memory Hierarchy and Caches Marco Paolieri (paolieri@usc.edu) Illustrations from CS:APP3e textbook The Memory Hierarchy So far... We modeled the memory system as an abstract array
More information