The ECM (Execution-Cache-Memory) Performance Model
|
|
- Meagan Sanders
- 5 years ago
- Views:
Transcription
1 The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited Loop Kernels. Proceedings of the Workshop Memory issues on Multi- and Manycore Platforms at PPAM 2009, the 8th International Conference on Parallel Processing and Applied Mathematics, Wroclaw, Poland, September 13-16, Lecture Notes in Computer Science Volume 6067, 2010, pp DOI: / _64. G. Hager, J. Treibig, J. Habich, and G. Wellein: Exploring performance and power properties of modern multicore chips via simple machine models. Concurrency and Computation: Practice and Experience, DOI: /cpe.3180 (2013). Preprint: arxiv: H. Stengel, J. Treibig, G. Hager, and G. Wellein: Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model. Submitted. Preprint: arxiv:
2 Assumptions and shortcomings of the roofline model Assumes one of two bottlenecks 1. In-core execution 2. Bandwidth of a single hierarchy level Latency effects are not modeled pure data streaming assumed In-core execution is sometimes hard to A(:)=B(:)+C(:)*D(:) model Saturation effects in multicore chips are not explained ECM model gives more insight Roofline predicts full socket BW 2
3 The Execution-Cache-Memory (ECM) model
4 ECM Model ECM = Execution-Cache-Memory Observations: Single-core execution time is not the maximum of 1. In-core execution 2. Data transfers through a single bottleneck Data transfers may or may not overlap with each other or with in-core execution Scaling is linear until the relevant bottleneck is reached ECM model Input: Same as for Roofline + data transfer times in hierarchy 4
5 Example: Schönauer Vector Triad in L2 cache REPEAT[ A(:) = B(:) + C(:) * double precision Analysis for Sandy Bridge core w/ AVX (unit of work: 1 cache line) Machine characteristics: Registers L1 1 LD/cy ST/cy Triad analysis (per CL): Registers L1 6 cy/cl Timeline: ADD ADD MULT MULT LD LD ST/2 LD ST/2 LD LD ST/2 16 F/CL (AVX) LD ST/2 32 B/cy (2 cy/cl) 10 cy/cl LD LD LD WA ST L2 L2 Roofline prediction: 16/10 F/cy Arithmetic: 1 ADD/cy+ 1 MULT/cy Arithmetic: AVX: 2 cy/cl Measurement: 16F / 17cy 5
6 Example: ECM model for Schönauer Vector Triad A(:)=B(:)+C(:)*D(:) on a Sandy Bridge Core with AVX CL transfer Writeallocate CL transfer 6
7 Testing different overlap hypotheses Results suggest no overlap! 7
8 Multicore scaling in the ECM model Identify relevant bandwidth bottlenecks L3 cache Memory interface Scale single-thread performance until first bottleneck is hit: n threads: P n = min(np 0, I b S ) Example: Scalable L3 on Sandy Bridge... 8
9 ECM prediction vs. measurements for A(:)=B(:)+C(:)*D(:) on a Sandy Bridge socket (no-overlap assumption) Model: Scales until saturation sets in Saturation point (# cores) well predicted Measurement: scaling not perfect Caveat: This is specific for this architecture and this benchmark! Check: Use overlappable kernel code 9
10 ECM prediction vs. measurements for A(:)=B(:)+C(:)/D(:) on a Sandy Bridge socket (full overlap assumption) In-core execution is dominated by divide operation (44 cycles with AVX, 22 scalar) Almost perfect agreement with ECM model General observation: If the L1 cache is 100% occupied by LD, there is no overlap throughout the hierarchy If there is slack at the L1, there is overlap in the hierarchy 10
11 Example 1: A 2D Jacobi stencil in DP with SSE2 on Sandy Bridge 11
12 Example 1: 2D Jacobi in DP with SSE2 on SNB 4-way unrolling 8 LUP / iteration Instruction count - 13 LOAD - 4 STORE - 12 ADD - 4 MUL 12
13 Example 1: 2D Jacobi in DP with SSE2 on SNB Processor characteristics (SSE instructions per cycle) - 2 LOAD (1 LOAD + 1 STORE) - 1 ADD - 1 MUL Code characteristics (SSE instructions per iteration) - 13 LOAD - 4 STORE - 12 ADD - 4 MUL LD LD LD LD 2LD 2LD 2LD 2LD L ST ST ST ST * * * * core execution: 12 cy 13
14 Example 1: 2D Jacobi in DP with SSE2 on SNB Situation 1: Data set fits into L1 cache ECM prediction: (8 LUP / 12 cy) * 3.5 GHz = 2.3 GLUP/s Measurement: 2.2 GLUP/s 12 cy Situation 2: Data set fits into L2 cache (not into L1) 3 additional transfer streams from L2 to L1 (data delay) Prediction: (8 LUP / (12+6) cy) * 3.5 GHz = 1.5 GLUP/s t1 RFO t0 6 cy Measurement: 1.9 GLUP/s Overlap? 14
15 Example 1: 2D Jacobi in DP with SSE2 on SNB LD LD LD LD 2LD 2LD 2LD 2LD L ST ST ST ST LOAD bottleneck: 8.5 cy * * * * core execution: 12 cycles L2 delay: 6 cycles L1 single ported no overlap during LD/ST ECM prediction w/ overlap: (8 LUP / (8.5+6) cy) * 3.5 GHz = 1.9 GLUP/s Measurement: 1.9 GLUP/s 12 cy t1 RFO t0 6 cy If the model fails, we learn something 15
16 ECM model the rules 4 cy 1. LOADs in the L1 cache do not overlap with any other data transfer in the memory hierarchy MULT 8 cy 3 cy STORE ADD 43 cy LOAD 6 cy L2-L1 9 cy 2. Everything else in the core overlaps perfectly with data transfers L3-L2 9 cy 3. The scaling limit is set by the ratio of # cycles per CL overall # cycles per CL at the bottleneck 4. The Roofline Model is recovered when assuming full overlap of all contributions (c) RRZE 2014 Example: time [cy] MEM 19 cy -L3 Single-core (data in L1): 8 cy (ADD) Single-core (data in memory): cy = 43 cy Scaling limit: 43 / 19 = 2.3 cores ECM model 16
17 ECM model notation Core time = overlapping and non-overlapping contributions ECM prediction = maximum of overlapping time and sum of all other contributions Convenient shorthand notation for contributions: Example from prev. slide: Predictions for data in different memory hierarchy levels: Experimental data (measured) notation: Saturation assumption for memory bottleneck: 17
18 ECM Model for DAXPY (AVX) on SNB 2.7 GHz (phinally) Loop: Contributions: Predictions: 18
19 ECM Model and measurements for array sum on SNB 2.7 GHz (phinally) Loop: Naive = scalar, no unrolling (full 3 cy penalty per ADD) 19
20 ECM Model and measurements for 2D Jacobi (AVX) on SNB 2.7 GHz (phinally) Loop: LC = layer condition satisfied in 20
21 Jacobi 2D impact of inner loop blocking on SNB (phinally) ECM 21
22 Jacobi 2D: Why outer loop blocking? Extra data prefetched from memory at block boundaries 22
23 Kahan dot product
24 Kahan dot product Goal: Compute large sums (many operands) with controlled numerical error attribute ((optimize("no-tree-vectorize"))) void ddot_kahan_scalar_comp( int N, const double* a, const double* b, double* r) { int i; double sum = 0.0; double c = 0.0; for (i=0; i<n; ++i) { double prod = a[i]*b[i]; double y = prod-c; double t = sum+y; c = (t-sum)-y; sum = t; } } (*r) = sum; 24
25 Example (from Wikipedia) 6-digit FP, initial sum = , adding and y = y = input[i] - c t = = Many digits have been lost! c = ( ) This must be evaluated as written! = Assimilated part of y recovered, vs. full y. = sum = Inaccurate result On the next step, c gives the error. y = Shortfall from previous stage included. = It is of a size similar to y: most digits meet. t = But few meet the digits of sum. = , rounds to c = ( ) This extracts whatever went in. = In this case, too much. = The excess would be subtracted off next time. sum = Exact result is , this is correctly rounded to 6 digits. 25
26 ECM Model and measurements on Emmy (IVB 2.2 GHz, 3 cy/cl from memory) Standard DP ddot: Scalar: AVX: Kahan ddot: Scalar: AVX: Conclusion: DP Kahan ddot saturates even in scalar mode SP Kahan will not saturate 26
27 Performance Modeling of Stencil Codes Applying the ECM model to stencil updates: - 3D Jacobi smoother (DP, AVX) - Long-range stencil (SP, AVX) (H. Stengel, RRZE)
28 Example 2: A 3D Jacobi smoother with AVX vectorization on an Intel Ivy Bridge processor 28
29 Jacobi 3D Manual Analysis Operation Count (1 LUP) MUL 1 ADD 5 LOAD 6 STORE 1 Cycle Count (4x unroll + AVX = 16 LUP) MUL 4 ADD 20 LOAD 24 STORE 8 29
30 Interlude: Intel Architecture Code Analyzer (IACA) Performs architecture-specific code analysis Prerequisite: Mark start and end of dominant work loop In high-level code (documented) In assembly code (see iacamarks.h) Does not influence code optimization (e.g. vectorization) Assembly loop might perform multiple updates per iteration (unrolling, SIMD) Important reports (throughput mode): Block throughput: runtime of one loop iteration ( core-time) Throughput bottleneck: limiting resource for code execution Port pressure: dominant pipeline port 30
31 16 updates (4x unroll + AVX) = 2 cache lines per loop iteration #pragma vector aligned 31
32 M-L3 (12cy) L3-L2 (10cy) L2-L1 (10cy) ADD (10cy) L1-REG (LD 12cy) Jacobi 3D ECM Non-LD/ST time Intel(R) Xeon(R) CPU E GHz Memory Bandwidth 47 GB/s Data transfers FrontEnd stalls 0.5*( ) =0.05cy MUL (2cy) Reg-Reg (6cy) Stores (4cy) Times [cy] for 8 LUP (DP) = 1 CL update = 0.5 loop iterations (ASM) = 0.5 * IACA output IACA throughput: 24.1cy/16LUP Single-core performance 3.0GHz / (44cy/ 8LUP) = 545MLUP/s Measurement (N=400): 542MLUP/s (~44cy) 44cy #pragma vector aligned 32
33 Socket Scaling Intel(R) Xeon(R) CPU E GHz Memory Bandwidth 47 GB/s 34
34 Example 3: 3D long-range stencil in single precision with AVX on Sandy Bridge 35
35 Example 3: 3D long-range stencil in SP with AVX on SNB Core execution 4 neighbors per direction Operations per update (code) 27 LOAD (25 V, 1 ROC, 1 U) 1 STORE (U) 26 ADD 15 MUL Core time & actual LOAD count IACA Collaboration with D. Keyes & T. Malas (KAUST) 36
36 IACA example output Core execution Core Execution time (16 LUP) = 2*34.25 cy = 68.5 cy Data transfer: LOAD ports REG L1: 2*30.5 cy = 61 cy 128 Bit Loads AVX vectorization, no unrolling: One iteration updates 8 SP (float) elements Multiply all numbers by 2X to get time for updating 1 CacheLine (16 floats) 37
37 Example 3: Data delay Problem size: (single precision) cy/cl Spatial blocking Layer condition at L3 and row condition in L1: OK From IACA analysis 61 cy 8 LOADS to V can be served directly by L3 cache + 1 from main memory 24 cy 24 cy 17 cy MemBW=40 GB/s Minimum data transfer to main memory: 4 WORD/LUP (LD: U,V,ROC ST:U) 38
38 M-L3 17 cy L3-L2 24 cy L2-L1 24 cy 126 cy ADD 52 cy MULT 38 cy Reg-Reg transfers 48cy L1-REG (Load) 61 cy Example 3: Putting it all together Core execution (Non-LD/ST cycles) Data delay Stores 4cy optimization target! IACA throughput 68.5 cy / CL (sp) FrontEnd stalls overlap: ( ) cy =7.5cy Single-core performance (ECM Model) 2.7GHz / (126cy / 16LUP) = 343 MLUP/s Measurement: 320 MLUP/s temporal blocking useless! 39
39 Socket scaling memory bandwidth limit 41
40 ECM model: Conclusions & outlook Saturation effects are ubiquitous; understanding them gives us opportunity to Find out about optimization opportunities Save energy by letting cores idle see power model later on Putting idle cores to better use communication, functional decomposition Simple models work best. Do not try to complicate things unless it is really necessary! Possible extensions to the ECM model Accommodate latency effects Model simple architectural hazards 42
Multicore Scaling: The ECM Model
Multicore Scaling: The ECM Model Single-core performance prediction The saturation point Stencil code examples: 2D Jacobi in L1 and L2 cache 3D Jacobi in memory 3D long-range stencil G. Hager, J. Treibig,
More informationQuantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model
ERLANGEN REGIONAL COMPUTING CENTER Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model Holger Stengel, J. Treibig, G. Hager, G. Wellein Erlangen Regional
More informationBasics of performance modeling for numerical applications: Roofline model and beyond
Basics of performance modeling for numerical applications: Roofline model and beyond Georg Hager, Jan Treibig, Gerhard Wellein SPPEXA PhD Seminar RRZE April 30, 2014 Prelude: Scalability 4 the win! Scalability
More informationAnalytical Tool-Supported Modeling of Streaming and Stencil Loops
ERLANGEN REGIONAL COMPUTING CENTER Analytical Tool-Supported Modeling of Streaming and Stencil Loops Georg Hager, Julian Hammer Erlangen Regional Computing Center (RRZE) Scalable Tools Workshop August
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationCase study: OpenMP-parallel sparse matrix-vector multiplication
Case study: OpenMP-parallel sparse matrix-vector multiplication A simple (but sometimes not-so-simple) example for bandwidth-bound code and saturation effects in memory Sparse matrix-vector multiply (spmvm)
More informationMaster Informatics Eng.
Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,
More informationBasics of Performance Engineering
ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently
More informationPerformance Engineering - Case study: Jacobi stencil
Performance Engineering - Case study: Jacobi stencil The basics in two dimensions (2D) Layer condition in 2D From 2D to 3D OpenMP parallelization strategies and layer condition in 3D NT stores Prof. Dr.
More informationarxiv: v1 [cs.pf] 12 Sep 2016
An ECM-based energy-efficiency optimization approach for bandwidth-limited streaming kernels on recent Intel Xeon processors arxiv:1609.03347v1 [cs.pf] 12 Sep 2016 Johannes Hofmann Department of Computer
More informationPerformance patterns and hardware metrics on modern multicore processors: Best practices for performance engineering
Performance patterns and hardware metrics on modern multicore processors: Best practices for performance engineering Jan Treibig, Georg Hager, Gerhard Wellein Erlangen Regional Computing Center (RRZE)
More informationPerformance Engineering for Algorithmic Building Blocks in the GHOST Library
Performance Engineering for Algorithmic Building Blocks in the GHOST Library Georg Hager, Moritz Kreutzer, Faisal Shahzad, Gerhard Wellein, Martin Galgon, Lukas Krämer, Bruno Lang, Jonas Thies, Melven
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Multi-core today: Intel Xeon 600v4 (016) Xeon E5-600v4 Broadwell
More informationInstructor: Leopold Grinberg
Part 1 : Roofline Model Instructor: Leopold Grinberg IBM, T.J. Watson Research Center, USA e-mail: leopoldgrinberg@us.ibm.com 1 ICSC 2014, Shanghai, China The Roofline Model DATA CALCULATIONS (+, -, /,
More informationCS 261 Fall Caching. Mike Lam, Professor. (get it??)
CS 261 Fall 2017 Mike Lam, Professor Caching (get it??) Topics Caching Cache policies and implementations Performance impact General strategies Caching A cache is a small, fast memory that acts as a buffer
More informationChapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.
Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE
More informationProgramming Techniques for Supercomputers: Modern processors. Architecture of the memory hierarchy
Programming Techniques for Supercomputers: Modern processors Architecture of the memory hierarchy Prof. Dr. G. Wellein (a,b), Dr. G. Hager (a), Dr. M. Wittmann (a) (a) HPC Services Regionales Rechenzentrum
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationPattern-driven Performance Engineering. Using hardware performance metrics Basics of Benchmarking Performance Patterns Signatures
Pattern-driven Performance Engineering Using hardware performance metrics Basics of Benchmarking Performance Patterns Signatures Performance analysis with hardware metrics Likwid-perfctr Best practices
More informationPerformance analysis with hardware metrics. Likwid-perfctr Best practices Energy consumption
Performance analysis with hardware metrics Likwid-perfctr Best practices Energy consumption Hardware performance metrics are ubiquitous as a starting point for performance analysis (including automatic
More informationCS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines
CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture VLIW, Vector, and Multithreaded Machines Assigned 3/24/2019 Problem Set #4 Due 4/5/2019 http://inst.eecs.berkeley.edu/~cs152/sp19
More informationSCIENTIFIC COMPUTING FOR ENGINEERS PERFORMANCE MODELING
2/20/13 CS 594: SCIENTIFIC COMPUTING FOR ENGINEERS PERFORMANCE MODELING Heike McCraw mccraw@icl.utk.edu 1. Basic Essentials OUTLINE Abstract architecture model Communication, Computation, and Locality
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationAdvanced Parallel Programming I
Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University
More informationEvaluation of Intel Xeon Phi "Knights Corner": Opportunities and Shortcomings
ERLANGEN REGIONAL COMPUTING CENTER Evaluation of Intel Xeon Phi "Knights Corner": Opportunities and Shortcomings J. Eitzinger 29.6.2016 Technologies Driving Performance Technology 1991 1992 1993 1994 1995
More informationProcessors, Performance, and Profiling
Processors, Performance, and Profiling Architecture 101: 5-Stage Pipeline Fetch Decode Execute Memory Write-Back Registers PC FP ALU Memory Architecture 101 1. Fetch instruction from memory. 2. Decode
More informationHow to Write Fast Numerical Code
How to Write Fast Numerical Code Lecture: Memory hierarchy, locality, caches Instructor: Markus Püschel TA: Alen Stojanov, Georg Ofenbeck, Gagandeep Singh Organization Temporal and spatial locality Memory
More informationParallelism. CS6787 Lecture 8 Fall 2017
Parallelism CS6787 Lecture 8 Fall 2017 So far We ve been talking about algorithms We ve been talking about ways to optimize their parameters But we haven t talked about the underlying hardware How does
More informationarxiv: v1 [cs.pf] 24 Feb 2017
An analysis of core- and chip-level architectural features in four generations of Intel server processors Johannes Hofmann 1, Georg Hager 2, Gerhard Wellein 2, and Dietmar Fey 1 arxiv:1702.07554v1 [cs.pf]
More informationAssuming ideal conditions (perfect pipelining and no hazards), how much time would it take to execute the same program in: b) A 5-stage pipeline?
1. Imagine we have a non-pipelined processor running at 1MHz and want to run a program with 1000 instructions. a) How much time would it take to execute the program? 1 instruction per cycle. 1MHz clock
More informationSparse Matrix-Vector Multiplication with Wide SIMD Units: Performance Models and a Unified Storage Format
ERLANGEN REGIONAL COMPUTING CENTER Sparse Matrix-Vector Multiplication with Wide SIMD Units: Performance Models and a Unified Storage Format Moritz Kreutzer, Georg Hager, Gerhard Wellein SIAM PP14 MS53
More informationFrom Tool Supported Performance Modeling of Regular Algorithms to Modeling of Irregular Algorithms
ERLANGEN REGIONAL COMPUTING CENTER [RRZE] From Tool Supported Performance Modeling of Regular Algorithms to Modeling of Irregular Algorithms Julian Hammer Georg Hager Jan Eitzinger Gerhard Wellein Overview
More informationPerformance Engineering
Performance Engineering J. Treibig Erlangen Regional Computing Center University Erlangen-Nuremberg 12.11.2013 Using the RRZE clusters Terminalserver: cshpc.rrze.uni-erlangen.de Loginnodes: emmy, lima,
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationVisualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017
Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature Intel Software Developer Conference London, 2017 Agenda Vectorization is becoming more and more important What is
More informationOptimization of finite-difference kernels on multi-core architectures for seismic applications
Optimization of finite-difference kernels on multi-core architectures for seismic applications V. Etienne 1, T. Tonellot 1, K. Akbudak 2, H. Ltaief 2, S. Kortas 3, T. Malas 4, P. Thierry 4, D. Keyes 2
More informationAllevia'ng memory bandwidth pressure with wavefront temporal blocking and diamond 'ling Tareq Malas* Georg Hager Gerhard Wellein David Keyes*
Allevia'ng memory bandwidth pressure with wavefront temporal blocking and diamond 'ling Tareq Malas* Georg Hager Gerhard Wellein David Keyes* Erlangen Regional Compu0ng Center, Germany *King Abdullah Univ.
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance
More informationCUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION
CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION WHAT YOU WILL LEARN An iterative method to optimize your GPU code Some common bottlenecks to look out for Performance diagnostics with NVIDIA Nsight
More informationCS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines
CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines Assigned April 7 Problem Set #5 Due April 21 http://inst.eecs.berkeley.edu/~cs152/sp09 The problem sets are intended
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationModern CPU Architectures
Modern CPU Architectures Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2014 16.04.2014 1 Motivation for Parallelism I CPU History RISC Software GmbH Johannes
More informationCS 426 Parallel Computing. Parallel Computing Platforms
CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:
More informationGeneric Cycle Accounting GOODA. Generic Optimization Data Analyzer
Generic Cycle Accounting GOODA Generic Optimization Data Analyzer What is Gooda Open sourced PMU analysis tool Processes perf.data file created with "perf record" Intrinsically incorporates hierarchical
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationHPC VT Machine-dependent Optimization
HPC VT 2013 Machine-dependent Optimization Last time Choose good data structures Reduce number of operations Use cheap operations strength reduction Avoid too many small function calls inlining Use compiler
More informationOptimising for the p690 memory system
Optimising for the p690 memory Introduction As with all performance optimisation it is important to understand what is limiting the performance of a code. The Power4 is a very powerful micro-processor
More informationDouble-precision General Matrix Multiply (DGEMM)
Double-precision General Matrix Multiply (DGEMM) Parallel Computation (CSE 0), Assignment Andrew Conegliano (A0) Matthias Springer (A00) GID G-- January, 0 0. Assumptions The following assumptions apply
More informationReview: Creating a Parallel Program. Programming for Performance
Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)
More informationLecture 12: Instruction Execution and Pipelining. William Gropp
Lecture 12: Instruction Execution and Pipelining William Gropp www.cs.illinois.edu/~wgropp Yet More To Consider in Understanding Performance We have implicitly assumed that an operation takes one clock
More informationLecture 4: RISC Computers
Lecture 4: RISC Computers Introduction Program execution features RISC characteristics RISC vs. CICS Zebo Peng, IDA, LiTH 1 Introduction Reduced Instruction Set Computer (RISC) represents an important
More informationIntroduction to GPGPU and GPU-architectures
Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks
More informationControl Hazards. Prediction
Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional
More informationOpenCL Vectorising Features. Andreas Beckmann
Mitglied der Helmholtz-Gemeinschaft OpenCL Vectorising Features Andreas Beckmann Levels of Vectorisation vector units, SIMD devices width, instructions SMX, SP cores Cus, PEs vector operations within kernels
More informationTools and techniques for optimization and debugging. Fabio Affinito October 2015
Tools and techniques for optimization and debugging Fabio Affinito October 2015 Fundamentals of computer architecture Serial architectures Introducing the CPU It s a complex, modular object, made of different
More informationGPU Microarchitecture Note Set 2 Cores
2 co 1 2 co 1 GPU Microarchitecture Note Set 2 Cores Quick Assembly Language Review Pipelined Floating-Point Functional Unit (FP FU) Typical CPU Statically Scheduled Scalar Core Typical CPU Statically
More informationChapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for
More informationIntroduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem
Introduction Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: Increase computation power Make the best use of available bandwidth We study the bandwidth
More informationJackson Marusarz Intel Corporation
Jackson Marusarz Intel Corporation Intel VTune Amplifier Quick Introduction Get the Data You Need Hotspot (Statistical call tree), Call counts (Statistical) Thread Profiling Concurrency and Lock & Waits
More informationLecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )
Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14) 1 1-Bit Prediction For each branch, keep track of what happened last time and use
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationScientific Computing on GPUs: GPU Architecture Overview
Scientific Computing on GPUs: GPU Architecture Overview Dominik Göddeke, Jakub Kurzak, Jan-Philipp Weiß, André Heidekrüger and Tim Schröder PPAM 2011 Tutorial Toruń, Poland, September 11 http://gpgpu.org/ppam11
More informationPerformance potential for simulating spin models on GPU
Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational
More informationKNL tools. Dr. Fabio Baruffa
KNL tools Dr. Fabio Baruffa fabio.baruffa@lrz.de 2 Which tool do I use? A roadmap to optimization We will focus on tools developed by Intel, available to users of the LRZ systems. Again, we will skip the
More informationProblem 1. (15 points):
CMU 15-418/618: Parallel Computer Architecture and Programming Practice Exercise 1 A Task Queue on a Multi-Core, Multi-Threaded CPU Problem 1. (15 points): The figure below shows a simple single-core CPU
More informationINTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017
INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and
More informationHigh performance computing. Memory
High performance computing Memory Performance of the computations For many programs, performance of the calculations can be considered as the retrievability from memory and processing by processor In fact
More informationHARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA
HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh
More informationarxiv: v1 [cs.pf] 10 Apr 2010
Efficient multicore-aware parallelization strategies for iterative stencil computations Jan Treibig, Gerhard Wellein, Georg Hager Erlangen Regional Computing Center (RRZE) University Erlangen-Nuremberg
More informationBandwidth Avoiding Stencil Computations
Bandwidth Avoiding Stencil Computations By Kaushik Datta, Sam Williams, Kathy Yelick, and Jim Demmel, and others Berkeley Benchmarking and Optimization Group UC Berkeley March 13, 2008 http://bebop.cs.berkeley.edu
More informationarxiv: v1 [cs.pf] 17 Jun 2012
Best practices for HPM-assisted performance engineering on modern multicore processors Jan Treibig, Georg Hager, and Gerhard Wellein arxiv:1206.3738v1 [cs.pf] 17 Jun 2012 Erlangen Regional Computing Center
More informationCUDA OPTIMIZATIONS ISC 2011 Tutorial
CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationControl Hazards. Branch Prediction
Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional
More informationChapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture
An Introduction to Parallel Programming Peter Pacheco Chapter 2 Parallel Hardware and Parallel Software 1 The Von Neuman Architecture Control unit: responsible for deciding which instruction in a program
More informationCache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons
Cache Memories 15-213/18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, 2017 Today s Instructor: Phil Gibbons 1 Today Cache memory organization and operation Performance impact
More informationHalf full or half empty? William Gropp Mathematics and Computer Science
Half full or half empty? William Gropp Mathematics and Computer Science www.mcs.anl.gov/~gropp MPI on Multicore Processors Work of Darius Buntinas and Guillaume Mercier 340 ns MPI ping/pong latency More
More informationHigh performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli
High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering
More informationUniversity of Toronto Faculty of Applied Science and Engineering
Print: First Name:............ Solutions............ Last Name:............................. Student Number:............................................... University of Toronto Faculty of Applied Science
More informationLECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY
LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Abridged version of Patterson & Hennessy (2013):Ch.5 Principle of Locality Programs access a small proportion of their address space at any time Temporal
More informationCOSC 6385 Computer Architecture. - Memory Hierarchies (II)
COSC 6385 Computer Architecture - Memory Hierarchies (II) Fall 2008 Cache Performance Avg. memory access time = Hit time + Miss rate x Miss penalty with Hit time: time to access a data item which is available
More informationWritten Exam / Tentamen
Written Exam / Tentamen Computer Organization and Components / Datorteknik och komponenter (IS1500), 9 hp Computer Hardware Engineering / Datorteknik, grundkurs (IS1200), 7.5 hp KTH Royal Institute of
More informationCache-oblivious Programming
Cache-oblivious Programming Story so far We have studied cache optimizations for array programs Main transformations: loop interchange, loop tiling Loop tiling converts matrix computations into block matrix
More informationAdvanced optimizations of cache performance ( 2.2)
Advanced optimizations of cache performance ( 2.2) 30 1. Small and Simple Caches to reduce hit time Critical timing path: address tag memory, then compare tags, then select set Lower associativity Direct-mapped
More informationDan Stafford, Justine Bonnot
Dan Stafford, Justine Bonnot Background Applications Timeline MMX 3DNow! Streaming SIMD Extension SSE SSE2 SSE3 and SSSE3 SSE4 Advanced Vector Extension AVX AVX2 AVX-512 Compiling with x86 Vector Processing
More informationCartoon parallel architectures; CPUs and GPUs
Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 Th Sep 11! Thanks to Jee Choi (a senior PhD student) for a big assist 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ~ socket 14 ~ core 14 ~ HWMT+SIMD
More informationKernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow
Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2
More informationA GPU performance estimation model based on micro-benchmarks and black-box kernel profiling
A GPU performance estimation model based on micro-benchmarks and black-box kernel profiling Elias Konstantinidis National and Kapodistrian University of Athens Department of Informatics and Telecommunications
More informationPerformance of Multicore LUP Decomposition
Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations
More informationSSE and SSE2. Timothy A. Chagnon 18 September All images from Intel 64 and IA 32 Architectures Software Developer's Manuals
SSE and SSE2 Timothy A. Chagnon 18 September 2007 All images from Intel 64 and IA 32 Architectures Software Developer's Manuals Overview SSE: Streaming SIMD (Single Instruction Multiple Data) Extensions
More informationCACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás
CACHE MEMORIES Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix B, John L. Hennessy and David A. Patterson, Morgan Kaufmann,
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2
More informationCache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.
Cache Memories From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6. Today Cache memory organization and operation Performance impact of caches The memory mountain Rearranging
More informationWhat is Pipelining? Time per instruction on unpipelined machine Number of pipe stages
What is Pipelining? Is a key implementation techniques used to make fast CPUs Is an implementation techniques whereby multiple instructions are overlapped in execution It takes advantage of parallelism
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #11 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Midterm 1:
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Moore s law Intel Sandy Bridge EP: 2.3 billion Nvidia
More informationIntel Knights Landing Hardware
Intel Knights Landing Hardware TACC KNL Tutorial IXPUG Annual Meeting 2016 PRESENTED BY: John Cazes Lars Koesterke 1 Intel s Xeon Phi Architecture Leverages x86 architecture Simpler x86 cores, higher compute
More informationCS 433 Homework 4. Assigned on 10/17/2017 Due in class on 11/7/ Please write your name and NetID clearly on the first page.
CS 433 Homework 4 Assigned on 10/17/2017 Due in class on 11/7/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies on collaboration.
More informationLinear Algebra for Modern Computers. Jack Dongarra
Linear Algebra for Modern Computers Jack Dongarra Tuning for Caches 1. Preserve locality. 2. Reduce cache thrashing. 3. Loop blocking when out of cache. 4. Software pipelining. 2 Indirect Addressing d
More information