Basics of Performance Engineering

Similar documents
Pattern-driven Performance Engineering. Using hardware performance metrics Basics of Benchmarking Performance Patterns Signatures

Performance patterns and hardware metrics on modern multicore processors: Best practices for performance engineering

Performance Engineering

arxiv: v1 [cs.pf] 17 Jun 2012

Performance analysis with hardware metrics. Likwid-perfctr Best practices Energy consumption

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Jackson Marusarz Intel Corporation

Multicore Scaling: The ECM Model

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

The ECM (Execution-Cache-Memory) Performance Model

Designing for Performance. Patrick Happ Raul Feitosa

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

EECS4201 Computer Architecture

Chapter 9. Pipelining Design Techniques

HPC VT Machine-dependent Optimization

Chap. 4 Multiprocessors and Thread-Level Parallelism

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Course web site: teaching/courses/car. Piazza discussion forum:

Ruler: High-Speed Packet Matching and Rewriting on Network Processors

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011

Master Informatics Eng.

Profiling: Understand Your Application

Performance analysis basics

Performance. CS 3410 Computer System Organization & Programming. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon]

2

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

15-740/ Computer Architecture Lecture 4: Pipelining. Prof. Onur Mutlu Carnegie Mellon University

Scientific Computing on GPUs: GPU Architecture Overview

CS 426 Parallel Computing. Parallel Computing Platforms

The Processor: Instruction-Level Parallelism

Processors. Young W. Lim. May 12, 2016

CS4961 Parallel Programming. Lecture 5: Data and Task Parallelism, cont. 9/8/09. Administrative. Mary Hall September 8, 2009.

Control Hazards. Prediction

Page 1. Multilevel Memories (Improving performance using a little cash )

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures

CS241 Computer Organization Spring Principle of Locality

Lecture 13: March 25

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Review: Creating a Parallel Program. Programming for Performance

Performance Analysis of the Lattice Boltzmann Method on x86-64 Architectures

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Lecture 13: Memory Consistency. + Course-So-Far Review. Parallel Computer Architecture and Programming CMU /15-618, Spring 2014

Advanced Parallel Programming I

Copyright 2012, Elsevier Inc. All rights reserved.

SCIENTIFIC COMPUTING FOR ENGINEERS PERFORMANCE MODELING

Exploitation of instruction level parallelism

The Memory Hierarchy & Cache

Chapter 14 Performance and Processor Design

A Multiprocessor system generally means that more than one instruction stream is being executed in parallel.

ECE 669 Parallel Computer Architecture

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

Bei Wang, Dmitry Prohorov and Carlos Rosales

Computer Architecture Review. Jo, Heeseung

Thinking parallel. Decomposition. Thinking parallel. COMP528 Ways of exploiting parallelism, or thinking parallel

Pipelining to Superscalar

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

1.3 Data processing; data storage; data movement; and control.

Modern CPU Architectures

CS4961 Parallel Programming. Lecture 7: Introduction to SIMD 09/14/2010. Homework 2, Due Friday, Sept. 10, 11:59 PM. Mary Hall September 14, 2010

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti

Fundamentals of Quantitative Design and Analysis

Online Course Evaluation. What we will do in the last week?

How to Write Fast Numerical Code

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

CS / ECE 6810 Midterm Exam - Oct 21st 2008

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Optimising for the p690 memory system

Parallel Processing SIMD, Vector and GPU s cont.

Processor (IV) - advanced ILP. Hwansoo Han

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

CS 654 Computer Architecture Summary. Peter Kemper

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

Hardware Performance Monitoring Unit Working Group Outbrief

Computer Architecture

Simultaneous Multithreading on Pentium 4

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth

Modern Processor Architectures. L25: Modern Compiler Design

Optimizing Parallel Reduction in CUDA

Control Hazards. Branch Prediction

Lecture 13: Memory Consistency. + Course-So-Far Review. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Introduction to Parallel Performance Engineering

ECE C61 Computer Architecture Lecture 2 performance. Prof. Alok N. Choudhary.

Lec 25: Parallel Processors. Announcements

Transcription:

ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015

Why hardware should not be exposed Such an approach is not portable Hardware issues frequently change Those nasty hardware details are too difficult to learn for the average programmer Important fundamental concepts are stable and portable (ILP, SIMD, memory organization). The basic principals are simple to understand and every programmer should know them. 2

Approaches to performance optimization Trial and error Blind data driven Highly skilled experts Automated expert tools Highly complex Problem centric Tool centric 3

Focus on resource utilization 1. Instruction execution Primary resource of the processor. 2. Data transfer bandwidth Data transfers as a consequence of instruction execution. What is the limiting resource? Do you fully utilize available resources? 4

What needs to be done on one slide Reduce computational work Reduce data volume (over slow data paths) Make use of parallel resources Load balancing Serial fraction Identify relevant bottleneck(s) Eliminate bottleneck Increase resource utilization Final Goal: Fully exploit offered resources for your specific code! 5

Basics of Optimization 1. Define relevant test cases 2. Establish a sensible performance metric 3. Acquire a runtime profile (sequential) 4. Identify hot kernels (Hopefully there are any!) 5. Carry out optimization process for each kernel Iteratively Motivation: Understand observed performance Learn about code characteristics and machine capabilities Deliberately decide on optimizations 6

Best Practices Benchmarking Preparation Reliable timing (Minimum time which can be measured?) Document code generation (Flags, Compiler Version) Get exclusive System System state (Clock, Turbo mode, Memory, Caches) Consider to automate runs with a skript (Shell, python, perl) Doing Affinity control Check: Is the result reasonable? Is result deterministic and reproducible. Statistics: Mean, Best?? Basic variants: Thread count, affinity, working set size (Baseline!) 7

Best Practices Benchmarking cont. Postprocessing Documentation Try to understand and explain the result Plan variations to gain more information Many things can be better understood if you plot them (gnuplot, xmgrace) 8

Thinking in Bottlenecks A bottleneck is a performance limiting setting Microarchitectures expose numerous bottlenecks Observation 1: Most applications face a single bottleneck at a time! Observation 2: There is a limited number of relevant bottlenecks! 9

Process vs. Tool Reduce complexity! We propose a human driven process to enable a systematic way to success! Executed by humans. Uses tools by means of data acquisition only. Uses one of the most powerful tools available: Your brain! You are a investigator making sense of what s going on. 10

Performance Engineering Process: Analysis Algorithm/Code Analysis Hardware/Instruction set architecture Microbenchmarking The set of input data indicating a pattern is its signature Application Benchmarking Pattern Performance patterns are typical performance limiting motifs Step 1 Analysis: Understanding observed performance 11

Performance analysis phase Understand observed performance: Where am I? Input: Static code analysis HPM data Scaling data set size Scaling number of used cores Microbenchmarking Signature Pattern Performance patterns are typical performance limiting motives. The set of input data indicating a pattern is its signature. 12

Performance Engineering Process: Modelling Pattern Qualitative view Wrong pattern Performance Model Quantitative view Validation Traces/HW metrics Step 2 Formulate Model: Validate pattern and get quantitative insight. 13

Performance Engineering Process: Optimization Pattern Performance Model Performance improves until next bottleneck is hit Improves Performance Optimize for better resource utilization Eliminate nonexpedient activity Step 3 Optimization: Improve utilization of offered resources. 14

Performance pattern classification 1. Maximum resource utilization 2. Hazards 3. Work related (Application or Processor) The system offers two basic resources: Execution of instructions (primary) Transferring data (secondary) 15

Patterns (I): Botttlenecks & hazards Pattern Performance behavior Metric signature, LIKWID performance group(s) Bandwidth saturation Saturating speedup across cores sharing a data path Bandwidth meets BW of suitable streaming benchmark (MEM, L3) ALU saturation Throughput at design limit(s) Good (low) CPI, integral ratio of cycles to specific instruction count(s) (FLOPS_*, DATA, CPI) Inefficient data access Excess data volume Latency-bound access Simple bandwidth performance model much too optimistic Low BW utilization / Low cache hit ratio, frequent CL evicts or replacements (CACHE, DATA, MEM) Micro-architectural anomalies Large discrepancy from simple performance model based on LD/ST and arithmetic throughput Relevant events are very hardware-specific, e.g., memory aliasing stalls, conflict misses, unaligned LD/ST, requeue events 16

Patterns (II): Hazards Pattern Performance behavior Metric signature, LIKWID performance group(s) False sharing of cache lines Large discrepancy from performance model in parallel case, bad scalability Frequent (remote) CL evicts (CACHE) Bad ccnuma page placement Bad or no scaling across NUMA domains, performance improves with interleaved page placement Unbalanced bandwidth on memory interfaces / High remote traffic (MEM) Pipelining issues In-core throughput far from design limit, performance insensitive to data set size (Large) integral ratio of cycles to specific instruction count(s), bad (high) CPI (FLOPS_*, DATA, CPI) Control flow issues See above High branch rate and branch miss ratio (BRANCH) 17

Patterns (III): Work-related Pattern Performance behavior Metric signature, LIKWID performance group(s) Load imbalance / serial fraction Synchronization overhead Instruction overhead Saturating/sub-linear speedup Speedup going down as more cores are added / No speedup with small problem sizes / Cores busy but low FP performance Low application performance, good scaling across cores, performance insensitive to problem size Different amount of work on the cores (FLOPS_*); note that instruction count is not reliable! Large non-fp instruction count (growing with number of cores used) / Low CPI (FLOPS_*, CPI) Low CPI near theoretical limit / Large non-fp instruction count (constant vs. number of cores) (FLOPS_*, DATA, CPI) Code composition Expensive instructions Ineffective instructions Similar to instruction overhead Many cycles per instruction (CPI) if the problem is large-latency arithmetic Scalar instructions dominating in data-parallel loops (FLOPS_*, CPI) 18

Example rabbitct Result of effort: 5-6 x improvement against initial parallel C code implementation >50% of peak performance (SSE) 19

Optimization without knowledge about bottleneck 20

Where to start Look at the code and understand what it is doing! Scaling runs: Scale #cores inside ccnuma domain Scale across ccnuma domains Scale working set size (if possible) HPM measurements: Memory Bandwidth Instruction decomposition: Arithmetic, data, branch, other SIMD vectorized fraction Data volumes inside memory hierarchy CPI 21

Most frequent patterns (seen with scientific computing glasses) Data transfer related: Memory bandwidth saturation Bad ccnuma page placement Parallelization Load imbalance Serial fraction Overhead: Synchronization overhead Excess work: Data volume reduction over slow data paths Reduction of algorithmic work Code composition: Instruction overhead Ineffective instructions Expensive instructions 22

Pattern: Bandwidth Saturation 1. Perform scaling run inside ccnuma domain 2. Measure memory bandwidth with HPM 3. Compare to micro benchmark with similar data access pattern Scalable bandwidth Measured bandwidth spmv: 45964 MB/s Synthetic load benchmark: 47022 MB/s Saturating bandwidth 23

Consequences from the saturation pattern Clearly distinguish between saturating and scalable performance on the chip level saturating type scalable type 24

Consequences from the saturation pattern There is no clear bottleneck for single-core execution Code profile for single thread code profile for multiple threads à Single-threaded profiling may be misleading 1 thread saturating part scalable part runtime 8 threads 25

Pattern: Instruction Overhead 1. Perform a HPM instruction decomposition analysis 2. Measure resource utilization 3. Static code analysis Instruction decomposition Inlining failed Inefficient data structures Arithmetic FP 12% 21% Load/Store 30% 50% Branch 24% 10% Other 34% 19% C++ codes which suffer from overhead (inlining problems, complex abstractions) need a lot more overall instructions related to the arithmetic instructions Often (but not always) good (i.e., low) CPI Low-ish bandwidth Low # of floating-point instructions vs. other instructions 26

Pattern: Inefficient Instructions 1. HPM measurement: Relation packed vs. scalar instructions 2. Static assembly code analysis: Search for scalar loads Small fraction +--------------------------------------+-------------+-------------+-------------+-------------+-------------+ of packed No AVX Event core 0 core 1 core 2 core 3 core 4 +--------------------------------------+-------------+-------------+-------------+-------------+-------------+ instructions INSTR_RETIRED_ANY 2.19445e+11 1.7674e+11 1.76255e+11 1.75728e+11 1.75578e+11 CPU_CLK_UNHALTED_CORE 1.4396e+11 1.28759e+11 1.28846e+11 1.28898e+11 1.28905e+11 CPU_CLK_UNHALTED_REF 1.20204e+11 1.0895e+11 1.09024e+11 1.09067e+11 1.09074e+11 FP_COMP_OPS_EXE_SSE_FP_PACKED_DOUBLE 1.1169e+09 1.09639e+09 1.09739e+09 1.10112e+09 1.10033e+09 FP_COMP_OPS_EXE_SSE_FP_SCALAR_DOUBLE 3.62746e+10 3.45789e+10 3.45446e+10 3.44553e+10 3.44829e+10 SIMD_FP_256_PACKED_DOUBLE 0 0 0 0 0 +--------------------------------------+-------------+-------------+-------------+-------------+-------------+ There is usually no counter for packed vs scalar (SIMD) loads and stores. Also the compiler usually does not distinguish! Only solution: Inspect code at assembly level. 27