Performance Engineering
|
|
- Gillian May
- 5 years ago
- Views:
Transcription
1 Performance Engineering J. Treibig Erlangen Regional Computing Center University Erlangen-Nuremberg
2 Using the RRZE clusters Terminalserver: cshpc.rrze.uni-erlangen.de Loginnodes: emmy, lima, woody, testfront Using the batchsystem to allocate compute resources: qsub I l nodes=<xxx>:ppn=<xxx>[:xxx] -l walltime=02:00:00 qsub I l nodes=phinally:ppn=32:turbo -l walltime=08:00:00 Module system to control environment: module avail: List available modules module load <XXX>: Load/Unload specific module module list: List loaded modules module show <XXX>: Show environment set by module 2
3 Best Practices Benchmarking Preparation: Reliable Timing (Minimum time which can be measured) Document Code generation (Flags, Compiler Version) Get exclusive System System state (Clock, Turbo mode, Memory, Caches) Consider to automate runs with a skript (Shell, python, perl) Doing Affinity control Check: Is the result reasonable? Is result deterministic and reproducible. Statistics: Mean, Best?? Postprocessing Documentation Try to understand and explain the result Plan variations to gain more information Many things can be better understood if you plot them (gnuplot, xmgrace) 3
4 A(:)=B(:)+C(:)*D(:) on one Sandy Bridge core (3 GHz) Theoretical limit L1D cache (32k) L2 cache (256k) L3 cache (20M) Memory Variation of data set size 4
5 Bandwidth limitations: Outer-level cache Scalability of shared data paths in L3 cache Variation of threads 5
6 Throughput vector triad on Sandy Bridge socket (3 GHz) Saturation effect in memory Scalable BW in L1, L2, L3 cache Variation of affinity 6
7 Basic process for optimization Runtime Profile: gprof, compiler Hardware Performance monitoring: likwid-perfctr, perf, PAPI Micro Benchmarking: STREAM, likwid-bench MPI trace tool: Intel ITAC Runtime profiling Algorithm/Code analysis Machine characteristics Performance model Performance pattern Kernel benchmarking Metric signatures Code optimization You are a investigator! 7
8 Using hardware performance metrics likwid-perfctr
9 Hardware performance metrics are ubiquitous as a starting point for performance analysis (including automatic analysis) are supported by many tools are often reduced to cache misses (what could be worse than cache misses?) Reality: Modern parallel computing is plagued by bottlenecks There are typical performance patterns that cover a large part of possible performance behaviors HPM signatures Scaling behavior Other sources of information Performance pattern 9
10 Tools for performance engineering Today: Automatic, intelligent tools Expert low level tools (bare metal approach) Enable the user Make resource bottlenecks visible And important: Don t get into the way! LIKWID tools: Small, flexible and effective tools likwid-topology and likwid-pin likwid-bench likwid-perfctr likwid-powermeter likwid-mpirun 10
11 Get to know the machine I likwid-topology You need quick and reliable access to all relevant properties of a compute node. Difficulties: Node information is scattered in various places Internet is one source of information, but may be unreliable likwid-topology offers: All relevant information from one single source Reliable data based directly on cpuid Quick overview about thread and memory topology (Turbo mode steps on Intel CPUs) 11
12 Get to know the machine I likwid-topology cont CPU type: Intel Core Westmere processor ************************************************************* Sockets: 2 Cores per socket: 6 Threads per core: HWThread Thread Core Socket Socket 0: ( ) Socket 1: ( ) Cache Topology Level: 3 Size: 12 MB Type: Unified cache Associativity: 16 Number of sets: Cache line size: 64 Non Inclusive cache Shared among 12 threads Cache groups: ( ) ( NUMA Topology NUMA domains: Domain 0: Processors: Memory: MB free of total MB Domain 1: Processors: Memory: MB free of total MB kB 32kB 32kB 32kB 32kB 32kB kB 256kB 256kB 256kB 256kB 256kB MB kB 32kB 32kB 32kB 32kB 32kB MB 3MB 3MB MB
13 Controlling affinity of threads likwid-pin It is crucial for threaded programs to control thread affinity on todays complex node topologies. Difficulties: Different solutions depending on threading model or OpenMP implementation Either policy based or using physical processor IDs Using environment variables likwid-pin offers: Portable pinning without touching the code Simple accessible command line interface Logical numberings within thread groups likwid-pin -c 13
14 Controlling affinity of threads Thread groups Possible unit prefixes N node Default if c is not specified! S socket M NUMA domain C outer level cache group Chipset Memory 14
15 Get to know the machine II likwid-bench Knowing the performance capabilities of a machine is essential for any optimization effort. Difficulties : Doing time measurements in microbenchmarking is tedious Thread and data placement need to be quickly adaptable Implementation in programming language may introduce problems likwid-bench offers: Rapid prototyping of assembly kernel Thread management and placement Data allocation and NUMA aware initialization Timing and result presentation Ready to use set of microbenchmarks 15
16 Get to know the machine II likwid-bench cont. Benchmarks are simple text files Benchmarks are automatically converted, compiled and added to the benchmark application $likwid-bench t clcopy g 1 i 1000 w S0:1MB:2 $likwid-bench t copy g 2 i 100 w S1:1GB w S0:1GB-0:S1,1:S0 STREAMS 2 TYPE DOUBLE FLOPS 0 BYTES 16 LOOP 32 movaps FPR1, [STR0 + GPR1 * 8 ] movaps FPR2, [STR0 + GPR1 * ] movaps FPR3, [STR0 + GPR1 * ] movaps FPR4, [STR0 + GPR1 * ] movaps [STR1 + GPR1 * 8 ], FPR1 movaps [STR1 + GPR1 * ], FPR2 movaps movaps [STR1 + GPR1 * ], FPR3 [STR1 + GPR1 * ], FPR4 Port to IBM Power available! Contact me if you are interested. 16
17 Hardware Performance Monitoring likwid-perfctr Hardware Performance Monitoring is a indispensable source of information how your code interacts with the hardware. Difficulties : HPM implementations are volatile and undocumented Tools are often vendor specific Events frequently are buggy Finding event sets with useful derived metrics is difficult likwid-perfctr offers: likwid-perfctr supports energy counters on Sandy Bridge accessible as a regular event! Simple end-to-end measurement of hardware performance metrics Same tool on any X86 processor Portable ready to use validated performance groups Flexible usage modes: Wrapper, Stethoscope, Timeline, Marker Full coverage of HPM events including Uncore 17
18 likwid-perfctr Example usage for Wrapper mode $ likwid-perfctr -C N:0-3 -g FLOPS_DP./stream.exe CPU type: Intel Core Westmere processor CPU clock: 2.67 GHz Always Measuring group FLOPS_DP measured Configured metrics YOUR PROGRAM OUTPUT (this group) Event core 0 core 1 core 2 core INSTR_RETIRED_ANY e e e e+08 CPU_CLK_UNHALTED_CORE e e e e+08 CPU_CLK_UNHALTED_REF e e e e+08 FP_COMP_OPS_EXE_SSE_FP_PACKED e e e e+07 FP_COMP_OPS_EXE_SSE_FP_SCALAR FP_COMP_OPS_EXE_SSE_DOUBLE_PRECISION e e e e Metric core 0 core 1 core 2 core Runtime [s] Clock [MHz] Derived CPI metrics DP MFlops/s Packed MUOPS/s Scalar MUOPS/s DP MUOPS/s
19 likwid-perfctr Group files SHORT PSTI EVENTSET FIXC0 INSTR_RETIRED_ANY FIXC1 CPU_CLK_UNHALTED_CORE FIXC2 CPU_CLK_UNHALTED_REF PMC0 FP_COMP_OPS_EXE_SSE_FP_PACKED PMC1 FP_COMP_OPS_EXE_SSE_FP_SCALAR PMC2 FP_COMP_OPS_EXE_SSE_SINGLE_PRECISION PMC3 FP_COMP_OPS_EXE_SSE_DOUBLE_PRECISION UPMC0 UNC_QMC_NORMAL_READS_ANY UPMC1 UNC_QMC_WRITES_FULL_ANY UPMC2 UNC_QHL_REQUESTS_REMOTE_READS UPMC3 UNC_QHL_REQUESTS_LOCAL_READS METRICS Runtime [s] FIXC1*inverseClock CPI FIXC1/FIXC0 Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock DP MFlops/s (DP assumed) 1.0E-06*(PMC0*2.0+PMC1)/time Packed MUOPS/s 1.0E-06*PMC0/time Scalar MUOPS/s 1.0E-06*PMC1/time SP MUOPS/s 1.0E-06*PMC2/time Groups are architecture specific They are defined in simple text files During recompile the code is generated likwid-perfctr -a outputs list of groups For every group an extensive documentation is available DP MUOPS/s 1.0E-06*PMC3/time Memory bandwidth [MBytes/s] 1.0E-06*(UPMC0+UPMC1)*64/time; Remote Read BW [MBytes/s] 1.0E-06*(UPMC2)*64/time; LONG Formula: DP MFlops/s = (FP_COMP_OPS_EXE_SSE_FP_PACKED*2 + FP_COMP_OPS_EXE_SSE_FP_SCALAR)/ runtime. 19
20 Basics for building LIKWID Download the latest release from Read the INSTALL and README files J Also consider a look in the Wiki on the LIKWID website LIKWID has no external dependencies and should build on any Linux system with a 2.6 or newer kernel Installing with make install is necessary for the pinning functionality and if you want to use the accessdaemon 20
21 Access to MSR and PCI address space likwid-perfctr and likwid-powermeter require access to MSR (model-specific register) and (on SandyBridge) PCI address space. MSR registers are accessed on x86 processors via special instructions which can only be executed in kernel space The Linux kernel allows reading and writing to these registers via special device files. This enables to implement LIKWID completely in user space The following options are available: Direct access to device files: The user must have read/write access to device files. AccessDaemon: The application starts a proxy application for access to device files (can be enabled in config.mk). If you want to measure memory bandwidth on SandyBridge you have to use the accessdaemon (WIKI)! 21
22 Setup direct access All modern Linux distributions support the necessary msr kernel module Check if device file exists: ls l /dev/cpu/0/ If msr file is missing, load module (must be root): modprobe msr Allow users access to msr device files (various solutions possible, must be root): chmod o+rw /dev/cpu/*/msr Now you can use likwid-perfctr as normal user You can integrate the necessary steps in a startup script or configure udev 22
23 likwid-perfctr Marker API To measure only parts of an application a marker API is available The API only turns counters on/off. The configuration of the counters is still performed by likwid-perfctr Multiple named regions can be measured Results on multiple calls are accumulated Inclusive and overlapping regions are possible #include <likwid.h> likwid_markerinit(); // must be called from serial region Likwid_markerThreadInit(); //Only if used in threaded setting likwid_markerstartregion( Compute );... likwid_markerstopregion( Compute ); likwid_markerstartregion( postprocess );... likwid_markerstopregion( postprocess ); likwid_markerclose(); // must be called from serial regio 23
24 likwid-perfctr Marker API convenience C preprocessor macros To enable easy toggling of instrumentation there is a set of macros To enable LIKWID instrumentation define LIKWID_PERFMON If LIKWID_PERFMON is not defined instrumentation will not be built #define LIKWID_PERFMON // comment out to disable #include <likwid.h> LIKWID_MARKER_INIT; LIKWID_MARKER_THREADINIT; LIKWID_MARKER_START( Compute );... LIKWID_MARKER_STOP( Compute ); Only necessary if measuring threaded code with accessdaemon LIKWID_MARKER_START( postprocess );... LIKWID_MARKER_STOP( postprocess ); LIKWID_MARKER_CLOSE; 24
25 Using the marker API Tips and pitfalls It may be convenient to copy the LIKWID header to your project. This enables to build even if LIKWID is not available. With instrumentation enabled you need to link against liblikwid.[a so]. If you want to instrument code in shared libraries you must build LIKWID also shared (can be enabled in config.mk) The initialization and closing part of the API must be always in the application part Use marked regions Start application with: likwid-perfctr g MEM -C N:0-15 -m./a.out If you want to control affinity yourself (-c option) you have to ensure that the number of threads entering all regions is equal to the cores specified to likwid-perfctr 25
26 Measuring energy consumption with LIKWID
27 Measuring energy consumption likwid-powermeter and likwid-perfctr -g ENERGY Implements Intel RAPL interface (Sandy Bridge) RAPL = Running average power limit CPU name: Intel Core SandyBridge processor CPU clock: 3.49 GHz Base clock: MHz Minimal clock: MHz Turbo Boost Steps: C MHz C MHz C MHz C MHz Thermal Spec Power: 95 Watts Minimum Power: 20 Watts Maximum Power: 95 Watts Maximum Time Window: micro sec
28 Example: A medical image reconstruction code on Sandy Bridge Sandy Bridge EP (8 cores, 2.7 GHz base freq.) Test case Runtime [s] Power [W] Energy [J] Faster code è less energy 8 cores, plain C cores, SSE cores (SMT), SSE cores (SMT), AVX
29 Useful Performance Patterns and Metric Signatures J. Treibig, G. Hager, and G. Wellein: Performance patterns and hardware metrics on modern multicore processors: Best practices for performance engineering. 5 th Workshop on Productivity and Performance (PROPER 2012) at Euro-Par 2012, August 28, 2012, Rhodes Island, Greece. Preprint: arxiv:
30 Performance patterns (1) Pattern Peformance behavior Metric signature Load imbalance Saturating/sub-linear speedup Different amount of work on the cores (FLOPS_DP, FLOPS_SP, FLOPS_AVX); note that instruction count is not reliable! BW saturation in outer-level cache Memory BW saturation Saturating speedup across cores of OL cache group Saturating speedup across cores on a memory interface OLC bandwidth meets BW of suitable streaming benchmark (L3) Memory BW meets BW of suitable streaming benchmark (MEM) Strided or erratic data access Simple BW performance model much too optimistic Low BW utilization / Low cache hit ratio, frequent CL evicts or replacements (CACHE, DATA, MEM) 30
31 Performance patterns (2) Pattern Peformance behavior Metric signature Bad instruction mix Limited instruction throughput Microarchitectural anomalies Peformance insensitive to problem size vs. cache levels Large discrepancy from simple performance model based on LD/ST and arithmetic throughput Large discrepancy from performance model Large ratio of instructions retired to FP instructions if the useful work is FP / Many cycles per instruction (CPI) if the problem is large-latency arithmetic / Scalar instructions dominating in data-parallel loops (FLOPS_DP, FLOPS_SP, CPI) Low CPI near theoretical limit if instruction throughput is the problem / Static code analysis predicting large pressure on single execution port / High CPI due to bad pipelining (FLOPS_DP, FLOPS_SP, DATA) Relevant events are very hardware-specific, e.g., stalls due to 4k memory aliasing, conflict misses, unaligned vs. aligned LD/ST, requeue events. Code review required, with architectural features in mind. 31
32 Performance patterns (3) Pattern Peformance behavior Metric signature Synchronization overhead False sharing of cache lines Speedup going down as more cores are added / No speedup with small problem sizes / Cores busy but low FP performance Small speedup or slowdown when adding cores Large non-fp instruction count (growing with number of cores used) / Low CPI (FLOPS_DP, FLOPS_DP, CPI) Frequent (remote) CL evicts (CACHE) Bad ccnuma page placement Bad or no scaling across NUMA domains Unbalanced bandwidth on memory interfaces / High remote traffic (MEM) 32
33 Example 1: Abstraction penalties in C++ code C++ codes which suffer from overhead (inlining problems, complex abstractions) need a lot more overall instructions related to the arithmetic instructions Often (but not always) good (i.e., low) CPI à Bad instruction mix pattern Low-ish bandwidth Low # of floating-point instructions vs. other instructions High-level optimizations complex or impossible à Strided access pattern Example: Matrix-matrix multiply with expression template frameworks on a 2.93 GHz Westmere core Total retired instructions [10 11 ] CPI Memory Bandwidth [MB/s] MFlops/s Classic Boost ublas Eigen Blaze/DGEMM
34 Example 2: Image reconstruction by backprojection Simple roofline analysis à Memory-bound algorithm à Memory BW saturation pattern Closer look via likwid-perfctr MEM group and IACA tool à Limited instruction throughput pattern Work reduction optimization à Load imbalance pattern identified by likwid-perfctr FLOPS_SP group à corrected by round-robin schedule 34
35 Conclusions Performance patterns are more than simple numbers Scaling behavior Bottleneck saturation HPM signatures The set presented here is just a suggestion; it will have to be tested against more codes 35
36 Conclusions There is no alternative to knowing what is going on between your code and the hardware Without performance modeling, optimizing code is like stumbling in the dark Performance x Flexibility = constant a.k.a. Abstraction is the natural enemy of performance 36
Performance patterns and hardware metrics on modern multicore processors: Best practices for performance engineering
Performance patterns and hardware metrics on modern multicore processors: Best practices for performance engineering Jan Treibig, Georg Hager, Gerhard Wellein Erlangen Regional Computing Center (RRZE)
More informationLIKWID. Lightweight performance tools. J. Treibig. Erlangen Regional Computing Center University of Erlangen-Nuremberg
LIKWID Lightweight performance tools J. Treibig Erlangen Regional Computing Center University of Erlangen-Nuremberg hpc@rrze.fau.de BOF, ISC 2013 19.06.2013 Outline Current state Overview Building and
More informationPerformance analysis with hardware metrics. Likwid-perfctr Best practices Energy consumption
Performance analysis with hardware metrics Likwid-perfctr Best practices Energy consumption Hardware performance metrics are ubiquitous as a starting point for performance analysis (including automatic
More informationPattern-driven Performance Engineering. Using hardware performance metrics Basics of Benchmarking Performance Patterns Signatures
Pattern-driven Performance Engineering Using hardware performance metrics Basics of Benchmarking Performance Patterns Signatures Performance analysis with hardware metrics Likwid-perfctr Best practices
More informationBasics of Performance Engineering
ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently
More informationLIKWID: Lightweight performance tools. J. Treibig RRZE, University Erlangen
LIKWID: Lightweight performance tools J. Treibig RRZE, University Erlangen 26.9.2011 hallenges For high efficiency hardware aware programming is required. ILP aches QPI SIMD NUMA Multicore architectures
More informationMulticore Performance and Tools. Part 1: Topology, affinity, clock speed
Multicore Performance and Tools Part 1: Topology, affinity, clock speed Tools for Node-level Performance Engineering Gather Node Information hwloc, likwid-topology, likwid-powermeter Affinity control and
More informationarxiv: v1 [cs.pf] 17 Jun 2012
Best practices for HPM-assisted performance engineering on modern multicore processors Jan Treibig, Georg Hager, and Gerhard Wellein arxiv:1206.3738v1 [cs.pf] 17 Jun 2012 Erlangen Regional Computing Center
More informationarxiv: v2 [cs.dc] 7 Jan 2013
LIKWID: Lightweight Performance Tools Jan Treibig, Georg Hager, and Gerhard Wellein arxiv:1104.4874v2 [cs.dc] 7 Jan 2013 Abstract Exploiting the performance of today s microprocessors requires intimate
More informationMonitoring, i Accounting und Nutzerverwaltung auf den HPC-Systemen
Monitoring, i Accounting und Nutzerverwaltung auf den HPC-Systemen des RRZE Georg Hager, Jan Treibig, Michael Meier, Thomas Zeiser, Markus Wittmann, Holger Stengel HPC Services Regionales Rechenzentrum
More informationMulticore Scaling: The ECM Model
Multicore Scaling: The ECM Model Single-core performance prediction The saturation point Stencil code examples: 2D Jacobi in L1 and L2 cache 3D Jacobi in memory 3D long-range stencil G. Hager, J. Treibig,
More informationERLANGEN REGIONAL COMPUTING CENTER [ RRZE] MuCoSim Hands On. Thomas Röhl Uni Erlangen)
ERLANGEN REGIONAL COMPUTING CENTER [ RRZE] MuCoSim Hands On Thomas Röhl (HPC @ Uni Erlangen) Thomas.Roehl@fau.de Agenda HPC systems @ FAU Login on cluster Batch system LIKWID Thread affinity Hardware performance
More informationThe ECM (Execution-Cache-Memory) Performance Model
The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited Loop Kernels. Proceedings of the Workshop Memory issues on Multi- and Manycore
More informationThe Practitioner's Cookbook for Good Parallel Performance on Multi- and Many-Core Systems
The Practitioner's Cookbook for Good Parallel Performance on Multi- and Many-Core Systems Georg Hager, Jan Treibig, Gerhard Wellein Erlangen Regional Computing Center (RRZE) and Department of Computer
More informationNode-Level Performance Engineering
Node-Level Performance Engineering Georg Hager, Gerhard Wellein Erlangen Regional Computing Center (RRZE) University of Erlangen-Nuremberg aixcelerate workshop RWTH Aachen, Germany October 10, 2013 Slides
More informationNode-Level Performance Engineering
Node-Level Performance Engineering Georg Hager, Jan Treibig, Gerhard Wellein Erlangen Regional Computing Center (RRZE) and Department of Computer Science University of Erlangen-Nuremberg ISC13 full-day
More informationJackson Marusarz Intel Corporation
Jackson Marusarz Intel Corporation Intel VTune Amplifier Quick Introduction Get the Data You Need Hotspot (Statistical call tree), Call counts (Statistical) Thread Profiling Concurrency and Lock & Waits
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationERLANGEN REGIONAL COMPUTING CENTER [ RRZE] MuCoSim Hands On. Thomas Röhl Uni Erlangen)
ERLANGEN REGIONAL COMPUTING CENTER [ RRZE] MuCoSim Hands On Thomas Röhl (HPC @ Uni Erlangen) Thomas.Roehl@fau.de Agenda HPC systems @ FAU Login on cluster Batch system Modern architectures LIKWID Thread
More informationAnalytical Tool-Supported Modeling of Streaming and Stencil Loops
ERLANGEN REGIONAL COMPUTING CENTER Analytical Tool-Supported Modeling of Streaming and Stencil Loops Georg Hager, Julian Hammer Erlangen Regional Computing Center (RRZE) Scalable Tools Workshop August
More informationLIKWID: A lightweight performance-oriented tool suite for x86 multicore environments
LIKWID: A lightweight performance-oriented tool suite for x86 multicore environments Jan Treibig, Georg Hager, Gerhard Wellein Erlangen Regional Computing Center (RRZE) University of Erlangen-Nuremberg
More informationProfiling: Understand Your Application
Profiling: Understand Your Application Michal Merta michal.merta@vsb.cz 1st of March 2018 Agenda Hardware events based sampling Some fundamental bottlenecks Overview of profiling tools perf tools Intel
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationQuantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model
ERLANGEN REGIONAL COMPUTING CENTER Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model Holger Stengel, J. Treibig, G. Hager, G. Wellein Erlangen Regional
More informationBasics of performance modeling for numerical applications: Roofline model and beyond
Basics of performance modeling for numerical applications: Roofline model and beyond Georg Hager, Jan Treibig, Gerhard Wellein SPPEXA PhD Seminar RRZE April 30, 2014 Prelude: Scalability 4 the win! Scalability
More informationIntel profiling tools and roofline model. Dr. Luigi Iapichino
Intel profiling tools and roofline model Dr. Luigi Iapichino luigi.iapichino@lrz.de Which tool do I use in my project? A roadmap to optimization (and to the next hour) We will focus on tools developed
More informationMunara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.
Munara Tolubaeva Technical Consulting Engineer 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries. notices and disclaimers Intel technologies features and benefits depend
More informationKNL tools. Dr. Fabio Baruffa
KNL tools Dr. Fabio Baruffa fabio.baruffa@lrz.de 2 Which tool do I use? A roadmap to optimization We will focus on tools developed by Intel, available to users of the LRZ systems. Again, we will skip the
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationAdvanced Parallel Programming I
Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University
More information2
1 2 3 4 5 6 For more information, see http://www.intel.com/content/www/us/en/processors/core/core-processorfamily.html 7 8 The logic for identifying issues on Intel Microarchitecture Codename Ivy Bridge
More informationCase study: OpenMP-parallel sparse matrix-vector multiplication
Case study: OpenMP-parallel sparse matrix-vector multiplication A simple (but sometimes not-so-simple) example for bandwidth-bound code and saturation effects in memory Sparse matrix-vector multiply (spmvm)
More informationBei Wang, Dmitry Prohorov and Carlos Rosales
Bei Wang, Dmitry Prohorov and Carlos Rosales Aspects of Application Performance What are the Aspects of Performance Intel Hardware Features Omni-Path Architecture MCDRAM 3D XPoint Many-core Xeon Phi AVX-512
More informationPerformance Tools for Technical Computing
Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Intel Software Conference 2010 April 13th, Barcelona, Spain Agenda o Motivation and Methodology
More informationModern CPU Architectures
Modern CPU Architectures Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2014 16.04.2014 1 Motivation for Parallelism I CPU History RISC Software GmbH Johannes
More informationPerformance analysis tools: Intel VTuneTM Amplifier and Advisor. Dr. Luigi Iapichino
Performance analysis tools: Intel VTuneTM Amplifier and Advisor Dr. Luigi Iapichino luigi.iapichino@lrz.de Which tool do I use in my project? A roadmap to optimisation After having considered the MPI layer,
More informationImproving Virtual Machine Scheduling in NUMA Multicore Systems
Improving Virtual Machine Scheduling in NUMA Multicore Systems Jia Rao, Xiaobo Zhou University of Colorado, Colorado Springs Kun Wang, Cheng-Zhong Xu Wayne State University http://cs.uccs.edu/~jrao/ Multicore
More informationNUMA-aware OpenMP Programming
NUMA-aware OpenMP Programming Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de Christian Terboven IT Center, RWTH Aachen University Deputy lead of the HPC
More informationNode-Level Performance Engineering for Multicore Systems
ERLANGEN REGIONAL COMPUTING CENTER Node-Level Performance Engineering for Multicore Systems J. Treibig PPoPP 2015, 6.2.2015 The Rules There is no alternative to knowing what is going on between your code
More informationEvaluation of Intel Xeon Phi "Knights Corner": Opportunities and Shortcomings
ERLANGEN REGIONAL COMPUTING CENTER Evaluation of Intel Xeon Phi "Knights Corner": Opportunities and Shortcomings J. Eitzinger 29.6.2016 Technologies Driving Performance Technology 1991 1992 1993 1994 1995
More informationIntel Knights Landing Hardware
Intel Knights Landing Hardware TACC KNL Tutorial IXPUG Annual Meeting 2016 PRESENTED BY: John Cazes Lars Koesterke 1 Intel s Xeon Phi Architecture Leverages x86 architecture Simpler x86 cores, higher compute
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationSimultaneous Multithreading on Pentium 4
Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on
More informationCS 261 Fall Caching. Mike Lam, Professor. (get it??)
CS 261 Fall 2017 Mike Lam, Professor Caching (get it??) Topics Caching Cache policies and implementations Performance impact General strategies Caching A cache is a small, fast memory that acts as a buffer
More informationIMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM
IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM I5 AND I7 PROCESSORS Juan M. Cebrián 1 Lasse Natvig 1 Jan Christian Meyer 2 1 Depart. of Computer and Information
More informationPerformance-oriented programming on multicore-based systems, with a focus on the Cray XE6
Performance-oriented programming on multicore-based systems, with a focus on the Cray XE6 Georg Hager (a), Jan Treibig (a), and Gerhard Wellein (a,b) (a) HPC Services, Erlangen Regional Computing Center
More informationMaster Informatics Eng.
Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,
More informationCOSC 6385 Computer Architecture - Thread Level Parallelism (I)
COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationPhilippe Thierry Sr Staff Engineer Intel Corp.
HPC@Intel Philippe Thierry Sr Staff Engineer Intel Corp. IBM, April 8, 2009 1 Agenda CPU update: roadmap, micro-μ and performance Solid State Disk Impact What s next Q & A Tick Tock Model Perenity market
More informationIntel Architecture for HPC
Intel Architecture for HPC Georg Zitzlsberger georg.zitzlsberger@vsb.cz 1st of March 2018 Agenda Salomon Architectures Intel R Xeon R processors v3 (Haswell) Intel R Xeon Phi TM coprocessor (KNC) Ohter
More informationn N c CIni.o ewsrg.au
@NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU
More informationIntroduction to Performance Tuning & Optimization Tools
Introduction to Performance Tuning & Optimization Tools a[i] a[i+1] + a[i+2] a[i+3] b[i] b[i+1] b[i+2] b[i+3] = a[i]+b[i] a[i+1]+b[i+1] a[i+2]+b[i+2] a[i+3]+b[i+3] Ian A. Cosden, Ph.D. Manager, HPC Software
More informationIntel Architecture for Software Developers
Intel Architecture for Software Developers 1 Agenda Introduction Processor Architecture Basics Intel Architecture Intel Core and Intel Xeon Intel Atom Intel Xeon Phi Coprocessor Use Cases for Software
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance
More informationToward Automated Application Profiling on Cray Systems
Toward Automated Application Profiling on Cray Systems Charlene Yang, Brian Friesen, Thorsten Kurth, Brandon Cook NERSC at LBNL Samuel Williams CRD at LBNL I have a dream.. M.L.K. Collect performance data:
More informationOptimising for the p690 memory system
Optimising for the p690 memory Introduction As with all performance optimisation it is important to understand what is limiting the performance of a code. The Power4 is a very powerful micro-processor
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationNon-uniform memory access (NUMA)
Non-uniform memory access (NUMA) Memory access between processor core to main memory is not uniform. Memory resides in separate regions called NUMA domains. For highest performance, cores should only access
More informationPotentials and Limitations for Energy Efficiency Auto-Tuning
Center for Information Services and High Performance Computing (ZIH) Potentials and Limitations for Energy Efficiency Auto-Tuning Parco Symposium Application Autotuning for HPC (Architectures) Robert Schöne
More informationHardware Performance Monitoring Unit Working Group Outbrief
Hardware Performance Monitoring Unit Working Group Outbrief CScADS Performance Tools for Extreme Scale Computing August 2011 hpctoolkit.org Topics From HW-centric measurements to application understanding
More informationCommon lore: An OpenMP+MPI hybrid code is never faster than a pure MPI code on the same hybrid hardware, except for obvious cases
Hybrid (i.e. MPI+OpenMP) applications (i.e. programming) on modern (i.e. multi- socket multi-numa-domain multi-core multi-cache multi-whatever) architectures: Things to consider Georg Hager Gerhard Wellein
More informationPerformance analysis basics
Performance analysis basics Christian Iwainsky Iwainsky@rz.rwth-aachen.de 25.3.2010 1 Overview 1. Motivation 2. Performance analysis basics 3. Measurement Techniques 2 Why bother with performance analysis
More informationCS3350B Computer Architecture CPU Performance and Profiling
CS3350B Computer Architecture CPU Performance and Profiling Marc Moreno Maza http://www.csd.uwo.ca/~moreno/cs3350_moreno/index.html Department of Computer Science University of Western Ontario, Canada
More informationReview: Creating a Parallel Program. Programming for Performance
Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)
More informationMultilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology
1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823
More information1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core
1. Many Core vs Multi Core 2. Performance Optimization Concepts for Many Core 3. Performance Optimization Strategy for Many Core 4. Example Case Studies NERSC s Cori will begin to transition the workload
More informationHPC VT Machine-dependent Optimization
HPC VT 2013 Machine-dependent Optimization Last time Choose good data structures Reduce number of operations Use cheap operations strength reduction Avoid too many small function calls inlining Use compiler
More informationHow to Write Fast Numerical Code
How to Write Fast Numerical Code Lecture: Memory hierarchy, locality, caches Instructor: Markus Püschel TA: Alen Stojanov, Georg Ofenbeck, Gagandeep Singh Organization Temporal and spatial locality Memory
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Multi-core today: Intel Xeon 600v4 (016) Xeon E5-600v4 Broadwell
More informationCOSC 6385 Computer Architecture - Multi Processor Systems
COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:
More informationChapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.
Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE
More informationSCIENTIFIC COMPUTING FOR ENGINEERS PERFORMANCE MODELING
2/20/13 CS 594: SCIENTIFIC COMPUTING FOR ENGINEERS PERFORMANCE MODELING Heike McCraw mccraw@icl.utk.edu 1. Basic Essentials OUTLINE Abstract architecture model Communication, Computation, and Locality
More informationEE282 Computer Architecture. Lecture 1: What is Computer Architecture?
EE282 Computer Architecture Lecture : What is Computer Architecture? September 27, 200 Marc Tremblay Computer Systems Laboratory Stanford University marctrem@csl.stanford.edu Goals Understand how computer
More informationIntel VTune Amplifier XE
Intel VTune Amplifier XE Vladimir Tsymbal Performance, Analysis and Threading Lab 1 Agenda Intel VTune Amplifier XE Overview Features Data collectors Analysis types Key Concepts Collecting performance
More informationParallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework and numa control Examples
More informationPerformance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models. Jason Andrews
Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models Jason Andrews Agenda System Performance Analysis IP Configuration System Creation Methodology: Create,
More informationTurbo Boost Up, AVX Clock Down: Complications for Scaling Tests
Turbo Boost Up, AVX Clock Down: Complications for Scaling Tests Steve Lantz 12/8/2017 1 What Is CPU Turbo? (Sandy Bridge) = nominal frequency http://www.hotchips.org/wp-content/uploads/hc_archives/hc23/hc23.19.9-desktop-cpus/hc23.19.921.sandybridge_power_10-rotem-intel.pdf
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationCommunication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures
Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures Rolf Rabenseifner rabenseifner@hlrs.de Gerhard Wellein gerhard.wellein@rrze.uni-erlangen.de University of Stuttgart
More informationThe Role of Performance
Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture The Role of Performance What is performance? A set of metrics that allow us to compare two different hardware
More informationPerformance-oriented programming on multicore-based systems, with a focus on the Cray XE6/XC30
Performance-oriented programming on multicore-based systems, with a focus on the Cray XE6/XC30 Georg Hager (a), Jan Treibig (a), and Gerhard Wellein (a,b) (a) HPC Services, Erlangen Regional Computing
More informationHow to Write Fast Numerical Code
How to Write Fast Numerical Code Lecture: Performance Counters and applying the Roofline Model Slides and lecture by Georg Ofenbeck Instructor: Markus Püschel TA: Georg Ofenbeck & Daniele Spampinato Read
More informationComputer Architecture s Changing Definition
Computer Architecture s Changing Definition 1950s Computer Architecture Computer Arithmetic 1960s Operating system support, especially memory management 1970s to mid 1980s Computer Architecture Instruction
More informationIntroduction to HPC and Optimization Tutorial VII
Felix Eckhofer Institut fã 1 4r numerische Mathematik und Optimierung Introduction to HPC and Optimization Tutorial VII January 30, 2013 TU Bergakademie Freiberg OpenMP Case study: Sparse matrix-vector
More informationCenter for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop
Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop http://cscads.rice.edu/ Discussion and Feedback CScADS Autotuning 07 Top Priority Questions for Discussion
More informationPerformance Analysis of Parallel Scientific Applications In Eclipse
Performance Analysis of Parallel Scientific Applications In Eclipse EclipseCon 2015 Wyatt Spear, University of Oregon wspear@cs.uoregon.edu Supercomputing Big systems solving big problems Performance gains
More informationHeterogeneous Computing and OpenCL
Heterogeneous Computing and OpenCL Hongsuk Yi (hsyi@kisti.re.kr) (Korea Institute of Science and Technology Information) Contents Overview of the Heterogeneous Computing Introduction to Intel Xeon Phi
More informationCSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore
CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors
More informationWhy you should care about hardware locality and how.
Why you should care about hardware locality and how. Brice Goglin TADaaM team Inria Bordeaux Sud-Ouest Agenda Quick example as an introduction Bind your processes What's the actual problem? Convenient
More informationKevin O Leary, Intel Technical Consulting Engineer
Kevin O Leary, Intel Technical Consulting Engineer Moore s Law Is Going Strong Hardware performance continues to grow exponentially We think we can continue Moore's Law for at least another 10 years."
More informationMemory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System
Center for Information ervices and High Performance Computing (ZIH) Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor ystem Parallel Architectures and Compiler Technologies
More informationPerformance-oriented programming on multicore-based clusters with MPI, OpenMP, and hybrid MPI/OpenMP
Performance-oriented programming on multicore-based clusters with MPI, OpenMP, and hybrid MPI/OpenMP Georg Hager (a), Gabriele Jost (b), Rolf Rabenseifner (c), Jan Treibig (a), and Gerhard Wellein (a,d)
More informationCMAQ PARALLEL PERFORMANCE WITH MPI AND OPENMP**
CMAQ 5.2.1 PARALLEL PERFORMANCE WITH MPI AND OPENMP** George Delic* HiPERiSM Consulting, LLC, P.O. Box 569, Chapel Hill, NC 27514, USA 1. INTRODUCTION This presentation reports on implementation of the
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationGetting Performance from OpenMP Programs on NUMA Architectures
Getting Performance from OpenMP Programs on NUMA Architectures Christian Terboven, RWTH Aachen University terboven@itc.rwth-aachen.de EU H2020 Centre of Excellence (CoE) 1 October 2015 31 March 2018 Grant
More informationPage 1. Multilevel Memories (Improving performance using a little cash )
Page 1 Multilevel Memories (Improving performance using a little cash ) 1 Page 2 CPU-Memory Bottleneck CPU Memory Performance of high-speed computers is usually limited by memory bandwidth & latency Latency
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationIFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor
IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor D.Sc. Mikko Byckling 17th Workshop on High Performance Computing in Meteorology October 24 th 2016, Reading, UK Legal Disclaimer & Optimization
More information