Performance Engineering

Size: px
Start display at page:

Download "Performance Engineering"

Transcription

1 Performance Engineering J. Treibig Erlangen Regional Computing Center University Erlangen-Nuremberg

2 Using the RRZE clusters Terminalserver: cshpc.rrze.uni-erlangen.de Loginnodes: emmy, lima, woody, testfront Using the batchsystem to allocate compute resources: qsub I l nodes=<xxx>:ppn=<xxx>[:xxx] -l walltime=02:00:00 qsub I l nodes=phinally:ppn=32:turbo -l walltime=08:00:00 Module system to control environment: module avail: List available modules module load <XXX>: Load/Unload specific module module list: List loaded modules module show <XXX>: Show environment set by module 2

3 Best Practices Benchmarking Preparation: Reliable Timing (Minimum time which can be measured) Document Code generation (Flags, Compiler Version) Get exclusive System System state (Clock, Turbo mode, Memory, Caches) Consider to automate runs with a skript (Shell, python, perl) Doing Affinity control Check: Is the result reasonable? Is result deterministic and reproducible. Statistics: Mean, Best?? Postprocessing Documentation Try to understand and explain the result Plan variations to gain more information Many things can be better understood if you plot them (gnuplot, xmgrace) 3

4 A(:)=B(:)+C(:)*D(:) on one Sandy Bridge core (3 GHz) Theoretical limit L1D cache (32k) L2 cache (256k) L3 cache (20M) Memory Variation of data set size 4

5 Bandwidth limitations: Outer-level cache Scalability of shared data paths in L3 cache Variation of threads 5

6 Throughput vector triad on Sandy Bridge socket (3 GHz) Saturation effect in memory Scalable BW in L1, L2, L3 cache Variation of affinity 6

7 Basic process for optimization Runtime Profile: gprof, compiler Hardware Performance monitoring: likwid-perfctr, perf, PAPI Micro Benchmarking: STREAM, likwid-bench MPI trace tool: Intel ITAC Runtime profiling Algorithm/Code analysis Machine characteristics Performance model Performance pattern Kernel benchmarking Metric signatures Code optimization You are a investigator! 7

8 Using hardware performance metrics likwid-perfctr

9 Hardware performance metrics are ubiquitous as a starting point for performance analysis (including automatic analysis) are supported by many tools are often reduced to cache misses (what could be worse than cache misses?) Reality: Modern parallel computing is plagued by bottlenecks There are typical performance patterns that cover a large part of possible performance behaviors HPM signatures Scaling behavior Other sources of information Performance pattern 9

10 Tools for performance engineering Today: Automatic, intelligent tools Expert low level tools (bare metal approach) Enable the user Make resource bottlenecks visible And important: Don t get into the way! LIKWID tools: Small, flexible and effective tools likwid-topology and likwid-pin likwid-bench likwid-perfctr likwid-powermeter likwid-mpirun 10

11 Get to know the machine I likwid-topology You need quick and reliable access to all relevant properties of a compute node. Difficulties: Node information is scattered in various places Internet is one source of information, but may be unreliable likwid-topology offers: All relevant information from one single source Reliable data based directly on cpuid Quick overview about thread and memory topology (Turbo mode steps on Intel CPUs) 11

12 Get to know the machine I likwid-topology cont CPU type: Intel Core Westmere processor ************************************************************* Sockets: 2 Cores per socket: 6 Threads per core: HWThread Thread Core Socket Socket 0: ( ) Socket 1: ( ) Cache Topology Level: 3 Size: 12 MB Type: Unified cache Associativity: 16 Number of sets: Cache line size: 64 Non Inclusive cache Shared among 12 threads Cache groups: ( ) ( NUMA Topology NUMA domains: Domain 0: Processors: Memory: MB free of total MB Domain 1: Processors: Memory: MB free of total MB kB 32kB 32kB 32kB 32kB 32kB kB 256kB 256kB 256kB 256kB 256kB MB kB 32kB 32kB 32kB 32kB 32kB MB 3MB 3MB MB

13 Controlling affinity of threads likwid-pin It is crucial for threaded programs to control thread affinity on todays complex node topologies. Difficulties: Different solutions depending on threading model or OpenMP implementation Either policy based or using physical processor IDs Using environment variables likwid-pin offers: Portable pinning without touching the code Simple accessible command line interface Logical numberings within thread groups likwid-pin -c 13

14 Controlling affinity of threads Thread groups Possible unit prefixes N node Default if c is not specified! S socket M NUMA domain C outer level cache group Chipset Memory 14

15 Get to know the machine II likwid-bench Knowing the performance capabilities of a machine is essential for any optimization effort. Difficulties : Doing time measurements in microbenchmarking is tedious Thread and data placement need to be quickly adaptable Implementation in programming language may introduce problems likwid-bench offers: Rapid prototyping of assembly kernel Thread management and placement Data allocation and NUMA aware initialization Timing and result presentation Ready to use set of microbenchmarks 15

16 Get to know the machine II likwid-bench cont. Benchmarks are simple text files Benchmarks are automatically converted, compiled and added to the benchmark application $likwid-bench t clcopy g 1 i 1000 w S0:1MB:2 $likwid-bench t copy g 2 i 100 w S1:1GB w S0:1GB-0:S1,1:S0 STREAMS 2 TYPE DOUBLE FLOPS 0 BYTES 16 LOOP 32 movaps FPR1, [STR0 + GPR1 * 8 ] movaps FPR2, [STR0 + GPR1 * ] movaps FPR3, [STR0 + GPR1 * ] movaps FPR4, [STR0 + GPR1 * ] movaps [STR1 + GPR1 * 8 ], FPR1 movaps [STR1 + GPR1 * ], FPR2 movaps movaps [STR1 + GPR1 * ], FPR3 [STR1 + GPR1 * ], FPR4 Port to IBM Power available! Contact me if you are interested. 16

17 Hardware Performance Monitoring likwid-perfctr Hardware Performance Monitoring is a indispensable source of information how your code interacts with the hardware. Difficulties : HPM implementations are volatile and undocumented Tools are often vendor specific Events frequently are buggy Finding event sets with useful derived metrics is difficult likwid-perfctr offers: likwid-perfctr supports energy counters on Sandy Bridge accessible as a regular event! Simple end-to-end measurement of hardware performance metrics Same tool on any X86 processor Portable ready to use validated performance groups Flexible usage modes: Wrapper, Stethoscope, Timeline, Marker Full coverage of HPM events including Uncore 17

18 likwid-perfctr Example usage for Wrapper mode $ likwid-perfctr -C N:0-3 -g FLOPS_DP./stream.exe CPU type: Intel Core Westmere processor CPU clock: 2.67 GHz Always Measuring group FLOPS_DP measured Configured metrics YOUR PROGRAM OUTPUT (this group) Event core 0 core 1 core 2 core INSTR_RETIRED_ANY e e e e+08 CPU_CLK_UNHALTED_CORE e e e e+08 CPU_CLK_UNHALTED_REF e e e e+08 FP_COMP_OPS_EXE_SSE_FP_PACKED e e e e+07 FP_COMP_OPS_EXE_SSE_FP_SCALAR FP_COMP_OPS_EXE_SSE_DOUBLE_PRECISION e e e e Metric core 0 core 1 core 2 core Runtime [s] Clock [MHz] Derived CPI metrics DP MFlops/s Packed MUOPS/s Scalar MUOPS/s DP MUOPS/s

19 likwid-perfctr Group files SHORT PSTI EVENTSET FIXC0 INSTR_RETIRED_ANY FIXC1 CPU_CLK_UNHALTED_CORE FIXC2 CPU_CLK_UNHALTED_REF PMC0 FP_COMP_OPS_EXE_SSE_FP_PACKED PMC1 FP_COMP_OPS_EXE_SSE_FP_SCALAR PMC2 FP_COMP_OPS_EXE_SSE_SINGLE_PRECISION PMC3 FP_COMP_OPS_EXE_SSE_DOUBLE_PRECISION UPMC0 UNC_QMC_NORMAL_READS_ANY UPMC1 UNC_QMC_WRITES_FULL_ANY UPMC2 UNC_QHL_REQUESTS_REMOTE_READS UPMC3 UNC_QHL_REQUESTS_LOCAL_READS METRICS Runtime [s] FIXC1*inverseClock CPI FIXC1/FIXC0 Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock DP MFlops/s (DP assumed) 1.0E-06*(PMC0*2.0+PMC1)/time Packed MUOPS/s 1.0E-06*PMC0/time Scalar MUOPS/s 1.0E-06*PMC1/time SP MUOPS/s 1.0E-06*PMC2/time Groups are architecture specific They are defined in simple text files During recompile the code is generated likwid-perfctr -a outputs list of groups For every group an extensive documentation is available DP MUOPS/s 1.0E-06*PMC3/time Memory bandwidth [MBytes/s] 1.0E-06*(UPMC0+UPMC1)*64/time; Remote Read BW [MBytes/s] 1.0E-06*(UPMC2)*64/time; LONG Formula: DP MFlops/s = (FP_COMP_OPS_EXE_SSE_FP_PACKED*2 + FP_COMP_OPS_EXE_SSE_FP_SCALAR)/ runtime. 19

20 Basics for building LIKWID Download the latest release from Read the INSTALL and README files J Also consider a look in the Wiki on the LIKWID website LIKWID has no external dependencies and should build on any Linux system with a 2.6 or newer kernel Installing with make install is necessary for the pinning functionality and if you want to use the accessdaemon 20

21 Access to MSR and PCI address space likwid-perfctr and likwid-powermeter require access to MSR (model-specific register) and (on SandyBridge) PCI address space. MSR registers are accessed on x86 processors via special instructions which can only be executed in kernel space The Linux kernel allows reading and writing to these registers via special device files. This enables to implement LIKWID completely in user space The following options are available: Direct access to device files: The user must have read/write access to device files. AccessDaemon: The application starts a proxy application for access to device files (can be enabled in config.mk). If you want to measure memory bandwidth on SandyBridge you have to use the accessdaemon (WIKI)! 21

22 Setup direct access All modern Linux distributions support the necessary msr kernel module Check if device file exists: ls l /dev/cpu/0/ If msr file is missing, load module (must be root): modprobe msr Allow users access to msr device files (various solutions possible, must be root): chmod o+rw /dev/cpu/*/msr Now you can use likwid-perfctr as normal user You can integrate the necessary steps in a startup script or configure udev 22

23 likwid-perfctr Marker API To measure only parts of an application a marker API is available The API only turns counters on/off. The configuration of the counters is still performed by likwid-perfctr Multiple named regions can be measured Results on multiple calls are accumulated Inclusive and overlapping regions are possible #include <likwid.h> likwid_markerinit(); // must be called from serial region Likwid_markerThreadInit(); //Only if used in threaded setting likwid_markerstartregion( Compute );... likwid_markerstopregion( Compute ); likwid_markerstartregion( postprocess );... likwid_markerstopregion( postprocess ); likwid_markerclose(); // must be called from serial regio 23

24 likwid-perfctr Marker API convenience C preprocessor macros To enable easy toggling of instrumentation there is a set of macros To enable LIKWID instrumentation define LIKWID_PERFMON If LIKWID_PERFMON is not defined instrumentation will not be built #define LIKWID_PERFMON // comment out to disable #include <likwid.h> LIKWID_MARKER_INIT; LIKWID_MARKER_THREADINIT; LIKWID_MARKER_START( Compute );... LIKWID_MARKER_STOP( Compute ); Only necessary if measuring threaded code with accessdaemon LIKWID_MARKER_START( postprocess );... LIKWID_MARKER_STOP( postprocess ); LIKWID_MARKER_CLOSE; 24

25 Using the marker API Tips and pitfalls It may be convenient to copy the LIKWID header to your project. This enables to build even if LIKWID is not available. With instrumentation enabled you need to link against liblikwid.[a so]. If you want to instrument code in shared libraries you must build LIKWID also shared (can be enabled in config.mk) The initialization and closing part of the API must be always in the application part Use marked regions Start application with: likwid-perfctr g MEM -C N:0-15 -m./a.out If you want to control affinity yourself (-c option) you have to ensure that the number of threads entering all regions is equal to the cores specified to likwid-perfctr 25

26 Measuring energy consumption with LIKWID

27 Measuring energy consumption likwid-powermeter and likwid-perfctr -g ENERGY Implements Intel RAPL interface (Sandy Bridge) RAPL = Running average power limit CPU name: Intel Core SandyBridge processor CPU clock: 3.49 GHz Base clock: MHz Minimal clock: MHz Turbo Boost Steps: C MHz C MHz C MHz C MHz Thermal Spec Power: 95 Watts Minimum Power: 20 Watts Maximum Power: 95 Watts Maximum Time Window: micro sec

28 Example: A medical image reconstruction code on Sandy Bridge Sandy Bridge EP (8 cores, 2.7 GHz base freq.) Test case Runtime [s] Power [W] Energy [J] Faster code è less energy 8 cores, plain C cores, SSE cores (SMT), SSE cores (SMT), AVX

29 Useful Performance Patterns and Metric Signatures J. Treibig, G. Hager, and G. Wellein: Performance patterns and hardware metrics on modern multicore processors: Best practices for performance engineering. 5 th Workshop on Productivity and Performance (PROPER 2012) at Euro-Par 2012, August 28, 2012, Rhodes Island, Greece. Preprint: arxiv:

30 Performance patterns (1) Pattern Peformance behavior Metric signature Load imbalance Saturating/sub-linear speedup Different amount of work on the cores (FLOPS_DP, FLOPS_SP, FLOPS_AVX); note that instruction count is not reliable! BW saturation in outer-level cache Memory BW saturation Saturating speedup across cores of OL cache group Saturating speedup across cores on a memory interface OLC bandwidth meets BW of suitable streaming benchmark (L3) Memory BW meets BW of suitable streaming benchmark (MEM) Strided or erratic data access Simple BW performance model much too optimistic Low BW utilization / Low cache hit ratio, frequent CL evicts or replacements (CACHE, DATA, MEM) 30

31 Performance patterns (2) Pattern Peformance behavior Metric signature Bad instruction mix Limited instruction throughput Microarchitectural anomalies Peformance insensitive to problem size vs. cache levels Large discrepancy from simple performance model based on LD/ST and arithmetic throughput Large discrepancy from performance model Large ratio of instructions retired to FP instructions if the useful work is FP / Many cycles per instruction (CPI) if the problem is large-latency arithmetic / Scalar instructions dominating in data-parallel loops (FLOPS_DP, FLOPS_SP, CPI) Low CPI near theoretical limit if instruction throughput is the problem / Static code analysis predicting large pressure on single execution port / High CPI due to bad pipelining (FLOPS_DP, FLOPS_SP, DATA) Relevant events are very hardware-specific, e.g., stalls due to 4k memory aliasing, conflict misses, unaligned vs. aligned LD/ST, requeue events. Code review required, with architectural features in mind. 31

32 Performance patterns (3) Pattern Peformance behavior Metric signature Synchronization overhead False sharing of cache lines Speedup going down as more cores are added / No speedup with small problem sizes / Cores busy but low FP performance Small speedup or slowdown when adding cores Large non-fp instruction count (growing with number of cores used) / Low CPI (FLOPS_DP, FLOPS_DP, CPI) Frequent (remote) CL evicts (CACHE) Bad ccnuma page placement Bad or no scaling across NUMA domains Unbalanced bandwidth on memory interfaces / High remote traffic (MEM) 32

33 Example 1: Abstraction penalties in C++ code C++ codes which suffer from overhead (inlining problems, complex abstractions) need a lot more overall instructions related to the arithmetic instructions Often (but not always) good (i.e., low) CPI à Bad instruction mix pattern Low-ish bandwidth Low # of floating-point instructions vs. other instructions High-level optimizations complex or impossible à Strided access pattern Example: Matrix-matrix multiply with expression template frameworks on a 2.93 GHz Westmere core Total retired instructions [10 11 ] CPI Memory Bandwidth [MB/s] MFlops/s Classic Boost ublas Eigen Blaze/DGEMM

34 Example 2: Image reconstruction by backprojection Simple roofline analysis à Memory-bound algorithm à Memory BW saturation pattern Closer look via likwid-perfctr MEM group and IACA tool à Limited instruction throughput pattern Work reduction optimization à Load imbalance pattern identified by likwid-perfctr FLOPS_SP group à corrected by round-robin schedule 34

35 Conclusions Performance patterns are more than simple numbers Scaling behavior Bottleneck saturation HPM signatures The set presented here is just a suggestion; it will have to be tested against more codes 35

36 Conclusions There is no alternative to knowing what is going on between your code and the hardware Without performance modeling, optimizing code is like stumbling in the dark Performance x Flexibility = constant a.k.a. Abstraction is the natural enemy of performance 36

Performance patterns and hardware metrics on modern multicore processors: Best practices for performance engineering

Performance patterns and hardware metrics on modern multicore processors: Best practices for performance engineering Performance patterns and hardware metrics on modern multicore processors: Best practices for performance engineering Jan Treibig, Georg Hager, Gerhard Wellein Erlangen Regional Computing Center (RRZE)

More information

LIKWID. Lightweight performance tools. J. Treibig. Erlangen Regional Computing Center University of Erlangen-Nuremberg

LIKWID. Lightweight performance tools. J. Treibig. Erlangen Regional Computing Center University of Erlangen-Nuremberg LIKWID Lightweight performance tools J. Treibig Erlangen Regional Computing Center University of Erlangen-Nuremberg hpc@rrze.fau.de BOF, ISC 2013 19.06.2013 Outline Current state Overview Building and

More information

Performance analysis with hardware metrics. Likwid-perfctr Best practices Energy consumption

Performance analysis with hardware metrics. Likwid-perfctr Best practices Energy consumption Performance analysis with hardware metrics Likwid-perfctr Best practices Energy consumption Hardware performance metrics are ubiquitous as a starting point for performance analysis (including automatic

More information

Pattern-driven Performance Engineering. Using hardware performance metrics Basics of Benchmarking Performance Patterns Signatures

Pattern-driven Performance Engineering. Using hardware performance metrics Basics of Benchmarking Performance Patterns Signatures Pattern-driven Performance Engineering Using hardware performance metrics Basics of Benchmarking Performance Patterns Signatures Performance analysis with hardware metrics Likwid-perfctr Best practices

More information

Basics of Performance Engineering

Basics of Performance Engineering ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently

More information

LIKWID: Lightweight performance tools. J. Treibig RRZE, University Erlangen

LIKWID: Lightweight performance tools. J. Treibig RRZE, University Erlangen LIKWID: Lightweight performance tools J. Treibig RRZE, University Erlangen 26.9.2011 hallenges For high efficiency hardware aware programming is required. ILP aches QPI SIMD NUMA Multicore architectures

More information

Multicore Performance and Tools. Part 1: Topology, affinity, clock speed

Multicore Performance and Tools. Part 1: Topology, affinity, clock speed Multicore Performance and Tools Part 1: Topology, affinity, clock speed Tools for Node-level Performance Engineering Gather Node Information hwloc, likwid-topology, likwid-powermeter Affinity control and

More information

arxiv: v1 [cs.pf] 17 Jun 2012

arxiv: v1 [cs.pf] 17 Jun 2012 Best practices for HPM-assisted performance engineering on modern multicore processors Jan Treibig, Georg Hager, and Gerhard Wellein arxiv:1206.3738v1 [cs.pf] 17 Jun 2012 Erlangen Regional Computing Center

More information

arxiv: v2 [cs.dc] 7 Jan 2013

arxiv: v2 [cs.dc] 7 Jan 2013 LIKWID: Lightweight Performance Tools Jan Treibig, Georg Hager, and Gerhard Wellein arxiv:1104.4874v2 [cs.dc] 7 Jan 2013 Abstract Exploiting the performance of today s microprocessors requires intimate

More information

Monitoring, i Accounting und Nutzerverwaltung auf den HPC-Systemen

Monitoring, i Accounting und Nutzerverwaltung auf den HPC-Systemen Monitoring, i Accounting und Nutzerverwaltung auf den HPC-Systemen des RRZE Georg Hager, Jan Treibig, Michael Meier, Thomas Zeiser, Markus Wittmann, Holger Stengel HPC Services Regionales Rechenzentrum

More information

Multicore Scaling: The ECM Model

Multicore Scaling: The ECM Model Multicore Scaling: The ECM Model Single-core performance prediction The saturation point Stencil code examples: 2D Jacobi in L1 and L2 cache 3D Jacobi in memory 3D long-range stencil G. Hager, J. Treibig,

More information

ERLANGEN REGIONAL COMPUTING CENTER [ RRZE] MuCoSim Hands On. Thomas Röhl Uni Erlangen)

ERLANGEN REGIONAL COMPUTING CENTER [ RRZE] MuCoSim Hands On. Thomas Röhl Uni Erlangen) ERLANGEN REGIONAL COMPUTING CENTER [ RRZE] MuCoSim Hands On Thomas Röhl (HPC @ Uni Erlangen) Thomas.Roehl@fau.de Agenda HPC systems @ FAU Login on cluster Batch system LIKWID Thread affinity Hardware performance

More information

The ECM (Execution-Cache-Memory) Performance Model

The ECM (Execution-Cache-Memory) Performance Model The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited Loop Kernels. Proceedings of the Workshop Memory issues on Multi- and Manycore

More information

The Practitioner's Cookbook for Good Parallel Performance on Multi- and Many-Core Systems

The Practitioner's Cookbook for Good Parallel Performance on Multi- and Many-Core Systems The Practitioner's Cookbook for Good Parallel Performance on Multi- and Many-Core Systems Georg Hager, Jan Treibig, Gerhard Wellein Erlangen Regional Computing Center (RRZE) and Department of Computer

More information

Node-Level Performance Engineering

Node-Level Performance Engineering Node-Level Performance Engineering Georg Hager, Gerhard Wellein Erlangen Regional Computing Center (RRZE) University of Erlangen-Nuremberg aixcelerate workshop RWTH Aachen, Germany October 10, 2013 Slides

More information

Node-Level Performance Engineering

Node-Level Performance Engineering Node-Level Performance Engineering Georg Hager, Jan Treibig, Gerhard Wellein Erlangen Regional Computing Center (RRZE) and Department of Computer Science University of Erlangen-Nuremberg ISC13 full-day

More information

Jackson Marusarz Intel Corporation

Jackson Marusarz Intel Corporation Jackson Marusarz Intel Corporation Intel VTune Amplifier Quick Introduction Get the Data You Need Hotspot (Statistical call tree), Call counts (Statistical) Thread Profiling Concurrency and Lock & Waits

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

ERLANGEN REGIONAL COMPUTING CENTER [ RRZE] MuCoSim Hands On. Thomas Röhl Uni Erlangen)

ERLANGEN REGIONAL COMPUTING CENTER [ RRZE] MuCoSim Hands On. Thomas Röhl Uni Erlangen) ERLANGEN REGIONAL COMPUTING CENTER [ RRZE] MuCoSim Hands On Thomas Röhl (HPC @ Uni Erlangen) Thomas.Roehl@fau.de Agenda HPC systems @ FAU Login on cluster Batch system Modern architectures LIKWID Thread

More information

Analytical Tool-Supported Modeling of Streaming and Stencil Loops

Analytical Tool-Supported Modeling of Streaming and Stencil Loops ERLANGEN REGIONAL COMPUTING CENTER Analytical Tool-Supported Modeling of Streaming and Stencil Loops Georg Hager, Julian Hammer Erlangen Regional Computing Center (RRZE) Scalable Tools Workshop August

More information

LIKWID: A lightweight performance-oriented tool suite for x86 multicore environments

LIKWID: A lightweight performance-oriented tool suite for x86 multicore environments LIKWID: A lightweight performance-oriented tool suite for x86 multicore environments Jan Treibig, Georg Hager, Gerhard Wellein Erlangen Regional Computing Center (RRZE) University of Erlangen-Nuremberg

More information

Profiling: Understand Your Application

Profiling: Understand Your Application Profiling: Understand Your Application Michal Merta michal.merta@vsb.cz 1st of March 2018 Agenda Hardware events based sampling Some fundamental bottlenecks Overview of profiling tools perf tools Intel

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model

Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model ERLANGEN REGIONAL COMPUTING CENTER Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model Holger Stengel, J. Treibig, G. Hager, G. Wellein Erlangen Regional

More information

Basics of performance modeling for numerical applications: Roofline model and beyond

Basics of performance modeling for numerical applications: Roofline model and beyond Basics of performance modeling for numerical applications: Roofline model and beyond Georg Hager, Jan Treibig, Gerhard Wellein SPPEXA PhD Seminar RRZE April 30, 2014 Prelude: Scalability 4 the win! Scalability

More information

Intel profiling tools and roofline model. Dr. Luigi Iapichino

Intel profiling tools and roofline model. Dr. Luigi Iapichino Intel profiling tools and roofline model Dr. Luigi Iapichino luigi.iapichino@lrz.de Which tool do I use in my project? A roadmap to optimization (and to the next hour) We will focus on tools developed

More information

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries. Munara Tolubaeva Technical Consulting Engineer 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries. notices and disclaimers Intel technologies features and benefits depend

More information

KNL tools. Dr. Fabio Baruffa

KNL tools. Dr. Fabio Baruffa KNL tools Dr. Fabio Baruffa fabio.baruffa@lrz.de 2 Which tool do I use? A roadmap to optimization We will focus on tools developed by Intel, available to users of the LRZ systems. Again, we will skip the

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

Advanced Parallel Programming I

Advanced Parallel Programming I Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University

More information

2

2 1 2 3 4 5 6 For more information, see http://www.intel.com/content/www/us/en/processors/core/core-processorfamily.html 7 8 The logic for identifying issues on Intel Microarchitecture Codename Ivy Bridge

More information

Case study: OpenMP-parallel sparse matrix-vector multiplication

Case study: OpenMP-parallel sparse matrix-vector multiplication Case study: OpenMP-parallel sparse matrix-vector multiplication A simple (but sometimes not-so-simple) example for bandwidth-bound code and saturation effects in memory Sparse matrix-vector multiply (spmvm)

More information

Bei Wang, Dmitry Prohorov and Carlos Rosales

Bei Wang, Dmitry Prohorov and Carlos Rosales Bei Wang, Dmitry Prohorov and Carlos Rosales Aspects of Application Performance What are the Aspects of Performance Intel Hardware Features Omni-Path Architecture MCDRAM 3D XPoint Many-core Xeon Phi AVX-512

More information

Performance Tools for Technical Computing

Performance Tools for Technical Computing Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Intel Software Conference 2010 April 13th, Barcelona, Spain Agenda o Motivation and Methodology

More information

Modern CPU Architectures

Modern CPU Architectures Modern CPU Architectures Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2014 16.04.2014 1 Motivation for Parallelism I CPU History RISC Software GmbH Johannes

More information

Performance analysis tools: Intel VTuneTM Amplifier and Advisor. Dr. Luigi Iapichino

Performance analysis tools: Intel VTuneTM Amplifier and Advisor. Dr. Luigi Iapichino Performance analysis tools: Intel VTuneTM Amplifier and Advisor Dr. Luigi Iapichino luigi.iapichino@lrz.de Which tool do I use in my project? A roadmap to optimisation After having considered the MPI layer,

More information

Improving Virtual Machine Scheduling in NUMA Multicore Systems

Improving Virtual Machine Scheduling in NUMA Multicore Systems Improving Virtual Machine Scheduling in NUMA Multicore Systems Jia Rao, Xiaobo Zhou University of Colorado, Colorado Springs Kun Wang, Cheng-Zhong Xu Wayne State University http://cs.uccs.edu/~jrao/ Multicore

More information

NUMA-aware OpenMP Programming

NUMA-aware OpenMP Programming NUMA-aware OpenMP Programming Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de Christian Terboven IT Center, RWTH Aachen University Deputy lead of the HPC

More information

Node-Level Performance Engineering for Multicore Systems

Node-Level Performance Engineering for Multicore Systems ERLANGEN REGIONAL COMPUTING CENTER Node-Level Performance Engineering for Multicore Systems J. Treibig PPoPP 2015, 6.2.2015 The Rules There is no alternative to knowing what is going on between your code

More information

Evaluation of Intel Xeon Phi "Knights Corner": Opportunities and Shortcomings

Evaluation of Intel Xeon Phi Knights Corner: Opportunities and Shortcomings ERLANGEN REGIONAL COMPUTING CENTER Evaluation of Intel Xeon Phi "Knights Corner": Opportunities and Shortcomings J. Eitzinger 29.6.2016 Technologies Driving Performance Technology 1991 1992 1993 1994 1995

More information

Intel Knights Landing Hardware

Intel Knights Landing Hardware Intel Knights Landing Hardware TACC KNL Tutorial IXPUG Annual Meeting 2016 PRESENTED BY: John Cazes Lars Koesterke 1 Intel s Xeon Phi Architecture Leverages x86 architecture Simpler x86 cores, higher compute

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Simultaneous Multithreading on Pentium 4

Simultaneous Multithreading on Pentium 4 Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on

More information

CS 261 Fall Caching. Mike Lam, Professor. (get it??)

CS 261 Fall Caching. Mike Lam, Professor. (get it??) CS 261 Fall 2017 Mike Lam, Professor Caching (get it??) Topics Caching Cache policies and implementations Performance impact General strategies Caching A cache is a small, fast memory that acts as a buffer

More information

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM I5 AND I7 PROCESSORS Juan M. Cebrián 1 Lasse Natvig 1 Jan Christian Meyer 2 1 Depart. of Computer and Information

More information

Performance-oriented programming on multicore-based systems, with a focus on the Cray XE6

Performance-oriented programming on multicore-based systems, with a focus on the Cray XE6 Performance-oriented programming on multicore-based systems, with a focus on the Cray XE6 Georg Hager (a), Jan Treibig (a), and Gerhard Wellein (a,b) (a) HPC Services, Erlangen Regional Computing Center

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Philippe Thierry Sr Staff Engineer Intel Corp.

Philippe Thierry Sr Staff Engineer Intel Corp. HPC@Intel Philippe Thierry Sr Staff Engineer Intel Corp. IBM, April 8, 2009 1 Agenda CPU update: roadmap, micro-μ and performance Solid State Disk Impact What s next Q & A Tick Tock Model Perenity market

More information

Intel Architecture for HPC

Intel Architecture for HPC Intel Architecture for HPC Georg Zitzlsberger georg.zitzlsberger@vsb.cz 1st of March 2018 Agenda Salomon Architectures Intel R Xeon R processors v3 (Haswell) Intel R Xeon Phi TM coprocessor (KNC) Ohter

More information

n N c CIni.o ewsrg.au

n N c CIni.o ewsrg.au @NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU

More information

Introduction to Performance Tuning & Optimization Tools

Introduction to Performance Tuning & Optimization Tools Introduction to Performance Tuning & Optimization Tools a[i] a[i+1] + a[i+2] a[i+3] b[i] b[i+1] b[i+2] b[i+3] = a[i]+b[i] a[i+1]+b[i+1] a[i+2]+b[i+2] a[i+3]+b[i+3] Ian A. Cosden, Ph.D. Manager, HPC Software

More information

Intel Architecture for Software Developers

Intel Architecture for Software Developers Intel Architecture for Software Developers 1 Agenda Introduction Processor Architecture Basics Intel Architecture Intel Core and Intel Xeon Intel Atom Intel Xeon Phi Coprocessor Use Cases for Software

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance

More information

Toward Automated Application Profiling on Cray Systems

Toward Automated Application Profiling on Cray Systems Toward Automated Application Profiling on Cray Systems Charlene Yang, Brian Friesen, Thorsten Kurth, Brandon Cook NERSC at LBNL Samuel Williams CRD at LBNL I have a dream.. M.L.K. Collect performance data:

More information

Optimising for the p690 memory system

Optimising for the p690 memory system Optimising for the p690 memory Introduction As with all performance optimisation it is important to understand what is limiting the performance of a code. The Power4 is a very powerful micro-processor

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Non-uniform memory access (NUMA)

Non-uniform memory access (NUMA) Non-uniform memory access (NUMA) Memory access between processor core to main memory is not uniform. Memory resides in separate regions called NUMA domains. For highest performance, cores should only access

More information

Potentials and Limitations for Energy Efficiency Auto-Tuning

Potentials and Limitations for Energy Efficiency Auto-Tuning Center for Information Services and High Performance Computing (ZIH) Potentials and Limitations for Energy Efficiency Auto-Tuning Parco Symposium Application Autotuning for HPC (Architectures) Robert Schöne

More information

Hardware Performance Monitoring Unit Working Group Outbrief

Hardware Performance Monitoring Unit Working Group Outbrief Hardware Performance Monitoring Unit Working Group Outbrief CScADS Performance Tools for Extreme Scale Computing August 2011 hpctoolkit.org Topics From HW-centric measurements to application understanding

More information

Common lore: An OpenMP+MPI hybrid code is never faster than a pure MPI code on the same hybrid hardware, except for obvious cases

Common lore: An OpenMP+MPI hybrid code is never faster than a pure MPI code on the same hybrid hardware, except for obvious cases Hybrid (i.e. MPI+OpenMP) applications (i.e. programming) on modern (i.e. multi- socket multi-numa-domain multi-core multi-cache multi-whatever) architectures: Things to consider Georg Hager Gerhard Wellein

More information

Performance analysis basics

Performance analysis basics Performance analysis basics Christian Iwainsky Iwainsky@rz.rwth-aachen.de 25.3.2010 1 Overview 1. Motivation 2. Performance analysis basics 3. Measurement Techniques 2 Why bother with performance analysis

More information

CS3350B Computer Architecture CPU Performance and Profiling

CS3350B Computer Architecture CPU Performance and Profiling CS3350B Computer Architecture CPU Performance and Profiling Marc Moreno Maza http://www.csd.uwo.ca/~moreno/cs3350_moreno/index.html Department of Computer Science University of Western Ontario, Canada

More information

Review: Creating a Parallel Program. Programming for Performance

Review: Creating a Parallel Program. Programming for Performance Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)

More information

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823

More information

1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core

1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core 1. Many Core vs Multi Core 2. Performance Optimization Concepts for Many Core 3. Performance Optimization Strategy for Many Core 4. Example Case Studies NERSC s Cori will begin to transition the workload

More information

HPC VT Machine-dependent Optimization

HPC VT Machine-dependent Optimization HPC VT 2013 Machine-dependent Optimization Last time Choose good data structures Reduce number of operations Use cheap operations strength reduction Avoid too many small function calls inlining Use compiler

More information

How to Write Fast Numerical Code

How to Write Fast Numerical Code How to Write Fast Numerical Code Lecture: Memory hierarchy, locality, caches Instructor: Markus Püschel TA: Alen Stojanov, Georg Ofenbeck, Gagandeep Singh Organization Temporal and spatial locality Memory

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Multi-core today: Intel Xeon 600v4 (016) Xeon E5-600v4 Broadwell

More information

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:

More information

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved. Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE

More information

SCIENTIFIC COMPUTING FOR ENGINEERS PERFORMANCE MODELING

SCIENTIFIC COMPUTING FOR ENGINEERS PERFORMANCE MODELING 2/20/13 CS 594: SCIENTIFIC COMPUTING FOR ENGINEERS PERFORMANCE MODELING Heike McCraw mccraw@icl.utk.edu 1. Basic Essentials OUTLINE Abstract architecture model Communication, Computation, and Locality

More information

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

EE282 Computer Architecture. Lecture 1: What is Computer Architecture? EE282 Computer Architecture Lecture : What is Computer Architecture? September 27, 200 Marc Tremblay Computer Systems Laboratory Stanford University marctrem@csl.stanford.edu Goals Understand how computer

More information

Intel VTune Amplifier XE

Intel VTune Amplifier XE Intel VTune Amplifier XE Vladimir Tsymbal Performance, Analysis and Threading Lab 1 Agenda Intel VTune Amplifier XE Overview Features Data collectors Analysis types Key Concepts Collecting performance

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework and numa control Examples

More information

Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models. Jason Andrews

Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models. Jason Andrews Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models Jason Andrews Agenda System Performance Analysis IP Configuration System Creation Methodology: Create,

More information

Turbo Boost Up, AVX Clock Down: Complications for Scaling Tests

Turbo Boost Up, AVX Clock Down: Complications for Scaling Tests Turbo Boost Up, AVX Clock Down: Complications for Scaling Tests Steve Lantz 12/8/2017 1 What Is CPU Turbo? (Sandy Bridge) = nominal frequency http://www.hotchips.org/wp-content/uploads/hc_archives/hc23/hc23.19.9-desktop-cpus/hc23.19.921.sandybridge_power_10-rotem-intel.pdf

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures Rolf Rabenseifner rabenseifner@hlrs.de Gerhard Wellein gerhard.wellein@rrze.uni-erlangen.de University of Stuttgart

More information

The Role of Performance

The Role of Performance Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture The Role of Performance What is performance? A set of metrics that allow us to compare two different hardware

More information

Performance-oriented programming on multicore-based systems, with a focus on the Cray XE6/XC30

Performance-oriented programming on multicore-based systems, with a focus on the Cray XE6/XC30 Performance-oriented programming on multicore-based systems, with a focus on the Cray XE6/XC30 Georg Hager (a), Jan Treibig (a), and Gerhard Wellein (a,b) (a) HPC Services, Erlangen Regional Computing

More information

How to Write Fast Numerical Code

How to Write Fast Numerical Code How to Write Fast Numerical Code Lecture: Performance Counters and applying the Roofline Model Slides and lecture by Georg Ofenbeck Instructor: Markus Püschel TA: Georg Ofenbeck & Daniele Spampinato Read

More information

Computer Architecture s Changing Definition

Computer Architecture s Changing Definition Computer Architecture s Changing Definition 1950s Computer Architecture Computer Arithmetic 1960s Operating system support, especially memory management 1970s to mid 1980s Computer Architecture Instruction

More information

Introduction to HPC and Optimization Tutorial VII

Introduction to HPC and Optimization Tutorial VII Felix Eckhofer Institut fã 1 4r numerische Mathematik und Optimierung Introduction to HPC and Optimization Tutorial VII January 30, 2013 TU Bergakademie Freiberg OpenMP Case study: Sparse matrix-vector

More information

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop http://cscads.rice.edu/ Discussion and Feedback CScADS Autotuning 07 Top Priority Questions for Discussion

More information

Performance Analysis of Parallel Scientific Applications In Eclipse

Performance Analysis of Parallel Scientific Applications In Eclipse Performance Analysis of Parallel Scientific Applications In Eclipse EclipseCon 2015 Wyatt Spear, University of Oregon wspear@cs.uoregon.edu Supercomputing Big systems solving big problems Performance gains

More information

Heterogeneous Computing and OpenCL

Heterogeneous Computing and OpenCL Heterogeneous Computing and OpenCL Hongsuk Yi (hsyi@kisti.re.kr) (Korea Institute of Science and Technology Information) Contents Overview of the Heterogeneous Computing Introduction to Intel Xeon Phi

More information

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors

More information

Why you should care about hardware locality and how.

Why you should care about hardware locality and how. Why you should care about hardware locality and how. Brice Goglin TADaaM team Inria Bordeaux Sud-Ouest Agenda Quick example as an introduction Bind your processes What's the actual problem? Convenient

More information

Kevin O Leary, Intel Technical Consulting Engineer

Kevin O Leary, Intel Technical Consulting Engineer Kevin O Leary, Intel Technical Consulting Engineer Moore s Law Is Going Strong Hardware performance continues to grow exponentially We think we can continue Moore's Law for at least another 10 years."

More information

Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System

Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System Center for Information ervices and High Performance Computing (ZIH) Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor ystem Parallel Architectures and Compiler Technologies

More information

Performance-oriented programming on multicore-based clusters with MPI, OpenMP, and hybrid MPI/OpenMP

Performance-oriented programming on multicore-based clusters with MPI, OpenMP, and hybrid MPI/OpenMP Performance-oriented programming on multicore-based clusters with MPI, OpenMP, and hybrid MPI/OpenMP Georg Hager (a), Gabriele Jost (b), Rolf Rabenseifner (c), Jan Treibig (a), and Gerhard Wellein (a,d)

More information

CMAQ PARALLEL PERFORMANCE WITH MPI AND OPENMP**

CMAQ PARALLEL PERFORMANCE WITH MPI AND OPENMP** CMAQ 5.2.1 PARALLEL PERFORMANCE WITH MPI AND OPENMP** George Delic* HiPERiSM Consulting, LLC, P.O. Box 569, Chapel Hill, NC 27514, USA 1. INTRODUCTION This presentation reports on implementation of the

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

Getting Performance from OpenMP Programs on NUMA Architectures

Getting Performance from OpenMP Programs on NUMA Architectures Getting Performance from OpenMP Programs on NUMA Architectures Christian Terboven, RWTH Aachen University terboven@itc.rwth-aachen.de EU H2020 Centre of Excellence (CoE) 1 October 2015 31 March 2018 Grant

More information

Page 1. Multilevel Memories (Improving performance using a little cash )

Page 1. Multilevel Memories (Improving performance using a little cash ) Page 1 Multilevel Memories (Improving performance using a little cash ) 1 Page 2 CPU-Memory Bottleneck CPU Memory Performance of high-speed computers is usually limited by memory bandwidth & latency Latency

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor D.Sc. Mikko Byckling 17th Workshop on High Performance Computing in Meteorology October 24 th 2016, Reading, UK Legal Disclaimer & Optimization

More information