ERLANGEN REGIONAL COMPUTING CENTER [ RRZE] MuCoSim Hands On. Thomas Röhl Uni Erlangen)

Size: px
Start display at page:

Download "ERLANGEN REGIONAL COMPUTING CENTER [ RRZE] MuCoSim Hands On. Thomas Röhl Uni Erlangen)"

Transcription

1 ERLANGEN REGIONAL COMPUTING CENTER [ RRZE] MuCoSim Hands On Thomas Röhl Uni Erlangen) Thomas.Roehl@fau.de

2 Agenda HPC FAU Login on cluster Batch system Modern architectures LIKWID Thread affinity Hardware performance monitoring (End-to-end & Marker API) CPU frequency 2

3 HPC FAU Production systems 2-socket systems: Emmy: 560 nodes, IvyBridge, 20 phy. 2.2 GHz 16 Xeon Phi, 16 Nvidia K20 Lima: 500 nodes, Westmere, 24 phy GHz TinyBlue: 84 nodes, Nehalem, 8 phy GHz 1-socket systems: Woody:40 nodes, SandyBridge, 4 phy. 3.5 GHz 72 nodes, Haswell, 4 phy. 3.4 GHz 64 nodes, Skylake nodes, 4 phy. 3.5 GHz 3

4 Access to HPC FAU Frontends (emmy, woody, lima) from FAU network via SSH (Only for compilation, don t run applications!) From outside, connect to cshpc first Console access: SSH X access: NoMachine NX ( Login: ssh <user>@<host> Copy: scp (-r) <file/folder> <user>@<host>:<dest> scp (-r) <user>@<host>:<file/folder> <dest> 4

5 Further information on clusters 5

6 NOW YOU Try SSH login on cluster frontend emmy Copy folder ~unrz139/mucosim to your home

7 ssh ssh cp r ~unrz139/mucosim $HOME 7

8 Batch system Get available nodes with properties: pbsnodes See stats of your jobs: qstat or qstat.<clustername> Submit job: qsub -I : Get interactive job (console) -l : Set properties nodes=<nodecount> or nodes=<nodename1>,<nodename2> ppn=<40 but cluster specific> (SMT threads of compute node(s)) walltime=hh:mm:ss (Runtime of job) fx.y (set fixed frequency) likwid (allow user to measure hardware counters) 8

9 Batch system examples Interactive job on 2 nodes for 3 hours: qsub -I -l nodes=2:ppn=40 -l walltime=03:00:00 Interactive job on 2 nodes for 3 hours, each with 1 Nvidia K20: qsub -I -l nodes=2:ppn=40:k20m1x,walltime=03:00:00 Non-Interactive job on 2 nodes for 3 hours with fixed frequency: qsub -l nodes=2:ppn=40:f2.0,walltime=03:00:00 xy.sh Non-Interactive job with properties in batch script: qsub xy.sh 9

10 Batch system scripts #!/bin/csh #PBS -l nodes=2:ppn=40 #PBS -l walltime=04:00:00 #PBS -N <jobname> #PBS -l likwid [ ] $ qsub test.batch Or #!/bin/bash -l Set job properties Set job runtime Set job name Enable LIKWID Copy, start, pollux.rrze.uni-erlangen.de Outputs in <jobname>.o and <jobname>.e

11 Module system on all FAU Automatically loaded for csh and bash (in batch scripts bash -l) Module system: module avail list module show <mod> or <mod>/<version> module load <mod> module unload <mod> Common modules: intel64, gcc, intelmpi, openmpi, likwid 11

12 Further information on software environment 12

13 MODERN COMPUTER ARCHITECTURES

14 Intel IvyBridge Architecture Source: Intel CPUs with attached L3 segments Memory controllers 14

15 Socket architecture Non-uniform access to other L3 segments One core can use all L3 segments Non-uniform access to memory (NUMA in-socket) Only one ring attached to PCIe All units are self-managing (System-on-Chip principle) 15

16 In-core architecture 1) Load instruction(s) into L2 2) Load instruction(s) into L1I 3) Decode instruction 4) Load [r8] using port 2 (or 3) 5) Data in L1D 6) Retire load operation 7) Calculate y = 2x in port 1 8) Retire add operation 9) Store y in [r10] 10) Data in L1D vmovapd ymm1, [r8] vaddpd ymm2, ymm1, ymm1 vmovapd [r10], ymm

17 In-core architecture On x86: CISC outside, RISC inside CISC instructions decoded to RISC instructions and back Additional buffers In-order to out-of-order (Reorder buffer) repeating instr. streams (Loop buffer) out-of-order operations (Load/Store buffer) Execution ports do the real work out of order Calculation, data transfer and address ports Retirement collects all RISCs of a CISC and commits 17

18 Cache hierarchy Core 0 Core 1 L1D L1I L2 L3 Core 2 Core 3 L1D L1I L2 2x 16 bytes 1x 32 bytes 1x 32 bytes Intra-socket ring/mesh Memory Controller QPI Caches are often one-ported, thus one direction per cycle 18

19 Cache hierarchy HT threads of a core share L1 and L2 L3 for a group of cores Keep required data in hierarchy as high and as long as possible (all time advise!) Use streaming access pattern if possible (helps prefetchers) If stored data not directly needed, use non-temporal stores (write directly to memory) Allocate data on the socket that it consumes it (QPI slower) 19

20 Available compilers The mainly used compilers on the clusters are Intel ICC and GCC Always test performance of multiple compilers ICC GCC OpenMP -qopenmp -fopenmp Optimization -O1, -O2, -O3, -Ofast Activate AVX -xavx(2) -mavx(2) -ftree-vectorize Non-temporal stores -qopt-streaming-stores=always N/A 20

21 NOW YOU Go to folder 01_tmv and submit job to cluster Run matrix-vector-multiplication interactivly - on all CPUs - with other compile options - with different compiler version (module)

22 $ cd $HOME/mucosim/01_tmv $ qsub matrix.batch $ qstat $ qsub -I l nodes=1:ppn=40 l walltime=00:10:00 $ make help $ make run OMP_NUM_THREADS=40 $ module load intel64/x or gcc/y make build CFLAGS_GCC= -O3 mavx make build CFLAGS_ICC= -O3 xhost 22

23 Thread pinning & performance analysis

24 Importance of Affinity Bandwidth decreases with each level Latency increases with each level Pin threads according to data locality Node Register L1 L2 L3 Memory SSD HDD Core Socket 24

25 Importance of Affinity STREAM benchmark on 16-core Sandy Bridge Pinning (physical cores No pinning first, first socket first) 25

26 LIKWID Overview Like I know what I do Set of tools: Topology information Process/Thread pinning Hardware performance monitoring Low-level benchmarking CPU frequency manipulation CPU feature manipulation (prefetchers) 26

27 System topology with LIKWID likwid-topology on i (Haswell) Thread topology Cache topology NUMA topology Graphical topology Socket 0: kB 32kB 32kB 32kB kB 256kB 256kB 256kB MB

28 Affinity with LIKWID likwid-pin LIKWID defines affinity domains: Node (N:0-23) Last Level Cache (C0:0-5) Socket (S1:0-11) NUMA domain (M0:0-5) 28

29 Affinity with LIKWID likwid-pin Broken in Physical selection: 0,1,2,3 or 0-3 Logical selection: S0:0-3 or L:<domain>:0-3 (phy. cores first) Function-based selection: E:N:8 = 0,20,1,21,2,22,3,23 E:N:8:1:2 = 0,1,2,3,4,5,6,7 Scattered over affinity domains: M:scatter: Fill all memory domains, physical cores first 0,10,1,11,2,12,3,13, Combine multiple selections S0:0@S0:1 = 0,1 29

30 NOW YOU Look at system topology Go to folder 02_tmv and run matrix interactivly - on all physical CPUs - on the first 5 physical CPUs per socket - don t use make run but likwid-pin directly

31 $ qsub I l nodes=1:ppn=40:likwid,walltime=00:10:00 $ module load likwid/4.2.0 $ likwid-topology $ make run PINSTR= E:N:20:1:2" $ make run PINSTR= 0,1,2,3,4@E:S1:5:1:2" $ likwid-pin h $ likwid-pin c E:N:20:1:2./matrix $ likwid-pin c 0,1,2,3,4,10,11,12,13,14./matrix 31

32 Runtime profile Intel compiler provides simple runtime profiling interface Build with -profile-functions (and maybe fno-inline) No parallel execution! Find out hotspots in the code Creates XML and tabular output files with fields: Time and time share for function Call and exit count File and line of function 32

33 Runtime profile Time(%) Self(%) Call count Function File:line runloop matrix.c: time_init timer.c: fillmatrix matrix.c: main matrix.c:71 For GCC use pg and gprof <exec> gmon.out Flat profile (like ICC) and call graph (--graph) 33

34 Go to folder 03_runtime_profile and submit job which is the hottest function? Run interactively the matrix example which is the hottest function?

35 $ qsub stream.batch (qstat, ls runtime_profile*) $ qsub I l nodes=1:ppn=40:likwid,walltime=00:10:00 $ make help (!) $ make build_matrix /run_matrix $ gprof --flat-profile matrix gmon.out $ module load intel64 $ less *.dump 35

36 HPM - Hardware Performance Monitoring Additional anaylsis method to software based analysis Vampir Totalview Intel Trace Analyzer/Collector Performance counters implemented in hardware Low-level data of CPU s functional units, cache and memory Partly not accurate (FLOP/s on SandyBridge or IvyBridge) 36

37 Each unit has 2-4 counters and maybe a fixed-purpose counter LIKWID uses different names: CBOX, MBOX, RBOX 37

38 LIKWID HPM - Hardware Performance Monitoring Simple end-to-end measurements likwid-perfctr sets up system topology and perfmon Start and stop HPM Execute application on given CPU set Evaluate counter values and derive metrics likwid-perfctr c E:S0:8:1:2 g FLOPS_DP./a.out Measure CPUs 0 to 7 on Socket 0 (-C for pin and measure) Double precision FLOP/s perf. group likwid-perfctr a for all available groups 38

39 LIKWID Performance groups Event names are not intuitive -> difficult selection Performance groups combine event set and derived metrics Derive counter results to metrics (bandwidth, ratios, ) Examples: No pinning L2/L3 traffic likwid-perfctr c 0-3@E:S1:4:1:2 g L3./a.out likwid-perfctr C E:N:10:1:2 g FLOPS_DP./a.out Pinning 10 threads, 1 out of 2 Double-precision floating point ops 39

40 LIKWID Performance groups on emmy FLOPS_AVX: Packed AVX MFlops/s FLOPS_DP: Double Precision MFlops/s FLOPS_SP: Single Precision MFlops/s DATA: Load to store ratio L2: L2 cache bandwidth in MBytes/s L3: L3 cache bandwidth in MBytes/s MEM: Main memory bandwidth in MBytes/s ENERGY: Power and Energy consumption MEM_DP: Memory & DP FLOP/s & Energy MEM_SP: Memory & SP FLOP/s & Energy 40

41 NOW YOU Go to folder 04_tmv and run interactivly - Why is L3 evict data volume of Core 0 larger? - Measure memory bandwidth running on both sockets - Measure DP FLOP/s with different CPU selections - Force vectorization and measure DP FLOP/s again

42 make run make run PINSTR= PERFGRP="MEM Make run PINSTR= E:N:20:1:2 PERFGRP= FLOPS_DP" make build CFLAGS_GCC= -O3 ffast-math CFLAGS_GCC= -O3 ffast-math mavx CFLAGS_ICC= -O3 xavx 42

43 likwid-perfctr Marker API mode Until now, we measured the whole application Measure only a code region of an application The configuration is still done by likwid-perfctr Multiple named regions can be measured (also nested) Results on multiple region calls are accumulated 43

44 Marker API macros #include <likwid.h> LIKWID_MARKER_INIT; // must be called from serial region LIKWID_MARKER_THREADINIT; // must be called from parallel region LIKWID_MARKER_START( Compute ); <code> LIKWID_MARKER_STOP( Compute ); LIKWID_MARKER_CLOSE; // must be called from serial region 44

45 Add marker API to code (restructure loops) #pragma omp parallel for <loop> #pragma omp parallel { LIKWID_MARKER_START( Compute ); #pragma omp for <loop> LIKWID_MARKER_STOP( Compute ); } 45

46 Add marker API to code (closed-source library calls) calc_some_func() #pragma omp parallel { LIKWID_MARKER_START( foo ) } calc_some_func() #pragma omp parallel { LIKWID_MARKER_STOP( foo ) } 46

47 Use it Compile: $CC DLIKWID_PERFMON $LIKWID_INC $LIKWID_LIB code.c \ o code llikwid LIWKID_INC and LIKWID_LIB defined by module system Run: likwid-perfctr C <cpustr> -g <group> -m./a.out Use capital C MarkerAPI requires pinned threads Tells likwid-perfctr to use MarkerAPI mode 47

48 Measure marked code region $ likwid-perfctr C 0,1,2 g L2 m./a.out ===================== Region: Compute ===================== Region Info core 0 core 1 core RDTSC Runtime [s] call count [ raw counter results ] Metric core 0 core 1 core Runtime (RDTSC) [s] Runtime unhalted [s] Clock [MHz] CPI L2 Load [MBytes/s] L2 Evict [MBytes/s] L2 bandwidth [MBytes/s] L2 data volume [GBytes] Region time of each thread Region calls of each thread Derived metrics for each thread 48

49 NOW YOU Go to folder 05_tmv and run interactivly - measure DP FLOP/s - measure memory bandwidth - what s wrong with the code?

50 make run make run PINSTR= PERFGRP="MEM Make run PINSTR= E:N:20:1:2 PERFGRP= FLOPS_DP" make build CFLAGS_GCC= -O3 ffast-math CFLAGS_GCC= -O3 ffast-math mavx CFLAGS_ICC= -O3 xavx Load imbalance, parallelize init and use smaller chunks for each thread make build DEFINES= -DPARALLEL_CHUNK DPARALLEL_CHUNK_INIT 50

51 CPU Frequency likwid-setfrequencies Change CPU frequency of affinity domains Set property likwid only, no fixed frequency See available frequencies: likwid-setfrequencies l See current frequency settings: likwid-setfrequencies p Set frequency of socket 1 to 2.2 GHz likwid-setfrequencies c S1 f 2.2 Set scaling governor performance on socket 0 likwid-setfrequencies c S0 g performance 51

52 ERLANGEN REGIONAL COMPUTING CENTER [ RRZE] Thank you for your attention! Regionales RechenZentrum Erlangen [RRZE] Martensstraße 1, Erlangen Thomas.Roehl@fau.de

53 Examples MuCoSim Hands On Thomas Röhl

54 Triangular-Matrix-Vector-Multiplication Parallelized with #pragma omp parallel What s happening here? Last thread executes instructions faster than first thread? Lower is better MuCoSim Hands On Thomas Röhl 54

55 Triangular-Matrix-Vector-Multiplication Retired instructions missleading Waiting in implicit OpenMP barrier issues many but short instructions We need to measure actual work Higher is better MuCoSim Hands On Thomas Röhl 55

56 Triangular-Matrix-Vector-Multiplication Floating point instructions reliable useful work metric But floating point instr. counters since SandyBridge only tendentially correct Higher is better MuCoSim Hands On Thomas Röhl 56

57 Triangular-Matrix-Vector-Multiplication Changing OMP scheduler to static with chunk size 16 smaller work packages per thread No imbalance anymore! Is it also faster? MuCoSim Hands On Thomas Röhl 57

58 Triangular-Matrix-Vector-Multiplication Scaling run on Intel SandyBridge over both sockets (8 phy. cores per socket) MuCoSim Hands On Thomas Röhl 58

59 ERLANGEN REGIONAL COMPUTING CENTER [ RRZE] Thank you for your attention! Regionales RechenZentrum Erlangen [RRZE] Martensstraße 1, Erlangen LIKWID:

ERLANGEN REGIONAL COMPUTING CENTER [ RRZE] MuCoSim Hands On. Thomas Röhl Uni Erlangen)

ERLANGEN REGIONAL COMPUTING CENTER [ RRZE] MuCoSim Hands On. Thomas Röhl Uni Erlangen) ERLANGEN REGIONAL COMPUTING CENTER [ RRZE] MuCoSim Hands On Thomas Röhl (HPC @ Uni Erlangen) Thomas.Roehl@fau.de Agenda HPC systems @ FAU Login on cluster Batch system LIKWID Thread affinity Hardware performance

More information

Performance analysis with hardware metrics. Likwid-perfctr Best practices Energy consumption

Performance analysis with hardware metrics. Likwid-perfctr Best practices Energy consumption Performance analysis with hardware metrics Likwid-perfctr Best practices Energy consumption Hardware performance metrics are ubiquitous as a starting point for performance analysis (including automatic

More information

Multicore Performance and Tools. Part 1: Topology, affinity, clock speed

Multicore Performance and Tools. Part 1: Topology, affinity, clock speed Multicore Performance and Tools Part 1: Topology, affinity, clock speed Tools for Node-level Performance Engineering Gather Node Information hwloc, likwid-topology, likwid-powermeter Affinity control and

More information

LIKWID. Lightweight performance tools. J. Treibig. Erlangen Regional Computing Center University of Erlangen-Nuremberg

LIKWID. Lightweight performance tools. J. Treibig. Erlangen Regional Computing Center University of Erlangen-Nuremberg LIKWID Lightweight performance tools J. Treibig Erlangen Regional Computing Center University of Erlangen-Nuremberg hpc@rrze.fau.de BOF, ISC 2013 19.06.2013 Outline Current state Overview Building and

More information

Performance Engineering

Performance Engineering Performance Engineering J. Treibig Erlangen Regional Computing Center University Erlangen-Nuremberg 12.11.2013 Using the RRZE clusters Terminalserver: cshpc.rrze.uni-erlangen.de Loginnodes: emmy, lima,

More information

Pattern-driven Performance Engineering. Using hardware performance metrics Basics of Benchmarking Performance Patterns Signatures

Pattern-driven Performance Engineering. Using hardware performance metrics Basics of Benchmarking Performance Patterns Signatures Pattern-driven Performance Engineering Using hardware performance metrics Basics of Benchmarking Performance Patterns Signatures Performance analysis with hardware metrics Likwid-perfctr Best practices

More information

Programming Techniques for Supercomputers. HPC RRZE University Erlangen-Nürnberg Sommersemester 2018

Programming Techniques for Supercomputers. HPC RRZE University Erlangen-Nürnberg Sommersemester 2018 Programming Techniques for Supercomputers HPC Services @ RRZE University Erlangen-Nürnberg Sommersemester 2018 Outline Login to RRZE s Emmy cluster Basic environment Some guidelines First Assignment 2

More information

Our new HPC-Cluster An overview

Our new HPC-Cluster An overview Our new HPC-Cluster An overview Christian Hagen Universität Regensburg Regensburg, 15.05.2009 Outline 1 Layout 2 Hardware 3 Software 4 Getting an account 5 Compiling 6 Queueing system 7 Parallelization

More information

NUMA-aware OpenMP Programming

NUMA-aware OpenMP Programming NUMA-aware OpenMP Programming Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de Christian Terboven IT Center, RWTH Aachen University Deputy lead of the HPC

More information

arxiv: v2 [cs.dc] 7 Jan 2013

arxiv: v2 [cs.dc] 7 Jan 2013 LIKWID: Lightweight Performance Tools Jan Treibig, Georg Hager, and Gerhard Wellein arxiv:1104.4874v2 [cs.dc] 7 Jan 2013 Abstract Exploiting the performance of today s microprocessors requires intimate

More information

Performance patterns and hardware metrics on modern multicore processors: Best practices for performance engineering

Performance patterns and hardware metrics on modern multicore processors: Best practices for performance engineering Performance patterns and hardware metrics on modern multicore processors: Best practices for performance engineering Jan Treibig, Georg Hager, Gerhard Wellein Erlangen Regional Computing Center (RRZE)

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

Profiling: Understand Your Application

Profiling: Understand Your Application Profiling: Understand Your Application Michal Merta michal.merta@vsb.cz 1st of March 2018 Agenda Hardware events based sampling Some fundamental bottlenecks Overview of profiling tools perf tools Intel

More information

LIKWID: Lightweight performance tools. J. Treibig RRZE, University Erlangen

LIKWID: Lightweight performance tools. J. Treibig RRZE, University Erlangen LIKWID: Lightweight performance tools J. Treibig RRZE, University Erlangen 26.9.2011 hallenges For high efficiency hardware aware programming is required. ILP aches QPI SIMD NUMA Multicore architectures

More information

Advanced Parallel Programming I

Advanced Parallel Programming I Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University

More information

Intel Architecture for Software Developers

Intel Architecture for Software Developers Intel Architecture for Software Developers 1 Agenda Introduction Processor Architecture Basics Intel Architecture Intel Core and Intel Xeon Intel Atom Intel Xeon Phi Coprocessor Use Cases for Software

More information

n N c CIni.o ewsrg.au

n N c CIni.o ewsrg.au @NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU

More information

Modern CPU Architectures

Modern CPU Architectures Modern CPU Architectures Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2014 16.04.2014 1 Motivation for Parallelism I CPU History RISC Software GmbH Johannes

More information

Before We Start. Sign in hpcxx account slips Windows Users: Download PuTTY. Google PuTTY First result Save putty.exe to Desktop

Before We Start. Sign in hpcxx account slips Windows Users: Download PuTTY. Google PuTTY First result Save putty.exe to Desktop Before We Start Sign in hpcxx account slips Windows Users: Download PuTTY Google PuTTY First result Save putty.exe to Desktop Research Computing at Virginia Tech Advanced Research Computing Compute Resources

More information

LAB. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

LAB. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers LAB Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld dan/lars/bbarth/milfeld@tacc.utexas.edu XSEDE 12 July 16, 2012 1 Discovery

More information

LIKWID: A lightweight performance-oriented tool suite for x86 multicore environments

LIKWID: A lightweight performance-oriented tool suite for x86 multicore environments LIKWID: A lightweight performance-oriented tool suite for x86 multicore environments Jan Treibig, Georg Hager, Gerhard Wellein Erlangen Regional Computing Center (RRZE) University of Erlangen-Nuremberg

More information

Native Computing and Optimization. Hang Liu December 4 th, 2013

Native Computing and Optimization. Hang Liu December 4 th, 2013 Native Computing and Optimization Hang Liu December 4 th, 2013 Overview Why run native? What is a native application? Building a native application Running a native application Setting affinity and pinning

More information

Code optimization. Geert Jan Bex

Code optimization. Geert Jan Bex Code optimization Geert Jan Bex (geertjan.bex@uhasselt.be) License: this presentation is released under the Creative Commons, see http://creativecommons.org/publicdomain/zero/1.0/ 1 CPU 2 Vectorization

More information

Introduction to PICO Parallel & Production Enviroment

Introduction to PICO Parallel & Production Enviroment Introduction to PICO Parallel & Production Enviroment Mirko Cestari m.cestari@cineca.it Alessandro Marani a.marani@cineca.it Domenico Guida d.guida@cineca.it Nicola Spallanzani n.spallanzani@cineca.it

More information

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ, Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - fabio.baruffa@lrz.de LRZ, 27.6.- 29.6.2016 Architecture Overview Intel Xeon Processor Intel Xeon Phi Coprocessor, 1st generation Intel Xeon

More information

Leibniz Supercomputer Centre. Movie on YouTube

Leibniz Supercomputer Centre. Movie on YouTube SuperMUC @ Leibniz Supercomputer Centre Movie on YouTube Peak Performance Peak performance: 3 Peta Flops 3*10 15 Flops Mega 10 6 million Giga 10 9 billion Tera 10 12 trillion Peta 10 15 quadrillion Exa

More information

Genius Quick Start Guide

Genius Quick Start Guide Genius Quick Start Guide Overview of the system Genius consists of a total of 116 nodes with 2 Skylake Xeon Gold 6140 processors. Each with 18 cores, at least 192GB of memory and 800 GB of local SSD disk.

More information

Working on the NewRiver Cluster

Working on the NewRiver Cluster Working on the NewRiver Cluster CMDA3634: Computer Science Foundations for Computational Modeling and Data Analytics 22 February 2018 NewRiver is a computing cluster provided by Virginia Tech s Advanced

More information

ARCHER Single Node Optimisation

ARCHER Single Node Optimisation ARCHER Single Node Optimisation Profiling Slides contributed by Cray and EPCC What is profiling? Analysing your code to find out the proportion of execution time spent in different routines. Essential

More information

Running applications on the Cray XC30

Running applications on the Cray XC30 Running applications on the Cray XC30 Running on compute nodes By default, users do not access compute nodes directly. Instead they launch jobs on compute nodes using one of three available modes: 1. Extreme

More information

Turbo Boost Up, AVX Clock Down: Complications for Scaling Tests

Turbo Boost Up, AVX Clock Down: Complications for Scaling Tests Turbo Boost Up, AVX Clock Down: Complications for Scaling Tests Steve Lantz 12/8/2017 1 What Is CPU Turbo? (Sandy Bridge) = nominal frequency http://www.hotchips.org/wp-content/uploads/hc_archives/hc23/hc23.19.9-desktop-cpus/hc23.19.921.sandybridge_power_10-rotem-intel.pdf

More information

Intel Architecture for HPC

Intel Architecture for HPC Intel Architecture for HPC Georg Zitzlsberger georg.zitzlsberger@vsb.cz 1st of March 2018 Agenda Salomon Architectures Intel R Xeon R processors v3 (Haswell) Intel R Xeon Phi TM coprocessor (KNC) Ohter

More information

Processors, Performance, and Profiling

Processors, Performance, and Profiling Processors, Performance, and Profiling Architecture 101: 5-Stage Pipeline Fetch Decode Execute Memory Write-Back Registers PC FP ALU Memory Architecture 101 1. Fetch instruction from memory. 2. Decode

More information

No Time to Read This Book?

No Time to Read This Book? Chapter 1 No Time to Read This Book? We know what it feels like to be under pressure. Try out a few quick and proven optimization stunts described below. They may provide a good enough performance gain

More information

Jackson Marusarz Intel Corporation

Jackson Marusarz Intel Corporation Jackson Marusarz Intel Corporation Intel VTune Amplifier Quick Introduction Get the Data You Need Hotspot (Statistical call tree), Call counts (Statistical) Thread Profiling Concurrency and Lock & Waits

More information

Lab: Scientific Computing Tsunami-Simulation

Lab: Scientific Computing Tsunami-Simulation Lab: Scientific Computing Tsunami-Simulation Session 4: Optimization and OMP Sebastian Rettenberger, Michael Bader 23.11.15 Session 4: Optimization and OMP, 23.11.15 1 Department of Informatics V Linux-Cluster

More information

Introduc)on to Hyades

Introduc)on to Hyades Introduc)on to Hyades Shawfeng Dong Department of Astronomy & Astrophysics, UCSSC Hyades 1 Hardware Architecture 2 Accessing Hyades 3 Compu)ng Environment 4 Compiling Codes 5 Running Jobs 6 Visualiza)on

More information

Introduction to CINECA HPC Environment

Introduction to CINECA HPC Environment Introduction to CINECA HPC Environment 23nd Summer School on Parallel Computing 19-30 May 2014 m.cestari@cineca.it, i.baccarelli@cineca.it Goals You will learn: The basic overview of CINECA HPC systems

More information

Quick Guide for the Torque Cluster Manager

Quick Guide for the Torque Cluster Manager Quick Guide for the Torque Cluster Manager Introduction: One of the main purposes of the Aries Cluster is to accommodate especially long-running programs. Users who run long jobs (which take hours or days

More information

UAntwerpen, 24 June 2016

UAntwerpen, 24 June 2016 Tier-1b Info Session UAntwerpen, 24 June 2016 VSC HPC environment Tier - 0 47 PF Tier -1 623 TF Tier -2 510 Tf 16,240 CPU cores 128/256 GB memory/node IB EDR interconnect Tier -3 HOPPER/TURING STEVIN THINKING/CEREBRO

More information

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past,

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Basics of Performance Engineering

Basics of Performance Engineering ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently

More information

Using ISMLL Cluster. Tutorial Lec 5. Mohsan Jameel, Information Systems and Machine Learning Lab, University of Hildesheim

Using ISMLL Cluster. Tutorial Lec 5. Mohsan Jameel, Information Systems and Machine Learning Lab, University of Hildesheim Using ISMLL Cluster Tutorial Lec 5 1 Agenda Hardware Useful command Submitting job 2 Computing Cluster http://www.admin-magazine.com/hpc/articles/building-an-hpc-cluster Any problem or query regarding

More information

Answers to Federal Reserve Questions. Training for University of Richmond

Answers to Federal Reserve Questions. Training for University of Richmond Answers to Federal Reserve Questions Training for University of Richmond 2 Agenda Cluster Overview Software Modules PBS/Torque Ganglia ACT Utils 3 Cluster overview Systems switch ipmi switch 1x head node

More information

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past, computers

More information

Transitioning to Leibniz and CentOS 7

Transitioning to Leibniz and CentOS 7 Transitioning to Leibniz and CentOS 7 Fall 2017 Overview Introduction: some important hardware properties of leibniz Working on leibniz: Logging on to the cluster Selecting software: toolchains Activating

More information

GPU Computing with Fornax. Dr. Christopher Harris

GPU Computing with Fornax. Dr. Christopher Harris GPU Computing with Fornax Dr. Christopher Harris ivec@uwa CAASTRO GPU Training Workshop 8-9 October 2012 Introducing the Historical GPU Graphics Processing Unit (GPU) n : A specialised electronic circuit

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Multi-core today: Intel Xeon 600v4 (016) Xeon E5-600v4 Broadwell

More information

Code optimization in a 3D diffusion model

Code optimization in a 3D diffusion model Code optimization in a 3D diffusion model Roger Philp Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 18 th 2016, Barcelona Agenda Background Diffusion

More information

Practical Introduction to

Practical Introduction to 1 2 Outline of the workshop Practical Introduction to What is ScaleMP? When do we need it? How do we run codes on the ScaleMP node on the ScaleMP Guillimin cluster? How to run programs efficiently on ScaleMP?

More information

Intel Knights Landing Hardware

Intel Knights Landing Hardware Intel Knights Landing Hardware TACC KNL Tutorial IXPUG Annual Meeting 2016 PRESENTED BY: John Cazes Lars Koesterke 1 Intel s Xeon Phi Architecture Leverages x86 architecture Simpler x86 cores, higher compute

More information

Parallel Computing. November 20, W.Homberg

Parallel Computing. November 20, W.Homberg Mitglied der Helmholtz-Gemeinschaft Parallel Computing November 20, 2017 W.Homberg Why go parallel? Problem too large for single node Job requires more memory Shorter time to solution essential Better

More information

Benchmark results on Knight Landing (KNL) architecture

Benchmark results on Knight Landing (KNL) architecture Benchmark results on Knight Landing (KNL) architecture Domenico Guida, CINECA SCAI (Bologna) Giorgio Amati, CINECA SCAI (Roma) Roma 23/10/2017 KNL, BDW, SKL A1 BDW A2 KNL A3 SKL cores per node 2 x 18 @2.3

More information

The RWTH Compute Cluster Environment

The RWTH Compute Cluster Environment The RWTH Compute Cluster Environment Tim Cramer 29.07.2013 Source: D. Both, Bull GmbH Rechen- und Kommunikationszentrum (RZ) The RWTH Compute Cluster (1/2) The Cluster provides ~300 TFlop/s No. 32 in TOP500

More information

OpenMP and Performance

OpenMP and Performance Christian Terboven, Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group {terboven,schmidl}@itc.rwth-aachen.de IT Center der RWTH Aachen University Tuning Cycle Performance Tuning aims

More information

How to Write Fast Numerical Code

How to Write Fast Numerical Code How to Write Fast Numerical Code Lecture: Memory hierarchy, locality, caches Instructor: Markus Püschel TA: Alen Stojanov, Georg Ofenbeck, Gagandeep Singh Organization Temporal and spatial locality Memory

More information

Introduction to GALILEO

Introduction to GALILEO Introduction to GALILEO Parallel & production environment Mirko Cestari m.cestari@cineca.it Alessandro Marani a.marani@cineca.it Domenico Guida d.guida@cineca.it Maurizio Cremonesi m.cremonesi@cineca.it

More information

Case study: OpenMP-parallel sparse matrix-vector multiplication

Case study: OpenMP-parallel sparse matrix-vector multiplication Case study: OpenMP-parallel sparse matrix-vector multiplication A simple (but sometimes not-so-simple) example for bandwidth-bound code and saturation effects in memory Sparse matrix-vector multiply (spmvm)

More information

Improving Virtual Machine Scheduling in NUMA Multicore Systems

Improving Virtual Machine Scheduling in NUMA Multicore Systems Improving Virtual Machine Scheduling in NUMA Multicore Systems Jia Rao, Xiaobo Zhou University of Colorado, Colorado Springs Kun Wang, Cheng-Zhong Xu Wayne State University http://cs.uccs.edu/~jrao/ Multicore

More information

rabbit.engr.oregonstate.edu What is rabbit?

rabbit.engr.oregonstate.edu What is rabbit? 1 rabbit.engr.oregonstate.edu Mike Bailey mjb@cs.oregonstate.edu rabbit.pptx What is rabbit? 2 NVIDIA Titan Black PCIe Bus 15 SMs 2880 CUDA cores 6 GB of memory OpenGL support OpenCL support Xeon system

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal

More information

Introduction to HPC and Optimization Tutorial VI

Introduction to HPC and Optimization Tutorial VI Felix Eckhofer Institut für numerische Mathematik und Optimierung Introduction to HPC and Optimization Tutorial VI January 8, 2013 TU Bergakademie Freiberg Going parallel HPC cluster in Freiberg 144 nodes,

More information

Copyright 2017 Intel Corporation

Copyright 2017 Intel Corporation Agenda Intel Xeon Scalable Platform Overview Architectural Enhancements 2 Platform Overview 3x16 PCIe* Gen3 2 or 3 Intel UPI 3x16 PCIe Gen3 Capabilities Details 10GbE Skylake-SP CPU OPA DMI Intel C620

More information

Advanced Message-Passing Interface (MPI)

Advanced Message-Passing Interface (MPI) Outline of the workshop 2 Advanced Message-Passing Interface (MPI) Bart Oldeman, Calcul Québec McGill HPC Bart.Oldeman@mcgill.ca Morning: Advanced MPI Revision More on Collectives More on Point-to-Point

More information

Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System

Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System Center for Information ervices and High Performance Computing (ZIH) Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor ystem Parallel Architectures and Compiler Technologies

More information

Achieving Peak Performance on Intel Hardware. Intel Software Developer Conference London, 2017

Achieving Peak Performance on Intel Hardware. Intel Software Developer Conference London, 2017 Achieving Peak Performance on Intel Hardware Intel Software Developer Conference London, 2017 Welcome Aims for the day You understand some of the critical features of Intel processors and other hardware

More information

Introduction to GALILEO

Introduction to GALILEO Introduction to GALILEO Parallel & production environment Mirko Cestari m.cestari@cineca.it Alessandro Marani a.marani@cineca.it Alessandro Grottesi a.grottesi@cineca.it SuperComputing Applications and

More information

Tech Computer Center Documentation

Tech Computer Center Documentation Tech Computer Center Documentation Release 0 TCC Doc February 17, 2014 Contents 1 TCC s User Documentation 1 1.1 TCC SGI Altix ICE Cluster User s Guide................................ 1 i ii CHAPTER 1

More information

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU Parallel Applications on Distributed Memory Systems Le Yan HPC User Services @ LSU Outline Distributed memory systems Message Passing Interface (MPI) Parallel applications 6/3/2015 LONI Parallel Programming

More information

New User Seminar: Part 2 (best practices)

New User Seminar: Part 2 (best practices) New User Seminar: Part 2 (best practices) General Interest Seminar January 2015 Hugh Merz merz@sharcnet.ca Session Outline Submitting Jobs Minimizing queue waits Investigating jobs Checkpointing Efficiency

More information

Introduction to GALILEO

Introduction to GALILEO November 27, 2016 Introduction to GALILEO Parallel & production environment Mirko Cestari m.cestari@cineca.it Alessandro Marani a.marani@cineca.it SuperComputing Applications and Innovation Department

More information

Introduction to OpenMP

Introduction to OpenMP 1 Introduction to OpenMP NTNU-IT HPC Section John Floan Notur: NTNU HPC http://www.notur.no/ www.hpc.ntnu.no/ Name, title of the presentation 2 Plan for the day Introduction to OpenMP and parallel programming

More information

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture ( ZIH ) Center for Information Services and High Performance Computing Overvi ew over the x86 Processor Architecture Daniel Molka Ulf Markwardt Daniel.Molka@tu-dresden.de ulf.markwardt@tu-dresden.de Outline

More information

Common lore: An OpenMP+MPI hybrid code is never faster than a pure MPI code on the same hybrid hardware, except for obvious cases

Common lore: An OpenMP+MPI hybrid code is never faster than a pure MPI code on the same hybrid hardware, except for obvious cases Hybrid (i.e. MPI+OpenMP) applications (i.e. programming) on modern (i.e. multi- socket multi-numa-domain multi-core multi-cache multi-whatever) architectures: Things to consider Georg Hager Gerhard Wellein

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework and numa control Examples

More information

Introduction to OpenMP. Lecture 2: OpenMP fundamentals

Introduction to OpenMP. Lecture 2: OpenMP fundamentals Introduction to OpenMP Lecture 2: OpenMP fundamentals Overview 2 Basic Concepts in OpenMP History of OpenMP Compiling and running OpenMP programs What is OpenMP? 3 OpenMP is an API designed for programming

More information

Performance analysis tools: Intel VTuneTM Amplifier and Advisor. Dr. Luigi Iapichino

Performance analysis tools: Intel VTuneTM Amplifier and Advisor. Dr. Luigi Iapichino Performance analysis tools: Intel VTuneTM Amplifier and Advisor Dr. Luigi Iapichino luigi.iapichino@lrz.de Which tool do I use in my project? A roadmap to optimisation After having considered the MPI layer,

More information

Computing with the Moore Cluster

Computing with the Moore Cluster Computing with the Moore Cluster Edward Walter An overview of data management and job processing in the Moore compute cluster. Overview Getting access to the cluster Data management Submitting jobs (MPI

More information

Image Sharpening. Practical Introduction to HPC Exercise. Instructions for Cirrus Tier-2 System

Image Sharpening. Practical Introduction to HPC Exercise. Instructions for Cirrus Tier-2 System Image Sharpening Practical Introduction to HPC Exercise Instructions for Cirrus Tier-2 System 2 1. Aims The aim of this exercise is to get you used to logging into an HPC resource, using the command line

More information

Philippe Thierry Sr Staff Engineer Intel Corp.

Philippe Thierry Sr Staff Engineer Intel Corp. HPC@Intel Philippe Thierry Sr Staff Engineer Intel Corp. IBM, April 8, 2009 1 Agenda CPU update: roadmap, micro-μ and performance Solid State Disk Impact What s next Q & A Tick Tock Model Perenity market

More information

Analytical Tool-Supported Modeling of Streaming and Stencil Loops

Analytical Tool-Supported Modeling of Streaming and Stencil Loops ERLANGEN REGIONAL COMPUTING CENTER Analytical Tool-Supported Modeling of Streaming and Stencil Loops Georg Hager, Julian Hammer Erlangen Regional Computing Center (RRZE) Scalable Tools Workshop August

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna16/ [ 9 ] Shared Memory Performance Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture

More information

Intel Manycore Testing Lab (MTL) - Linux Getting Started Guide

Intel Manycore Testing Lab (MTL) - Linux Getting Started Guide Intel Manycore Testing Lab (MTL) - Linux Getting Started Guide Introduction What are the intended uses of the MTL? The MTL is prioritized for supporting the Intel Academic Community for the testing, validation

More information

OS impact on performance

OS impact on performance PhD student CEA, DAM, DIF, F-91297, Arpajon, France Advisor : William Jalby CEA supervisor : Marc Pérache 1 Plan Remind goal of OS Reproducibility Conclusion 2 OS : between applications and hardware 3

More information

Performance Tools for Technical Computing

Performance Tools for Technical Computing Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Intel Software Conference 2010 April 13th, Barcelona, Spain Agenda o Motivation and Methodology

More information

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA SUDHEER CHUNDURI, SCOTT PARKER, KEVIN HARMS, VITALI MOROZOV, CHRIS KNIGHT, KALYAN KUMARAN Performance Engineering Group Argonne Leadership Computing Facility

More information

Parallel Computer Architecture - Basics -

Parallel Computer Architecture - Basics - Parallel Computer Architecture - Basics - Christian Terboven 19.03.2012 / Aachen, Germany Stand: 15.03.2012 Version 2.3 Rechen- und Kommunikationszentrum (RZ) Agenda Processor

More information

KNL tools. Dr. Fabio Baruffa

KNL tools. Dr. Fabio Baruffa KNL tools Dr. Fabio Baruffa fabio.baruffa@lrz.de 2 Which tool do I use? A roadmap to optimization We will focus on tools developed by Intel, available to users of the LRZ systems. Again, we will skip the

More information

Evaluation of Intel Xeon Phi "Knights Corner": Opportunities and Shortcomings

Evaluation of Intel Xeon Phi Knights Corner: Opportunities and Shortcomings ERLANGEN REGIONAL COMPUTING CENTER Evaluation of Intel Xeon Phi "Knights Corner": Opportunities and Shortcomings J. Eitzinger 29.6.2016 Technologies Driving Performance Technology 1991 1992 1993 1994 1995

More information

OP2 FOR MANY-CORE ARCHITECTURES

OP2 FOR MANY-CORE ARCHITECTURES OP2 FOR MANY-CORE ARCHITECTURES G.R. Mudalige, M.B. Giles, Oxford e-research Centre, University of Oxford gihan.mudalige@oerc.ox.ac.uk 27 th Jan 2012 1 AGENDA OP2 Current Progress Future work for OP2 EPSRC

More information

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines CS450/650 Notes Winter 2013 A Morton Superscalar Pipelines 1 Scalar Pipeline Limitations (Shen + Lipasti 4.1) 1. Bounded Performance P = 1 T = IC CPI 1 cycletime = IPC frequency IC IPC = instructions per

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

The JANUS Computing Environment

The JANUS Computing Environment Research Computing UNIVERSITY OF COLORADO The JANUS Computing Environment Monte Lunacek monte.lunacek@colorado.edu rc-help@colorado.edu What is JANUS? November, 2011 1,368 Compute nodes 16,416 processors

More information

Performance Engineering - Case study: Jacobi stencil

Performance Engineering - Case study: Jacobi stencil Performance Engineering - Case study: Jacobi stencil The basics in two dimensions (2D) Layer condition in 2D From 2D to 3D OpenMP parallelization strategies and layer condition in 3D NT stores Prof. Dr.

More information

Non-uniform memory access (NUMA)

Non-uniform memory access (NUMA) Non-uniform memory access (NUMA) Memory access between processor core to main memory is not uniform. Memory resides in separate regions called NUMA domains. For highest performance, cores should only access

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

Accelerator Programming Lecture 1

Accelerator Programming Lecture 1 Accelerator Programming Lecture 1 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de January 11, 2016 Accelerator Programming

More information

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200

More information

LAB. Preparing for Stampede: Programming Heterogeneous Many- Core Supercomputers

LAB. Preparing for Stampede: Programming Heterogeneous Many- Core Supercomputers LAB Preparing for Stampede: Programming Heterogeneous Many- Core Supercomputers Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld dan/lars/bbarth/milfeld@tacc.utexas.edu XSEDE 12 July 16, 2012 1

More information