Experiences Programming and Optimizing for Knights Landing

Size: px
Start display at page:

Download "Experiences Programming and Optimizing for Knights Landing"

Transcription

1 Experiences Programming and Optimizing for Knights Landing John Michalakes University Corporation for Atmospheric Research Boulder, Colorado USA 2016 MultiCore 6 Workshop September 13 th 2016 (Post-workshop: slides 26 and 36 updated on 2016/09/22)

2 Outline Hardware and software Applications: NEPTUNE, MPAS, WRF and kernels Knights Landing overview Optimizing for KNL Scaling Performance Roofline Tools Summary 2

3 Models: NEPTUNE/NUMA Andreas Mueller, NPS, Monterey, CA; and M. Kopera, S. Marras, and F. X. Giraldo. Towards Operational Weather Prediction at 3.0km Global Resolution With the Dynamical Core NUMA. 96 th Amer. Met. Society Annual Mtg. January,

4 Models: NEPTUNE/NUMA Spectral element 4 th, 5 th and higher-order* continuous Galerkin (discontinuous planned) Cubed Sphere (also icosahedral) Adaptive Mesh Refinement (AMR) in development NEPTUNE 72 h forecast (5 km resolution) of accumulated precipitation for Hurr. Sandy Computationally dense but highly scalable Constant width-one halo communication Good locality for next generation HPC MPI-only but OpenMP threading in development Example of Adaptive Grid tracking a severe event courtesy: Frank Giraldo, NPS * This is not the same order as is used to identify the leading term of the error in finite difference schemes, which in fact describes accuracy. Evaluation of Gaussian quadrature over N+1 LGL quadrature points will be exact to machine precision as long as the polynomial integrand is of the order 2x(N 1) 3, or less. Gabersek et al. MWR Apr 2012.DOI: /MWR D

5 Models: NEPTUNE/NUMA NWP is one of the first HPC applications and, with climate, 3,000,000 threads has been one its key drivers (47,000 nodes) Exponential growth in HPC capability has translated directly to better forecasts and steadily improving value to the public HPC growth continues toward Peta-/Exaflop, but only parallelism is increasing More floating point capability Proportionately less cache, memory and I/O bandwidth NWP parallelism scales in one dimension less than complexity Ensembles scale computationally but move problem to I/O Can operational NWP stay on the bus? More scalable formulations Andreas Mueller, NPS, Monterey, CA; and M. Kopera, S. Marras, and F. X. Giraldo. Towards Operational Weather Prediction at 3.0km Global Resolution With the Dynamical Core NUMA. 96 th Amer. Met. Society Annual Mtg. January, ,000 threads (47 nodes) Expose and exploit all available parallelism, especially fine-grain Shows computational resources needed to maintain fixed solution time with increasing resolution 5

6 Models: NEPTUNE/NUMA See also: Andreas Müller, Michal A. Kopera, Simone Marras, Lucas C. Wilcox, Tobin Isaac, Francis X. Giraldo. Strong Scaling for Numerical Weather Prediction at Petascale with the Atmospheric Model NUMA (last revised 8 Sep 2016 (v3) (47,000 nodes) Exponential growth in HPC capability has translated directly to better forecasts and steadily improving value to the public NWP is one of the first HPC applications and, with climate, 3,000,000 threads has been one its key drivers HPC growth continues toward Peta-/Exaflop, but only parallelism is increasing More floating point capability Proportionately less cache, memory and I/O bandwidth NWP parallelism scales in one dimension less than complexity Ensembles scale computationally but move problem to I/O Can operational NWP stay on the bus? More scalable formulations Andreas Mueller, NPS, Monterey, CA; and M. Kopera, S. Marras, and F. X. Giraldo. Towards Operational Weather Prediction at 3.0km Global Resolution With the Dynamical Core NUMA. 96 th Amer. Met. Society Annual Mtg. January, ,000 threads (47 nodes) Expose and exploit all available parallelism, especially fine-grain Shows computational resources needed to maintain fixed solution time with increasing resolution 6

7 Models: MPAS Collaborative project between NCAR and LANL for developing atmosphere, ocean and other earth system simulation components for use in climate, regional climate and weather studies Applications include global NWP and global atmospheric chemistry, regional climate, tropical cyclones, convectionpermitting hazardous weather forecasting Finite Volume, C grid Refinement capability Centroidal Voronoi tessellated unstructured mesh, allows arbitrary in place horizontal mesh refinement HPC Readiness Current release (v4.0) supports parallelism via MPI (horiz. domain decomp.) Hybrid (MPI+OpenMP) parallelism implemented, undergoing testing Primarily funding: National Science Foundation and the DOE Office of Science 7

8 Hardware Intel Xeon Phi 7250 (Knights Landing) announced at ISC 16 in June 14 nanometer feature size, > 8 billion transistors 68 cores, 1.4 GHz modified Silvermont with out-of-order instruction execution Two 512-bit wide Vector Processing Units per core Peak ~3 TF/s double precision, ~6 TF/s single precision 16 GB MCDRAM (on-chip) memory, > 400 GB/s bandwidth Hostless no separate host processor and no offload programming Binary compatible ISA (instruction set architecture) 8

9 Optimizing for Intel Xeon Phi Most work in MIC programming involves optimization to achieve best share of peak performance Parallelism: KNL has up to 288 hardware threads (4 per core) and a total of more than 2000 floating point units on the chip Exposing coarse, medium and especially fine-grain (vector) parallelism in application to efficiently use Xeon Phi Locality: Reorganizing and restructuring data structures and computation to improve data reuse in cache and reduce floating point units idle-time waiting for data: memory latency Kernels that do not have high data reuse or that do not fit in cache require high memory bandwidth The combination of these factors affecting performance on the Xeon Phi (or any contemporary processor) can be characterized in terms of computational intensity, and conceptualized using The Roofline Model 9

10 Optimizing for Intel Xeon Phi Roofline Model of Processor Performance Bounds application performance as a function of computational intensity the number of floating point operations per byte moved between the processor and a level of the memory hierarchy. KNL Empirical Roofline Toolkit achievable performance Thanks: Doug Doerfler, LBNL computational intensity * Williams, S. W., A. Waterman, D. Patterson. Electrical Engineering and Computer Sciences, University of California at Berkeley Technical Report No. UCB/EECS October 17, 2008 slide 21 *Sometimes referred to as: Arithmetic intensity (registers L1): largely algorithmic Operational intensity (LLC DRAM): improvable

11 Optimizing for Intel Xeon Phi Roofline Model of Processor Performance Bounds application performance as a function of computational intensity If intensity is high enough, application is compute bound by floating point capability 2333 GFLOP/s double precision vector & fma KNL Williams, S. W., A. Waterman, D. Patterson. Electrical Engineering and Computer Sciences, University of California at Berkeley Technical Report No. UCB/EECS October 17, 2008 slide 22 (Vectorization means the processor can perform 8 double precision operations at once as long as the operations are independent)

12 Optimizing for Intel Xeon Phi Roofline Model of Processor Performance Bounds application performance as a function of computational intensity If intensity is high enough, application is compute bound by floating point capability 2333 GFLOP/s double precision vector & FMA 1166 GFLOP/s double precision vector, no FMA KNL 1166 GFLOP/sec (no FMA) Williams, S. W., A. Waterman, D. Patterson. Electrical Engineering and Computer Sciences, University of California at Berkeley Technical Report No. UCB/EECS October 17, 2008 slide 23 (FMA instructions perform a floating point multiply and a floating point addition as the same operation when the compiler can detect such expressions of the form: result = A*B+C)

13 Optimizing for Intel Xeon Phi Roofline Model of Processor Performance Bounds application performance as a function of computational intensity If intensity is high enough, application is compute bound by floating point capability 2333 GFLOP/s double precision vector & FMA 1166 GFLOP/s double precision vector, no FMA 145 GFLOP/s double precision no vector nor FMA KNL 1166 GFLOP/sec (no FMA) 145 GFLOP/sec (no vector) Williams, S. W., A. Waterman, D. Patterson. Electrical Engineering and Computer Sciences, University of California at Berkeley Technical Report No. UCB/EECS October 17, 2008 slide 24

14 Optimizing for Intel Xeon Phi Roofline Model of Processor Performance Bounds application performance as a function of computational intensity If intensity is high enough, application is compute bound by floating point capability If intensity is not enough to satisfy demand for data by the processor s floating point units, the application is memory bound 128 GB main memory (DRAM) 1166 GFLOP/sec (no FMA) KNL 145 GFLOP/sec (no vector) Williams, S. W., A. Waterman, D. Patterson. Electrical Engineering and Computer Sciences, University of California at Berkeley Technical Report No. UCB/EECS October 17, 2008 slide 25

15 Optimizing for Intel Xeon Phi Roofline Model of Processor Performance Bounds application performance as a function of computational intensity If intensity is high enough, application is compute bound by floating point capability If intensity is not enough to satisfy demand for data by the processor s floating point units, the application is memory bound 128 GB main memory (DRAM) 16 GB High Bandwidth memory (MCDRAM) 1166 GFLOP/sec (no FMA) KNL 145 GFLOP/sec (no vector) Williams, S. W., A. Waterman, D. Patterson. Electrical Engineering and Computer Sciences, University of California at Berkeley Technical Report No. UCB/EECS October 17, 2008 Among the ways to use MCDRAM: Configure KNL to use MCDRAM as direct mapped Cache Configure KNL to use MCDRAM as a NUMA region numactl membind=0./executable (main memory) numactl membind=1./executable (MCDRAM) More info:

16 Optimizing for Intel Xeon Phi Roofline Model of Processor Performance Bounds application performance as a function of computational intensity If intensity is high enough, application is compute bound by floating point capability If intensity is not enough to satisfy demand for data by the processor s floating point units, the application is memory bound 128 GB main memory (DRAM) 16 GB High Bandwidth memory (MCDRAM) KNL is nominally 3 TFLOP/sec but to saturate full floating point capability, need: 0.35 flops per byte from L1 cache 1 flop per byte from L2 cache 6 flops per byte from high bandwidth memory 25 flops per byte from main memory 1166 GFLOP/sec (no FMA) 145 GFLOP/sec (no vector) KNL Williams, S. W., A. Waterman, D. Patterson. Electrical Engineering and Computer Sciences, University of California at Berkeley Technical Report No. UCB/EECS October 17, 2008

17 Optimizing for Intel Xeon Phi Roofline Model of Processor Performance Bounds application performance as a function of computational intensity If intensity is high enough, application is compute bound by floating point capability If intensity is not enough to satisfy demand for data by the processor s floating point units, the application is memory bound 128 GB main memory (DRAM) 16 GB High Bandwidth memory (MCDRAM) KNL is nominally 3 TFLOP/sec but to saturate full floating point capability, need: 0.35 flops per byte from L1 cache 1 flop per byte from L2 cache 6 flops per byte from high bandwidth memory 25 flops per byte from main memory Hard to come by in real applications! NEPTUNE benefits from MCDRAM (breaks through the DRAM ceiling) but realizing only a fraction of the MCDRAM ceiling 36.5 GF/s (MCDRAM) 20.3 GF/s (DRAM) FLOPS/byte NEPTUNE E14P3L40 (U.S. Navy Global Model Prototype) 1166 GFLOP/sec (no FMA) 145 GFLOP/sec (no vector) KNL slide 28

18 Optimizing for Intel Xeon Phi Roofline Model of Processor Performance Bounds application performance as a function of computational intensity If intensity is high enough, application is compute bound by floating point capability If intensity is not enough to satisfy demand for data by the processor s floating point units, the application is memory bound 128 GB main memory (DRAM) 16 GB High Bandwidth memory (MCDRAM) KNL is nominally 3 TFLOP/sec but to saturate full floating point capability, need: 0.35 flops per byte from L1 cache 1 flop per byte from L2 cache 6 flops per byte from high bandwidth memory 25 flops per byte from main memory Hard to come by in real applications! Optimizing for KNL involves increasing parallelism and locality slide 29 Increasing parallelism and parallel efficiency will raise performance for a given computational intensity Increasing locality will improve computational intensity and move the bar to the right improved paralleism improved locality 1166 GFLOP/sec (no FMA) 145 GFLOP/sec (no vector) KNL

19 NEPTUNE Performance on 1 Node Knights Landing NEPTUNE GFLOP/second, based on count of 9.13 billion double precision floating point operations per time step MPI-only higher is better GF/sec KNC KNL 4.2x DRAM MCDRM 2.1x HSW KNL 1.3x BDW KNL.9x 19

20 NEPTUNE Performance on 1 Node Knights Landing KNL peaks at 2 threads NEPTUNE per core Thread competition for resources (Q index is innermost in NEPTUNE) Better ILP Running out of work? MPI-only GFLOP/second, based on count of 9.13 billion double precision floating point operations per time step higher is better GF/sec KNC KNL 4.2x DRAM MCDRM 2.1x HSW KNL 1.3x BDW KNL.9x 20

21 MPAS Performance on 1 Node Knights Landing MPAS GFLOP/second, based on count of 3.7 billion single precision floating point operations per dynamics step higher is better GF/sec MPI and OpenMP but best time shown here was single-threaded KNL 7250 B0 68c (MCDRAM) Caveat: Plot shows floating point rate, not forecast speed. Number of threads 21

22 MPAS Performance on 1 Node Knights Landing MPAS GFLOP/second, based on count of 3.7 billion single precision floating point operations per dynamics step higher is better GF/sec MPI and OpenMP but best time shown here was single-threaded KNL 7250 B0 68c (MCDRAM) arithmetic intensity/ Number of threads 22

23 MPAS and NEPTUNE Performance on Knights Landing MPAS GFLOP/second, based on count of 3.7 billion single precision floating point operations per dynamics step higher is better GF/sec NEPTUNE GFLOP/second, based on count of 9.3 billion double precision floating point operations per dynamics step Caveat: Plot shows floating point rate, not forecast speed. Number of threads 23

24 Optimizing NEPTUNE 0.7 FLOP/Byte NUMA results on Blue Gene/Q suggest opportunites fro improvement from vectorization and reduced parallel overheads Work in progress with NRL s NEPTUNE model using the Supercell configuration and workload from NGGPS testing Analyzing and restructuring based on vector reports from compiler Adding multi-threading (OpenMP) Andreas Müller, Michal A. Kopera, Simone Marras, Lucas C. Wilcox, Tobin Isaac, Francis X. Giraldo. Strong Scaling for Numerical Weather Prediction at Petascale with the Atmospheric Model NUMA (Submitted on 5 Nov 2015 (v1), last revised 8 Sep 2016 (v3))

25 Optimizing NEPTUNE 25 higher is better

26 Optimizing NEPTUNE 26 higher is better

27 Optimizing NEPTUNE Optimizations Compile time specification of important loop ranges Number of variables per grid point Dimensions of an element Flattening nested loops to collapse for vectorization Facilitating inlining Subroutines called from element loops in CREATE_LAPLACIAN and CREATE_RHS Matrix multiplies (need to try MKL DGEMM) Splitting apart loops where part vectorizes and part doesn t Fixing unaligned data (apparently still a benefit on KNL) Drag forces Using double precision, -fp-model precise Need a larger workload 27

28 Optimizing NEPTUNE 28 higher is better

29 Optimizing NEPTUNE SPMD OpenMP ** Create threaded region very high in call tree Improves locality more like MPI tasks Avoid starting and stopping threaded regions Avoid threading over different domain dimensions HOMME uses to block over elements, tracers and the vertical dimension HBM code modernization insights from a Xeon Phi experiment, Jacob Weismann Poulsen, DMI, Denmark, Per Berg, DMI, Denmark, Karthik Raman, Intel, USA MPI3 shared memory windows,,* Resyncing the memory model ate up any benefits I-MPI hard to beat on Phi; using shmem-like transport already? ** Jacob Weismann Poulsen (DMI) et al. HBM code modernization insights from a Xeon Phi experiment, CUG 2016, London Torsten Hoefler, ETH Zurich ( Brinskiy and Lubin: content/uploads/2015/06/pum21 2 An_Introduction_to_MPI 3.pdf * Jed Brown, To Thread or Not To Thread, NCAR Multi core Workshop, Threads.pdf 29

30 Optimizing NEPTUNE 30 higher is better

31 NUMA effects on MCDRAM? Sub-NUMA Clustering (SNC-4) divides nodes into four NUMA regions, each assigned two MCDRAM memory controllers (EDCs) Can first touch by MPI task 0 during the application s initialization phase cause uneven placement of addresses? Measure traffic as number of cache lines read/written using uncore event counters on each EDC *Available from Intel and coming in RHEL/CENTOS

32 NUMA effects on MCDRAM? Sub-NUMA Clustering (SNC-4) divides nodes into four NUMA regions, each assigned two MCDRAM memory controllers (EDCs) Can first touch by MPI task 0 during the application s initialization phase cause uneven placement of addresses? Measure traffic as number of cache lines read/written using uncore event counters on each EDC Linux kernel extensions to support counting KNL ucore in Perf utility* Functionality being added to PAPI Application level library courtesy of Sumedh Naik, Intel program NEPTUNE call ucore_setup... if (irank == irank0) then call ucore_start_counting endif!$omp parallel firstprivate(... do 1, ntimesteps compute a time step enddo!$omp end parallel if (irank == irank0) then call ucore_stop_counting call ucore_tellall endif... *Available from Intel and coming in RHEL/CENTOS

33 NUMA effects on MCDRAM? Sub-NUMA Clustering (SNC-4) divides nodes into four NUMA regions, each assigned two MCDRAM memory controllers (EDCs) Can first touch by MPI task 0 during the application s initialization phase cause uneven placement of addresses? Measure traffic as number of cache lines read/written using uncore event counters on each EDC OUTPUT FROM SUMEDH NAIK S UCORE EVENT LIBRARY: DDR Reads: DDR Reads Total: DDR Writes: DDR Writes Total: Linux kernel extensions to support counting KNL ucore in Perf utility* Functionality being added to PAPI Application level ucore library courtesy of Sumedh Naik, Intel *Available from Intel and coming in RHEL/CENTOS program NEPTUNE call ucore_setup... if (irank == irank0) then call ucore_start_counting endif!$omp parallel firstprivate(... do 1, ntimesteps MCDRAM Reads: MCDRAM Reads Total: compute a time step MCDRAM Writes: MCDRAM Writes Total: enddo!$omp end parallel if (irank == irank0) then call ucore_stop_counting call ucore_tellall endif...

34 NUMA effects on MCDRAM? LOWER is better Sub-NUMA Clustering (SNC-4) divides nodes into four NUMA regions, each assigned two MCDRAM memory controllers Can first touch by MPI task 0 during the application s initialization phase cause uneven placement of addresses?

35 NUMA effects on MCDRAM? LOWER is better Sub-NUMA Clustering (SNC-4) divides nodes into four NUMA regions, each assigned two MCDRAM memory controllers Can first touch by MPI task 0 during the application s initialization phase cause uneven placement of addresses?

36 WRF Kernels revisited WSM5 Microphysics original optimized original op m. (single precision) KNC 11 GF/s 154 GF/s 14x KNL 68 GF/s 283 GF/s 4.2x KNC KNL 6.2x 1.8x RRTMG Radiation original optimized original op m. (double precision) KNC 10 GF/s 36 GF/s 3.6x KNL 36 GF/s 78 GF/s 2.2x KNC KNL 3.6x 2.2x Michalakes, Iacono and Jessup. Optimizing Weather Model Radiative Transfer Physics for Intel's Many Integrated Core (MIC) Architecture. Parallel Processing Letters. World Scientific. Vol. 26, No. 3 (Sept. 2016) Accepted; publication date TBD. 36

37 Tools overview Profiling Opcount and FP rate Memory traffic and BW Thread utilization Vector utilization Roofline analysis Execution traces Intel Vtune Amplifier By routine, line, and instruction Timeseries B/W output but and traffic counts Histogram General exploration output B/W, CPU time, spintime, MPI comms, as fcn of time Intel Software Development Emulator Only way to count FP ops on current Xeon Count load/store between registers and first level cache Not tried Intel Advisor Not tried Not tried Not tried Routine by routine and line by line Routine by routine (In beta) MPE/Jumpshot (ANL/LANS) By routine, Instrumented region Detailed Gantt plots Empirical Roofline Toolkit (NERSC/LBL) Max FP rate Multiple levels of memory hierarchy Yes Linux Perf PAPI or other Possible Possible Count load/store between LLC and Mem. Possible Possible 37

38 Summary New models such as MPAS and NUMA/NEPTUNE Higher order, variable resolution Demonstrated scalability Focus is on computational efficiency to run faster Progress with accelerators Early Intel Knights Landing results showing 1-2% of peak Good news: with new MCDRAM, we aren t memory bandwidth bound Need to improve vector utilization, other bottlenecks to reach > 10% of peak Naval Postgraduate School s results speeding up NUMA show promise Ongoing efforts Measurement, characterization and optimization of floating point and data movement Explore effect of different workloads, more aggressive compiler optimizations, reduced (single) precision 38

39 Acknowledgements Intel: Mike Greenfield, Ruchira Sasanka, Indraneil Gokhale, Larry Meadows, Zakhar Matveev, Dmitry Petunin, Dmitry Dontsov, Naik Sumedh, Karthik Raman NRL and NPS: Alex Reinecke, Kevin Viner, Jim Doyle, Sasa Gabersek, Andreas Mueller (now at ECMWF), Frank Giraldo NCAR: Bill Skamarock, Michael Duda, John Dennis, Chris Kerr NOAA: Mark Govett, Jim Rosinski, Tom Henderson (now at Spire) Google search, which usually leads to something John McAlpin has already figured out... 39

Towards Exascale Computing with the Atmospheric Model NUMA

Towards Exascale Computing with the Atmospheric Model NUMA Towards Exascale Computing with the Atmospheric Model NUMA Andreas Müller, Daniel S. Abdi, Michal Kopera, Lucas Wilcox, Francis X. Giraldo Department of Applied Mathematics Naval Postgraduate School, Monterey

More information

Optimizing Weather Model Radiative Transfer Physics for the Many Integrated Core and GPGPU Architectures

Optimizing Weather Model Radiative Transfer Physics for the Many Integrated Core and GPGPU Architectures Optimizing Weather Model Radiative Transfer Physics for the Many Integrated Core and GPGPU Architectures John Michalakes NOAA/NCEP/Environmental Modeling Center (IM Systems Group) University of Colorado

More information

Progress Porting and Tuning NWP Physics Packages on Intel Xeon Phi

Progress Porting and Tuning NWP Physics Packages on Intel Xeon Phi Progress Porting and Tuning NWP Physics Packages on Intel Xeon Phi Tom Henderson Thomas.B.Henderson@noaa.gov Mark Govett, Jacques Middlecoff James Rosinski NOAA Global Systems Division Indraneil Gokhale,

More information

Development and Testing of a Next Generation Spectral Element Model for the US Navy

Development and Testing of a Next Generation Spectral Element Model for the US Navy Development and Testing of a Next Generation Spectral Element Model for the US Navy Alex Reinecke 1, Kevin Viner 1, James Doyle 1, Sasa Gabersek 1, Matus Martini 2, John Mickalakes 3, Dave Ryglicki 4,

More information

Developing NEPTUNE for U.S. Naval Weather Prediction

Developing NEPTUNE for U.S. Naval Weather Prediction Developing NEPTUNE for U.S. Naval Weather Prediction John Michalakes University Corporation for Atmospheric Research/CPAESS Boulder, Colorado USA michalak@ucar.edu Dr. Alex Reinecke U.S. Naval Research

More information

Porting and Tuning WRF Physics Packages on Intel Xeon and Xeon Phi and NVIDIA GPU

Porting and Tuning WRF Physics Packages on Intel Xeon and Xeon Phi and NVIDIA GPU Porting and Tuning WRF Physics Packages on Intel Xeon and Xeon Phi and NVIDIA GPU Tom Henderson Thomas.B.Henderson@noaa.gov Mark Govett, James Rosinski, Jacques Middlecoff NOAA Global Systems Division

More information

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi архитектура, модели программирования, оптимизация. Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture

More information

Performance and Energy Usage of Workloads on KNL and Haswell Architectures

Performance and Energy Usage of Workloads on KNL and Haswell Architectures Performance and Energy Usage of Workloads on KNL and Haswell Architectures Tyler Allen 1 Christopher Daley 2 Doug Doerfler 2 Brian Austin 2 Nicholas Wright 2 1 Clemson University 2 National Energy Research

More information

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA SUDHEER CHUNDURI, SCOTT PARKER, KEVIN HARMS, VITALI MOROZOV, CHRIS KNIGHT, KALYAN KUMARAN Performance Engineering Group Argonne Leadership Computing Facility

More information

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi архитектура, модели программирования, оптимизация. Нижний Новгород, 2016 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture How Programming

More information

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends Collaborators: Richard T. Mills, Argonne National Laboratory Sarat Sreepathi, Oak Ridge National Laboratory Forrest M. Hoffman, Oak Ridge National Laboratory Jitendra Kumar, Oak Ridge National Laboratory

More information

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk) Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further

More information

KNL tools. Dr. Fabio Baruffa

KNL tools. Dr. Fabio Baruffa KNL tools Dr. Fabio Baruffa fabio.baruffa@lrz.de 2 Which tool do I use? A roadmap to optimization We will focus on tools developed by Intel, available to users of the LRZ systems. Again, we will skip the

More information

Running the FIM and NIM Weather Models on GPUs

Running the FIM and NIM Weather Models on GPUs Running the FIM and NIM Weather Models on GPUs Mark Govett Tom Henderson, Jacques Middlecoff, Jim Rosinski, Paul Madden NOAA Earth System Research Laboratory Global Models 0 to 14 days 10 to 30 KM resolution

More information

Kirill Rogozhin. Intel

Kirill Rogozhin. Intel Kirill Rogozhin Intel From Old HPC principle to modern performance model Old HPC principles: 1. Balance principle (e.g. Kung 1986) hw and software parameters altogether 2. Compute Density, intensity, machine

More information

1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core

1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core 1. Many Core vs Multi Core 2. Performance Optimization Concepts for Many Core 3. Performance Optimization Strategy for Many Core 4. Example Case Studies NERSC s Cori will begin to transition the workload

More information

Intel Knights Landing Hardware

Intel Knights Landing Hardware Intel Knights Landing Hardware TACC KNL Tutorial IXPUG Annual Meeting 2016 PRESENTED BY: John Cazes Lars Koesterke 1 Intel s Xeon Phi Architecture Leverages x86 architecture Simpler x86 cores, higher compute

More information

Performance Optimization Strategies for WRF Physics Schemes Used in Weather Modeling

Performance Optimization Strategies for WRF Physics Schemes Used in Weather Modeling International Journal of Networking and Computing www.ijnc.org, ISSN 2185-2847 Volume 8, Number 2, pages 31 327, July 218 Performance Optimization Strategies for WRF Physics Schemes Used in Weather Modeling

More information

Bei Wang, Dmitry Prohorov and Carlos Rosales

Bei Wang, Dmitry Prohorov and Carlos Rosales Bei Wang, Dmitry Prohorov and Carlos Rosales Aspects of Application Performance What are the Aspects of Performance Intel Hardware Features Omni-Path Architecture MCDRAM 3D XPoint Many-core Xeon Phi AVX-512

More information

Exploiting Thread Parallelism for Ocean Modeling on Cray XC Supercomputers

Exploiting Thread Parallelism for Ocean Modeling on Cray XC Supercomputers Exploiting Thread Parallelism for Ocean Modeling on Cray XC Supercomputers Abhinav Sarje, Douglas W. Jacobsen, Samuel W. Williams, Todd Ringler, Leonid Oliker Lawrence Berkeley National Laboratory {asarje,swwilliams,loliker}@lbl.gov

More information

Toward Automated Application Profiling on Cray Systems

Toward Automated Application Profiling on Cray Systems Toward Automated Application Profiling on Cray Systems Charlene Yang, Brian Friesen, Thorsten Kurth, Brandon Cook NERSC at LBNL Samuel Williams CRD at LBNL I have a dream.. M.L.K. Collect performance data:

More information

Intel Many Integrated Core (MIC) Architecture

Intel Many Integrated Core (MIC) Architecture Intel Many Integrated Core (MIC) Architecture Karl Solchenbach Director European Exascale Labs BMW2011, November 3, 2011 1 Notice and Disclaimers Notice: This document contains information on products

More information

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor D.Sc. Mikko Byckling 17th Workshop on High Performance Computing in Meteorology October 24 th 2016, Reading, UK Legal Disclaimer & Optimization

More information

NPS-NRL-Rice-UIUC Collaboration on Navy Atmosphere-Ocean Coupled Models on Many-Core Computer Architectures Annual Report

NPS-NRL-Rice-UIUC Collaboration on Navy Atmosphere-Ocean Coupled Models on Many-Core Computer Architectures Annual Report DISTRIBUTION STATEMENT A: Distribution approved for public release; distribution is unlimited. NPS-NRL-Rice-UIUC Collaboration on Navy Atmosphere-Ocean Coupled Models on Many-Core Computer Architectures

More information

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved. Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE

More information

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture Double Rewards of Porting Scientific Applications to the Intel MIC Architecture Troy A. Porter Hansen Experimental Physics Laboratory and Kavli Institute for Particle Astrophysics and Cosmology Stanford

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,

More information

Progress on GPU Parallelization of the NIM Prototype Numerical Weather Prediction Dynamical Core

Progress on GPU Parallelization of the NIM Prototype Numerical Weather Prediction Dynamical Core Progress on GPU Parallelization of the NIM Prototype Numerical Weather Prediction Dynamical Core Tom Henderson NOAA/OAR/ESRL/GSD/ACE Thomas.B.Henderson@noaa.gov Mark Govett, Jacques Middlecoff Paul Madden,

More information

SCIENTIFIC COMPUTING FOR ENGINEERS PERFORMANCE MODELING

SCIENTIFIC COMPUTING FOR ENGINEERS PERFORMANCE MODELING 2/20/13 CS 594: SCIENTIFIC COMPUTING FOR ENGINEERS PERFORMANCE MODELING Heike McCraw mccraw@icl.utk.edu 1. Basic Essentials OUTLINE Abstract architecture model Communication, Computation, and Locality

More information

Intel profiling tools and roofline model. Dr. Luigi Iapichino

Intel profiling tools and roofline model. Dr. Luigi Iapichino Intel profiling tools and roofline model Dr. Luigi Iapichino luigi.iapichino@lrz.de Which tool do I use in my project? A roadmap to optimization (and to the next hour) We will focus on tools developed

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

Final Report for the NPS-NRL-Rice-UIUC Collaboration on Navy Atmosphere-Ocean Coupled Models on Many-Core Computer Architectures

Final Report for the NPS-NRL-Rice-UIUC Collaboration on Navy Atmosphere-Ocean Coupled Models on Many-Core Computer Architectures DISTRIBUTION STATEMENT A: Distribution approved for public release; distribution is unlimited. Final Report for the NPS-NRL-Rice-UIUC Collaboration on Navy Atmosphere-Ocean Coupled Models on Many-Core

More information

Opportunities and Challenges in Sparse Linear Algebra on Many-Core Processors with High-Bandwidth Memory

Opportunities and Challenges in Sparse Linear Algebra on Many-Core Processors with High-Bandwidth Memory Opportunities and Challenges in Sparse Linear Algebra on Many-Core Processors with High-Bandwidth Memory Jongsoo Park, Parallel Computing Lab, Intel Corporation with contributions from MKL team 1 Algorithm/

More information

Using the Roofline Model and Intel Advisor

Using the Roofline Model and Intel Advisor Using the Roofline Model and Intel Advisor Samuel Williams SWWilliams@lbl.gov Computational Research Division Lawrence Berkeley National Lab Tuomas Koskela TKoskela@lbl.gov NERSC Lawrence Berkeley National

More information

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past, computers

More information

CS560 Lecture Parallel Architecture 1

CS560 Lecture Parallel Architecture 1 Parallel Architecture Announcements The RamCT merge is done! Please repost introductions. Manaf s office hours HW0 is due tomorrow night, please try RamCT submission HW1 has been posted Today Isoefficiency

More information

Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA

Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA Kazuhiko Komatsu, S. Momose, Y. Isobe, O. Watanabe, A. Musa, M. Yokokawa, T. Aoyama, M. Sato, H. Kobayashi Tohoku University 14 November,

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

Performance of a multi-physics code on Cavium ThunderX2

Performance of a multi-physics code on Cavium ThunderX2 Performance of a multi-physics code on Cavium ThunderX2 User Productivity Enhancement, Technology Transfer, and Training (PETTT) Presented by John G. Wohlbier (PETTT/Engility), Keith Obenschain, Gopal

More information

A Multiscale Non-hydrostatic Atmospheric Model for Regional and Global Applications

A Multiscale Non-hydrostatic Atmospheric Model for Regional and Global Applications A Multiscale Non-hydrostatic Atmospheric Model for Regional and Global Applications James D. Doyle 1, Frank Giraldo 2, Saša Gaberšek 1 1 Naval Research Laboratory, Monterey, CA, USA 2 Naval Postgraduate

More information

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

NVIDIA Update and Directions on GPU Acceleration for Earth System Models NVIDIA Update and Directions on GPU Acceleration for Earth System Models Stan Posey, HPC Program Manager, ESM and CFD, NVIDIA, Santa Clara, CA, USA Carl Ponder, PhD, Applications Software Engineer, NVIDIA,

More information

Introduction to Xeon Phi. Bill Barth January 11, 2013

Introduction to Xeon Phi. Bill Barth January 11, 2013 Introduction to Xeon Phi Bill Barth January 11, 2013 What is it? Co-processor PCI Express card Stripped down Linux operating system Dense, simplified processor Many power-hungry operations removed Wider

More information

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past,

More information

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li Evaluation of sparse LU factorization and triangular solution on multicore architectures X. Sherry Li Lawrence Berkeley National Laboratory ParLab, April 29, 28 Acknowledgement: John Shalf, LBNL Rich Vuduc,

More information

Intel Architecture for HPC

Intel Architecture for HPC Intel Architecture for HPC Georg Zitzlsberger georg.zitzlsberger@vsb.cz 1st of March 2018 Agenda Salomon Architectures Intel R Xeon R processors v3 (Haswell) Intel R Xeon Phi TM coprocessor (KNC) Ohter

More information

What s P. Thierry

What s P. Thierry What s new@intel P. Thierry Principal Engineer, Intel Corp philippe.thierry@intel.com CPU trend Memory update Software Characterization in 30 mn 10 000 feet view CPU : Range of few TF/s and

More information

Analysis of Subroutine xppm0 in FV3. Lynd Stringer, NOAA Affiliate Redline Performance Solutions LLC

Analysis of Subroutine xppm0 in FV3. Lynd Stringer, NOAA Affiliate Redline Performance Solutions LLC Analysis of Subroutine xppm0 in FV3 Lynd Stringer, NOAA Affiliate Redline Performance Solutions LLC Lynd.Stringer@noaa.gov Introduction Hardware/Software Why xppm0? KGEN Compiler at O2 Assembly at O2 Compiler

More information

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017 Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature Intel Software Developer Conference London, 2017 Agenda Vectorization is becoming more and more important What is

More information

Richard Tran Mills. September 13, MultiCore 6 Workshop, NCAR, Boulder, CO

Richard Tran Mills. September 13, MultiCore 6 Workshop, NCAR, Boulder, CO Weather, Climate, and Earth System Modeling with the 2nd Generation ("Knights Landing") Intel Xeon Phi Processor: Early Experiences and Future Potential Richard Tran Mills September 13, 2016 MultiCore

More information

Optimizing Fusion PIC Code XGC1 Performance on Cori Phase 2

Optimizing Fusion PIC Code XGC1 Performance on Cori Phase 2 Optimizing Fusion PIC Code XGC1 Performance on Cori Phase 2 T. Koskela, J. Deslippe NERSC / LBNL tkoskela@lbl.gov June 23, 2017-1 - Thank you to all collaborators! LBNL Brian Friesen, Ankit Bhagatwala,

More information

Performance Analysis and Prediction for distributed homogeneous Clusters

Performance Analysis and Prediction for distributed homogeneous Clusters Performance Analysis and Prediction for distributed homogeneous Clusters Heinz Kredel, Hans-Günther Kruse, Sabine Richling, Erich Strohmaier IT-Center, University of Mannheim, Germany IT-Center, University

More information

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST Introduction to tuning on many core platforms Gilles Gouaillardet RIST gilles@rist.or.jp Agenda Why do we need many core platforms? Single-thread optimization Parallelization Conclusions Why do we need

More information

Performance analysis tools: Intel VTuneTM Amplifier and Advisor. Dr. Luigi Iapichino

Performance analysis tools: Intel VTuneTM Amplifier and Advisor. Dr. Luigi Iapichino Performance analysis tools: Intel VTuneTM Amplifier and Advisor Dr. Luigi Iapichino luigi.iapichino@lrz.de Which tool do I use in my project? A roadmap to optimisation After having considered the MPI layer,

More information

HPC Architectures. Types of resource currently in use

HPC Architectures. Types of resource currently in use HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Study of Performance Issues on a SMP-NUMA System using the Roofline Model

Study of Performance Issues on a SMP-NUMA System using the Roofline Model Study of Performance Issues on a SMP-NUMA System using the Roofline Model Juan A. Lorenzo 1, Juan C. Pichel 1, Tomás F. Pena 1, Marcos Suárez 2 and Francisco F. Rivera 1 1 Computer Architecture Group,

More information

John Hengeveld Director of Marketing, HPC Evangelist

John Hengeveld Director of Marketing, HPC Evangelist MIC, Intel and Rearchitecting for Exascale John Hengeveld Director of Marketing, HPC Evangelist Intel Data Center Group Dr. Jean-Laurent Philippe, PhD Technical Sales Manager & Exascale Technical Lead

More information

The Icosahedral Nonhydrostatic (ICON) Model

The Icosahedral Nonhydrostatic (ICON) Model The Icosahedral Nonhydrostatic (ICON) Model Scalability on Massively Parallel Computer Architectures Florian Prill, DWD + the ICON team 15th ECMWF Workshop on HPC in Meteorology October 2, 2012 ICON =

More information

Performance Optimization of Smoothed Particle Hydrodynamics for Multi/Many-Core Architectures

Performance Optimization of Smoothed Particle Hydrodynamics for Multi/Many-Core Architectures Performance Optimization of Smoothed Particle Hydrodynamics for Multi/Many-Core Architectures Dr. Fabio Baruffa Dr. Luigi Iapichino Leibniz Supercomputing Centre fabio.baruffa@lrz.de Outline of the talk

More information

Hybrid Strategies for the NEMO Ocean Model on! Many-core Processors! I. Epicoco, S. Mocavero, G. Aloisio! University of Salento & CMCC, Italy!

Hybrid Strategies for the NEMO Ocean Model on! Many-core Processors! I. Epicoco, S. Mocavero, G. Aloisio! University of Salento & CMCC, Italy! Hybrid Strategies for the NEMO Ocean Model on! Many-core Processors! I. Epicoco, S. Mocavero, G. Aloisio! University of Salento & CMCC, Italy! 2012 Programming weather, climate, and earth-system models

More information

Introduction to tuning on KNL platforms

Introduction to tuning on KNL platforms Introduction to tuning on KNL platforms Gilles Gouaillardet RIST gilles@rist.or.jp 1 Agenda Why do we need many core platforms? KNL architecture Single-thread optimization Parallelization Common pitfalls

More information

Towards modernisation of the Gadget code on many-core architectures Fabio Baruffa, Luigi Iapichino (LRZ)

Towards modernisation of the Gadget code on many-core architectures Fabio Baruffa, Luigi Iapichino (LRZ) Towards modernisation of the Gadget code on many-core architectures Fabio Baruffa, Luigi Iapichino (LRZ) Overview Modernising P-Gadget3 for the Intel Xeon Phi : code features, challenges and strategy for

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

The IBM Blue Gene/Q: Application performance, scalability and optimisation

The IBM Blue Gene/Q: Application performance, scalability and optimisation The IBM Blue Gene/Q: Application performance, scalability and optimisation Mike Ashworth, Andrew Porter Scientific Computing Department & STFC Hartree Centre Manish Modani IBM STFC Daresbury Laboratory,

More information

Performance Profiler. Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava,

Performance Profiler. Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava, Performance Profiler Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava, 08-09-2016 Faster, Scalable Code, Faster Intel VTune Amplifier Performance Profiler Get Faster Code Faster With Accurate

More information

VLPL-S Optimization on Knights Landing

VLPL-S Optimization on Knights Landing VLPL-S Optimization on Knights Landing 英特尔软件与服务事业部 周姗 2016.5 Agenda VLPL-S 性能分析 VLPL-S 性能优化 总结 2 VLPL-S Workload Descriptions VLPL-S is the in-house code from SJTU, paralleled with MPI and written in C++.

More information

Porting the ICON Non-hydrostatic Dynamics and Physics to GPUs

Porting the ICON Non-hydrostatic Dynamics and Physics to GPUs Porting the ICON Non-hydrostatic Dynamics and Physics to GPUs William Sawyer (CSCS/ETH), Christian Conti (ETH), Xavier Lapillonne (C2SM/ETH) Programming weather, climate, and earth-system models on heterogeneous

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Introduction to tuning on KNL platforms

Introduction to tuning on KNL platforms Introduction to tuning on KNL platforms Gilles Gouaillardet RIST gilles@rist.or.jp 1 Agenda Why do we need many core platforms? KNL architecture Post-K overview Single-thread optimization Parallelization

More information

n N c CIni.o ewsrg.au

n N c CIni.o ewsrg.au @NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU

More information

Efficient Parallel Programming on Xeon Phi for Exascale

Efficient Parallel Programming on Xeon Phi for Exascale Efficient Parallel Programming on Xeon Phi for Exascale Eric Petit, Intel IPAG, Seminar at MDLS, Saclay, 29th November 2016 Legal Disclaimers Intel technologies features and benefits depend on system configuration

More information

arxiv: v2 [hep-lat] 3 Nov 2016

arxiv: v2 [hep-lat] 3 Nov 2016 MILC staggered conjugate gradient performance on Intel KNL arxiv:1611.00728v2 [hep-lat] 3 Nov 2016 Department of Physics, Indiana University, Bloomington IN 47405, USA E-mail: ruizli@umail.iu.edu Carleton

More information

Introduction to Performance Tuning & Optimization Tools

Introduction to Performance Tuning & Optimization Tools Introduction to Performance Tuning & Optimization Tools a[i] a[i+1] + a[i+2] a[i+3] b[i] b[i+1] b[i+2] b[i+3] = a[i]+b[i] a[i+1]+b[i+1] a[i+2]+b[i+2] a[i+3]+b[i+3] Ian A. Cosden, Ph.D. Manager, HPC Software

More information

ORAP Forum October 10, 2013

ORAP Forum October 10, 2013 Towards Petaflop simulations of core collapse supernovae ORAP Forum October 10, 2013 Andreas Marek 1 together with Markus Rampp 1, Florian Hanke 2, and Thomas Janka 2 1 Rechenzentrum der Max-Planck-Gesellschaft

More information

Advanced Parallel Programming I

Advanced Parallel Programming I Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University

More information

GPU Developments in Atmospheric Sciences

GPU Developments in Atmospheric Sciences GPU Developments in Atmospheric Sciences Stan Posey, HPC Program Manager, ESM Domain, NVIDIA (HQ), Santa Clara, CA, USA David Hall, Ph.D., Sr. Solutions Architect, NVIDIA, Boulder, CO, USA NVIDIA HPC UPDATE

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

HPC Technology Trends

HPC Technology Trends HPC Technology Trends High Performance Embedded Computing Conference September 18, 2007 David S Scott, Ph.D. Petascale Product Line Architect Digital Enterprise Group Risk Factors Today s s presentations

More information

ICON for HD(CP) 2. High Definition Clouds and Precipitation for Advancing Climate Prediction

ICON for HD(CP) 2. High Definition Clouds and Precipitation for Advancing Climate Prediction ICON for HD(CP) 2 High Definition Clouds and Precipitation for Advancing Climate Prediction High Definition Clouds and Precipitation for Advancing Climate Prediction ICON 2 years ago Parameterize shallow

More information

Trends in systems and how to get efficient performance

Trends in systems and how to get efficient performance Trends in systems and how to get efficient performance Martin Hilgeman HPC Consultant martin.hilgeman@dell.com The landscape is changing We are no longer in the general purpose era the argument of tuning

More information

Scalable Nonhydrostatic Unified Models of the Atmosphere with Moisture

Scalable Nonhydrostatic Unified Models of the Atmosphere with Moisture DISTRIBUTION STATEMENT A. Approved for public release; distribution is unlimited. Scalable Nonhydrostatic Unified Models of the Atmosphere with Moisture Francis X. Giraldo Department of Applied Mathematics

More information

Instructor: Leopold Grinberg

Instructor: Leopold Grinberg Part 1 : Roofline Model Instructor: Leopold Grinberg IBM, T.J. Watson Research Center, USA e-mail: leopoldgrinberg@us.ibm.com 1 ICSC 2014, Shanghai, China The Roofline Model DATA CALCULATIONS (+, -, /,

More information

NVIDIA Jetson Platform Characterization

NVIDIA Jetson Platform Characterization NVIDIA Jetson Platform Characterization Hassan Halawa, Hazem A. Abdelhafez, Andrew Boktor, Matei Ripeanu The University of British Columbia {hhalawa, hazem, boktor, matei}@ece.ubc.ca Abstract. This study

More information

Benchmark results on Knight Landing (KNL) architecture

Benchmark results on Knight Landing (KNL) architecture Benchmark results on Knight Landing (KNL) architecture Domenico Guida, CINECA SCAI (Bologna) Giorgio Amati, CINECA SCAI (Roma) Roma 23/10/2017 KNL, BDW, SKL A1 BDW A2 KNL A3 SKL cores per node 2 x 18 @2.3

More information

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort rob@cs.vu.nl Schedule 2 1. Introduction, performance metrics & analysis 2. Many-core hardware 3. Cuda class 1: basics 4. Cuda class

More information

Parallel Quality Meshes for Earth Models

Parallel Quality Meshes for Earth Models Parallel Quality Meshes for Earth Models John Burkardt Department of Scientific Computing Florida State University... 04 October 2016, Virginia Tech... http://people.sc.fsu.edu/ jburkardt/presentations/......

More information

HPC Architectures evolution: the case of Marconi, the new CINECA flagship system. Piero Lanucara

HPC Architectures evolution: the case of Marconi, the new CINECA flagship system. Piero Lanucara HPC Architectures evolution: the case of Marconi, the new CINECA flagship system Piero Lanucara Many advantages as a supercomputing resource: Low energy consumption. Limited floor space requirements Fast

More information

Performance optimization of the Smoothed Particle Hydrodynamics code Gadget3 on 2nd generation Intel Xeon Phi

Performance optimization of the Smoothed Particle Hydrodynamics code Gadget3 on 2nd generation Intel Xeon Phi Performance optimization of the Smoothed Particle Hydrodynamics code Gadget3 on 2nd generation Intel Xeon Phi Dr. Luigi Iapichino Leibniz Supercomputing Centre Supercomputing 2017 Intel booth, Nerve Center

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Welcome. Virtual tutorial starts at BST

Welcome. Virtual tutorial starts at BST Welcome Virtual tutorial starts at 15.00 BST Using KNL on ARCHER Adrian Jackson With thanks to: adrianj@epcc.ed.ac.uk @adrianjhpc Harvey Richardson from Cray Slides from Intel Xeon Phi Knights Landing

More information

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ, Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - fabio.baruffa@lrz.de LRZ, 27.6.- 29.6.2016 Architecture Overview Intel Xeon Processor Intel Xeon Phi Coprocessor, 1st generation Intel Xeon

More information

Mike Greenfield, Intel MultiCore 7 Workshop September 27 and 28, 2017 National Center for Atmospheric Research in Boulder, Colorado

Mike Greenfield, Intel MultiCore 7 Workshop September 27 and 28, 2017 National Center for Atmospheric Research in Boulder, Colorado Mike Greenfield, Intel Multi 7 Workshop September 27 and 28, 2017 National Center for Atmospheric Research in ulder, Colorado * Legal Disclaimers Intel technologies may require enabled hardware, specific

More information

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,

More information

Knights Corner: Your Path to Knights Landing

Knights Corner: Your Path to Knights Landing Knights Corner: Your Path to Knights Landing James Reinders, Intel Wednesday, September 17, 2014; 9-10am PDT Photo (c) 2014, James Reinders; used with permission; Yosemite Half Dome rising through forest

More information

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group Aim High Intel Technical Update Teratec 07 Symposium June 20, 2007 Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group Risk Factors Today s s presentations contain forward-looking statements.

More information

GPU Consideration for Next Generation Weather (and Climate) Simulations

GPU Consideration for Next Generation Weather (and Climate) Simulations GPU Consideration for Next Generation Weather (and Climate) Simulations Oliver Fuhrer 1, Tobias Gisy 2, Xavier Lapillonne 3, Will Sawyer 4, Ugo Varetto 4, Mauro Bianco 4, David Müller 2, and Thomas C.

More information

arxiv: v1 [hep-lat] 1 Dec 2017

arxiv: v1 [hep-lat] 1 Dec 2017 arxiv:1712.00143v1 [hep-lat] 1 Dec 2017 MILC Code Performance on High End CPU and GPU Supercomputer Clusters Carleton DeTar 1, Steven Gottlieb 2,, Ruizi Li 2,, and Doug Toussaint 3 1 Department of Physics

More information

Findings from real petascale computer systems with meteorological applications

Findings from real petascale computer systems with meteorological applications 15 th ECMWF Workshop Findings from real petascale computer systems with meteorological applications Toshiyuki Shimizu Next Generation Technical Computing Unit FUJITSU LIMITED October 2nd, 2012 Outline

More information

CS 475: Parallel Programming Introduction

CS 475: Parallel Programming Introduction CS 475: Parallel Programming Introduction Wim Bohm, Sanjay Rajopadhye Colorado State University Fall 2014 Course Organization n Let s make a tour of the course website. n Main pages Home, front page. Syllabus.

More information

Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning

Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning Yukinori Sato (JAIST / JST CREST) Hiroko Midorikawa (Seikei Univ. / JST CREST) Toshio Endo (TITECH / JST CREST)

More information

Philippe Thierry Sr Staff Engineer Intel Corp.

Philippe Thierry Sr Staff Engineer Intel Corp. HPC@Intel Philippe Thierry Sr Staff Engineer Intel Corp. IBM, April 8, 2009 1 Agenda CPU update: roadmap, micro-μ and performance Solid State Disk Impact What s next Q & A Tick Tock Model Perenity market

More information