Bring your application to a new era:

Similar documents
Code modernization of Polyhedron benchmark suite

HPC code modernization with Intel development tools

Code optimization in a 3D diffusion model

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.

Intel Xeon Phi Coprocessor

Getting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions

Intel Xeon Phi Coprocessors

Scientific Computing with Intel Xeon Phi Coprocessors

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Advanced OpenMP Features

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture

Computer Architecture and Structured Parallel Programming James Reinders, Intel

Intel Architecture and Tools Jureca Tuning for the platform II. Dr. Heinrich Bockhorst Intel SSG/DPD/ Date:

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

Lect. 2: Types of Parallelism

Dynamic SIMD Scheduling

Intel Performance Libraries

Reusing this material

Maximizing performance and scalability using Intel performance libraries

Intel MPI Library Conditional Reproducibility

Introduction to Xeon Phi. Bill Barth January 11, 2013

The Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation

A Simple Path to Parallelism with Intel Cilk Plus

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

MULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT

OpenMP 4.0 implementation in GCC. Jakub Jelínek Consulting Engineer, Platform Tools Engineering, Red Hat

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA

Bei Wang, Dmitry Prohorov and Carlos Rosales

The Era of Heterogeneous Computing

OpenMP: Vectorization and #pragma omp simd. Markus Höhnerbach

Cilk Plus GETTING STARTED

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Native Computing and Optimization. Hang Liu December 4 th, 2013

Accelerator Programming Lecture 1

Intel MIC Programming Workshop, Hardware Overview & Native Execution. IT4Innovations, Ostrava,

AACE: Applications. Director, Application Acceleration Center of Excellence National Institute for Computational Sciences glenn-

Overview of Intel Xeon Phi Coprocessor

Intel Many Integrated Core (MIC) Architecture

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017

Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012

Intel Xeon Phi Coprocessor

Many-core Processor Programming for beginners. Hongsuk Yi ( 李泓錫 ) KISTI (Korea Institute of Science and Technology Information)

Intel Parallel Studio XE 2015

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Running HARMONIE on Xeon Phi Coprocessors

Accelerating InDel Detection on Modern Multi-Core SIMD CPU Architecture

High Performance Computing: Tools and Applications

Xeon Phi Coprocessors on Turing

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Parallel, Dynamically Adaptive 2.5D Porous Media Flow Simulation on Xeon Phi Architectures

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Native Computing and Optimization on the Intel Xeon Phi Coprocessor. John D. McCalpin

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Programming Intel R Xeon Phi TM

NUMA-aware OpenMP Programming

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Presenter: Georg Zitzlsberger Date:

OpenMP: Open Multiprocessing

Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ,

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Intel Software Development Products for High Performance Computing and Parallel Programming

VLPL-S Optimization on Knights Landing

Intel Xeon Phi архитектура, модели программирования, оптимизация.

OpenMP: Open Multiprocessing

Benchmark results on Knight Landing (KNL) architecture

Intel Xeon Phi programming. September 22nd-23rd 2015 University of Copenhagen, Denmark

What s New August 2015

Intel Knights Landing Hardware

Revealing the performance aspects in your code

Intel Architecture for HPC

AUTOMATIC SMT THREADING

Intel Math Kernel Library (Intel MKL) Overview. Hans Pabst Software and Services Group Intel Corporation

Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor

Intel Math Kernel Library (Intel MKL) Latest Features

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Introduction to Performance Tuning & Optimization Tools

Parallel Programming on Ranger and Stampede

Klaus-Dieter Oertel, May 28 th 2013 Software and Services Group Intel Corporation

Kevin O Leary, Intel Technical Consulting Engineer

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

CS4961 Parallel Programming. Lecture 9: Task Parallelism in OpenMP 9/22/09. Administrative. Mary Hall September 22, 2009.

Growth in Cores - A well rehearsed story

Intel Architecture for Software Developers

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant

Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture. Alexander Berreth. Markus Bühler, Benedikt Anlauf

Hybrid (MPP+OpenMP) version of LS-DYNA

Laurent Duhem Intel Alain Dominguez - Intel

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Presenter: Georg Zitzlsberger. Date:

[Potentially] Your first parallel application

Symmetric Computing. SC 14 Jerome VIENNE

Native Computing and Optimization on the Intel Xeon Phi Coprocessor. John D. McCalpin

Overview: Programming Environment for Intel Xeon Phi Coprocessor

Introduction to the Intel Xeon Phi on Stampede

Transcription:

Bring your application to a new era: learning by example how to parallelize and optimize for Intel Xeon processor and Intel Xeon Phi TM coprocessor Manel Fernández, Roger Philp, Richard Paul Bayncore Ltd. HPCKP 15

Moore s law: how to use so many transistors? Single thread performance is limited Clock frequency constraints Near parallelism harder to expose Instruction level parallelism (ILP) Hint: exploit distant parallelism Data level parallelism (DLP) Task level parallelism (TLP) Programmers responsibility to expose DLP/TLP parallelism "Transistor Count and Moore's Law - 2011" by Wgsimon - http://en.wikipedia.org/wiki/moore's_law 2

The multi- and many-core era: Intel solutions for HPC Multi-core C/C++/Fortran, OMP/MPI/Cilk+/TBB Bootable, native execution model Up to 18 cores, 3 GHz, 36 threads Up to 768 GB, 68 GB/s, 432 GFLOP/s DP 256-bit SIMD, FMA, gather (AVX2) Targeted at general purpose applications Single thread performance (ILP) Memory capacity Many integrated core (MIC) C/C++/Fortran, OMP/MPI/Cilk+/TBB PCIe, native and offload execution models Up to 61 cores, 1.2 GHz, 244 threads Up to 16 GB, 352 GB/s, 1.2 TFLOP/s DP 512-bit SIMD, FMA, gather/scatter, EMU (IMCI) Targeted at highly parallel applications High parallelism (DLP, TLP) High memory bandwidth 3

How to enable parallelism with standard methods Intel Parallel Studio XE 2015 tool suite Source code + annotations (OMP, MPI, compiler directives) Compiler, libraries, parallel programming models Single programming model for all your code 4

Characterizing Polyhedron benchmark suite Windows 8 Intel Core TM i7-4500u (0,1)(2,3) Intel Fortran Compiler 15.0.1.14 [/O3 /fp:fast=2 /align:array64byte /Qipo /QxHost] 5

Speedup (higher is better) Auto-vectorization effectiveness 3.5 Elapsed time speedup vs. not vectorized serial version 3 2.5 2 1.5 1 0.5 0 /Qvec- /Qvec+ 6

Speedup (higher is better) Auto-parallelization effectiveness 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Elapsed time speedup vs. serial version /Qparallel- /Qparallel+ (default, 4t {0:3}{0:3}{0:3}{0:3}) /Qparallel+ (compact, 2t {0:1}{2:3}) 7

Speedup vs. serial version Gb /s Memory bandwidth requirements Memory bandwidth and speedup vs. serial version 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 20 18 16 14 12 10 8 6 4 2 0 WR BW RD BW /Qvec- /Qparallel (compact, 2t {0:1}{2:3}) 8

Observations: implicit vs. explicit parallelism Compiler toolchain is limited in exposing implicit parallelism Good for ILP (uarch supposed to help) Not so bad for DLP Exploited by use of vectors (SIMD) But potentially missing opportunities due to aliasing, etc. Disappointing for TLP Hyper-threading rarely useful on HPC applications Explicit parallelism relies on the programmer DLP: compiler directives, array notation, vector classes, intrincsics TLP: Multi- and many-cores available (OpenMP, Cilk+, TBB) Distributed systems with standard methods Clusters, MPI models 9

Exposing DLP/TLP parallelism Simplest method by using compiler directives (aka pragmas ) Exposing DLP: vectorization/simd pragmas #pragma vector {args} #pragma ivdep #pragma simd [clauses] Exposing TLP: OMP pragmas #omp parallel for #omp atomic/critical Vectorization hints Ignore vector assumed dependencies Enforces vectorization with hints Parallelizes iterations of a given loop Thread synchronization Runtime performance tuning for threaded applications OMP_NUM_THREADS OMP_SCHEDULE KMP_AFFINITY KMP_PLACE_THREADS Number of threads to run How work is distributed among threads How threads are bound to physical PUs Easy thread placement (Intel Xeon Phi TM only) 10

Polyhedron/gas_dyn2 Linux RHEL 6.6 Intel Xeon E5-4650L, 2 socket x 8 cores x 2 HTs Intel Xeon Phi TM 7120A, 61 cores x 4 threads Intel Fortran Compiler 15.0.1.14 [-O3 -fp-model fast=2 -align array64byte -ipo -xhost/-mmic] 11

Serial version Continuity equations solver to models the flow of a gas in 1D Two main hotspots: EOS (66%) and CHOZDT(33%) Implicit loops by using Fortran 90 array notation Both hotspots perfectly fused + vectorized 12

OMP workshare construct Workshare currently not working (not parallelized) Reduction loop in CHOZDT does not even vectorize 13

OMP parallel loop (CHOZDT) Intel compiler does not support OMP 4.0 user defined reductions We have to write the parallel reduction by ourselves! 14

OMP parallel loop (EOS) Straightforward transformation Streaming stores to avoid wasting some read bandwidth 15

Stteps/s (higher is better) GB/s Performance results Polyhedron/gas_dyn2: speed and bandwidth evolution 3M nodes, 20K steps 800 80 700 70 600 60 500 50 400 40 300 30 200 20 100 10 0 Xeon 1t Xeon Phi 1t Xeon 1t/c Xeon Phi 1t/c Xeon Phi 2t/c Xeon Phi 4t/c 0 Serial OMP loops + streaming stores Bandwidth Steps/s Intel Xeon Phi TM speedup vs. Intel Xeon : 5.8x (serial), 4.4x (parallel) 16

Polyhedron/linpk Linux RHEL 6.6 Intel Xeon E5-4650L, 2 socket x 8 cores x 2 HTs Intel Xeon Phi TM 7120A, 61 cores x 4 threads Intel Fortran Compiler 15.0.1.14 [-O3 -fp-model fast=2 -align array64byte -ipo -xhost/-mmic] 17

Linpk hotspot: DGEFA Matrix decomposition with partial pivoting by Gaussian elimination Invokes BLAS routines DAXPY (98%), IDAMAX, DSCAL (all are inlined) 18

OMP parallel loop Inner i loop properly autovectorized by the compiler Middle j loop can be parallelized Outer k loop (diagonal) has dependencies between iterations Application is memory bound 19

Million elements/s (higher is better) GB/s Performance results Polyhedron/linpk: speed and bandwidth evolution 7k5 x 7k5 elements 1.6 80 1.4 70 1.2 60 1 50 0.8 40 0.6 30 0.4 20 0.2 10 0 Xeon 1t Xeon Phi 1t Xeon 1t/c Xeon Phi 1t/c Xeon Phi 2t/c Xeon Phi 4t/c 0 Serial OMP loop Bandwidth Elements/s Intel Xeon Phi TM speedup vs. Intel Xeon : 3x (serial), 3.6x (parallel) 20

Summary and conclusions Programmers are responsible of exposing DLP/TLP parallelism to fully exploit the available hardware in HPC domains Today s Intel HPC solutions allow to easily expose DLP/TLP parallelism Intel Parallel Studio XE 2015 tool suite Simple methods (compiler pragmas, OMP, libraries) Same source code for multi- and many-core processors Intel Xeon Phi TM coprocessors targeted at highly parallel applications Significant speedups achieved in bandwidth bound applications Runtime tuning is key to achieve best performance Future work Experiment with other benchmarks (not only from Polyhedron) Non memory bound applications, native/offload execution models Extend parallelization to distributed systems with MPI 21