PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

Similar documents
Introduction to Xeon Phi. Bill Barth January 11, 2013

CP2K: HIGH PERFORMANCE ATOMISTIC SIMULATION

Reusing this material

Intel Xeon Phi Coprocessors

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ,

Intel MIC Programming Workshop, Hardware Overview & Native Execution. IT4Innovations, Ostrava,

Introduc)on to Xeon Phi

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Scientific Computing with Intel Xeon Phi Coprocessors

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

CP2K Performance Benchmark and Profiling. April 2011

Intel Many Integrated Core (MIC) Architecture

GPUs and Emerging Architectures

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

CP2K: INTRODUCTION AND ORIENTATION

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture

An Introduction to the Intel Xeon Phi. Si Liu Feb 6, 2015

Introduction to the Intel Xeon Phi on Stampede

Preparing for Highly Parallel, Heterogeneous Coprocessing

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012

CP2K Performance Benchmark and Profiling. April 2011

Knights Corner: Your Path to Knights Landing

Intel Knights Landing Hardware

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

TACC s Stampede Project: Intel MIC for Simulation and Data-Intensive Computing

HECToR. UK National Supercomputing Service. Andy Turner & Chris Johnson

Parallel Programming on Ranger and Stampede

Accelerator Programming Lecture 1

Overview of Intel Xeon Phi Coprocessor

Intra-MIC MPI Communication using MVAPICH2: Early Experience

n N c CIni.o ewsrg.au

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi Coprocessor

A Unified Approach to Heterogeneous Architectures Using the Uintah Framework

Introduc)on to Xeon Phi

Architecture, Programming and Performance of MIC Phi Coprocessor

The Era of Heterogeneous Computing

HPC Architectures. Types of resource currently in use

Intel Xeon Phi Coprocessor

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Computer Architecture and Structured Parallel Programming James Reinders, Intel

Trends in HPC (hardware complexity and software challenges)

John Hengeveld Director of Marketing, HPC Evangelist

Intel Architecture for HPC

Debugging Intel Xeon Phi KNC Tutorial

E, F. Best-known methods (BKMs), 153 Boot strap processor (BSP),

Bring your application to a new era:

Introduc)on to Xeon Phi

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi Coprocessor

General Purpose GPU Computing in Partial Wave Analysis

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

Experiences with ENZO on the Intel Many Integrated Core Architecture

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

Simulation using MIC co-processor on Helios

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Heterogeneous Computing and OpenCL

Native Computing and Optimization. Hang Liu December 4 th, 2013

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA

Parallel Computing. November 20, W.Homberg

1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core

OCTOPUS Performance Benchmark and Profiling. June 2015

Efficient Parallel Programming on Xeon Phi for Exascale

Intel MIC Programming Workshop, Hardware Overview & Native Execution. Ostrava,

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,

The Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Intel Math Kernel Library (Intel MKL) Latest Features

PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ,

Does the Intel Xeon Phi processor fit HEP workloads?

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

CPMD Performance Benchmark and Profiling. February 2014

Path to Exascale? Intel in Research and HPC 2012

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

Intel Performance Libraries

Performance analysis tools: Intel VTuneTM Amplifier and Advisor. Dr. Luigi Iapichino

CURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS

GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES. Nikolay Markovskiy Peter Messmer

Accelerating Insights In the Technical Computing Transformation

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

HPC. Accelerating. HPC Advisory Council Lugano, CH March 15 th, Herbert Cornelius Intel

Mixed MPI-OpenMP EUROBEN kernels

Technical Report. Document Id.: CESGA Date: July 28 th, Responsible: Andrés Gómez. Status: FINAL

HPC Architectures evolution: the case of Marconi, the new CINECA flagship system. Piero Lanucara

Performance Profiler. Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava,

AACE: Applications. Director, Application Acceleration Center of Excellence National Institute for Computational Sciences glenn-

Towards modernisation of the Gadget code on many-core architectures Fabio Baruffa, Luigi Iapichino (LRZ)

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

HPC Architectures past,present and emerging trends

April 2 nd, Bob Burroughs Director, HPC Solution Sales

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor

EXASCALE COMPUTING ROADMAP IMPACT ON LEGACY CODES MARCH 17 TH, MIC Workshop PAGE 1. MIC workshop Guillaume Colin de Verdière

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Transcription:

PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk)

Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further Reading

Xeon Phi Overview Terminology: Project Larrabee Intel research project planned to develop a GPGPU architecture shelved in 2010 Many Integrated Core (MIC) Scalable architecture for many-core computing based on x86 cores Intel Xeon Phi TM Product name e.g. Xeon Phi 5110P Code names: Knight s Ferry MIC prototype (2010), 45 nm Knight s Corner current generation (2013), 22 nm Knight s Landing announced (2015), 14 nm

Xeon Phi Overview Xeon Phi 5110P Co-processor attached via PCIe bus 60 cores @ 1.053 GHz 4 virtual threads per core 512-bit (8 doubles) vector unit per core Coherent L2 cache 8 GB main memory 1.011 GFLOP/s per coprocessor

Xeon Phi Overview Intel Xeon Phi Architecture Overview Image Intel, ARCHER Xeon Phi Training Course 8 memory controllers 16 Channel GDDR5 MC PCIe GEN2 Cores: 61 core s, at 1.1 GHz in-order, support 4 threads 512 bit Vector Processing Unit 32 native registers High-speed bi-directional ring interconnect Fully Coherent L2 Cache Reliability Features Parity on L1 Cache, ECC on memory CRC on memory IO, CAP on memory IO

Xeon Phi Overview Operation Modes: Native Mode Xeon Phi runs cut-down Linux SSH in and launch (parallel) program Later case study used native mode only Offload Mode Main program thread runs on host CPU Marked-up regions are offloaded for execution on Xeon Phi OpenCL Main program runs on CPU Launch OpenCL kernels on the Xeon Phi

Xeon Phi Overview Programming Models: Intel Compiler Standard C, C++, Fortran Required to generate vector instructions MPI In native mode on single Xeon Phi (or across several) In offload mode between host CPUs Single communicator spanning host and MIC OpenMP Within a kernel in offload mode In native mode, possibly hybrid with MPI Intel TBB, Cilk+, OpenCL, OpenACC

Porting CP2K to Xeon Phi CP2K is a program to perform atomistic and molecular simulations of solid state, liquid, molecular, and biological systems. It provides a general framework for different methods such as e.g., density functional theory (DFT) using a mixed Gaussian and plane waves approach (GPW) and classical pair and many-body potentials. From www.cp2k.org (2004!)

Porting CP2K to Xeon Phi Many force models: Classical DFT (GPW) Hybrid Hartree-Fock LS-DFT post-hf (MP2, RPA) Simulation tools MD (various ensembles) Monte Carlo Minimisation (GEO/CELL_OPT) Properties (Spectra, excitations ) Open Source GPL, www.cp2k.org 1m loc, ~2 commits per day ~10 core developers 3 rd most used code on ARCHER

Porting CP2K to Xeon Phi Work done thanks to funding from PRACE Porting to Intel Compiler Just add mmic! Fixed several bugs (in CP2K and in Intel compiler / MKL) Recommend using ifort 14.1 and MKL 11.1 Using MIC-optimised libraries MKL up to 5x faster than FFTW 3.3.3 for 1D FFT (n=256) MKL up to 3x faster for 3D FFT (n=128)

Porting CP2K to Xeon Phi Task placement how to place 240 virtual threads?

Performance Results Initial port (no optimisations): Langasite relaxation (DFT) 8 MPI x 16 OpenMP (128 threads) Around ~8x slower than 8-core Sandy Bridge CPU Optimisation: Main bottlenecks are poor scaling with OpenMP Parallelised two expensive routines Total memory usage also a constraint Finally, still 5.4x slower than the CPU!

Lessons Learned Porting is relatively easy If the source code is portable and already parallelised Native mode provides an easy way in to using Xeon Phi But finding enough parallelism to fill 240 threads with memory limit of 8 GB is a hard strong-scaling problem Different approach to multi-core, memory rich on-node parallelism Fine-grained parallelism required (e.g. data parallel) Could scale across multiple MICs to harness more memory, but this comes with additional overhead

Lessons Learned Offload mode Complex logic, function calls are much slower on KNC core than on Xeon For Knight s Corner, might be better to run use offload mode and only run suitable kernels on the Xeon Phi Serial Optimisation Good auto-vectorisation, sometimes necessary to align and/or pad arrays. P54C cores are slow, and memory bandwidth is limited. Expect this to improve on KNL.

Summary Xeon Phi offers promising performance gains Comparable performance to current GPUs Works best with data parallel codes with large amount of fine-grained parallelism Easy to use Thanks to support for existing common programming models Still need a highly parallel algorithm Knight s Landing expected 2015 Self-hosting Higher memory b/w (stacked memory) + 384 GB DDR4 AVX-512 vectorisation and Atom cores.

Further Reading Programming the Xeon Phi (ARCHER training course materials) https://www.archer.ac.uk/training/course-material/2014/06/ XeonPhi_Bristol/ Evaluating CP2K on Exascale Hardware: Intel Xeon Phi http://www.prace-ri.eu/img/pdf/wp152.pdf Optimising CP2K for the Intel Xeon Phi http://www.prace-ri.eu/img/pdf/wp140.pdf Intel Xeon Phi Webinar: https://software.intel.com/en-us/articles/intel-xeon-phi-webinar