IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor

Similar documents
Bei Wang, Dmitry Prohorov and Carlos Rosales

IXPUG 16. Dmitry Durnov, Intel MPI team

Intel Xeon Phi Coprocessor. Technical Resources. Intel Xeon Phi Coprocessor Workshop Pawsey Centre & CSIRO, Aug Intel Xeon Phi Coprocessor

Performance Evaluation of NWChem Ab-Initio Molecular Dynamics (AIMD) Simulations on the Intel Xeon Phi Processor

HPCG on Intel Xeon Phi 2 nd Generation, Knights Landing. Alexander Kleymenov and Jongsoo Park Intel Corporation SC16, HPCG BoF

Vectorization Advisor: getting started

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Contributors: Surabhi Jain, Gengbin Zheng, Maria Garzaran, Jim Cownie, Taru Doodi, and Terry L. Wilmarth

Intel tools for High Performance Python 데이터분석및기타기능을위한고성능 Python

H.J. Lu, Sunil K Pandey. Intel. November, 2018

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

What s P. Thierry

OpenMP * 4 Support in Clang * / LLVM * Andrey Bokhanko, Intel

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017

INTEL HPC DEVELOPER CONFERENCE FUEL YOUR INSIGHT

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013

Intel Many Integrated Core (MIC) Architecture

Dr Christopher Dahnken. SSG DRD EMEA Datacenter

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Crosstalk between VMs. Alexander Komarov, Application Engineer Software and Services Group Developer Relations Division EMEA

The Intel Xeon PHI Architecture

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

LIBXSMM Library for small matrix multiplications. Intel High Performance and Throughput Computing (EMEA) Hans Pabst, March 12 th 2015

Graphics Performance Analyzer for Android

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature

Intel Software Development Products Licensing & Programs Channel EMEA

The Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation

Knights Corner: Your Path to Knights Landing

Kevin O Leary, Intel Technical Consulting Engineer

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Using Intel VTune Amplifier XE for High Performance Computing

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant

Jim Cownie, Johnny Peyton with help from Nitya Hariharan and Doug Jacobsen

Intel Architecture 2S Server Tioga Pass Performance and Power Optimization

Intel HPC Technologies Outlook

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

High Performance Computing The Essential Tool for a Knowledge Economy

Ravindra Babu Ganapathi

DPDK Performance Report Release Test Date: Nov 16 th 2016

FAST FORWARD TO YOUR <NEXT> CREATION

April 2 nd, Bob Burroughs Director, HPC Solution Sales

Efficient Parallel Programming on Xeon Phi for Exascale

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Intel Xeon Phi coprocessor (codename Knights Corner) George Chrysos Senior Principal Engineer Hot Chips, August 28, 2012

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Повышение энергоэффективности мобильных приложений путем их распараллеливания. Примеры. Владимир Полин

NVMe Over Fabrics: Scaling Up With The Storage Performance Development Kit

Guy Blank Intel Corporation, Israel March 27-28, 2017 European LLVM Developers Meeting Saarland Informatics Campus, Saarbrücken, Germany

Sample for OpenCL* and DirectX* Video Acceleration Surface Sharing

Overview of Data Fitting Component in Intel Math Kernel Library (Intel MKL) Intel Corporation

What s New August 2015

Intel Xeon Phi Coprocessor Performance Analysis

VLPL-S Optimization on Knights Landing

Intel Advisor XE. Vectorization Optimization. Optimization Notice

Intel Knights Landing Hardware

Intel Architecture for Software Developers

HPC. Accelerating. HPC Advisory Council Lugano, CH March 15 th, Herbert Cornelius Intel

Intel Atom Processor Based Platform Technologies. Intelligent Systems Group Intel Corporation

Mikhail Dvorskiy, Jim Cownie, Alexey Kukanov

Intel technologies, tools and techniques for power and energy efficiency analysis

Ravindra Babu Ganapathi Product Owner/ Technical Lead Omni Path Libraries, Intel Corp. Sayantan Sur Senior Software Engineer, Intel Corp.

Becca Paren Cluster Systems Engineer Software and Services Group. May 2017

Intel s Architecture for NFV

Fast-track Hybrid IT Transformation with Intel Data Center Blocks for Cloud

Intel Architecture for HPC

Intel Xeon Phi Coprocessor

Installation Guide and Release Notes

DPDK Intel NIC Performance Report Release 18.02

Using Intel VTune Amplifier XE and Inspector XE in.net environment

Running Docker* Containers on Intel Xeon Phi Processors

Bitonic Sorting Intel OpenCL SDK Sample Documentation

INTEL MKL Vectorized Compact routines

Kirill Rogozhin. Intel

Sarah Knepper. Intel Math Kernel Library (Intel MKL) 25 May 2018, iwapt 2018

OpenCL* and Microsoft DirectX* Video Acceleration Surface Sharing

Intel Cluster Checker 3.0 webinar

DPDK Intel NIC Performance Report Release 18.05

DPDK Vhost/Virtio Performance Report Release 18.05

Intel Many-Core Processor Architecture for High Performance Computing

Innovating and Integrating for Communications and Storage

Data center: The center of possibility

Bitonic Sorting. Intel SDK for OpenCL* Applications Sample Documentation. Copyright Intel Corporation. All Rights Reserved

Getting Started with Intel SDK for OpenCL Applications

Installation Guide and Release Notes

Embree Ray Tracing Kernels: Overview and New Features

EPYC VIDEO CUG 2018 MAY 2018

DPDK Vhost/Virtio Performance Report Release 18.11

DPDK Intel NIC Performance Report Release 17.08

Performance Profiler. Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava,

ISA-L Performance Report Release Test Date: Sept 29 th 2017

Intel Parallel Studio XE 2011 for Windows* Installation Guide and Release Notes

Parallel Programming Features in the Fortran Standard. Steve Lionel 12/4/2012

Sergey Maidanov. Software Engineering Manager for Intel Distribution for Python*

Opportunities and Challenges in Sparse Linear Algebra on Many-Core Processors with High-Bandwidth Memory

12th ANNUAL WORKSHOP 2016 NVME OVER FABRICS. Presented by Phil Cayton Intel Corporation. April 6th, 2016

MICHAL MROZEK ZBIGNIEW ZDANOWICZ

Real World Development examples of systems / iot

Memory & Thread Debugger

Transcription:

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor D.Sc. Mikko Byckling 17th Workshop on High Performance Computing in Meteorology October 24 th 2016, Reading, UK

Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright, Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

Contents Intel Xeon Phi processor codenamed Knights Landing (KNL) overview Architecture, cluster and memory modes IFS RAPS14 benchmark overview IFS RAPS14/TL159 benchmarks Notes on optimization effort Time to solution, application profile, energy to solution 3

Intel Xeon Phi processor overview

Intel Xeon Phi architecture 36 Lanes PCIe* Gen3 (x16, x16, x4) Instruction set architecture Intel Xeon Processor compatible, adds Intel AVX-512 MCDRAM MCDRAM KNL Package On-package memory 16GB, up to 490 GB/s STREAM TRIAD Platform Memory Up to 384GB (6ch DDR4 2400) Fixed Bottlenecks 2D Mesh Architecture Out-of-Order Cores 3X single-thread vs. KNC TILE: (up to 36) 2VPU Core HUB 1MB L2 2VPU Core Enhanced Intel Atom cores based on Silvermont Microarchitecture DDR4 Tile MCDRAM MCDRAM EDC (embedded DRAM controller) DDR4 Bi-directional tile connections (same bit width as Xeon core interconnect) 5 IMC (integrated memory controller) IIO (integrated I/O controller)

KNL cluster modes All-2-All Quadrant Sub-NUMA clustering 1 4 3 1 4 2 3 3 1 4 2 2 No affinity between tile, directory and memory Chip divided into 4 virtual software transparent quadrants Affinity between directory and memory Chip visible to the OS as a 4S Xeon Affinity between tile, directory and memory 6 Cluster modes are BIOS-selectable

KNL on-package memory modes Cache mode Flat mode Hybrid mode 64B cache lines directmapped 16GB MCDRAM DRAM Physical Address 16GB MCDRAM DRAM Split options: 25/75% or 50/50% 8 or 4 GB MCDRAM 8 or 12GB MCDRAM DRAM MCDRAM as a L3 cache between CPU and DDR (HW managed) Application manages the use of MCDRAM and DDR MCDRAM both as cache and application managed memory 7 Memory modes are BIOS-selectable

IFS RAPS14 benchmark

IFS RAPS14 benchmark In development since the early 90 s Better performance measure than a Linpack run for ECMWF s numerical weather prediction applications IFS RAPS14 is based on IFS cycle 41R2 Includes TL159, TCO639 and TCO1279 models Bitwise reproducible results expected With RAPS14, issues either with the compiler or MKL seemed to prevent reproducibility (even on a Xeon) Opted to get a performance baseline instead 9

IFS RAPS14 benchmark: statistics 2.4M lines of code Nearly flat profile Well parallelized with MPI and OpenMP RAPS14/TL159: ~6% in MPI library, ~90% in OpenMP parallel regions RAPS14/TL159: 48.7 Gflops/sec (~4% of peak of a dual Intel Xeon 2697v4) Lines of code 128 399 155 931 216 707 1 839 561 Fortran 90 Fortran 77 C C Header 10

IFS RAPS14/TL159 benchmarks

KNL runtime configuration Optimal runtime configuration found with a search through the parameter space of MPI ranks, OpenMP threads, NPROMA and NRPROMA Optimal parameters for KNL rather different from the optimal parameters for Xeon 2MB pages and tbbmalloc_proxy library beneficial for both Xeon and KNL For KNL the performance impact more pronounced, up to ~15-20% 12

KNL code optimization effort AVX-512 vectorization enabled with -xmic-avx512 compiler flag With O3 the compiler too aggressive on optimizations for some routines, switched to O2 instead In some cases vec-threshold0 flag used to change compiler heuristics and ensure vectorization Due to assumed dependencies the compiler failed to vectorize some of the key hotspots Added!DIR$ IVDEP to ~10 routines, one loop rewritten Less than 100 lines of code modified in total! 13

Benchmark test systems* Intel Xeon 2697v4 Intel Xeon Phi 7250 2 sockets, 18 cores/socket, 36 cores, 72 threads, 2.3Ghz DDR4 64GB 2400Mhz TDP 145W/socket, 290W in total 1 socket, 68 cores, 272 threads, 1.4Ghz DDR4 96GB 2400Mhz 16GB of MCDRAM TDP 215W 14 * For a full list of configuration options, see Single node configuration

IFS RAPS14 runtime configuration* Intel Xeon 2697v4 Intel Xeon Phi 7250 18 MPI tasks 2 threads per task NPROMA=16 NRPROMA=4 34 MPI tasks 4 threads per task NPROMA=48 NRPROMA=8 Quadrant cluster mode, cache memory mode 15 * For a full list of configuration options, see Single node configuration

Time to solution Best known compiler / runtime settings and the same source code used for both systems 24h run: wall-clock time for the whole run as given by IFS RAPS14 24 timesteps: total wall-clock time for 16 regular time steps and 8 radiation time steps Time (s) 50 45 40 35 30 25 20 15 10 5 0 RAPS14/TL159, 24h 46,67 41,98 1.11x faster Full 24h run 43,23 36,3 1.19x faster 24 timesteps 2S Intel Xeon 2697v4 Intel Xeon Phi 7250 16 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For configuration details, see Single node configuration.

Time to solution, time step breakdown 3 RAPS14/TL159, 24h 2,5 2 Time (s) 1,5 1 0,5 0 Initialization (partly serial) and IO (serial) Time step 2S Intel Xeon 2697v4 Intel Xeon Phi 7250 17 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For configuration details, see Single node configuration.

Time to solution, main hotspots Intel Xeon 2697v4 Intel Xeon Phi 7250 Function CPU time CPU time / thread Instr. retired CPI cloudsc 101,82 2,83 4,02E+11 0,68 radlswr 90,12 2,50 1,56E+11 1,57 [libiomp5.so] 73,37 2,04 2,07E+11 0,98 srtm_spcvrt_mcica 69,04 1,92 3,42E+11 0,54 cpg 65,30 1,81 4,75E+10 3,73 [libmkl_avx2.so] 47,38 1,32 1,98E+11 0,64 laitri 46,59 1,29 7,22E+10 1,75 laitli 43,96 1,22 4,63E+10 2,54 rrtm_rtrn1a_140gp_mcica 39,98 1,11 1,12E+11 0,96 intel_avx_rep_memset 39,86 1,11 6,06E+10 1,76 Function CPU time CPU time / thread Instr. retired CPI cloudsc 407,72 3,00 2,10E+11 2,86 [libmpi.so.12.0] 298,86 2,20 2,40E+11 1,86 srtm_spcvrt_mcica 246,48 1,81 1,41E+11 2,61 radlswr 181,38 1,33 8,35E+10 3,25 [libmkl_avx512_mic.so] 143,68 1,06 6,90E+10 3,00 srtm_reftra 124,26 0,91 7,32E+10 2,56 intel_mic_avx512f_memcpy 119,63 0,88 4,10E+10 4,42 cloudvar 111,65 0,82 8,29E+10 1,99 laitri 108,33 0,80 3,40E+10 4,71 mcica_cld_gen 108,11 0,79 2,99E+10 5,45 18 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For configuration details, see Single node configuration.

Time to solution, memory bandwidth Intel Xeon 2697v4 Intel Xeon Phi 7250 Peak 110-120 Average 80-90 Bandwidth (GB/sec) Peak 340-360 Average 200-250 Bandwidth (GB/sec) 19 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For configuration details, see Single node configuration.

Energy to solution Intel Xeon 2697v4 Intel Xeon Phi 7250 Power (W) 500 450 400 350 300 250 200 150 100 50 0 41,98s 36,3s 500 0,1 2,2 4,3 6,4 8,5 10,6 12,7 16,7 18,8 21,0 23,1 25,2 27,3 29,4 31,5 33,6 37,6 39,7 41,8 45,9 48,0 50,1 52,2 54,3 Power (W) 450 400 350 300 250 200 150 100 50 0 0,0 2,3 4,7 7,0 9,3 11,6 13,9 16,2 18,5 20,8 23,1 25,4 27,7 30,0 32,3 34,6 36,9 39,2 41,5 43,8 46,1 48,4 50,7 53,0 Time (s) Average power consumption: 420W Time (s) Average power consumption: 353W 20 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For configuration details, see Single node configuration.

KNL performance summary RAPS14/TL159, 24 timesteps Relative performance - HIGHER IS BETTER 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00 Up to 1.19X faster Performance per time step Time steps only Performance / Watt 2S Intel Xeon 2697v4 Intel Xeon Phi 7250 Up to 1.42X more energy efficient Performance / Watt - HIGHER IS BETTER 21 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For configuration details, see Single node configuration.

Conclusions Intel Xeon Phi processor offers better performance and energy efficiency while maintaining an established codebase Performing serial IO or scalar operations should be avoided Code optimizations benefit Intel Xeon processors as well Future work: bit reproducible results, further multi-node experiments with TCO639 22

Backup

KNL performance summary RAPS14/TL159, 24h run Relative performance - HIGHER IS BETTER 1,60 1,40 1,20 1,00 0,80 0,60 0,40 0,20 0,00 Up to Up to 1.42X Up to 1.39X Up to more 1.19X more 1.11X energy faster energy faster efficient efficient Performance per time step Performance / Watt Full 24h run 2S Intel Xeon 2697v4 Intel Xeon Phi 7250 Performance / Watt - HIGHER IS BETTER 24 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For configuration details, see Single node configuration.

Preliminary: TCO639 node scaling Results computed on ECMWF s Cray * XC40 * KNL partition RAPS14/TCO639, 5h run, node scaling 250 234,74 200 204,62 1.15x 164,13 Time (s) 150 100 1.43x 140,27 1.67x 129,09 1.82x 117,43 2.00x 110,45 2.13x 50 0 10 12 16 20 24 28 32 Number of Intel Xeon Phi 7210 nodes (16 tasks, 8 threads per node) 25 *Other names and brands may be claimed as the property of others. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For configuration details, see Multi-node configuration.

Preliminary: TCO639 node scaling Results computed on ECMWF s Cray * XC40 * KNL partition 25 RAPS14/TCO639, 5h run, time step breakdown 20 Time (s) 15 10 5 0 Time step 26 10 Intel Xeon Phi 7210 20 Intel Xeon Phi 7210 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information *Other names and brands may be claimed as and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that the property of others. product when combined with other products. For configuration details, see Multi-node configuration.

Configuration details: single node Intel Xeon processor E5-2697 v4: Dual Intel Xeon processor E5-2697 v4 2.3Ghz, 18 cores/socket, 36 cores, 72 threads (HT and Turbo ON), BIOS GRRFSDP1.86B0271.R00.1510301446, DDR4 64 GB, 2400 MHz, RHEL 7.2, 1.0 TB SATA drive WD1003FZEX-00MK2A0, /proc/sys/vm/nr_hugepages=8000, Intel Compiler 2017, tbbmalloc_proxy Intel Xeon settings: 18 MPI tasks, 2 OpenMP threads per task, NRPROMA=-4, NPROMA=-16. Environment variables: OMP_STACKSIZE=48M, KMP_AFFINITY=compact,, KMP_BLOCKTIME=12, I_MPI_FABRICS=shm, I_MPI_PIN_DOMAIN=omp, I_MPI_PIN_PROCESSOR_LIST=allcores, I_MPI_PIN_ORDER=bunch, TBB_MALLOC_USE_HUGE_PAGES=1 Intel Xeon Phi processor 7250: Intel Xeon Phi processor 7250, 68 cores, 272 threads, 1.4 GHz base core freq., Turbo ON, 1.7 GHz uncore freq., MCDRAM 16 GB 7.2 GT/s, BIOS GVPRCRB1.86B.0011.R02.1608040407, DDR4 96 GB 2400MHz, Quadrant cluster mode, MCDRAM cache memory mode, RHEL 7.2, XPPSL 1.4.1, 1.0 TB SATA drive WD1003FZEX-00MK2A0, /proc/sys/vm/nr_hugepages=24000, Intel Compiler 2017, tbbmalloc_proxy Intel Xeon Phi settings: 34 MPI tasks, 4 OpenMP threads per task, NRPROMA=-8, NPROMA=-48. Environment variables: OMP_STACKSIZE=48M, KMP_AFFINITY=compact, KMP_BLOCKTIME=12, KMP_HW_SUBSET=2t, I_MPI_FABRICS=shm, I_MPI_SHM_LMT=direct, I_MPI_PIN_ORDER=scatter, TBB_MALLOC_USE_HUGE_PAGES=1 Compiler settings: Vectorization flags for Intel Xeon : -xcore-avx2. Vectorization flags for Intel Xeon Phi : xmic-avx512. Recipe: IFS RAPS14 is available under a license from ECMWF. A full list of code modifications and compiler settings used has been delivered and is available to licensed developers from ECMWF. The same improved source code was used for testing both Intel Xeon and Intel Xeon Phi. Power Data: Total system wall power is measured out-of-band over ipmi interface, polling the BMC chip every 0.1 seconds. Energy usage is matched to the average of internally timed code segments to arrive at performance per Watt estimate. Average time step length: Average time step length computed by averaging the timings for 16 normal and 8 radiation time steps in a 24h forecast run. Average energy consumption: Dual Intel Xeon processor E5-2697 v4 2.3Ghz, 18 cores/socket 420W (418W 24h run), Intel Xeon Phi processor 7250, 68 cores (272 threads), 1.4 GHz 352W (340W 24h run). 27 *Other names and brands may be claimed as the property of others

Configuration details: multi-node Multi-node benchmarks computed on Cray* XC40 * partition at ECMWF with Intel Xeon Phi processor 7210. Intel Xeon Phi processor 7210: Intel Xeon Phi processor 7210, 64 cores, 256 threads, 1.3 GHz base core freq., Turbo ON, 1.6 GHz uncore freq., MCDRAM 16 GB 7.2 GT/s, BIOS GVPRCRB1.86B.0011.R02.1608040407, DDR4 96 GB 2400MHz, Quadrant cluster mode, MCDRAM cache memory mode, Cray CCE 8.5.3, Intel Compiler 2017, tbbmalloc_proxy Intel Xeon Phi settings, Cray XC40, TL159: 32 MPI tasks, 4 OpenMP threads per task, NRPROMA=-8, NPROMA=-48. Environment variables: OMP_STACKSIZE=96M, KMP_AFFINITY=compact, KMP_BLOCKTIME=12, TBB_MALLOC_USE_HUGE_PAGES=1 Intel Xeon Phi settings, Cray XC40, TCO639: 16 MPI tasks, 8 OpenMP threads per task, NRPROMA=-8, NPROMA=-48. Environment variables: OMP_STACKSIZE=64M, KMP_AFFINITY=compact, KMP_BLOCKTIME=12, TBB_MALLOC_USE_HUGE_PAGES=1 Compiler settings: Vectorization flags for Intel Xeon : -xcore-avx2. Vectorization flags for Intel Xeon Phi : xmic-avx512. Recipe: IFS RAPS14 is available under a license from ECMWF. A full list of code modifications and compiler settings used has been delivered and is available to licensed developers from ECMWF. The same improved source code was used for testing both Intel Xeon and Intel Xeon Phi. Cray* XC40 *, TL159 ALPS settings: -m1500h -d 4 -j 2 -N 32 -cc depth -r 1 --p-state=1301000 Cray* XC40 *, TCO639 ALPS settings: -m2500h -d 8 -j 2 -N 16 -cc depth -r 1 --p-state=1301000 28 *Other names and brands may be claimed as the property of others *Other names and brands may be claimed as the property of others.