Does the Intel Xeon Phi processor fit HEP workloads?

Similar documents
Beyond core count: a look at new mainstream computing platforms for HEP workloads

Introduction to Xeon Phi. Bill Barth January 11, 2013

Intel Knights Landing Hardware

Knights Corner: Your Path to Knights Landing

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Geant4 MT Performance. Soon Yung Jun (Fermilab) 21 st Geant4 Collaboration Meeting, Ferrara, Italy Sept , 2016

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

Computer Architecture and Structured Parallel Programming James Reinders, Intel

Scientific Computing with Intel Xeon Phi Coprocessors

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Maximum Likelihood Fits on GPUs S. Jarp, A. Lazzaro, J. Leduc, A. Nowak, F. Pantaleo CERN openlab

Turbo Boost Up, AVX Clock Down: Complications for Scaling Tests

Summary of CERN Workshop on Future Challenges in Tracking and Trigger Concepts. Abdeslem DJAOUI PPD Rutherford Appleton Laboratory

Introduc)on to Xeon Phi

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor

Intel Xeon Phi Coprocessors

Overview of Intel Xeon Phi Coprocessor

Intel Many Integrated Core (MIC) Architecture

The Stampede is Coming: A New Petascale Resource for the Open Science Community

An Introduction to the Intel Xeon Phi. Si Liu Feb 6, 2015

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

AUTOMATIC SMT THREADING

Parallel Programming on Ranger and Stampede

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

The future of commodity computing and manycore versus the interests of HEP software

CERN openlab II. CERN openlab and. Sverre Jarp CERN openlab CTO 16 September 2008

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013

The Era of Heterogeneous Computing

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Intel Compiler. Advanced Technical Skills (ATS) North America. IBM High Performance Computing February 2010 Y. Joanna Wong, Ph.D.

n N c CIni.o ewsrg.au

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

CSE502: Computer Architecture CSE 502: Computer Architecture

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Evaluation of Intel Xeon Phi "Knights Corner": Opportunities and Shortcomings

Architecture, Programming and Performance of MIC Phi Coprocessor

The Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation

Bring your application to a new era:

Intel Architecture for HPC

CS377P Programming for Performance Multicore Performance Multithreading

OpenFlow Software Switch & Intel DPDK. performance analysis

Simulating Stencil-based Application on Future Xeon-Phi Processor

VLPL-S Optimization on Knights Landing

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

A Unified Approach to Heterogeneous Architectures Using the Uintah Framework

Introduction to the Intel Xeon Phi on Stampede

Path to Exascale? Intel in Research and HPC 2012

Intel Architecture for Software Developers

April 2 nd, Bob Burroughs Director, HPC Solution Sales

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core

Vector Engine Processor of SX-Aurora TSUBASA

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Chapter 1 Introduction to Xeon Phi Architecture

GPUs and Emerging Architectures

Parallel Programming on Larrabee. Tim Foley Intel Corp

Intel Parallel Studio XE 2015

Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012

Performance Tools for Technical Computing

What s P. Thierry

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

An Alternative to GPU Acceleration For Mobile Platforms

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant

The GeantV prototype on KNL. Federico Carminati, Andrei Gheata and Sofia Vallecorsa for the GeantV team

Intra-MIC MPI Communication using MVAPICH2: Early Experience

S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems

Bei Wang, Dmitry Prohorov and Carlos Rosales

High Performance Computing with Accelerators

Overview of Tianhe-2

Real-Time Support for GPU. GPU Management Heechul Yun

ARISTA: Improving Application Performance While Reducing Complexity

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes


Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ,

rabbit.engr.oregonstate.edu What is rabbit?

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

Building NVLink for Developers

Intel Xeon Phi coprocessor (codename Knights Corner) George Chrysos Senior Principal Engineer Hot Chips, August 28, 2012

TACC s Stampede Project: Intel MIC for Simulation and Data-Intensive Computing

Experiences with ENZO on the Intel Many Integrated Core Architecture

Introduction to tuning on KNL platforms

10 Gbit/s Challenge inside the Openlab framework

Multi-threaded ATLAS Simulation on Intel Knights Landing Processors

Cray XC Scalability and the Aries Network Tony Ford

Intel High-Performance Computing. Technologies for Engineering

CERN IT Technical Forum

Reusing this material

Intel : Accelerating the Path to Exascale. Kirk Skaugen Vice President Intel Architecture Group General Manager Data Center Group

Intel HPC Technologies Outlook

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Progress Porting and Tuning NWP Physics Packages on Intel Xeon Phi

where the Web was born Experience of Adding New Architectures to the LCG Production Environment

Introduc)on to Xeon Phi

Transcription:

Does the Intel Xeon Phi processor fit HEP workloads? October 17th, CHEP 2013, Amsterdam Andrzej Nowak, CERN openlab CTO office On behalf of Georgios Bitzes, Havard Bjerke, Andrea Dotti, Alfio Lazzaro, Sverre Jarp, Pawel Szostek, Liviu Valsan, Mirela-Madalina Botezatu, Julien Leduc

CERN openlab is a framework for evaluating and integrating cuttingedge IT technologies or services in partnership with industry The Platform Competence Center (PCC) has worked closely with Intel for the past decade and focuses on: many-core scalability performance tuning and optimization benchmarking and thermal optimization teaching CERN openlab 2

Outline A brief history of openlab involvement with the Intel MIC project Architecture refresher HEP benchmarks Results Conclusions and projections 3

4

5

Brief history of Intel MIC at openlab Early access Work since MIC alpha (under RS- NDA) ISA reviews in 2008 Results 3 benchmarks ported from Xeon and delivering results: ROOT, Geant4, ALICE HLT trackfitter Expertise Understood and compared with Xeon Post-launch dissemination 6

Architecture refresher (1) Image: Intel 7

Architecture refresher (2) Image: semiaccurate.com / Intel 8

Architecture refresher (3) PCIe card form factor, SMP machine on a chip Linux OS on board PCIe power envelope - ~200-300W Limited on-board memory (16GB total limited by GDDR) ~60 cores @ ~1 GHz x86 architecture 64-bit P54C core: In-order, Superscalar Shared coherent cache (~256-512k L2/core KNF/KNC) New ISA with new vector instructions Floating point support through vector units 512-bit wide, FMA 4-wide hardware threading >200 threads in total 1 TFLOP DP Still with the programmability of a Xeon? 9

10

HEP software today Very limited or no vectorization Online has somewhat better conditions to vectorize Sub-optimal instruction level parallelism (CPI at >1) Hardware threading often beneficial Cores used well through multiprocessing bar the stiff memory requirements However, systems put in production with delays Sockets used well Multiple systems used very well Relying on in-core improvements and # cores for scaling 11

Benchmarks Standard: HEPSPEC06 (partial test) Need to look forward to prototype workloads Analysis: MLFit Threaded (pthreads, MPI, OpenMP, TBB) Vectorized (Cilk+) Simulation: Next-gen multi-threaded Geant4 prototype Threaded FullCMS example No explicit vectorization Online: New ALICE/CBM track fitter prototype Threaded (here with OpenMP) Vectorized with Vc ICC 14.0 and pinning used unless specified otherwise 12

Porting effort LOC 1 st port time New ports Tuning TF < 1 000 days N/A 2 weeks MLFit 3 000 < 1 day < 1 day weeks MTG 2 000 000 1 month < 1 day < 1 week 13

HEPSPEC06 results and extrapolation HEPSPEC06 represents our current family of workloads, not optimized for next-gen hardware Soplex would not finish properly Ran out of time to investigate 32 cores: 57.7 HS06 Extrapolating to 61 cores: 110 HS06 SMT scalability: 1.8 / thread @ 1 core 3.48 / 4 threads @ core SMT under full load scales differently Using factor from MTG4 - ~70% Expected throughput for 244 threads: ~190 Reference IVB-EP w/ 48 threads: > 450 14

MLFit 60 MLFit speedup on KNC 50 40 Speedup 30 20 10 0 0 60 120 180 240 Number of threads 15

MLFit SMT scaling / BW saturation 1.80 SMT Scaling on 1 core MLFit - SMT scaling SMT Scaling with full load 1.60 1.40 1.20 1.45 1.61 1.63 1.25 1.26 1.26 1.00 0.80 0.60 0.40 0.20 1.00 1.00 0.00 1 2 3 4 Number of threads 16

MLFit kernel prototype Code courtesy of Vincenzo Innocente, CERN 100 90 Optimized MLFit kernel prototype on KNC 80 70 Speedup 60 50 40 30 20 10 0 0 60 120 180 240 Number of threads 17

New multi-threaded Geant4 prototype preliminary results; code courtesy of Andrea Dotti + G4 team 18 16 14 New MTG4, pi evts/s on KNC and Xeon weak scaling, 100 events/thread, preliminary results SNB, 32t, 2.9GHz IVB, 48t, 2.4 GHz KNC, 244t, 1.24 GHz Geant4, v9.6-ref09a, preliminary results Events / second 12 10 8 6 4 2 0 0 32 64 96 128 160 192 224 Number of threads (100 events per thread) 18

New multi-threaded Geant4 prototype preliminary results; code courtesy of Andrea Dotti + G4 team 18 16 14 New MTG4, pi evts/s on KNC and Xeon weak scaling, 100 events/thread, preliminary results SNB, 32t, 2.9GHz IVB, 48t, 2.4 GHz KNC, 244t, 1.24 GHz Geant4, v9.6-ref09a, preliminary results Events / second 12 10 8 6 4 2 100% 144% SMT Benefit 162% 168% 0 0 32 64 96 128 160 192 224 Number of threads (100 events per thread) 19

The new ALICE/CBM Trackfitter preliminary results x87 singlevc (vec) OpenMP 213 threads (max perf) DP scalar to SP vector speedup OpenMP to SP vector speedup OpenMP vec to x87 speedup 11.587 0.811 0.0098 14.2x 82.7x 1182x Speedup 90 80 70 60 50 40 30 20 10 0 Trackfitter scaling 0 60 120 180 240 Number of threads 20

Xeon comparison SNB-EP vs. KNC throughput IVB-EP vs. KNC throughput SNB 32 threads, 2.9 GHz vs. KNC 244 threads, 1.24 GHz IVB 48 threads, 2.4GHz vs. KNC 244 threads, 1.24 GHz 144% 1.4 1.4 1.2 122% 1.2 114% 1 100% 1 100% 87% 0.8 0.8 0.6 51% 0.6 0.4 37% 0.4 30% 34% 0.2 0.2 0 Reference HS06 extrapolated (no vec) MTG4 (no vec) MLFit Trackfitter 0 Reference HS06 extrapolated (no vec) MTG4 (no vec) MLFit Trackfitter Higher is better No manual MIC-specific optimizations applied 21

Porting and software - conclusions Think in multiple dimensions of performance Vectorization, threading, porting to ICC can and should be done independently, on Xeon Optimization Compiler Tuning threading and performance tools exist, so do many runtimes Math usage (different implementation!) Memory 16 GB limit today Build systems the on-card OS is simpler Linux, with some, not all, OSS addons 22

VECTORIZATION (hard, but not always impossible) 23

Performance - conclusions Optimized applications surpass dual-socket Xeon performance Non-optimized performance reaches approximately a single server socket We don t make use of optimized bandwidth Control over math function usage and performance is key vis a vis Xeon Stack (including compiler) maturity is important and improving SMT benefit for 1, 2, 3, 4 HW threads changed over time 24

Were we of any help? Pre-silicon feedback (Geant4) -> arch. behavior System connectivity -> full system System integration -> ongoing (KNL) Comments on general OS -> Linux Math function usage -> better compilers and guidelines Documentation -> improved Benchmarks -> delivered Testimonials -> delivered Comments on stack -> ongoing (OSS) Many more 25

Looking ahead KNL: 14nm stand-alone or PCI, integrated memory Let s do some math Stampede: 8 out of 10 PF from MIC Upgrade to KNL: 15 PF = 50-80% improvement over KNC ISA convergence between KNL and Xeon? New platform connectivity options Heterogeneity The future of accessible performance 26

Thank you (Q & A) Andrzej.Nowak@cern.ch Credits go to Georgios Bitzes, Havard Bjerke, Andrea Dotti, Alfio Lazzaro, Sverre Jarp, Pawel Szostek, Liviu Valsan, Mirela-Madalina Botezatu, Julien Leduc and many others. Special thanks to Intel. 27