Intel Xeon Phi архитектура, модели программирования, оптимизация.

Similar documents
Intel Xeon Phi архитектура, модели программирования, оптимизация.

Bei Wang, Dmitry Prohorov and Carlos Rosales

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

Efficient Parallel Programming on Xeon Phi for Exascale

Performance Profiler. Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava,

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Performance and Energy Usage of Workloads on KNL and Haswell Architectures

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Intel VTune Amplifier XE. Dr. Michael Klemm Software and Services Group Developer Relations Division

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Capability Models for Manycore Memory Systems: A Case-Study with Xeon Phi KNL

HPC Architectures. Types of resource currently in use

Performance analysis tools: Intel VTuneTM Amplifier and Advisor. Dr. Luigi Iapichino

The Intel Xeon PHI Architecture

VLPL-S Optimization on Knights Landing

Intel Knights Landing Hardware

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

Introduction to tuning on KNL platforms

Introduction to tuning on KNL platforms

arxiv: v2 [hep-lat] 3 Nov 2016

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Intel VTune Amplifier XE for Tuning of HPC Applications Intel Software Developer Conference Frankfurt, 2017 Klaus-Dieter Oertel, Intel

Dr Christopher Dahnken. SSG DRD EMEA Datacenter

Intel Architecture for HPC

What s P. Thierry

IHK/McKernel: A Lightweight Multi-kernel Operating System for Extreme-Scale Supercomputing

1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core

Intel Many Integrated Core (MIC) Architecture

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies

Knights Corner: Your Path to Knights Landing

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,

Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory

Reusing this material

Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy

On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows

Cray XC Scalability and the Aries Network Tony Ford

Cori (2016) and Beyond Ensuring NERSC Users Stay Productive

arxiv: v1 [hep-lat] 1 Dec 2017

Intel Architecture and Tools Jureca Tuning for the platform II. Dr. Heinrich Bockhorst Intel SSG/DPD/ Date:

April 2 nd, Bob Burroughs Director, HPC Solution Sales

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

Performance optimization of the Smoothed Particle Hydrodynamics code Gadget3 on 2nd generation Intel Xeon Phi

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Deep Learning with Intel DAAL

IXPUG 16. Dmitry Durnov, Intel MPI team

Using Intel VTune Amplifier XE for High Performance Computing

Intel Many-Core Processor Architecture for High Performance Computing

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

Welcome. Virtual tutorial starts at BST

Dmitry Durnov 15 February 2017

Cray Performance Tools Enhancements for Next Generation Systems Heidi Poxon

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Experiences with ENZO on the Intel Many Integrated Core Architecture

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Intel VTune Amplifier XE

Scientific Computing with Intel Xeon Phi Coprocessors

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Scheduler Optimization for Current Generation Cray Systems

An Exploration into Object Storage for Exascale Supercomputers. Raghu Chandrasekar

KNL tools. Dr. Fabio Baruffa

Introduction to KNL and Parallel Computing

Towards modernisation of the Gadget code on many-core architectures Fabio Baruffa, Luigi Iapichino (LRZ)

Performance Optimization of Smoothed Particle Hydrodynamics for Multi/Many-Core Architectures

Toward Automated Application Profiling on Cray Systems

The knight makes his play for the crown Phi & Omni-Path Glenn Rosenberg Computer Insights UK 2016

Directions in Workload Management

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

Trends of Network Topology on Supercomputers. Michihiro Koibuchi National Institute of Informatics, Japan 2018/11/27

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

Ravindra Babu Ganapathi Product Owner/ Technical Lead Omni Path Libraries, Intel Corp. Sayantan Sur Senior Software Engineer, Intel Corp.

Performance of deal.ii on a node

Jackson Marusarz Intel Corporation

Intel profiling tools and roofline model. Dr. Luigi Iapichino

Parallel Programming on Ranger and Stampede

Introduction to Xeon Phi. Bill Barth January 11, 2013

ECE 574 Cluster Computing Lecture 23

Interconnect Your Future

Introduction to parallel Computing

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013

INTEL HPC DEVELOPER CONFERENCE FUEL YOUR INSIGHT

Accelerating Insights In the Technical Computing Transformation

Simulating Stencil-based Application on Future Xeon-Phi Processor

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Mapping MPI+X Applications to Multi-GPU Architectures

Migrating Offloading Software to Intel Xeon Phi Processor

Overview of Tianhe-2

Geant4 MT Performance. Soon Yung Jun (Fermilab) 21 st Geant4 Collaboration Meeting, Ferrara, Italy Sept , 2016

DDN. DDN Updates. DataDirect Neworks Japan, Inc Nobu Hashizume. DDN Storage 2018 DDN Storage 1

Computer Architecture and Structured Parallel Programming James Reinders, Intel

Lecture 10: Cache Coherence. Parallel Computer Architecture and Programming CMU / 清华 大学, Summer 2017

Simplified and Effective Serial and Parallel Performance Optimization

DDN. DDN Updates. Data DirectNeworks Japan, Inc Shuichi Ihara. DDN Storage 2017 DDN Storage

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant

Performance Evaluation of NWChem Ab-Initio Molecular Dynamics (AIMD) Simulations on the Intel Xeon Phi Processor

Transcription:

Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel

Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture How Programming models - positioning and spectrum How Fast Optimization and Tools 2

What and Why HPC High-Performance Computing the use of super computers and parallel processing techniques for solving complex computational problems. 3

What and Why TOP 500 Today s Future of tomorrow s mainstream HPC 4

What and Why TOP 500 Highlights Performance Projection 5

What and Why TOP 500 Highlights Top 10 list 6

What and Why TOP 500 Highlights Accelerators in Power Efficiency 7

What and Why TOP 500 Highlights Accelerators/Coprocessors N V I d I a 8

What and Why Intel Many Integrated Core (MIC) architecture Larrabee + TerraFlops Research Chip + Competition with NVidia on Accelerators 9

What and Why Parallelization and vectorization Scalar Vector Parallel Parallel + Vector 10

What and Why Xeon VS Xeon Phi 11

12

13

14

KNL Mesh Interconnect: All-to-All EDC Tile OPIO OPIO PCIe OPIO OPIO EDC Tile IIO EDC Tile EDC Tile 3 Address uniformly hashed across all distributed directories Tile Tile Tile 1 4 Tile Tile Tile Tile Tile Tile Tile Tile Tile Typical Read L2 miss DDR imc Tile Tile Tile Tile imc DDR 1. L2 miss encountered Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile 2. Send request to the distributed directory Tile Tile Tile Tile Tile Tile EDC EDC Misc EDC EDC 2 3. Miss in the directory. Forward to memory OPIO OPIO OPIO OPIO 4. Memory sends the data to the requestor 15

KNL Mesh Interconnect: Quadrant Chip divided into four OPIO OPIO PCIe OPIO OPIO Quadrants EDC EDC IIO EDC EDC Tile Tile Tile Tile Tile Tile Tile 1 4 Tile Tile Tile Tile Tile Tile Tile Tile Tile 2 3 Directory for an address resides in the same Quadrant as the memory location SW Transparent DDR imc Tile Tile Tile Tile imc Tile Tile Tile Tile Tile Tile DDR Typical Read L2 miss Tile Tile Tile Tile Tile Tile 1. L2 miss encountered Tile Tile Tile Tile Tile Tile 2. Send request to the distributed directory EDC EDC Misc EDC EDC 3. Miss in the directory. Forward to memory OPIO OPIO OPIO OPIO 4. Memory sends the data to the requestor 16

KNL Mesh Interconnect: Sub-NUMA Clustering OPIO OPIO OPIO OPIO PCIe EDC EDC IIO EDC EDC Tile Tile Tile Tile Each Quadrant (Cluster) exposed as a separate NUMA domain to OS 3 Analogous to 4S Xeon DDR Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile imc Tile Tile Tile Tile imc 1 4 2 DDR SW Visible Typical Read L2 miss Tile Tile Tile Tile Tile Tile 1. L2 miss encountered Tile Tile Tile Tile Tile Tile 2. Send request to the distributed directory Tile Tile Tile Tile Tile Tile EDC EDC Misc EDC EDC 3. Miss in the directory. Forward to memory 4. Memory sends the data to the requestor OPIO OPIO OPIO OPIO 17

18

19

20

Cori Supercomputer at NERSC (National Energy Research Scientific Computing Center at LBNL/DOE) became the first publically announced Knights Landing based system, with over 9,300 nodes slated to be deployed in mid-2016 Trinity Supercomputer at NNSA (National Nuclear Security Administration) is a $174 million deal awarded to Cray that will feature Haswell and Knights Landing, with acceptance phases in both late-2015 and 2016. Expecting over 50 system providers for the KNL host processor, in addition to many more PCIe*-card based solutions. >100 Petaflops of committed customer deals to date The DOE* and Argonne* awarded Intel contracts for two systems (Theta and Aurora) as a part of the CORAL* program, with a combined value of over $200 million. Intel is teaming with Cray* on both systems. Scheduled for 2016, Theta is the first system with greater than 8.5 petaflop/s and more than 2,500 nodes, featuring the Intel Xeon Phi processor (Knights Landing), Cray* Aries* interconnect and Cray s* XC* supercomputing platform. Scheduled for 2018, Aurora is the second and largest system with 180-450 petaflop/s and approximately 50,000 nodes, featuring the next-generation Intel Xeon Phi processor (Knights Hill), 2nd generation Intel Omni-Path fabric, Cray s* Shasta* platform, and a new memory hierarchy composed of Intel Lustre, Burst Buffer Storage, and persistent memory through high bandwidth on-package memory 21

How KNL positioning Massive thread and data parallelism and massive memory bandwidth with good ST performance in a ISA compatible standard CPU form factor Out-of-box performance on throughput workloads about the same as Xeon, with potential for > 2X in performance when optimized for vectors, threads and memory BW. Code Base Programming mode Compilers, Tools & Libraries Same programming model, tools, compilers and libraries as Xeon. Boots standard OS, runs all legacy code Xeon KNL 22

How Programming models on Xeon Phi: Native Main model for well-parallelized applications While x86_64 binary works recompilation with xmic-avx512 needed: Enables to use 512-bit vector operations Allows to use both VPUs on a core MCDRAM and memory tuning 23

How Programming models on Xeon Phi: Offload Can be used if an application has significant sequential part Host->Card through PCI Host->Self Boot through interconnect (Offload Over Fabric) Three ways to use: OpenMP 4.0 target directives MKL Automatic offload Direct calls to the offload APIs (COI), and those built on it (e.g., HStreams) 24

How Offload programming model 25

How Offload with pragma target in OpenMP 4.0 26

How Fast Optimization BKMs Optimization techniques are mostly the same as for Xeon and helping both Loop unrolling to feed vectorization Loop reorganization to avoid strides Be careful with no dependency pragmas Data layout changes for more efficient cache usage Moving to hybrid MPI+OpenMP from pure MPI Avoid data replication, inner node communication, increased MPI buffer size NUMA-awareness for sub-numa clustering mode MPI/thread pinning with parallel data initialization Eliminating syncs on barriers where possible The more threads the more barrier cost 27

How Fast Tools Intel Hardware Features Omni-Path Architectur e MCDRAM 3D XPoint Many-core Xeon Phi AVX-512 Distributed memory Memory I/O Threading CPU Core Message size Rank placement Rank Imbalance RTL Overhead Pt2Pt ->collective Ops Network Bandwidth Latency Bandwidth NUMA File I/O I/O latency I/O waits System-wide I/O Threaded/serial ratio Thread Imbalance RTL overhead (scheduling, forking) Synchronization uarch issues (IPC) FPU usage efficiency Vectorization Cluster Node Core

How Fast Tools Intel Hardware Features Omni-Path Intel ITAC Message size Rank placement Rank Imbalance RTL Overhead Pt2Pt ->collective Ops Network Bandwidth MCDRAM 3D X Many-core Intel VTune Amplifier Xeon Phi Distributed memory Memory I/O Threading CPU Core Latency Bandwidth NUMA File I/O I/O latency I/O waits System-wide I/O Threaded/serial ratio Thread Imbalance RTL overhead (scheduling, forking) Synchronization Intel Parallel Studio XE Cluster Edition covers all aspects of distributed application performance in synergy with Intel HW and Runtimes uarch issues (IPC) FPU usage efficiency Cluster Node Core AVX Intel Advisor

How Fast Tools - Workflow Intel Confidential 30

How Fast Tools: Intel MPI Performance Snapshot Performance Triage Orchestrator How to tune for efficient utilization of hardware capabilities Scalability 32K ranks (~0.8Gb trace size per 1K ranks) Performance characterization Intel MPI internal statistics and Intel MPI imbalance (unproductive wait-time) time Guidance to ITAC if MPI-bound OpenMP* imbalance (unproductive wait-time) time Guidance to VTune Amplifier OpenMP* efficiency analysis if bottleneck Basic memory efficiency and footprint information Guidance to VTune Memory Access Analysis if memory-bound GFLOPs

How Fast Tools: Intel MPI Performance Snapshot Performance Triage Orchestrator How to tune for efficient utilization of hardware capabilities Scalability 32K ranks (~0.8Gb trace size per 1K ranks) Performance characterization Intel MPI internal statistics and Intel MPI imbalance (unproductive wait-time) time Guidance to ITAC if MPI-bound OpenMP* imbalance (unproductive wait-time) time Guidance to VTune Amplifier OpenMP* efficiency analysis if bottleneck Basic memory efficiency and footprint information Guidance to VTune Memory Access Analysis if memory-bound GFLOPs

How Fast Tools: VTune Amplifier exploring scalability, threading/cpu utilization Is serial time of my application significant to prevent scaling? How efficient is my parallelization towards ideal parallel execution? How much theoretical gain I can get if invest in tuning? What regions are more perspective to invest? Links to grid view for more details on inefficiency 33

How Fast Tools: High Bandwidth Memory Analysis on KNL with VTune Memory Bandwidth often is a limiting factor of compute intensive applications on multi-core systems MCDRAM High Bandwidth Memory with much greater bandwidth speedup to alleviate this problem Limited MCDRAM size might require selective data object placement to HBM (for flat and hybrid MCDRAM modes) Memory Access analysis helps to identify memory objects for HBM placement to benefit the most 3/23/2017 34

How Fast Tools: High Bandwidth Memory Analysis on KNL with VTune Explore DRAM Bandwidth histogram to see if the app is bandwidth bound Significant portion of application time spent in high memory bandwidth utilization The app may benefit from MCDRAM 3/23/2017 35

How Fast Tools: High Bandwidth Memory Analysis on KNL with VTune Investigate the memory allocations inducing bandwidth Bandwidth Domain/Bandwidth Utilization Type/Memory Object/Allocation Stack grouping with expansion by DRAM/High and sorting by L2 Miss Count Focus on allocations inducing L2 misses Allocation stack shows the allocation place in user s code 3/23/2017 36

How Fast Tools: High Bandwidth Memory Analysis on KNL with VTune Allocate object using High Bandwidth Memory Specifying a custom memory allocator class for stored vector elements 3/23/2017 37

How Fast Tools: High Bandwidth Memory Analysis on KNL with VTune DRAM bandwidth significantly decreased reducing DRAM memory access stalls 3/23/2017 38

Summary Many-core-based architectures play main role to achieve Exascale and further Intel Many Integrated Core (MIC) offers competitive performance on well-known HPC programming models KNL is a step forward in this direction with More cores, faster ST High Bandwidth Memory Self-boot with better performance/watt and no data transfer cost 39

Intel Confidential