Intel Xeon Phi архитектура, модели программирования, оптимизация.

Similar documents
Intel Xeon Phi архитектура, модели программирования, оптимизация.

Bei Wang, Dmitry Prohorov and Carlos Rosales

Efficient Parallel Programming on Xeon Phi for Exascale

Performance Profiler. Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava,

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

Performance analysis tools: Intel VTuneTM Amplifier and Advisor. Dr. Luigi Iapichino

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber

Performance and Energy Usage of Workloads on KNL and Haswell Architectures

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Intel VTune Amplifier XE. Dr. Michael Klemm Software and Services Group Developer Relations Division

Intel Many Integrated Core (MIC) Architecture

Performance optimization of the Smoothed Particle Hydrodynamics code Gadget3 on 2nd generation Intel Xeon Phi

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

HPC Architectures. Types of resource currently in use

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

Performance Optimization of Smoothed Particle Hydrodynamics for Multi/Many-Core Architectures

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

VLPL-S Optimization on Knights Landing

The Intel Xeon PHI Architecture

Intel VTune Amplifier XE for Tuning of HPC Applications Intel Software Developer Conference Frankfurt, 2017 Klaus-Dieter Oertel, Intel

Using Intel VTune Amplifier XE for High Performance Computing

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,

What s P. Thierry

Kevin O Leary, Intel Technical Consulting Engineer

Introduction to tuning on KNL platforms

Intel Architecture for HPC

arxiv: v2 [hep-lat] 3 Nov 2016

Intel Knights Landing Hardware

Towards modernisation of the Gadget code on many-core architectures Fabio Baruffa, Luigi Iapichino (LRZ)

Introduction to tuning on KNL platforms

IHK/McKernel: A Lightweight Multi-kernel Operating System for Extreme-Scale Supercomputing

Knights Corner: Your Path to Knights Landing

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies

Capability Models for Manycore Memory Systems: A Case-Study with Xeon Phi KNL

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Reusing this material

Intel Architecture and Tools Jureca Tuning for the platform II. Dr. Heinrich Bockhorst Intel SSG/DPD/ Date:

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Cray XC Scalability and the Aries Network Tony Ford

Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy

Dmitry Durnov 15 February 2017

1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core

Intel profiling tools and roofline model. Dr. Luigi Iapichino

Intel VTune Amplifier XE

arxiv: v1 [hep-lat] 1 Dec 2017

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

Dr Christopher Dahnken. SSG DRD EMEA Datacenter

Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory

Cori (2016) and Beyond Ensuring NERSC Users Stay Productive

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant

April 2 nd, Bob Burroughs Director, HPC Solution Sales

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

Tutorial: Analyzing MPI Applications. Intel Trace Analyzer and Collector Intel VTune Amplifier XE

KNL tools. Dr. Fabio Baruffa

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Scientific Computing with Intel Xeon Phi Coprocessors

Optimization of the Gadget code and energy measurements on second-generation Intel Xeon Phi

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Jackson Marusarz Intel Corporation

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Trends of Network Topology on Supercomputers. Michihiro Koibuchi National Institute of Informatics, Japan 2018/11/27

Experiences with ENZO on the Intel Many Integrated Core Architecture

Performance Optimization of Smoothed Particle Hydrodynamics for Multi/Many-Core Architectures

AUTOMATIC SMT THREADING

IXPUG 16. Dmitry Durnov, Intel MPI team

Accelerating Insights In the Technical Computing Transformation

Profiling: Understand Your Application

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

HPC Architectures evolution: the case of Marconi, the new CINECA flagship system. Piero Lanucara

Interconnect Your Future

On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows

Introduction to Performance Tuning & Optimization Tools

Parallel Programming on Ranger and Stampede

Intel Many-Core Processor Architecture for High Performance Computing

Simplified and Effective Serial and Parallel Performance Optimization

Cray Performance Tools Enhancements for Next Generation Systems Heidi Poxon

Computer Architecture and Structured Parallel Programming James Reinders, Intel

ECE 574 Cluster Computing Lecture 23

Introduction to Xeon Phi. Bill Barth January 11, 2013

Deep Learning with Intel DAAL

Lecture 10: Cache Coherence. Parallel Computer Architecture and Programming CMU / 清华 大学, Summer 2017

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Benchmark results on Knight Landing (KNL) architecture

INTEL HPC DEVELOPER CONFERENCE FUEL YOUR INSIGHT

Introduction to KNL and Parallel Computing

HPC-BLAST Scalable Sequence Analysis for the Intel Many Integrated Core Future

The knight makes his play for the crown Phi & Omni-Path Glenn Rosenberg Computer Insights UK 2016

NEXTGenIO Performance Tools for In-Memory I/O

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013

Welcome. Virtual tutorial starts at BST

Introduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero

HPC future trends from a science perspective

Performance of deal.ii on a node

Toward Automated Application Profiling on Cray Systems

Transcription:

Нижний Новгород, 2016 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Intel

Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture How Programming models - positioning and spectrum How Fast Optimization and Tools 2

If you were plowing a field, which would you rather use two strong oxen, or 1024 chickens? 3

What and Why HPC High-Performance Computing the use of super computers and parallel processing techniques for solving complex computational problems. 4

What and Why TOP 500 Today s Future of tomorrow s mainstream HPC 5

What and Why TOP 500 Highlights Performance Projection 6

What and Why TOP 500 Highlights Top 10 list 7

What and Why TOP 500 Highlights Accelerators in Power Efficiency 8

What and Why TOP 500 Highlights Accelerators/Coprocessors N V I d I a 9

What and Why Intel May Integrated Core (MIC) architecture Larrabee + TerraFlops Research Chip + Competition with NVidia on Accelerators 10

What and Why Parallelization and vectorization Scalar Vector Parallel Parallel + Vector 11

What and Why Xeon VS Xeon Phi 12

13

14

15

KNL Mesh Interconnect: All-to-All EDC Tile OPIO OPIO PCIe OPIO OPIO EDC Tile IIO EDC Tile EDC Tile 3 Address uniformly hashed across all distributed directories Tile Tile Tile 1 4 Tile Tile Tile Tile Tile Tile Tile Tile Tile Typical Read L2 miss DDR imc Tile Tile Tile Tile imc DDR 1. L2 miss encountered Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile 2. Send request to the distributed directory Tile Tile Tile Tile Tile Tile EDC EDC Misc EDC EDC 2 3. Miss in the directory. Forward to memory OPIO OPIO OPIO OPIO 4. Memory sends the data to the requestor 16

KNL Mesh Interconnect: Quadrant Chip divided into four OPIO OPIO PCIe OPIO OPIO Quadrants EDC EDC IIO EDC EDC Tile Tile Tile Tile Tile Tile Tile 1 4 Tile Tile Tile Tile Tile Tile Tile Tile Tile 2 3 Directory for an address resides in the same Quadrant as the memory location SW Transparent DDR imc Tile Tile Tile Tile imc Tile Tile Tile Tile Tile Tile DDR Typical Read L2 miss Tile Tile Tile Tile Tile Tile 1. L2 miss encountered Tile Tile Tile Tile Tile Tile 2. Send request to the distributed directory EDC EDC Misc EDC EDC 3. Miss in the directory. Forward to memory OPIO OPIO OPIO OPIO 4. Memory sends the data to the requestor 17

KNL Mesh Interconnect: Sub-NUMA Clustering OPIO OPIO OPIO OPIO PCIe EDC EDC IIO EDC EDC Tile Tile Tile Tile Each Quadrant (Cluster) exposed as a separate NUMA domain to OS 3 Analogous to 4S Xeon DDR Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile imc Tile Tile Tile Tile imc 1 4 2 DDR SW Visible Typical Read L2 miss Tile Tile Tile Tile Tile Tile 1. L2 miss encountered Tile Tile Tile Tile Tile Tile 2. Send request to the distributed directory Tile Tile Tile Tile Tile Tile EDC EDC Misc EDC EDC 3. Miss in the directory. Forward to memory 4. Memory sends the data to the requestor OPIO OPIO OPIO OPIO 18

19

20

21

Cori Supercomputer at NERSC (National Energy Research Scientific Computing Center at LBNL/DOE) became the first publically announced Knights Landing based system, with over 9,300 nodes slated to be deployed in mid-2016 Trinity Supercomputer at NNSA (National Nuclear Security Administration) is a $174 million deal awarded to Cray that will feature Haswell and Knights Landing, with acceptance phases in both late-2015 and 2016. Expecting over 50 system providers for the KNL host processor, in addition to many more PCIe*-card based solutions. >100 Petaflops of committed customer deals to date The DOE* and Argonne* awarded Intel contracts for two systems (Theta and Aurora) as a part of the CORAL* program, with a combined value of over $200 million. Intel is teaming with Cray* on both systems. Scheduled for 2016, Theta is the first system with greater than 8.5 petaflop/s and more than 2,500 nodes, featuring the Intel Xeon Phi processor (Knights Landing), Cray* Aries* interconnect and Cray s* XC* supercomputing platform. Scheduled for 2018, Aurora is the second and largest system with 180-450 petaflop/s and approximately 50,000 nodes, featuring the next-generation Intel Xeon Phi processor (Knights Hill), 2nd generation Intel Omni-Path fabric, Cray s* Shasta* platform, and a new memory hierarchy composed of Intel Lustre, Burst Buffer Storage, and persistent memory through high bandwidth on-package memory 22

How Programming models 23

How Positioning works: Adoption for Coprocessors in TOP 500 24

How Positioning works: Adoption speed for Coprocessors in TOP 500 25

How KNL positioning Massive thread and data parallelism and massive memory bandwidth with good ST performance in a ISA compatible standard CPU form factor Out-of-box performance on throughput workloads about the same as Xeon, with potential for > 2X in performance when optimized for vectors, threads and memory BW. Code Base Programming mode Compilers, Tools & Libraries Same programming model, tools, compilers and libraries as Xeon. Boots standard OS, runs all legacy code Xeon KNL 26

How Programming models on Xeon Phi Native (Xeon Phi) Offload (Xeon -> Xeon Phi) 27

How Programming models on Xeon Phi: native Recompilation, with xmic-avx512 Vectorization: increased efficiency, use of new instructions MCDRAM and memory tuning: tile, 1GB pages 28

How Offload programming model 29

How Offload with pragma target in OpenMP 4.0 30

How Programming models on Xeon Phi : offload Applicable for coprocessor cards mostly Cost for data transfers Three ways to use: OpenMP 4.0 target directives MKL Automatic offload Direct calls to the offload APIs (COI), and those built on it (e.g., HStreams) Offload over fabric implementation for self-boot 31

How Fast Optimization BKMs Optimization techniques are the same as for Xeon and helping both Loop unrolling to feed vectorization Loop reorganization to avoid strides Be careful with no dependency pragmas Data layout changes for more efficient cache usage Moving to hybrid MPI+OpenMP from pure MPI Avoid data replication, inner node communication, increased MPI buffer size NUMA-awareness for sub-numa clustering mode MPI/thread pinning with parallel data initialization Eliminating syncs on barriers where possible The more threads the more barrier cost 32

How Fast Tools Intel Hardware Features Omni-Path Architectur e MCDRAM 3D XPoint Many-core Xeon Phi AVX-512 Distributed memory Memory I/O Threading CPU Core Message size Rank placement Rank Imbalance RTL Overhead Pt2Pt ->collective Ops Network Bandwidth Latency Bandwidth NUMA File I/O I/O latency I/O waits System-wide I/O Threaded/serial ratio Thread Imbalance RTL overhead (scheduling, forking) Synchronization uarch issues (IPC) FPU usage efficiency Vectorization Cluster Node Core

How Fast Tools Intel Hardware Features Omni-Path Intel ITAC Message size Rank placement Rank Imbalance RTL Overhead Pt2Pt ->collective Ops Network Bandwidth MCDRAM 3D X Many-core Intel VTune Amplifier Xeon Phi Distributed memory Memory I/O Threading CPU Core Latency Bandwidth NUMA File I/O I/O latency I/O waits System-wide I/O Threaded/serial ratio Thread Imbalance RTL overhead (scheduling, forking) Synchronization Intel Parallel Studio XE Cluster Edition covers all aspects of distributed application performance in synergy with Intel HW and Runtimes uarch issues (IPC) FPU usage efficiency Cluster Node Core AVX Intel Advisor

How Fast Tools - Workflow Intel Confidential 35

How Fast Tools: Intel MPI Performance Snapshot Performance Triage Orchestrator How to tune for efficient utilization of hardware capabilities Scalability 32K ranks (~0.8Gb trace size per 1K ranks) Performance characterization Intel MPI internal statistics and Intel MPI imbalance (unproductive wait-time) time Guidance to ITAC if MPI-bound OpenMP* imbalance (unproductive wait-time) time Guidance to VTune Amplifier OpenMP* efficiency analysis if bottleneck Basic memory efficiency and footprint information Guidance to VTune Memory Access Analysis if memory-bound GFLOPs

How Fast Tools: Intel MPI Performance Snapshot Performance Triage Orchestrator How to tune for efficient utilization of hardware capabilities Scalability 32K ranks (~0.8Gb trace size per 1K ranks) Performance characterization Intel MPI internal statistics and Intel MPI imbalance (unproductive wait-time) time Guidance to ITAC if MPI-bound OpenMP* imbalance (unproductive wait-time) time Guidance to VTune Amplifier OpenMP* efficiency analysis if bottleneck Basic memory efficiency and footprint information Guidance to VTune Memory Access Analysis if memory-bound GFLOPs

How Fast Tools: Intel Trace Analyzer and Collector Inter-node Summary page Time interval shown Aggregation of shown data Tagging & Filtering Idealizer Compare Perf. Assistant Settings Imbalance Diagram Compare the event timelines of two communication profiles Blue = computation Red = communication Chart showing how the MPI processes interact

How Fast Tools: Intel Trace Analyzer and Collector Inter-node Compare the event timelines of two communication profiles Blue = computation Red = communication Chart showing how the MPI processes interact

How Fast Tools: VTune Amplifier exploring scalability, threading/cpu utilization Is serial time of my application significant to prevent scaling? How efficient is my parallelization towards ideal parallel execution? How much theoretical gain I can get if invest in tuning? What regions are more perspective to invest? Links to grid view for more details on inefficiency 40

How Fast Tools: High Bandwidth Memory Analysis on KNL with VTune Memory Bandwidth often is a limiting factor of compute intensive applications on multi-core systems MCDRAM High Bandwidth Memory with much greater bandwidth speedup to alleviate this problem Limited MCDRAM size might require selective data object placement to HBM (for flat and hybrid MCDRAM modes) Memory Access analysis helps to identify memory objects for HBM placement to benefit the most 3/18/2016 41

How Fast Tools: High Bandwidth Memory Analysis on KNL with VTune Explore DRAM Bandwidth histogram to see if the app is bandwidth bound Significant portion of application time spent in high memory bandwidth utilization The app may benefit from MCDRAM 3/18/2016 42

How Fast Tools: High Bandwidth Memory Analysis on KNL with VTune Investigate the memory allocations inducing bandwidth Bandwidth Domain/Bandwidth Utilization Type/Memory Object/Allocation Stack grouping with expansion by DRAM/High and sorting by L2 Miss Count Focus on allocations inducing L2 misses Allocation stack shows the allocation place in user s code 3/18/2016 43

How Fast Tools: High Bandwidth Memory Analysis on KNL with VTune Allocate object using High Bandwidth Memory Specifying a custom memory allocator class for stored vector elements 3/18/2016 44

How Fast Tools: High Bandwidth Memory Analysis on KNL with VTune DRAM bandwidth significantly decreased reducing DRAM memory access stalls 3/18/2016 45

How Fast Tools: Vector Advisor explore vectorization 1. Compiler diagnostics + Performance Data + SIMD efficiency information Intel Advisor s Vectorization Advisor fills a gap in code performance analysis. It can guide the informed user to better exploit the 2. Guidance: detect problem vector capabilities and of modern recommend how processors to fix it and coprocessors 3. Accurate Trip Counts: understand parallelism granularity and overheads Dr. Luigi Iapichino Scientific Computing Expert Leibniz Supercomputing Centre 4. Loop-Carried Dependency Analysis 5. Memory Access Patterns Analysis 46

Summary Many-core-based architectures play main role to achieve Exascale and further Intel Many Integrated Core (MIC) offers competitive performance on well-known HPC programming models KNL is a step forward in this direction with More cores, faster ST High Bandwidth Memory Self-boot with better performance/watt and no data transfer cost 47

Intel Confidential