Intel Xeon Phi архитектура, модели программирования, оптимизация.

Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel

Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture How Programming models - positioning and spectrum How Fast Optimization and Tools 2

What and Why HPC High-Performance Computing the use of super computers and parallel processing techniques for solving complex computational problems. 3

What and Why TOP 500 Today s Future of tomorrow s mainstream HPC 4

What and Why TOP 500 Highlights Performance Projection 5

What and Why TOP 500 Highlights Top 10 list 6

What and Why TOP 500 Highlights Accelerators in Power Efficiency 7

What and Why TOP 500 Highlights Accelerators/Coprocessors N V I d I a 8

What and Why Intel Many Integrated Core (MIC) architecture Larrabee + TerraFlops Research Chip + Competition with NVidia on Accelerators 9

What and Why Parallelization and vectorization Scalar Vector Parallel Parallel + Vector 10

What and Why Xeon VS Xeon Phi 11

KNL Mesh Interconnect: All-to-All EDC Tile OPIO OPIO PCIe OPIO OPIO EDC Tile IIO EDC Tile EDC Tile 3 Address uniformly hashed across all distributed directories Tile Tile Tile 1 4 Tile Tile Tile Tile Tile Tile Tile Tile Tile Typical Read L2 miss DDR imc Tile Tile Tile Tile imc DDR 1. L2 miss encountered Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile 2. Send request to the distributed directory Tile Tile Tile Tile Tile Tile EDC EDC Misc EDC EDC 2 3. Miss in the directory. Forward to memory OPIO OPIO OPIO OPIO 4. Memory sends the data to the requestor 15

KNL Mesh Interconnect: Quadrant Chip divided into four OPIO OPIO PCIe OPIO OPIO Quadrants EDC EDC IIO EDC EDC Tile Tile Tile Tile Tile Tile Tile 1 4 Tile Tile Tile Tile Tile Tile Tile Tile Tile 2 3 Directory for an address resides in the same Quadrant as the memory location SW Transparent DDR imc Tile Tile Tile Tile imc Tile Tile Tile Tile Tile Tile DDR Typical Read L2 miss Tile Tile Tile Tile Tile Tile 1. L2 miss encountered Tile Tile Tile Tile Tile Tile 2. Send request to the distributed directory EDC EDC Misc EDC EDC 3. Miss in the directory. Forward to memory OPIO OPIO OPIO OPIO 4. Memory sends the data to the requestor 16

KNL Mesh Interconnect: Sub-NUMA Clustering OPIO OPIO OPIO OPIO PCIe EDC EDC IIO EDC EDC Tile Tile Tile Tile Each Quadrant (Cluster) exposed as a separate NUMA domain to OS 3 Analogous to 4S Xeon DDR Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile imc Tile Tile Tile Tile imc 1 4 2 DDR SW Visible Typical Read L2 miss Tile Tile Tile Tile Tile Tile 1. L2 miss encountered Tile Tile Tile Tile Tile Tile 2. Send request to the distributed directory Tile Tile Tile Tile Tile Tile EDC EDC Misc EDC EDC 3. Miss in the directory. Forward to memory 4. Memory sends the data to the requestor OPIO OPIO OPIO OPIO 17

Cori Supercomputer at NERSC (National Energy Research Scientific Computing Center at LBNL/DOE) became the first publically announced Knights Landing based system, with over 9,300 nodes slated to be deployed in mid-2016 Trinity Supercomputer at NNSA (National Nuclear Security Administration) is a $174 million deal awarded to Cray that will feature Haswell and Knights Landing, with acceptance phases in both late-2015 and 2016. Expecting over 50 system providers for the KNL host processor, in addition to many more PCIe*-card based solutions. >100 Petaflops of committed customer deals to date The DOE* and Argonne* awarded Intel contracts for two systems (Theta and Aurora) as a part of the CORAL* program, with a combined value of over $200 million. Intel is teaming with Cray* on both systems. Scheduled for 2016, Theta is the first system with greater than 8.5 petaflop/s and more than 2,500 nodes, featuring the Intel Xeon Phi processor (Knights Landing), Cray* Aries* interconnect and Cray s* XC* supercomputing platform. Scheduled for 2018, Aurora is the second and largest system with 180-450 petaflop/s and approximately 50,000 nodes, featuring the next-generation Intel Xeon Phi processor (Knights Hill), 2nd generation Intel Omni-Path fabric, Cray s* Shasta* platform, and a new memory hierarchy composed of Intel Lustre, Burst Buffer Storage, and persistent memory through high bandwidth on-package memory 21

How KNL positioning Massive thread and data parallelism and massive memory bandwidth with good ST performance in a ISA compatible standard CPU form factor Out-of-box performance on throughput workloads about the same as Xeon, with potential for > 2X in performance when optimized for vectors, threads and memory BW. Code Base Programming mode Compilers, Tools & Libraries Same programming model, tools, compilers and libraries as Xeon. Boots standard OS, runs all legacy code Xeon KNL 22

How Programming models on Xeon Phi: Native Main model for well-parallelized applications While x86_64 binary works recompilation with xmic-avx512 needed: Enables to use 512-bit vector operations Allows to use both VPUs on a core MCDRAM and memory tuning 23

How Programming models on Xeon Phi: Offload Can be used if an application has significant sequential part Host->Card through PCI Host->Self Boot through interconnect (Offload Over Fabric) Three ways to use: OpenMP 4.0 target directives MKL Automatic offload Direct calls to the offload APIs (COI), and those built on it (e.g., HStreams) 24

How Offload programming model 25

How Offload with pragma target in OpenMP 4.0 26

How Fast Optimization BKMs Optimization techniques are mostly the same as for Xeon and helping both Loop unrolling to feed vectorization Loop reorganization to avoid strides Be careful with no dependency pragmas Data layout changes for more efficient cache usage Moving to hybrid MPI+OpenMP from pure MPI Avoid data replication, inner node communication, increased MPI buffer size NUMA-awareness for sub-numa clustering mode MPI/thread pinning with parallel data initialization Eliminating syncs on barriers where possible The more threads the more barrier cost 27

How Fast Tools Intel Hardware Features Omni-Path Architectur e MCDRAM 3D XPoint Many-core Xeon Phi AVX-512 Distributed memory Memory I/O Threading CPU Core Message size Rank placement Rank Imbalance RTL Overhead Pt2Pt ->collective Ops Network Bandwidth Latency Bandwidth NUMA File I/O I/O latency I/O waits System-wide I/O Threaded/serial ratio Thread Imbalance RTL overhead (scheduling, forking) Synchronization uarch issues (IPC) FPU usage efficiency Vectorization Cluster Node Core

How Fast Tools Intel Hardware Features Omni-Path Intel ITAC Message size Rank placement Rank Imbalance RTL Overhead Pt2Pt ->collective Ops Network Bandwidth MCDRAM 3D X Many-core Intel VTune Amplifier Xeon Phi Distributed memory Memory I/O Threading CPU Core Latency Bandwidth NUMA File I/O I/O latency I/O waits System-wide I/O Threaded/serial ratio Thread Imbalance RTL overhead (scheduling, forking) Synchronization Intel Parallel Studio XE Cluster Edition covers all aspects of distributed application performance in synergy with Intel HW and Runtimes uarch issues (IPC) FPU usage efficiency Cluster Node Core AVX Intel Advisor

How Fast Tools - Workflow Intel Confidential 30

How Fast Tools: Intel MPI Performance Snapshot Performance Triage Orchestrator How to tune for efficient utilization of hardware capabilities Scalability 32K ranks (~0.8Gb trace size per 1K ranks) Performance characterization Intel MPI internal statistics and Intel MPI imbalance (unproductive wait-time) time Guidance to ITAC if MPI-bound OpenMP* imbalance (unproductive wait-time) time Guidance to VTune Amplifier OpenMP* efficiency analysis if bottleneck Basic memory efficiency and footprint information Guidance to VTune Memory Access Analysis if memory-bound GFLOPs

How Fast Tools: VTune Amplifier exploring scalability, threading/cpu utilization Is serial time of my application significant to prevent scaling? How efficient is my parallelization towards ideal parallel execution? How much theoretical gain I can get if invest in tuning? What regions are more perspective to invest? Links to grid view for more details on inefficiency 33

How Fast Tools: High Bandwidth Memory Analysis on KNL with VTune Memory Bandwidth often is a limiting factor of compute intensive applications on multi-core systems MCDRAM High Bandwidth Memory with much greater bandwidth speedup to alleviate this problem Limited MCDRAM size might require selective data object placement to HBM (for flat and hybrid MCDRAM modes) Memory Access analysis helps to identify memory objects for HBM placement to benefit the most 3/23/2017 34

How Fast Tools: High Bandwidth Memory Analysis on KNL with VTune Explore DRAM Bandwidth histogram to see if the app is bandwidth bound Significant portion of application time spent in high memory bandwidth utilization The app may benefit from MCDRAM 3/23/2017 35

How Fast Tools: High Bandwidth Memory Analysis on KNL with VTune Investigate the memory allocations inducing bandwidth Bandwidth Domain/Bandwidth Utilization Type/Memory Object/Allocation Stack grouping with expansion by DRAM/High and sorting by L2 Miss Count Focus on allocations inducing L2 misses Allocation stack shows the allocation place in user s code 3/23/2017 36

How Fast Tools: High Bandwidth Memory Analysis on KNL with VTune Allocate object using High Bandwidth Memory Specifying a custom memory allocator class for stored vector elements 3/23/2017 37

How Fast Tools: High Bandwidth Memory Analysis on KNL with VTune DRAM bandwidth significantly decreased reducing DRAM memory access stalls 3/23/2017 38

Summary Many-core-based architectures play main role to achieve Exascale and further Intel Many Integrated Core (MIC) offers competitive performance on well-known HPC programming models KNL is a step forward in this direction with More cores, faster ST High Bandwidth Memory Self-boot with better performance/watt and no data transfer cost 39

Intel Confidential