Intel Xeon Phi архитектура, модели программирования, оптимизация.

Нижний Новгород, 2016 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Intel

Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture How Programming models - positioning and spectrum How Fast Optimization and Tools 2

If you were plowing a field, which would you rather use two strong oxen, or 1024 chickens? 3

What and Why HPC High-Performance Computing the use of super computers and parallel processing techniques for solving complex computational problems. 4

What and Why TOP 500 Today s Future of tomorrow s mainstream HPC 5

What and Why TOP 500 Highlights Performance Projection 6

What and Why TOP 500 Highlights Top 10 list 7

What and Why TOP 500 Highlights Accelerators in Power Efficiency 8

What and Why TOP 500 Highlights Accelerators/Coprocessors N V I d I a 9

What and Why Intel May Integrated Core (MIC) architecture Larrabee + TerraFlops Research Chip + Competition with NVidia on Accelerators 10

What and Why Parallelization and vectorization Scalar Vector Parallel Parallel + Vector 11

What and Why Xeon VS Xeon Phi 12

KNL Mesh Interconnect: All-to-All EDC Tile OPIO OPIO PCIe OPIO OPIO EDC Tile IIO EDC Tile EDC Tile 3 Address uniformly hashed across all distributed directories Tile Tile Tile 1 4 Tile Tile Tile Tile Tile Tile Tile Tile Tile Typical Read L2 miss DDR imc Tile Tile Tile Tile imc DDR 1. L2 miss encountered Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile 2. Send request to the distributed directory Tile Tile Tile Tile Tile Tile EDC EDC Misc EDC EDC 2 3. Miss in the directory. Forward to memory OPIO OPIO OPIO OPIO 4. Memory sends the data to the requestor 16

KNL Mesh Interconnect: Quadrant Chip divided into four OPIO OPIO PCIe OPIO OPIO Quadrants EDC EDC IIO EDC EDC Tile Tile Tile Tile Tile Tile Tile 1 4 Tile Tile Tile Tile Tile Tile Tile Tile Tile 2 3 Directory for an address resides in the same Quadrant as the memory location SW Transparent DDR imc Tile Tile Tile Tile imc Tile Tile Tile Tile Tile Tile DDR Typical Read L2 miss Tile Tile Tile Tile Tile Tile 1. L2 miss encountered Tile Tile Tile Tile Tile Tile 2. Send request to the distributed directory EDC EDC Misc EDC EDC 3. Miss in the directory. Forward to memory OPIO OPIO OPIO OPIO 4. Memory sends the data to the requestor 17

KNL Mesh Interconnect: Sub-NUMA Clustering OPIO OPIO OPIO OPIO PCIe EDC EDC IIO EDC EDC Tile Tile Tile Tile Each Quadrant (Cluster) exposed as a separate NUMA domain to OS 3 Analogous to 4S Xeon DDR Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile imc Tile Tile Tile Tile imc 1 4 2 DDR SW Visible Typical Read L2 miss Tile Tile Tile Tile Tile Tile 1. L2 miss encountered Tile Tile Tile Tile Tile Tile 2. Send request to the distributed directory Tile Tile Tile Tile Tile Tile EDC EDC Misc EDC EDC 3. Miss in the directory. Forward to memory 4. Memory sends the data to the requestor OPIO OPIO OPIO OPIO 18

Cori Supercomputer at NERSC (National Energy Research Scientific Computing Center at LBNL/DOE) became the first publically announced Knights Landing based system, with over 9,300 nodes slated to be deployed in mid-2016 Trinity Supercomputer at NNSA (National Nuclear Security Administration) is a $174 million deal awarded to Cray that will feature Haswell and Knights Landing, with acceptance phases in both late-2015 and 2016. Expecting over 50 system providers for the KNL host processor, in addition to many more PCIe*-card based solutions. >100 Petaflops of committed customer deals to date The DOE* and Argonne* awarded Intel contracts for two systems (Theta and Aurora) as a part of the CORAL* program, with a combined value of over $200 million. Intel is teaming with Cray* on both systems. Scheduled for 2016, Theta is the first system with greater than 8.5 petaflop/s and more than 2,500 nodes, featuring the Intel Xeon Phi processor (Knights Landing), Cray* Aries* interconnect and Cray s* XC* supercomputing platform. Scheduled for 2018, Aurora is the second and largest system with 180-450 petaflop/s and approximately 50,000 nodes, featuring the next-generation Intel Xeon Phi processor (Knights Hill), 2nd generation Intel Omni-Path fabric, Cray s* Shasta* platform, and a new memory hierarchy composed of Intel Lustre, Burst Buffer Storage, and persistent memory through high bandwidth on-package memory 22

How Programming models 23

How Positioning works: Adoption for Coprocessors in TOP 500 24

How Positioning works: Adoption speed for Coprocessors in TOP 500 25

How KNL positioning Massive thread and data parallelism and massive memory bandwidth with good ST performance in a ISA compatible standard CPU form factor Out-of-box performance on throughput workloads about the same as Xeon, with potential for > 2X in performance when optimized for vectors, threads and memory BW. Code Base Programming mode Compilers, Tools & Libraries Same programming model, tools, compilers and libraries as Xeon. Boots standard OS, runs all legacy code Xeon KNL 26

How Programming models on Xeon Phi Native (Xeon Phi) Offload (Xeon -> Xeon Phi) 27

How Programming models on Xeon Phi: native Recompilation, with xmic-avx512 Vectorization: increased efficiency, use of new instructions MCDRAM and memory tuning: tile, 1GB pages 28

How Offload programming model 29

How Offload with pragma target in OpenMP 4.0 30

How Programming models on Xeon Phi : offload Applicable for coprocessor cards mostly Cost for data transfers Three ways to use: OpenMP 4.0 target directives MKL Automatic offload Direct calls to the offload APIs (COI), and those built on it (e.g., HStreams) Offload over fabric implementation for self-boot 31

How Fast Optimization BKMs Optimization techniques are the same as for Xeon and helping both Loop unrolling to feed vectorization Loop reorganization to avoid strides Be careful with no dependency pragmas Data layout changes for more efficient cache usage Moving to hybrid MPI+OpenMP from pure MPI Avoid data replication, inner node communication, increased MPI buffer size NUMA-awareness for sub-numa clustering mode MPI/thread pinning with parallel data initialization Eliminating syncs on barriers where possible The more threads the more barrier cost 32

How Fast Tools Intel Hardware Features Omni-Path Architectur e MCDRAM 3D XPoint Many-core Xeon Phi AVX-512 Distributed memory Memory I/O Threading CPU Core Message size Rank placement Rank Imbalance RTL Overhead Pt2Pt ->collective Ops Network Bandwidth Latency Bandwidth NUMA File I/O I/O latency I/O waits System-wide I/O Threaded/serial ratio Thread Imbalance RTL overhead (scheduling, forking) Synchronization uarch issues (IPC) FPU usage efficiency Vectorization Cluster Node Core

How Fast Tools Intel Hardware Features Omni-Path Intel ITAC Message size Rank placement Rank Imbalance RTL Overhead Pt2Pt ->collective Ops Network Bandwidth MCDRAM 3D X Many-core Intel VTune Amplifier Xeon Phi Distributed memory Memory I/O Threading CPU Core Latency Bandwidth NUMA File I/O I/O latency I/O waits System-wide I/O Threaded/serial ratio Thread Imbalance RTL overhead (scheduling, forking) Synchronization Intel Parallel Studio XE Cluster Edition covers all aspects of distributed application performance in synergy with Intel HW and Runtimes uarch issues (IPC) FPU usage efficiency Cluster Node Core AVX Intel Advisor

How Fast Tools - Workflow Intel Confidential 35

How Fast Tools: Intel MPI Performance Snapshot Performance Triage Orchestrator How to tune for efficient utilization of hardware capabilities Scalability 32K ranks (~0.8Gb trace size per 1K ranks) Performance characterization Intel MPI internal statistics and Intel MPI imbalance (unproductive wait-time) time Guidance to ITAC if MPI-bound OpenMP* imbalance (unproductive wait-time) time Guidance to VTune Amplifier OpenMP* efficiency analysis if bottleneck Basic memory efficiency and footprint information Guidance to VTune Memory Access Analysis if memory-bound GFLOPs

How Fast Tools: Intel Trace Analyzer and Collector Inter-node Summary page Time interval shown Aggregation of shown data Tagging & Filtering Idealizer Compare Perf. Assistant Settings Imbalance Diagram Compare the event timelines of two communication profiles Blue = computation Red = communication Chart showing how the MPI processes interact

How Fast Tools: Intel Trace Analyzer and Collector Inter-node Compare the event timelines of two communication profiles Blue = computation Red = communication Chart showing how the MPI processes interact

How Fast Tools: VTune Amplifier exploring scalability, threading/cpu utilization Is serial time of my application significant to prevent scaling? How efficient is my parallelization towards ideal parallel execution? How much theoretical gain I can get if invest in tuning? What regions are more perspective to invest? Links to grid view for more details on inefficiency 40

How Fast Tools: High Bandwidth Memory Analysis on KNL with VTune Memory Bandwidth often is a limiting factor of compute intensive applications on multi-core systems MCDRAM High Bandwidth Memory with much greater bandwidth speedup to alleviate this problem Limited MCDRAM size might require selective data object placement to HBM (for flat and hybrid MCDRAM modes) Memory Access analysis helps to identify memory objects for HBM placement to benefit the most 3/18/2016 41

How Fast Tools: High Bandwidth Memory Analysis on KNL with VTune Explore DRAM Bandwidth histogram to see if the app is bandwidth bound Significant portion of application time spent in high memory bandwidth utilization The app may benefit from MCDRAM 3/18/2016 42

How Fast Tools: High Bandwidth Memory Analysis on KNL with VTune Investigate the memory allocations inducing bandwidth Bandwidth Domain/Bandwidth Utilization Type/Memory Object/Allocation Stack grouping with expansion by DRAM/High and sorting by L2 Miss Count Focus on allocations inducing L2 misses Allocation stack shows the allocation place in user s code 3/18/2016 43

How Fast Tools: High Bandwidth Memory Analysis on KNL with VTune Allocate object using High Bandwidth Memory Specifying a custom memory allocator class for stored vector elements 3/18/2016 44

How Fast Tools: High Bandwidth Memory Analysis on KNL with VTune DRAM bandwidth significantly decreased reducing DRAM memory access stalls 3/18/2016 45

How Fast Tools: Vector Advisor explore vectorization 1. Compiler diagnostics + Performance Data + SIMD efficiency information Intel Advisor s Vectorization Advisor fills a gap in code performance analysis. It can guide the informed user to better exploit the 2. Guidance: detect problem vector capabilities and of modern recommend how processors to fix it and coprocessors 3. Accurate Trip Counts: understand parallelism granularity and overheads Dr. Luigi Iapichino Scientific Computing Expert Leibniz Supercomputing Centre 4. Loop-Carried Dependency Analysis 5. Memory Access Patterns Analysis 46

Summary Many-core-based architectures play main role to achieve Exascale and further Intel Many Integrated Core (MIC) offers competitive performance on well-known HPC programming models KNL is a step forward in this direction with More cores, faster ST High Bandwidth Memory Self-boot with better performance/watt and no data transfer cost 47

Intel Confidential