Intel Knights Landing Hardware

Size: px

Start display at page:

Download "Intel Knights Landing Hardware"

Cameron Cobb
5 years ago
Views:

1 Intel Knights Landing Hardware TACC KNL Tutorial IXPUG Annual Meeting 2016 PRESENTED BY: John Cazes Lars Koesterke 1

2 Intel s Xeon Phi Architecture Leverages x86 architecture Simpler x86 cores, higher compute throughput per watt Supports legacy programming models Fortran, C/C++ MPI, OpenMP, pthreads Designed for floating point performance Provides high memory bandwidth Runs an operating system Many- core design rather than multi- core Designed to run hundreds of execution threads in parallel 2

2 nd Generation Intel Xeon Phi Knights Landing Many Integrated Cores (MIC) architecture Up to 72 cores (based on Silvermont) 4 H/W threads per core Possible 288 threads of execution 16 GB MCDRAM*

3 2 nd Generation Intel Xeon Phi Knights Landing Many Integrated Cores (MIC) architecture Up to 72 cores (based on Silvermont) 4 H/W threads per core Possible 288 threads of execution 16 GB MCDRAM* (high bandwidth) on-package 1 socket self hosted (no more PCI bottleneck!) 3+ TF DP peak performance 6+ TF SP peak performance 400+ GB/s STREAM performance Supports Intel Omni-Path Fabric * Multi-Channel DRAM 3

4 Knights Corner à Knights Landing KNC Co-processor KNL Self hosted Stripped down Linux Centos 7 Binary incompatible with other architectures Binary compatible with prior Xeon (non Phi) architectures 1.1 GHz processor 1.4 GHz processor 8 GB RAM Up to 400 GB RAM (including 16 GB MCDRAM) 22 nm process 14 nm process bit VPU bit VPUs No support for: Out of order Branch prediction Fast unaligned memory access Support for: Out of order Branch prediction Fast unaligned memory access 4

5 KNL Diagram Cores are grouped in pairs (tiles) Up to 36 tiles (72 cores) 2D mesh interconnect 2 DDR memory controllers 6 channels DDR4 Up to 90 GB/s 16 GB MCDRAM 8 embedded DRAM controllers Up to 475 GB/s (KNIGHTS LANDING: SECOND- GENERATION INTEL XEON PHI PRODUCT, A. Sodani, et.al.,ieee Micro March/April 2016) 5

KNL Tile Each core (based on Intel Silvermont): Local L1 cache 2 512-bit VPUs (almost symmetric) 2 cores/tile 1 MB shared L2 cache (up to 36 MB

6 KNL Tile Each core (based on Intel Silvermont): Local L1 cache bit VPUs (almost symmetric) 2 cores/tile 1 MB shared L2 cache (up to 36 MB L2 per KNL) Shared mesh connection (KNIGHTS LANDING: SECOND- GENERATION INTEL XEON PHI PRODUCT, A. Sodani, et.al.,ieee Micro March/April 2016) 6

7 KNL Core 8-way 32KB instruction cache 2 VPUs, only one has support for legacy floating point ops Compile with -xmic-avx512 to use both VPUs (Only supported by Intel compilers) 8-way 32KB data cache (KNIGHTS LANDING: SECOND- GENERATION INTEL XEON PHI PRODUCT, A. Sodani, et.al.,ieee Micro March/April 2016) 7

8 KNL ISA Sandy Bridge x87/mmx SSE AVX Haswell x87/mmx SSE AVX AVX2 BMI TSX KNL x87/mmx SSE AVX AVX2 BMI AVX-512F AVX-512CD AVX-512PF Legacy KNL supports all legacy instructions Introduces AVX-512 Extensions: Foundations (common between Xeon and Xeon Phi) Conflict Detection Prefetch Exponential and Reciprocal AVX-512ER -xmic-avx512 AVX-512 ISA: 8

9 KNL C/C++/Fortran and Python/Java/ Feels like a traditional node (not a co-processor!) However: Many-core approach Cores relatively slow Intra-node parallelization required Binary compatible with previous Xeon, but not the other way around (when compiled with -xmic-avx512 ) 9

10 Stampede KNL Upgrade Upgrade to TACC s Stampede cluster ~1.5 PF additional performance 117 in Top 500 First KNL system on the list core KNL nodes Intel s Omni-Path Fabric Network Separate cluster that shares filesystems with Stampede Funded by the National Science Foundation (NSF) through grant #ACI

11 Stampede KNL Upgrade Stampede s Original Components (Sandy Bridge Cluster) login1 through login4 (Sandy Bridge) Infiniband network Sandy Bridge compute nodes with KNC MIC coprocessors sbatch idev Sandy Bridge largemem and GPU compute nodes ssh Centos 6 Internet ssh $HOME $SCRATCH $WORK ssh Centos 7 login-knl1 (Haswell) sbatch idev OmniPath network KNL compute nodes Stampede Upgrade (KNL Cluster) 11

12 Vectorization Differences with KNC and understanding vector reports 12

13 Similarities to KNC Supports 512-bit vectors: bit floats/integers 8 64-bit doubles Vectorization on KNL Differences from KNC 2 VPUs 32 addressable registers Full support for packed 64-bit integer arithmetic Supports masked operations Supports unaligned loads & stores Supports SSE/2/3/4, AVX, and AVX2 instruction sets Only on 1 of the 2 vector units Many other improvements: Improved Gather/Scatter Hardware FP Divide Hardware FP Inverse square root 13

14 Vectorization Procedure Compile with -xmic-avx512 to target KNL Add -qopt-report=[234] to get optimization reports 2: brief overview of which loops are vectorized and not vectorized (search for dependence ) 3: summaries of load and store streams, alignment, and estimated speedup for each loop 4: load and store stream information by array name, and estimated overhead of vectorization The primary inhibitor of vectorization is possible aliasing Learn how to use the restrict keyword in C Vectorization can be forced using a pragma This may give incorrect results if aliasing is actually present! 14

15 Optimization Reports Sample code swim.f lines of Fortran Optimization report sizes: (default == all phases ) Level 2: 386 lines Level 3: 696 lines Level 4: 1253 lines Most of the length is from the vectorization report. The combined report includes other important information, so you probably don t want to exclude the other phases (- qopt-report-phase) 15

16 Example loop nest from swim.f!$omp PARALLEL DO do j=1,n do i=1,m cu(i+1,j) =.5d0*(p(i+1,j,mid)+p(i,j,mid))*u(i+1,j,mid) cv(i,j+1) =.5d0*(p(i,j+1,mid)+p(i,j,mid))*v(i,j+1,mid) z(i+1,j+1) = (fsdx*(v(i+1,j+1,mid)-v(i,j+1,mid))-fsdy* (u(i+1,j+1,mid)-u(i+1,j,mid)))/ (p(i,j,mid)+p(i+1,j,mid)+p(i+1,j+1,mid)+p(i,j+1,mid)) h(i,j) = p(i,j,mid)+.25d0* (u(i+1,j,mid)*u(i+1,j,mid)+u(i,j,mid)*u(i,j,mid) +v(i,j+1,mid)*v(i,j+1,mid)+v(i,j,mid)*v(i,j,mid)) end do end do Details don t matter just note that This are 2 nested loops There are a lot of array references on the right-hand sides. There are 4 arrays being stored. 16

17 Example Level 2 optimization report Outer loop not vectorized LOOP BEGIN at swim.f(318,7) remark #15542: loop was not vectorized: inner loop was already vectorized LOOP BEGIN at swim.f(319,11) <Peeled loop for vectorization> remark #15301: PEEL LOOP WAS VECTORIZED LOOP END Peel loop (prolog) reported separately LOOP BEGIN at swim.f(319,11) remark #15300: LOOP WAS VECTORIZED remark #25456: Number of Array Refs Scalar Replaced In Loop: 1 LOOP END Main loop only very high level info here LOOP BEGIN at swim.f(319,11) <Remainder loop for vectorization> remark #15301: REMAINDER LOOP WAS VECTORIZED LOOP END LOOP END Remainder loop (epilog) reported separately 17

18 Example Level 3 optimization report LOOP BEGIN at swim.f(318,7) remark #15542: loop was not vectorized: inner loop was already vectorized LOOP BEGIN at swim.f(319,11) <Peeled loop for vectorization> remark #15301: PEEL LOOP WAS VECTORIZED LOOP END LOOP BEGIN at swim.f(319,11) remark #15300: LOOP WAS VECTORIZED [Lots more stuff added here see next slide] remark #25456: Number of Array Refs Scalar Replaced In Loop: 1 LOOP END LOOP BEGIN at swim.f(319,11) <Remainder loop for vectorization> remark #15301: REMAINDER LOOP WAS VECTORIZED LOOP END LOOP END 18

19 Level 3 optimization report extra info LOOP BEGIN at swim.f(319,11) remark #15300: LOOP WAS VECTORIZED remark #15448: unmasked aligned unit stride loads: 14 remark #15449: unmasked aligned unit stride stores: 2 remark #15450: unmasked unaligned unit stride loads: 9 remark #15451: unmasked unaligned unit stride stores: 2 remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 98 remark #15477: vector loop cost: remark #15478: estimated potential speedup: remark #15488: --- end vector loop cost summary --- remark #25456: Number of Array Refs Scalar Replaced In Loop: 1 remark #25015: Estimate of max trip count of loop=250 LOOP END Memory Reference info Estimated Cycle Cost & Speedup Compiler cost model based on this assumed trip count 19

20 Level 4 optimization report LOOP BEGIN at swim.f(319,11) remark #15300: LOOP WAS VECTORIZED [Lots more stuff added here see next slide] remark #15448: unmasked aligned unit stride loads: 14 remark #15449: unmasked aligned unit stride stores: 2 remark #15450: unmasked unaligned unit stride loads: 9 remark #15451: unmasked unaligned unit stride stores: 2 remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 98 remark #15477: vector loop cost: remark #15478: estimated potential speedup: remark #15488: --- end vector loop cost summary --- remark #25456: Number of Array Refs Scalar Replaced In Loop: 1 remark #25015: Estimate of max trip count of loop=250 LOOP END 20

21 Level 4 optimization report extra info LOOP BEGIN at swim.f(319,11) remark #15300: LOOP WAS VECTORIZED remark #15389: vectorization support: reference cu has unaligned access [ swim.f(320,15) ] remark #15389: vectorization support: reference p has unaligned access [ swim.f(320,15) ] remark #15388: vectorization support: reference p has aligned access [ swim.f(320,15) ] remark #15389: vectorization support: reference u has unaligned access [ swim.f(320,15) ] remark #15388: vectorization support: reference cv has aligned access [ swim.f(321,15) ] remark #15388: vectorization support: reference p has aligned access [ swim.f(321,15) ] [ lots of Alignment similar lines status omitted for here every ] array used remark #15389: vectorization support: reference u has unaligned access [ swim.f(325,15) ] remark #15389: vectorization support: reference u has unaligned access [ swim.f(325,15) ] remark #15388: vectorization support: reference u has aligned access [ swim.f(325,15) ] remark #15388: vectorization support: reference u has aligned access [ swim.f(325,15) ] remark #15388: vectorization support: reference v has aligned access [ swim.f(325,15) ] remark #15388: vectorization support: reference v has aligned access [ swim.f(325,15) ] remark #15388: vectorization support: reference v has aligned access [ swim.f(325,15) ] remark #15388: vectorization support: reference v has aligned access [ swim.f(325,15) ] remark #15381: vectorization support: unaligned access used inside loop body remark #15305: vectorization support: vector length 8 remark #15309: vectorization support: normalized vectorization overhead Vector length used (typically 8 or 16) Vectorization overhead will be relatively high in Peel and Remainder loops, but should be low in main loop 21

22 And after all this 22

23 Does it really work? FLASH 6 68 More results: Steps/second Sandy Bridge Haswell KNL Threads 23

Introduction to Xeon Phi. Bill Barth January 11, 2013

Introduction to Xeon Phi. Bill Barth January 11, 2013 Introduction to Xeon Phi Bill Barth January 11, 2013 What is it? Co-processor PCI Express card Stripped down Linux operating system Dense, simplified processor Many power-hungry operations removed Wider