Capability Models for Manycore Memory Systems: A Case-Study with Xeon Phi KNL

Size: px

Start display at page:

Download "Capability Models for Manycore Memory Systems: A Case-Study with Xeon Phi KNL"

Arlene Shaw
5 years ago
Views:

1 SABELA RAMOS, TORSTEN HOEFLER Capability Models for Manycore Memory Systems: A Case-Study with Xeon Phi KNL spcl.inf.ethz.ch

2 Microarchitectures are becoming more and more complex CPU L1 CPU L1 CPU L1 CPU L1 L2 CPU L1 CPU L1 L2 CPU L1 CPU L1 L2 CPU L1 G5 CPU L1 L2 CPU L1 L2 G5 L2 L1 CPU L2 L1 CPU 2

3 How to optimize codes for these complex architectures? Performance engineering: encompasses the set of roles, skills, activities, practices, tools, and deliverables applied at every phase of the systems development life cycle which ensures that a solution will be designed, implemented, and operationally supported to meet the non-functional requirements for performance (such as throughput, latency, or memory usage). Manually profile codes and tune them to the given architecture Requires highly-skilled performance engineers Need familiarity with NUMA (topology, bandwidths etc.) Caches (associativity, sizes etc.) Microarchitecture (number of outstanding loads etc.) Trust me, I m an engineer! 3

4 An engineering example Tacoma Narrows Bridge 4

5 Scientific Performance Engineering 1) Observe 2) Model 3) Understand 4) Build 5

6 Modeling by example: KNL Architecture (mesh) MCDRAM MCDRAM PCIe MCDRAM MCDRAM EDC EDC IIO EDC EDC IMC IMC EDC EDC Misc EDC EDC Core 2 VPU 1 MB L2 CHA Core 2 VPU MCDRAM MCDRAM MCDRAM MCDRAM 6

7 KNL Architecture (memory: Flat & Cache) MCDRAM MCDRAM PCIe MCDRAM MCDRAM EDC EDC IIO EDC EDC IMC IMC EDC EDC Misc EDC EDC Core 2 VPU 1 MB L2 CHA Core 2 VPU MCDRAM MCDRAM MCDRAM MCDRAM 7

8 KNL Architecture (all to all mode) MCDRAM MCDRAM PCIe MCDRAM MCDRAM EDC EDC IIO EDC EDC IMC IMC EDC EDC Misc EDC EDC Core 2 VPU 1 MB L2 CHA Core 2 VPU MCDRAM MCDRAM MCDRAM MCDRAM 8

9 KNL Architecture (Quadrant or Hemisphere) MCDRAM MCDRAM PCIe MCDRAM MCDRAM EDC EDC IIO EDC EDC IMC IMC EDC EDC Misc EDC EDC Core 2 VPU 1 MB L2 CHA Core 2 VPU MCDRAM MCDRAM MCDRAM MCDRAM 9

10 KNL Architecture (SNC-4 or SNC-2) MCDRAM MCDRAM PCIe MCDRAM MCDRAM EDC EDC IIO EDC EDC IMC IMC EDC EDC Misc EDC EDC Core 2 VPU 1 MB L2 CHA Core 2 VPU MCDRAM MCDRAM MCDRAM MCDRAM 10

11 KNL Architecture MCDRAM MCDRAM PCIe MCDRAM MCDRAM EDC EDC IIO EDC EDC How much does this all matter? What is the real cost of accessing cache? IMC IMC What is the cost of accessing memory? EDC EDC Misc EDC EDC MCDRAM MCDRAM MCDRAM MCDRAM 11

12 Step 1: Understand core-to-core transfers MESIF cache coherence spcl.inf.ethz.ch Write back overhead Location: only 5-15% difference Contention effects? That is curious! All values are medians within 10% of the 95% nonparametric CI, cf. TH, RB: Scientific Benchmarking of Parallel Computing Systems, SC16 12

13 Step 2: Understand core-to-memory transfers DRAM and MCDRAM MCDRAM 20% slower! MCDRAM 4-6x faster! Need to read and write for full bandwidth Cache mode >20% slower Bandwidth suffers a bit All values are medians within 10% of the 95% nonparametric CI, cf. TH, RB: Scientific Benchmarking of Parallel Computing Systems, SC16 13

14 Performance engineers optimize your code! Are you kidding me? 14

15 A principled approach to designing cache-to-cache broadcast algorithms Multi-ary tree example 0 depth d = 2 Tree depth Level size 1 2 k 1 = k 2 = 3 Tree cost Reached threads S. Ramos, TH: Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi, ACM HPDC 13 S. Ramos, TH: Cache line aware optimizations for ccnuma systems (IEEE TPDS 17). 15

16 Model-driven performance engineering for broadcast Binary or binomial trees? Oh! But sure I can do that. 16

17 Model-driven performance engineering for broadcast Binary or binomial trees? 13x faster, lower variance Oh! But sure I can do that. 17

18 Easy to generalize to similar algorithms Barrier (7x faster than OpenMP) Reduce (5x faster then OpenMP) 18

19 Sorting in complex memories Triad Cache Mode SNC4 Check! Let s do something Big Data! Triad Flat Mode SNC4 We ll need a parallel sort on all cores, right? 19

20 Memory model: Bitonic Mergesort CPU0 CPU1 CPU0 Slices of 16 elements go through a bitonic network. Communication: CPU0 accesses data from local and remote caches. Synchronization: CPU0 waits for CPU1. Memory accesses: latency vs. bandwidth. 20

21 Modeling Bitonic Sorting L1 access cost memory access cost elements in L1 L2 access cost elements in L2 21

22 Bitonic Sort of 1 kib measurement model including synchronization cost Synchronization outweights memory costs for small data! So don t parallelize too much! model (just) memory costs 22

23 Bitonic Sort of 4 MiB model including synchronization cost parallelization boundary model (just) memory costs 23

24 Bitonic Sort of 1 GiB Always use all cores! synchronization negligible 24

25 The most surprising result last The model (and practical measurements) indicate that it does not matter. Hey, but which memory? DRAM or MCDRAM? Thesis: the higher bandwidth of MCDRAM did not help due to the higher latency (log 2 n depth). Disclaimer: this is NOT the best sorting algorithm for Xeon Phi KNL. It is the best we found with limited effort. We suspect that a combination of algorithms will perform best. 25

26 spcl.inf.ethz.ch Scientific performance engineering for complex memory systems Questions/Discussions? 26

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture