Reusing this material

Size: px

Start display at page:

Download "Reusing this material"

Marilyn Bradley
6 years ago
Views:

1 XEON PHI BASICS

2 Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. This means you are free to copy and redistribute the material and adapt and build on the material under the following terms: You must give appropriate credit, provide a link to the license and indicate if changes were made. If you adapt or build on the material you must distribute your work under the same license as the original. Note that presentations may contains images owned by others. Please seek their permission before reusing these images.

3 LESSON PLAN Programming models Parallelisation Compilers and Tools Performance Considerations

4 Programming models

5 Programming models Host Coprocessor + Main Memory

6 Programming models 3 Basic Programming Models Host Native mode Coprocessor + Offload execution Symmetric execution Main Memory

7 Programming models Native Mode: Xeon Phi only Host Main Memory ssh (PCIe) Coprocessor int main() { do stuff(); } Host used for preparation work (e.g. compiling, data copy) User initiates run from host or can use host to connect to Xeon Phi via ssh

8 Programming models Native Mode: Xeon Phi only Host Main Memory ssh (PCIe) Coprocessor int main() { do stuff(); } Host used for preparation work (e.g. compiling, data copy) User initiates run from host or can use host to connect to Xeon Phi via ssh Programme runs on Xeon Phi from start to finish as usual

9 Programming models Native Mode: Xeon Phi only Pros: Requires minimal effort to port Works well with flat profile applications No memory copy required

10 Programming models Native Mode: Xeon Phi only Pros: Requires minimal effort to port Works well with flat profile applications No memory copy required Cons: Poor performance on codes with large serial regions and complex codes Limited Xeon Phi memory

11 Programming models Offload Execution: Hotspot eliminator Host Main Memory int main() { #pragma offload do_stuff() } ssh (PCIe) do_stuff(){ } Coprocessor Application is initiated on host

12 Programming models Offload Execution: Hotspot eliminator Host Main Memory int main() { #pragma offload do_stuff() } ssh (PCIe) do_stuff(){ } Coprocessor Application is initiated on host Embarrassingly parallel hotspots are offloaded to Xeon Phi

13 Programming models Offload Execution: Hotspot eliminator Host Main Memory int main() { #pragma offload do_stuff() } ssh (PCIe) do_stuff(){ } Coprocessor Application is initiated on host Embarrassingly parallel hotspots are offloaded to Xeon Phi Results of offload region are returned to host where execution continues

14 Programming models Offload Execution: Hotspot eliminator Pros: Serial code handled by advanced CPU cores Embarrassingly parallel hotspots are executed efficiently on Xeon Phi More efficient use of (limited) Xeon Phi memory

15 Programming models Offload Execution: Hotspot eliminator Pros: Serial code handled by advanced CPU cores Embarrassingly parallel hotspots are executed efficiently on Xeon Phi More efficient use of (limited) Xeon Phi memory Cons: Data must be copied to and from the Xeon Phi via (slow) PCIe Bus May lead to poor utilisation of CPU/XeonPhi (idle time)

16 Programming models Symmetric Execution: Phi-as-a-node Host Main Memory MPI_RANK=0 15 int main() { do_stuff() } ssh (PCIe) MPI_RANK= int main() { do_stuff() } Coprocessor Application is initiated on host but

17 Programming models Symmetric Execution: Phi-as-a-node Host Main Memory MPI_RANK=0 15 int main() { do_stuff() } ssh (PCIe) MPI_RANK= int main() { do_stuff() } Coprocessor Application is initiated on host but Runs across both CPU and Xeon Phi cores

18 Programming models Symmetric Execution: Phi-as-a-node Host Main Memory MPI_RANK=0 15 int main() { do_stuff() } ssh (PCIe) MPI_RANK= int main() { do_stuff() } Coprocessor Application is initiated on host but Runs across both CPU and Xeon Phi cores Effectively using Xeon Phi as just another node for MPI to use

19 Programming models Symmetric Execution: Phi-as-a-node Pros: Promise of full hardware utilisation No need for offloading pragmas and memory copies

20 Programming models Symmetric Execution: Phi-as-a-node Pros: Serial code handled by advanced CPU cores Embarrassingly parallel hotspots are executed efficiently on Xeon Phi More efficient use of (limited) Xeon Phi memory Cons: Tricky load-balancing Code is rarely optimal for both CPU and Xeon Phi

21 Parallelisation

22 Parallelisation MPI and / or OpenMP

23 Parallelisation MPI+OpenMP with Offload MPI runs only on hosts MPI processes offload to Xeon Phi OpenMP in MPI processes OpenMP in offload regions Image from Colfax training material

24 Parallelisation Symmetric Pure MPI MPI processes on host MPI processes (native) on Xeon Phi No OpenMP Image from Colfax training material

25 Parallelisation Symmetric hybrid MPI+OpenMP MPI processes on host MPI processes (native) on Xeon Phi All MPI processes use OpenMP multithreading Image from Colfax training material

26 Parallelisation What is best? What is your goal? What is your system? What is your application? Generally OpenMP faster than MPI on Xeon Phi Poor performance of MPI on Xeon Phi Less memory (especially important on Xeon Phi) Worth checking affinity settings (more later)

27 Compilers & Tools

28 Compilers & Tools Compilers In a word: Intel

29 Compilers & Tools Compilers In a word: Intel Intel C Compiler Intel C++ Compiler Intel Fortran Compiler

30 Compilers & Tools Tools In two words: Intel & (but mainly Intel) Allinea

31 Compilers & Tools Tools Intel Parallel Studio XE Intel C, C++ and Fortran compilers (MIC-capable) Intel Math Kernel Library (MKL) Intel MPI Library (only in Cluster Edition) Intel Trace Analyzer and Collector / ITAC (MPI profiler) Intel VTune Amplifier XE (multi-threaded profiler) Intel Inspector XE (memory and threading debugging) Intel Threading Building Blocks / TBB (threading library) Intel Performance Primitives / IPP (media and data) Intel Advisor XE (guided parallelism design) Allinea Map (lightweight profiler) DDT (debug) Forge (unified UI for DDT & Map)

32 Compilers & Tools Tools Runtime

33 Compilers & Tools Tools Runtime MPSS (Intel Manycore Platform Software Stack) Environment Variables Linux Commands

34 Compilers & Tools MPSS Tools Environment Variables Runtime Linux Commands micnativeloadex micinfo miccheck micsmc (GUI) micrasd (root) MKL_MIC_ENABLE MIC_ENV_PREFIX MIC_LD_LIBRARY_PATH I_MPI_MIC I_MPI_MIC_POSTFIX OFFLOAD_REPORT KMP_AFFINITY KMP_BLOCKTIME MIC_USE_2MB_BUFFERS lspci grep Phi cat /etc/hosts grep mic cat /proc/cpuinfo grep proc tail -n 3 For more details: E1EC94AE-A13D-463E-B3C3-6D7A7205F5A1.htm

35 Performance Considerations

36 Performance Considerations Four things to consider first: Execution mode Vectorisation Alignment Affinity Application Design

37 Performance Considerations Mode of execution Native Offload Symmetric Mode chosen should depend on the application and system configuration (as discussed previously)

38 Performance Considerations Vectorisation Xeon Phi performance is greatly dependant on vector units. Intel Xeon CPUs also use (smaller) vector units Code optimised for Intel Xeon will run faster on Intel Xeon Phi KNL (next generation Xeon Phi) will also use 512-AVX vector units Code optimised for Intel Xeon Phi KNC will also run faster on Intel Xeon Phi KNL *(KNC-KNL not binary compatible)

39 Performance Considerations Vectorisation Xeon Phi performance is greatly dependant on vector units. Intel Xeon CPUs also use (smaller) vector units Code optimised for Intel Xeon will run faster on Intel Xeon Phi KNL (next generation Xeon Phi) will also use 512-AVX vector units Code optimised for Intel Xeon Phi KNC will also run faster on Intel Xeon Phi KNL *(KNC-KNL not binary compatible)

40 Performance Considerations Vectorisation Xeon Phi performance is greatly dependant on vector units. Intel Xeon CPUs also use (smaller) vector units Code optimised for Intel Xeon will run faster on Intel Xeon Phi KNL (next generation Xeon Phi) will also use 512-AVX vector units Code optimised for Intel Xeon Phi KNC will also run faster on Intel Xeon Phi KNL *(KNC-KNL not binary compatible)

41 Performance Considerations Data Alignment Loop is vectorised!= faster Data alignment is critical for vectorisation to be beneficial Remember to not only align data, but also to tell the compiler that data is aligned at loop.

42 Performance Considerations Data Alignment Loop is vectorised!= faster Data alignment is critical for vectorisation to be beneficial Remember to not only align data, but also to tell the compiler that data is aligned at loop.

43 Performance Considerations Data Alignment Loop is vectorised!= faster Data alignment is critical for vectorisation to be beneficial Remember to not only align data, but also to tell the compiler that data is aligned at loop.

44 Performance Considerations Affinity All data moves over high-speed ring interconnect Affinity critical for good performance Default settings are not always optimal In offload mode, may accidentally use poor settings. e.g. 240 threads competing for the use of 30 cores, while 30 other cores are idle.

45 Performance Considerations Affinity All data moves over high-speed ring interconnect Affinity critical for good performance Default settings are not always optimal In offload mode, may accidentally use poor settings. e.g. 240 threads competing for the use of 30 cores, while 30 other cores are idle.

46 Performance Considerations Affinity All data moves over high-speed ring interconnect Affinity critical for good performance Default settings are not always optimal In offload mode, may accidentally use poor settings. e.g. 240 threads competing for the use of 30 cores, while 30 other cores are idle.

47 Performance Considerations Affinity All data moves over high-speed ring interconnect Affinity critical for good performance Default settings are not always optimal In offload mode, may accidentally use poor settings. e.g. 240 threads competing for the use of 30 cores, while 30 other cores are idle.

48 Performance Considerations Application Design Design >> Optimisation Consider all levels of parallelism available and adapt your algorithm to exploit as many and as much as possible

49 Performance Considerations Levels of parallelism Machine Node Numa Region Core Thread Vector Unit Co-processor Vector Unit

50 Performance Considerations Levels of parallelism Machine Node Node Node Node Numa Region Core Thread Vector Unit Co-processor Vector Unit

51 Performance Considerations Levels of parallelism Machine Node Node Node Node Numa Numa Region Region Core Thread Vector Unit Co-processor Vector Unit

52 Performance Considerations Levels of parallelism Numa Numa Region Region Core Core Core Core Core Co-processor Thread Machine Node Node Node Node Vector Unit Vector Unit

53 Performance Considerations Levels of parallelism Numa Numa Region Region Core Core Core Core Core Co-processor Thread Thread Thread Machine Node Node Node Node Vector Unit Vector Unit

54 Performance Considerations Levels of parallelism Numa Numa Region Region Core Core Core Core Core Co-processor Thread Thread Thread Machine Node Node Node Node Vector Vector Unit Unit Vector Unit Vector Vector Unit Unit

55 Performance Considerations Levels of parallelism Numa Numa Region Region Core Core Core Core Core Co-processor Co-processor Thread Thread Thread Machine Node Node Node Node Vector Vector Unit Unit Vector Unit Vector Vector Unit Unit

56 Performance Considerations Levels of parallelism Machine Node Node Node Node Numa Numa Region Region Core Core Core Core Core Co-processor Co-processor Thread Thread Vector Unit Thread Vector Unit Vector Vector Unit Unit Vector Unit Vector Vector Unit Unit Vector Unit Vector Vector Unit Unit

57 Performance Considerations Levels of parallelism Machine Node Node Node Node Numa Numa Region Region Core Core Core Core Core Co-processor Co-processor Thread Thread Vector Vector Unit Unit Thread Vector Vector Unit Unit Vector Unit Vector Vector Unit Unit Vector Unit Vector Vector Unit Unit

58 Summary

59 Summary

60 Summary Programming models Native, Offload, Symmetric - what s best for you. Parallelisation MPI, OpenMP -> OpenMP better on Xeon Phi Many ways to mix and match Compilers and Tools Use Intel compilers (C, C++, Fortran) Intel and Allinea tools: VTune, Map, etc. Wide variety of runtime tools and environment variables: micinfo, KMP_AFFINITY Performance Considerations Programming model Vectorisation - needed to exploit Xeon Phi compute Data alignment - needed to make vectorisation useful Thread/process affinity - can be critical for performance Application design: Consider levels of parallelism

61 Summary Programming models Native, Offload, Symmetric - what s best for you. Parallelisation MPI, OpenMP -> OpenMP better on Xeon Phi Many ways to mix and match Compilers and Tools Use Intel compilers (C, C++, Fortran) Intel and Allinea tools: VTune, Map, etc. Wide variety of runtime tools and environment variables: micinfo, KMP_AFFINITY Performance Considerations Programming model Vectorisation - needed to exploit Xeon Phi compute Data alignment - needed to make vectorisation useful Thread/process affinity - can be critical for performance Application design: Consider levels of parallelism

62 Summary Programming models Native, Offload, Symmetric - what s best for you. Parallelisation MPI, OpenMP -> OpenMP better on Xeon Phi Many ways to mix and match Compilers and Tools Use Intel compilers (C, C++, Fortran) Intel and Allinea tools: VTune, Map, etc. Wide variety of runtime tools and environment variables: micinfo, KMP_AFFINITY Performance Considerations Programming model Vectorisation - needed to exploit Xeon Phi compute Data alignment - needed to make vectorisation useful Thread/process affinity - can be critical for performance Application design: Consider levels of parallelism

63 Summary Programming models Native, Offload, Symmetric - what s best for you. Parallelisation MPI, OpenMP -> OpenMP better on Xeon Phi Many ways to mix and match Compilers and Tools Use Intel compilers (C, C++, Fortran) Intel and Allinea tools: VTune, Map, etc. Wide variety of runtime tools and environment variables: micinfo, KMP_AFFINITY Performance Considerations Programming model Vectorisation - needed to exploit Xeon Phi compute Data alignment - needed to make vectorisation useful Thread/process affinity - can be critical for performance Application design: Consider levels of parallelism

64 Thank You!

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk) Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further