Dmitry Durnov 17 February 2016

Size: px

Start display at page:

Download "Dmitry Durnov 17 February 2016"

Letitia Bond
6 years ago
Views:

1 Особенности разработки ПО для систем с распределённой памятью Dmitry Durnov 17 February 2016

2 Agenda Modern cluster architecture Node level Cluster level Programming models Tools 2/17/2016 2

3 Modern cluster architecture 3

4 Modern cluster architecture. Node level CPU MC Memory QPI PCIe Fast Interconnect Co-processor QPI DMI CPU MC PCIe Memory Co-processor PCH Ethernet 2/17/2016 SSD 4

5 Modern cluster architecture. Xeon Phi MCDRAM MCDRAM MCDRAM Core Core Core D D R 4 MC Core Core Core Core Core Core MC D D R 4 Fast Interconnect PCIe DMI Ethernet PCH SSD 2/17/2016 5

6 Modern cluster architecture. Node level - HW: - Several sockets multicore CPU (2-4 sockets) GB memory per core - Accelerator/Co-processor (12.8% of total Top500 list) - Fast Interconnect adapter (communication and IO) - Slow Interconnect adapter (management/ssh) - Local storage - SW: - Linux OS (RHEL/SLES/CentOS/ ) - Parallel File system (PVFS/PanFS/GPFS/Lustre/ ) - Job manager node level (LSF/PBS/Torque/SLURM/ ) 2/17/2016 6

7 Modern cluster architecture. Cluster level. Fat Tree topology Head Node Node Node Node Node 2/17/2016 7

8 Modern cluster architecture. Cluster level - HW: - Interconnect switches/cables (Fat tree/dragonfly/butterfly/ topology) - SW: - Parallel File system (PVFS/PanFS/GPFS/Lustre/ ) - Job manager (LSF/PBS/Torque/SLURM/ ) 2/17/2016 8

9 Modern cluster architecture. Node level. CPU - 64 bit architecture - Out Of Order execution - Xeon: Up to 18 cores per socket (36 with Hyper Threading) - Xeon Phi: 60+ cores (240+ with Hyper Threading) - 1, 2, 4 sockets configurations (QPI) - Vectorization (AVX instructions set. 256/512 bit vector length) - 2/3 cache levels - And many other features 2/17/2016 9

10 MC Modern cluster architecture. Node level. Memory hierarchy - Several levels of hierarchy - L1 cache latency ~4-5 cycles - L2 cache latency - ~10-12 cycles CPU HT HT HT HT Core Core L1 L1 L2 L2 Memory - L3 (LLC) cache latency - ~36-38 cycles - Local memory latency ~ cycles - NUMA impact LLC QPI - Remote LLC latency - ~ cycles - Remote memory latency ~ cycles Data locality is very important QPI CPU MC Memory 2/17/

11 Modern cluster architecture. Interconnect - Infiniband - Technologies/APIs - RDMA (ibverbs, udapl, mxm) - PSM (True Scale) - Ethernet - Technologies/APIs HCA - TCP/IP (sockets) - RoCE (ibverbs, udapl, ) CPU MEM OS CPU MEM OS HCA Remote memory latency ~ 1 usec (~ cycles) Accelerators/Co-processors have possibility to use Interconnect directly (E.g. CCL-direct) OS-bypassing and zero-copy mechanisms 2/17/

12 Modern cluster architecture. Interconnect. Intel Omni Path HFI - Key features: - Link Speed: 100 Gbit/s - MPI Latency: less than 1 usec end-to-end - High MPI message rate (160 mmps) - Scalable to tens-of-thousands of nodes - APIs: - PSM2 (compatible with PSM) - OFI (Open Fabrics Interface) - ibverbs 2/17/

13 Modern cluster architecture. Interconnect. OFI API 2/17/

14 Programming models 14

15 Programming models. - Node level - Pthreads - OpenMP - TBB - CilckPlus - Cluster level - MPI - Different PGAS models - Hybrid model: MPI+X. MPI+OpenMP 2/17/2016 Intel Confidential 15

16 2/17/

17 What is MPI? - MPI Message Passing Interface - Version 1.0 of the standard was released in June Current version is MPI Provides language independent API for point-to-point, collectives and many other operations across distributed memory systems - Many implementations exist (MPICH, Intel MPI, MVAPICH, Cray MPT, Platform MPI, MS MPI, Open MPI, HPC-X, etc) 2/17/

18 MPI basis - MPI provides a powerful, efficient and portable way for parallel programming - Typically MPI supports SPMD model (MPMD possible though), i. e. same sub-program runs on each processor. The total program (all subprograms of the program) must be started with the MPI startup tool. - MPI program talks by means of messages (not streams) - Rich API: MPI Environment Point-to-Point communication Collective communication One-sided communication (Remote Memory Access) MPI Datatypes Application topologies Profiling interface File I/O Dynamic Processes 2/17/

19 MPI Program #include "mpi.h" #include <stdio.h> program main use MPI integer ierr int main( int argc, char *argv[] ) { call MPI_INIT( ierr ) MPI_Init( &argc, &argv ); printf( "Hello, world!\n" ); MPI_Finalize(); print *, 'Hello, world!' call MPI_FINALIZE( ierr ) end return 0; } 2/17/

message with any tag) - MPI_Recv may receive from any process by using MPI_ANY_SOURCE as

20 Point-to-Point Communication - Messages are matched by triplet of source, tag and communicator - Tag is just a message mark (MPI_Recv may provide MPI_ANY_TAG to match message with any tag) - MPI_Recv may receive from any process by using MPI_ANY_SOURCE as source - Communicator represents two things: the group of processes and communication context 2/17/

21 Collective communication - Represent different communication patterns, which may involve an arbitrary number of ranks - Why would not plain send and receive be enough? O-p-t-i-m-i-z-at-i-on - All collective operations involve every process in a given communicator - MPI implementations may contain several algorithms for every collective - Typically based on point-to-point functionality (but not necessary) - Can be divided into 3 categories: one-to-all, all-to-one, all-to-all - There are regular, nonblocking and neighbor collectives. 2/17/

22 Collective communication MPI_Bcast: one process (root) sends some chunk of data to the rest of processes in the given communicator Possible algorithms: 2/17/

23 One-sided communication - One process to specify all communication parameters, both for the sending side and for the receiving side - Separate communication and synchronization - No matching - Process that initiates a one-sided communication operation is the origin process and the process that contains the memory being accessed is the target process - Memory is exposed via window concept - Quite rich API: a bunch of different window creation, communication, synchronization and atomic routines - Example of communication call: MPI_Put(const void *origin_addr, int origin_count, MPI_Datatype origin_datatype, int target_rank, MPI_Aint target_disp, int target_count, MPI_Datatype target_datatype, MPI_Win win) 2/17/

24 One-sided communication Window Memory that a process allows other processes to access via one-sided communication is called a window Group of processes specify their local windows to other processes by calling the collective function (i. e. MPI_Win_create, MPI_Win_allocate, etc) P1 P2 P3 2/17/

25 Passive mode int b1 = 1, b2 = 2; int winbuf = -1; MPI_Win win; MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Win_create(&winbuf, sizeof(int), sizeof(int), MPI_INFO_NULL, MPI_COMM_WORLD, &win); if (rank == 0) { MPI_Win_lock(MPI_LOCK_SHARED, 1, 0, win); MPI_Put(&b1, 1, MPI_INT, 1, 0, 1, MPI_INT, win); MPI_Put(&b2, 1, MPI_INT, 1, 0, 1, MPI_INT, win); MPI_Win_unlock(1, win ); } MPI_Barrier(MPI_COMM_WORLD); if (rank == 1) { printf("my win %d\n", winbuf); } 2/17/

MPI+MPI: Shared Memory Windows MPI_Comm_split_type(TYPE_SHARED) MPI_Win_allocate_shared(shr_comm)

MPI_Win_unlock_all(shr_win) - Leverage RMA to incorporate node-level programming - RMA provides

Memory/core is not increasing - Allows NUMA-aware mapping - Each window piece is associated with

26 MPI+MPI: Shared Memory Windows MPI_Comm_split_type(TYPE_SHARED) MPI_Win_allocate_shared(shr_comm) MPI_Win_shared_query(&baseptr) MPI_Win_lock_all(shr_win) // access baseptr[] MPI_Win_sync() // MPI_Win_unlock_all(shr_win) - Leverage RMA to incorporate node-level programming - RMA provides portable atomics, synchronization, - Eliminates X in MPI+X, when only shared memory is needed - Memory/core is not increasing - Allows NUMA-aware mapping - Each window piece is associated with the process that allocated it 2/17/

27 Tools 27

28 How Intel Parallel Studio XE 2016 helps make Faster Code Faster for HPC HPC Cluster Cluster Edition Multi-fabric MPI library MPI error checking and tuning MPI Messages Threading design & prototyping Professional Edition Parallel performance tuning Memory & thread correctness Composer Edition Vectorized & Threaded Node Intel C++ and Fortran compilers Parallel models (e.g., OpenMP*) Optimized libraries

Evaluate MPI load balancing Find communication hotspots Tune single node

29 Performance Tuning Tools for Distributed Applications Intel Trace Analyzer and Collector Intel VTune Amplifier XE Tune cross-node MPI Visualize MPI behavior Evaluate MPI load balancing Find communication hotspots Tune single node threading Visualize thread behavior Evaluate thread load balancing Find thread sync bottlenecks

30 Intel Trace Analyzer and Collector Overview Intel Trace Analyzer and Collector helps the developer: Visualize and understand parallel application behavior Evaluate profiling statistics and load balancing Identify communication hotspots Features Event-based approach Low overhead Excellent scalability Powerful aggregation and filtering functions Performance Assistance and Imbalance Tuning NEW in 9.1: MPI Performance Snapshot 30

31 Using the Intel Trace Analyzer and Collector is Easy! Step 1 Run your binary and create a tracefile $ mpirun trace n 2./test Step 2 View the Results: $ traceanalyzer & 31

32 Intel Trace Analyzer and Collector Compare the event timelines of two communication profiles Blue = computation Red = communication Chart showing how the MPI processes interact 32

Host = too low load on coprocessor Host 16 MPI procs

33 Improving Load Balance: Real World Case Collapsed data per node and coprocessor card Too high load on Host = too low load on coprocessor Host 16 MPI procs x 1 OpenMP thread Coprocessor 8 MPI procs x 28 OpenMP threads

34 Improving Load Balance: Real World Case Collapsed data per node and coprocessor card Too low load on Host = too high load on coprocessor Host 16 MPI procs x 1 OpenMP thread Coprocessor 24 MPI procs x 8 OpenMP threads

35 Improving Load Balance: Real World Case Collapsed data per node and coprocessor card Perfect balance Host load = Coprocessor load Host 16 MPI procs x 1 OpenMP thread Coprocessor 16 MPI procs x 12 OpenMP thrds

36 Ideal Interconnect Simulator (Idealizer) Helps to figure out application s imbalance simulating its behavior in the ideal communication environment Actual trace Idealized Trace Easy way to identify application bottlenecks 36

37 MPI Performance Assistance Automatic Performance Assistant Detect common MPI performance issues Automated tips on potential solutions Automatically detect performance issues and their impact on runtime

38 MPI Performance Snapshot High capacity MPI profiler Lightweight Low overhead profiling for 100K+ Ranks Scalability Performance variation at scale can be detected sooner Identifying Key Metrics Shows PAPI counters and MPI/OpenMP imbalances

based HPC applications More details on issues More details on Warnings Run-time Errors Warnings Run-time Errors and

39 MPI Correctness Checking Highlights: Checks and pin-point hard to find run-time errors Unique feature to identify run-time errors Displays the correctness (parameter passing) of MPI communication for more robust and reliable MPI based HPC applications More details on issues More details on Warnings Run-time Errors Warnings Run-time Errors and Warnings can be identified easily. By a single mouse-click, more detailed information helps to identify root-causes MPI statistics 39

40 Intel MPI Benchmarks 4.1 Standard benchmarks with OSI-compatible CPL license Enables testing of interconnects, systems, and MPI implementations Comprehensive set of MPI kernels that provide performance measurements for: Point-to-point message-passing Global data movement and computation routines One-sided communications File I/O Supports MPI-1.x, MPI-2.x, and MPI- 3.x standards What s New: Introduction of new benchmarks Measure cumulative bandwidth and message rate values The Intel MPI Benchmarks provide a simple and easy way to measure MPI performance on your cluster

41 Online Resources Intel MPI Library product page Intel Trace Analyzer and Collector product page Intel Clusters and HPC Technology forums Intel Xeon Phi Coprocessor Developer Community 41

43 Backup 43

44 PGAS 2/17/

45 GPI and MCTP GPI - Global Address Space Programming Interface (Implements GASPI standard) MCTP - Multi-Core Thread Package Developed by Fraunhofer Institute for Industrial Mathematics ITWM GPI has been evolving since 2005 (it was called FVM:Fraunhofer Virtual Machine prior 2009) GPI is intended for distributed memory systems. MCTP supplements GPI on the node level. GPI/MCTP completely replaced MPI in Fraunhofer Institutes They have big customers like Shell, some government assignments About 40 people working in HPC area in Fraunhofer 2/17/

46 OpenSHMEM Specification API for programming in the PGAS (plus reference implementation of this API). Specification is about 80 pages It is an attempt to standardize different SHMEM implementations (Cray, SGI, Quadrics, HP, IBM, etc) It is a library (like MPI), available for C and Fortran Main features: one-sided communication, natural overlap of computation and communication, atomic memory operations, collective calls, etc There are several implementations available: Mellanox based on OpenMPI TSHMEM - Over Tilera Tile-Gx architecture and libraries (many core processes). Developed prototype to deal with xeon phi, which will be integrated to TSHMEM soon OpenSHMEM on top of MPI-3. First official release in Jan Shows quite poor performance values 2/17/

47 CAF CAF (Coarray Fortran) was an extension of Fortran 95/2003, since 1990s Fortran 2008 standard includes coarrays, but with slightly different than original CAF syntax Utilize concept of images, each image executes the same program independently from the others Coarray Fortran is usually implemented on top of MPI for better portability Coarray Fortran 2.0 is being developed by the Rice University. It includes several additional features compared to the emerging Fortran 2008 Several Implementations exist: Cray CAF Los Alamos Computer Science Institute Intel Fortran Compiler XE supports CAF OpenUH compiler supports coarrays (in the form of Fortran 2008) 2/17/

UPC UPC (Unified Parallel C): is an extension of the C language, which assumes SPMD programming model

distinguishes two types of data: private and shared. Shared data accessible from all UPC threads.

48 UPC UPC (Unified Parallel C): is an extension of the C language, which assumes SPMD programming model The desired amount of threads (UPC entity) can be specified either at compile time or in run-time UPC distinguishes two types of data: private and shared. Shared data accessible from all UPC threads. Also UPC provides some collectives, parallel I/O, synchronization primitives, etc The latest specification version is 1.3 (16 Nov 2013) Several Implementations exist: GWU UPC: also provides UPC specification as well as UPC collective and parallel I/O specifications Berkeley UPC GCC UPC Florida UPC MTU UPC 2/17/

49 Some More PGAS Chapel: PGAS language, core language (not an extension) X10: Object oriented programming language (IBM, open sourced). CAN Use MPI as a network transport and MPI collectives : upcoming in X (dec 2013) XcalableMP (U. of Tsukuba, U. of Tokyo, U of Kyoto, etc) directive based language extension similar to OpenMP. SPMD execution model Charm++: C++ based language ARMCI (Aggregate Remote Memory Copy Interface): library Global Arrays: library etc 2/17/

50 Application optimization/tuning example 50

51 Project goals Port to Intel Xeon Phi and reach tangible performance gains vs initial Xeon-only baseline Test-drive Intel Cluster Studio XE on Xeon Phi Create a case-study, with practical recommendations reusable in other cases Not a goal: to create the best performing ray tracer. Refer to dedicated projects (e.g. Embree by Intel Labs)

52 Tachyon ray tracer Open source ray tracing demo ( Part of SpecMPI suite Supports parallelism (MPI + OpenMP)

53 Computational modes Real-time rendering Throughput computing Images (c) Audi, Dreamworks Production of Puss in Boots required 69 million render hours

54 rank4 rank2 rank1 MPI rank0 Tachyon algorithm 3D model is a set of primitives (e.g. triangles) 3D space is pre-divided into grid, each voxel points to list of triangles contained/crossing it Image pixel calculated using ray intersections (lights, reflections, shadows) Hybrid parallelism: each frame is divided into chunks processed by MPI processes, a chunk is divided into lines processed by OpenMP threads Thread 0 Thread 1 Thread n Thread 0 Thread 1 Thread n

55 Known issues of the algorithm Communication profile: 1 master and n workers. Workers communicate to the master only. Master performs same computations + processing. A bottleneck and limited scalability. Each frame starts after a previous one. There is no explicit synchronization but because of communication channel saturation slaves have to wait for the master. Work imbalance: lines and frames have different complexities Hybrid parallelism with dynamic OpenMP scheduling helps to relieve Static MPI scheduling still exhibits the issue across frames Limited scalability across Xeon cluster. MPI+OpenMP hybrid better than MPI only

56 Recap of Part 1 Communication profile: 1 master and n workers, master performs same computations # of communications reduced thanks to buffered messages Work imbalance: lines and frames have different complexities Hybrid parallelism with dynamic OpenMP scheduling helps to relieve Static MPI scheduling still exhibits the issue across frames Even if improved, scalability will still be limited

57 Extra challenge - imbalance across Xeon and Xeon Phi Xeon and Xeon Phi have different performances How to split up the work? Which execution model to choose? Is ray tracing good for Xeon Phi?

58 Porting: Efficient apps for Xeon Phi 1. Allow massive parallelism (to load 60+cores x 4 threads) 2. Run intensive computations (to efficiently use 512bit vectors) 3. Provide memory efficiency (to meet current 6-16GB constraints) no slack: available parallel work (frame height) ~ # of threads no vectorizable loops, only scalar computations Tachyon s profile: Your application needs to meet certain requirements to use Xeon Phi best

59 Target execution model Symmetric MPI XEON XEON PHI XEON DIRECTIVES XEON MPI XEON PHI XEON PHI XEON XEON PHI NATIVE model OFFLOAD model SYMMETRIC model Most flexible. Least number of code changes.

60 Build for Xeon Phi No code changes, only makefile: -mmic -fp-model fast=2 Target platform is Xeon Phi Trade-off between accuracy and performance, OK for ray tracing Very easy! Running code in a minute

61 Why fp-model fast=2? With default flag, a reciprocal (1/x) computation unexpectedly became a hotspot on Phi (not on Xeon): Compiler generated heavy-weight code for higher precision -fp-model fast=2 is a trade-off to favor performance (precision is still fine for ray tracing) Reciprocal calculation time reduced by >2x

62 Run export I_MPI_MIC=enable mpiexec.hydra \ -n 2 -host mynode1 <command-line> : \ -n 2 -host mynode2 <command-line> : \ -n 2 host mynoden <command-line> : \ -n 2 host mynode1-mic0 <command-line> : \ -n 2 host mynode1-mic1 <command-line> : \ -n 2 host mynode2-mic0 <command-line> : \ -n 2 host mynoden-mic1 <command-line> Same syntax. A Phi card is just like another node.

First results 4 nodes x 2SNB 141.8 FPS 4 nodes x 1KNC 38 FPS??? 4 nodes x (2SNB + 1KNC) - 39 FPS!

63 First results 4 nodes x 2SNB FPS 4 nodes x 1KNC 38 FPS??? 4 nodes x (2SNB + 1KNC) - 39 FPS!!! SNB Sandy Bridge, 2 nd generation Intel Core processors KNC Knights Corner, Intel Xeon Phi co-processors Heterogeneous run slows down. Need to understand what happens

64 Using Intel Trace Analyzer and Collector Multiple synchronizations: all processes wait for the master MPI overhead is significant comparing to useful work

65 Using VTune Amplifier XE OpenMP overhead within each frame due to work imbalance (result collected on 61 threads; 244 threads will worsen the imbalance)

66 Using VTune Amplifier XE Poor vectorization of the hotspots. Recall: good vectorization is prerequisite for efficient Xeon Phi use

67 Conclusions No vectorization 512 bit registers (able to hold 16 floats) are wasted Insufficient parallelism 240 hyper-threads are wasted Ranks on Xeon Phi run slower than on Xeon Due to static MPI scheduling within each frame and frame-byframe computation, Xeon s cannot start new frame until Xeon Phi s complete their lines. Total performance suffers

68 Improvement directions Dynamic balancing across MPI ranks SIMD: exploit vectors Efficient intra-process OpenMP parallelism This works for both Xeon Phi and Xeon

69 #1 - Dynamic MPI scheduling Each worker computes entire frame: asks a master for a frame #, computes and sends back entire frame Master maintains a circular buffer, dispatches frame #, displays a frame. No computation by master Circular buffer to avoid memory growth Significantly reduces # of communications Reduced synchronizations: a worker does not wait for others anymore Compensates Xeon vs Xeon Phi difference Increases scalability Improves both Xeon Phi and Xeon-only!

70 Code change Producer-consumer like algorithm New algorithm ~250 lines in main loop Not Xeon Phi specific: could be implemented to address limited Xeon scalability. Xeon Phi just triggered it. This is important: you optimize for Xeon, benefit everywhere! Non-trivial but not a rocket science. Double ROI

71 Re-running Intel Trace Analyzer and Collector MPI processes are doing useful work, not waiting for each other Thin master is quickly dispatching the work and polling for completion status

72 Re-running Intel Trace Analyzer and Collector (cont ed) Each Xeon process (P1 and P2) processes 2x data of each Xeon Phi process (P3-P10). Processes are no longer gated by each other

73 #2. Improve OpenMP parallelism Create parallel slack by reducing chunk size: from a line to a few pixels. >= cache line (to avoid false sharing) Keep dynamic scheduling (OMP_SCHEDULE=dynamic) Enables massive parallelism (# of chunks >> HW threads) Compensates different line complexities Also helps on Xeon

74 Code change 6 new lines an OpenMP for-loop by pixel #, instead of by line # Straightforward change. The same parallel model OpenMP. Again, double ROI

75 Re-running with Amplifier XE OpenMP overhead significantly reduced. The timeline is clean reflecting good work balance

76 #3. Exploiting SIMD (Single Instruction Multiple Data) How to utilize vectorization when: there are no loops in a hotspot function (tri_intersect)? the hotspot function is called on a linked list (grid_intersect)?

77 Code change new data structures Composite triangles: 511 X16 Composite triangles X8 X7 X6 X5 X4 V1: {x1, x2,, xn} {y1, y2,, yn} {z1, z2,, zn} V2:... V3:... X3 X2 0 X1 SSE: 4 triangles, AVX: 8, Xeon Phi: 16 Structure Of Arrays: register containing 4/8/16 float coordinates (x, y or z) Bit mask to describe real / void triangles A small library of vector operations (+,-, dot-, cross-product, ) using intrinsics Y16... Y8 Y7 Y6 Y5 Y4 Y3 Y2 Y1 Reused from Embree for SSE/AVX, extended for Phi X16 Y16 X8 Y8 X7 Y7 X6 Y6 X5 Y5 X4 Y4 X3 Y3 X2 Y2 X1 Y1 Single C++ template intersection (et al) function No code duplication Again, double ROI - improves both Xeon Phi and Xeon!

78 Code change (cont ed)

79 SIMD benefits One intersection with multiple triangles at once Approach can be used for multi-rays intersections used by Embree and Autodesk s ray tracer Small extra overhead during scene load (each grid cell rebuilds its list of simple triangles to composites) but benefit in heavy computations Intrinsics can be replaced with direct loops and compiler s auto-vectorization to improve portability Again, double ROI - improves both Xeon Phi and Xeon!

80 Re-running with Amplifier XE High Vector Unit usage (13.7) (although some memory latency issues remain)

81 Updated results 4 nodes x 2 SNB FPS 4 nodes x 1 KNC FPS Speed up on both Xeon and Xeon Phi 1.5x 4.6x 4 nodes x (2SNB + 1KNC) FPS 9.7X speed up! Xeon and Xeon Phi add to each other

Parallel programming for Intel architecture

MPI process MPI process Intel Xeon E5: - 8

cores - 240 threads - SIMD-512 Parallelism

82 Parallel programming for Intel architecture Intel Xeon and Intel Xeon Phi Application MPI process MPI process Intel Xeon E5: - 8 cores - 16 threads - SIMD-256 Thread Thread Thread Thread Vector/SIMD Vector/SIMD Vector/SIMD Vector/SIMD Intel Xeon Phi: - 60 cores threads - SIMD-512 Parallelism at all levels, with Intel software tools. Maximize your ROI!

83 Next steps Experiment with prefetching Raplace intrinsics with plain C and rely on vectorization by compiler Experiment with replacing linked lists with arrays Fine-tune with affinity settings (e.g. KMP_AFFINITY=balanced)

84 Summary The application must meet certain criteria to benefit from Xeon Phi You might need to apply reasonable efforts to achieve that Good news: You can optimize for Xeon and benefit on Xeon Phi, and vice versa You use the same tools and programming models, same code

Dmitry Durnov 15 February 2017

Cовременные тенденции разработки высокопроизводительных приложений Dmitry Durnov 15 February 2017 Agenda Modern cluster architecture Node level Cluster level Programming models Tools 2/20/2017 2 Modern