Dmitry Durnov 17 February 2016

Size: px
Start display at page:

Download "Dmitry Durnov 17 February 2016"

Transcription

1 Особенности разработки ПО для систем с распределённой памятью Dmitry Durnov 17 February 2016

2 Agenda Modern cluster architecture Node level Cluster level Programming models Tools 2/17/2016 2

3 Modern cluster architecture 3

4 Modern cluster architecture. Node level CPU MC Memory QPI PCIe Fast Interconnect Co-processor QPI DMI CPU MC PCIe Memory Co-processor PCH Ethernet 2/17/2016 SSD 4

5 Modern cluster architecture. Xeon Phi MCDRAM MCDRAM MCDRAM Core Core Core D D R 4 MC Core Core Core Core Core Core MC D D R 4 Fast Interconnect PCIe DMI Ethernet PCH SSD 2/17/2016 5

6 Modern cluster architecture. Node level - HW: - Several sockets multicore CPU (2-4 sockets) GB memory per core - Accelerator/Co-processor (12.8% of total Top500 list) - Fast Interconnect adapter (communication and IO) - Slow Interconnect adapter (management/ssh) - Local storage - SW: - Linux OS (RHEL/SLES/CentOS/ ) - Parallel File system (PVFS/PanFS/GPFS/Lustre/ ) - Job manager node level (LSF/PBS/Torque/SLURM/ ) 2/17/2016 6

7 Modern cluster architecture. Cluster level. Fat Tree topology Head Node Node Node Node Node 2/17/2016 7

8 Modern cluster architecture. Cluster level - HW: - Interconnect switches/cables (Fat tree/dragonfly/butterfly/ topology) - SW: - Parallel File system (PVFS/PanFS/GPFS/Lustre/ ) - Job manager (LSF/PBS/Torque/SLURM/ ) 2/17/2016 8

9 Modern cluster architecture. Node level. CPU - 64 bit architecture - Out Of Order execution - Xeon: Up to 18 cores per socket (36 with Hyper Threading) - Xeon Phi: 60+ cores (240+ with Hyper Threading) - 1, 2, 4 sockets configurations (QPI) - Vectorization (AVX instructions set. 256/512 bit vector length) - 2/3 cache levels - And many other features 2/17/2016 9

10 MC Modern cluster architecture. Node level. Memory hierarchy - Several levels of hierarchy - L1 cache latency ~4-5 cycles - L2 cache latency - ~10-12 cycles CPU HT HT HT HT Core Core L1 L1 L2 L2 Memory - L3 (LLC) cache latency - ~36-38 cycles - Local memory latency ~ cycles - NUMA impact LLC QPI - Remote LLC latency - ~ cycles - Remote memory latency ~ cycles Data locality is very important QPI CPU MC Memory 2/17/

11 Modern cluster architecture. Interconnect - Infiniband - Technologies/APIs - RDMA (ibverbs, udapl, mxm) - PSM (True Scale) - Ethernet - Technologies/APIs HCA - TCP/IP (sockets) - RoCE (ibverbs, udapl, ) CPU MEM OS CPU MEM OS HCA Remote memory latency ~ 1 usec (~ cycles) Accelerators/Co-processors have possibility to use Interconnect directly (E.g. CCL-direct) OS-bypassing and zero-copy mechanisms 2/17/

12 Modern cluster architecture. Interconnect. Intel Omni Path HFI - Key features: - Link Speed: 100 Gbit/s - MPI Latency: less than 1 usec end-to-end - High MPI message rate (160 mmps) - Scalable to tens-of-thousands of nodes - APIs: - PSM2 (compatible with PSM) - OFI (Open Fabrics Interface) - ibverbs 2/17/

13 Modern cluster architecture. Interconnect. OFI API 2/17/

14 Programming models 14

15 Programming models. - Node level - Pthreads - OpenMP - TBB - CilckPlus - Cluster level - MPI - Different PGAS models - Hybrid model: MPI+X. MPI+OpenMP 2/17/2016 Intel Confidential 15

16 2/17/

17 What is MPI? - MPI Message Passing Interface - Version 1.0 of the standard was released in June Current version is MPI Provides language independent API for point-to-point, collectives and many other operations across distributed memory systems - Many implementations exist (MPICH, Intel MPI, MVAPICH, Cray MPT, Platform MPI, MS MPI, Open MPI, HPC-X, etc) 2/17/

18 MPI basis - MPI provides a powerful, efficient and portable way for parallel programming - Typically MPI supports SPMD model (MPMD possible though), i. e. same sub-program runs on each processor. The total program (all subprograms of the program) must be started with the MPI startup tool. - MPI program talks by means of messages (not streams) - Rich API: MPI Environment Point-to-Point communication Collective communication One-sided communication (Remote Memory Access) MPI Datatypes Application topologies Profiling interface File I/O Dynamic Processes 2/17/

19 MPI Program #include "mpi.h" #include <stdio.h> program main use MPI integer ierr int main( int argc, char *argv[] ) { call MPI_INIT( ierr ) MPI_Init( &argc, &argv ); printf( "Hello, world!\n" ); MPI_Finalize(); print *, 'Hello, world!' call MPI_FINALIZE( ierr ) end return 0; } 2/17/

20 Point-to-Point Communication - Messages are matched by triplet of source, tag and communicator - Tag is just a message mark (MPI_Recv may provide MPI_ANY_TAG to match message with any tag) - MPI_Recv may receive from any process by using MPI_ANY_SOURCE as source - Communicator represents two things: the group of processes and communication context 2/17/

21 Collective communication - Represent different communication patterns, which may involve an arbitrary number of ranks - Why would not plain send and receive be enough? O-p-t-i-m-i-z-at-i-on - All collective operations involve every process in a given communicator - MPI implementations may contain several algorithms for every collective - Typically based on point-to-point functionality (but not necessary) - Can be divided into 3 categories: one-to-all, all-to-one, all-to-all - There are regular, nonblocking and neighbor collectives. 2/17/

22 Collective communication MPI_Bcast: one process (root) sends some chunk of data to the rest of processes in the given communicator Possible algorithms: 2/17/

23 One-sided communication - One process to specify all communication parameters, both for the sending side and for the receiving side - Separate communication and synchronization - No matching - Process that initiates a one-sided communication operation is the origin process and the process that contains the memory being accessed is the target process - Memory is exposed via window concept - Quite rich API: a bunch of different window creation, communication, synchronization and atomic routines - Example of communication call: MPI_Put(const void *origin_addr, int origin_count, MPI_Datatype origin_datatype, int target_rank, MPI_Aint target_disp, int target_count, MPI_Datatype target_datatype, MPI_Win win) 2/17/

24 One-sided communication Window Memory that a process allows other processes to access via one-sided communication is called a window Group of processes specify their local windows to other processes by calling the collective function (i. e. MPI_Win_create, MPI_Win_allocate, etc) P1 P2 P3 2/17/

25 Passive mode int b1 = 1, b2 = 2; int winbuf = -1; MPI_Win win; MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Win_create(&winbuf, sizeof(int), sizeof(int), MPI_INFO_NULL, MPI_COMM_WORLD, &win); if (rank == 0) { MPI_Win_lock(MPI_LOCK_SHARED, 1, 0, win); MPI_Put(&b1, 1, MPI_INT, 1, 0, 1, MPI_INT, win); MPI_Put(&b2, 1, MPI_INT, 1, 0, 1, MPI_INT, win); MPI_Win_unlock(1, win ); } MPI_Barrier(MPI_COMM_WORLD); if (rank == 1) { printf("my win %d\n", winbuf); } 2/17/

26 MPI+MPI: Shared Memory Windows MPI_Comm_split_type(TYPE_SHARED) MPI_Win_allocate_shared(shr_comm) MPI_Win_shared_query(&baseptr) MPI_Win_lock_all(shr_win) // access baseptr[] MPI_Win_sync() // MPI_Win_unlock_all(shr_win) - Leverage RMA to incorporate node-level programming - RMA provides portable atomics, synchronization, - Eliminates X in MPI+X, when only shared memory is needed - Memory/core is not increasing - Allows NUMA-aware mapping - Each window piece is associated with the process that allocated it 2/17/

27 Tools 27

28 How Intel Parallel Studio XE 2016 helps make Faster Code Faster for HPC HPC Cluster Cluster Edition Multi-fabric MPI library MPI error checking and tuning MPI Messages Threading design & prototyping Professional Edition Parallel performance tuning Memory & thread correctness Composer Edition Vectorized & Threaded Node Intel C++ and Fortran compilers Parallel models (e.g., OpenMP*) Optimized libraries

29 Performance Tuning Tools for Distributed Applications Intel Trace Analyzer and Collector Intel VTune Amplifier XE Tune cross-node MPI Visualize MPI behavior Evaluate MPI load balancing Find communication hotspots Tune single node threading Visualize thread behavior Evaluate thread load balancing Find thread sync bottlenecks

30 Intel Trace Analyzer and Collector Overview Intel Trace Analyzer and Collector helps the developer: Visualize and understand parallel application behavior Evaluate profiling statistics and load balancing Identify communication hotspots Features Event-based approach Low overhead Excellent scalability Powerful aggregation and filtering functions Performance Assistance and Imbalance Tuning NEW in 9.1: MPI Performance Snapshot 30

31 Using the Intel Trace Analyzer and Collector is Easy! Step 1 Run your binary and create a tracefile $ mpirun trace n 2./test Step 2 View the Results: $ traceanalyzer & 31

32 Intel Trace Analyzer and Collector Compare the event timelines of two communication profiles Blue = computation Red = communication Chart showing how the MPI processes interact 32

33 Improving Load Balance: Real World Case Collapsed data per node and coprocessor card Too high load on Host = too low load on coprocessor Host 16 MPI procs x 1 OpenMP thread Coprocessor 8 MPI procs x 28 OpenMP threads

34 Improving Load Balance: Real World Case Collapsed data per node and coprocessor card Too low load on Host = too high load on coprocessor Host 16 MPI procs x 1 OpenMP thread Coprocessor 24 MPI procs x 8 OpenMP threads

35 Improving Load Balance: Real World Case Collapsed data per node and coprocessor card Perfect balance Host load = Coprocessor load Host 16 MPI procs x 1 OpenMP thread Coprocessor 16 MPI procs x 12 OpenMP thrds

36 Ideal Interconnect Simulator (Idealizer) Helps to figure out application s imbalance simulating its behavior in the ideal communication environment Actual trace Idealized Trace Easy way to identify application bottlenecks 36

37 MPI Performance Assistance Automatic Performance Assistant Detect common MPI performance issues Automated tips on potential solutions Automatically detect performance issues and their impact on runtime

38 MPI Performance Snapshot High capacity MPI profiler Lightweight Low overhead profiling for 100K+ Ranks Scalability Performance variation at scale can be detected sooner Identifying Key Metrics Shows PAPI counters and MPI/OpenMP imbalances

39 MPI Correctness Checking Highlights: Checks and pin-point hard to find run-time errors Unique feature to identify run-time errors Displays the correctness (parameter passing) of MPI communication for more robust and reliable MPI based HPC applications More details on issues More details on Warnings Run-time Errors Warnings Run-time Errors and Warnings can be identified easily. By a single mouse-click, more detailed information helps to identify root-causes MPI statistics 39

40 Intel MPI Benchmarks 4.1 Standard benchmarks with OSI-compatible CPL license Enables testing of interconnects, systems, and MPI implementations Comprehensive set of MPI kernels that provide performance measurements for: Point-to-point message-passing Global data movement and computation routines One-sided communications File I/O Supports MPI-1.x, MPI-2.x, and MPI- 3.x standards What s New: Introduction of new benchmarks Measure cumulative bandwidth and message rate values The Intel MPI Benchmarks provide a simple and easy way to measure MPI performance on your cluster

41 Online Resources Intel MPI Library product page Intel Trace Analyzer and Collector product page Intel Clusters and HPC Technology forums Intel Xeon Phi Coprocessor Developer Community 41

42

43 Backup 43

44 PGAS 2/17/

45 GPI and MCTP GPI - Global Address Space Programming Interface (Implements GASPI standard) MCTP - Multi-Core Thread Package Developed by Fraunhofer Institute for Industrial Mathematics ITWM GPI has been evolving since 2005 (it was called FVM:Fraunhofer Virtual Machine prior 2009) GPI is intended for distributed memory systems. MCTP supplements GPI on the node level. GPI/MCTP completely replaced MPI in Fraunhofer Institutes They have big customers like Shell, some government assignments About 40 people working in HPC area in Fraunhofer 2/17/

46 OpenSHMEM Specification API for programming in the PGAS (plus reference implementation of this API). Specification is about 80 pages It is an attempt to standardize different SHMEM implementations (Cray, SGI, Quadrics, HP, IBM, etc) It is a library (like MPI), available for C and Fortran Main features: one-sided communication, natural overlap of computation and communication, atomic memory operations, collective calls, etc There are several implementations available: Mellanox based on OpenMPI TSHMEM - Over Tilera Tile-Gx architecture and libraries (many core processes). Developed prototype to deal with xeon phi, which will be integrated to TSHMEM soon OpenSHMEM on top of MPI-3. First official release in Jan Shows quite poor performance values 2/17/

47 CAF CAF (Coarray Fortran) was an extension of Fortran 95/2003, since 1990s Fortran 2008 standard includes coarrays, but with slightly different than original CAF syntax Utilize concept of images, each image executes the same program independently from the others Coarray Fortran is usually implemented on top of MPI for better portability Coarray Fortran 2.0 is being developed by the Rice University. It includes several additional features compared to the emerging Fortran 2008 Several Implementations exist: Cray CAF Los Alamos Computer Science Institute Intel Fortran Compiler XE supports CAF OpenUH compiler supports coarrays (in the form of Fortran 2008) 2/17/

48 UPC UPC (Unified Parallel C): is an extension of the C language, which assumes SPMD programming model The desired amount of threads (UPC entity) can be specified either at compile time or in run-time UPC distinguishes two types of data: private and shared. Shared data accessible from all UPC threads. Also UPC provides some collectives, parallel I/O, synchronization primitives, etc The latest specification version is 1.3 (16 Nov 2013) Several Implementations exist: GWU UPC: also provides UPC specification as well as UPC collective and parallel I/O specifications Berkeley UPC GCC UPC Florida UPC MTU UPC 2/17/

49 Some More PGAS Chapel: PGAS language, core language (not an extension) X10: Object oriented programming language (IBM, open sourced). CAN Use MPI as a network transport and MPI collectives : upcoming in X (dec 2013) XcalableMP (U. of Tsukuba, U. of Tokyo, U of Kyoto, etc) directive based language extension similar to OpenMP. SPMD execution model Charm++: C++ based language ARMCI (Aggregate Remote Memory Copy Interface): library Global Arrays: library etc 2/17/

50 Application optimization/tuning example 50

51 Project goals Port to Intel Xeon Phi and reach tangible performance gains vs initial Xeon-only baseline Test-drive Intel Cluster Studio XE on Xeon Phi Create a case-study, with practical recommendations reusable in other cases Not a goal: to create the best performing ray tracer. Refer to dedicated projects (e.g. Embree by Intel Labs)

52 Tachyon ray tracer Open source ray tracing demo ( Part of SpecMPI suite Supports parallelism (MPI + OpenMP)

53 Computational modes Real-time rendering Throughput computing Images (c) Audi, Dreamworks Production of Puss in Boots required 69 million render hours

54 rank4 rank2 rank1 MPI rank0 Tachyon algorithm 3D model is a set of primitives (e.g. triangles) 3D space is pre-divided into grid, each voxel points to list of triangles contained/crossing it Image pixel calculated using ray intersections (lights, reflections, shadows) Hybrid parallelism: each frame is divided into chunks processed by MPI processes, a chunk is divided into lines processed by OpenMP threads Thread 0 Thread 1 Thread n Thread 0 Thread 1 Thread n

55 Known issues of the algorithm Communication profile: 1 master and n workers. Workers communicate to the master only. Master performs same computations + processing. A bottleneck and limited scalability. Each frame starts after a previous one. There is no explicit synchronization but because of communication channel saturation slaves have to wait for the master. Work imbalance: lines and frames have different complexities Hybrid parallelism with dynamic OpenMP scheduling helps to relieve Static MPI scheduling still exhibits the issue across frames Limited scalability across Xeon cluster. MPI+OpenMP hybrid better than MPI only

56 Recap of Part 1 Communication profile: 1 master and n workers, master performs same computations # of communications reduced thanks to buffered messages Work imbalance: lines and frames have different complexities Hybrid parallelism with dynamic OpenMP scheduling helps to relieve Static MPI scheduling still exhibits the issue across frames Even if improved, scalability will still be limited

57 Extra challenge - imbalance across Xeon and Xeon Phi Xeon and Xeon Phi have different performances How to split up the work? Which execution model to choose? Is ray tracing good for Xeon Phi?

58 Porting: Efficient apps for Xeon Phi 1. Allow massive parallelism (to load 60+cores x 4 threads) 2. Run intensive computations (to efficiently use 512bit vectors) 3. Provide memory efficiency (to meet current 6-16GB constraints) no slack: available parallel work (frame height) ~ # of threads no vectorizable loops, only scalar computations Tachyon s profile: Your application needs to meet certain requirements to use Xeon Phi best

59 Target execution model Symmetric MPI XEON XEON PHI XEON DIRECTIVES XEON MPI XEON PHI XEON PHI XEON XEON PHI NATIVE model OFFLOAD model SYMMETRIC model Most flexible. Least number of code changes.

60 Build for Xeon Phi No code changes, only makefile: -mmic -fp-model fast=2 Target platform is Xeon Phi Trade-off between accuracy and performance, OK for ray tracing Very easy! Running code in a minute

61 Why fp-model fast=2? With default flag, a reciprocal (1/x) computation unexpectedly became a hotspot on Phi (not on Xeon): Compiler generated heavy-weight code for higher precision -fp-model fast=2 is a trade-off to favor performance (precision is still fine for ray tracing) Reciprocal calculation time reduced by >2x

62 Run export I_MPI_MIC=enable mpiexec.hydra \ -n 2 -host mynode1 <command-line> : \ -n 2 -host mynode2 <command-line> : \ -n 2 host mynoden <command-line> : \ -n 2 host mynode1-mic0 <command-line> : \ -n 2 host mynode1-mic1 <command-line> : \ -n 2 host mynode2-mic0 <command-line> : \ -n 2 host mynoden-mic1 <command-line> Same syntax. A Phi card is just like another node.

63 First results 4 nodes x 2SNB FPS 4 nodes x 1KNC 38 FPS??? 4 nodes x (2SNB + 1KNC) - 39 FPS!!! SNB Sandy Bridge, 2 nd generation Intel Core processors KNC Knights Corner, Intel Xeon Phi co-processors Heterogeneous run slows down. Need to understand what happens

64 Using Intel Trace Analyzer and Collector Multiple synchronizations: all processes wait for the master MPI overhead is significant comparing to useful work

65 Using VTune Amplifier XE OpenMP overhead within each frame due to work imbalance (result collected on 61 threads; 244 threads will worsen the imbalance)

66 Using VTune Amplifier XE Poor vectorization of the hotspots. Recall: good vectorization is prerequisite for efficient Xeon Phi use

67 Conclusions No vectorization 512 bit registers (able to hold 16 floats) are wasted Insufficient parallelism 240 hyper-threads are wasted Ranks on Xeon Phi run slower than on Xeon Due to static MPI scheduling within each frame and frame-byframe computation, Xeon s cannot start new frame until Xeon Phi s complete their lines. Total performance suffers

68 Improvement directions Dynamic balancing across MPI ranks SIMD: exploit vectors Efficient intra-process OpenMP parallelism This works for both Xeon Phi and Xeon

69 #1 - Dynamic MPI scheduling Each worker computes entire frame: asks a master for a frame #, computes and sends back entire frame Master maintains a circular buffer, dispatches frame #, displays a frame. No computation by master Circular buffer to avoid memory growth Significantly reduces # of communications Reduced synchronizations: a worker does not wait for others anymore Compensates Xeon vs Xeon Phi difference Increases scalability Improves both Xeon Phi and Xeon-only!

70 Code change Producer-consumer like algorithm New algorithm ~250 lines in main loop Not Xeon Phi specific: could be implemented to address limited Xeon scalability. Xeon Phi just triggered it. This is important: you optimize for Xeon, benefit everywhere! Non-trivial but not a rocket science. Double ROI

71 Re-running Intel Trace Analyzer and Collector MPI processes are doing useful work, not waiting for each other Thin master is quickly dispatching the work and polling for completion status

72 Re-running Intel Trace Analyzer and Collector (cont ed) Each Xeon process (P1 and P2) processes 2x data of each Xeon Phi process (P3-P10). Processes are no longer gated by each other

73 #2. Improve OpenMP parallelism Create parallel slack by reducing chunk size: from a line to a few pixels. >= cache line (to avoid false sharing) Keep dynamic scheduling (OMP_SCHEDULE=dynamic) Enables massive parallelism (# of chunks >> HW threads) Compensates different line complexities Also helps on Xeon

74 Code change 6 new lines an OpenMP for-loop by pixel #, instead of by line # Straightforward change. The same parallel model OpenMP. Again, double ROI

75 Re-running with Amplifier XE OpenMP overhead significantly reduced. The timeline is clean reflecting good work balance

76 #3. Exploiting SIMD (Single Instruction Multiple Data) How to utilize vectorization when: there are no loops in a hotspot function (tri_intersect)? the hotspot function is called on a linked list (grid_intersect)?

77 Code change new data structures Composite triangles: 511 X16 Composite triangles X8 X7 X6 X5 X4 V1: {x1, x2,, xn} {y1, y2,, yn} {z1, z2,, zn} V2:... V3:... X3 X2 0 X1 SSE: 4 triangles, AVX: 8, Xeon Phi: 16 Structure Of Arrays: register containing 4/8/16 float coordinates (x, y or z) Bit mask to describe real / void triangles A small library of vector operations (+,-, dot-, cross-product, ) using intrinsics Y16... Y8 Y7 Y6 Y5 Y4 Y3 Y2 Y1 Reused from Embree for SSE/AVX, extended for Phi X16 Y16 X8 Y8 X7 Y7 X6 Y6 X5 Y5 X4 Y4 X3 Y3 X2 Y2 X1 Y1 Single C++ template intersection (et al) function No code duplication Again, double ROI - improves both Xeon Phi and Xeon!

78 Code change (cont ed)

79 SIMD benefits One intersection with multiple triangles at once Approach can be used for multi-rays intersections used by Embree and Autodesk s ray tracer Small extra overhead during scene load (each grid cell rebuilds its list of simple triangles to composites) but benefit in heavy computations Intrinsics can be replaced with direct loops and compiler s auto-vectorization to improve portability Again, double ROI - improves both Xeon Phi and Xeon!

80 Re-running with Amplifier XE High Vector Unit usage (13.7) (although some memory latency issues remain)

81 Updated results 4 nodes x 2 SNB FPS 4 nodes x 1 KNC FPS Speed up on both Xeon and Xeon Phi 1.5x 4.6x 4 nodes x (2SNB + 1KNC) FPS 9.7X speed up! Xeon and Xeon Phi add to each other

82 Parallel programming for Intel architecture Intel Xeon and Intel Xeon Phi Application MPI process MPI process Intel Xeon E5: - 8 cores - 16 threads - SIMD-256 Thread Thread Thread Thread Vector/SIMD Vector/SIMD Vector/SIMD Vector/SIMD Intel Xeon Phi: - 60 cores threads - SIMD-512 Parallelism at all levels, with Intel software tools. Maximize your ROI!

83 Next steps Experiment with prefetching Raplace intrinsics with plain C and rely on vectorization by compiler Experiment with replacing linked lists with arrays Fine-tune with affinity settings (e.g. KMP_AFFINITY=balanced)

84 Summary The application must meet certain criteria to benefit from Xeon Phi You might need to apply reasonable efforts to achieve that Good news: You can optimize for Xeon and benefit on Xeon Phi, and vice versa You use the same tools and programming models, same code

Dmitry Durnov 15 February 2017

Dmitry Durnov 15 February 2017 Cовременные тенденции разработки высокопроизводительных приложений Dmitry Durnov 15 February 2017 Agenda Modern cluster architecture Node level Cluster level Programming models Tools 2/20/2017 2 Modern

More information

Advanced MPI. Andrew Emerson

Advanced MPI. Andrew Emerson Advanced MPI Andrew Emerson (a.emerson@cineca.it) Agenda 1. One sided Communications (MPI-2) 2. Dynamic processes (MPI-2) 3. Profiling MPI and tracing 4. MPI-I/O 5. MPI-3 22/02/2017 Advanced MPI 2 One

More information

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi архитектура, модели программирования, оптимизация. Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture

More information

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters K. Kandalla, A. Venkatesh, K. Hamidouche, S. Potluri, D. Bureddy and D. K. Panda Presented by Dr. Xiaoyi

More information

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries How to Boost the Performance of Your MPI and PGAS s with MVAPICH2 Libraries A Tutorial at the MVAPICH User Group (MUG) Meeting 18 by The MVAPICH Team The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Bei Wang, Dmitry Prohorov and Carlos Rosales

Bei Wang, Dmitry Prohorov and Carlos Rosales Bei Wang, Dmitry Prohorov and Carlos Rosales Aspects of Application Performance What are the Aspects of Performance Intel Hardware Features Omni-Path Architecture MCDRAM 3D XPoint Many-core Xeon Phi AVX-512

More information

Message Passing Interface (MPI) on Intel Xeon Phi coprocessor

Message Passing Interface (MPI) on Intel Xeon Phi coprocessor Message Passing Interface (MPI) on Intel Xeon Phi coprocessor Special considerations for MPI on Intel Xeon Phi and using the Intel Trace Analyzer and Collector Gergana Slavova gergana.s.slavova@intel.com

More information

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi архитектура, модели программирования, оптимизация. Нижний Новгород, 2016 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture How Programming

More information

Intel Parallel Studio XE Cluster Edition - Intel MPI - Intel Traceanalyzer & Collector

Intel Parallel Studio XE Cluster Edition - Intel MPI - Intel Traceanalyzer & Collector Intel Parallel Studio XE Cluster Edition - Intel MPI - Intel Traceanalyzer & Collector A brief Introduction to MPI 2 What is MPI? Message Passing Interface Explicit parallel model All parallelism is explicit:

More information

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy Why serial is not enough Computing architectures Parallel paradigms Message Passing Interface How

More information

Interconnect Your Future

Interconnect Your Future Interconnect Your Future Smart Interconnect for Next Generation HPC Platforms Gilad Shainer, August 2016, 4th Annual MVAPICH User Group (MUG) Meeting Mellanox Connects the World s Fastest Supercomputer

More information

Sayantan Sur, Intel. SEA Symposium on Overlapping Computation and Communication. April 4 th, 2018

Sayantan Sur, Intel. SEA Symposium on Overlapping Computation and Communication. April 4 th, 2018 Sayantan Sur, Intel SEA Symposium on Overlapping Computation and Communication April 4 th, 2018 Legal Disclaimer & Benchmark results were obtained prior to implementation of recent software patches and

More information

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany Scalasca support for Intel Xeon Phi Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany Overview Scalasca performance analysis toolset support for MPI & OpenMP

More information

Reusing this material

Reusing this material XEON PHI BASICS Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 Message passing vs. Shared memory Client Client Client Client send(msg) recv(msg) send(msg) recv(msg) MSG MSG MSG IPC Shared

More information

Advanced MPI. Andrew Emerson

Advanced MPI. Andrew Emerson Advanced MPI Andrew Emerson (a.emerson@cineca.it) Agenda 1. One sided Communications (MPI-2) 2. Dynamic processes (MPI-2) 3. Profiling MPI and tracing 4. MPI-I/O 5. MPI-3 11/12/2015 Advanced MPI 2 One

More information

Parallel Programming Libraries and implementations

Parallel Programming Libraries and implementations Parallel Programming Libraries and implementations Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.

More information

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU Parallel Applications on Distributed Memory Systems Le Yan HPC User Services @ LSU Outline Distributed memory systems Message Passing Interface (MPI) Parallel applications 6/3/2015 LONI Parallel Programming

More information

Native Computing and Optimization. Hang Liu December 4 th, 2013

Native Computing and Optimization. Hang Liu December 4 th, 2013 Native Computing and Optimization Hang Liu December 4 th, 2013 Overview Why run native? What is a native application? Building a native application Running a native application Setting affinity and pinning

More information

Tutorial: Analyzing MPI Applications. Intel Trace Analyzer and Collector Intel VTune Amplifier XE

Tutorial: Analyzing MPI Applications. Intel Trace Analyzer and Collector Intel VTune Amplifier XE Tutorial: Analyzing MPI Applications Intel Trace Analyzer and Collector Intel VTune Amplifier XE Contents Legal Information... 3 1. Overview... 4 1.1. Prerequisites... 5 1.1.1. Required Software... 5 1.1.2.

More information

Parallel Programming. Libraries and Implementations

Parallel Programming. Libraries and Implementations Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Intel VTune Amplifier XE

Intel VTune Amplifier XE Intel VTune Amplifier XE Vladimir Tsymbal Performance, Analysis and Threading Lab 1 Agenda Intel VTune Amplifier XE Overview Features Data collectors Analysis types Key Concepts Collecting performance

More information

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR Presentation at Mellanox Theater () Dhabaleswar K. (DK) Panda - The Ohio State University panda@cse.ohio-state.edu Outline Communication

More information

Jackson Marusarz Intel Corporation

Jackson Marusarz Intel Corporation Jackson Marusarz Intel Corporation Intel VTune Amplifier Quick Introduction Get the Data You Need Hotspot (Statistical call tree), Call counts (Statistical) Thread Profiling Concurrency and Lock & Waits

More information

Portable, MPI-Interoperable! Coarray Fortran

Portable, MPI-Interoperable! Coarray Fortran Portable, MPI-Interoperable! Coarray Fortran Chaoran Yang, 1 Wesley Bland, 2! John Mellor-Crummey, 1 Pavan Balaji 2 1 Department of Computer Science! Rice University! Houston, TX 2 Mathematics and Computer

More information

Intel Architecture for HPC

Intel Architecture for HPC Intel Architecture for HPC Georg Zitzlsberger georg.zitzlsberger@vsb.cz 1st of March 2018 Agenda Salomon Architectures Intel R Xeon R processors v3 (Haswell) Intel R Xeon Phi TM coprocessor (KNC) Ohter

More information

Performance Profiler. Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava,

Performance Profiler. Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava, Performance Profiler Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava, 08-09-2016 Faster, Scalable Code, Faster Intel VTune Amplifier Performance Profiler Get Faster Code Faster With Accurate

More information

MPI-3 One-Sided Communication

MPI-3 One-Sided Communication HLRN Parallel Programming Workshop Speedup your Code on Intel Processors at HLRN October 20th, 2017 MPI-3 One-Sided Communication Florian Wende Zuse Institute Berlin Two-sided communication Standard Message

More information

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications

More information

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,

More information

Migrating Offloading Software to Intel Xeon Phi Processor

Migrating Offloading Software to Intel Xeon Phi Processor Migrating Offloading Software to Intel Xeon Phi Processor White Paper February 2018 Document Number: 337129-001US Legal Lines and Disclaimers Intel technologies features and benefits depend on system configuration

More information

Multicore Performance and Tools. Part 1: Topology, affinity, clock speed

Multicore Performance and Tools. Part 1: Topology, affinity, clock speed Multicore Performance and Tools Part 1: Topology, affinity, clock speed Tools for Node-level Performance Engineering Gather Node Information hwloc, likwid-topology, likwid-powermeter Affinity control and

More information

Introduction to Xeon Phi. Bill Barth January 11, 2013

Introduction to Xeon Phi. Bill Barth January 11, 2013 Introduction to Xeon Phi Bill Barth January 11, 2013 What is it? Co-processor PCI Express card Stripped down Linux operating system Dense, simplified processor Many power-hungry operations removed Wider

More information

Intel Knights Landing Hardware

Intel Knights Landing Hardware Intel Knights Landing Hardware TACC KNL Tutorial IXPUG Annual Meeting 2016 PRESENTED BY: John Cazes Lars Koesterke 1 Intel s Xeon Phi Architecture Leverages x86 architecture Simpler x86 cores, higher compute

More information

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries. Munara Tolubaeva Technical Consulting Engineer 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries. notices and disclaimers Intel technologies features and benefits depend

More information

The MPI Message-passing Standard Practical use and implementation (I) SPD Course 2/03/2010 Massimo Coppola

The MPI Message-passing Standard Practical use and implementation (I) SPD Course 2/03/2010 Massimo Coppola The MPI Message-passing Standard Practical use and implementation (I) SPD Course 2/03/2010 Massimo Coppola What is MPI MPI: Message Passing Interface a standard defining a communication library that allows

More information

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated

More information

Introduction to parallel computing concepts and technics

Introduction to parallel computing concepts and technics Introduction to parallel computing concepts and technics Paschalis Korosoglou (support@grid.auth.gr) User and Application Support Unit Scientific Computing Center @ AUTH Overview of Parallel computing

More information

MPI and comparison of models Lecture 23, cs262a. Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018

MPI and comparison of models Lecture 23, cs262a. Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018 MPI and comparison of models Lecture 23, cs262a Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018 MPI MPI - Message Passing Interface Library standard defined by a committee of vendors, implementers,

More information

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,

More information

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008 SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem

More information

Introduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero

Introduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero Introduction to Intel Xeon Phi programming techniques Fabio Affinito Vittorio Ruggiero Outline High level overview of the Intel Xeon Phi hardware and software stack Intel Xeon Phi programming paradigms:

More information

Performance Tools for Technical Computing

Performance Tools for Technical Computing Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Intel Software Conference 2010 April 13th, Barcelona, Spain Agenda o Motivation and Methodology

More information

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization

More information

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware

More information

MPI and CUDA. Filippo Spiga, HPCS, University of Cambridge.

MPI and CUDA. Filippo Spiga, HPCS, University of Cambridge. MPI and CUDA Filippo Spiga, HPCS, University of Cambridge Outline Basic principle of MPI Mixing MPI and CUDA 1 st example : parallel GPU detect 2 nd example: heat2d CUDA- aware MPI, how

More information

CS 426. Building and Running a Parallel Application

CS 426. Building and Running a Parallel Application CS 426 Building and Running a Parallel Application 1 Task/Channel Model Design Efficient Parallel Programs (or Algorithms) Mainly for distributed memory systems (e.g. Clusters) Break Parallel Computations

More information

HPC code modernization with Intel development tools

HPC code modernization with Intel development tools HPC code modernization with Intel development tools Bayncore, Ltd. Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 17 th 2016, Barcelona Microprocessor

More information

HPC Architectures. Types of resource currently in use

HPC Architectures. Types of resource currently in use HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

USING OPEN FABRIC INTERFACE IN INTEL MPI LIBRARY

USING OPEN FABRIC INTERFACE IN INTEL MPI LIBRARY 14th ANNUAL WORKSHOP 2018 USING OPEN FABRIC INTERFACE IN INTEL MPI LIBRARY Michael Chuvelev, Software Architect Intel April 11, 2018 INTEL MPI LIBRARY Optimized MPI application performance Application-specific

More information

Holland Computing Center Kickstart MPI Intro

Holland Computing Center Kickstart MPI Intro Holland Computing Center Kickstart 2016 MPI Intro Message Passing Interface (MPI) MPI is a specification for message passing library that is standardized by MPI Forum Multiple vendor-specific implementations:

More information

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk) Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further

More information

CS4961 Parallel Programming. Lecture 18: Introduction to Message Passing 11/3/10. Final Project Purpose: Mary Hall November 2, 2010.

CS4961 Parallel Programming. Lecture 18: Introduction to Message Passing 11/3/10. Final Project Purpose: Mary Hall November 2, 2010. Parallel Programming Lecture 18: Introduction to Message Passing Mary Hall November 2, 2010 Final Project Purpose: - A chance to dig in deeper into a parallel programming model and explore concepts. -

More information

MPI 1. CSCI 4850/5850 High-Performance Computing Spring 2018

MPI 1. CSCI 4850/5850 High-Performance Computing Spring 2018 MPI 1 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning Objectives

More information

Accelerator Programming Lecture 1

Accelerator Programming Lecture 1 Accelerator Programming Lecture 1 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de January 11, 2016 Accelerator Programming

More information

Overview of Intel Xeon Phi Coprocessor

Overview of Intel Xeon Phi Coprocessor Overview of Intel Xeon Phi Coprocessor Sept 20, 2013 Ritu Arora Texas Advanced Computing Center Email: rauta@tacc.utexas.edu This talk is only a trailer A comprehensive training on running and optimizing

More information

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing NTNU, IMF February 16. 2018 1 Recap: Distributed memory programming model Parallelism with MPI. An MPI execution is started

More information

Intel Many Integrated Core (MIC) Architecture

Intel Many Integrated Core (MIC) Architecture Intel Many Integrated Core (MIC) Architecture Karl Solchenbach Director European Exascale Labs BMW2011, November 3, 2011 1 Notice and Disclaimers Notice: This document contains information on products

More information

JURECA Tuning for the platform

JURECA Tuning for the platform JURECA Tuning for the platform Usage of ParaStation MPI 2017-11-23 Outline ParaStation MPI Compiling your program Running your program Tuning parameters Resources 2 ParaStation MPI Based on MPICH (3.2)

More information

Using Intel VTune Amplifier XE and Inspector XE in.net environment

Using Intel VTune Amplifier XE and Inspector XE in.net environment Using Intel VTune Amplifier XE and Inspector XE in.net environment Levent Akyil Technical Computing, Analyzers and Runtime Software and Services group 1 Refresher - Intel VTune Amplifier XE Intel Inspector

More information

Intel VTune Amplifier XE for Tuning of HPC Applications Intel Software Developer Conference Frankfurt, 2017 Klaus-Dieter Oertel, Intel

Intel VTune Amplifier XE for Tuning of HPC Applications Intel Software Developer Conference Frankfurt, 2017 Klaus-Dieter Oertel, Intel Intel VTune Amplifier XE for Tuning of HPC Applications Intel Software Developer Conference Frankfurt, 2017 Klaus-Dieter Oertel, Intel Agenda Which performance analysis tool should I use first? Intel Application

More information

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ, Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - fabio.baruffa@lrz.de LRZ, 27.6.- 29.6.2016 Architecture Overview Intel Xeon Processor Intel Xeon Phi Coprocessor, 1st generation Intel Xeon

More information

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Introduction to the Message Passing Interface (MPI)

Introduction to the Message Passing Interface (MPI) Introduction to the Message Passing Interface (MPI) CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Introduction to the Message Passing Interface (MPI) Spring 2018

More information

FVM - How to program the Multi-Core FVM instead of MPI

FVM - How to program the Multi-Core FVM instead of MPI FVM - How to program the Multi-Core FVM instead of MPI DLR, 15. October 2009 Dr. Mirko Rahn Competence Center High Performance Computing and Visualization Fraunhofer Institut for Industrial Mathematics

More information

Open Fabrics Workshop 2013

Open Fabrics Workshop 2013 Open Fabrics Workshop 2013 OFS Software for the Intel Xeon Phi Bob Woodruff Agenda Intel Coprocessor Communication Link (CCL) Software IBSCIF RDMA from Host to Intel Xeon Phi Direct HCA Access from Intel

More information

Message Passing Interface

Message Passing Interface Message Passing Interface DPHPC15 TA: Salvatore Di Girolamo DSM (Distributed Shared Memory) Message Passing MPI (Message Passing Interface) A message passing specification implemented

More information

Introduction to Parallel and Distributed Systems - INZ0277Wcl 5 ECTS. Teacher: Jan Kwiatkowski, Office 201/15, D-2

Introduction to Parallel and Distributed Systems - INZ0277Wcl 5 ECTS. Teacher: Jan Kwiatkowski, Office 201/15, D-2 Introduction to Parallel and Distributed Systems - INZ0277Wcl 5 ECTS Teacher: Jan Kwiatkowski, Office 201/15, D-2 COMMUNICATION For questions, email to jan.kwiatkowski@pwr.edu.pl with 'Subject=your name.

More information

PGAS: Partitioned Global Address Space

PGAS: Partitioned Global Address Space .... PGAS: Partitioned Global Address Space presenter: Qingpeng Niu January 26, 2012 presenter: Qingpeng Niu : PGAS: Partitioned Global Address Space 1 Outline presenter: Qingpeng Niu : PGAS: Partitioned

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core

1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core 1. Many Core vs Multi Core 2. Performance Optimization Concepts for Many Core 3. Performance Optimization Strategy for Many Core 4. Example Case Studies NERSC s Cori will begin to transition the workload

More information

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구 MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구 Leading Supplier of End-to-End Interconnect Solutions Analyze Enabling the Use of Data Store ICs Comprehensive End-to-End InfiniBand and Ethernet Portfolio

More information

VLPL-S Optimization on Knights Landing

VLPL-S Optimization on Knights Landing VLPL-S Optimization on Knights Landing 英特尔软件与服务事业部 周姗 2016.5 Agenda VLPL-S 性能分析 VLPL-S 性能优化 总结 2 VLPL-S Workload Descriptions VLPL-S is the in-house code from SJTU, paralleled with MPI and written in C++.

More information

Introducing Task-Containers as an Alternative to Runtime Stacking

Introducing Task-Containers as an Alternative to Runtime Stacking Introducing Task-Containers as an Alternative to Runtime Stacking EuroMPI, Edinburgh, UK September 2016 Jean-Baptiste BESNARD jbbesnard@paratools.fr Julien ADAM, Sameer SHENDE, Allen MALONY (ParaTools)

More information

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign Introduction to MPI May 20, 2013 Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign Top500.org PERFORMANCE DEVELOPMENT 1 Eflop/s 162 Pflop/s PROJECTED 100 Pflop/s

More information

Portable, MPI-Interoperable! Coarray Fortran

Portable, MPI-Interoperable! Coarray Fortran Portable, MPI-Interoperable! Coarray Fortran Chaoran Yang, 1 Wesley Bland, 2! John Mellor-Crummey, 1 Pavan Balaji 2 1 Department of Computer Science! Rice University! Houston, TX 2 Mathematics and Computer

More information

An Introduction to MPI

An Introduction to MPI An Introduction to MPI Parallel Programming with the Message Passing Interface William Gropp Ewing Lusk Argonne National Laboratory 1 Outline Background The message-passing model Origins of MPI and current

More information

MPI. (message passing, MIMD)

MPI. (message passing, MIMD) MPI (message passing, MIMD) What is MPI? a message-passing library specification extension of C/C++ (and Fortran) message passing for distributed memory parallel programming Features of MPI Point-to-point

More information

Performance analysis tools: Intel VTuneTM Amplifier and Advisor. Dr. Luigi Iapichino

Performance analysis tools: Intel VTuneTM Amplifier and Advisor. Dr. Luigi Iapichino Performance analysis tools: Intel VTuneTM Amplifier and Advisor Dr. Luigi Iapichino luigi.iapichino@lrz.de Which tool do I use in my project? A roadmap to optimisation After having considered the MPI layer,

More information

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows

More information

Native Computing and Optimization on Intel Xeon Phi

Native Computing and Optimization on Intel Xeon Phi Native Computing and Optimization on Intel Xeon Phi ISC 2015 Carlos Rosales carlos@tacc.utexas.edu Overview Why run native? What is a native application? Building a native application Running a native

More information

Intel Architecture and Tools Jureca Tuning for the platform II. Dr. Heinrich Bockhorst Intel SSG/DPD/ Date:

Intel Architecture and Tools Jureca Tuning for the platform II. Dr. Heinrich Bockhorst Intel SSG/DPD/ Date: Intel Architecture and Tools Jureca Tuning for the platform II Dr. Heinrich Bockhorst Intel SSG/DPD/ Date: 23.11.2017 Agenda Introduction Processor Architecture Overview Composer XE Compiler Intel Python

More information

Experiences with ENZO on the Intel Many Integrated Core Architecture

Experiences with ENZO on the Intel Many Integrated Core Architecture Experiences with ENZO on the Intel Many Integrated Core Architecture Dr. Robert Harkness National Institute for Computational Sciences April 10th, 2012 Overview ENZO applications at petascale ENZO and

More information

CS4961 Parallel Programming. Lecture 16: Introduction to Message Passing 11/3/11. Administrative. Mary Hall November 3, 2011.

CS4961 Parallel Programming. Lecture 16: Introduction to Message Passing 11/3/11. Administrative. Mary Hall November 3, 2011. CS4961 Parallel Programming Lecture 16: Introduction to Message Passing Administrative Next programming assignment due on Monday, Nov. 7 at midnight Need to define teams and have initial conversation with

More information

Outline. Overview Theoretical background Parallel computing systems Parallel programming models MPI/OpenMP examples

Outline. Overview Theoretical background Parallel computing systems Parallel programming models MPI/OpenMP examples Outline Overview Theoretical background Parallel computing systems Parallel programming models MPI/OpenMP examples OVERVIEW y What is Parallel Computing? Parallel computing: use of multiple processors

More information

OPENSHMEM AND OFI: BETTER TOGETHER

OPENSHMEM AND OFI: BETTER TOGETHER 4th ANNUAL WORKSHOP 208 OPENSHMEM AND OFI: BETTER TOGETHER James Dinan, David Ozog, and Kayla Seager Intel Corporation [ April, 208 ] NOTICES AND DISCLAIMERS Intel technologies features and benefits depend

More information

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Parallelism Decompose the execution into several tasks according to the work to be done: Function/Task

More information

Advanced Job Launching. mapping applications to hardware

Advanced Job Launching. mapping applications to hardware Advanced Job Launching mapping applications to hardware A Quick Recap - Glossary of terms Hardware This terminology is used to cover hardware from multiple vendors Socket The hardware you can touch and

More information

Advances of parallel computing. Kirill Bogachev May 2016

Advances of parallel computing. Kirill Bogachev May 2016 Advances of parallel computing Kirill Bogachev May 2016 Demands in Simulations Field development relies more and more on static and dynamic modeling of the reservoirs that has come a long way from being

More information

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI CS 470 Spring 2017 Mike Lam, Professor Distributed Programming & MPI MPI paradigm Single program, multiple data (SPMD) One program, multiple processes (ranks) Processes communicate via messages An MPI

More information

HPC Parallel Programing Multi-node Computation with MPI - I

HPC Parallel Programing Multi-node Computation with MPI - I HPC Parallel Programing Multi-node Computation with MPI - I Parallelization and Optimization Group TATA Consultancy Services, Sahyadri Park Pune, India TCS all rights reserved April 29, 2013 Copyright

More information

Programming with MPI. Pedro Velho

Programming with MPI. Pedro Velho Programming with MPI Pedro Velho Science Research Challenges Some applications require tremendous computing power - Stress the limits of computing power and storage - Who might be interested in those applications?

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Towards Exascale Programming Models HPC Summit, Prague Erwin Laure, KTH

Towards Exascale Programming Models HPC Summit, Prague Erwin Laure, KTH Towards Exascale Programming Models HPC Summit, Prague Erwin Laure, KTH 1 Exascale Programming Models With the evolution of HPC architecture towards exascale, new approaches for programming these machines

More information

Slides prepared by : Farzana Rahman 1

Slides prepared by : Farzana Rahman 1 Introduction to MPI 1 Background on MPI MPI - Message Passing Interface Library standard defined by a committee of vendors, implementers, and parallel programmers Used to create parallel programs based

More information

High performance computing and numerical modeling

High performance computing and numerical modeling High performance computing and numerical modeling Volker Springel Plan for my lectures Lecture 1: Collisional and collisionless N-body dynamics Lecture 2: Gravitational force calculation Lecture 3: Basic

More information

Intel MPI Library Conditional Reproducibility

Intel MPI Library Conditional Reproducibility 1 Intel MPI Library Conditional Reproducibility By Michael Steyer, Technical Consulting Engineer, Software and Services Group, Developer Products Division, Intel Corporation Introduction High performance

More information

Intel Parallel Studio XE 2015

Intel Parallel Studio XE 2015 2015 Create faster code faster with this comprehensive parallel software development suite. Faster code: Boost applications performance that scales on today s and next-gen processors Create code faster:

More information