Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing

Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows one to: solve problems that don t fit on a single CPU solve problems that can t be solved in a reasonable time We can solve larger problems faster more cases 2

A Change in Moore s Law!

Parallelism is the New Moore s Law Power and energy efficiency impose a key constraint on design of micro-architectures Clock speeds have plateaued Hardware parallelism is increasing rapidly to make up the difference

Cluster System Architecture internet 1 TopSpin 120 Login Nodes Raid 5 2950 1 2 Home Server 2 InfiniBand Switch Hierarchy TopSpin 270 HO ME 130 1 2 TopSpin 120 16 I/O Nodes WORK File System GigE Switch Hierarchy GigE InfiniBand Fibre Channel

Blade : Rack : System 1 node : 2 x 8 cores= 16 cores 1 chassis : 10 nodes = 120 cores 1 rack (frame) : 4 chassis = 480 cores system : 10 racks = 4,800 cores x 10 x 4

HPC Trends Memory Memory M P GPU Architecture Single core Multicore GPU Cluster Code Serial OpenMP CUDA MPI

Multi-core systems Memory Memory Memory Memory Memory Network Current processors place multiple processor cores on a die Communication details are increasingly complex Cache access Main memory access Quick Path / Hyper Transport socket connections Node to node connection via network

Accelerator-based Systems Memory Memory Memory Memory G P U G P U G P U G P U Network Calculations made in both CPUs and Graphical Processing Unit No longer limited to single precision calculations Load balancing critical for performance Requires specific libraries and compilers (CUDA, OpenCL) Co-processor from Intel: MIC (Many Integrated Core)

Motivation Where is unrealized performance and how do we extract it? How broad is the performance impact? Hierarchical parallelism Increased importance of fine-grained and data parallelism More cores available per processor

Where is the Parallelism? Level 1: Single instruction multiple data (SIMD) vector registers within individual CPU cores Level 2: Increasing number of cores per CPU Level 3: Accelerator-equipped systems General purpose graphics processors (GPGPU) Intel Xeon Phi / many integrated core (MIC) Level 4: Supercomputing resources Large number of compute nodes multiple levels of parallelism Increasing heterogeneity in hardware components

Motivations for Multithreading and Vectorization Expose parallelism that is inaccessible using MPI alone Fine-grained parallelism Task-parallelism Automatic vectorization (Single Instruction Multiple Data) Vector processors are more prevalent and getting wider Compilers will vectorize automatically if possible Accelerators such as GPU / Intel Xeon Phi Multi-threaded code is important to efficiently multi-core processors Multi-core CPU present in laptops, desktops, and supercomputers

Multi-threaded Programs Expose parallelism that is inaccessible using MPI alone Fine-grained parallelism Task-parallelism Automatic vectorization (Single Instruction Multiple Data) Vector processors are more prevalent and getting wider Compilers will vectorize automatically if possible Accelerators such as GPU / Intel Xeon Phi Multi-threaded code is important to efficiently multi-core processors Multi-core CPU present in laptops, desktops, and supercomputers

Multi-threaded Programs OpenMP: Most widely used for CPU-based parallelization and for targeting the Intel Xeon Phi OpenACC: Primarily used in the development of GPUbased codes pthreads, C++ 11 (Multithreading features): in the C++ standard, not fully supported CUDA OpenCL Intel Thread Building Blocks (TBB), Cilk++

What is OpenMP? API for parallel programming on shared memory systems Parallel threads Implemented through the use of: Compiler Directives Runtime Library Environment Variables Supported in C, C++, and Fortran Maintained by OpenMP Architecture Review Board (http://www.openmp.org/)

Shared Memory Memory P P P P P Your laptop Multicore, multiple memory NUMA system HokieOne (SGI UV) One node on blueridge

OpenMP constructs OpenMP language extensions parallel control structures work sharing data environment synchronization runtime functions, env. variables governs flow of control in the program parallel directive distributes work among threads do/parallel do and Section directives specifies variables as shared or private shared and private clauses coordinates thread execution critical and atomic directives barrier directive Runtime environment omp_set_num_threads( ) omp_get_thread_num() OMP_NUM_THREADS OMP_SCHEDULE

Factors Affecting Multi-thread Performance Avoid overhead of initializing new threads wherever possible Bind threads to physical hardware cores Cache coherence issues can cause serious performance degradation when memory is written by different cores Data for a calculation performed by a particular core should be local to that core Avoid synchronization; try to enforce thread safety without serializing code

Single Instruction Multiple Data (SIMD) Each clock cycle a processor loads instructions and data on which those instructions operate SIMD processors can apply a single instruction to multiple pieces of data in a single clock cycle Modern processors increasingly enable or rely on SIMD to achieve high performance: Intel SandyBridge / IvyBridge / Haswell AMD Opteron IBM BlueGene Q Accelerators such as GPU and the Intel Xeon Phi

Auto-Vectorization Summary Performance gains from auto-vectorization are not guaranteed: Certain algorithms vectorize while others do not Problem details can also impact performance Compiler and hardware combination impact the efficiency of vectorization However: SIMD is becoming more prevalent and speedup can be significant SIMD data structure optimizations provide benefits on both CPU and accelerators (GPU, Intel Xeon Phi)

Software Challenges for Multithreading Programming models for multi-threading are actively evolving Compiler support and performance for different implementations can vary widely Tradeoffs between portability and performance C++ 11, OpenMP Architecture specific programming models: Intel thread building blocks, Cilk++, CUDA, OpenCL etc.

Compiler Auto-Vectorization Many compilers can automatically generate vector instructions Intel 13.0 gcc 4.7 llvm 3.4 pgi 14.0 IBM XL How you write your code has a huge impact on whether or not the compiler will generate vector instructions (and how optimal it will be) The performance of the various compilers will vary

Programming Practices that Inhibit Auto-Vectorization Loops without single point of entry and exit Branching prevents vectorization Data dependencies Read after write Write after read Aliasing may cause compiler to assume data dependencies exist for safety! Non-contiguous memory accesses Function calls within loops

Data Structures and Auto- Vectorization Structure of arrays is preferred over array of structures Memory alignment has a big impact on how efficiently vectorization is performed Example task: add two vectors together to obtain a third vector: C[i] = A[i] + B[i]

Single Instruction Multiple Data (SIMD)

Data Structures and Auto- Vectorization struct ArrayOfStruct { double A,B,C; void add(){ C = A+B; } } /* some code */ ArrayOfStruct *AOS; AOS = new ArrayOfStruct[SIZE] for (i=0; i<size; i++) AOS[i].add();

Data Structures and Auto- Vectorization struct StructOfArrays { /*... */ double *A,*B,*C; void add(){ for (i=0; i<size; i++) C[i] = A[i]+B[i]; } } /* some code */ StructOfArrays SOA(SIZE); SOA.add(); // Same calculation // different data layout

Data Structures and Auto- Vectorization Compilers can often be prompted to print out information about whether vectorization is performed icc vec-report2 restrict VecAdd.cpp For the array of structures loop: for (i=0; i<size; i++) AOS[i]. add(); The compiler prints the following: remark: loop was not vectorized: vectorization possible but seems inefficient.

Data Structures and Auto- Vectorization For the structure of arrays loop: for (i=0; i<size; i++) C[i] =A[i]+B[i]; The compiler prints the following: remark: LOOP WAS VECTORIZED ( structure of arrays is preferred for SIMD computations, including on accelerators like GPU)

Data Structures and Auto- Vectorization // Memory alignment and auto-vectorization // Little things can make a big difference double *A = new double[size]; double *B = new double[size]; double *C = new double[size]; // Explicitly aligning memory is advantageous! declspec (align(16)) double A[SIZE]; declspec (align(16)) double B[SIZE]; declspec (align(16)) double C[SIZE];

Data Structures and Auto- Vectorization Compare the performance Intel SandyBridge CPU Intel 13.0 compiler 256-bit SIMD register (4 x double per instruction) Aligned structure of arrays is a clear winner: Array of structures = 2.1 seconds Structure of arrays = 0.99 seconds (~2x speedup) Aligned structure of arrays = 0.6 seconds (~3.5x)

Questions???