More Data Locality for Static Control Programs on NUMA Architectures

Similar documents
More Data Locality for Static Control Programs on NUMA Architectures

PolyOpt/C. A Polyhedral Optimizer for the ROSE compiler Edition 0.2, for PolyOpt/C March 12th Louis-Noël Pouchet

Polly Polyhedral Optimizations for LLVM

OS impact on performance

Tiling: A Data Locality Optimizing Algorithm

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Polly First successful optimizations - How to proceed?

The Polyhedral Model Is More Widely Applicable Than You Think

Claude TADONKI. MINES ParisTech PSL Research University Centre de Recherche Informatique

Parallelisation. Michael O Boyle. March 2014

NUMA-aware OpenMP Programming

Advanced optimizations of cache performance ( 2.2)

Getting Performance from OpenMP Programs on NUMA Architectures

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

A polyhedral loop transformation framework for parallelization and tuning

Automatic Code Generation of Distributed Parallel Tasks Onzième rencontre de la communauté française de compilation

On Optimization of Parallel Communication-Avoiding Codes for Multicore Architectures

Simone Campanoni Loop transformations

Parallel Numerical Algorithms

Cluster Computing. Performance and Debugging Issues in OpenMP. Topics. Factors impacting performance. Scalable Speedup

Static and Dynamic Frequency Scaling on Multicore CPUs

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

Affine Loop Optimization using Modulo Unrolling in CHAPEL

Linear Algebra for Modern Computers. Jack Dongarra

FADA : Fuzzy Array Dataflow Analysis

Loop Transformations! Part II!

Day 6: Optimization on Parallel Intel Architectures

Exploring Parallelism At Different Levels

ECE 563 Spring 2012 First Exam

COMP Parallel Computing. SMM (2) OpenMP Programming Model

Adopt a Polyhedral Compiler!

Lecture 6: Optimize for Hardware Backends. CSE599G1: Spring 2017

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

Petalisp. A Common Lisp Library for Data Parallel Programming. Marco Heisig Chair for System Simulation FAU Erlangen-Nürnberg

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Shoal: smart allocation and replication of memory for parallel programs


OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

Parallel Programming in C with MPI and OpenMP

Performance Comparison Between Patus and Pluto Compilers on Stencils

EE/CSCI 451: Parallel and Distributed Computation

ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 2016 Solutions Name:...

Lecture 2. Memory locality optimizations Address space organization

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

Hybrid Model Parallel Programs

Parallel Programming in C with MPI and OpenMP

Language and Compiler Parallelization Support for Hashtables

CS420: Operating Systems

A Static Cut-off for Task Parallel Programs

The Legion Mapping Interface

Automatic Polyhedral Optimization of Stencil Codes

A Uniform Programming Model for Petascale Computing

Combined Iterative and Model-driven Optimization in an Automatic Parallelization Framework

Loop Nest Optimizer of GCC. Sebastian Pop. Avgust, 2006

Predictive Modeling in a Polyhedral Optimization Space

Parallel Algorithm Engineering

OpenACC 2.6 Proposed Features

A Framework for Automatic OpenMP Code Generation

Runtime Support for Scalable Task-parallel Programs

Non-uniform memory access (NUMA)

OVERVIEW OF MPC JUNE 24 TH LLNL Meeting June 15th, 2015 PAGE 1

Polyèdres et compilation

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Parallel Programming in C with MPI and OpenMP

Putting Automatic Polyhedral Compilation for GPGPU to Work

Computational Interdisciplinary Modelling High Performance Parallel & Distributed Computing Our Research

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Parallel Programming. Libraries and Implementations

Static Data Race Detection for SPMD Programs via an Extended Polyhedral Representation

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Revisiting Loop Transformations with X10 Clocks. Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

Parallel Programming Libraries and implementations

Control flow graphs and loop optimizations. Thursday, October 24, 13

for Task-Based Parallelism

An Integrated Program Representation for Loop Optimizations

Automatic OpenCL Optimization for Locality and Parallelism Management

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

Why you should care about hardware locality and how.

Adaptive Runtime Selection of Parallel Schedules in the Polytope Model

Piecewise Holistic Autotuning of Compiler and Runtime Parameters

Cilk Plus GETTING STARTED

JCudaMP: OpenMP/Java on CUDA

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

Loop Modifications to Enhance Data-Parallel Performance

Parametric Multi-Level Tiling of Imperfectly Nested Loops*

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

Parallel Programming with OpenMP. CS240A, T. Yang

Parallel Computing. Hwansoo Han (SKKU)

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Compiling for Advanced Architectures

Scientific Programming in C XIV. Parallel programming

Allows program to be incrementally parallelized

Pipeline Parallelism and the OpenMP Doacross Construct. COMP515 - guest lecture October 27th, 2015 Jun Shirako

Offload acceleration of scientific calculations within.net assemblies

New Mapping Interface

Polly - Polyhedral optimization in LLVM

Portability of OpenMP Offload Directives Jeff Larkin, OpenMP Booth Talk SC17

Transcription:

More Data Locality for Static Control Programs on NUMA Architectures Adilla Susungi 1, Albert Cohen 2, Claude Tadonki 1 1 MINES ParisTech, PSL Research University 2 Inria and DI, Ecole Normale Supérieure 7th International Workshop on Polyhedral Compilation Techniques (IMPACT 17) Stockholm, Sweden, January 23, 2017

Motivations Data locality Interest in any kind of technique that can produce data locality Combining several types Loop transformations Layout transformations Data placement on NUMA architectures Data Locality for SCoPs on NUMA 1 / 25

Motivations Data locality Interest in any kind of technique that can produce data locality Combining several types Loop transformations Layout transformations Data placement on NUMA architectures Automatic polyhedral parallelizers Current tools do not consider the integration of control, data flow, memory mapping and placement optimizations Data Locality for SCoPs on NUMA 1 / 25

Example Pluto 1 : for multicore CPUs Optimizations using loop transformations only 1 U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral program optimization system. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2008. Data Locality for SCoPs on NUMA 2 / 25

Example Pluto 1 : for multicore CPUs Optimizations using loop transformations only 16 cores: 3x 36 cores: 2.6x 1 U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral program optimization system. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2008. Data Locality for SCoPs on NUMA 2 / 25

Example Pluto 1 : for multicore CPUs Optimizations using loop transformations only 16 cores: 3x 36 cores: 2.6x How to provide more data locality thanks to additional transpositions and NUMA placements? 1 U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral program optimization system. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2008. Data Locality for SCoPs on NUMA 2 / 25

NUMA architectures Traffic contention and remote accesses issues Can be dealt with: At the programming level using an API (Libnuma, hwloc) Using extended programming languages 2 At execution time using environment variables (GOMP CPU AFFINITY, KMP AFFINITY) or runtime solutions (e.g, MPC 3 ) 2 A. Muddukrishna, P. A. Jonsson, and M. Brorsson. Locality-Aware Task Scheduling and Data Distribution for OpenMP Programs on NUMA Systems and Manycore Processors. Scientific Programming, 2015. 3 M. Pérache, H. Hourdren, R. Namyst. MPC: A Unified Parallel Runtime for Clusters of NUMA Machines. Euro-Par 2008. Data Locality for SCoPs on NUMA 3 / 25

NUMA architectures Traffic contention and remote accesses issues Can be dealt with: At the programming level using an API (Libnuma, hwloc) Using extended programming languages 2 At execution time using environment variables (GOMP CPU AFFINITY, KMP AFFINITY) or runtime solutions (e.g, MPC 3 ) What is the most convenient way to explore NUMA placement decisions at compile-time? 2 A. Muddukrishna, P. A. Jonsson, and M. Brorsson. Locality-Aware Task Scheduling and Data Distribution for OpenMP Programs on NUMA Systems and Manycore Processors. Scientific Programming, 2015. 3 M. Pérache, H. Hourdren, R. Namyst. MPC: A Unified Parallel Runtime for Clusters of NUMA Machines. Euro-Par 2008. Data Locality for SCoPs on NUMA 3 / 25

Roadmap Goals 1. Transpositions and NUMA placements in Pluto outputs for more locality 2. A convenient way to explore optimizations decisions at compile-time Data Locality for SCoPs on NUMA 4 / 25

Roadmap Goals 1. Transpositions and NUMA placements in Pluto outputs for more locality 2. A convenient way to explore optimizations decisions at compile-time Our solution Proposing a parallel intermediate language: Ivie Manipulate meta-programs for space exploration Makes prototyping easier than using unified polyhedral approach Future use beyond SCoPs Prototyping an extension of Pluto tool flow involving the PIL Case studies on PolyBench programs: Gemver, Gesummv, Covariance, Gemm Data Locality for SCoPs on NUMA 4 / 25

About our PIL: Ivie Main idea Manipulate arrays in parallel programs Transpositions: data transposition, index permutation NUMA placements: interleaved allocation, replications Data Locality for SCoPs on NUMA 5 / 25

About our PIL: Ivie Main idea Manipulate arrays in parallel programs Transpositions: data transposition, index permutation NUMA placements: interleaved allocation, replications Design Declarative/functional Data Locality for SCoPs on NUMA 5 / 25

About our PIL: Ivie Main idea Manipulate arrays in parallel programs Transpositions: data transposition, index permutation NUMA placements: interleaved allocation, replications Design Declarative/functional Decoupled manipulation of array characteristics Data Locality for SCoPs on NUMA 5 / 25

About our PIL: Ivie Main idea Manipulate arrays in parallel programs Transpositions: data transposition, index permutation NUMA placements: interleaved allocation, replications Design Declarative/functional Decoupled manipulation of array characteristics Physical and virtual memory abstraction Data Locality for SCoPs on NUMA 5 / 25

About our PIL: Ivie Main idea Manipulate arrays in parallel programs Transpositions: data transposition, index permutation NUMA placements: interleaved allocation, replications Design Declarative/functional Decoupled manipulation of array characteristics Physical and virtual memory abstraction Meta-language embedded in Python Possible interfacing with islpy for affine transformations Data Locality for SCoPs on NUMA 5 / 25

About our PIL: Ivie Main idea Manipulate arrays in parallel programs Transpositions: data transposition, index permutation NUMA placements: interleaved allocation, replications Design Declarative/functional Decoupled manipulation of array characteristics Physical and virtual memory abstraction Meta-language embedded in Python Possible interfacing with islpy for affine transformations What Ivie is not A new programming language/domain-specific language Data Locality for SCoPs on NUMA 5 / 25

Implementation C code No more control flow optimization after Pluto! Pluto OpenMP C code OpenMP C-to-Ivie code generation Ivie code Modified Ivie code Meta-programming Ivie-to-OpenMP C code generation Final code Data Locality for SCoPs on NUMA 6 / 25

Implementation C code No more control flow optimization after Pluto! Pluto OpenMP C code OpenMP C-to-Ivie code generation Ivie code Modified Ivie code Meta-programming Ivie-to-OpenMP C code generation Final code Data Locality for SCoPs on NUMA 6 / 25

Loop abstraction Declaration of iterator types: parallel or sequential with i as piter: with j as siter: A[i][j] = k1(a[i][j], u1[i], v1[j], u2[i], v2[j]) Output variable Arguments: array elements Implicit loop bounds Anonymous functions performing element-wise operations Accumulations made explicit Arrays follow either physical or virtual memory abstraction Data Locality for SCoPs on NUMA 7 / 25

Array declarations Using default declaration construct Declaration of array A A = array(2, double, [N,N]) Parameters Number of dimensions Type Dimension sizes Input C code int A[N][N]; int B[N][N]; int C[N][N]; for (i = 0; i < N; i++) for (j = 0; j < N; j++) C[i][j] = A[i][j] + B[i][j]; Output Used when generating code from input source A = array(2, int, [N,N]) B = array(2, int, [N,N]) C = array(2, int, [N,N]) with i as siter: with j as siter: C[i][j] = f(a[i][j], B[i][j]) Data Locality for SCoPs on NUMA 8 / 25

Array declarations Via data replication Replication of array A A = array(2, double, [N,N]) Ar = replicate(a) Replicating A and B A = array(2, int, [N,N]) B = array(2, int, [N,N]) C = array(2, int, [N,N]) Ar = replicate(a) Br = replicate(b) Replication of read-only arrays Ar inherits all characteristics of A Shape Content Used when meta-programming Resulting C code int A[N][N], B[N][N], C[N][N]; int Ar[N][N]; int Br[N][N]; for (i = 0; i < N; i++) { memcpy(ar[i], A[i], N * sizeof(int)); memcpy(br[i], B[i], N * sizeof(int)); } Data Locality for SCoPs on NUMA 9 / 25

Array declarations Via explicit transposition Transposition of array A A = array(2, double, [N,N]) Atp = transpose(a, 1, 2) Parameters Array of origin Dimension ranks to be permuted Transposing A A = array(2, int, [N,N]) Atp = transpose(a, 1, 2) with i as siter: with j as siter:... = f(atp[i][j],...) Resulting C code Atp inherits from A Content Transposed shape Atp is physical Used when meta-programming int A[N][N]; int Atp[N][N]; /* Initialization of A */ for (i = 0; i < N; i++) for (j = 0; j < N; j++) Atp[i][j] = A[j][i]; for (i = 0; i < N; i++) for (j = 0; j < N; j++)... = Atp[i][j]; Data Locality for SCoPs on NUMA 10 / 25

Array declarations Via virtual index permutation Transposition of array A A = array(2, double, [N,N]) Atv = vtranspose(a, 1, 2) Parameters Array of origin Dimension ranks to be permuted Transposing A A = array(2, int, [N,N]) Atv = vtranspose(a, 1, 2) with i as siter: with j as siter:... = f(atv[i][j],...) Resulting C code Atv inherits from A Content Transposed shape Atv is virtual Used when meta-programming int A[N][N]; /* Initialization of A */ for (i = 0; i < N; i++) for (j = 0; j < N; j++)... = A[j][i]; Data Locality for SCoPs on NUMA 11 / 25

Array declarations For concise abstraction of several arrays Abstracting arrays A and Ar A = array(2, double, [N,N]) Ar = replicate(a) As = select([({it} <= val), A], [({it} > val), Ar]) Parameters Pairs of condition and arrays As is virtual Allows explicit control in partitioning For NUMA management A = array(2, double, [N,N]) Ar = replicate(a) As = select([({it} <= val), A], [({it} > val), Ar]) with i as piter: with j as siter:... = f(as[i][j]) int A[N][N]; int Ar[N][N]; /* Initialization of A and Ar */ #pragma omp parallel for schedule(...) for (i = 0; i < N; i++) if (i <= val) for (j = 0; j < N; j++)... = A[i][j]; if (i > val) for (j = 0; j < N; j++)... = Ar[i][j]; Data Locality for SCoPs on NUMA 12 / 25

Data placement on NUMA Constructs based on API functions available in libnuma Interleaved allocation A = numa_alloc_interleaved(size) Allocation on node A = numa_alloc_onnode(size, node_id) A.map_interleaved(1) A.map_onnode(node_id) Data Locality for SCoPs on NUMA 13 / 25

Experimental setup Machine Compilation Thread binding Default Pluto options Intel Xeon E5-2697 v4 (Broadwell), 4 nodes, 36 cores gcc -03 -march=native (enables vectorization) OMP PROC BIND Tiling for L1 cache, parallelization, vectorization Data Locality for SCoPs on NUMA 14 / 25

Experimental setup Machine Compilation Thread binding Default Pluto options Intel Xeon E5-2697 v4 (Broadwell), 4 nodes, 36 cores gcc -03 -march=native (enables vectorization) OMP PROC BIND Tiling for L1 cache, parallelization, vectorization Possible loop fusion heuristics: No loop fusion (no fuse) Maximum fusion (max fuse) In-between fusion (smart fuse) Data Locality for SCoPs on NUMA 14 / 25

Experimental setup Machine Compilation Thread binding Default Pluto options Intel Xeon E5-2697 v4 (Broadwell), 4 nodes, 36 cores gcc -03 -march=native (enables vectorization) OMP PROC BIND Tiling for L1 cache, parallelization, vectorization Possible loop fusion heuristics: No loop fusion (no fuse) Maximum fusion (max fuse) In-between fusion (smart fuse) Different program versions: Default Pluto output (default) Pluto output + NUMA only (NUMA) Pluto output + transposition only (Layout) Pluto output + NUMA + transposition (NUMA-Layout) Data Locality for SCoPs on NUMA 14 / 25

Gemver Code snippet for (i = 0; i < _PB_N; i++) for (j = 0; j < _PB_N; j++) A[i][j] = A[i][j] + u1[i] * v1[j] + u2[i] * v2[j]; for (i = 0; i < _PB_N; i++) for (j = 0; j < _PB_N; j++) x[i] = x[i] + beta * A[j][i] * y[j]; /*... */ for (i = 0; i < _PB_N; i++) for (j = 0; j < _PB_N; j++) w[i] = w[i] + alpha * A[i][j] * x[j]; Interesting properties Permutation profitable with loop fusions Several choices: need to find best permutation May loose some parallelism depending on chosen loop fusion Bandwidth-bound Data Locality for SCoPs on NUMA 15 / 25

Gemver Meta-programs example: smart fuse vs no fuse A = array(2, DATA_TYPE, [n, n]) u1 = array(1, DATA_TYPE, [n]) v1 = array(1, DATA_TYPE, [n]) A_v = vtranspose(a, 1, 2) u1_1 = replicate(u1) u1_2 = replicate(u1) u1_3 = replicate(u1) A.map_interleaved(1) u1.map_onnode(0) u1_1.map_onnode(1) u1_2.map_onnode(2) u1_3.map_onnode(3) u1_s = select([0 <= {it} <= 8, u1], [9 <= {it} <= 17, u1_1], /*...*/) with i as siter: with j as siter: A_v[i][j] = init() No fuse with t2 as cpiter: with t3 as siter: with t4 as siter: with t5 as siter: A[t4][t5] = f3(a[t4][t5], u1_s[t4], v1[t5], u2_s[t4], v2[t5]) with t2 as piter: with t3 as siter: with t4 as siter: with t5 as siter: x[t5] = f7(x[t5], A[t4][t5], y[t4]) Smart fuse with t2 as cpiter: with t3 as siter: with t4 as siter: with t5 as siter: A[t4][t5] = f3(a[t4][t5], u1_s[t4], v1[t5], u2_s[t4], v2[t5]) x[t5] = f7(x[t5], A[t4][t5], y[t4]) Data Locality for SCoPs on NUMA 16 / 25

Gemver Different Pluto versions with no loop fusion More speed-up with NUMA Much less speed-up with transposition No added value with thread binding Data Locality for SCoPs on NUMA 17 / 25

Gemver Different Pluto versions with smart loop fusion More speed-up with NUMA More speed-up with transposition No added value with thread binding Data Locality for SCoPs on NUMA 18 / 25

Gesummv Data Locality for SCoPs on NUMA 19 / 25

Gesummv Data Locality for SCoPs on NUMA 19 / 25

Gemm Different naive versions Interesting property Column-major access to B Modifications Transposed initialization of B NUMA placement: interleaved allocation only # Default declarations C = array(2, DATA_TYPE, [ni, nj]) A = array(2, DATA_TYPE, [ni, nk]) B = array(2, DATA_TYPE, [nk, nj]) # Meta-programmed declaration B_v = vtranspose(b, 1, 2) # Initializations with i as siter: with j as siter: B_v[i][j] = init() #... other initializations with t2 as piter: with t3 as siter: with t4 as siter: with t5 as siter: with t7 as siter: with t6 as siter: C[t5][t6] = f9(c[t5][t6], A[t5][t7], B[t6][t7]) Data Locality for SCoPs on NUMA 20 / 25

Gemm Different naive versions Some speed-up with transposition but loop interchange is better No speed-up with NUMA Data Locality for SCoPs on NUMA 21 / 25

Gemm Different Pluto versions Pluto s solution: loop interchange Data Locality for SCoPs on NUMA 22 / 25

Gemm Different Pluto versions Pluto s solution: loop interchange No speed-up with NUMA Transposition makes things worse Data Locality for SCoPs on NUMA 22 / 25

Covariance Data Locality for SCoPs on NUMA 23 / 25

Balance sheet Advantages NUMA placements help bandwidth-bound programs More speed-up with transpositions New opportunities with transpositions Wider space exploration for combining different types of optimizations Data Locality for SCoPs on NUMA 24 / 25

Balance sheet Advantages NUMA placements help bandwidth-bound programs More speed-up with transpositions New opportunities with transpositions Wider space exploration for combining different types of optimizations Disadvantages Multiple conditional branching Copy overheads Data Locality for SCoPs on NUMA 24 / 25

Balance sheet Advantages NUMA placements help bandwidth-bound programs More speed-up with transpositions New opportunities with transpositions Wider space exploration for combining different types of optimizations Disadvantages Multiple conditional branching Copy overheads No more control flow optimization after Pluto! Ok, we definitely still need some. Data Locality for SCoPs on NUMA 24 / 25

Some future work Deeper investigation For case studies More experiments (other SCoPS, non-scops) Data Locality for SCoPs on NUMA 25 / 25

Some future work Deeper investigation For case studies More experiments (other SCoPS, non-scops) PIL design and implementation Revisit or extend some constructs Interfacing with islpy Memory and control flow optimizations: more integrated composition Data Locality for SCoPs on NUMA 25 / 25

Some future work Deeper investigation For case studies More experiments (other SCoPS, non-scops) PIL design and implementation Revisit or extend some constructs Interfacing with islpy Memory and control flow optimizations: more integrated composition Polyhedral analysis to help: determine interleaving granularity generate different schedules for transpositions Data Locality for SCoPs on NUMA 25 / 25

Some future work Deeper investigation For case studies More experiments (other SCoPS, non-scops) PIL design and implementation Revisit or extend some constructs Interfacing with islpy Memory and control flow optimizations: more integrated composition Polyhedral analysis to help: determine interleaving granularity generate different schedules for transpositions A parallel intermediate language in Pluto s framework? Pure post-processing is difficult: Pluto outputs may be (very) complex. Data Locality for SCoPs on NUMA 25 / 25

Some future work Deeper investigation For case studies More experiments (other SCoPS, non-scops) PIL design and implementation Revisit or extend some constructs Interfacing with islpy Memory and control flow optimizations: more integrated composition Polyhedral analysis to help: determine interleaving granularity generate different schedules for transpositions A parallel intermediate language in Pluto s framework? Pure post-processing is difficult: Pluto outputs may be (very) complex. Ad hoc implementation probably the best solution for Pluto. But intermediate language necessary for space exploration of optimizations Data Locality for SCoPs on NUMA 25 / 25