Task-based programming models to support hierarchical algorithms
|
|
- Maximillian Wilfred Morton
- 5 years ago
- Views:
Transcription
1 Task-based programming models to support hierarchical algorithms Rosa M BadiaBarcelona Supercomputing Center SHAXC 2016, KAUST, 11 May 2016
2 Outline BSC Overview of superscalar programming model OmpSs overview Use of OmpSs in numerical examples PyCOMPSs overview Use of PyCOMPSs in numerical examples Conclusions 2
3 Barcelona Supercomputing Center - Centro Nacional de Supercomputación BSC-CNS objectives: R&D in Computer, Life, Earth and Engineering Sciences. Supercomputing services and support to Spanish and European researchers. BSC-CNS is a consortium that includes: Spanish Government 60% Catalonian Government 30% Universitat Politècnica de Catalunya (UPC) 10% 425 people, 40 countries 3
4 Mission of BSC R&D Departments COMPUTER SCIENCES To influence the way machines are built, programmed and used: programming models, performance tools, Big Data, computer architecture, energy efficiency EARTH SCIENCES To develop and implement global and regional stateof-the-art models for shortterm air quality forecast and long-term climate applications LIFE SCIENCES To understand living organisms by means of theoretical and computational methods (molecular modeling, genomics, proteomics) CASE To develop scientific and engineering software to efficiently exploit supercomputing capabilities (biomedical, geophysics, atmospheric, energy, social and economic simulations) 4
5 Reducing the gap between applcations and architectures New algorithms and data structures require programming models and runtimes that: Reduce programming complexity Keep portability Enable communication avoiding, asynchronous algorithms, Applications PM: High-level, clean, abstract interface Computing Platform Runtime API Storage Platform 5
6 So, what is a superscalar programming model? High-level sequential programming Executes following superscalar processor model Out of order Task is the unit of work Builds a task graph at runtime that express potential concurrency Large number of in-flight tasks Exposes distant parallelism Based on a runtime Makes decisions and executes the workflow Offers an abstraction to plug applications to different resources Computing Storage Superscalar processor Dataflow High level Sequential programming Workflows Utilities 6
7 The StarSs programming model CellSs SMPSs GPUSs ClusterSs ClearSpeedSs StarSs OmpSs GridSs ClusterSs PyCOMPSs/COMPSs Different implementations, targeting different platforms OmpSs: multicore, GPUs, clusters COMPSs: clusters, federated clouds, old grids Open source pm.bsc.es compss.bsc.es 7
8 Main elements of superscalar programming model syntax Superscalar program Sequential code Single shared memory space Identification of tasks Task Main element of programming model: computation unit Operates in given parameters and local variables Amount of work (granularity) may vary in a wide range (from μsecs, to minutes, hours), may depend on input arguments, Once started executes to completion independent of other tasks Syntax Task annotations Task arguments directionality Synchronizations 8
9 Task annotations Different languages, same idea Annotations designed according to the standards of each programming = = cholesky.objects.block") void = Direction.INOUT) Block diag... #pragma omp task inout (A[k][k])priority (10) spotrf (A[k][k]) ; for (i=k+1; i<nt; i++) { #pragma omp task in (A[k][k]) inout... strsm (A[k][k], priority=true) def potrf(a): A = dpotrf(a, lower=true)[0].tolist() Python Java 9
10 Task arguments directionality Input, Output, Inout Indicate that the argument is read, written, or read and written by the task Used at execution time to determine the data dependences between tasks Is not a direct edge, but may generate one or more edges Gives information about Locality Data to be (output o1 ) def meta (input o2) def metb (o2)... def main () meta (myobject) metb (myobject) metb (otherobject) meta metb metb 10
11 Impact of synchronization in task based programming Ideally tasks executed according data dependences However synchronizations can not always be avoided When tasks results are needed Semantics of a synchronization Granularity should not be the only parameter to decide what is a task Synchronizations stop task graph generation Synchronizations syntax Designed according to language standards in other languages Synchronizations added by interception in Java #pragma omp taskwait foo = compss_wait_on(foo) C Python 11
12 Performance analysis Runtimes instrumented with Extrae library to generate Paraver tracefiles 12
13 PART I: OMPSS 13
14 OmpSs environent: Mercurium Compiler Recognizes constructs and transforms them to calls to the runtime Manages code restructuring for different target devices Device-specific handlers May generate code in a separate file Invokes different back-end compilers à nvcc for NVIDIA C/C++/Fortran
15 OmpSs Environment: Nanos++ runtime CUDA threads GPU MIC threads MIC Source code #pragma omp task Mercurium compiler Application binary New tasks Nanos++ runtime Coherence support (Directory / Cache) Data requests Dependency support (task graph) Ready tasks Worker threads Execute task (local) Device operations (e.g. execute task) Helper threads Execute task (device) Scheduler 15
16 Example 1: Communication avoiding QR* in OmpSs #pragma omp task inout( A[0;br*bc] ) output( T[0;br*bc] ) priority(3) void dgeqrf_dlarft (int br, int bc, int skip, double *A, double *T); #pragma omp task input( T[0;br*bc] ) inout( C[0;br*bc] ) void dlarfb (int br, int bc, int skip, double *V, double *T, double *C); for ( int k=0; k<nt ; k++ ) { #pragma omp task dgeqr_dlarft(br,bc,0,a[k,k],t[k,k]); inout( C[0;br*bc], D[0;br*bc]) output( T[0;br*bc]) priority(2) void dgeqrf_split for ( int br, j=k+1; int j<nt bc, int ; j++ skip, ) { double *C, double *D, double *T); dlarfb(br,bc,0,a[k,k],t[k,k],a[k,j]); #pragma omp task } input( T[0;br*bc] ) inout( F[0;br*bc], G[0;br*bc]) priority(1) void dlarfb_split for (int ( intbr, i=k+1; int bc, i<mt int ; i++ skip, ) { double *D, double *T, double *F, double *G); dgeqr_split(br,bc,0,a[k,k],a[i,k],t[i,k]); for ( int j=k+1; j<nt ; j++ ) { dlarfb_split(br,bc,0,a[i,k],t[i,k],a[k,j],a[i,j]); } } } * Demmel et al., Communication-optimal parallel and sequential QR and LU factorizations. 16
17 Example 2: Heterogeneous #pragma omp target device (hstreams) implements(dgemm_task) copy_deps #pragma omp task input([ts][ts] A, [ts][ts] B) inout([ts][ts] C) priority (10) void dgemm_phi (int ts, double *A, double *B, double *C) { double alpha = 1.0; const char trans = N ; dgemm(&trans, &trans, &ts, &ts, &ts, &alpha, A, &ts, B, &ts, &alpha, C, &ts); } #pragma omp task input([ts][ts] A, [ts][ts] B) inout([ts][ts] C) priority (10) void dgemm_task (int ts, double *A, double *B, double *C) { cblas_dgemm (CblasRowMajor, CblasNoTrans, CblasNoTrans, ts, ts, ts, 1, A, ts, B, ts, 1, C, ts); } void matmul (int N, int TS, double *A, double *B, double *C) { for (int i = 0; i < N; j++) for (int j = 0; j < N; j++) for (int k = 0; k < N; k++) dgemm_task(ts, &A[i][k], &B[k][j], &C[i][j]); #pragma omp taskwait 17
18 Example 3: Task parallelism ILUPACK s preconditioned CG solver* Exploitation of nested parallelism Tasks are split into finer granularity tasks 18
19 Example 3: Task parallelism ILUPACK s preconditioned CG solver* NUMA-aware execution thanks to OmpSs specific NUMA-aware scheduler The code records in which socket each task is executed during the initial calculation of the preconditioner During all iterations, all tasks which operate on the same data that was generated/accessed during the preconditioner calculation are mapped to the same socket where they were originally executed 19
20 Some results: 8 CPUs + 1 GPGPU + 1 MIC Performance [Gflop/s] Matrix Multiply Matrix size [#elements on a side] OCUbf OCUaf OhSbf OhSaf OHTsm CUDA hstreams Performance [Gflop/s] Cholesky Factorization Matrix size [#elements on a side] OCUbf OCUaf OhSbf OhSaf OHTsm CUDA hstreams The best performance can be achieved when all processing units (CPUs, GPGPU and Xeon Phi) cooperate together 22
21 Some results: NUMA-aware scheduler results Scheduler distance-aware Al-Omairy et at., Dense Matrix Computations on NUMA Architectures with Distance-Aware Work Stealing 23
22 PART II: PYCOMPSS/COMPSS
23 Why Python? Python is powerful... and fast; plays well with others; runs everywhere; is friendly & easy to learn; is Open. * Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C Large community using it, including scientific and numeric Object-oriented programming and structured programming are fully supported Large number of software modules available (38,000 as of January 2014) * From python.org 25
24 Task annotations Use of decorators to annotate tasks and indicate arguments directionality Other annotations: constraints Small API priority=true) def potrf(a): A = dpotrf(a, lower=true)[0].tolist() foo = compss_wait_on(foo) 26
25 PyCOMPSsRuntime behavior Python user code + task annotations Binding TDG Tasks Grids Clusters Clouds Runtime Files, objects
26 PyCOMPSs stack + Numpy and MKL MKL is parallelized with OpenMP. Two levels of parallelism to exploit Task level Thread level App Main Python interpreter Numpy, ScyPy COMPSs worker Task code Python interpreter Numpy, ScyPy COMPSs master MKL MKL Host node Worker node Parallelized with OpenMP 28
27 1 Example 1: Matrix = INOUT) def multiply(a, b, c): c += a*b 5 d11 6 d11 d3 2 d3 3 d3 9 d16 def initialize_variables(): import numpy as np for matrix in [A,B,C]: for i in range(msize): matrix.append([]) for j in range(msize): if matrix == C: block = np.array(np.zeros((bsize, BSIZE)), dtype=np.double, copy=false)... Initialization Synchro0 10 d21 d11 d3 d3 d sync d21 d11 d11 d16 d26 15 sync d21 d16 d16 d26 d30 16 sync d21 d21 d26 d30 d31 sync d26 d26 d30 d31 d32 33 sync d34 d30 d30 d31 d32 34 sync d34 d31 d31 d32 d sync d39 d34 d32 d32 d sync 39 d40 d39 d34 d34 d sync 40 d42 d40 d39 d38 d sync d46 d42 d40 d39 d sync 57 d46 d42 d40 d40 d sync 58 d48 d46 d42 d42 d sync 59 d48 d46 d46 d47 63 sync 60 initialize_variables() main eti = time.time() for i in range(msize): for j in range(msize): for k in range(msize): multiply(a[i][k], B[k][j], C[i][j]) C = compss_wait_on(c) print "Compute Time {} s".format(time.time()-eti) d48 d47 d47 64 sync d48 d48 sync 29
28 36 d36 64 d33 d43 61 d32 d d2 5 d1 15 d2 d8 d5 d8 d8 d8 d4 d8 d8 d5 d5 d4 d5 d5 d3 d4 d4 d9 d4 d3 d3 d3 d3 d5 d6 d4 d6 d6 d3 d6 d d d31 34 d34 d43 d40d43 d43 d39 d43 d38 d43 d64 85 d61 d60 82 d d d d1 12 d1 d d1 2 d d2 d d1 d2 21 d37 d12 d37 d37 d10 11 d37 d19 23 d37 d18 d17 40 d16 d d13 d11 d22 d21 41 d62 49 d70d70 d67 d70 d70 d66 d d5 d20 d7 d1 d3 d6 d2 27 d7 d4 25 d7 7 d6 d7 24 d7 26 d24 d23 d27 d25 42 d41 d40 d40 d40 d39 d39 d39 d38 d38 d38 d40 d39 d41 d38 d41 d41 d41 d40 d38 d39d42 d42 d41 d42 d42 d42 83 d63 d49 84 d85 d82 d83 d84 d73 d86 d86 d86 d75 d d81 90 d48 73 d47 67 d d45 66 d d50 d65 d65 d65 d65 d52 d51 d65 d68d67 d67 d66 d67 d66 d66 d68 d68 d68 d67 d69 d66 d69 d68 d69 d69 d69 d72 87 d90 d90 d90 d87 d90 d87 d88d87 d88 d88 d87 d89 d88 d89 d89 d89 97 d97 98 d100 d98 d101 d99 d101 d d d104 d d104 d d113 d d d113 d d d71 86 d d d74 d94 d102 d102 d102 d103 d103 d d d118 d120 sync d115 d d d112 d112 d117 Synchro0 d d114 d d d95 d d78 89 d96 d d53 d79 77 d56 d d d55 55 d57 d d58 d7 d26 d d29 Example 2: Cholesky def cholesky_blocked(a): from import priority=true) n = len(a) def potrf(a): for k in range(n): from scipy.linalg.lapack import dpotrf A = dpotrf(a, lower=true)[0].tolist() return priority=true) def trsm(a, B): from scipy.linalg import solve_triangular # update trailing matrix from numpy import transpose for i in range(k+1, n): B = solve_triangular(a, B, lower=true, trans='t') for j in range(k+1, i): def gemm(a, B, C): from scipy.linalg.blas import dgemm from numpy import transpose alpha = -1.0 beta = 1.0 C = dgemm(alpha, A, B, c=c, beta=beta, trans_b=1).tolist() import compss_wait_on # Diagonal block factorization A[k][k] = potrf(a[k][k]) # Triangular systems for i in range(k+1, n): A[k][i] = trsm(a[k][k], A[k][i]) main A[j][i] = gemm(a[k][i], A[k][j], A[j][i]) A[i][i] = syrk(a[k][i], A[i][i]) A = compss_wait_on(a) return A 30
29 Some results: matrix multiply Matrix size: Block size: 4096 and OpenMP thread / task 16 tasks per node 14,00 12,00 Matrix multiply - speedup Execution time (secs) Matrix multiply - Execution time #cores Speedup 10,00 8,00 6,00 4, Matrix multiply - Performance 2, ,00 0,00 5,00 10,00 15,00 20,00 25,00 30,00 35,00 #nodes GFlops #cores 31
30 Some results: matrix multiply Matrix size: Block size: 4096 and OpenMP thread / task 2 tasks per node Matrix Multiply - Speedup - 8 threads Seconds Matrix multiply - Execution time - 8 threads 4500, , , , , , , ,00 500,00 0,00 0,00 100,00 200,00 300,00 400,00 500,00 600,00 #cores ,00 Speedup 16,00 14,00 12,00 10,00 8,00 6,00 4,00 2,00 0,00 0,00 5,00 10,00 15,00 20,00 25,00 30,00 35,00 #nodes GFlops Matrix multiply - Performance - 8 threads 500,00 450,00 400,00 350,00 300,00 250,00 200,00 150,00 100,00 50,00 0,00 0,00 100,00 200,00 300,00 400,00 500,00 600,00 #cores
31 Some insight BSIZE 2048 BSIZE thread 8 thread 33
32 Some insight BSIZE 2048 BSIZE thread 8 thread 34
33 Conclusions StarSs is a family of task-based programming models targeting parallel systems OmpSs focuses in more traditional HPC systems, including heterogeneous nodes with GPUs and accelerators PyCOMPSs provides a high level, easy interface focusing in distributed computing, including cloud and Big Data Both systems come with a whole environment of tools Performance analysis Monitor Ongoing work integrating PyCOMPSs/COMPSs to new storage approaches to enable converge between HPC and Big Data Open source: pm.bsc.es compss.bsc.es 35
34 POP CoE A Center of Excellence On Performance Optimization and Productivity Promoting best practices in performance analysis and parallel programming Providing Services Precise understanding of application and system behavior Suggestion/support on how to refactor code in the most productive way Horizontal Transversal across application areas, platforms, scales For academic AND industrial codes and users! 36
35 Thank you! 37
Barcelona Supercomputing Center
www.bsc.es Barcelona Supercomputing Center Centro Nacional de Supercomputación EMIT 2016. Barcelona June 2 nd, 2016 Barcelona Supercomputing Center Centro Nacional de Supercomputación BSC-CNS objectives:
More informationAn Extension of the StarSs Programming Model for Platforms with Multiple GPUs
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento
More informationAsynchronous Task Creation for Task-Based Parallel Programming Runtimes
Asynchronous Task Creation for Task-Based Parallel Programming Runtimes Jaume Bosch (jbosch@bsc.es), Xubin Tan, Carlos Álvarez, Daniel Jiménez, Xavier Martorell and Eduard Ayguadé Barcelona, Sept. 24,
More informationOmpSs Fundamentals. ISC 2017: OpenSuCo. Xavier Teruel
OmpSs Fundamentals ISC 2017: OpenSuCo Xavier Teruel Outline OmpSs brief introduction OmpSs overview and influence in OpenMP Execution model and parallelization approaches Memory model and target copies
More informationOmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel
www.bsc.es OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray Ozen guray.ozen@bsc.es Exascale in BSC Marenostrum 4 (13.7 Petaflops ) General purpose cluster (3400
More informationBest GPU Code Practices Combining OpenACC, CUDA, and OmpSs
www.bsc.es Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs Pau Farré Antonio J. Peña Munich, Oct. 12 2017 PROLOGUE Barcelona Supercomputing Center Marenostrum 4 13.7 PetaFlop/s General Purpose
More informationOverview of research activities Toward portability of performance
Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into
More informationEU Research Infra Integration: a vision from the BSC. Josep M. Martorell, PhD Associate Director
EU Research Infra Integration: a vision from the BSC Josep M. Martorell, PhD Associate Director 11/2017 Ideas on 3 topics: 1. The BSC as a Research Infrastructure 2. The added-value of an European RI for
More informationInteractive HPC: Large Scale In-Situ Visualization Using NVIDIA Index in ALYA MultiPhysics
www.bsc.es Interactive HPC: Large Scale In-Situ Visualization Using NVIDIA Index in ALYA MultiPhysics Christopher Lux (NV), Vishal Mehta (BSC) and Marc Nienhaus (NV) May 8 th 2017 Barcelona Supercomputing
More informationCloud interoperability and elasticity with COMPSs
www.bsc.es Cloud interoperability and elasticity with COMPSs Interoperability Demo Days Dec 12-2014, London Daniele Lezzi Barcelona Supercomputing Center Outline COMPSs programming model COMPSs tools COMPSs
More informationA Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures
A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures E. Ayguade 1,2, R.M. Badia 2,4, D. Cabrera 2, A. Duran 2, M. Gonzalez 1,2, F. Igual 3, D. Jimenez 1, J. Labarta 1,2, X. Martorell
More informationBSC and integrating persistent data and parallel programming models
www.bsc.es Barcelona, September 22 nd, 2015 BSC and integrating persistent data and parallel programming models Toni Cortes Leader of the storage-system research group Barcelona Supercomputing Center Centro
More informationOpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs
www.bsc.es OpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs Hugo Pérez UPC-BSC Benjamin Hernandez Oak Ridge National Lab Isaac Rudomin BSC March 2015 OUTLINE
More informationDesign and Development of support for GPU Unified Memory in OMPSS
Design and Development of support for GPU Unified Memory in OMPSS Master in Innovation and Research in Informatics (MIRI) High Performance Computing (HPC) Facultat d Informàtica de Barcelona (FIB) Universitat
More informationHardware Hetergeneous Task Scheduling for Task-based Programming Models
www.bsc.es Hardware Hetergeneous Task Scheduling for Task-based Programming Models Xubin Tan OpenMPCon 2018 Advisors: Carlos Álvarez, Daniel Jiménez-González Agenda > Background, Motivation > Picos++ accelerated
More informationNEW ADVANCES IN GPU LINEAR ALGEBRA
GTC 2012: NEW ADVANCES IN GPU LINEAR ALGEBRA Kyle Spagnoli EM Photonics 5/16/2012 QUICK ABOUT US» HPC/GPU Consulting Firm» Specializations in:» Electromagnetics» Image Processing» Fluid Dynamics» Linear
More informationALYA Multi-Physics System on GPUs: Offloading Large-Scale Computational Mechanics Problems
www.bsc.es ALYA Multi-Physics System on GPUs: Offloading Large-Scale Computational Mechanics Problems Vishal Mehta Engineer, Barcelona Supercomputing Center vishal.mehta@bsc.es Training BSC/UPC GPU Centre
More information! XKaapi : a runtime for highly parallel (OpenMP) application
XKaapi : a runtime for highly parallel (OpenMP) application Thierry Gautier thierry.gautier@inrialpes.fr MOAIS, INRIA, Grenoble C2S@Exa 10-11/07/2014 Agenda - context - objective - task model and execution
More informationDesign Decisions for a Source-2-Source Compiler
Design Decisions for a Source-2-Source Compiler Roger Ferrer, Sara Royuela, Diego Caballero, Alejandro Duran, Xavier Martorell and Eduard Ayguadé Barcelona Supercomputing Center and Universitat Politècnica
More informationTutorial OmpSs: Overlapping communication and computation
www.bsc.es Tutorial OmpSs: Overlapping communication and computation PATC course Parallel Programming Workshop Rosa M Badia, Xavier Martorell PATC 2013, 18 October 2013 Tutorial OmpSs Agenda 10:00 11:00
More informationExploring Dynamic Parallelism on OpenMP
www.bsc.es Exploring Dynamic Parallelism on OpenMP Guray Ozen, Eduard Ayguadé, Jesús Labarta WACCPD @ SC 15 Guray Ozen - Exploring Dynamic Parallelism in OpenMP Austin, Texas 2015 MACC: MACC: Introduction
More informationMAGMA. Matrix Algebra on GPU and Multicore Architectures
MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/
More informationProgramming Models for Multi- Threading. Brian Marshall, Advanced Research Computing
Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows
More informationCellSs Making it easier to program the Cell Broadband Engine processor
Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of
More informationProgramming model and application porting to the Dynamical Exascale Entry Platform (DEEP)
Programming model and application porting to the Dynamical Exascale Entry Platform (DEEP) EASC 2013 April 10 th, Edinburgh Damián A. Mallón The research leading to these results has received funding from
More informationTask Superscalar: Using Processors as Functional Units
Task Superscalar: Using Processors as Functional Units Yoav Etsion Alex Ramirez Rosa M. Badia Eduard Ayguade Jesus Labarta Mateo Valero HotPar, June 2010 Yoav Etsion Senior Researcher Parallel Programming
More informationEnabling GPU support for the COMPSs-Mobile framework
Enabling GPU support for the COMPSs-Mobile framework Francesc Lordan, Rosa M Badia and Wen-Mei Hwu Nov 13, 2017 4th Workshop on Accelerator Programming Using Directives COMPSs-Mobile infrastructure WAN
More informationStarPU: a runtime system for multigpu multicore machines
StarPU: a runtime system for multigpu multicore machines Raymond Namyst RUNTIME group, INRIA Bordeaux Journées du Groupe Calcul Lyon, November 2010 The RUNTIME Team High Performance Runtime Systems for
More informationOptimizing an Earth Science Atmospheric Application with the OmpSs Programming Model
www.bsc.es Optimizing an Earth Science Atmospheric Application with the OmpSs Programming Model HPC Knowledge Meeting'15 George S. Markomanolis, Jesus Labarta, Oriol Jorba University of Barcelona, Barcelona,
More informationDistributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca
Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent
More informationA brief introduction to OpenMP
A brief introduction to OpenMP Alejandro Duran Barcelona Supercomputing Center Outline 1 Introduction 2 Writing OpenMP programs 3 Data-sharing attributes 4 Synchronization 5 Worksharings 6 Task parallelism
More informationTechnology on Dense Linear Algebra
Impact of Multi core and Many core Technology on Dense Linear Algebra Enrique S. Quintana-Ortí Berlin, September 2011 Berlin, September 2011 1 Multi-core and Many-core The free lunch is over (H. Sutter,
More informationEEDC. Scientific Programming Models. Execution Environments for Distributed Computing. Master in Computer Architecture, Networks and Systems - CANS
EEDC Execution Environments for Distributed Computing 34330 Master in Computer Architecture, Networks and Systems - CANS Scientific Programming Models Group members: Francesc Lordan francesc.lordan@bsc.es
More informationConcurrent Programming with OpenMP
Concurrent Programming with OpenMP Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico October 11, 2012 CPD (DEI / IST) Parallel and Distributed
More informationSolving Dense Linear Systems on Platforms with Multiple Hardware Accelerators
Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Francisco D. Igual Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Universidad Jaime I de Castellón (Spain) Robert A. van de
More informationOmpSs Specification. BSC Programming Models
OmpSs Specification BSC Programming Models March 30, 2017 CONTENTS 1 Introduction to OmpSs 3 1.1 Reference implementation........................................ 3 1.2 A bit of history..............................................
More informationHow to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić
How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about
More informationOpenMP Tasking Model Unstructured parallelism
www.bsc.es OpenMP Tasking Model Unstructured parallelism Xavier Teruel and Xavier Martorell What is a task in OpenMP? Tasks are work units whose execution may be deferred or it can be executed immediately!!!
More informationFundamentals of OmpSs
www.bsc.es Fundamentals of OmpSs Tasks and Dependences Xavier Teruel New York, June 2013 AGENDA: Fundamentals of OmpSs Tasking and Synchronization Data Sharing Attributes Dependence Model Other Tasking
More informationPerformance of deal.ii on a node
Performance of deal.ii on a node Bruno Turcksin Texas A&M University, Dept. of Mathematics Bruno Turcksin Deal.II on a node 1/37 Outline 1 Introduction 2 Architecture 3 Paralution 4 Other Libraries 5 Conclusions
More informationMaking Dataflow Programming Ubiquitous for Scientific Computing
Making Dataflow Programming Ubiquitous for Scientific Computing Hatem Ltaief KAUST Supercomputing Lab Synchronization-reducing and Communication-reducing Algorithms and Programming Models for Large-scale
More informationSolving Dense Linear Systems on Graphics Processors
Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad
More informationGPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler
GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler Taylor Lloyd Jose Nelson Amaral Ettore Tiotto University of Alberta University of Alberta IBM Canada 1 Why? 2 Supercomputer Power/Performance GPUs
More informationApplication programming on parallel/distributed computing platforms Daniele Lezzi BSC
Application programming on parallel/distributed computing platforms Daniele Lezzi BSC 11/01/2018 Training week Munich Outline Programming parallel and distributed computing platforms: an overview Programming
More informationThe Stampede is Coming: A New Petascale Resource for the Open Science Community
The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center boisseau@tacc.utexas.edu Stampede: Solicitation US National Science Foundation
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationShared-memory Parallel Programming with Cilk Plus
Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 30 August 2018 Outline for Today Threaded programming
More informationTask-parallel reductions in OpenMP and OmpSs
Task-parallel reductions in OpenMP and OmpSs Jan Ciesko 1 Sergi Mateo 1 Xavier Teruel 1 Vicenç Beltran 1 Xavier Martorell 1,2 1 Barcelona Supercomputing Center 2 Universitat Politècnica de Catalunya {jan.ciesko,sergi.mateo,xavier.teruel,
More informationSciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications
Parallel Tiled Algorithms for Multicore Architectures Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou SciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications
More informationHeterogenous Acceleration for Linear Algebra in Mulit-Coprocessor Environments
Heterogenous Acceleration for Linear Algebra in Mulit-Coprocessor Environments Azzam Haidar 1, Piotr Luszczek 1, Stanimire Tomov 1, and Jack Dongarra 1,2,3 1 University of Tennessee Knoxville, USA 2 Oak
More informationOn Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators
On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC
More informationLooPo: Automatic Loop Parallelization
LooPo: Automatic Loop Parallelization Michael Claßen Fakultät für Informatik und Mathematik Düsseldorf, November 27 th 2008 Model-Based Loop Transformations model-based approach: map source code to an
More informationAdvanced OpenMP Features
Advanced OpenMP Features Christian Terboven, Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group {terboven,schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University Sudoku IT Center
More informationOpenMP 4.0. Mark Bull, EPCC
OpenMP 4.0 Mark Bull, EPCC OpenMP 4.0 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all devices!
More informationCUDA GPGPU Workshop 2012
CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline
More informationHeterogenous Acceleration for Linear Algebra in Mulit-Coprocessor Environments
Heterogenous Acceleration for Linear Algebra in Mulit-Coprocessor Environments Azzam Haidar 1, Piotr Luszczek 1, Stanimire Tomov 1, and Jack Dongarra 1,2,3 1 University of Tennessee Knoxville, USA 2 Oak
More informationJoe Hummel, PhD. Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago.
Joe Hummel, PhD Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago email: joe@joehummel.net stuff: http://www.joehummel.net/downloads.html Async programming:
More informationHierarchical DAG Scheduling for Hybrid Distributed Systems
June 16, 2015 Hierarchical DAG Scheduling for Hybrid Distributed Systems Wei Wu, Aurelien Bouteiller, George Bosilca, Mathieu Faverge, Jack Dongarra IPDPS 2015 Outline! Introduction & Motivation! Hierarchical
More informationScheduling of QR Factorization Algorithms on SMP and Multi-core Architectures
Scheduling of Algorithms on SMP and Multi-core Architectures Gregorio Quintana-Ortí Enrique S. Quintana-Ortí Ernie Chan Robert A. van de Geijn Field G. Van Zee quintana@icc.uji.es Universidad Jaime I de
More informationHybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS
+ Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics
More informationParallel and Distributed Programming Introduction. Kenjiro Taura
Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel Programming? 2 What Parallel Machines Look Like, and Where Performance Come From? 3 How to Program Parallel
More informationPedraforca: a First ARM + GPU Cluster for HPC
www.bsc.es Pedraforca: a First ARM + GPU Cluster for HPC Nikola Puzovic, Alex Ramirez We ve hit the power wall ALL computers are limited by power consumption Energy-efficient approaches Multi-core Fujitsu
More informationHigh-Level API for GPGPU using Meta-programming
High-Level API for GPGPU using Meta-programming Joel Falcou University Paris-Sud LRI December 15, 2015 The Hardware/Software Trade-Off Single Core Era Performance Multi-Core/SIMD Era Performance Heterogenous
More informationPortability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures
Photos placed in horizontal position with even amount of white space between photos and header Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures Christopher Forster,
More informationLeveraging Parallelware in MAESTRO and EPEEC
Leveraging Parallelware in MAESTRO and EPEEC and Enhancements to Parallelware Manuel Arenaz manuel.arenaz@appentra.com PRACE booth #2033 Thursday, 15 November 2018 Dallas, US http://www.prace-ri.eu/praceatsc18/
More informationExploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization
Exploiting Task-Parallelism on Clusters via Adrián Castelló, Rafael Mayo, Judit Planas, Enrique S. Quintana-Ortí RePara 2015, August Helsinki, Finland Exploiting Task-Parallelism on Clusters via Power/energy/utilization
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationParallel Computing. Hwansoo Han (SKKU)
Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo
More informationMAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures
MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010
More informationOpenMP 4.0/4.5. Mark Bull, EPCC
OpenMP 4.0/4.5 Mark Bull, EPCC OpenMP 4.0/4.5 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all
More informationHeterogeneous Multicore Parallel Programming
Innovative software for manycore paradigms Heterogeneous Multicore Parallel Programming S. Chauveau & L. Morin & F. Bodin Introduction Numerous legacy applications can benefit from GPU computing Many programming
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationOpenACC 2.6 Proposed Features
OpenACC 2.6 Proposed Features OpenACC.org June, 2017 1 Introduction This document summarizes features and changes being proposed for the next version of the OpenACC Application Programming Interface, tentatively
More informationAddressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer
Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems Ed Hinkel Senior Sales Engineer Agenda Overview - Rogue Wave & TotalView GPU Debugging with TotalView Nvdia CUDA Intel Phi 2
More informationAnalysis of the Task Superscalar Architecture Hardware Design
Available online at www.sciencedirect.com Procedia Computer Science 00 (2013) 000 000 International Conference on Computational Science, ICCS 2013 Analysis of the Task Superscalar Architecture Hardware
More informationIntel Xeon Phi архитектура, модели программирования, оптимизация.
Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture
More informationPARALUTION - a Library for Iterative Sparse Methods on CPU and GPU
- a Library for Iterative Sparse Methods on CPU and GPU Dimitar Lukarski Division of Scientific Computing Department of Information Technology Uppsala Programming for Multicore Architectures Research Center
More informationOmpCloud: Bridging the Gap between OpenMP and Cloud Computing
OmpCloud: Bridging the Gap between OpenMP and Cloud Computing Hervé Yviquel, Marcio Pereira and Guido Araújo University of Campinas (UNICAMP), Brazil A bit of background qguido Araujo, PhD Princeton University
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware
More informationEvaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li
Evaluation of sparse LU factorization and triangular solution on multicore architectures X. Sherry Li Lawrence Berkeley National Laboratory ParLab, April 29, 28 Acknowledgement: John Shalf, LBNL Rich Vuduc,
More informationTask based parallelization of recursive linear algebra routines using Kaapi
Task based parallelization of recursive linear algebra routines using Kaapi Clément PERNET joint work with Jean-Guillaume DUMAS and Ziad SULTAN Université Grenoble Alpes, LJK-CASYS January 20, 2017 Journée
More informationOverview: Emerging Parallel Programming Models
Overview: Emerging Parallel Programming Models the partitioned global address space paradigm the HPCS initiative; basic idea of PGAS the Chapel language: design principles, task and data parallelism, sum
More informationJCudaMP: OpenMP/Java on CUDA
JCudaMP: OpenMP/Java on CUDA Georg Dotzler, Ronald Veldema, Michael Klemm Programming Systems Group Martensstraße 3 91058 Erlangen Motivation Write once, run anywhere - Java Slogan created by Sun Microsystems
More informationCUDA Programming Model
CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming
More informationParallelism V. HPC Profiling. John Cavazos. Dept of Computer & Information Sciences University of Delaware
Parallelism V HPC Profiling John Cavazos Dept of Computer & Information Sciences University of Delaware Lecture Overview Performance Counters Profiling PAPI TAU HPCToolkit PerfExpert Performance Counters
More informationSpeedup Altair RADIOSS Solvers Using NVIDIA GPU
Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair
More informationAdvanced OpenMP Features
Christian Terboven, Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group {terboven,schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University Vectorization 2 Vectorization SIMD =
More informationPyCOMPSs: Parallel computational workflows in Python
Original Article PyCOMPSs: Parallel computational workflows in Python The International Journal of High Performance Computing Applications 2017, Vol. 31(1) 66 82 Ó The Author(s) 2015 Reprints and permissions:
More informationMetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores
MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores Presented by Xiaohui Chen Joint work with Marc Moreno Maza, Sushek Shekar & Priya Unnikrishnan University of Western Ontario,
More informationNVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU
NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationIntel Performance Libraries
Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation
More informationOptimising the Mantevo benchmark suite for multi- and many-core architectures
Optimising the Mantevo benchmark suite for multi- and many-core architectures Simon McIntosh-Smith Department of Computer Science University of Bristol 1 Bristol's rich heritage in HPC The University of
More informationOpenMP Tutorial. Dirk Schmidl. IT Center, RWTH Aachen University. Member of the HPC Group Christian Terboven
OpenMP Tutorial Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de IT Center, RWTH Aachen University Head of the HPC Group terboven@itc.rwth-aachen.de 1 Tasking
More informationMAGMA: a New Generation
1.3 MAGMA: a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Jack Dongarra T. Dong, M. Gates, A. Haidar, S. Tomov, and I. Yamazaki University of Tennessee, Knoxville Release
More informationINTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017
INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and
More informationEasy Programming the Cloud with PyCOMPSs
www.bsc.es Easy Programming the Cloud with PyCOMPSs FiCLOUD 2014 Barcelona, August 28 Barcelona Supercomputing Center The BSC-CNS objectives: R&D in Computer Sciences, Life Sciences and Earth Sciences
More informationSPOC : GPGPU programming through Stream Processing with OCaml
SPOC : GPGPU programming through Stream Processing with OCaml Mathias Bourgoin - Emmanuel Chailloux - Jean-Luc Lamotte January 23rd, 2012 GPGPU Programming Two main frameworks Cuda OpenCL Different Languages
More informationParallel Programming. Libraries and Implementations
Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationGuiding the optimization of parallel codes on multicores using an analytical cache model
Guiding the optimization of parallel codes on multicores using an analytical cache model Diego Andrade, Basilio B. Fraguela, and Ramón Doallo Universidade da Coruña, Spain {diego.andrade,basilio.fraguela,ramon.doalllo}@udc.es
More information