Task-based programming models to support hierarchical algorithms

Size: px
Start display at page:

Download "Task-based programming models to support hierarchical algorithms"

Transcription

1 Task-based programming models to support hierarchical algorithms Rosa M BadiaBarcelona Supercomputing Center SHAXC 2016, KAUST, 11 May 2016

2 Outline BSC Overview of superscalar programming model OmpSs overview Use of OmpSs in numerical examples PyCOMPSs overview Use of PyCOMPSs in numerical examples Conclusions 2

3 Barcelona Supercomputing Center - Centro Nacional de Supercomputación BSC-CNS objectives: R&D in Computer, Life, Earth and Engineering Sciences. Supercomputing services and support to Spanish and European researchers. BSC-CNS is a consortium that includes: Spanish Government 60% Catalonian Government 30% Universitat Politècnica de Catalunya (UPC) 10% 425 people, 40 countries 3

4 Mission of BSC R&D Departments COMPUTER SCIENCES To influence the way machines are built, programmed and used: programming models, performance tools, Big Data, computer architecture, energy efficiency EARTH SCIENCES To develop and implement global and regional stateof-the-art models for shortterm air quality forecast and long-term climate applications LIFE SCIENCES To understand living organisms by means of theoretical and computational methods (molecular modeling, genomics, proteomics) CASE To develop scientific and engineering software to efficiently exploit supercomputing capabilities (biomedical, geophysics, atmospheric, energy, social and economic simulations) 4

5 Reducing the gap between applcations and architectures New algorithms and data structures require programming models and runtimes that: Reduce programming complexity Keep portability Enable communication avoiding, asynchronous algorithms, Applications PM: High-level, clean, abstract interface Computing Platform Runtime API Storage Platform 5

6 So, what is a superscalar programming model? High-level sequential programming Executes following superscalar processor model Out of order Task is the unit of work Builds a task graph at runtime that express potential concurrency Large number of in-flight tasks Exposes distant parallelism Based on a runtime Makes decisions and executes the workflow Offers an abstraction to plug applications to different resources Computing Storage Superscalar processor Dataflow High level Sequential programming Workflows Utilities 6

7 The StarSs programming model CellSs SMPSs GPUSs ClusterSs ClearSpeedSs StarSs OmpSs GridSs ClusterSs PyCOMPSs/COMPSs Different implementations, targeting different platforms OmpSs: multicore, GPUs, clusters COMPSs: clusters, federated clouds, old grids Open source pm.bsc.es compss.bsc.es 7

8 Main elements of superscalar programming model syntax Superscalar program Sequential code Single shared memory space Identification of tasks Task Main element of programming model: computation unit Operates in given parameters and local variables Amount of work (granularity) may vary in a wide range (from μsecs, to minutes, hours), may depend on input arguments, Once started executes to completion independent of other tasks Syntax Task annotations Task arguments directionality Synchronizations 8

9 Task annotations Different languages, same idea Annotations designed according to the standards of each programming = = cholesky.objects.block") void = Direction.INOUT) Block diag... #pragma omp task inout (A[k][k])priority (10) spotrf (A[k][k]) ; for (i=k+1; i<nt; i++) { #pragma omp task in (A[k][k]) inout... strsm (A[k][k], priority=true) def potrf(a): A = dpotrf(a, lower=true)[0].tolist() Python Java 9

10 Task arguments directionality Input, Output, Inout Indicate that the argument is read, written, or read and written by the task Used at execution time to determine the data dependences between tasks Is not a direct edge, but may generate one or more edges Gives information about Locality Data to be (output o1 ) def meta (input o2) def metb (o2)... def main () meta (myobject) metb (myobject) metb (otherobject) meta metb metb 10

11 Impact of synchronization in task based programming Ideally tasks executed according data dependences However synchronizations can not always be avoided When tasks results are needed Semantics of a synchronization Granularity should not be the only parameter to decide what is a task Synchronizations stop task graph generation Synchronizations syntax Designed according to language standards in other languages Synchronizations added by interception in Java #pragma omp taskwait foo = compss_wait_on(foo) C Python 11

12 Performance analysis Runtimes instrumented with Extrae library to generate Paraver tracefiles 12

13 PART I: OMPSS 13

14 OmpSs environent: Mercurium Compiler Recognizes constructs and transforms them to calls to the runtime Manages code restructuring for different target devices Device-specific handlers May generate code in a separate file Invokes different back-end compilers à nvcc for NVIDIA C/C++/Fortran

15 OmpSs Environment: Nanos++ runtime CUDA threads GPU MIC threads MIC Source code #pragma omp task Mercurium compiler Application binary New tasks Nanos++ runtime Coherence support (Directory / Cache) Data requests Dependency support (task graph) Ready tasks Worker threads Execute task (local) Device operations (e.g. execute task) Helper threads Execute task (device) Scheduler 15

16 Example 1: Communication avoiding QR* in OmpSs #pragma omp task inout( A[0;br*bc] ) output( T[0;br*bc] ) priority(3) void dgeqrf_dlarft (int br, int bc, int skip, double *A, double *T); #pragma omp task input( T[0;br*bc] ) inout( C[0;br*bc] ) void dlarfb (int br, int bc, int skip, double *V, double *T, double *C); for ( int k=0; k<nt ; k++ ) { #pragma omp task dgeqr_dlarft(br,bc,0,a[k,k],t[k,k]); inout( C[0;br*bc], D[0;br*bc]) output( T[0;br*bc]) priority(2) void dgeqrf_split for ( int br, j=k+1; int j<nt bc, int ; j++ skip, ) { double *C, double *D, double *T); dlarfb(br,bc,0,a[k,k],t[k,k],a[k,j]); #pragma omp task } input( T[0;br*bc] ) inout( F[0;br*bc], G[0;br*bc]) priority(1) void dlarfb_split for (int ( intbr, i=k+1; int bc, i<mt int ; i++ skip, ) { double *D, double *T, double *F, double *G); dgeqr_split(br,bc,0,a[k,k],a[i,k],t[i,k]); for ( int j=k+1; j<nt ; j++ ) { dlarfb_split(br,bc,0,a[i,k],t[i,k],a[k,j],a[i,j]); } } } * Demmel et al., Communication-optimal parallel and sequential QR and LU factorizations. 16

17 Example 2: Heterogeneous #pragma omp target device (hstreams) implements(dgemm_task) copy_deps #pragma omp task input([ts][ts] A, [ts][ts] B) inout([ts][ts] C) priority (10) void dgemm_phi (int ts, double *A, double *B, double *C) { double alpha = 1.0; const char trans = N ; dgemm(&trans, &trans, &ts, &ts, &ts, &alpha, A, &ts, B, &ts, &alpha, C, &ts); } #pragma omp task input([ts][ts] A, [ts][ts] B) inout([ts][ts] C) priority (10) void dgemm_task (int ts, double *A, double *B, double *C) { cblas_dgemm (CblasRowMajor, CblasNoTrans, CblasNoTrans, ts, ts, ts, 1, A, ts, B, ts, 1, C, ts); } void matmul (int N, int TS, double *A, double *B, double *C) { for (int i = 0; i < N; j++) for (int j = 0; j < N; j++) for (int k = 0; k < N; k++) dgemm_task(ts, &A[i][k], &B[k][j], &C[i][j]); #pragma omp taskwait 17

18 Example 3: Task parallelism ILUPACK s preconditioned CG solver* Exploitation of nested parallelism Tasks are split into finer granularity tasks 18

19 Example 3: Task parallelism ILUPACK s preconditioned CG solver* NUMA-aware execution thanks to OmpSs specific NUMA-aware scheduler The code records in which socket each task is executed during the initial calculation of the preconditioner During all iterations, all tasks which operate on the same data that was generated/accessed during the preconditioner calculation are mapped to the same socket where they were originally executed 19

20 Some results: 8 CPUs + 1 GPGPU + 1 MIC Performance [Gflop/s] Matrix Multiply Matrix size [#elements on a side] OCUbf OCUaf OhSbf OhSaf OHTsm CUDA hstreams Performance [Gflop/s] Cholesky Factorization Matrix size [#elements on a side] OCUbf OCUaf OhSbf OhSaf OHTsm CUDA hstreams The best performance can be achieved when all processing units (CPUs, GPGPU and Xeon Phi) cooperate together 22

21 Some results: NUMA-aware scheduler results Scheduler distance-aware Al-Omairy et at., Dense Matrix Computations on NUMA Architectures with Distance-Aware Work Stealing 23

22 PART II: PYCOMPSS/COMPSS

23 Why Python? Python is powerful... and fast; plays well with others; runs everywhere; is friendly & easy to learn; is Open. * Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C Large community using it, including scientific and numeric Object-oriented programming and structured programming are fully supported Large number of software modules available (38,000 as of January 2014) * From python.org 25

24 Task annotations Use of decorators to annotate tasks and indicate arguments directionality Other annotations: constraints Small API priority=true) def potrf(a): A = dpotrf(a, lower=true)[0].tolist() foo = compss_wait_on(foo) 26

25 PyCOMPSsRuntime behavior Python user code + task annotations Binding TDG Tasks Grids Clusters Clouds Runtime Files, objects

26 PyCOMPSs stack + Numpy and MKL MKL is parallelized with OpenMP. Two levels of parallelism to exploit Task level Thread level App Main Python interpreter Numpy, ScyPy COMPSs worker Task code Python interpreter Numpy, ScyPy COMPSs master MKL MKL Host node Worker node Parallelized with OpenMP 28

27 1 Example 1: Matrix = INOUT) def multiply(a, b, c): c += a*b 5 d11 6 d11 d3 2 d3 3 d3 9 d16 def initialize_variables(): import numpy as np for matrix in [A,B,C]: for i in range(msize): matrix.append([]) for j in range(msize): if matrix == C: block = np.array(np.zeros((bsize, BSIZE)), dtype=np.double, copy=false)... Initialization Synchro0 10 d21 d11 d3 d3 d sync d21 d11 d11 d16 d26 15 sync d21 d16 d16 d26 d30 16 sync d21 d21 d26 d30 d31 sync d26 d26 d30 d31 d32 33 sync d34 d30 d30 d31 d32 34 sync d34 d31 d31 d32 d sync d39 d34 d32 d32 d sync 39 d40 d39 d34 d34 d sync 40 d42 d40 d39 d38 d sync d46 d42 d40 d39 d sync 57 d46 d42 d40 d40 d sync 58 d48 d46 d42 d42 d sync 59 d48 d46 d46 d47 63 sync 60 initialize_variables() main eti = time.time() for i in range(msize): for j in range(msize): for k in range(msize): multiply(a[i][k], B[k][j], C[i][j]) C = compss_wait_on(c) print "Compute Time {} s".format(time.time()-eti) d48 d47 d47 64 sync d48 d48 sync 29

28 36 d36 64 d33 d43 61 d32 d d2 5 d1 15 d2 d8 d5 d8 d8 d8 d4 d8 d8 d5 d5 d4 d5 d5 d3 d4 d4 d9 d4 d3 d3 d3 d3 d5 d6 d4 d6 d6 d3 d6 d d d31 34 d34 d43 d40d43 d43 d39 d43 d38 d43 d64 85 d61 d60 82 d d d d1 12 d1 d d1 2 d d2 d d1 d2 21 d37 d12 d37 d37 d10 11 d37 d19 23 d37 d18 d17 40 d16 d d13 d11 d22 d21 41 d62 49 d70d70 d67 d70 d70 d66 d d5 d20 d7 d1 d3 d6 d2 27 d7 d4 25 d7 7 d6 d7 24 d7 26 d24 d23 d27 d25 42 d41 d40 d40 d40 d39 d39 d39 d38 d38 d38 d40 d39 d41 d38 d41 d41 d41 d40 d38 d39d42 d42 d41 d42 d42 d42 83 d63 d49 84 d85 d82 d83 d84 d73 d86 d86 d86 d75 d d81 90 d48 73 d47 67 d d45 66 d d50 d65 d65 d65 d65 d52 d51 d65 d68d67 d67 d66 d67 d66 d66 d68 d68 d68 d67 d69 d66 d69 d68 d69 d69 d69 d72 87 d90 d90 d90 d87 d90 d87 d88d87 d88 d88 d87 d89 d88 d89 d89 d89 97 d97 98 d100 d98 d101 d99 d101 d d d104 d d104 d d113 d d d113 d d d71 86 d d d74 d94 d102 d102 d102 d103 d103 d d d118 d120 sync d115 d d d112 d112 d117 Synchro0 d d114 d d d95 d d78 89 d96 d d53 d79 77 d56 d d d55 55 d57 d d58 d7 d26 d d29 Example 2: Cholesky def cholesky_blocked(a): from import priority=true) n = len(a) def potrf(a): for k in range(n): from scipy.linalg.lapack import dpotrf A = dpotrf(a, lower=true)[0].tolist() return priority=true) def trsm(a, B): from scipy.linalg import solve_triangular # update trailing matrix from numpy import transpose for i in range(k+1, n): B = solve_triangular(a, B, lower=true, trans='t') for j in range(k+1, i): def gemm(a, B, C): from scipy.linalg.blas import dgemm from numpy import transpose alpha = -1.0 beta = 1.0 C = dgemm(alpha, A, B, c=c, beta=beta, trans_b=1).tolist() import compss_wait_on # Diagonal block factorization A[k][k] = potrf(a[k][k]) # Triangular systems for i in range(k+1, n): A[k][i] = trsm(a[k][k], A[k][i]) main A[j][i] = gemm(a[k][i], A[k][j], A[j][i]) A[i][i] = syrk(a[k][i], A[i][i]) A = compss_wait_on(a) return A 30

29 Some results: matrix multiply Matrix size: Block size: 4096 and OpenMP thread / task 16 tasks per node 14,00 12,00 Matrix multiply - speedup Execution time (secs) Matrix multiply - Execution time #cores Speedup 10,00 8,00 6,00 4, Matrix multiply - Performance 2, ,00 0,00 5,00 10,00 15,00 20,00 25,00 30,00 35,00 #nodes GFlops #cores 31

30 Some results: matrix multiply Matrix size: Block size: 4096 and OpenMP thread / task 2 tasks per node Matrix Multiply - Speedup - 8 threads Seconds Matrix multiply - Execution time - 8 threads 4500, , , , , , , ,00 500,00 0,00 0,00 100,00 200,00 300,00 400,00 500,00 600,00 #cores ,00 Speedup 16,00 14,00 12,00 10,00 8,00 6,00 4,00 2,00 0,00 0,00 5,00 10,00 15,00 20,00 25,00 30,00 35,00 #nodes GFlops Matrix multiply - Performance - 8 threads 500,00 450,00 400,00 350,00 300,00 250,00 200,00 150,00 100,00 50,00 0,00 0,00 100,00 200,00 300,00 400,00 500,00 600,00 #cores

31 Some insight BSIZE 2048 BSIZE thread 8 thread 33

32 Some insight BSIZE 2048 BSIZE thread 8 thread 34

33 Conclusions StarSs is a family of task-based programming models targeting parallel systems OmpSs focuses in more traditional HPC systems, including heterogeneous nodes with GPUs and accelerators PyCOMPSs provides a high level, easy interface focusing in distributed computing, including cloud and Big Data Both systems come with a whole environment of tools Performance analysis Monitor Ongoing work integrating PyCOMPSs/COMPSs to new storage approaches to enable converge between HPC and Big Data Open source: pm.bsc.es compss.bsc.es 35

34 POP CoE A Center of Excellence On Performance Optimization and Productivity Promoting best practices in performance analysis and parallel programming Providing Services Precise understanding of application and system behavior Suggestion/support on how to refactor code in the most productive way Horizontal Transversal across application areas, platforms, scales For academic AND industrial codes and users! 36

35 Thank you! 37

Barcelona Supercomputing Center

Barcelona Supercomputing Center www.bsc.es Barcelona Supercomputing Center Centro Nacional de Supercomputación EMIT 2016. Barcelona June 2 nd, 2016 Barcelona Supercomputing Center Centro Nacional de Supercomputación BSC-CNS objectives:

More information

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento

More information

Asynchronous Task Creation for Task-Based Parallel Programming Runtimes

Asynchronous Task Creation for Task-Based Parallel Programming Runtimes Asynchronous Task Creation for Task-Based Parallel Programming Runtimes Jaume Bosch (jbosch@bsc.es), Xubin Tan, Carlos Álvarez, Daniel Jiménez, Xavier Martorell and Eduard Ayguadé Barcelona, Sept. 24,

More information

OmpSs Fundamentals. ISC 2017: OpenSuCo. Xavier Teruel

OmpSs Fundamentals. ISC 2017: OpenSuCo. Xavier Teruel OmpSs Fundamentals ISC 2017: OpenSuCo Xavier Teruel Outline OmpSs brief introduction OmpSs overview and influence in OpenMP Execution model and parallelization approaches Memory model and target copies

More information

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel www.bsc.es OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray Ozen guray.ozen@bsc.es Exascale in BSC Marenostrum 4 (13.7 Petaflops ) General purpose cluster (3400

More information

Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs

Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs www.bsc.es Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs Pau Farré Antonio J. Peña Munich, Oct. 12 2017 PROLOGUE Barcelona Supercomputing Center Marenostrum 4 13.7 PetaFlop/s General Purpose

More information

Overview of research activities Toward portability of performance

Overview of research activities Toward portability of performance Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into

More information

EU Research Infra Integration: a vision from the BSC. Josep M. Martorell, PhD Associate Director

EU Research Infra Integration: a vision from the BSC. Josep M. Martorell, PhD Associate Director EU Research Infra Integration: a vision from the BSC Josep M. Martorell, PhD Associate Director 11/2017 Ideas on 3 topics: 1. The BSC as a Research Infrastructure 2. The added-value of an European RI for

More information

Interactive HPC: Large Scale In-Situ Visualization Using NVIDIA Index in ALYA MultiPhysics

Interactive HPC: Large Scale In-Situ Visualization Using NVIDIA Index in ALYA MultiPhysics www.bsc.es Interactive HPC: Large Scale In-Situ Visualization Using NVIDIA Index in ALYA MultiPhysics Christopher Lux (NV), Vishal Mehta (BSC) and Marc Nienhaus (NV) May 8 th 2017 Barcelona Supercomputing

More information

Cloud interoperability and elasticity with COMPSs

Cloud interoperability and elasticity with COMPSs www.bsc.es Cloud interoperability and elasticity with COMPSs Interoperability Demo Days Dec 12-2014, London Daniele Lezzi Barcelona Supercomputing Center Outline COMPSs programming model COMPSs tools COMPSs

More information

A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures

A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures E. Ayguade 1,2, R.M. Badia 2,4, D. Cabrera 2, A. Duran 2, M. Gonzalez 1,2, F. Igual 3, D. Jimenez 1, J. Labarta 1,2, X. Martorell

More information

BSC and integrating persistent data and parallel programming models

BSC and integrating persistent data and parallel programming models www.bsc.es Barcelona, September 22 nd, 2015 BSC and integrating persistent data and parallel programming models Toni Cortes Leader of the storage-system research group Barcelona Supercomputing Center Centro

More information

OpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs

OpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs www.bsc.es OpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs Hugo Pérez UPC-BSC Benjamin Hernandez Oak Ridge National Lab Isaac Rudomin BSC March 2015 OUTLINE

More information

Design and Development of support for GPU Unified Memory in OMPSS

Design and Development of support for GPU Unified Memory in OMPSS Design and Development of support for GPU Unified Memory in OMPSS Master in Innovation and Research in Informatics (MIRI) High Performance Computing (HPC) Facultat d Informàtica de Barcelona (FIB) Universitat

More information

Hardware Hetergeneous Task Scheduling for Task-based Programming Models

Hardware Hetergeneous Task Scheduling for Task-based Programming Models www.bsc.es Hardware Hetergeneous Task Scheduling for Task-based Programming Models Xubin Tan OpenMPCon 2018 Advisors: Carlos Álvarez, Daniel Jiménez-González Agenda > Background, Motivation > Picos++ accelerated

More information

NEW ADVANCES IN GPU LINEAR ALGEBRA

NEW ADVANCES IN GPU LINEAR ALGEBRA GTC 2012: NEW ADVANCES IN GPU LINEAR ALGEBRA Kyle Spagnoli EM Photonics 5/16/2012 QUICK ABOUT US» HPC/GPU Consulting Firm» Specializations in:» Electromagnetics» Image Processing» Fluid Dynamics» Linear

More information

ALYA Multi-Physics System on GPUs: Offloading Large-Scale Computational Mechanics Problems

ALYA Multi-Physics System on GPUs: Offloading Large-Scale Computational Mechanics Problems www.bsc.es ALYA Multi-Physics System on GPUs: Offloading Large-Scale Computational Mechanics Problems Vishal Mehta Engineer, Barcelona Supercomputing Center vishal.mehta@bsc.es Training BSC/UPC GPU Centre

More information

! XKaapi : a runtime for highly parallel (OpenMP) application

! XKaapi : a runtime for highly parallel (OpenMP) application XKaapi : a runtime for highly parallel (OpenMP) application Thierry Gautier thierry.gautier@inrialpes.fr MOAIS, INRIA, Grenoble C2S@Exa 10-11/07/2014 Agenda - context - objective - task model and execution

More information

Design Decisions for a Source-2-Source Compiler

Design Decisions for a Source-2-Source Compiler Design Decisions for a Source-2-Source Compiler Roger Ferrer, Sara Royuela, Diego Caballero, Alejandro Duran, Xavier Martorell and Eduard Ayguadé Barcelona Supercomputing Center and Universitat Politècnica

More information

Tutorial OmpSs: Overlapping communication and computation

Tutorial OmpSs: Overlapping communication and computation www.bsc.es Tutorial OmpSs: Overlapping communication and computation PATC course Parallel Programming Workshop Rosa M Badia, Xavier Martorell PATC 2013, 18 October 2013 Tutorial OmpSs Agenda 10:00 11:00

More information

Exploring Dynamic Parallelism on OpenMP

Exploring Dynamic Parallelism on OpenMP www.bsc.es Exploring Dynamic Parallelism on OpenMP Guray Ozen, Eduard Ayguadé, Jesús Labarta WACCPD @ SC 15 Guray Ozen - Exploring Dynamic Parallelism in OpenMP Austin, Texas 2015 MACC: MACC: Introduction

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA. Matrix Algebra on GPU and Multicore Architectures MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/

More information

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows

More information

CellSs Making it easier to program the Cell Broadband Engine processor

CellSs Making it easier to program the Cell Broadband Engine processor Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of

More information

Programming model and application porting to the Dynamical Exascale Entry Platform (DEEP)

Programming model and application porting to the Dynamical Exascale Entry Platform (DEEP) Programming model and application porting to the Dynamical Exascale Entry Platform (DEEP) EASC 2013 April 10 th, Edinburgh Damián A. Mallón The research leading to these results has received funding from

More information

Task Superscalar: Using Processors as Functional Units

Task Superscalar: Using Processors as Functional Units Task Superscalar: Using Processors as Functional Units Yoav Etsion Alex Ramirez Rosa M. Badia Eduard Ayguade Jesus Labarta Mateo Valero HotPar, June 2010 Yoav Etsion Senior Researcher Parallel Programming

More information

Enabling GPU support for the COMPSs-Mobile framework

Enabling GPU support for the COMPSs-Mobile framework Enabling GPU support for the COMPSs-Mobile framework Francesc Lordan, Rosa M Badia and Wen-Mei Hwu Nov 13, 2017 4th Workshop on Accelerator Programming Using Directives COMPSs-Mobile infrastructure WAN

More information

StarPU: a runtime system for multigpu multicore machines

StarPU: a runtime system for multigpu multicore machines StarPU: a runtime system for multigpu multicore machines Raymond Namyst RUNTIME group, INRIA Bordeaux Journées du Groupe Calcul Lyon, November 2010 The RUNTIME Team High Performance Runtime Systems for

More information

Optimizing an Earth Science Atmospheric Application with the OmpSs Programming Model

Optimizing an Earth Science Atmospheric Application with the OmpSs Programming Model www.bsc.es Optimizing an Earth Science Atmospheric Application with the OmpSs Programming Model HPC Knowledge Meeting'15 George S. Markomanolis, Jesus Labarta, Oriol Jorba University of Barcelona, Barcelona,

More information

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent

More information

A brief introduction to OpenMP

A brief introduction to OpenMP A brief introduction to OpenMP Alejandro Duran Barcelona Supercomputing Center Outline 1 Introduction 2 Writing OpenMP programs 3 Data-sharing attributes 4 Synchronization 5 Worksharings 6 Task parallelism

More information

Technology on Dense Linear Algebra

Technology on Dense Linear Algebra Impact of Multi core and Many core Technology on Dense Linear Algebra Enrique S. Quintana-Ortí Berlin, September 2011 Berlin, September 2011 1 Multi-core and Many-core The free lunch is over (H. Sutter,

More information

EEDC. Scientific Programming Models. Execution Environments for Distributed Computing. Master in Computer Architecture, Networks and Systems - CANS

EEDC. Scientific Programming Models. Execution Environments for Distributed Computing. Master in Computer Architecture, Networks and Systems - CANS EEDC Execution Environments for Distributed Computing 34330 Master in Computer Architecture, Networks and Systems - CANS Scientific Programming Models Group members: Francesc Lordan francesc.lordan@bsc.es

More information

Concurrent Programming with OpenMP

Concurrent Programming with OpenMP Concurrent Programming with OpenMP Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico October 11, 2012 CPD (DEI / IST) Parallel and Distributed

More information

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Francisco D. Igual Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Universidad Jaime I de Castellón (Spain) Robert A. van de

More information

OmpSs Specification. BSC Programming Models

OmpSs Specification. BSC Programming Models OmpSs Specification BSC Programming Models March 30, 2017 CONTENTS 1 Introduction to OmpSs 3 1.1 Reference implementation........................................ 3 1.2 A bit of history..............................................

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

OpenMP Tasking Model Unstructured parallelism

OpenMP Tasking Model Unstructured parallelism www.bsc.es OpenMP Tasking Model Unstructured parallelism Xavier Teruel and Xavier Martorell What is a task in OpenMP? Tasks are work units whose execution may be deferred or it can be executed immediately!!!

More information

Fundamentals of OmpSs

Fundamentals of OmpSs www.bsc.es Fundamentals of OmpSs Tasks and Dependences Xavier Teruel New York, June 2013 AGENDA: Fundamentals of OmpSs Tasking and Synchronization Data Sharing Attributes Dependence Model Other Tasking

More information

Performance of deal.ii on a node

Performance of deal.ii on a node Performance of deal.ii on a node Bruno Turcksin Texas A&M University, Dept. of Mathematics Bruno Turcksin Deal.II on a node 1/37 Outline 1 Introduction 2 Architecture 3 Paralution 4 Other Libraries 5 Conclusions

More information

Making Dataflow Programming Ubiquitous for Scientific Computing

Making Dataflow Programming Ubiquitous for Scientific Computing Making Dataflow Programming Ubiquitous for Scientific Computing Hatem Ltaief KAUST Supercomputing Lab Synchronization-reducing and Communication-reducing Algorithms and Programming Models for Large-scale

More information

Solving Dense Linear Systems on Graphics Processors

Solving Dense Linear Systems on Graphics Processors Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad

More information

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler Taylor Lloyd Jose Nelson Amaral Ettore Tiotto University of Alberta University of Alberta IBM Canada 1 Why? 2 Supercomputer Power/Performance GPUs

More information

Application programming on parallel/distributed computing platforms Daniele Lezzi BSC

Application programming on parallel/distributed computing platforms Daniele Lezzi BSC Application programming on parallel/distributed computing platforms Daniele Lezzi BSC 11/01/2018 Training week Munich Outline Programming parallel and distributed computing platforms: an overview Programming

More information

The Stampede is Coming: A New Petascale Resource for the Open Science Community

The Stampede is Coming: A New Petascale Resource for the Open Science Community The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center boisseau@tacc.utexas.edu Stampede: Solicitation US National Science Foundation

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

Shared-memory Parallel Programming with Cilk Plus

Shared-memory Parallel Programming with Cilk Plus Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 30 August 2018 Outline for Today Threaded programming

More information

Task-parallel reductions in OpenMP and OmpSs

Task-parallel reductions in OpenMP and OmpSs Task-parallel reductions in OpenMP and OmpSs Jan Ciesko 1 Sergi Mateo 1 Xavier Teruel 1 Vicenç Beltran 1 Xavier Martorell 1,2 1 Barcelona Supercomputing Center 2 Universitat Politècnica de Catalunya {jan.ciesko,sergi.mateo,xavier.teruel,

More information

SciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications

SciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications Parallel Tiled Algorithms for Multicore Architectures Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou SciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications

More information

Heterogenous Acceleration for Linear Algebra in Mulit-Coprocessor Environments

Heterogenous Acceleration for Linear Algebra in Mulit-Coprocessor Environments Heterogenous Acceleration for Linear Algebra in Mulit-Coprocessor Environments Azzam Haidar 1, Piotr Luszczek 1, Stanimire Tomov 1, and Jack Dongarra 1,2,3 1 University of Tennessee Knoxville, USA 2 Oak

More information

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC

More information

LooPo: Automatic Loop Parallelization

LooPo: Automatic Loop Parallelization LooPo: Automatic Loop Parallelization Michael Claßen Fakultät für Informatik und Mathematik Düsseldorf, November 27 th 2008 Model-Based Loop Transformations model-based approach: map source code to an

More information

Advanced OpenMP Features

Advanced OpenMP Features Advanced OpenMP Features Christian Terboven, Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group {terboven,schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University Sudoku IT Center

More information

OpenMP 4.0. Mark Bull, EPCC

OpenMP 4.0. Mark Bull, EPCC OpenMP 4.0 Mark Bull, EPCC OpenMP 4.0 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all devices!

More information

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

Heterogenous Acceleration for Linear Algebra in Mulit-Coprocessor Environments

Heterogenous Acceleration for Linear Algebra in Mulit-Coprocessor Environments Heterogenous Acceleration for Linear Algebra in Mulit-Coprocessor Environments Azzam Haidar 1, Piotr Luszczek 1, Stanimire Tomov 1, and Jack Dongarra 1,2,3 1 University of Tennessee Knoxville, USA 2 Oak

More information

Joe Hummel, PhD. Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago.

Joe Hummel, PhD. Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago. Joe Hummel, PhD Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago email: joe@joehummel.net stuff: http://www.joehummel.net/downloads.html Async programming:

More information

Hierarchical DAG Scheduling for Hybrid Distributed Systems

Hierarchical DAG Scheduling for Hybrid Distributed Systems June 16, 2015 Hierarchical DAG Scheduling for Hybrid Distributed Systems Wei Wu, Aurelien Bouteiller, George Bosilca, Mathieu Faverge, Jack Dongarra IPDPS 2015 Outline! Introduction & Motivation! Hierarchical

More information

Scheduling of QR Factorization Algorithms on SMP and Multi-core Architectures

Scheduling of QR Factorization Algorithms on SMP and Multi-core Architectures Scheduling of Algorithms on SMP and Multi-core Architectures Gregorio Quintana-Ortí Enrique S. Quintana-Ortí Ernie Chan Robert A. van de Geijn Field G. Van Zee quintana@icc.uji.es Universidad Jaime I de

More information

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics

More information

Parallel and Distributed Programming Introduction. Kenjiro Taura

Parallel and Distributed Programming Introduction. Kenjiro Taura Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel Programming? 2 What Parallel Machines Look Like, and Where Performance Come From? 3 How to Program Parallel

More information

Pedraforca: a First ARM + GPU Cluster for HPC

Pedraforca: a First ARM + GPU Cluster for HPC www.bsc.es Pedraforca: a First ARM + GPU Cluster for HPC Nikola Puzovic, Alex Ramirez We ve hit the power wall ALL computers are limited by power consumption Energy-efficient approaches Multi-core Fujitsu

More information

High-Level API for GPGPU using Meta-programming

High-Level API for GPGPU using Meta-programming High-Level API for GPGPU using Meta-programming Joel Falcou University Paris-Sud LRI December 15, 2015 The Hardware/Software Trade-Off Single Core Era Performance Multi-Core/SIMD Era Performance Heterogenous

More information

Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures

Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures Photos placed in horizontal position with even amount of white space between photos and header Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures Christopher Forster,

More information

Leveraging Parallelware in MAESTRO and EPEEC

Leveraging Parallelware in MAESTRO and EPEEC Leveraging Parallelware in MAESTRO and EPEEC and Enhancements to Parallelware Manuel Arenaz manuel.arenaz@appentra.com PRACE booth #2033 Thursday, 15 November 2018 Dallas, US http://www.prace-ri.eu/praceatsc18/

More information

Exploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization

Exploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization Exploiting Task-Parallelism on Clusters via Adrián Castelló, Rafael Mayo, Judit Planas, Enrique S. Quintana-Ortí RePara 2015, August Helsinki, Finland Exploiting Task-Parallelism on Clusters via Power/energy/utilization

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Parallel Computing. Hwansoo Han (SKKU)

Parallel Computing. Hwansoo Han (SKKU) Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo

More information

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010

More information

OpenMP 4.0/4.5. Mark Bull, EPCC

OpenMP 4.0/4.5. Mark Bull, EPCC OpenMP 4.0/4.5 Mark Bull, EPCC OpenMP 4.0/4.5 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all

More information

Heterogeneous Multicore Parallel Programming

Heterogeneous Multicore Parallel Programming Innovative software for manycore paradigms Heterogeneous Multicore Parallel Programming S. Chauveau & L. Morin & F. Bodin Introduction Numerous legacy applications can benefit from GPU computing Many programming

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

OpenACC 2.6 Proposed Features

OpenACC 2.6 Proposed Features OpenACC 2.6 Proposed Features OpenACC.org June, 2017 1 Introduction This document summarizes features and changes being proposed for the next version of the OpenACC Application Programming Interface, tentatively

More information

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems Ed Hinkel Senior Sales Engineer Agenda Overview - Rogue Wave & TotalView GPU Debugging with TotalView Nvdia CUDA Intel Phi 2

More information

Analysis of the Task Superscalar Architecture Hardware Design

Analysis of the Task Superscalar Architecture Hardware Design Available online at www.sciencedirect.com Procedia Computer Science 00 (2013) 000 000 International Conference on Computational Science, ICCS 2013 Analysis of the Task Superscalar Architecture Hardware

More information

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi архитектура, модели программирования, оптимизация. Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture

More information

PARALUTION - a Library for Iterative Sparse Methods on CPU and GPU

PARALUTION - a Library for Iterative Sparse Methods on CPU and GPU - a Library for Iterative Sparse Methods on CPU and GPU Dimitar Lukarski Division of Scientific Computing Department of Information Technology Uppsala Programming for Multicore Architectures Research Center

More information

OmpCloud: Bridging the Gap between OpenMP and Cloud Computing

OmpCloud: Bridging the Gap between OpenMP and Cloud Computing OmpCloud: Bridging the Gap between OpenMP and Cloud Computing Hervé Yviquel, Marcio Pereira and Guido Araújo University of Campinas (UNICAMP), Brazil A bit of background qguido Araujo, PhD Princeton University

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware

More information

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li Evaluation of sparse LU factorization and triangular solution on multicore architectures X. Sherry Li Lawrence Berkeley National Laboratory ParLab, April 29, 28 Acknowledgement: John Shalf, LBNL Rich Vuduc,

More information

Task based parallelization of recursive linear algebra routines using Kaapi

Task based parallelization of recursive linear algebra routines using Kaapi Task based parallelization of recursive linear algebra routines using Kaapi Clément PERNET joint work with Jean-Guillaume DUMAS and Ziad SULTAN Université Grenoble Alpes, LJK-CASYS January 20, 2017 Journée

More information

Overview: Emerging Parallel Programming Models

Overview: Emerging Parallel Programming Models Overview: Emerging Parallel Programming Models the partitioned global address space paradigm the HPCS initiative; basic idea of PGAS the Chapel language: design principles, task and data parallelism, sum

More information

JCudaMP: OpenMP/Java on CUDA

JCudaMP: OpenMP/Java on CUDA JCudaMP: OpenMP/Java on CUDA Georg Dotzler, Ronald Veldema, Michael Klemm Programming Systems Group Martensstraße 3 91058 Erlangen Motivation Write once, run anywhere - Java Slogan created by Sun Microsystems

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

Parallelism V. HPC Profiling. John Cavazos. Dept of Computer & Information Sciences University of Delaware

Parallelism V. HPC Profiling. John Cavazos. Dept of Computer & Information Sciences University of Delaware Parallelism V HPC Profiling John Cavazos Dept of Computer & Information Sciences University of Delaware Lecture Overview Performance Counters Profiling PAPI TAU HPCToolkit PerfExpert Performance Counters

More information

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Speedup Altair RADIOSS Solvers Using NVIDIA GPU Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair

More information

Advanced OpenMP Features

Advanced OpenMP Features Christian Terboven, Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group {terboven,schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University Vectorization 2 Vectorization SIMD =

More information

PyCOMPSs: Parallel computational workflows in Python

PyCOMPSs: Parallel computational workflows in Python Original Article PyCOMPSs: Parallel computational workflows in Python The International Journal of High Performance Computing Applications 2017, Vol. 31(1) 66 82 Ó The Author(s) 2015 Reprints and permissions:

More information

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores Presented by Xiaohui Chen Joint work with Marc Moreno Maza, Sushek Shekar & Priya Unnikrishnan University of Western Ontario,

More information

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Intel Performance Libraries

Intel Performance Libraries Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation

More information

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Optimising the Mantevo benchmark suite for multi- and many-core architectures Optimising the Mantevo benchmark suite for multi- and many-core architectures Simon McIntosh-Smith Department of Computer Science University of Bristol 1 Bristol's rich heritage in HPC The University of

More information

OpenMP Tutorial. Dirk Schmidl. IT Center, RWTH Aachen University. Member of the HPC Group Christian Terboven

OpenMP Tutorial. Dirk Schmidl. IT Center, RWTH Aachen University. Member of the HPC Group Christian Terboven OpenMP Tutorial Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de IT Center, RWTH Aachen University Head of the HPC Group terboven@itc.rwth-aachen.de 1 Tasking

More information

MAGMA: a New Generation

MAGMA: a New Generation 1.3 MAGMA: a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Jack Dongarra T. Dong, M. Gates, A. Haidar, S. Tomov, and I. Yamazaki University of Tennessee, Knoxville Release

More information

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017 INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and

More information

Easy Programming the Cloud with PyCOMPSs

Easy Programming the Cloud with PyCOMPSs www.bsc.es Easy Programming the Cloud with PyCOMPSs FiCLOUD 2014 Barcelona, August 28 Barcelona Supercomputing Center The BSC-CNS objectives: R&D in Computer Sciences, Life Sciences and Earth Sciences

More information

SPOC : GPGPU programming through Stream Processing with OCaml

SPOC : GPGPU programming through Stream Processing with OCaml SPOC : GPGPU programming through Stream Processing with OCaml Mathias Bourgoin - Emmanuel Chailloux - Jean-Luc Lamotte January 23rd, 2012 GPGPU Programming Two main frameworks Cuda OpenCL Different Languages

More information

Parallel Programming. Libraries and Implementations

Parallel Programming. Libraries and Implementations Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Guiding the optimization of parallel codes on multicores using an analytical cache model

Guiding the optimization of parallel codes on multicores using an analytical cache model Guiding the optimization of parallel codes on multicores using an analytical cache model Diego Andrade, Basilio B. Fraguela, and Ramón Doallo Universidade da Coruña, Spain {diego.andrade,basilio.fraguela,ramon.doalllo}@udc.es

More information