PEZY-SC Omni OpenACC GPU. Green500[1] Shoubu( ) ExaScaler PEZY-SC. [4] Omni OpenACC NVIDIA GPU. ExaScaler PEZY-SC PZCL PZCL OpenCL[2]

Size: px
Start display at page:

Download "PEZY-SC Omni OpenACC GPU. Green500[1] Shoubu( ) ExaScaler PEZY-SC. [4] Omni OpenACC NVIDIA GPU. ExaScaler PEZY-SC PZCL PZCL OpenCL[2]"

Transcription

1 ZY-SC Omni 1,a) ,4 1,5 ZY-SC ZY-SC OpenCL ZCL ZY-SC Suiren Blue ZCL N-Body 98%NB CG 88% ZCL 1. Green500[1] Shoubu( ) xascaler ZY-SC MIMD xascaler ZY-SC ZCL ZCL OpenCL[2] ZCL ZY-SC 1 Graduate School of Systems and Information ngineering, University of Tsukuba 2 xascaler xascaler Inc. 3 (KK) Computing Research Center, High nergy Accelerator Research Organization (KK) 4 Center for Computational Sciences, University of Tsukuba 5 RIKN Advanced Institute for Computational Science a) tabuchi@hpcs.cs.tsukuba.ac.jp [3] GU ZY-SC ZY-SC ZY-SC Omni [4] Omni NVIDIA GU CUDA[5] source-to-source N-body NAS arallel Benchmarks CG (NB CG)[6] ZCL 2 3 ZY-SC ZCL 4 ZY-SC Omni ZY-SC [7] OpenCL OpenCL ZCL c 2016 Information rocessing Society of Japan 1

2 OpenCL ZCL OpenCL accull[8] OpenUH-[9] OpenARC[10] RoseACC[11] GCC [12] accull ython YaCF CUDA OpenCL OpenUH- OpenUH CUDA OpenCL OpenARC 1.0 Cetus compiler infrastructure RoseACC Rose Compiler OpenCL GU OpenCL CU MIC FGA ZY-SC ZCL OpenCL Omni ZCL ZY-SC 3. ZY-SC ZY-SC ZCL DDR4 DDR4 DDR4 1 L1 (2KB) L1 (2KB) Village (4) L2 (64KB) City (16) L3 (2MB) refecture (256) DDR4 ZY-SC T7 T6 2 T3 T2 T0 T1 1 refecture T4 T L1 L3 3.1 ZY-SC ZY rocessing lement () MIMD 8 SMT (Simultaneous MultiThreading) 8192 MIMD Village City refecture 8 16KB 2 ALU FU Village 4 2 2KB L1 City 4 Village Unit (SFU) 64KB L2 refecture 16 City 2MB L3 3.2 ZCL ZCL ZY ZY-SC OpenCL OpenCL AI OpenCL 1.1 OpenCL AI OpenCL ZCL c 2016 Information rocessing Society of Japan 2

3 City (128 ) OpenCL 3 ZCL C/C++OpenCL OpenCL kernel global local ZCL pzc ZCL OpenCL ID ID get pid() get tid() OpenCL get group id(0) get local id(0) get maxpid() get maxtid() OpenCL get num groups(0) get local size(0) ZCL chgthread() sync() flush() chgthread() sync() sync L1() Village sync L2() City sync L3() refecture flush() flush L1() Village L1 flush L2() City L1,L2 4. ZY-SC Omni ZY-SC Omni 4.1 C/C++/Fortran C C with 3 translator C with ACC AI call ZCL kernel C compiler ZCL compiler Run7me Library xecu7on file Kernel binary Omni Compiler load at runtime ZCL ZCL ZY-SC ZCL CU GU ZY-SC ZCL 4.2 C Fortran95 Omni Compiler Infrastructure[13] 3 C translator C ZCL C ZCL 2 Omni runtime library data data 4 data data a b copy a copyout b ACC init data lower length DV ADDR name name c 2016 Information rocessing Society of Japan 3

4 int a[100], b; #pragma acc data copy(a) copyout(b) /* some codes using a and b */ (a) int a[100], b; void *DSC_a,*DV_ADDR_a,*DSC_b,*DV_ADDR_b; unsigned long long _lower[] = 0; unsigned long long _length[] = 100; _ACC_init_data(&(DSC_a),&(DV_ADDR_a),a,sizeof(int),1,_lower,length); _ACC_init_data(&(DSC_b),&(DV_ADDR_b),&(b),sizeof(int),0,NULL, NULL); _ACC_copy_data(DSC_a,_ACC_HOST_TO_DVIC,_ACC_ASYNC_SYNC); /* some codes using a and b */ _ACC_copy_data(DSC_a,_ACC_DVIC_TO_HOST,_ACC_ASYNC_SYNC); _ACC_copy_data(DSC_b,_ACC_DVIC_TO_HOST,_ACC_ASYNC_SYNC); _ACC_finalize_data(DSC_a); _ACC_finalize_data(DSC_b); 4 (b) data DSC name name ACC copy data ACC finalize data 4.3 parallel parallel gang, worker, vector 3 ZCL gang vector firstprivate private #pragma acc parallel present(a) num_gangs(16) /* codes in parallel region */ (a) /* host code */ int _ACC_ngangs = 16; int _ACC_nworkers = 1; int _ACC_veclen = 8; int _ACC_conf[] = _ACC_ngangs, _ACC_nworkers, _ACC_veclen; void* _ACC_args[] = &DV_ADDR_a; size_t _ACC_argsizes[] = sizeof(void*); _ACC_launch(_ACC_program, 0, _ACC_conf, ACC_ASYNC_SYNC, 1, args, arg_sizes); /* kernel function in device code */ void pzc ACC_kernel_0(int *a) /* codes in parallel region */ 5 (b) parallel num gangs 5 parallel pzc ACC kernel 0 gang ACC args ACC argsizes ACC launch 1 ACC program cl program cl kernel 2 ACC launch clnqueuendrangekernel ZY-SC loop loop for gang vector cyclic loop reduction c 2016 Information rocessing Society of Japan 4

5 /* inside parallel region */ #pragma acc loop vector reduction(+:sum) for(i = 0; i < N; i++) a[i]++; sum += a[i]; (a) /* inside kernel function */ int _niter_i, _idx, _init, _cond, _step, _red_sum; _ACC_init_reduction_var(&_red_sum,0); _ACC_calc_niter(&_niter_i, 0, N, 1); _ACC_init_thread_iter(&_init,&_cond,&_step,_niter_i); for(_idx = _init; _idx < _cond; _idx += _step) int i; _ACC_calc_idx(_idx, &i, 0, N, 1); a[i]++; _red_sum += a[i]; _ACC_reduction_thread(sum,_red_sum, 0); 6 (b) loop 6 loop ACC calc niter ACC init thread iter ACC calc idx ACC init reduction var ACC reduction thread 5. ZCL N-Body) NB CG N-Body NB CG 5.1 KK Suiren Blue 1 N-Body 7 ZCL ZCL (merged kernel) 2 1 ZCL (merged kernel, chgthread) CU Memory Accelerator 1 (Suiren Blue) Intel Xeon Lv3 2.3GHz DDR4 1866MHz, 64GB ZY-SC (DDR4 1866MHz 16GB) Compiler ICC , ZSDK 2.1, Omni compiler for ZY-SC chgthread() ZCL % chgthread() N-Body NB CG 8 mop/s Mega Operations er Second 1 ZCL ZCL (merged kernel) conj grad 7 1 ZCL (merged kernel, chgthread) chgthread() ZCL % CG 0 ClassB ZCL % parallel parallel kernels 2 parallel 1 parallel parallel kernels 1 ZCL sync() kernels 1 kernels GU ZCL c 2016 Information rocessing Society of Japan 5

6 実行時間 (s) ZCL ZCL(merged kernel) ZCL(merged kernel, chgthread) 2 N-Body NB CG N-Body NB CG ZCL ZCL(merged kernel) ZCL(merged kernel, chgthread) 114 (5) 447 (25) K 16K 32K 64K 128K 256K 512K 1024K 粒子数 7 N-Body 48% NB CG 45% ZCL 6. mop / s A (14000) B (75000) C (150000) Class ( 行列サイズ ) 8 NB CG ZCL ZCL (merged kernel) ZCL (merged kernel, chgthread) chgthread() ZCL ZCL 11 38% CG chgthread() ZCL % ZCL kernels chgthread() 5.2 ZCL AI (SLOC) N-Body NB CG 2 ZCL N-Body ZY-SC ZY-SC NVIDIA GU CUDA Omni C ZY-SC ZCL N-Body ZCL 98% NB CG ZCL 88% ZCL N-Body 48% NB CG 45% kernels chgthread() [1] The green [2] Khronos Group, OpenCL - The open standard for parallel programming of heterogeneous systems. [3] -Standard.org, Home. [4] Akihiro Tabuchi, Masahiro Nakao, and Mitsuhisa Sato. A source-to-source openacc compiler for cuda. In uro- ar Workshops, pp , [5] NVIDIA, home_new.html. arallel rogramming and Computing latform CUDA. [6] NASA Advanced Supercomputing Division, http: // NAS arallel Benchmarks. [7]. Suiren c 2016 Information rocessing Society of Japan 6

7 .. [ ], No. 11, dec [8] Ruymán Reyes, Iván López-Rodríguez, JuanJ. Fumero, and Francisco de Sande. accull: An openacc implementation with cuda and opencl support. In uro-ar 2012 arallel rocessing, Vol of Lecture Notes in Computer Science, pp Springer Berlin Heidelberg, [9] Xiaonan Tian, Rengan Xu, Yonghong Yan, Zhifeng Yun, Sunita Chandrasekaran, and Barbara Chapman. Compiling a high-level directive-based programming model for gpgpus. In Languages and Compilers for arallel Computing, Lecture Notes in Computer Science, pp Springer International ublishing, [10] Seyong Lee and Jeffrey S. Vetter. Openarc: Open accelerator research compiler for directive-based, efficient heterogeneous computing. In roceedings of the 23rd International Symposium on High-performance arallel and Distributed Computing, HDC 14, pp , New York, NY, USA, ACM. [11] University of Delaware and LLNL, org/. RoseACC. [12] GCC, - GCC Wiki. [13] RIKN AICS and University of Tsukuba, omni-compiler.org. Omni Compiler roject. c 2016 Information rocessing Society of Japan 7

Omni Compiler and XcodeML: An Infrastructure for Source-to- Source Transformation

Omni Compiler and XcodeML: An Infrastructure for Source-to- Source Transformation http://omni compiler.org/ Omni Compiler and XcodeML: An Infrastructure for Source-to- Source Transformation MS03 Code Generation Techniques for HPC Earth Science Applications Mitsuhisa Sato (RIKEN / Advanced

More information

Compiling a High-level Directive-Based Programming Model for GPGPUs

Compiling a High-level Directive-Based Programming Model for GPGPUs Compiling a High-level Directive-Based Programming Model for GPGPUs Xiaonan Tian, Rengan Xu, Yonghong Yan, Zhifeng Yun, Sunita Chandrasekaran, and Barbara Chapman Department of Computer Science, University

More information

Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators

Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators CSCE 569 Parallel Computing Department of Computer Science and Engineering Yonghong Yan yanyh@cse.sc.edu

More information

An Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters

An Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters An Extension of XcalableMP PGAS Lanaguage for Multi-node Clusters Jinpil Lee, Minh Tuan Tran, Tetsuya Odajima, Taisuke Boku and Mitsuhisa Sato University of Tsukuba 1 Presentation Overview l Introduction

More information

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3 /CPU,a),2,2 2,2 Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3 XMP XMP-dev CPU XMP-dev/StarPU XMP-dev XMP CPU StarPU CPU /CPU XMP-dev/StarPU N /CPU CPU. Graphics Processing Unit GP General-Purpose

More information

HPC Challenge Awards 2010 Class2 XcalableMP Submission

HPC Challenge Awards 2010 Class2 XcalableMP Submission HPC Challenge Awards 2010 Class2 XcalableMP Submission Jinpil Lee, Masahiro Nakao, Mitsuhisa Sato University of Tsukuba Submission Overview XcalableMP Language and model, proposed by XMP spec WG Fortran

More information

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware OpenACC Standard Directives for Accelerators Credits http://www.openacc.org/ o V1.0: November 2011 Specification OpenACC, Directives for Accelerators, Nvidia Slideware CAPS OpenACC Compiler, HMPP Workbench

More information

Objective. GPU Teaching Kit. OpenACC. To understand the OpenACC programming model. Introduction to OpenACC

Objective. GPU Teaching Kit. OpenACC. To understand the OpenACC programming model. Introduction to OpenACC GPU Teaching Kit Accelerated Computing OpenACC Introduction to OpenACC Objective To understand the OpenACC programming model basic concepts and pragma types simple examples 2 2 OpenACC The OpenACC Application

More information

The Design and Implementation of OpenMP 4.5 and OpenACC Backends for the RAJA C++ Performance Portability Layer

The Design and Implementation of OpenMP 4.5 and OpenACC Backends for the RAJA C++ Performance Portability Layer The Design and Implementation of OpenMP 4.5 and OpenACC Backends for the RAJA C++ Performance Portability Layer William Killian Tom Scogland, Adam Kunen John Cavazos Millersville University of Pennsylvania

More information

OpenACC 2.6 Proposed Features

OpenACC 2.6 Proposed Features OpenACC 2.6 Proposed Features OpenACC.org June, 2017 1 Introduction This document summarizes features and changes being proposed for the next version of the OpenACC Application Programming Interface, tentatively

More information

Getting Started with Directive-based Acceleration: OpenACC

Getting Started with Directive-based Acceleration: OpenACC Getting Started with Directive-based Acceleration: OpenACC Ahmad Lashgar Member of High-Performance Computing Research Laboratory, School of Computer Science Institute for Research in Fundamental Sciences

More information

OpenACC. Part 2. Ned Nedialkov. McMaster University Canada. CS/SE 4F03 March 2016

OpenACC. Part 2. Ned Nedialkov. McMaster University Canada. CS/SE 4F03 March 2016 OpenACC. Part 2 Ned Nedialkov McMaster University Canada CS/SE 4F03 March 2016 Outline parallel construct Gang loop Worker loop Vector loop kernels construct kernels vs. parallel Data directives c 2013

More information

GPU. OpenMP. OMPCUDA OpenMP. forall. Omni CUDA 3) Global Memory OMPCUDA. GPU Thread. Block GPU Thread. Vol.2012-HPC-133 No.

GPU. OpenMP. OMPCUDA OpenMP. forall. Omni CUDA 3) Global Memory OMPCUDA. GPU Thread. Block GPU Thread. Vol.2012-HPC-133 No. GPU CUDA OpenMP 1 2 3 1 1 OpenMP CUDA OM- PCUDA OMPCUDA GPU CUDA CUDA 1. GPU GPGPU 1)2) GPGPU CUDA 3) CPU CUDA GPGPU CPU GPU OpenMP GPU CUDA OMPCUDA 4)5) OMPCUDA GPU OpenMP GPU CUDA OMPCUDA/MG 2 GPU OMPCUDA

More information

Is OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels

Is OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels National Aeronautics and Space Administration Is OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels Jose M. Monsalve Diaz (UDEL), Gabriele Jost (NASA), Sunita Chandrasekaran

More information

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer OpenACC Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer 3 WAYS TO ACCELERATE APPLICATIONS Applications Libraries Compiler Directives Programming Languages Easy to use Most Performance

More information

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016 OpenACC. Part I Ned Nedialkov McMaster University Canada October 2016 Outline Introduction Execution model Memory model Compiling pgaccelinfo Example Speedups Profiling c 2016 Ned Nedialkov 2/23 Why accelerators

More information

Early Experiences With The OpenMP Accelerator Model

Early Experiences With The OpenMP Accelerator Model Early Experiences With The OpenMP Accelerator Model Chunhua Liao 1, Yonghong Yan 2, Bronis R. de Supinski 1, Daniel J. Quinlan 1 and Barbara Chapman 2 1 Center for Applied Scientific Computing, Lawrence

More information

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017 INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and

More information

COMP Parallel Computing. Programming Accelerators using Directives

COMP Parallel Computing. Programming Accelerators using Directives COMP 633 - Parallel Computing Lecture 15 October 30, 2018 Programming Accelerators using Directives Credits: Introduction to OpenACC and toolkit Jeff Larkin, Nvidia COMP 633 - Prins Directives for Accelerator

More information

Research Article Multi-GPU Support on Single Node Using Directive-Based Programming Model

Research Article Multi-GPU Support on Single Node Using Directive-Based Programming Model Scientific Programming Volume 2015, Article ID 621730, 15 pages http://dx.doi.org/10.1155/2015/621730 Research Article Multi-GPU Support on Single Node Using Directive-Based Programming Model Rengan Xu,

More information

JCudaMP: OpenMP/Java on CUDA

JCudaMP: OpenMP/Java on CUDA JCudaMP: OpenMP/Java on CUDA Georg Dotzler, Ronald Veldema, Michael Klemm Programming Systems Group Martensstraße 3 91058 Erlangen Motivation Write once, run anywhere - Java Slogan created by Sun Microsystems

More information

C PGAS XcalableMP(XMP) Unified Parallel

C PGAS XcalableMP(XMP) Unified Parallel PGAS XcalableMP Unified Parallel C 1 2 1, 2 1, 2, 3 C PGAS XcalableMP(XMP) Unified Parallel C(UPC) XMP UPC XMP UPC 1 Berkeley UPC GASNet 1. MPI MPI 1 Center for Computational Sciences, University of Tsukuba

More information

A Comparative Study of OpenACC Implementations

A Comparative Study of OpenACC Implementations A Comparative Study of OpenACC Implementations Ruymán Reyes, Iván López, Juan J. Fumero and Francisco de Sande 1 Abstract GPUs and other accelerators are available on many different devices, while GPGPU

More information

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel www.bsc.es OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray Ozen guray.ozen@bsc.es Exascale in BSC Marenostrum 4 (13.7 Petaflops ) General purpose cluster (3400

More information

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Jonas Hahnfeld 1, Christian Terboven 1, James Price 2, Hans Joachim Pflug 1, Matthias S. Müller

More information

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15: Profiling and Parallelizing with the OpenACC Toolkit

More information

OpenACC programming for GPGPUs: Rotor wake simulation

OpenACC programming for GPGPUs: Rotor wake simulation DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing

More information

OpenACC (Open Accelerators - Introduced in 2012)

OpenACC (Open Accelerators - Introduced in 2012) OpenACC (Open Accelerators - Introduced in 2012) Open, portable standard for parallel computing (Cray, CAPS, Nvidia and PGI); introduced in 2012; GNU has an incomplete implementation. Uses directives in

More information

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler Taylor Lloyd Jose Nelson Amaral Ettore Tiotto University of Alberta University of Alberta IBM Canada 1 Why? 2 Supercomputer Power/Performance GPUs

More information

Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015

Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015 Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015 Abstract As both an OpenMP and OpenACC insider I will present my opinion of the current status of these two directive sets for programming

More information

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:

More information

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2016

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2016 Advanced OpenACC John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center Copyright 2016 Outline Loop Directives Data Declaration Directives Data Regions Directives Cache directives Wait

More information

OpenACC Support in Score-P and Vampir

OpenACC Support in Score-P and Vampir Center for Information Services and High Performance Computing (ZIH) OpenACC Support in Score-P and Vampir Hands-On for the Taurus GPU Cluster February 2016 Robert Dietrich (robert.dietrich@tu-dresden.de)

More information

GPUs and Emerging Architectures

GPUs and Emerging Architectures GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs

More information

An OpenACC construct is an OpenACC directive and, if applicable, the immediately following statement, loop or structured block.

An OpenACC construct is an OpenACC directive and, if applicable, the immediately following statement, loop or structured block. API 2.6 R EF ER ENC E G U I D E The OpenACC API 2.6 The OpenACC Application Program Interface describes a collection of compiler directives to specify loops and regions of code in standard C, C++ and Fortran

More information

page migration Implementation and Evaluation of Dynamic Load Balancing Using Runtime Performance Monitoring on Omni/SCASH

page migration Implementation and Evaluation of Dynamic Load Balancing Using Runtime Performance Monitoring on Omni/SCASH Omni/SCASH 1 2 3 4 heterogeneity Omni/SCASH page migration Implementation and Evaluation of Dynamic Load Balancing Using Runtime Performance Monitoring on Omni/SCASH Yoshiaki Sakae, 1 Satoshi Matsuoka,

More information

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function Chen-Ting Chang, Yu-Sheng Chen, I-Wei Wu, and Jyh-Jiun Shann Dept. of Computer Science, National Chiao

More information

OpenMP on the FDSM software distributed shared memory. Hiroya Matsuba Yutaka Ishikawa

OpenMP on the FDSM software distributed shared memory. Hiroya Matsuba Yutaka Ishikawa OpenMP on the FDSM software distributed shared memory Hiroya Matsuba Yutaka Ishikawa 1 2 Software DSM OpenMP programs usually run on the shared memory computers OpenMP programs work on the distributed

More information

A ROSE-based OpenMP 3.0 Research Compiler Supporting Multiple Runtime Libraries

A ROSE-based OpenMP 3.0 Research Compiler Supporting Multiple Runtime Libraries A ROSE-based OpenMP 3.0 Research Compiler Supporting Multiple Runtime Libraries Chunhua Liao, Daniel J. Quinlan, Thomas Panas and Bronis R. de Supinski Center for Applied Scientific Computing Lawrence

More information

OpenACC Course Lecture 1: Introduction to OpenACC September 2015

OpenACC Course Lecture 1: Introduction to OpenACC September 2015 OpenACC Course Lecture 1: Introduction to OpenACC September 2015 Course Objective: Enable you to accelerate your applications with OpenACC. 2 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15:

More information

An Introduction to OpenACC. Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel

An Introduction to OpenACC. Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel An Introduction to OpenACC Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel Chapter 1 Introduction OpenACC is a software accelerator that uses the host and the device. It uses compiler

More information

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2017

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2017 Advanced OpenACC John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center Copyright 2017 Outline Loop Directives Data Declaration Directives Data Regions Directives Cache directives Wait

More information

Design Decisions for a Source-2-Source Compiler

Design Decisions for a Source-2-Source Compiler Design Decisions for a Source-2-Source Compiler Roger Ferrer, Sara Royuela, Diego Caballero, Alejandro Duran, Xavier Martorell and Eduard Ayguadé Barcelona Supercomputing Center and Universitat Politècnica

More information

Introduction to OpenACC. 16 May 2013

Introduction to OpenACC. 16 May 2013 Introduction to OpenACC 16 May 2013 GPUs Reaching Broader Set of Developers 1,000,000 s 100,000 s Early Adopters Research Universities Supercomputing Centers Oil & Gas CAE CFD Finance Rendering Data Analytics

More information

Masahiro Nakao, Hitoshi Murai, Takenori Shimosaka, Mitsuhisa Sato

Masahiro Nakao, Hitoshi Murai, Takenori Shimosaka, Mitsuhisa Sato Masahiro Nakao, Hitoshi Murai, Takenori Shimosaka, Mitsuhisa Sato Center for Computational Sciences, University of Tsukuba, Japan RIKEN Advanced Institute for Computational Science, Japan 2 XMP/C int array[16];

More information

S Comparing OpenACC 2.5 and OpenMP 4.5

S Comparing OpenACC 2.5 and OpenMP 4.5 April 4-7, 2016 Silicon Valley S6410 - Comparing OpenACC 2.5 and OpenMP 4.5 James Beyer, NVIDIA Jeff Larkin, NVIDIA GTC16 April 7, 2016 History of OpenMP & OpenACC AGENDA Philosophical Differences Technical

More information

ECE 574 Cluster Computing Lecture 10

ECE 574 Cluster Computing Lecture 10 ECE 574 Cluster Computing Lecture 10 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 1 October 2015 Announcements Homework #4 will be posted eventually 1 HW#4 Notes How granular

More information

Directive-Based, High-Level Programming and Optimizations for High-Performance Computing with FPGAs

Directive-Based, High-Level Programming and Optimizations for High-Performance Computing with FPGAs Directive-Based, High-Level Programming and Optimizations for High-Performance Computing with FPGAs Jacob Lambert University of Oregon jlambert@cs.uoregon.edu Advisor: Allen D. Malony University of Oregon

More information

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2018

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2018 Advanced OpenACC John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center Copyright 2018 Outline Loop Directives Data Declaration Directives Data Regions Directives Cache directives Wait

More information

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics

More information

OpenACC2 vs.openmp4. James Lin 1,2 and Satoshi Matsuoka 2

OpenACC2 vs.openmp4. James Lin 1,2 and Satoshi Matsuoka 2 2014@San Jose Shanghai Jiao Tong University Tokyo Institute of Technology OpenACC2 vs.openmp4 he Strong, the Weak, and the Missing to Develop Performance Portable Applica>ons on GPU and Xeon Phi James

More information

Early Experiences with the OpenMP Accelerator Model

Early Experiences with the OpenMP Accelerator Model Early Experiences with the OpenMP Accelerator Model Canberra, Australia, IWOMP 2013, Sep. 17th * University of Houston LLNL-PRES- 642558 This work was performed under the auspices of the U.S. Department

More information

Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA

Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA L. Oteski, G. Colin de Verdière, S. Contassot-Vivier, S. Vialle,

More information

Is OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels

Is OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels National Aeronautics and Space Administration Is OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels Jose M. Monsalve Diaz (UDEL), Gabriele Jost (NASA), Sunita Chandrasekaran

More information

OpenACC and the Cray Compilation Environment James Beyer PhD

OpenACC and the Cray Compilation Environment James Beyer PhD OpenACC and the Cray Compilation Environment James Beyer PhD Agenda A brief introduction to OpenACC Cray Programming Environment (PE) Cray Compilation Environment, CCE An in depth look at CCE 8.2 and OpenACC

More information

Lecture 4: OpenMP Open Multi-Processing

Lecture 4: OpenMP Open Multi-Processing CS 4230: Parallel Programming Lecture 4: OpenMP Open Multi-Processing January 23, 2017 01/23/2017 CS4230 1 Outline OpenMP another approach for thread parallel programming Fork-Join execution model OpenMP

More information

An Introduction to OpenAcc

An Introduction to OpenAcc An Introduction to OpenAcc ECS 158 Final Project Robert Gonzales Matthew Martin Nile Mittow Ryan Rasmuss Spring 2016 1 Introduction: What is OpenAcc? OpenAcc stands for Open Accelerators. Developed by

More information

INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC. Jeff Larkin, NVIDIA Developer Technologies

INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC. Jeff Larkin, NVIDIA Developer Technologies INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC Jeff Larkin, NVIDIA Developer Technologies AGENDA Accelerated Computing Basics What are Compiler Directives? Accelerating Applications with OpenACC Identifying

More information

OpenACC Fundamentals. Steve Abbott November 15, 2017

OpenACC Fundamentals. Steve Abbott November 15, 2017 OpenACC Fundamentals Steve Abbott , November 15, 2017 AGENDA Data Regions Deep Copy 2 while ( err > tol && iter < iter_max ) { err=0.0; JACOBI ITERATION #pragma acc parallel loop reduction(max:err)

More information

A Simple Guideline for Code Optimizations on Modern Architectures with OpenACC and CUDA

A Simple Guideline for Code Optimizations on Modern Architectures with OpenACC and CUDA A Simple Guideline for Code Optimizations on Modern Architectures with OpenACC and CUDA L. Oteski, G. Colin de Verdière, S. Contassot-Vivier, S. Vialle, J. Ryan Acks.: CEA/DIFF, IDRIS, GENCI, NVIDIA, Région

More information

Introduction to OpenACC

Introduction to OpenACC Introduction to OpenACC Alexander Fu, David Lin, Russell Miller June 2016 Taking advantage of the processing power of the GPU is what makes CUDA relevant. However, using CUDA and constraining oneself to

More information

A cache-aware performance prediction framework for GPGPU computations

A cache-aware performance prediction framework for GPGPU computations A cache-aware performance prediction framework for GPGPU computations The 8th Workshop on UnConventional High Performance Computing 215 Alexander Pöppl, Alexander Herz August 24th, 215 UCHPC 215, August

More information

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra) CS 470 Spring 2016 Mike Lam, Professor Other Architectures (with an aside on linear algebra) Parallel Systems Shared memory (uniform global address space) Primary story: make faster computers Programming

More information

MPI_Send(a,..., MPI_COMM_WORLD); MPI_Recv(a,..., MPI_COMM_WORLD, &status);

MPI_Send(a,..., MPI_COMM_WORLD); MPI_Recv(a,..., MPI_COMM_WORLD, &status); $ $ 2 global void kernel(int a[max], int llimit, int ulimit) {... } : int main(int argc, char *argv[]){ MPI_Int(&argc, &argc); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size);

More information

Programming Environment Research Team

Programming Environment Research Team Chapter 2 Programming Environment Research Team 2.1 Members Mitsuhisa Sato (Team Leader) Hitoshi Murai (Research Scientist) Miwako Tsuji (Research Scientist) Masahiro Nakao (Research Scientist) Jinpil

More information

Identification and Elimination of the Overhead of Accelerate with a Super-resolution Application

Identification and Elimination of the Overhead of Accelerate with a Super-resolution Application Regular Paper Identification and Elimination of the Overhead of Accelerate with a Super-resolution Application Izumi Asakura 1,a) Hidehiko Masuhara 1 Takuya Matsumoto 2 Kiminori Matsuzaki 3 Received: April

More information

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC DR. CHRISTOPH ANGERER, NVIDIA *) THANKS TO JEFF LARKIN, NVIDIA, FOR THE SLIDES 3 APPROACHES TO GPU PROGRAMMING Applications Libraries Compiler Directives

More information

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors 1 Agenda OpenCL Overview of Platform, Execution and Memory models Mapping these models to AM57x Overview of OpenMP Offload Model Compare and contrast

More information

Runtime Address Space Computation for SDSM Systems

Runtime Address Space Computation for SDSM Systems Runtime Address Space Computation for SDSM Systems Jairo Balart Outline Introduction Inspector/executor model Implementation Evaluation Conclusions & future work 2 Outline Introduction Inspector/executor

More information

Introduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University

Introduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University Introduction to OpenACC Shaohao Chen Research Computing Services Information Services and Technology Boston University Outline Introduction to GPU and OpenACC Basic syntax and the first OpenACC program:

More information

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2017 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2017 Jens Teubner Computer Architecture Summer 2017 34 Part II Graphics

More information

LLVM-based Communication Optimizations for PGAS Programs

LLVM-based Communication Optimizations for PGAS Programs LLVM-based Communication Optimizations for PGAS Programs nd Workshop on the LLVM Compiler Infrastructure in HPC @ SC15 Akihiro Hayashi (Rice University) Jisheng Zhao (Rice University) Michael Ferguson

More information

Introduction to GPU (Graphics Processing Unit) Architecture & Programming

Introduction to GPU (Graphics Processing Unit) Architecture & Programming Introduction to GU (Graphics rocessing Unit) Architecture & rogramming C240A. 2017 T. Yang ome of slides are from M. Hall of Utah C6235 Overview Hardware architecture rogramming model Example Historical

More information

Parallelism III. MPI, Vectorization, OpenACC, OpenCL. John Cavazos,Tristan Vanderbruggen, and Will Killian

Parallelism III. MPI, Vectorization, OpenACC, OpenCL. John Cavazos,Tristan Vanderbruggen, and Will Killian Parallelism III MPI, Vectorization, OpenACC, OpenCL John Cavazos,Tristan Vanderbruggen, and Will Killian Dept of Computer & Information Sciences University of Delaware 1 Lecture Overview Introduction MPI

More information

GPU programming made easier

GPU programming made easier GPU programming made easier Jacob Jepsen 6. June 2014 University of Copenhagen Department of Computer Science 6. June 2014 Introduction We created a tool that reduces the development time of GPU code.

More information

OpenMP 4.0/4.5: New Features and Protocols. Jemmy Hu

OpenMP 4.0/4.5: New Features and Protocols. Jemmy Hu OpenMP 4.0/4.5: New Features and Protocols Jemmy Hu SHARCNET HPC Consultant University of Waterloo May 10, 2017 General Interest Seminar Outline OpenMP overview Task constructs in OpenMP SIMP constructs

More information

A COMPILER OPTIMIZATION FRAMEWORK FOR DIRECTIVE-BASED GPU COMPUTING

A COMPILER OPTIMIZATION FRAMEWORK FOR DIRECTIVE-BASED GPU COMPUTING A COMPILER OPTIMIZATION FRAMEWORK FOR DIRECTIVE-BASED GPU COMPUTING A Dissertation Presented to the Faculty of the Department of Computer Science University of Houston In Partial Fulfillment of the Requirements

More information

Scientific discovery, analysis and prediction made possible through high performance computing.

Scientific discovery, analysis and prediction made possible through high performance computing. Scientific discovery, analysis and prediction made possible through high performance computing. An Introduction to GPGPU Programming Bob Torgerson Arctic Region Supercomputing Center November 21 st, 2013

More information

CSC573: TSHA Introduction to Accelerators

CSC573: TSHA Introduction to Accelerators CSC573: TSHA Introduction to Accelerators Sreepathi Pai September 5, 2017 URCS Outline Introduction to Accelerators GPU Architectures GPU Programming Models Outline Introduction to Accelerators GPU Architectures

More information

OpenACC Fundamentals. Steve Abbott November 13, 2016

OpenACC Fundamentals. Steve Abbott November 13, 2016 OpenACC Fundamentals Steve Abbott , November 13, 2016 Who Am I? 2005 B.S. Physics Beloit College 2007 M.S. Physics University of Florida 2015 Ph.D. Physics University of New Hampshire

More information

GRVI Phalanx Update: A Massively Parallel RISC-V FPGA Accelerator Framework. Jan Gray CARRV2017: 2017/10/14

GRVI Phalanx Update: A Massively Parallel RISC-V FPGA Accelerator Framework. Jan Gray   CARRV2017: 2017/10/14 GRVI halanx Update: A Massively arallel RISC-V FGA Accelerator Framework Jan Gray jan@fpga.org http://fpga.org CARRV2017: 2017/10/14 FGA Datacenter Accelerators Are Almost Mainstream Catapult v2. Intel

More information

Accelerating Financial Applications on the GPU

Accelerating Financial Applications on the GPU Accelerating Financial Applications on the GPU Scott Grauer-Gray Robert Searles William Killian John Cavazos Department of Computer and Information Science University of Delaware Sixth Workshop on General

More information

Introduction to Multicore Programming

Introduction to Multicore Programming Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Automatic Parallelization and OpenMP 3 GPGPU 2 Multithreaded Programming

More information

OPENACC DIRECTIVES FOR ACCELERATORS NVIDIA

OPENACC DIRECTIVES FOR ACCELERATORS NVIDIA OPENACC DIRECTIVES FOR ACCELERATORS NVIDIA Directives for Accelerators ABOUT OPENACC GPUs Reaching Broader Set of Developers 1,000,000 s 100,000 s Early Adopters Research Universities Supercomputing Centers

More information

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee

More information

OpenMP Doacross Loops Case Study

OpenMP Doacross Loops Case Study National Aeronautics and Space Administration OpenMP Doacross Loops Case Study November 14, 2017 Gabriele Jost and Henry Jin www.nasa.gov Background Outline - The OpenMP doacross concept LU-OMP implementations

More information

Parallel Programming. Libraries and Implementations

Parallel Programming. Libraries and Implementations Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Vectorisation and Portable Programming using OpenCL

Vectorisation and Portable Programming using OpenCL Vectorisation and Portable Programming using OpenCL Mitglied der Helmholtz-Gemeinschaft Jülich Supercomputing Centre (JSC) Andreas Beckmann, Ilya Zhukov, Willi Homberg, JSC Wolfram Schenck, FH Bielefeld

More information

LLVM for the future of Supercomputing

LLVM for the future of Supercomputing LLVM for the future of Supercomputing Hal Finkel hfinkel@anl.gov 2017-03-27 2017 European LLVM Developers' Meeting What is Supercomputing? Computing for large, tightly-coupled problems. Lots of computational

More information

Power 7. Dan Christiani Kyle Wieschowski

Power 7. Dan Christiani Kyle Wieschowski Power 7 Dan Christiani Kyle Wieschowski History 1980-2000 1980 RISC Prototype 1990 POWER1 (Performance Optimization With Enhanced RISC) (1 um) 1993 IBM launches 66MHz POWER2 (.35 um) 1997 POWER2 Super

More information

Automatic Testing of OpenACC Applications

Automatic Testing of OpenACC Applications Automatic Testing of OpenACC Applications Khalid Ahmad School of Computing/University of Utah Michael Wolfe NVIDIA/PGI November 13 th, 2017 Why Test? When optimizing or porting Validate the optimization

More information

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows

More information

OpenMP 3.0 Tasking Implementation in OpenUH

OpenMP 3.0 Tasking Implementation in OpenUH Open64 Workshop @ CGO 09 OpenMP 3.0 Tasking Implementation in OpenUH Cody Addison Texas Instruments Lei Huang University of Houston James (Jim) LaGrone University of Houston Barbara Chapman University

More information

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:

More information

Accelerating image registration on GPUs

Accelerating image registration on GPUs Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining

More information

Multicore-aware parallelization strategies for efficient temporal blocking (BMBF project: SKALB)

Multicore-aware parallelization strategies for efficient temporal blocking (BMBF project: SKALB) Multicore-aware parallelization strategies for efficient temporal blocking (BMBF project: SKALB) G. Wellein, G. Hager, M. Wittmann, J. Habich, J. Treibig Department für Informatik H Services, Regionales

More information

Joe Hummel, PhD. Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago.

Joe Hummel, PhD. Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago. Joe Hummel, PhD Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago email: joe@joehummel.net stuff: http://www.joehummel.net/downloads.html Async programming:

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Scientific Programming in C XIV. Parallel programming

Scientific Programming in C XIV. Parallel programming Scientific Programming in C XIV. Parallel programming Susi Lehtola 11 December 2012 Introduction The development of microchips will soon reach the fundamental physical limits of operation quantum coherence

More information