PEZY-SC Omni OpenACC GPU. Green500[1] Shoubu( ) ExaScaler PEZY-SC. [4] Omni OpenACC NVIDIA GPU. ExaScaler PEZY-SC PZCL PZCL OpenCL[2]
|
|
- Wesley Shields
- 5 years ago
- Views:
Transcription
1 ZY-SC Omni 1,a) ,4 1,5 ZY-SC ZY-SC OpenCL ZCL ZY-SC Suiren Blue ZCL N-Body 98%NB CG 88% ZCL 1. Green500[1] Shoubu( ) xascaler ZY-SC MIMD xascaler ZY-SC ZCL ZCL OpenCL[2] ZCL ZY-SC 1 Graduate School of Systems and Information ngineering, University of Tsukuba 2 xascaler xascaler Inc. 3 (KK) Computing Research Center, High nergy Accelerator Research Organization (KK) 4 Center for Computational Sciences, University of Tsukuba 5 RIKN Advanced Institute for Computational Science a) tabuchi@hpcs.cs.tsukuba.ac.jp [3] GU ZY-SC ZY-SC ZY-SC Omni [4] Omni NVIDIA GU CUDA[5] source-to-source N-body NAS arallel Benchmarks CG (NB CG)[6] ZCL 2 3 ZY-SC ZCL 4 ZY-SC Omni ZY-SC [7] OpenCL OpenCL ZCL c 2016 Information rocessing Society of Japan 1
2 OpenCL ZCL OpenCL accull[8] OpenUH-[9] OpenARC[10] RoseACC[11] GCC [12] accull ython YaCF CUDA OpenCL OpenUH- OpenUH CUDA OpenCL OpenARC 1.0 Cetus compiler infrastructure RoseACC Rose Compiler OpenCL GU OpenCL CU MIC FGA ZY-SC ZCL OpenCL Omni ZCL ZY-SC 3. ZY-SC ZY-SC ZCL DDR4 DDR4 DDR4 1 L1 (2KB) L1 (2KB) Village (4) L2 (64KB) City (16) L3 (2MB) refecture (256) DDR4 ZY-SC T7 T6 2 T3 T2 T0 T1 1 refecture T4 T L1 L3 3.1 ZY-SC ZY rocessing lement () MIMD 8 SMT (Simultaneous MultiThreading) 8192 MIMD Village City refecture 8 16KB 2 ALU FU Village 4 2 2KB L1 City 4 Village Unit (SFU) 64KB L2 refecture 16 City 2MB L3 3.2 ZCL ZCL ZY ZY-SC OpenCL OpenCL AI OpenCL 1.1 OpenCL AI OpenCL ZCL c 2016 Information rocessing Society of Japan 2
3 City (128 ) OpenCL 3 ZCL C/C++OpenCL OpenCL kernel global local ZCL pzc ZCL OpenCL ID ID get pid() get tid() OpenCL get group id(0) get local id(0) get maxpid() get maxtid() OpenCL get num groups(0) get local size(0) ZCL chgthread() sync() flush() chgthread() sync() sync L1() Village sync L2() City sync L3() refecture flush() flush L1() Village L1 flush L2() City L1,L2 4. ZY-SC Omni ZY-SC Omni 4.1 C/C++/Fortran C C with 3 translator C with ACC AI call ZCL kernel C compiler ZCL compiler Run7me Library xecu7on file Kernel binary Omni Compiler load at runtime ZCL ZCL ZY-SC ZCL CU GU ZY-SC ZCL 4.2 C Fortran95 Omni Compiler Infrastructure[13] 3 C translator C ZCL C ZCL 2 Omni runtime library data data 4 data data a b copy a copyout b ACC init data lower length DV ADDR name name c 2016 Information rocessing Society of Japan 3
4 int a[100], b; #pragma acc data copy(a) copyout(b) /* some codes using a and b */ (a) int a[100], b; void *DSC_a,*DV_ADDR_a,*DSC_b,*DV_ADDR_b; unsigned long long _lower[] = 0; unsigned long long _length[] = 100; _ACC_init_data(&(DSC_a),&(DV_ADDR_a),a,sizeof(int),1,_lower,length); _ACC_init_data(&(DSC_b),&(DV_ADDR_b),&(b),sizeof(int),0,NULL, NULL); _ACC_copy_data(DSC_a,_ACC_HOST_TO_DVIC,_ACC_ASYNC_SYNC); /* some codes using a and b */ _ACC_copy_data(DSC_a,_ACC_DVIC_TO_HOST,_ACC_ASYNC_SYNC); _ACC_copy_data(DSC_b,_ACC_DVIC_TO_HOST,_ACC_ASYNC_SYNC); _ACC_finalize_data(DSC_a); _ACC_finalize_data(DSC_b); 4 (b) data DSC name name ACC copy data ACC finalize data 4.3 parallel parallel gang, worker, vector 3 ZCL gang vector firstprivate private #pragma acc parallel present(a) num_gangs(16) /* codes in parallel region */ (a) /* host code */ int _ACC_ngangs = 16; int _ACC_nworkers = 1; int _ACC_veclen = 8; int _ACC_conf[] = _ACC_ngangs, _ACC_nworkers, _ACC_veclen; void* _ACC_args[] = &DV_ADDR_a; size_t _ACC_argsizes[] = sizeof(void*); _ACC_launch(_ACC_program, 0, _ACC_conf, ACC_ASYNC_SYNC, 1, args, arg_sizes); /* kernel function in device code */ void pzc ACC_kernel_0(int *a) /* codes in parallel region */ 5 (b) parallel num gangs 5 parallel pzc ACC kernel 0 gang ACC args ACC argsizes ACC launch 1 ACC program cl program cl kernel 2 ACC launch clnqueuendrangekernel ZY-SC loop loop for gang vector cyclic loop reduction c 2016 Information rocessing Society of Japan 4
5 /* inside parallel region */ #pragma acc loop vector reduction(+:sum) for(i = 0; i < N; i++) a[i]++; sum += a[i]; (a) /* inside kernel function */ int _niter_i, _idx, _init, _cond, _step, _red_sum; _ACC_init_reduction_var(&_red_sum,0); _ACC_calc_niter(&_niter_i, 0, N, 1); _ACC_init_thread_iter(&_init,&_cond,&_step,_niter_i); for(_idx = _init; _idx < _cond; _idx += _step) int i; _ACC_calc_idx(_idx, &i, 0, N, 1); a[i]++; _red_sum += a[i]; _ACC_reduction_thread(sum,_red_sum, 0); 6 (b) loop 6 loop ACC calc niter ACC init thread iter ACC calc idx ACC init reduction var ACC reduction thread 5. ZCL N-Body) NB CG N-Body NB CG 5.1 KK Suiren Blue 1 N-Body 7 ZCL ZCL (merged kernel) 2 1 ZCL (merged kernel, chgthread) CU Memory Accelerator 1 (Suiren Blue) Intel Xeon Lv3 2.3GHz DDR4 1866MHz, 64GB ZY-SC (DDR4 1866MHz 16GB) Compiler ICC , ZSDK 2.1, Omni compiler for ZY-SC chgthread() ZCL % chgthread() N-Body NB CG 8 mop/s Mega Operations er Second 1 ZCL ZCL (merged kernel) conj grad 7 1 ZCL (merged kernel, chgthread) chgthread() ZCL % CG 0 ClassB ZCL % parallel parallel kernels 2 parallel 1 parallel parallel kernels 1 ZCL sync() kernels 1 kernels GU ZCL c 2016 Information rocessing Society of Japan 5
6 実行時間 (s) ZCL ZCL(merged kernel) ZCL(merged kernel, chgthread) 2 N-Body NB CG N-Body NB CG ZCL ZCL(merged kernel) ZCL(merged kernel, chgthread) 114 (5) 447 (25) K 16K 32K 64K 128K 256K 512K 1024K 粒子数 7 N-Body 48% NB CG 45% ZCL 6. mop / s A (14000) B (75000) C (150000) Class ( 行列サイズ ) 8 NB CG ZCL ZCL (merged kernel) ZCL (merged kernel, chgthread) chgthread() ZCL ZCL 11 38% CG chgthread() ZCL % ZCL kernels chgthread() 5.2 ZCL AI (SLOC) N-Body NB CG 2 ZCL N-Body ZY-SC ZY-SC NVIDIA GU CUDA Omni C ZY-SC ZCL N-Body ZCL 98% NB CG ZCL 88% ZCL N-Body 48% NB CG 45% kernels chgthread() [1] The green [2] Khronos Group, OpenCL - The open standard for parallel programming of heterogeneous systems. [3] -Standard.org, Home. [4] Akihiro Tabuchi, Masahiro Nakao, and Mitsuhisa Sato. A source-to-source openacc compiler for cuda. In uro- ar Workshops, pp , [5] NVIDIA, home_new.html. arallel rogramming and Computing latform CUDA. [6] NASA Advanced Supercomputing Division, http: // NAS arallel Benchmarks. [7]. Suiren c 2016 Information rocessing Society of Japan 6
7 .. [ ], No. 11, dec [8] Ruymán Reyes, Iván López-Rodríguez, JuanJ. Fumero, and Francisco de Sande. accull: An openacc implementation with cuda and opencl support. In uro-ar 2012 arallel rocessing, Vol of Lecture Notes in Computer Science, pp Springer Berlin Heidelberg, [9] Xiaonan Tian, Rengan Xu, Yonghong Yan, Zhifeng Yun, Sunita Chandrasekaran, and Barbara Chapman. Compiling a high-level directive-based programming model for gpgpus. In Languages and Compilers for arallel Computing, Lecture Notes in Computer Science, pp Springer International ublishing, [10] Seyong Lee and Jeffrey S. Vetter. Openarc: Open accelerator research compiler for directive-based, efficient heterogeneous computing. In roceedings of the 23rd International Symposium on High-performance arallel and Distributed Computing, HDC 14, pp , New York, NY, USA, ACM. [11] University of Delaware and LLNL, org/. RoseACC. [12] GCC, - GCC Wiki. [13] RIKN AICS and University of Tsukuba, omni-compiler.org. Omni Compiler roject. c 2016 Information rocessing Society of Japan 7
Omni Compiler and XcodeML: An Infrastructure for Source-to- Source Transformation
http://omni compiler.org/ Omni Compiler and XcodeML: An Infrastructure for Source-to- Source Transformation MS03 Code Generation Techniques for HPC Earth Science Applications Mitsuhisa Sato (RIKEN / Advanced
More informationCompiling a High-level Directive-Based Programming Model for GPGPUs
Compiling a High-level Directive-Based Programming Model for GPGPUs Xiaonan Tian, Rengan Xu, Yonghong Yan, Zhifeng Yun, Sunita Chandrasekaran, and Barbara Chapman Department of Computer Science, University
More informationLecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators
Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators CSCE 569 Parallel Computing Department of Computer Science and Engineering Yonghong Yan yanyh@cse.sc.edu
More informationAn Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters
An Extension of XcalableMP PGAS Lanaguage for Multi-node Clusters Jinpil Lee, Minh Tuan Tran, Tetsuya Odajima, Taisuke Boku and Mitsuhisa Sato University of Tsukuba 1 Presentation Overview l Introduction
More informationGPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3
/CPU,a),2,2 2,2 Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3 XMP XMP-dev CPU XMP-dev/StarPU XMP-dev XMP CPU StarPU CPU /CPU XMP-dev/StarPU N /CPU CPU. Graphics Processing Unit GP General-Purpose
More informationHPC Challenge Awards 2010 Class2 XcalableMP Submission
HPC Challenge Awards 2010 Class2 XcalableMP Submission Jinpil Lee, Masahiro Nakao, Mitsuhisa Sato University of Tsukuba Submission Overview XcalableMP Language and model, proposed by XMP spec WG Fortran
More informationOpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware
OpenACC Standard Directives for Accelerators Credits http://www.openacc.org/ o V1.0: November 2011 Specification OpenACC, Directives for Accelerators, Nvidia Slideware CAPS OpenACC Compiler, HMPP Workbench
More informationObjective. GPU Teaching Kit. OpenACC. To understand the OpenACC programming model. Introduction to OpenACC
GPU Teaching Kit Accelerated Computing OpenACC Introduction to OpenACC Objective To understand the OpenACC programming model basic concepts and pragma types simple examples 2 2 OpenACC The OpenACC Application
More informationThe Design and Implementation of OpenMP 4.5 and OpenACC Backends for the RAJA C++ Performance Portability Layer
The Design and Implementation of OpenMP 4.5 and OpenACC Backends for the RAJA C++ Performance Portability Layer William Killian Tom Scogland, Adam Kunen John Cavazos Millersville University of Pennsylvania
More informationOpenACC 2.6 Proposed Features
OpenACC 2.6 Proposed Features OpenACC.org June, 2017 1 Introduction This document summarizes features and changes being proposed for the next version of the OpenACC Application Programming Interface, tentatively
More informationGetting Started with Directive-based Acceleration: OpenACC
Getting Started with Directive-based Acceleration: OpenACC Ahmad Lashgar Member of High-Performance Computing Research Laboratory, School of Computer Science Institute for Research in Fundamental Sciences
More informationOpenACC. Part 2. Ned Nedialkov. McMaster University Canada. CS/SE 4F03 March 2016
OpenACC. Part 2 Ned Nedialkov McMaster University Canada CS/SE 4F03 March 2016 Outline parallel construct Gang loop Worker loop Vector loop kernels construct kernels vs. parallel Data directives c 2013
More informationGPU. OpenMP. OMPCUDA OpenMP. forall. Omni CUDA 3) Global Memory OMPCUDA. GPU Thread. Block GPU Thread. Vol.2012-HPC-133 No.
GPU CUDA OpenMP 1 2 3 1 1 OpenMP CUDA OM- PCUDA OMPCUDA GPU CUDA CUDA 1. GPU GPGPU 1)2) GPGPU CUDA 3) CPU CUDA GPGPU CPU GPU OpenMP GPU CUDA OMPCUDA 4)5) OMPCUDA GPU OpenMP GPU CUDA OMPCUDA/MG 2 GPU OMPCUDA
More informationIs OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels
National Aeronautics and Space Administration Is OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels Jose M. Monsalve Diaz (UDEL), Gabriele Jost (NASA), Sunita Chandrasekaran
More informationOpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer
OpenACC Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer 3 WAYS TO ACCELERATE APPLICATIONS Applications Libraries Compiler Directives Programming Languages Easy to use Most Performance
More informationOpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016
OpenACC. Part I Ned Nedialkov McMaster University Canada October 2016 Outline Introduction Execution model Memory model Compiling pgaccelinfo Example Speedups Profiling c 2016 Ned Nedialkov 2/23 Why accelerators
More informationEarly Experiences With The OpenMP Accelerator Model
Early Experiences With The OpenMP Accelerator Model Chunhua Liao 1, Yonghong Yan 2, Bronis R. de Supinski 1, Daniel J. Quinlan 1 and Barbara Chapman 2 1 Center for Applied Scientific Computing, Lawrence
More informationINTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017
INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and
More informationCOMP Parallel Computing. Programming Accelerators using Directives
COMP 633 - Parallel Computing Lecture 15 October 30, 2018 Programming Accelerators using Directives Credits: Introduction to OpenACC and toolkit Jeff Larkin, Nvidia COMP 633 - Prins Directives for Accelerator
More informationResearch Article Multi-GPU Support on Single Node Using Directive-Based Programming Model
Scientific Programming Volume 2015, Article ID 621730, 15 pages http://dx.doi.org/10.1155/2015/621730 Research Article Multi-GPU Support on Single Node Using Directive-Based Programming Model Rengan Xu,
More informationJCudaMP: OpenMP/Java on CUDA
JCudaMP: OpenMP/Java on CUDA Georg Dotzler, Ronald Veldema, Michael Klemm Programming Systems Group Martensstraße 3 91058 Erlangen Motivation Write once, run anywhere - Java Slogan created by Sun Microsystems
More informationC PGAS XcalableMP(XMP) Unified Parallel
PGAS XcalableMP Unified Parallel C 1 2 1, 2 1, 2, 3 C PGAS XcalableMP(XMP) Unified Parallel C(UPC) XMP UPC XMP UPC 1 Berkeley UPC GASNet 1. MPI MPI 1 Center for Computational Sciences, University of Tsukuba
More informationA Comparative Study of OpenACC Implementations
A Comparative Study of OpenACC Implementations Ruymán Reyes, Iván López, Juan J. Fumero and Francisco de Sande 1 Abstract GPUs and other accelerators are available on many different devices, while GPGPU
More informationOmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel
www.bsc.es OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray Ozen guray.ozen@bsc.es Exascale in BSC Marenostrum 4 (13.7 Petaflops ) General purpose cluster (3400
More informationEvaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices
Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Jonas Hahnfeld 1, Christian Terboven 1, James Price 2, Hans Joachim Pflug 1, Matthias S. Müller
More informationProfiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015
Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15: Profiling and Parallelizing with the OpenACC Toolkit
More informationOpenACC programming for GPGPUs: Rotor wake simulation
DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing
More informationOpenACC (Open Accelerators - Introduced in 2012)
OpenACC (Open Accelerators - Introduced in 2012) Open, portable standard for parallel computing (Cray, CAPS, Nvidia and PGI); introduced in 2012; GNU has an incomplete implementation. Uses directives in
More informationGPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler
GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler Taylor Lloyd Jose Nelson Amaral Ettore Tiotto University of Alberta University of Alberta IBM Canada 1 Why? 2 Supercomputer Power/Performance GPUs
More informationComparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015
Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015 Abstract As both an OpenMP and OpenACC insider I will present my opinion of the current status of these two directive sets for programming
More informationFCUDA: Enabling Efficient Compilation of CUDA Kernels onto
FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:
More informationAdvanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2016
Advanced OpenACC John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center Copyright 2016 Outline Loop Directives Data Declaration Directives Data Regions Directives Cache directives Wait
More informationOpenACC Support in Score-P and Vampir
Center for Information Services and High Performance Computing (ZIH) OpenACC Support in Score-P and Vampir Hands-On for the Taurus GPU Cluster February 2016 Robert Dietrich (robert.dietrich@tu-dresden.de)
More informationGPUs and Emerging Architectures
GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs
More informationAn OpenACC construct is an OpenACC directive and, if applicable, the immediately following statement, loop or structured block.
API 2.6 R EF ER ENC E G U I D E The OpenACC API 2.6 The OpenACC Application Program Interface describes a collection of compiler directives to specify loops and regions of code in standard C, C++ and Fortran
More informationpage migration Implementation and Evaluation of Dynamic Load Balancing Using Runtime Performance Monitoring on Omni/SCASH
Omni/SCASH 1 2 3 4 heterogeneity Omni/SCASH page migration Implementation and Evaluation of Dynamic Load Balancing Using Runtime Performance Monitoring on Omni/SCASH Yoshiaki Sakae, 1 Satoshi Matsuoka,
More informationA Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function
A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function Chen-Ting Chang, Yu-Sheng Chen, I-Wei Wu, and Jyh-Jiun Shann Dept. of Computer Science, National Chiao
More informationOpenMP on the FDSM software distributed shared memory. Hiroya Matsuba Yutaka Ishikawa
OpenMP on the FDSM software distributed shared memory Hiroya Matsuba Yutaka Ishikawa 1 2 Software DSM OpenMP programs usually run on the shared memory computers OpenMP programs work on the distributed
More informationA ROSE-based OpenMP 3.0 Research Compiler Supporting Multiple Runtime Libraries
A ROSE-based OpenMP 3.0 Research Compiler Supporting Multiple Runtime Libraries Chunhua Liao, Daniel J. Quinlan, Thomas Panas and Bronis R. de Supinski Center for Applied Scientific Computing Lawrence
More informationOpenACC Course Lecture 1: Introduction to OpenACC September 2015
OpenACC Course Lecture 1: Introduction to OpenACC September 2015 Course Objective: Enable you to accelerate your applications with OpenACC. 2 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15:
More informationAn Introduction to OpenACC. Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel
An Introduction to OpenACC Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel Chapter 1 Introduction OpenACC is a software accelerator that uses the host and the device. It uses compiler
More informationAdvanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2017
Advanced OpenACC John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center Copyright 2017 Outline Loop Directives Data Declaration Directives Data Regions Directives Cache directives Wait
More informationDesign Decisions for a Source-2-Source Compiler
Design Decisions for a Source-2-Source Compiler Roger Ferrer, Sara Royuela, Diego Caballero, Alejandro Duran, Xavier Martorell and Eduard Ayguadé Barcelona Supercomputing Center and Universitat Politècnica
More informationIntroduction to OpenACC. 16 May 2013
Introduction to OpenACC 16 May 2013 GPUs Reaching Broader Set of Developers 1,000,000 s 100,000 s Early Adopters Research Universities Supercomputing Centers Oil & Gas CAE CFD Finance Rendering Data Analytics
More informationMasahiro Nakao, Hitoshi Murai, Takenori Shimosaka, Mitsuhisa Sato
Masahiro Nakao, Hitoshi Murai, Takenori Shimosaka, Mitsuhisa Sato Center for Computational Sciences, University of Tsukuba, Japan RIKEN Advanced Institute for Computational Science, Japan 2 XMP/C int array[16];
More informationS Comparing OpenACC 2.5 and OpenMP 4.5
April 4-7, 2016 Silicon Valley S6410 - Comparing OpenACC 2.5 and OpenMP 4.5 James Beyer, NVIDIA Jeff Larkin, NVIDIA GTC16 April 7, 2016 History of OpenMP & OpenACC AGENDA Philosophical Differences Technical
More informationECE 574 Cluster Computing Lecture 10
ECE 574 Cluster Computing Lecture 10 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 1 October 2015 Announcements Homework #4 will be posted eventually 1 HW#4 Notes How granular
More informationDirective-Based, High-Level Programming and Optimizations for High-Performance Computing with FPGAs
Directive-Based, High-Level Programming and Optimizations for High-Performance Computing with FPGAs Jacob Lambert University of Oregon jlambert@cs.uoregon.edu Advisor: Allen D. Malony University of Oregon
More informationAdvanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2018
Advanced OpenACC John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center Copyright 2018 Outline Loop Directives Data Declaration Directives Data Regions Directives Cache directives Wait
More informationHybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS
+ Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics
More informationOpenACC2 vs.openmp4. James Lin 1,2 and Satoshi Matsuoka 2
2014@San Jose Shanghai Jiao Tong University Tokyo Institute of Technology OpenACC2 vs.openmp4 he Strong, the Weak, and the Missing to Develop Performance Portable Applica>ons on GPU and Xeon Phi James
More informationEarly Experiences with the OpenMP Accelerator Model
Early Experiences with the OpenMP Accelerator Model Canberra, Australia, IWOMP 2013, Sep. 17th * University of Houston LLNL-PRES- 642558 This work was performed under the auspices of the U.S. Department
More informationTowards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA
Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA L. Oteski, G. Colin de Verdière, S. Contassot-Vivier, S. Vialle,
More informationIs OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels
National Aeronautics and Space Administration Is OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels Jose M. Monsalve Diaz (UDEL), Gabriele Jost (NASA), Sunita Chandrasekaran
More informationOpenACC and the Cray Compilation Environment James Beyer PhD
OpenACC and the Cray Compilation Environment James Beyer PhD Agenda A brief introduction to OpenACC Cray Programming Environment (PE) Cray Compilation Environment, CCE An in depth look at CCE 8.2 and OpenACC
More informationLecture 4: OpenMP Open Multi-Processing
CS 4230: Parallel Programming Lecture 4: OpenMP Open Multi-Processing January 23, 2017 01/23/2017 CS4230 1 Outline OpenMP another approach for thread parallel programming Fork-Join execution model OpenMP
More informationAn Introduction to OpenAcc
An Introduction to OpenAcc ECS 158 Final Project Robert Gonzales Matthew Martin Nile Mittow Ryan Rasmuss Spring 2016 1 Introduction: What is OpenAcc? OpenAcc stands for Open Accelerators. Developed by
More informationINTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC. Jeff Larkin, NVIDIA Developer Technologies
INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC Jeff Larkin, NVIDIA Developer Technologies AGENDA Accelerated Computing Basics What are Compiler Directives? Accelerating Applications with OpenACC Identifying
More informationOpenACC Fundamentals. Steve Abbott November 15, 2017
OpenACC Fundamentals Steve Abbott , November 15, 2017 AGENDA Data Regions Deep Copy 2 while ( err > tol && iter < iter_max ) { err=0.0; JACOBI ITERATION #pragma acc parallel loop reduction(max:err)
More informationA Simple Guideline for Code Optimizations on Modern Architectures with OpenACC and CUDA
A Simple Guideline for Code Optimizations on Modern Architectures with OpenACC and CUDA L. Oteski, G. Colin de Verdière, S. Contassot-Vivier, S. Vialle, J. Ryan Acks.: CEA/DIFF, IDRIS, GENCI, NVIDIA, Région
More informationIntroduction to OpenACC
Introduction to OpenACC Alexander Fu, David Lin, Russell Miller June 2016 Taking advantage of the processing power of the GPU is what makes CUDA relevant. However, using CUDA and constraining oneself to
More informationA cache-aware performance prediction framework for GPGPU computations
A cache-aware performance prediction framework for GPGPU computations The 8th Workshop on UnConventional High Performance Computing 215 Alexander Pöppl, Alexander Herz August 24th, 215 UCHPC 215, August
More informationCS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)
CS 470 Spring 2016 Mike Lam, Professor Other Architectures (with an aside on linear algebra) Parallel Systems Shared memory (uniform global address space) Primary story: make faster computers Programming
More informationMPI_Send(a,..., MPI_COMM_WORLD); MPI_Recv(a,..., MPI_COMM_WORLD, &status);
$ $ 2 global void kernel(int a[max], int llimit, int ulimit) {... } : int main(int argc, char *argv[]){ MPI_Int(&argc, &argc); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size);
More informationProgramming Environment Research Team
Chapter 2 Programming Environment Research Team 2.1 Members Mitsuhisa Sato (Team Leader) Hitoshi Murai (Research Scientist) Miwako Tsuji (Research Scientist) Masahiro Nakao (Research Scientist) Jinpil
More informationIdentification and Elimination of the Overhead of Accelerate with a Super-resolution Application
Regular Paper Identification and Elimination of the Overhead of Accelerate with a Super-resolution Application Izumi Asakura 1,a) Hidehiko Masuhara 1 Takuya Matsumoto 2 Kiminori Matsuzaki 3 Received: April
More informationINTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC
INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC DR. CHRISTOPH ANGERER, NVIDIA *) THANKS TO JEFF LARKIN, NVIDIA, FOR THE SLIDES 3 APPROACHES TO GPU PROGRAMMING Applications Libraries Compiler Directives
More informationOpenCL TM & OpenMP Offload on Sitara TM AM57x Processors
OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors 1 Agenda OpenCL Overview of Platform, Execution and Memory models Mapping these models to AM57x Overview of OpenMP Offload Model Compare and contrast
More informationRuntime Address Space Computation for SDSM Systems
Runtime Address Space Computation for SDSM Systems Jairo Balart Outline Introduction Inspector/executor model Implementation Evaluation Conclusions & future work 2 Outline Introduction Inspector/executor
More informationIntroduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University
Introduction to OpenACC Shaohao Chen Research Computing Services Information Services and Technology Boston University Outline Introduction to GPU and OpenACC Basic syntax and the first OpenACC program:
More informationNVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU
NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated
More informationComputer Architecture
Jens Teubner Computer Architecture Summer 2017 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2017 Jens Teubner Computer Architecture Summer 2017 34 Part II Graphics
More informationLLVM-based Communication Optimizations for PGAS Programs
LLVM-based Communication Optimizations for PGAS Programs nd Workshop on the LLVM Compiler Infrastructure in HPC @ SC15 Akihiro Hayashi (Rice University) Jisheng Zhao (Rice University) Michael Ferguson
More informationIntroduction to GPU (Graphics Processing Unit) Architecture & Programming
Introduction to GU (Graphics rocessing Unit) Architecture & rogramming C240A. 2017 T. Yang ome of slides are from M. Hall of Utah C6235 Overview Hardware architecture rogramming model Example Historical
More informationParallelism III. MPI, Vectorization, OpenACC, OpenCL. John Cavazos,Tristan Vanderbruggen, and Will Killian
Parallelism III MPI, Vectorization, OpenACC, OpenCL John Cavazos,Tristan Vanderbruggen, and Will Killian Dept of Computer & Information Sciences University of Delaware 1 Lecture Overview Introduction MPI
More informationGPU programming made easier
GPU programming made easier Jacob Jepsen 6. June 2014 University of Copenhagen Department of Computer Science 6. June 2014 Introduction We created a tool that reduces the development time of GPU code.
More informationOpenMP 4.0/4.5: New Features and Protocols. Jemmy Hu
OpenMP 4.0/4.5: New Features and Protocols Jemmy Hu SHARCNET HPC Consultant University of Waterloo May 10, 2017 General Interest Seminar Outline OpenMP overview Task constructs in OpenMP SIMP constructs
More informationA COMPILER OPTIMIZATION FRAMEWORK FOR DIRECTIVE-BASED GPU COMPUTING
A COMPILER OPTIMIZATION FRAMEWORK FOR DIRECTIVE-BASED GPU COMPUTING A Dissertation Presented to the Faculty of the Department of Computer Science University of Houston In Partial Fulfillment of the Requirements
More informationScientific discovery, analysis and prediction made possible through high performance computing.
Scientific discovery, analysis and prediction made possible through high performance computing. An Introduction to GPGPU Programming Bob Torgerson Arctic Region Supercomputing Center November 21 st, 2013
More informationCSC573: TSHA Introduction to Accelerators
CSC573: TSHA Introduction to Accelerators Sreepathi Pai September 5, 2017 URCS Outline Introduction to Accelerators GPU Architectures GPU Programming Models Outline Introduction to Accelerators GPU Architectures
More informationOpenACC Fundamentals. Steve Abbott November 13, 2016
OpenACC Fundamentals Steve Abbott , November 13, 2016 Who Am I? 2005 B.S. Physics Beloit College 2007 M.S. Physics University of Florida 2015 Ph.D. Physics University of New Hampshire
More informationGRVI Phalanx Update: A Massively Parallel RISC-V FPGA Accelerator Framework. Jan Gray CARRV2017: 2017/10/14
GRVI halanx Update: A Massively arallel RISC-V FGA Accelerator Framework Jan Gray jan@fpga.org http://fpga.org CARRV2017: 2017/10/14 FGA Datacenter Accelerators Are Almost Mainstream Catapult v2. Intel
More informationAccelerating Financial Applications on the GPU
Accelerating Financial Applications on the GPU Scott Grauer-Gray Robert Searles William Killian John Cavazos Department of Computer and Information Science University of Delaware Sixth Workshop on General
More informationIntroduction to Multicore Programming
Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Automatic Parallelization and OpenMP 3 GPGPU 2 Multithreaded Programming
More informationOPENACC DIRECTIVES FOR ACCELERATORS NVIDIA
OPENACC DIRECTIVES FOR ACCELERATORS NVIDIA Directives for Accelerators ABOUT OPENACC GPUs Reaching Broader Set of Developers 1,000,000 s 100,000 s Early Adopters Research Universities Supercomputing Centers
More informationPerformance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference
The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee
More informationOpenMP Doacross Loops Case Study
National Aeronautics and Space Administration OpenMP Doacross Loops Case Study November 14, 2017 Gabriele Jost and Henry Jin www.nasa.gov Background Outline - The OpenMP doacross concept LU-OMP implementations
More informationParallel Programming. Libraries and Implementations
Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationVectorisation and Portable Programming using OpenCL
Vectorisation and Portable Programming using OpenCL Mitglied der Helmholtz-Gemeinschaft Jülich Supercomputing Centre (JSC) Andreas Beckmann, Ilya Zhukov, Willi Homberg, JSC Wolfram Schenck, FH Bielefeld
More informationLLVM for the future of Supercomputing
LLVM for the future of Supercomputing Hal Finkel hfinkel@anl.gov 2017-03-27 2017 European LLVM Developers' Meeting What is Supercomputing? Computing for large, tightly-coupled problems. Lots of computational
More informationPower 7. Dan Christiani Kyle Wieschowski
Power 7 Dan Christiani Kyle Wieschowski History 1980-2000 1980 RISC Prototype 1990 POWER1 (Performance Optimization With Enhanced RISC) (1 um) 1993 IBM launches 66MHz POWER2 (.35 um) 1997 POWER2 Super
More informationAutomatic Testing of OpenACC Applications
Automatic Testing of OpenACC Applications Khalid Ahmad School of Computing/University of Utah Michael Wolfe NVIDIA/PGI November 13 th, 2017 Why Test? When optimizing or porting Validate the optimization
More informationProgramming Models for Multi- Threading. Brian Marshall, Advanced Research Computing
Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows
More informationOpenMP 3.0 Tasking Implementation in OpenUH
Open64 Workshop @ CGO 09 OpenMP 3.0 Tasking Implementation in OpenUH Cody Addison Texas Instruments Lei Huang University of Houston James (Jim) LaGrone University of Houston Barbara Chapman University
More informationFCUDA: Enabling Efficient Compilation of CUDA Kernels onto
FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:
More informationAccelerating image registration on GPUs
Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining
More informationMulticore-aware parallelization strategies for efficient temporal blocking (BMBF project: SKALB)
Multicore-aware parallelization strategies for efficient temporal blocking (BMBF project: SKALB) G. Wellein, G. Hager, M. Wittmann, J. Habich, J. Treibig Department für Informatik H Services, Regionales
More informationJoe Hummel, PhD. Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago.
Joe Hummel, PhD Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago email: joe@joehummel.net stuff: http://www.joehummel.net/downloads.html Async programming:
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationScientific Programming in C XIV. Parallel programming
Scientific Programming in C XIV. Parallel programming Susi Lehtola 11 December 2012 Introduction The development of microchips will soon reach the fundamental physical limits of operation quantum coherence
More information