C PGAS XcalableMP(XMP) Unified Parallel

Size: px
Start display at page:

Download "C PGAS XcalableMP(XMP) Unified Parallel"

Transcription

1 PGAS XcalableMP Unified Parallel C 1 2 1, 2 1, 2, 3 C PGAS XcalableMP(XMP) Unified Parallel C(UPC) XMP UPC XMP UPC 1 Berkeley UPC GASNet 1. MPI MPI 1 Center for Computational Sciences, University of Tsukuba 2 Graduate School of Systems and Information Engineering, University of Tsukuba 3 RIKEN Advanced Institute for Computational Science Partitioned Global Address SpacePGASPGAS MPI Single Program Multiple Data MPI XcalableMPXMP 1),2) Unified Parallel CUPC 3) PGAS XMP UPC C XMP UPC UPC XMP UPC 2 PGAS XMP UPC 3 4 read/writelaplace Solver NAS Parallel Benchmarks NPB 4) Conjugate GradientCG5 2. Partitioned Global Address Space 2.1 Partitioned Global Address Space PGAS PGAS PGAS 1 c 2011 Information Processing Society of Japan

2 MPI PGAS XMP UPC instance of executionxmp UPC MPI 2.2 XcalableMP XMP XcalableMP XMP Spec WG e-science 5) XMP High Performance FortranHPF 7) XMP HPF HPF XMP XMP C Fortran C XMP XMP Fig. 1 XMP Fig. 2 gmove gmove XMP Fortran :Fig. 2 a2[] N/2 N-1 a1[] 0 N/2-1 for Fig. 3 loop Fig. 3 t 2.3 Unified Parallel C UPC PGAS 1 UPC Consortium 6) UPC #pragma xmp template t(0:n-1) template t index 0 N-1 #pragma xmp nodes p(4) #pragma xmp distribute t(block) onto p node 1 node 2 node 3 node 4 index 0 N/4-1 N/2-1 3*N/4-1 N-1 #pragma xmp align a[i] with t(i) a[] node 1 node 2 node 3 node 4 node 1 node 2 node 3 node 4 index 0 N/4-1 N/2-1 3*N/4-1 N-1 1 (XMP) Fig. 1 Conceptual diagram of template(xmp) #pragma xmp gmove a1[0:n/2-1] = a2[n/2:n-1]; 2 Gmove (XMP) Fig. 2 Example of gmove directive(xmp) #pragma xmp loop on t(i) for(i = 0; i < N; i++){ a[i] = func(i); } 3 Loop (XMP) Fig. 3 Example of loop directive(xmp) shared double a1[100], a2[100]; upc_memcpy(a1, a2, 100*sizeof(double)); 4 UPC (UPC) Fig. 4 How to declare and transfer shared data(upc) shared 1 UPC Fig. 4 Fig double a1[] a2[] a1[] a2[] Block shared [10] double a[100]; shared Block Fig. 4 upc memcpy() a2 100*sizeof(double) a1 upc memget() upc memput() for Fig. 5 upc forall upc forall 4 3 C 3. XcalableMP Unified Parallel C XMP UPC 2 c 2011 Information Processing Society of Japan

3 upc_forall( i=0; i<n; i++; &a[i]){ a[i] = func(i); } 5 upc forall (UPC) Fig. 5 Example of upc forall(upc) #pragma xmp nodes p(2, 2) #pragma xmp template t(0:9, 0:9) #pragma xmp distribute t(block, cyclic) onto p int a[10][10]; #pragma xmp align a[i][j] with t(j, i) 6 2 (XMP) Fig. 6 Example of distribution of two-dimensional array(xmp) 1 Fig. 6 Table 1 Indexes of each process in Fig. 6 Process 1st indexes 2nd indexes of a[][] of a[][] p(1,1) 0, 1, 2, 3, 4 0, 2, 4, 6, 8 p(2,1) 5, 6, 7, 8, 9 0, 2, 4, 6, 8 p(1,2) 0, 1, 2, 3, 4 1, 3, 5, 7, 9 p(2,2) 5, 6, 7, 8, 9 1, 3, 5, 7, 9 Fig. 7 #pragma xmp coarray y[1:3] = x[2:4]:[3]; 7 Co-array (XMP) Example of Co-array Function(XMP) 3.1 CPU XMP UPC cyclicblockblock-cyclic XMP gblock XMP Fig Fig. 6 Table 1 UPC 1 UPC XMP XMP XMP UPC upc alloc() XMP 3.2 XMP UPC XMP UPC UPC strict/relaxed 2 strict relaxed relaxed UPC UPC XMP UPC UPC XMP Fig. 2 UPC XMP 3.3 XMP Co-array Fortran 8) Fig. 7 Fig. 7 3 x[] 2 4 y[] 1 3 Fortran XMP CAF CAF codimension UPC 3 c 2011 Information Processing Society of Japan

4 XMP UPC XMP Omni XMP Compiler 9) version 0.5.3TXMPUPC Lawrence Berkeley National Laboratory UC Berkeley Berkeley UPC 10) version BUPC XMP MPI BUPC MPI GASNet 11),12) T2K Tsukuba System Table 2BUPC GASNet API Infiniband APIibv CPU 8 1 CPU 10 BUPC -O3 param max-inline-insns-single=35000 param inline-unit-growth=10000 param large-function-growth= PGAS read/write double 2 20 Block Cyclic TXMP loop Fig. 3BUPC upc forall Fig. 5 Fig. 8 read/write Native TXMP BUPC gcc read/write Fig. 8 TXMP Block Native Cyclic TXMP 2 Table 2 Specifications of each node on experimental environment CPU AMD Opteron Quad-Core 8000 series 2.3GHz (4 sockets) Memory DDR2 667MHz 32GB Network Infiniband DDR (4 rails) 8GB/s OS Linux Compiler gcc MPI mvapich2-1.7a 8 Fig. 8 Access speed in global region Block XMP HPF TXMP 13) BUPC Block Native 3 Cyclic 2 Block Cyclic Block shared Block 14) UPC shared XMP XMP UPC 4.3 Laplace Solver Laplace Solver 4 TXMP Laplace Solver Fig. 9 BUPC Fig Block XMP shadow reflect BUPC TXMP 4 c 2011 Information Processing Society of Japan

5 9 Laplace solver (TXMP) Fig. 9 Source of laplace solver (TXMP) 10 Fig. 10 Laplace solver Result of laplace solver 11 Privatization (UPC) Fig. 11 Sample code of privatization(upc) 12 Laplace solver (BUPC) Fig. 12 Source of laplace solver (BUPC) 4 Fig. 12 THREADS MYTHREAD MPI 0 Laplace Solver 2 UPC 1 SIZE 512 TIMES 100 Fig. 10 Fig. 10 TXMP BUPC 4.2 Block TXMP BUPC 4.4 NAS Parallel Benchmark Conjugate Gradient CG UPC UPC NAS Parallel BenchmarksUPC-NPB 15) XMP UPC w[] UPC-NPB CG (1) (2)(1) (3)(2) 3 (2) Privatization Fig. 11 Privatization Fig SIZE w[] w[] 1 3 w[] SIZE/THREADS w ptr w[] w ptr 16) MPI (1) BUPC-1(2) BUPC-2(3) BUPC-3 CG TXMP Fig. 13 BUPC-1 Fig c 2011 Information Processing Society of Japan

6 13 Conjugate gradient (TXMP) Fig. 13 Source of conjugate gradient (TXMP) 14 Conjugate gradient Fig. 14 Result of conjugate gradient 15 Conjugate gradient (BUPC) Fig. 15 Source of conjugate gradient (BUPC) Table 3 3 CPU CPU time of each implementation Cores TXMP BUPC-1 BUPC Table 4 4 Comm. time of each implementation Cores TXMP BUPC-1 BUPC CG 2 PROC COLS PROCS ROW 2 for w[]bupc-1 w1[] 2 for w[] q[] CLASS C NA CG Fig. 14 BUPC-2 BUPC-3 BUPC-3 2 MPI-CG BUPC-2 TXMP Table 3 CPU Table 4 Table 4 MPI-CG MPI Table 3 Table 4 CPU 1 TXMP BUPC Fig. 13 Fig. 15 TXMP BUPC TXMP w[] BUPC q[] 2, 8, 32, 128 TXMP BUPC 6 c 2011 Information Processing Society of Japan

7 5 XcalableMP Unified Parallel C Table 5 Language features of XcalableMP and Unified Parallel C XcalableMP Unified Parallel C upc forall C, Fortran C 5. C PGAS XMP UPC Table 5 XMP XMP Laplace Solver CG UPC XMP UPC XMP UPC e- XcalableMP 1) XcalableMP Specification DRAFT 0.7, XcalableMP Specification Working Group, ),,. XcalableMP,, Vol.3, No.3, pp , ) UPC Consortium, UPC Language Specifications V1.2, Technical Report LBNL , Berkeley National Lab, spec 1. 2.pdf 4) Bailey, D.H. and et al.: THE NAS PARALLEL BENCHMARKS, Technical Report NAS , Nasa Ames Research Center ) IT e-, go.jp/bmenu/boshu/detail/ /002.htm 6) Unified Parallel C at George Washington University, 7) C.H. Koelbel, D.B. Loverman, R. Shreiber, GL. Steele Jr., M.E. Zosel. The High Performance Fortran Handbook. MIT Press, ) R. Numwich and J. Reid. Co-Array Fortran for parallel programming. Technical Report RAL-TR , Rutherford Appleton Laboratory, ) 10) 11) Christian Bell, Dan Bonachea, Rajesh Nishtala, Katherine Yelick. Optimizing bandwidth limited problems using one-sided communication and overlap. In The 20th Int l Parallel and Distributed Processing Symposium (IPDPS), ) 13) fhpf J.JSSAC, Vol. 11, No. 3,4, pp , ) Wei-Yu Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu, Katherine Yelick. A Performance Analysis of the Berkeley UPC Compiler, ICS 03 Proceedings of the 17th annual international conference on Supercomputing, ) 16) El-Ghazawi, T., Chauvin, S. UPC benchmarking issues, Parallel Processing, International Conference, pp , c 2011 Information Processing Society of Japan

HPC Challenge Awards 2010 Class2 XcalableMP Submission

HPC Challenge Awards 2010 Class2 XcalableMP Submission HPC Challenge Awards 2010 Class2 XcalableMP Submission Jinpil Lee, Masahiro Nakao, Mitsuhisa Sato University of Tsukuba Submission Overview XcalableMP Language and model, proposed by XMP spec WG Fortran

More information

An Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters

An Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters An Extension of XcalableMP PGAS Lanaguage for Multi-node Clusters Jinpil Lee, Minh Tuan Tran, Tetsuya Odajima, Taisuke Boku and Mitsuhisa Sato University of Tsukuba 1 Presentation Overview l Introduction

More information

Omni Compiler and XcodeML: An Infrastructure for Source-to- Source Transformation

Omni Compiler and XcodeML: An Infrastructure for Source-to- Source Transformation http://omni compiler.org/ Omni Compiler and XcodeML: An Infrastructure for Source-to- Source Transformation MS03 Code Generation Techniques for HPC Earth Science Applications Mitsuhisa Sato (RIKEN / Advanced

More information

Masahiro Nakao, Hitoshi Murai, Takenori Shimosaka, Mitsuhisa Sato

Masahiro Nakao, Hitoshi Murai, Takenori Shimosaka, Mitsuhisa Sato Masahiro Nakao, Hitoshi Murai, Takenori Shimosaka, Mitsuhisa Sato Center for Computational Sciences, University of Tsukuba, Japan RIKEN Advanced Institute for Computational Science, Japan 2 XMP/C int array[16];

More information

Exploring XcalableMP. Shun Liang. August 24, 2012

Exploring XcalableMP. Shun Liang. August 24, 2012 Exploring XcalableMP Shun Liang August 24, 2012 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2012 Abstract This project has implemented synthetic and application

More information

XcalableMP Implementation and

XcalableMP Implementation and XcalableMP Implementation and Performance of NAS Parallel Benchmarks Mitsuhisa Sato Masahiro Nakao, Jinpil Lee and Taisuke Boku University of Tsukuba, Japan What s XcalableMP? XcalableMP (XMP for short)

More information

int a[100]; #pragma xmp nodes p[*] #pragma xmp template t[100] #pragma xmp distribute t[block] onto p #pragma xmp align a[i] with t[i]

int a[100]; #pragma xmp nodes p[*] #pragma xmp template t[100] #pragma xmp distribute t[block] onto p #pragma xmp align a[i] with t[i] 2 3 4 int a[100]; #pragma xmp nodes p[*] #pragma xmp template t[100] #pragma xmp distribute t[block] onto p #pragma xmp align a[i] with t[i] integer :: a(100)!$xmp nodes p(*)!$xmp template t(100)!$xmp

More information

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3 /CPU,a),2,2 2,2 Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3 XMP XMP-dev CPU XMP-dev/StarPU XMP-dev XMP CPU StarPU CPU /CPU XMP-dev/StarPU N /CPU CPU. Graphics Processing Unit GP General-Purpose

More information

Performance Comparison between Two Programming Models of XcalableMP

Performance Comparison between Two Programming Models of XcalableMP Performance Comparison between Two Programming Models of XcalableMP H. Sakagami Fund. Phys. Sim. Div., National Institute for Fusion Science XcalableMP specification Working Group (XMP-WG) Dilemma in Parallel

More information

Unified Runtime for PGAS and MPI over OFED

Unified Runtime for PGAS and MPI over OFED Unified Runtime for PGAS and MPI over OFED D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University, USA Outline Introduction

More information

C++ T2K NPB. An Implementation and NPB Evaluation of C++ Task Allocation Library (tpdplib) on T2K Open Supercomputer

C++ T2K NPB. An Implementation and NPB Evaluation of C++ Task Allocation Library (tpdplib) on T2K Open Supercomputer C++ tpdplib T2K NPB 1 1 GPGPU C++ tpdplib T2K An Implementation and NPB Evaluation of C++ Task Allocation Library (tpdplib) on T2K Open Supercomputer Takeo Yamasaki 1 and Nakayama Masaya 1 Modern computing

More information

Unifying UPC and MPI Runtimes: Experience with MVAPICH

Unifying UPC and MPI Runtimes: Experience with MVAPICH Unifying UPC and MPI Runtimes: Experience with MVAPICH Jithin Jose Miao Luo Sayantan Sur D. K. Panda Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University,

More information

What is Stencil Computation?

What is Stencil Computation? Model Checking Stencil Computations Written in a Partitioned Global Address Space Language Tatsuya Abe, Toshiyuki Maeda, and Mitsuhisa Sato RIKEN AICS HIPS 13 May 20, 2013 What is Stencil Computation?

More information

IPSJ SIG Technical Report Vol.2014-HPC-145 No /7/29 XcalableMP FFT 1 1 1,2 HPC PGAS XcalableMP XcalableMP G-FFT 90.6% 186.6TFLOPS XMP MPI

IPSJ SIG Technical Report Vol.2014-HPC-145 No /7/29 XcalableMP FFT 1 1 1,2 HPC PGAS XcalableMP XcalableMP G-FFT 90.6% 186.6TFLOPS XMP MPI XcalableMP FFT, HPC PGAS XcalableMP XcalableMP 89 G-FFT 9.6% 86.6TFLOPS XMP MPI. Fourier (FFT) MPI [] Partitioned Global Address Space (PGAS) FFT PGAS PGAS XcalableMP(XMP)[] C Fortran XMP HPC [] Global-FFT

More information

DEVELOPING AN OPTIMIZED UPC COMPILER FOR FUTURE ARCHITECTURES

DEVELOPING AN OPTIMIZED UPC COMPILER FOR FUTURE ARCHITECTURES DEVELOPING AN OPTIMIZED UPC COMPILER FOR FUTURE ARCHITECTURES Tarek El-Ghazawi, François Cantonnet, Yiyi Yao Department of Electrical and Computer Engineering The George Washington University tarek@gwu.edu

More information

Berkeley UPC SESSION 3: Library Extensions

Berkeley UPC SESSION 3: Library Extensions SESSION 3: Library Extensions Christian Bell, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Rajesh Nishtala, Mike Welcome, Kathy Yelick U.C. Berkeley / LBNL 16 Library Extensions

More information

Implementation and Evaluation of Coarray Fortran Translator Based on OMNI XcalableMP. October 29, 2015 Hidetoshi Iwashita, RIKEN AICS

Implementation and Evaluation of Coarray Fortran Translator Based on OMNI XcalableMP. October 29, 2015 Hidetoshi Iwashita, RIKEN AICS Implementation and Evaluation of Coarray Fortran Translator Based on OMNI XcalableMP October 29, 2015 Hidetoshi Iwashita, RIKEN AICS Background XMP Contains Coarray Features XcalableMP (XMP) A PGAS language,

More information

Affine Loop Optimization using Modulo Unrolling in CHAPEL

Affine Loop Optimization using Modulo Unrolling in CHAPEL Affine Loop Optimization using Modulo Unrolling in CHAPEL Aroon Sharma, Joshua Koehler, Rajeev Barua LTS POC: Michael Ferguson 2 Overall Goal Improve the runtime of certain types of parallel computers

More information

MPI_Send(a,..., MPI_COMM_WORLD); MPI_Recv(a,..., MPI_COMM_WORLD, &status);

MPI_Send(a,..., MPI_COMM_WORLD); MPI_Recv(a,..., MPI_COMM_WORLD, &status); $ $ 2 global void kernel(int a[max], int llimit, int ulimit) {... } : int main(int argc, char *argv[]){ MPI_Int(&argc, &argc); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size);

More information

A Local-View Array Library for Partitioned Global Address Space C++ Programs

A Local-View Array Library for Partitioned Global Address Space C++ Programs Lawrence Berkeley National Laboratory A Local-View Array Library for Partitioned Global Address Space C++ Programs Amir Kamil, Yili Zheng, and Katherine Yelick Lawrence Berkeley Lab Berkeley, CA, USA June

More information

Performance Evaluation for Omni XcalableMP Compiler on Many-core Cluster System based on Knights Landing

Performance Evaluation for Omni XcalableMP Compiler on Many-core Cluster System based on Knights Landing ABSTRACT Masahiro Nakao RIKEN Advanced Institute for Computational Science Hyogo, Japan masahiro.nakao@riken.jp Taisuke Boku Center for Computational Sciences University of Tsukuba Ibaraki, Japan To reduce

More information

Automatic Nonblocking Communication for Partitioned Global Address Space Programs

Automatic Nonblocking Communication for Partitioned Global Address Space Programs Automatic Nonblocking Communication for Partitioned Global Address Space Programs Wei-Yu Chen 1,2 wychen@cs.berkeley.edu Costin Iancu 2 cciancu@lbl.gov Dan Bonachea 1,2 bonachea@cs.berkeley.edu Katherine

More information

Partitioned Global Address Space (PGAS) Model. Bin Bao

Partitioned Global Address Space (PGAS) Model. Bin Bao Partitioned Global Address Space (PGAS) Model Bin Bao Contents PGAS model introduction Unified Parallel C (UPC) introduction Data Distribution, Worksharing and Exploiting Locality Synchronization and Memory

More information

Unified Parallel C (UPC) Katherine Yelick NERSC Director, Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

Unified Parallel C (UPC) Katherine Yelick NERSC Director, Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley Unified Parallel C (UPC) Katherine Yelick NERSC Director, Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley Berkeley UPC Team Current UPC Team Filip Blagojevic Dan Bonachea Paul Hargrove

More information

MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); double a[100]; #pragma acc data copy(a) { #pragma acc parallel loop for(i=0;i<100;i++)

MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); double a[100]; #pragma acc data copy(a) { #pragma acc parallel loop for(i=0;i<100;i++) 2 MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); double a[100]; #pragma acc data copy(a) { #pragma acc parallel loop for(i=0;i

More information

Portable, MPI-Interoperable! Coarray Fortran

Portable, MPI-Interoperable! Coarray Fortran Portable, MPI-Interoperable! Coarray Fortran Chaoran Yang, 1 Wesley Bland, 2! John Mellor-Crummey, 1 Pavan Balaji 2 1 Department of Computer Science! Rice University! Houston, TX 2 Mathematics and Computer

More information

Architectural Trends and Programming Model Strategies for Large-Scale Machines

Architectural Trends and Programming Model Strategies for Large-Scale Machines Architectural Trends and Programming Model Strategies for Large-Scale Machines Katherine Yelick U.C. Berkeley and Lawrence Berkeley National Lab http://titanium.cs.berkeley.edu http://upc.lbl.gov 1 Kathy

More information

A Characterization of Shared Data Access Patterns in UPC Programs

A Characterization of Shared Data Access Patterns in UPC Programs A Characterization of Shared Data Access Patterns in UPC Programs Christopher Barton 1, Călin Caşcaval, and José Nelson Amaral 1 1 Department of Computing Science, University of Alberta, Edmonton, Canada

More information

Yasuo Okabe. Hitoshi Murai. 1. Introduction. 2. Evaluation. Elapsed Time (sec) Number of Processors

Yasuo Okabe. Hitoshi Murai. 1. Introduction. 2. Evaluation. Elapsed Time (sec) Number of Processors Performance Evaluation of Large-scale Parallel Simulation Codes and Designing New Language Features on the (High Performance Fortran) Data-Parallel Programming Environment Project Representative Yasuo

More information

Enforcing Textual Alignment of

Enforcing Textual Alignment of Parallel Hardware Parallel Applications IT industry (Silicon Valley) Parallel Software Users Enforcing Textual Alignment of Collectives using Dynamic Checks and Katherine Yelick UC Berkeley Parallel Computing

More information

LLVM-based Communication Optimizations for PGAS Programs

LLVM-based Communication Optimizations for PGAS Programs LLVM-based Communication Optimizations for PGAS Programs nd Workshop on the LLVM Compiler Infrastructure in HPC @ SC15 Akihiro Hayashi (Rice University) Jisheng Zhao (Rice University) Michael Ferguson

More information

Software Distributed Shared Memory with High Bandwidth Network: Production and Evaluation

Software Distributed Shared Memory with High Bandwidth Network: Production and Evaluation ,,.,, InfiniBand PCI Express,,,. Software Distributed Shared Memory with High Bandwidth Network: Production and Evaluation Akira Nishida, The recent development of commodity hardware technologies makes

More information

Lecture 32: Partitioned Global Address Space (PGAS) programming models

Lecture 32: Partitioned Global Address Space (PGAS) programming models COMP 322: Fundamentals of Parallel Programming Lecture 32: Partitioned Global Address Space (PGAS) programming models Zoran Budimlić and Mack Joyner {zoran, mjoyner}@rice.edu http://comp322.rice.edu COMP

More information

Performance Analysis Framework for GASNet Middleware, Tools, and Applications

Performance Analysis Framework for GASNet Middleware, Tools, and Applications Performance Analysis Framework for GASNet Middleware, Tools, and Applications Prashanth Prakash Max Billingsley III Alan D. George Vikas Aggarwal High-performance Computing and Simulation (HCS) Research

More information

A Characterization of Shared Data Access Patterns in UPC Programs

A Characterization of Shared Data Access Patterns in UPC Programs IBM T.J. Watson Research Center A Characterization of Shared Data Access Patterns in UPC Programs Christopher Barton, Calin Cascaval, Jose Nelson Amaral LCPC `06 November 2, 2006 Outline Motivation Overview

More information

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand Miao Luo, Hao Wang, & D. K. Panda Network- Based Compu2ng Laboratory Department of Computer Science and Engineering The Ohio State

More information

Computer Science Technical Report

Computer Science Technical Report Computer Science Technical Report A Performance Model for Unified Parallel C Zhang Zhang Michigan Technological University Computer Science Technical Report CS-TR-7-4 June, 27 Department of Computer Science

More information

Runtime Address Space Computation for SDSM Systems

Runtime Address Space Computation for SDSM Systems Runtime Address Space Computation for SDSM Systems Jairo Balart Outline Introduction Inspector/executor model Implementation Evaluation Conclusions & future work 2 Outline Introduction Inspector/executor

More information

Comparative Performance Analysis of RDMA-Enhanced Ethernet

Comparative Performance Analysis of RDMA-Enhanced Ethernet Comparative Performance Analysis of RDMA-Enhanced Ethernet Casey B. Reardon and Alan D. George HCS Research Laboratory University of Florida Gainesville, FL July 24, 2005 Clement T. Cole Ammasso Inc. Boston,

More information

PGAS Languages (Par//oned Global Address Space) Marc Snir

PGAS Languages (Par//oned Global Address Space) Marc Snir PGAS Languages (Par//oned Global Address Space) Marc Snir Goal Global address space is more convenient to users: OpenMP programs are simpler than MPI programs Languages such as OpenMP do not provide mechanisms

More information

Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor

Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor Intel K. K. E-mail: hirokazu.kobayashi@intel.com Yoshifumi Nakamura RIKEN AICS E-mail: nakamura@riken.jp Shinji Takeda

More information

Linkage of XcalableMP and Python languages for high productivity on HPC cluster system

Linkage of XcalableMP and Python languages for high productivity on HPC cluster system Linkage of XcalableMP and Python languages for high productivity on HPC cluster system Masahiro Nakao (RIKEN Center for Computational Science) 6th XMP Workshop@University of Tsukuba, Japan Background XcalableMP

More information

A C compiler for Large Data Sequential Processing using Remote Memory

A C compiler for Large Data Sequential Processing using Remote Memory A C compiler for Large Data Sequential Processing using Remote Memory Shiyo Yoshimura, Hiroko Midorikawa Graduate School of Science and Technology, Seikei University, Tokyo, Japan E-mail:dm106231@cc.seikei.ac.jp,

More information

Efficient Data Race Detection for Unified Parallel C

Efficient Data Race Detection for Unified Parallel C P A R A L L E L C O M P U T I N G L A B O R A T O R Y Efficient Data Race Detection for Unified Parallel C ParLab Winter Retreat 1/14/2011" Costin Iancu, LBL" Nick Jalbert, UC Berkeley" Chang-Seo Park,

More information

Unified Parallel C (UPC)

Unified Parallel C (UPC) Unified Parallel C (UPC) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 21 March 27, 2008 Acknowledgments Supercomputing 2007 tutorial on Programming using

More information

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers Overlapping Computation and Communication for Advection on Hybrid Parallel Computers James B White III (Trey) trey@ucar.edu National Center for Atmospheric Research Jack Dongarra dongarra@eecs.utk.edu

More information

Cluster-enabled OpenMP: An OpenMP compiler for the SCASH software distributed shared memory system

Cluster-enabled OpenMP: An OpenMP compiler for the SCASH software distributed shared memory system 123 Cluster-enabled OpenMP: An OpenMP compiler for the SCASH software distributed shared memory system Mitsuhisa Sato a, Hiroshi Harada a, Atsushi Hasegawa b and Yutaka Ishikawa a a Real World Computing

More information

Porting GASNet to Portals: Partitioned Global Address Space (PGAS) Language Support for the Cray XT

Porting GASNet to Portals: Partitioned Global Address Space (PGAS) Language Support for the Cray XT Porting GASNet to Portals: Partitioned Global Address Space (PGAS) Language Support for the Cray XT Paul Hargrove Dan Bonachea, Michael Welcome, Katherine Yelick UPC Review. July 22, 2009. What is GASNet?

More information

Performance Evaluation of InfiniBand with PCI Express

Performance Evaluation of InfiniBand with PCI Express Performance Evaluation of InfiniBand with PCI Express Jiuxing Liu Amith Mamidala Abhinav Vishnu Dhabaleswar K Panda Department of Computer and Science and Engineering The Ohio State University Columbus,

More information

UPC: A Portable High Performance Dialect of C

UPC: A Portable High Performance Dialect of C UPC: A Portable High Performance Dialect of C Kathy Yelick Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Wei Tu, Mike Welcome Parallelism on the Rise

More information

Using a Cluster as a Memory Resource: A Fast and Large Virtual Memory on MPI

Using a Cluster as a Memory Resource: A Fast and Large Virtual Memory on MPI Using a Cluster as a Memory Resource: A Fast and Large Virtual Memory on MPI DLM: Distributed Large Memory System IEEE Cluster 2009, New Orleans, Aug.31- Sep.4 Hiroko Midorikawa, Kazuhiro Saito Seikei

More information

UPC-CHECK: A scalable tool for detecting run-time errors in Unified Parallel C

UPC-CHECK: A scalable tool for detecting run-time errors in Unified Parallel C myjournal manuscript No. (will be inserted by the editor) UPC-CHECK: A scalable tool for detecting run-time errors in Unified Parallel C James Coyle Indranil Roy Marina Kraeva Glenn R. Luecke Received:

More information

Evaluating the Impact of Programming Language Features on the Performance of Parallel Applications on Cluster Architectures

Evaluating the Impact of Programming Language Features on the Performance of Parallel Applications on Cluster Architectures Evaluating the Impact of Programming Language Features on the Performance of Parallel Applications on Cluster Architectures Konstantin Berlin 1,JunHuan 2, Mary Jacob 3, Garima Kochhar 3,JanPrins 2, Bill

More information

On the Performance and Energy Efficiency of the PGAS Programming Model on Multicore Architectures

On the Performance and Energy Efficiency of the PGAS Programming Model on Multicore Architectures On the and Energy Efficiency of the PGAS Programming Model on Multicore Architectures Jérémie Lagravière & Johannes Langguth Simula Research Laboratory NO-1364 Fornebu, Norway jeremie@simula.no langguth@simula.no

More information

Productivity and Performance Using Partitioned Global Address Space Languages

Productivity and Performance Using Partitioned Global Address Space Languages Productivity and Performance Using Partitioned Global Address Space Languages Katherine Yelick 1,2, Dan Bonachea 1,2, Wei-Yu Chen 1,2, Phillip Colella 2, Kaushik Datta 1,2, Jason Duell 1,2, Susan L. Graham

More information

OpenMP on the FDSM software distributed shared memory. Hiroya Matsuba Yutaka Ishikawa

OpenMP on the FDSM software distributed shared memory. Hiroya Matsuba Yutaka Ishikawa OpenMP on the FDSM software distributed shared memory Hiroya Matsuba Yutaka Ishikawa 1 2 Software DSM OpenMP programs usually run on the shared memory computers OpenMP programs work on the distributed

More information

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman) CMSC 714 Lecture 4 OpenMP and UPC Chau-Wen Tseng (from A. Sussman) Programming Model Overview Message passing (MPI, PVM) Separate address spaces Explicit messages to access shared data Send / receive (MPI

More information

Parallel Programming Languages. HPC Fall 2010 Prof. Robert van Engelen

Parallel Programming Languages. HPC Fall 2010 Prof. Robert van Engelen Parallel Programming Languages HPC Fall 2010 Prof. Robert van Engelen Overview Partitioned Global Address Space (PGAS) A selection of PGAS parallel programming languages CAF UPC Further reading HPC Fall

More information

Portable, MPI-Interoperable! Coarray Fortran

Portable, MPI-Interoperable! Coarray Fortran Portable, MPI-Interoperable! Coarray Fortran Chaoran Yang, 1 Wesley Bland, 2! John Mellor-Crummey, 1 Pavan Balaji 2 1 Department of Computer Science! Rice University! Houston, TX 2 Mathematics and Computer

More information

OpenSHMEM Performance and Potential: A NPB Experimental Study

OpenSHMEM Performance and Potential: A NPB Experimental Study OpenSHMEM Performance and Potential: A NPB Experimental Study Swaroop Pophale, Ramachandra Nanjegowda, Tony Curtis, Barbara Chapman University of Houston Houston, Texas 77004 Haoqiang Jin NASA Ames Research

More information

Performance Evaluation of InfiniBand with PCI Express

Performance Evaluation of InfiniBand with PCI Express Performance Evaluation of InfiniBand with PCI Express Jiuxing Liu Server Technology Group IBM T. J. Watson Research Center Yorktown Heights, NY 1598 jl@us.ibm.com Amith Mamidala, Abhinav Vishnu, and Dhabaleswar

More information

Miwako TSUJI XcalableMP

Miwako TSUJI XcalableMP Miwako TSUJI AICS 2014.10.24 2 XcalableMP 2010.09 2014.03 2013.10.25 AKIHABARA FP2C (Framework for Post-Petascale Computing) YML + XMP(-dev) + StarPU integrated developed in Japan and in France Experimental

More information

Page Replacement Algorithm using Swap-in History for Remote Memory Paging

Page Replacement Algorithm using Swap-in History for Remote Memory Paging Page Replacement Algorithm using Swap-in History for Remote Memory Paging Kazuhiro SAITO Hiroko MIDORIKAWA and Munenori KAI Graduate School of Engineering, Seikei University, 3-3-, Kichijoujikita-machi,

More information

Programming Environment Research Team

Programming Environment Research Team Chapter 2 Programming Environment Research Team 2.1 Members Mitsuhisa Sato (Team Leader) Hitoshi Murai (Research Scientist) Miwako Tsuji (Research Scientist) Masahiro Nakao (Research Scientist) Jinpil

More information

Parallel Programming Features in the Fortran Standard. Steve Lionel 12/4/2012

Parallel Programming Features in the Fortran Standard. Steve Lionel 12/4/2012 Parallel Programming Features in the Fortran Standard Steve Lionel 12/4/2012 Agenda Overview of popular parallelism methodologies FORALL a look back DO CONCURRENT Coarrays Fortran 2015 Q+A 12/5/2012 2

More information

APPLICATION OF PARALLEL ARRAYS FOR SEMIAUTOMATIC PARALLELIZATION OF FLOW IN POROUS MEDIA PROBLEM SOLVER

APPLICATION OF PARALLEL ARRAYS FOR SEMIAUTOMATIC PARALLELIZATION OF FLOW IN POROUS MEDIA PROBLEM SOLVER Mathematical Modelling and Analysis 2005. Pages 171 177 Proceedings of the 10 th International Conference MMA2005&CMAM2, Trakai c 2005 Technika ISBN 9986-05-924-0 APPLICATION OF PARALLEL ARRAYS FOR SEMIAUTOMATIC

More information

Parallel Programming with Coarray Fortran

Parallel Programming with Coarray Fortran Parallel Programming with Coarray Fortran SC10 Tutorial, November 15 th 2010 David Henty, Alan Simpson (EPCC) Harvey Richardson, Bill Long, Nathan Wichmann (Cray) Tutorial Overview The Fortran Programming

More information

Steve Deitz Chapel project, Cray Inc.

Steve Deitz Chapel project, Cray Inc. Parallel Programming in Chapel LACSI 2006 October 18 th, 2006 Steve Deitz Chapel project, Cray Inc. Why is Parallel Programming Hard? Partitioning of data across processors Partitioning of tasks across

More information

Parallel Languages: Past, Present and Future

Parallel Languages: Past, Present and Future Parallel Languages: Past, Present and Future Katherine Yelick U.C. Berkeley and Lawrence Berkeley National Lab 1 Kathy Yelick Internal Outline Two components: control and data (communication/sharing) One

More information

ROSE-CIRM Detecting C-Style Errors in UPC Code

ROSE-CIRM Detecting C-Style Errors in UPC Code ROSE-CIRM Detecting C-Style Errors in UPC Code Peter Pirkelbauer 1 Chunhuah Liao 1 Thomas Panas 2 Daniel Quinlan 1 1 2 Microsoft Parallel Data Warehouse This work was funded by the Department of Defense

More information

Hierarchical Computation in the SPMD Programming Model

Hierarchical Computation in the SPMD Programming Model Hierarchical Computation in the SPMD Programming Model Amir Kamil Katherine Yelick Computer Science Division, University of California, Berkeley {kamil,yelick@cs.berkeley.edu Abstract. Large-scale parallel

More information

Comparing One-Sided Communication with MPI, UPC and SHMEM

Comparing One-Sided Communication with MPI, UPC and SHMEM Comparing One-Sided Communication with MPI, UPC and SHMEM EPCC University of Edinburgh Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk +44 131 650 5077 The Future ain t what it used to

More information

Parallel Programming of High-Performance Reconfigurable Computing Systems with Unified Parallel C

Parallel Programming of High-Performance Reconfigurable Computing Systems with Unified Parallel C Parallel Programming of High-Performance Reconfigurable Computing Systems with Unified Parallel C Tarek El-Ghazawi, Olivier Serres, Samy Bahra, Miaoqing Huang and Esam El-Araby Department of Electrical

More information

ADVANCED PGAS CENTRIC USAGE OF THE OPENFABRICS INTERFACE

ADVANCED PGAS CENTRIC USAGE OF THE OPENFABRICS INTERFACE 13 th ANNUAL WORKSHOP 2017 ADVANCED PGAS CENTRIC USAGE OF THE OPENFABRICS INTERFACE Erik Paulson, Kayla Seager, Sayantan Sur, James Dinan, Dave Ozog: Intel Corporation Collaborators: Howard Pritchard:

More information

Post-Petascale Computing. Mitsuhisa Sato

Post-Petascale Computing. Mitsuhisa Sato Challenges on Programming Models and Languages for Post-Petascale Computing -- from Japanese NGS project "The K computer" to Exascale computing -- Mitsuhisa Sato Center for Computational Sciences (CCS),

More information

System Software Stack for the Next Generation High-Performance Computers

System Software Stack for the Next Generation High-Performance Computers 1,2 2 Gerofi Balazs 1 3 2 4 4 5 6 7 7 PC CPU PC OS MPI I/O System Software Stack for the Next Generation High-Performance Computers Yutaka Ishikawa 1,2 Atsushi Hori 2 Gerofi Balazs 1 Masamichi Takagi 3

More information

Implementation and Performance Analysis of Non-Blocking Collective Operations for MPI

Implementation and Performance Analysis of Non-Blocking Collective Operations for MPI Implementation and Performance Analysis of Non-Blocking Collective Operations for MPI T. Hoefler 1,2, A. Lumsdaine 1 and W. Rehm 2 1 Open Systems Lab 2 Computer Architecture Group Indiana University Technical

More information

The Performance Analysis of Portable Parallel Programming Interface MpC for SDSM and pthread

The Performance Analysis of Portable Parallel Programming Interface MpC for SDSM and pthread The Performance Analysis of Portable Parallel Programming Interface MpC for SDSM and pthread Workshop DSM2005 CCGrig2005 Seikei University Tokyo, Japan midori@st.seikei.ac.jp Hiroko Midorikawa 1 Outline

More information

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst Ali Khajeh-Saeed Software Engineer CD-adapco J. Blair Perot Mechanical Engineering UMASS, Amherst Supercomputers Optimization Stream Benchmark Stag++ (3D Incompressible Flow Code) Matrix Multiply Function

More information

High-Performance Key-Value Store on OpenSHMEM

High-Performance Key-Value Store on OpenSHMEM High-Performance Key-Value Store on OpenSHMEM Huansong Fu*, Manjunath Gorentla Venkata, Ahana Roy Choudhury*, Neena Imam, Weikuan Yu* *Florida State University Oak Ridge National Laboratory Outline Background

More information

Evaluating the Portability of UPC to the Cell Broadband Engine

Evaluating the Portability of UPC to the Cell Broadband Engine Evaluating the Portability of UPC to the Cell Broadband Engine Dipl. Inform. Ruben Niederhagen JSC Cell Meeting CHAIR FOR OPERATING SYSTEMS Outline Introduction UPC Cell UPC on Cell Mapping Compiler and

More information

OpenMPI OpenMP like tool for easy programming in MPI

OpenMPI OpenMP like tool for easy programming in MPI OpenMPI OpenMP like tool for easy programming in MPI Taisuke Boku 1, Mitsuhisa Sato 1, Masazumi Matsubara 2, Daisuke Takahashi 1 1 Graduate School of Systems and Information Engineering, University of

More information

Leveraging C++ Meta-programming Capabilities to Simplify the Message Passing Programming Model

Leveraging C++ Meta-programming Capabilities to Simplify the Message Passing Programming Model Leveraging C++ Meta-programming Capabilities to Simplify the Message Passing Programming Model Simone Pellegrini, Radu Prodan, and Thomas Fahringer University of Innsbruck Distributed and Parallel Systems

More information

ScaleUPC: A UPC Compiler for Multi-Core Systems

ScaleUPC: A UPC Compiler for Multi-Core Systems ScaleUPC: A UPC Compiler for Multi-Core Systems Weiming Zhao Department of Computer Science Michigan Technological University wezhao@mtu.edu Zhenlin Wang Department of Computer Science Michigan Technological

More information

Compilation Techniques for Partitioned Global Address Space Languages

Compilation Techniques for Partitioned Global Address Space Languages Compilation Techniques for Partitioned Global Address Space Languages Katherine Yelick U.C. Berkeley and Lawrence Berkeley National Lab http://titanium.cs.berkeley.edu http://upc.lbl.gov 1 Kathy Yelick

More information

Co-array Fortran Performance and Potential: an NPB Experimental Study. Department of Computer Science Rice University

Co-array Fortran Performance and Potential: an NPB Experimental Study. Department of Computer Science Rice University Co-array Fortran Performance and Potential: an NPB Experimental Study Cristian Coarfa Jason Lee Eckhardt Yuri Dotsenko John Mellor-Crummey Department of Computer Science Rice University Parallel Programming

More information

Compilation Techniques for Partitioned Global Address Space Languages

Compilation Techniques for Partitioned Global Address Space Languages Compilation Techniques for Partitioned Global Address Space Languages Katherine Yelick U.C. Berkeley and Lawrence Berkeley National Lab http://titanium.cs.berkeley.edu http://upc.lbl.gov 1 Kathy Yelick

More information

Hierarchical Pointer Analysis

Hierarchical Pointer Analysis for Distributed Programs Amir Kamil and Katherine Yelick U.C. Berkeley August 23, 2007 1 Amir Kamil Background 2 Amir Kamil Hierarchical Machines Parallel machines often have hierarchical structure 4 A

More information

CIRM - Dynamic Error Detection

CIRM - Dynamic Error Detection CIRM - Dynamic Error Detection Peter Pirkelbauer Center for Applied Scientific Computing (CASC) Lawrence Livermore National Laboratory This work was funded by the Department of Defense and used elements

More information

Runtime Correctness Checking for Emerging Programming Paradigms

Runtime Correctness Checking for Emerging Programming Paradigms (protze@itc.rwth-aachen.de), Christian Terboven, Matthias S. Müller, Serge Petiton, Nahid Emad, Hitoshi Murai and Taisuke Boku RWTH Aachen University, Germany University of Tsukuba / RIKEN, Japan Maison

More information

A Comparison of Unified Parallel C, Titanium and Co-Array Fortran. The purpose of this paper is to compare Unified Parallel C, Titanium and Co-

A Comparison of Unified Parallel C, Titanium and Co-Array Fortran. The purpose of this paper is to compare Unified Parallel C, Titanium and Co- Shaun Lindsay CS425 A Comparison of Unified Parallel C, Titanium and Co-Array Fortran The purpose of this paper is to compare Unified Parallel C, Titanium and Co- Array Fortran s methods of parallelism

More information

Introduction to Parallel Programming

Introduction to Parallel Programming Introduction to Parallel Programming Section 5. Victor Gergel, Professor, D.Sc. Lobachevsky State University of Nizhni Novgorod (UNN) Contents (CAF) Approaches to parallel programs development Parallel

More information

UPC Performance Evaluation on a Multicore System

UPC Performance Evaluation on a Multicore System Performance Evaluation on a Multicore System Damián A. Mallón, J. Carlos Mouriño, Andrés Gómez Galicia Supercomputing Center Santiago de Compostela, Spain {dalvarez,jmourino,agomez}@cesga.es Guillermo

More information

Amazon Web Services: Performance Analysis of High Performance Computing Applications on the Amazon Web Services Cloud

Amazon Web Services: Performance Analysis of High Performance Computing Applications on the Amazon Web Services Cloud Amazon Web Services: Performance Analysis of High Performance Computing Applications on the Amazon Web Services Cloud Summarized by: Michael Riera 9/17/2011 University of Central Florida CDA5532 Agenda

More information

Portable SHMEMCache: A High-Performance Key-Value Store on OpenSHMEM and MPI

Portable SHMEMCache: A High-Performance Key-Value Store on OpenSHMEM and MPI Portable SHMEMCache: A High-Performance Key-Value Store on OpenSHMEM and MPI Huansong Fu*, Manjunath Gorentla Venkata, Neena Imam, Weikuan Yu* *Florida State University Oak Ridge National Laboratory Outline

More information

High Performance Fortran. James Curry

High Performance Fortran. James Curry High Performance Fortran James Curry Wikipedia! New Fortran statements, such as FORALL, and the ability to create PURE (side effect free) procedures Compiler directives for recommended distributions of

More information

Performance Analysis, Modeling and Tuning at Scale. Katherine Yelick

Performance Analysis, Modeling and Tuning at Scale. Katherine Yelick C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Performance Analysis, Modeling and Tuning at Scale Katherine Yelick Berkeley Institute for Performance Studies Lawrence Berkeley National Lab and

More information

Scaling with PGAS Languages

Scaling with PGAS Languages Scaling with PGAS Languages Panel Presentation at OFA Developers Workshop (2013) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Proceedings of the GCC Developers Summit

Proceedings of the GCC Developers Summit Reprinted from the Proceedings of the GCC Developers Summit June 17th 19th, 2008 Ottawa, Ontario Canada Conference Organizers Andrew J. Hutton, Steamballoon, Inc., Linux Symposium, Thin Lines Mountaineering

More information

Performance without Pain = Productivity Data Layout and Collective Communication in UPC

Performance without Pain = Productivity Data Layout and Collective Communication in UPC Performance without Pain = Productivity Data Layout and Collective Communication in UPC By Rajesh Nishtala (UC Berkeley), George Almási (IBM Watson Research Center), Călin Caşcaval (IBM Watson Research

More information