Implementing a Scalable Parallel Reduction in Unified Parallel C

Similar documents
New Programming Paradigms: Partitioned Global Address Space Languages

Lecture 32: Partitioned Global Address Space (PGAS) programming models

Steve Deitz Chapel project, Cray Inc.

IBM XL Unified Parallel C User s Guide

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

Abstract. Negative tests: These tests are to determine the error detection capabilities of a UPC compiler implementation.

Unified Parallel C (UPC)

IBM InfoSphere Guardium

Overview: The OpenMP Programming Model

Parallel Programming Languages. HPC Fall 2010 Prof. Robert van Engelen

Lab DSE Designing User Experience Concepts in Multi-Stream Configuration Management

A Comparison of Unified Parallel C, Titanium and Co-Array Fortran. The purpose of this paper is to compare Unified Parallel C, Titanium and Co-

IBM ThinkPad USB Portable Diskette Drive. User s Guide

Unified Parallel C, UPC

Tivoli Storage Manager for Virtual Environments: Data Protection for VMware Solution Design Considerations IBM Redbooks Solution Guide

IBM Security Guardium

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

OpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing

OpenMP: Vectorization and #pragma omp simd. Markus Höhnerbach

ROSE-CIRM Detecting C-Style Errors in UPC Code

OpenMPand the PGAS Model. CMSC714 Sept 15, 2015 Guest Lecturer: Ray Chen

Accelerated Library Framework for Hybrid-x86

Setting Up Swagger UI for a Production Environment

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

CME 213 S PRING Eric Darve

Reducing MIPS Using InfoSphere Optim Query Workload Tuner TDZ-2755A. Lloyd Matthews, U.S. Senate

EE/CSCI 451: Parallel and Distributed Computation

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

#1593: The top 10 things that can go wrong with an IBM Traveler Server

InfoSphere Guardium 9.1 TechTalk Reporting 101

IBM. IBM i2 Enterprise Insight Analysis Understanding the Deployment Patterns. Version 2 Release 1 BA

Continuous Availability with the IBM DB2 purescale Feature IBM Redbooks Solution Guide

Introduction to HPC and Optimization Tutorial VI

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 2016 Solutions Name:...

Version 2 Release 1. IBM i2 Enterprise Insight Analysis Understanding the Deployment Patterns IBM BA

Parallel Programming in C with MPI and OpenMP

Introduction to Parallel Programming Part 4 Confronting Race Conditions

Jukka Julku Multicore programming: Low-level libraries. Outline. Processes and threads TBB MPI UPC. Examples

Setting Up Swagger UI on WebSphere

Using Hive for Data Warehousing

Why C? Because we can t in good conscience espouse Fortran.

Integrated use of IBM WebSphere Adapter for Siebel and SAP with WPS Relationship Service. Quick Start Scenarios

Introduction to OpenMP

OpenMP: Open Multiprocessing

Migrating Classifications with Migration Manager

OpenMP Programming. Aiichiro Nakano

Parallel Programming with OpenMP. CS240A, T. Yang, 2013 Modified from Demmel/Yelick s and Mary Hall s Slides

Programming Shared Memory Systems with OpenMP Part I. Book

5.12 EXERCISES Exercises 263

Optimizing Data Transformation with Db2 for z/os and Db2 Analytics Accelerator

CS 470 Spring Mike Lam, Professor. Advanced OpenMP

IBM PowerKVM available with the Linux only scale-out servers IBM Redbooks Solution Guide

Parallel Programming with OpenMP. CS240A, T. Yang

Multithreading in C with OpenMP

Sometimes incorrect Always incorrect Slower than serial Faster than serial. Sometimes incorrect Always incorrect Slower than serial Faster than serial

IBM Geographically Dispersed Resiliency for Power Systems. Version Release Notes IBM

IBM Security Guardium Cloud Deployment Guide AWS EC2

z/tpf Descriptor Definition Projects

IBM Compliance Offerings For Verse and S1 Cloud. 01 June 2017 Presented by: Chuck Stauber

COMP Parallel Computing. SMM (2) OpenMP Programming Model

Allows program to be incrementally parallelized

Shared Memory Programming Model

IBM Security Guardium Cloud Deployment Guide IBM SoftLayer

IBM FlashSystem V Quick Start Guide IBM GI

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to OpenMP

Using Hive for Data Warehousing

IBM z/os Management Facility V2R1 Solution Guide IBM Redbooks Solution Guide

Chap. 6 Part 3. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1

Designing a Reference Architecture for Virtualized Environments Using IBM System Storage N series IBM Redbooks Solution Guide

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen

Static Data Race Detection for SPMD Programs via an Extended Polyhedral Representation

IBM FlashSystem V MTM 9846-AC3, 9848-AC3, 9846-AE2, 9848-AE2, F, F. Quick Start Guide IBM GI

Module 11: The lastprivate Clause Lecture 21: Clause and Routines. The Lecture Contains: The lastprivate Clause. Data Scope Attribute Clauses

Programming with Shared Memory PART II. HPC Fall 2007 Prof. Robert van Engelen

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

Version 4 Release 1. IBM i2 Enterprise Insight Analysis Data Model White Paper IBM

IBM emessage Version 8.x and higher. Account Startup Overview

Parallel Programming in C with MPI and OpenMP

Express Edition for IBM x86 Getting Started

Implementing IBM CICS JSON Web Services for Mobile Applications IBM Redbooks Solution Guide

IBM Spatially Enables Enterprise With ESRI ArcGIS Server

IBM TotalStorage DS4000 Fibre Channel and Serial ATA Intermix Premium Feature. Installation Overview

Innovate 2013 Automated Mobile Testing

Advanced OpenMP. Lecture 11: OpenMP 4.0

DPHPC: Introduction to OpenMP Recitation session

Utility Capacity on Demand: What Utility CoD Is and How to Use It

Multicore Programming with OpenMP. CSInParallel Project

Empowering DBA's with IBM Data Studio. Deb Jenson, Data Studio Product Manager,

Raspberry Pi Basics. CSInParallel Project

ECE 574 Cluster Computing Lecture 10

Parallel Programming: OpenMP

Upgrading XL Fortran Compilers

Partitioned Global Address Space (PGAS) Model. Bin Bao

Build integration overview: Rational Team Concert and IBM UrbanCode Deploy

COSC 462. Parallel Algorithms. The Design Basics. Piotr Luszczek

A Faster Parallel Algorithm for Analyzing Drug-Drug Interaction from MEDLINE Database

IBM Cloud Object Storage System Version Time Synchronization Configuration Guide IBM DSNCFG_ K

Transcription:

Implementing a Scalable Parallel Reduction in Unified Parallel C Introduction A reduction is the process of combining elements of a vector (or array) to yield a single aggregate element. It is commonly used in scientific computations. For instance the inner product of two n- dimensional vectors x, y is given by: n x T y = Σ x i y i i=1 This computation requires n multiplications and n-1 additions. The n multiplications are independent from each other, therefore could be executed concurrently. Once the additive terms have been computed they can be summed together to yield the final result. Given the importance of reduction operations in scientific algorithms, many parallel languages provide support for user defined reductions. For example, OpenMP provides a user reduction clause to be used in the OMP parallel construct. In the following example the reduction clause indicates that "sum" is a reduction variable: int A[N]; int sum=0; #pragma omp parallel for reduction ( +: sum ) for ( i=1;i<n;i++ ) { sum = sum + A[i]; In the code snippet above each thread performs a portion of the additions that make up the final sum. At the end of the parallel loop, the partial sums are combined into the final result. Reduction Implementation in Unified Parallel C In this article we will explore different implementations of a global sum reduction in Unified Parallel C, in an attempt to find an efficient and scalable implementation. Unified Parallel C (UPC) is an extension to the C programming language that allows users to express parallelism in their code. The UPC language subscribes to the Partition Global Address Space programming model. Partitioned Global Address Space (PGAS) languages such as UPC are increasingly seen as a convenient way to enhance programmer productivity for High-Performance Computing (HPC) applications on large-scale machines. A UPC program is executed by a fixed number of threads (THREADS) which are created at program startup. In general a program statement might be executed concurrently by all threads. To facilitate the distribution and coordination of work among threads UPC provides a rich set of language features. A tutorial on Unified Parallel C is available at http://domino.research.ibm.com/comm/research_projects.nsf/pages/xlupc.confs.html/$file/pact08tutorial.pdf. 1

Let's consider how a naive sum reduction could be written in Unified Parallel C 01 #include <upc.h> 02 03 #define N 100000000 04 0 shared int A[N]; 06 shared int sum=0; 07 int main() { 08 int i; 09 10 upc_forall (i=0;i<n;i++;&a[i]) { A[i] = 1; 11 upc_barrier; 12 13 upc_forall (i=0;i<n;i++;&a[i] ) { sum = sum + A[i]; 14 upc_barrier; 1 16 if( MYTHREAD == 0 && sum!= N ) 17 { 18 printf ("Thread:%d,result=%d,expect=%d\n",MYTHREAD,sum,N); 19 return 1; 20 21 22 return 0; 23 At lines - 6 we have declared a shared array "A" and a shared variable "sum". At line 10 we initialize all elements of A to 1. At line 13 we have attempted to perform the reduction by accumulating the sum of A's elements values in shared variable "sum". Is this program correct? A possible program output is: Thread:0,result=27176662,expect=100000000 The result is obvious wrong, but what is the problem? The keen reader might point out that the program as written contains a race condition. Multiple threads can write into shared variable "sum" concurrently, possibly overwriting a partial value previously stored. In order to eliminate the race condition we could protect writes into variable "sum" using a critical section. In UPC this is accomplished by using a "lock" variable as follow: upc_forall ( i=0;i<n;i++;&a[i] ) { upc_lock ( lock ); sum = sum + A[i]; upc_unlock ( lock ); The modified version of the program will output the correct result. However what are the implications of this "solution"? The use of the lock effectively serializes the upc_forall loop iterations, preventing any performance gain from parallel execution. To confirm this theory we have measured how long it takes for the upc_forall loop above to compute the sum of the array elements. Our experiments were conducted on a POWER system running AIX.3 using up to 32 threads (Figure 1). 2

3 30 2 TIME(S) 20 1 10 0 1 3 7 9 11 13 1 17 19 21 23 2 27 29 31 THREADS Figure 1: Effect of using a lock inside a upc_forall loop. From the results illustrated in Figure 1 we can infer that the time it takes to execute the upc_forall loop does not improve considerably when the number of threads used to execute the program increases. This is what we expected because the use of the lock in the loop prevents concurrent execution of loop iterations. To get better scalability (increased program performance as the number of threads increases), it is critical to remove the lock in the upc_forall loop. This can be done by accumulating the partial sum computed by each thread into a thread-local variable. A thread-local variable is allocated in the private memory space of each thread, thus there are THREADS instances of the variable. Each instance of the thread-local variable can be used to accumulate the sum of the array elements having affinity to each thread: upc_forall ( i=0;i<n;i++;&a[i] ) { partialsum += A[i]; upc_barrier; upc_lock (lock); sum = sum + partialsum; upc_unlock(lock); In the code fragment shown above the thread-local variable partialsum is used to store the sum of the array elements having affinity to the executing thread (MYTHREAD). For example THREAD 0 will add array elements A[0], A[THREADS], A[2*THREADS], etc in its instance of partialsum. In order to compute the final result it is necessary to add the partialsum contributions from each thread. To avoid a race condition (write-after-write hazard on variable sum ), we use the UPC lock functions to serialize the accesses on sum.. 3

TIME(S) 6 4 3 2 1 0 1 3 7 9 11 13 1 17 19 21 23 2 27 29 31 THREADS Figure 2: Effect of using thread-local variables to compute partial sums The performance result illustrated by Figure 2 demonstrates that the program is now scalable. That is the time taken to compute the reduction diminishes as the number of threads used to execute the program increases. The reason for this improvement is simple: the lock is now acquired THREADS times in total instead of being acquired by each thread in every loop iteration. Conclusion In this article, we illustrate the concept of a reduction operation and explained how to implement a parallel reduction in Unified Parallel C. We have shown how the UPC lock primitives can be used to guarantee program correctness. We have then compared the performance of two distinct correct reduction implementations, one using a lock inside a upc_forall loop; the other using thread-local variables to accumulate partial results on each thread. The performance measurements obtained clearly indicates that locks should be used judiciously (if at all) inside loops. 4

May 2010 References in this document to IBM products, programs, or services do not imply that IBM intends to make these available in all countries in which IBM operates. Any reference to an IBM program product in this publication is not intended to state or imply that only IBM s program product may be used. Any functionally equivalent program may be used instead. IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol ( or ), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at Copyright and trademark information at www.ibm.com/legal/copytrade.shtml Copyright International Business Machines Corporation 2010. US Government Users Restricted Rights Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.