SWARM Tutorial. Chen Chen 4/12/2012

Similar documents
Thread. Disclaimer: some slides are adopted from the book authors slides with permission 1

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

CS 0449 Sample Midterm

Introduction to OpenMP

Decision Making and Loops

Exercise: OpenMP Programming

Introduction to Standard OpenMP 3.1

Parallel Programming with OpenMP. CS240A, T. Yang

Shared Memory Programming Model

CS240: Programming in C

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

ECE 574 Cluster Computing Lecture 10

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

SOFTWARE Ph.D. Qualifying Exam Spring Consider the following C program, which includes three function definitions, including the main function.

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

OpenMP. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

IWOMP Dresden, Germany

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

Arrays and Pointers in C. Alan L. Cox

ECE264 Fall 2013 Exam 1, September 24, 2013

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

Allows program to be incrementally parallelized

CSE 160 Lecture 7. C++11 threads C++11 memory model

Parallel Programming using OpenMP

OpenMP Examples - Tasking

Parallel Programming using OpenMP

CSE 306/506 Operating Systems Concurrency: Mutual Exclusion and Synchronization YoungMin Kwon

Multithreading in C with OpenMP

A brief introduction to OpenMP

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

OpenMP Programming. Aiichiro Nakano

OpenMP. Dr. William McDoniel and Prof. Paolo Bientinesi WS17/18. HPAC, RWTH Aachen

Data Environment: Default storage attributes

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

BLM2031 Structured Programming. Zeyneb KURT

High Performance Computing: Tools and Applications

Parallel Programming: OpenMP

POSIX Threads and OpenMP tasks

A Lightweight OpenMP Runtime

Concurrent Programming with OpenMP

OpenMP - Introduction

Incremental Layer Assignment for Critical Path Timing

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

C programming basics T3-1 -

OpenMP - III. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Threads. What is a thread? Motivation. Single and Multithreaded Processes. Benefits

Programming Shared Memory Systems with OpenMP Part I. Book

CS533 Concepts of Operating Systems. Jonathan Walpole

Exercise (could be a quiz) Solution. Concurrent Programming. Roadmap. Tevfik Koşar. CSE 421/521 - Operating Systems Fall Lecture - IV Threads

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

A Fast Review of C Essentials Part I

Concurrent Server Design Multiple- vs. Single-Thread

OpenMP 4. CSCI 4850/5850 High-Performance Computing Spring 2018

On the cost of managing data flow dependencies

Parallel Programming

POSIX threads CS 241. February 17, Copyright University of Illinois CS 241 Staff

SOFTWARE Ph.D. Qualifying Exam Spring Consider the following C program which consists of two function definitions including the main function.

COP 3223 Introduction to Programming with C - Study Union - Fall 2017

OpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing

Distributed Systems + Middleware Concurrent Programming with OpenMP

Parallel Programming in C with MPI and OpenMP

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Alfio Lazzaro: Introduction to OpenMP

Mango DSP Top manufacturer of multiprocessing video & imaging solutions.

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

Pointers, Arrays, and Strings. CS449 Spring 2016

Fall 2015 COMP Operating Systems. Lab #3

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

File Descriptors and Piping

Parallel Programming in C with MPI and OpenMP

19.1. Unit 19. OpenMP Library for Parallelism

PRACE Autumn School Basic Programming Models

Dynamic memory allocation

OpenMP threading: parallel regions. Paolo Burgio

CS 426. Building and Running a Parallel Application

15-418, Spring 2008 OpenMP: A Short Introduction

Function-level Parallelism for Multi-core Processors Automated Parallelization of Graphical Dataflow Programs

OpenMPand the PGAS Model. CMSC714 Sept 15, 2015 Guest Lecturer: Ray Chen

Operating Systems Lab

Control Structure: Loop

Parallel Programming with OpenMP. CS240A, T. Yang, 2013 Modified from Demmel/Yelick s and Mary Hall s Slides

Computer Architecture

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

LECTURE NOTES ON PROGRAMMING FUNDAMENTAL USING C++ LANGUAGE

Parallelizing Compilers

Introduction to OpenMP.

EL2310 Scientific Programming LAB2: C lab session. Patric Jensfelt, Andrzej Pronobis

Parallel Programming in C with MPI and OpenMP

Workshop Agenda Feb 25 th 2015

Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG

Concurrent Programming with OpenMP

Shared Memory Programming Paradigm!

CS510 Operating System Foundations. Jonathan Walpole

2

Type Conversion. and. Statements

Shared Memory Programming with OpenMP

CS 61C: Great Ideas in Computer Architecture Introduction to C

Transcription:

SWARM Tutorial Chen Chen 4/12/2012 1

Outline Introduction to SWARM Programming in SWARM Atomic Operations in SWARM Parallel For Loop in SWARM 2

Outline Introduction to SWARM Programming in SWARM Atomic Operations in SWARM Parallel For Loop in SWARM 3

What is SWARM SWARM (SWift Adaptive Runtime Machine) A dynamic adaptive runtime system that minimizes user exposure to physical parallelism and system complexity SWARM is designed to enable programming on many-core architectures by utilizing a dynamic, message-driven model of execution instead of the static scheduling and sequential computing method of conventional programming models Cited from ETI SWARM webpage, http://www.etinternational.com/index.php/products/swarmbeta/ 4

Difference Between OpenMP and Master thread Slave threads Sequential task Wake up all threads Parallel tasks Barrier Sequential task Wake up all threads tasks SWARM tasks (codelets) Barrier OpenMP Coarse-grain execution model SWARM Fine-grain execution model 5

How Codelets Work in SWARM Each codelet is attached a dependency counter with some initial positive value A codelet is in dormant state if its counter is greater than 0 A codelet is in enabled state if its counter is 0 An enabled codelet will be scheduled to a free thread and executed (firing state) A codelet can call satisfy() to decrease the value of any dependency counter SWARM runtime handles codelet schedule and dependency maintenance dep counter Codelet 6

How Codelet Works in SWARM an Example Suppose we have 3 codelets. Codelet3 cannot start unless both codelet1 and codelet2 are done. And suppose we use one thread to execute the 3 codelets. Thread is free dep1 =0 dep2 =0 Enabled Codelet1 Enabled Codelet2 dep3 =2 Dormant Codelet3 7

How Codelet Works in SWARM an Example Suppose we have 3 codelets. Codelet3 cannot start unless both codelet1 and codelet2 are done. And suppose we use one thread to execute the 3 codelets. Thread is executing codelet1 dep1 =0 Firing Codelet1 Call satisfy() to decrease dep3 by 1 dep3 =2 Dormant Codelet3 dep2 =0 Enabled Codelet2 8

How Codelet Works in SWARM an Example Suppose we have 3 codelets. Codelet3 cannot start unless both codelet1 and codelet2 are done. And suppose we use one thread to execute the 3 codelets. Thread is free dep1 =0 Done Codelet1 Call satisfy() to decrease dep3 by 1 dep3 =1 Dormant Codelet3 dep2 =0 Enabled Codelet2 9

How Codelet Works in SWARM an Example Suppose we have 3 codelets. Codelet3 cannot start unless both codelet1 and codelet2 are done. And suppose we use one thread to execute the 3 codelets. Thread is executing codelet2 dep1 =0 Done Codelet1 Call satisfy() to decrease dep3 by 1 dep3 =1 Dormant Codelet3 dep2 =0 Firing Codelet2 Call satisfy() to decrease dep3 by 1 10

How Codelet Works in SWARM an Example Suppose we have 3 codelets. Codelet3 cannot start unless both codelet1 and codelet2 are done. And suppose we use one thread to execute the 3 codelets. Thread is free dep1 =0 Done Codelet1 Call satisfy() to decrease dep3 by 1 dep3 =0 Enabled Codelet3 dep2 =0 Done Codelet2 Call satisfy() to decrease dep3 by 1 11

How Codelet Works in SWARM an Example Suppose we have 3 codelets. Codelet3 cannot start unless both codelet1 and codelet2 are done. And suppose we use one thread to execute the 3 codelets. Thread is executing codelet3 dep1 =0 Done Codelet1 Call satisfy() to decrease dep3 by 1 dep3 =0 Firing Codelet3 dep2 =0 Done Codelet2 Call satisfy() to decrease dep3 by 1 12

How Codelet Works in SWARM an Example Suppose we have 3 codelets. Codelet3 cannot start unless both codelet1 and codelet2 are done. And suppose we use one thread to execute the 3 codelets. Thread is free dep1 =0 Done Codelet1 Call satisfy() to decrease dep3 by 1 dep3 =0 Done Codelet3 dep2 =0 Done Codelet2 Call satisfy() to decrease dep3 by 1 13

Outline Introduction to SWARM Programming in SWARM Atomic Operations in SWARM Parallel For Loop in SWARM 14

Programming in SWARM Partition the problem into codelets Problem Break into codelets codelets Classify Entry codelet: No parent Single dependent codelet: One parent Multiple dependent codelet: multiple parents Sink codelet: Executed at last 15

Programming in SWARM cont. Setup dependency in the program Call swarm_enterruntime() to start entry codelet Entry codelet has no parent. We execute it at beginning of SWARM runtime. 16

Programming in SWARM cont. Setup dependency in the program Call swarm_enterruntime() to start entry codelet Call swarm_schedulegeneral() to create a single dependent codelet at the end of its parent Single dependent codelet has only one parent. We create it at the end of its parent without setting dependency. 17

Programming in SWARM cont. Setup dependency in the program Call swarm_enterruntime() to start entry codelet Call swarm_schedulegeneral() to create a single dependent codelet at the end of its parent Call swarm_dependency_init() to create a multiple dependent codelet before any of its parent is created (e.g., create it at some ancient of all its parents) and setup dependencies A multiple dependent codelet has multiple parents. We have to create it and set its dependency counter. We perform the creation and setting before the start of any of its parents to avoid conflicts in the dependency counter. 18

Programming in SWARM cont. Setup dependency in the program Call swarm_enterruntime() to start entry codelet Call swarm_schedulegeneral() to create a single dependent codelet at the end of its parent Call swarm_dependency_init() to create a multiple dependent codelet before any of its parent is created (e.g., create it at some ancient of all its parents) and setup dependencies Sink codelet is created in the same way as either single dependent codelet or multiple dependent codelet, according to the number of its parents 19 Call swarm_shutdownruntime() at the end of sink codelet to terminate SWARM runtime.

Example: Hello World startup hello 8 in all hello world world done Codelet graph of hello world 20

Example: Hello World #include <stdio.h> #include <swarm/runtime.h> #include <swarm/scheduler.h> startup static void hello(void *_i) const unsigned i = (size_t)_i; #define COUNT 8 static void startup(void *); static void hello(void *); static void world(void *); static void done(void *); int main(void) return!swarm_enterruntime(null, startup, NULL); hello world 8 in all done hello world printf("%u: Hello,\n", i); swarm_schedulegeneral(world, _i); static void world(void *_i) const unsigned i = (size_t)_i; printf("%u: world!\n", i); swarm_satisfy(&dep, 1); static swarm_dependency_t dep; static void startup(void *unused) unsigned i; (void)unused; swarm_dependency_init(&dep, COUNT, done, NULL); for(i=0; i<count; i++) swarm_schedulegeneral(hello, (void *)(size_t)i); static void done(void *unused) (void)unused; puts("all done!"); swarm_shutdownruntime(null); 21

Example: Hello World #include <stdio.h> #include <swarm/runtime.h> #include <swarm/scheduler.h> startup static void hello(void *_i) const unsigned i = (size_t)_i; #define COUNT 8 static void startup(void *); static void hello(void *); static void world(void *); static void done(void *); int main(void) return!swarm_enterruntime(null, startup, NULL); hello world done hello world printf("%u: Hello,\n", i); swarm_schedulegeneral(world, _i); static void world(void *_i) const unsigned i = (size_t)_i; printf("%u: world!\n", i); swarm_satisfy(&dep, 1); static swarm_dependency_t dep; static void startup(void *unused) unsigned i; (void)unused; swarm_dependency_init(&dep, COUNT, done, NULL); for(i=0; i<count; i++) swarm_schedulegeneral(hello, (void *)(size_t)i); static void done(void *unused) (void)unused; puts("all done!"); swarm_shutdownruntime(null); 22

Example: Hello World #include <stdio.h> #include <swarm/runtime.h> #include <swarm/scheduler.h> startup static void hello(void *_i) const unsigned i = (size_t)_i; #define COUNT 8 static void startup(void *); static void hello(void *); static void world(void *); static void done(void *); int main(void) return!swarm_enterruntime(null, startup, NULL); hello world done hello world printf("%u: Hello,\n", i); swarm_schedulegeneral(world, _i); static void world(void *_i) const unsigned i = (size_t)_i; printf("%u: world!\n", i); swarm_satisfy(&dep, 1); static swarm_dependency_t dep; static void startup(void *unused) unsigned i; (void)unused; swarm_dependency_init(&dep, COUNT, done, NULL); for(i=0; i<count; i++) swarm_schedulegeneral(hello, (void *)(size_t)i); static void done(void *unused) (void)unused; puts("all done!"); swarm_shutdownruntime(null); 23

Example: Hello World #include <stdio.h> #include <swarm/runtime.h> #include <swarm/scheduler.h> startup static void hello(void *_i) const unsigned i = (size_t)_i; #define COUNT 8 static void startup(void *); static void hello(void *); static void world(void *); static void done(void *); int main(void) return!swarm_enterruntime(null, startup, NULL); hello world done hello world printf("%u: Hello,\n", i); swarm_schedulegeneral(world, _i); static void world(void *_i) const unsigned i = (size_t)_i; printf("%u: world!\n", i); swarm_satisfy(&dep, 1); static swarm_dependency_t dep; static void startup(void *unused) unsigned i; (void)unused; swarm_dependency_init(&dep, COUNT, done, NULL); for(i=0; i<count; i++) swarm_schedulegeneral(hello, (void *)(size_t)i); static void done(void *unused) (void)unused; puts("all done!"); swarm_shutdownruntime(null); 24

Example: Hello World #include <stdio.h> #include <swarm/runtime.h> #include <swarm/scheduler.h> startup static void hello(void *_i) const unsigned i = (size_t)_i; #define COUNT 8 static void startup(void *); static void hello(void *); static void world(void *); static void done(void *); int main(void) return!swarm_enterruntime(null, startup, NULL); hello world done hello world printf("%u: Hello,\n", i); swarm_schedulegeneral(world, _i); static void world(void *_i) const unsigned i = (size_t)_i; printf("%u: world!\n", i); swarm_satisfy(&dep, 1); static swarm_dependency_t dep; static void startup(void *unused) unsigned i; (void)unused; swarm_dependency_init(&dep, COUNT, done, NULL); for(i=0; i<count; i++) swarm_schedulegeneral(hello, (void *)(size_t)i); static void done(void *unused) (void)unused; puts("all done!"); swarm_shutdownruntime(null); 25

SWARM APIs Enter SWARM Runtime swarm_enterruntime(params, codelet, context) params: pointer to swarm_runtime_params_t Setting up SWARM runtime (e.g., max number of threads) codelet: function name (codelet) Function in the format void fname(void * context) Entry codelet context: pointer to a data structure The parameters passed to the codelet 26

Review Hello World #include <stdio.h> #include <swarm/runtime.h> #include <swarm/scheduler.h> startup static void hello(void *_i) const unsigned i = (size_t)_i; #define COUNT 8 static void startup(void *); static void hello(void *); static void world(void *); static void done(void *); int main(void) return!swarm_enterruntime(null, startup, NULL); hello world 8 in all done hello world printf("%u: Hello,\n", i); swarm_schedulegeneral(world, _i); static void world(void *_i) const unsigned i = (size_t)_i; printf("%u: world!\n", i); swarm_satisfy(&dep, 1); static swarm_dependency_t dep; static void startup(void *unused) unsigned i; (void)unused; swarm_dependency_init(&dep, COUNT, done, NULL); for(i=0; i<count; i++) swarm_schedulegeneral(hello, (void *)(size_t)i); static void done(void *unused) (void)unused; puts("all done!"); swarm_shutdownruntime(null); 27

SWARM APIs Codelet Creation (1) Create a free codelet, i.e., the codelet does not depend on other codelets swarm_schedulegeneral(codelet, context) codelet: function name (codelet) Function in the format void fname(void * context) context: pointer to a data structure The parameters passed to the codelet 28

Review Hello World #include <stdio.h> #include <swarm/runtime.h> #include <swarm/scheduler.h> startup static void hello(void *_i) const unsigned i = (size_t)_i; #define COUNT 8 static void startup(void *); static void hello(void *); static void world(void *); static void done(void *); int main(void) return!swarm_enterruntime(null, startup, NULL); hello world 8 in all done hello world printf("%u: Hello,\n", i); swarm_schedulegeneral(world, _i); static void world(void *_i) const unsigned i = (size_t)_i; printf("%u: world!\n", i); swarm_satisfy(&dep, 1); static swarm_dependency_t dep; static void startup(void *unused) unsigned i; (void)unused; swarm_dependency_init(&dep, COUNT, done, NULL); for(i=0; i<count; i++) swarm_schedulegeneral(hello, (void *)(size_t)i); static void done(void *unused) (void)unused; puts("all done!"); swarm_shutdownruntime(null); 29

SWARM APIs Codelet Creation (2) Create a dependent codelet, i.e., the codelet depends on other codelets swarm_dependency_init(dep, count, codelet, context) dep: pointer to swarm_dependency_t A variable that stores number of satisfied dependencies count: An integer that specifies required dependencies codelet: function name (codelet) Function in the format void fname(void * context) context: pointer to a data structure The parameters passed to the codelet 30

Review Hello World #include <stdio.h> #include <swarm/runtime.h> #include <swarm/scheduler.h> startup static void hello(void *_i) const unsigned i = (size_t)_i; #define COUNT 8 static void startup(void *); static void hello(void *); static void world(void *); static void done(void *); int main(void) return!swarm_enterruntime(null, startup, NULL); hello world 8 in all done hello world printf("%u: Hello,\n", i); swarm_schedulegeneral(world, _i); static void world(void *_i) const unsigned i = (size_t)_i; printf("%u: world!\n", i); swarm_satisfy(&dep, 1); static swarm_dependency_t dep; static void startup(void *unused) unsigned i; (void)unused; swarm_dependency_init(&dep, COUNT, done, NULL); for(i=0; i<count; i++) swarm_schedulegeneral(hello, (void *)(size_t)i); static void done(void *unused) (void)unused; puts("all done!"); swarm_shutdownruntime(null); 31

SWARM APIs Satisfy a Dependency swarm_satisfy(dep, num) dep: pointer to swarm_dependency_t A variable that stores number of satisfied dependencies num: An integer that specifies the number of times to satisfy dep 32

Review Hello World #include <stdio.h> #include <swarm/runtime.h> #include <swarm/scheduler.h> startup static void hello(void *_i) const unsigned i = (size_t)_i; #define COUNT 8 static void startup(void *); static void hello(void *); static void world(void *); static void done(void *); int main(void) return!swarm_enterruntime(null, startup, NULL); hello world 8 in all done hello world printf("%u: Hello,\n", i); swarm_schedulegeneral(world, _i); static void world(void *_i) const unsigned i = (size_t)_i; printf("%u: world!\n", i); swarm_satisfy(&dep, 1); static swarm_dependency_t dep; static void startup(void *unused) unsigned i; (void)unused; swarm_dependency_init(&dep, COUNT, done, NULL); for(i=0; i<count; i++) swarm_schedulegeneral(hello, (void *)(size_t)i); static void done(void *unused) (void)unused; puts("all done!"); swarm_shutdownruntime(null); 33

SWARM APIs Terminate SWARM Runtime swarm_shutdownruntime(null) Shut down the runtime in which the caller is executing 34

Review Hello World #include <stdio.h> #include <swarm/runtime.h> #include <swarm/scheduler.h> startup static void hello(void *_i) const unsigned i = (size_t)_i; #define COUNT 8 static void startup(void *); static void hello(void *); static void world(void *); static void done(void *); int main(void) return!swarm_enterruntime(null, startup, NULL); hello world 8 in all done hello world printf("%u: Hello,\n", i); swarm_schedulegeneral(world, _i); static void world(void *_i) const unsigned i = (size_t)_i; printf("%u: world!\n", i); swarm_satisfy(&dep, 1); static swarm_dependency_t dep; static void startup(void *unused) unsigned i; (void)unused; swarm_dependency_init(&dep, COUNT, done, NULL); for(i=0; i<count; i++) swarm_schedulegeneral(hello, (void *)(size_t)i); static void done(void *unused) (void)unused; puts("all done!"); swarm_shutdownruntime(null); 35

Outline Introduction to SWARM Programming in SWARM Atomic Operations in SWARM Parallel For Loop in SWARM 36

Example: Data Race in Computing Sum Codelet 0 Compute sum = A1 + A2 + A3 sum = 0; Codelet 1 tmp1 = sum; tmp1 = tmp1 + A1; sum = tmp1; Codelet 2 tmp2 = sum; tmp2 = tmp2 + A2; sum = tmp2; Codelet 3 tmp3 = sum; tmp3 = tmp3 + A3; sum = tmp3; Codelet 4 output(sum); 37

Example: Data Race in Computing Sum Executed instructions Executing instructions Unexecuted instructions Codelet 0 sum = 0; Compute sum = A1 + A2 + A3 Result: tmp1 = tmp2 = tmp3 = 0 Codelet 1 tmp1 = sum; tmp1 = tmp1 + A1; sum = tmp1; Codelet 2 tmp2 = sum; tmp2 = tmp2 + A2; sum = tmp2; Codelet 3 tmp3 = sum; tmp3 = tmp3 + A3; sum = tmp3; Codelet 4 output(sum); 38

Example: Data Race in Computing Sum Executed instructions Executing instructions Unexecuted instructions Codelet 0 sum = 0; Compute sum = A1 + A2 + A3 Result : tmp1 = A1 tmp2 = A2 tmp3 = A3 Codelet 1 tmp1 = sum; tmp1 = tmp1 + A1; sum = tmp1; Codelet 2 tmp2 = sum; tmp2 = tmp2 + A2; sum = tmp2; Codelet 3 tmp3 = sum; tmp3 = tmp3 + A3; sum = tmp3; Codelet 4 output(sum); 39

Example: Data Race in Computing Sum Executed instructions Executing instructions Unexecuted instructions Codelet 1 tmp1 = sum; tmp1 = tmp1 + A1; sum = tmp1; Codelet 0 sum = 0; Codelet 2 tmp2 = sum; tmp2 = tmp2 + A2; sum = tmp2; Compute sum = A1 + A2 + A3 Result : tmp1 = A1 tmp2 = A2 tmp3 = A3 sum = A1 Codelet 3 tmp3 = sum; tmp3 = tmp3 + A3; sum = tmp3; Codelet 4 output(sum); 40

Example: Data Race in Computing Sum Executed instructions Executing instructions Unexecuted instructions Codelet 1 tmp1 = sum; tmp1 = tmp1 + A1; sum = tmp1; Codelet 0 sum = 0; Codelet 2 tmp2 = sum; tmp2 = tmp2 + A2; sum = tmp2; Compute sum = A1 + A2 + A3 Result : tmp1 = A1 tmp2 = A2 tmp3 = A3 sum = A2 Codelet 3 tmp3 = sum; tmp3 = tmp3 + A3; sum = tmp3; Codelet 4 output(sum); 41

Example: Data Race in Computing Sum Executed instructions Executing instructions Unexecuted instructions Codelet 1 tmp1 = sum; tmp1 = tmp1 + A1; sum = tmp1; Codelet 0 sum = 0; Codelet 2 tmp2 = sum; tmp2 = tmp2 + A2; sum = tmp2; Compute sum = A1 + A2 + A3 Result : tmp1 = A1 tmp2 = A2 tmp3 = A3 sum = A3 Codelet 3 tmp3 = sum; tmp3 = tmp3 + A3; sum = tmp3; Codelet 4 output(sum); 42

Example: Data Race in Computing Sum Executed instructions Executing instructions Unexecuted instructions Codelet 0 sum = 0; Compute sum = A1 + A2 + A3 Result: output A3 Wrong result!! Codelet 1 tmp1 = sum; tmp1 = tmp1 + A1; sum = tmp1; Codelet 2 tmp2 = sum; tmp2 = tmp2 + A2; sum = tmp2; Codelet 3 tmp3 = sum; tmp3 = tmp3 + A3; sum = tmp3; Codelet 4 output(sum); 43

Example: Data Race in Computing Sum Solution Codelet i The three instructions must be executed atomically. tmp1 = sum; tmp1 = tmp1 + A1; sum = tmp1; The three instructions must be executed as if one instruction. When the codelet computes sum, the other codelets must not change the value of sum. 44

Example: Data Race in Computing Sum Solution Codelet 0 Compute sum = A1 + A2 + A3 sum = 0; Codelet 1 atomic Codelet 2 atomic Codelet 3 atomic tmp1 = sum; tmp1 = tmp1 + A1; sum = tmp1; tmp2 = sum; tmp2 = tmp2 + A2; sum = tmp2; tmp3 = sum; tmp3 = tmp3 + A3; sum = tmp3; Codelet 4 output(sum); 45

Example: Data Race in Computing Sum Solution Executed instructions Executing instructions Unexecuted instructions Codelet 0 sum = 0; Compute sum = A1 + A2 + A3 Execution process: tmp2 = 0 tmp2 = A2 sum = A2 Codelet 1 atomic Codelet 2 atomic Codelet 3 atomic tmp1 = sum; tmp1 = tmp1 + A1; sum = tmp1; tmp2 = sum; tmp2 = tmp2 + A2; sum = tmp2; tmp3 = sum; tmp3 = tmp3 + A3; sum = tmp3; Codelet 4 output(sum); 46

Example: Data Race in Computing Sum Solution Executed instructions Executing instructions Unexecuted instructions Codelet 0 sum = 0; Compute sum = A1 + A2 + A3 Execution process: tmp3 = A2 tmp3 = A2+A3 sum = A2+A3 Codelet 1 atomic Codelet 2 atomic Codelet 3 atomic tmp1 = sum; tmp1 = tmp1 + A1; sum = tmp1; tmp2 = sum; tmp2 = tmp2 + A2; sum = tmp2; tmp3 = sum; tmp3 = tmp3 + A3; sum = tmp3; Codelet 4 output(sum); 47

Example: Data Race in Computing Sum Solution Executed instructions Executing instructions Unexecuted instructions Codelet 0 sum = 0; Compute sum = A1 + A2 + A3 Execution process: tmp1 = A2+A3 tmp1 = A1+A2+A3 sum = A1+A2+A3 Codelet 1 atomic Codelet 2 atomic Codelet 3 atomic tmp1 = sum; tmp1 = tmp1 + A1; sum = tmp1; tmp2 = sum; tmp2 = tmp2 + A2; sum = tmp2; tmp3 = sum; tmp3 = tmp3 + A3; sum = tmp3; Codelet 4 output(sum); 48

Example: Data Race in Computing Sum Solution Executed instructions Executing instructions Unexecuted instructions Codelet 0 sum = 0; Compute sum = A1 + A2 + A3 Result: output A1 + A2 + A3 Codelet 1 atomic Codelet 2 atomic Codelet 3 atomic tmp1 = sum; tmp1 = tmp1 + A1; sum = tmp1; tmp2 = sum; tmp2 = tmp2 + A2; sum = tmp2; tmp3 = sum; tmp3 = tmp3 + A3; sum = tmp3; Codelet 4 output(sum); 49

Atomic Operations in SWARM swarm_atomic_getandadd(var, val) Atomically do the following work ret = var var = var + val return ret swarm_atomic_cmpandset(var, val, newval) Atomically do the following work If var == val, then var = newval; return true; Otherwise, return false 50

More Information About Atomic Operations Read SWARM Document General information about atomic operations: share/doc/swarm/programmersguide/sec_coding_atomics.htm and share/doc/swarm/programmersguide/sec_coding_atomics_naming.htm Information about atomic get and add: share/doc/swarm/programmersguide/sec_coding_atomics_rmw.htm Information about atomic compare and set: share/doc/swarm/programmersguide/sec_coding_atomics_access.htm 51

Outline Introduction to SWARM Programming in SWARM Atomic Operations in SWARM Parallel For Loop in SWARM 52

Parallel For Loop in SWARM (1) Problem formulation: Suppose we have the following for loop where loop iterations can be executed in arbitrary order. How can we parallel the for loop in SWARM? for (i = 0; i < N; i++) foo(i); 53

Parallel For Loop in SWARM (2) Problem formulation: Suppose we have the following for loop where loop iterations can be executed in arbitrary order. How can we parallel the for loop in SWARM? for (i = 0; i < N; i++) foo(i); Methodology 1: Spawn N codelets. Each codelet does one foo(i). Not recommended due to heavy overhead. 54

Parallel For Loop in SWARM (3) Problem formulation: Suppose we have the following for loop where loop iterations can be executed in arbitrary order. How can we parallel the for loop in SWARM? for (i = 0; i < N; i++) foo(i); Methodology 1: Spawn N codelets. Each codelet does one foo(i). Not recommended due to heavy overhead. Methodology 2: Spawn k codelets. Each codelet does N/k foo(i)s. Good for balanced workload. Not good for unbalanced workload. 55

Parallel For Loop in SWARM (4) Problem formulation: Suppose we have the following for loop where loop iterations can be executed in arbitrary order. How can we parallel the for loop in SWARM? for (i = 0; i < N; i++) foo(i); Methodology 1: Spawn N codelets. Each codelet does one foo(i). Not recommended due to heavy overhead. Methodology 2: Spawn k codelets. Each codelet does N/k foo(i)s. Good for balanced workload. Not good for unbalanced workload. Methodology 3: Spawn k codelets. Each codelet dynamically execute foo(i)s. Good for unbalanced workload. 56

Parallel For Loop in SWARM (5) static void startup(void *unused) unsigned i; (void)unused; // COUNT is total number of threads swarm_dependency_init(&dep, COUNT, done, NULL); for(i=0; i<count; i++) swarm_schedulegeneral(dotproduct, (void *)(size_t)i); static void dotproduct(void *_tid) const unsigned tid = (size_t)_tid; unsigned i; // LEN is length of the array sum[tid] = 0; for (i = tid * LEN / COUNT; i < (tid + 1) * LEN / COUNT; i++) sum[tid] += v1[i] * v2[i]; swarm_satisfy(&dep, 1); static void done(void *unused) unsigned i; (void)unused; int result; result = 0; for (i = 0; i < COUNT; i++) result += sum[i]; printf("result is : %d\n", result); swarm_shutdownruntime(null); Example of using methodology 2 for vector dot product 57

Parallel For Loop in SWARM (6) for (i = 0; i < N; i++) foo(i); Methodology 3: Spawn k codelets. Each codelet dynamically execute foo(i)s. How? Codelet (1) Get first index of unexecuted loop iteration and stored in i (2) Increase the index by CHUNK_SIZE (3) Executes foo(i), foo(i+1),, foo(i+chunk_size-1) Hints: Steps (1) and (2) must be done atomically. Once i >= N, the codelet is completed Correctly handle the case that i + CHUNK_SIZE >= N 58

Parallel For Loop in SWARM (7) Set and Get maximum number of threads Set maximum number of threads swarm_runtime_params_t p; swarm_runtime_params_init(&p); if (M_NUM_THREADS > 0) p.maxthreadcount = m_numthreads; if(!swarm_enterruntime(&p, startup, _ctxt)) //startup : entry codelet //_ctxt: parameter passing to startup fprintf(stderr, "%s: unable to start SWARM runtime\n", *argv); return 64; 59

Parallel For Loop in SWARM (8) Set and Get maximum number of threads Get maximum number of threads unsigned GetSwarmThreadCount() //return maximum number of threads const swarm_threadlocale_t *top = swarm_topmostlocale; size_t k; if(!top) return 1; k = swarm_locale_getchildren( swarm_threadlocale_to_locale(top), NULL, 0); return k+!k; 60