Introduction to SWARM Software and Algorithms for Running on Multicore Processors

Size: px

Start display at page:

Download "Introduction to SWARM Software and Algorithms for Running on Multicore Processors"

Janel Williams
6 years ago
Views:

1 Itroductio to SWARM Software ad Algorithms for Ruig o Multicore Processors David A. Bader Georgia Istitute of Techology Tutorial compiled by Rucheek H. Sagai M.S. Studet, College of Computig Georgia istitute of Techology SWARM is a portable ope source library of basic primitives that provide framework for desigig algorithms o multicore systems. Usig this framework, we have implemeted efficiet parallel algorithms for importat primitive operatios such as prefix-sums, poiter-jumpig, symmetry breakig, ad list rakig; combiatorial problems such as sortig ad selectio; parallel graph theoretic algorithms such as spaig tree, miimum spaig tree, graph decompositio, ad tree cotractio; ad computatioal geomics applicatios such as maximum parsimoy. This documetatio provides descriptios for various variables, fuctios ad macros that ca be used as API s for developig parallel codes ad is supported by the explaatio of a example code. 1

2 Idex of Cotets 1. Itroductio Motivatio for SWARM What is SWARM SWARM Applicatio Programmig Iterface Tutorial SWARM Program Structure SWARM Variables ad Predefies SWARM Fuctios ad Primitives Iitializatio, Executio ad Termiatio Fuctios Bier Fuctios Memory Maagemet Fuctios Broadcast Fuctios Replicate Fuctios Reduce Fuctios Sca Fuctios SWARM Macros Workig with the Example Code Stadard Deviatio SWARM for Stadard Deviatio Calculatio Example Code itself Step by Step Explaatio of the Code Coclusio

3 1. Itroductio: 1.1 Motivatio for SWARM Sice the iceptio of desktop computer, software performace has improved at a expoetial rate, primarily drive by rapid growth i processig power. Performace of a algorithm kept o simply improvig by the ival of ew ad faster processors. However, we ca o loger solely rely o Moore s law for performace improvemets. Fudametal physical limitatios such as the size of the trasistor ad power costraits have ow ecessitated a radical chage i commodity microprocessor architecture to multicore desigs. Dual ad quad-core processors are slowly ad steadily fidig their way ito the desktops ad the laptops. Software developers ad programmers are ow required to exploit this cocurecy at algorithmic level. 1.2 What is SWARM? SWARM (SoftWare ad Algorithms for Ruig o Multicore) has bee itroduced as a ope source parallel programmig framework. It is a library of primitives that fully exploit the multicore processors. The SWARM programmig framework is a descedat of the symmetric multiprocessor (SMP) ode library compoet of SIMPLE (Joural of Parallel ad Distributed Computig, 58(1):92 108, 1999). SWARM is built o POSIX threads that allow the user to use either the already developed primitives or direct thread primitives. SWARM has costructs for parallelizatio, restrictig cotrol of threads, allocatio ad de-allocatio of shared memory, ad commuicatio primitives for sychroizatio, replicatio ad broadcastig. The framework has bee successfully used to implemet efficiet parallel versios of primitive algorithms. Viz. List rakig, Prefix sums, Symmetry breakig etc. I order to use the SWARM library, the programmer eeds to make miimal modificatios to existig sequetial code. After idetifyig compute-itesive routies i the program, work ca be assiged to each core usig a efficiet multicore algorithm. Idepedet operatios such as those arisig i fuctioal parallelism or loop parallelism ca be typically threaded. For fuctioal parallelism, this meas that each thread acts as a fuctioal process for that task, ad for loop parallelism, each thread computes its portio of the computatio cocurretly. Note that it might be ecessary to apply loop trasformatios to reduce data depedecies betwee threads. SWARM cotais efficiet implemetatios of commoly-used primitives i parallel programmig. These computatio ad commuicatio primitives have bee discussed below ad followed up with their usage i a example code. 3

4 2. SWARM Applicatio Programmig Iterface Tutorial 2.1 SWARM Program Structure Before begiig with the SWARM API fuctios ad variables, we first preset the structure of a typical SWARM program. This will help i uderstadig the fuctios alog with their usage described later. The code executes the fuctio routie i parallel, sychroizes at a certai poit ad later prits the thread id o which it is ruig for each idividual thread. 8 threads/cores have bee assumed for all the examples preseted i this sectio. #iclude <swarm.h> static void routie (THREADED) SWARM_Bier_syc(TH); pritf( My thread id = %d, Total threads = %d\, MYTHREAD, THREADS); it mai (it argc, char **argv) SWARM_Iit(&argc, &argv); /* sequetial code */ /* parallelize a routie usig SWARM */ SWARM_Ru(routie); /* more sequetial code */ /*Termiate clealy */ SWARM_Fialize(); 4

5 Correspodig output (assumig 8 threads My thread id = 2, Total threads = 8 My thread id = 5, Total threads = 8 My thread id = 0, Total threads = 8 My thread id = 1, Total threads = 8 My thread id = 7, Total threads = 8 My thread id = 4, Total threads = 8 My thread id = 3, Total threads = 8 My thread id = 6, Total threads = 8 The order i which the threads are prited varies o differet executios. 5

6 2.2 SWARM Variables ad Predefies THREADED: THREADED is defied to be a structure that cotais all the required iformatio for the particular thread. The parallel routie must be ivoked with THREADED as the argumet. TH: TH is istace of THREADED. As see i the above code, TH is passed as a argumet for SWARM_Bier_syc. The fuctio defiitio that accepts TH is defied as SWARM_Bier_syc(THREADED) which ca be see i the swarm.h header file. From the programmig poit of view, ot much has to be worried about TH ad THREADED as they have bee declared for iteral implemetatio purposes. MYTHREAD: Provides the thread id of the thread cotaiig it. MYTHREAD is output o the test code for each of the threads. THREADS: Specifies the total umber of threads executig i parallel. Both THREADS ad MYTHREAD are useful variables from programmers poit of view. Figure below explais the above cocepts: struct THREADED exter it THREADS = 8 TH a istace of THREADED TH 0 MYTHREA TH 1 MYTHREA TH 7 MYTHREA THREADS THREADS THREADS 6

7 2.3 SWARM Fuctios ad Primitives Iitializatio, Executio ad Termiatio Fuctios: These fuctios are called from the mai of the program. As replicated i the code below, most of the SWARM applicatios will have the structure of mai usig these three fuctios. it mai (it argc, char **argv) SWARM_Iit(&argc, &argv); /* sequetial code */ /* parallelize a routie usig SWARM */ SWARM_Ru(routie); /* more sequetial code */ /*Termiate clealy */ SWARM_Fialize(); Each of them is further described below. SWARM_Iit(&argc, &argv) This fuctio is resposible for iitializig the parallel eviromet ad allocatig the requisite memory. It looks at the umber of processors available o the machie ad sets up the threads accordigly. The user ca also specify the umber of threads by usig the optio t. SWARM_Ru ((void *)routie) This is the routie which we wat to parallelize across various threads. Thread creatio ad executio takes place here. SWARM_Fialize() Performs the clea up task by freeig up the allocated memory. 7

8 2.3.2 Bier Fuctios Bier fuctios are used to sychroize the executio o various threads. The threads execute asychroously i parallel util this fuctio is called. Whe oe of the threads reaches the bier, it waits util all the threads reach this poit, thus sychroizig all the threads. Oce sychroized, all threads start executig asychroously i parallel agai. //Parallel code.... Bier() //Parallel code.... Bier() //Parallel code.... Bier() time //Parallel code Bier() //Parallel code Bier() //Parallel code Bier() //Parallel code. //Parallel code. //Parallel code. THREAD 0 THREAD 1 THREAD 7 Wait for threads to reach here void SWARM_Bier() By default SWARM_Bier uses SWARM_Bier_syc, which is described ext. Usage: static void routie (THREADED) /* parallel code */ /* use the SWARM Bier for sychroizatio */ SWARM_Bier(); /* more parallel code */ 8

9 void SWARM_Bier_syc(THREADED) This fuctio achieves thread sychroizatio usig coditioal wait o mutex lock util all threads have reached. Usage: static void routie (THREADED) /* parallel code */ /* use the SWARM Bier for sychroizatio */ SWARM_Bier_syc(); /* more parallel code */ void SWARM_Bier_tree(THREADED) This fuctio achieves thread sychroizatio usig shared buffers. Threads are viewed as a biary tree based o their thread id. Each thread seds a message to the paret thread whe it ivokes the bier. This goes bottom up util root detects completio of bier. The root, ow releases the processes for further executio i top dow order. Usage: static void routie (THREADED) /* parallel code */ /* use the SWARM Bier for sychroizatio */ SWARM_Bier_tree(); /* more parallel code */ 9

10 2.3.3 Memory Maagemet Fuctios: These are wrapper fuctios for dyamic memory allocatio ad release of shared structures. Allocatio of memory is doe globally, i.e. all threads will access the same memory locatio Local poiter to global memory locatio -2-1 A A A THREAD 0 THREAD 1 THREAD 7 void *SWARM_malloc(it bytes, THREADED) Dyamic memory allocatio for specified umber of bytes. The uderlyig fuctio implemetatio esures that allocatio is doe oly o oe of the threads. Retur type is a void poiter ad ca be type-casted to achieve desired type. Usage: static void routie (THREADED) it *A; /* example: allocate a shared ay of size */ A = (it*)swarm_malloc(*sizeof(it),th); 10

11 void *SWARM_malloc_l(log bytes, THREADED) Similar to SWARM_malloc, except that, it is useful for allocatig larger amouts of memory that caot be specified by it. Thus, if sizeof(it) is 4 bytes ad if we wat to allocate more tha 2^32 1 (or greater tha or equal to 2GB of memory ), SWARM_malloc_l should be used. Usage: static void routie (THREADED) it *A; /* example: allocate a shared ay of size */ A = (it*)swarm_malloc_l(*sizeof(it),th); void SWARM_free(void *ptr, THREADED) Used for de-allocatig the dyamically allocated memory. The uderlyig fuctio implemetatio esures that de-allocatio is doe oly o oe of the threads. Usage: static void routie (THREADED) it *A; A = (it*)swarm_malloc(*sizeof(it),th); /* Free the memory allocated for A */ SWARM_free(A); 11

12 2.3.4 Broadcast Fuctios: There are may situatios where a value or a memory locatio o oe of the threads or cores eeds to be distributed to other cores. Broadcast fuctios are used for the same. Broadcast copyig poiter cotet A A A i i i THREAD 0 THREAD 1 Broadcast copyig data value THREAD 7 it SWARM_Bcast_i(it myval, THREADED) Used to Broadcast iteger values to all the cores. Usage: static void routie (THREADED) it i = 5; o_oe_thread i = 10; pritf( Before broadcast:\\ ); pritf("value of i = %d o thread %d\", I, MYTHREAD); SWARM_Bier(); /* Broadcastig value of i to all cores */ i = SWARM_Bcast_i(i,TH); 12

13 o_oe_thread pritf( \\ ); pritf( After broadcast:\\ ); pritf("value of i = %d o thread %d\", I, MYTHREAD); Output for the above code: Before broadcast: Value of i = 5 o thread 1 Value of i = 10 o thread 0 Value of i = 5 o thread 7 Value of i = 5 o thread 2 Value of i = 5 o thread 3 Value of i = 5 o thread 4 Value of i = 5 o thread 6 Value of i = 5 o thread 5 After broadcast: Value of i = 10 o thread 1 Value of i = 10 o thread 0 Value of i = 10 o thread 7 Value of i = 10 o thread 2 Value of i = 10 o thread 3 Value of i = 10 o thread 4 Value of i = 10 o thread 6 Value of i = 10 o thread 5 it SWARM_Bcast_from_i(it myval, it source, THREADED) Broadcast routie metioed previously, broadcasts the value o thread/core 0. However, i cases where it is required to copy values from a specific thread, you ca use the Bcast_from fuctio. Fuctioally, it is same as Bcast, except that you eed to specify the source thread id as the argumet. log SWARM_Bcast_l(log myval, THREADED) Broadcast values of type log to all cores. 13

14 log SWARM_Bcast_from_l(log myval, it source, THREADED) Broadcast values of type log from a specific core to all cores. double SWARM_Bcast_d(double myval, THREADED) Broadcast values of type double to all cores. double SWARM_Bcast_from_d(double myval, it source, THREADED) Broadcast values of type double from a specific core to all cores. char SWARM_Bcast_c(char myval, THREADED) Broadcast char values to all the cores. char SWARM_Bcast_from_c(char myval, it source, THREADED) Broadcast char values from a specific core to all cores. it *SWARM_Bcast_ip(it *myval, THREADED) Used to provide each processig core with the address of the shared buffer. Care should be take to esure that the buffer whose address is beig broadcasted is shared. Usage: static void routie (THREADED) it *a = (it *)SWARM_malloc(sizeof(it), TH); it *b = (it *)SWARM_malloc(sizeof(it), TH); it *c; *a = 5; *b = 10; c = a; SWARM_Bier(); o_oe_thread pritf( Before broadcast:\\ ); c = b; //b is shared 14

15 pritf("value of *c = %d o thread %d\", *c, MYTHREAD); /* Broadcastig address of b to all cores */ c = SWARM_Bcast_ip(c,TH); o_oe_thread pritf( \\ ); pritf( After broadcast:\\ ); SWARM_Bier(); pritf("value of *c = %d o thread %d\", *c, MYTHREAD); Output for the above code: Before broadcast: Value of *c = 5 o thread 1 Value of *c = 10 o thread 0 Value of *c = 5 o thread 7 Value of *c = 5 o thread 2 Value of *c = 5 o thread 3 Value of *c = 5 o thread 4 Value of *c = 5 o thread 6 Value of *c = 5 o thread 5 After broadcast: Value of *c = 10 o thread 1 Value of *c = 10 o thread 0 Value of *c = 10 o thread 7 Value of *c = 10 o thread 2 Value of *c = 10 o thread 3 Value of *c = 10 o thread 4 Value of *c = 10 o thread 6 Value of *c = 10 o thread 5 it *SWARM_Bcast_from_ip(it *myval, it source, THREADED) Broadcast poiter to a iteger from a specific core to all cores. log *SWARM_Bcast_lp(log *myval, THREADED) Broadcast poiter to log iteger. 15

16 log *SWARM_Bcast_from_lp(log *myval, it source, THREADED) Broadcast poiter to a log iteger from a specific core to all cores. double *SWARM_Bcast_dp(double *myval, THREADED) Broadcast poiter to a double. double *SWARM_Bcast_from_dp(double *myval, it source, THREADED) Broadcast poiter to a double from a specific core to all cores. char *SWARM_Bcast_cp(char *myval, THREADED) Broadcast poiter to a character. char *SWARM_Bcast_from_cp(char *myval, it source, THREADED) Broadcast poiter to a character from a specific core to call cores. 16

17 2.3.5 Replicate Fuctios: The basic differece betwee replicate ad broadcast is that, while broadcast copies the value ito pre-existig memory locatios o all cores, replicate allocates memory durig the fuctio call ad creates the replica of the object passed by the callig thread. Replicate cotets Allocate memory Allocate memory A A A THREAD 0 THREAD 1 THREAD 7 void *SWARM_Replicate(void *myval, it source, it bytes, THREADED) The argumets for the replicate fuctio require you to pass the poiter to the object to be replicated, the thread/core id of the callig thread ad the size of object to be replicated. The returig object must be type-casted with the correspodig type of the passed object. Usage static void routie (THREADED) double *a = NULL; it size = 0; 17

18 o_thread(1) a = (double *)malloc(3 * sizeof(double)); a[0] = 9.99; a[1] = 8.88; a[2] = 7.77; size = sizeof(double)*3; pritf("before Replicate\\"); pritf("o thread %d values = %lf %lf %lf\", MYTHREAD, *a, *(a+1), *(a+2)); /*Replicatig the double object 'a' to o all cores*/ a = (double *)SWARM_Replicate(a, 1, size, TH); SWARM_Bier(); o_oe_thread pritf("\\"); pritf("after Replicate\\"); SWARM_Bier(); pritf("o thread %d values = %lf %lf %lf\", MYTHREAD, *a, *(a+1), *(a+2)); Output: Before Replicate O thread 1 values = After Replicate O thread 0 values = O thread 1 values = O thread 2 values = O thread 3 values = O thread 4 values = O thread 5 values =

19 2.3.5 Reduce Fuctios: These are set of fuctios that are used to obtai additio, maximum or miimum of values calculated across differet threads. Each thread provides the value i its local copy to this primitive ad the operatio specified by the op argumet is performed o these values. op ca take values SUM, MAX or MIN based o task to be performed. Used for iteger valued operatios. it SWARM_Reduce_i(it myval, reduce_t op, THREADED) The value provided by each thread is of type iteger. Usage: static void routie (THREADED) it sum = 0; sum = SWARM_Reduce_i(MYTHREAD, SUM, TH); pritf("value of sum = %d o thread %d\", sum, MYTHREAD); Output: Value of sum = 28 o thread 0 Value of sum = 28 o thread 1 Value of sum = 28 o thread 2 Value of sum = 28 o thread 3 Value of sum = 28 o thread 4 Value of sum = 28 o thread 5 Value of sum = 28 o thread 6 Value of sum = 28 o thread 7 log SWARM_Reduce_l(log myval, reduce_t op, THREADED) Perform operatio o log values. double SWARM_Reduce_d(double myval, reduce_t op, THREADED) Perform operatio o floatig poit values. 19

20 Sca Fuctios: Sca fuctios perform task similar to Reduce fuctios. However, these are prefix operatios, i the sese that output of each thread is based oly o the threads before them (i.e. those threads havig smaller thread id tha it). it SWARM_Sca_i(it myval, reduce_t op, THREADED) The value provided by each thread is of type iteger. Usage: static void routie (THREADED) it sum = 0; sum = SWARM_Sca_i(MYTHREAD, SUM, TH); pritf("value of sum = %d o thread %d\", sum, MYTHREAD); Output: Value of sum = 0 o thread 0 Value of sum = 1 o thread 1 Value of sum = 2 o thread 2 Value of sum = 6 o thread 3 Value of sum = 10 o thread 4 Value of sum = 15 o thread 5 Value of sum = 21 o thread 6 Value of sum = 28 o thread 7 log SWARM_Sca_l(log myval, reduce_t op, THREADED) Perform operatio o log values. double SWARM_Sca_d(double myval, reduce_t op, THREADED) Perform operatio o float values. 20

21 2.4 Macros for SWARM o_thread, o_oe_thread Cotrol ca be give to ay particular thread usig these macros. o_oe_thread gives cotrol to thread 0. Usage: static void routie (THREADED) /* example: execute code o thread MYTHREAD */ o_thread(threads - 1) pritf("reached here i oly oe of the threads\"); pritf("i thread %d\\", MYTHREAD); SWARM_Bier(); /* example: execute code o oe thread */ o_oe_thread pritf("reached here i oly oe of the threads\"); pritf("i thread %d\", MYTHREAD); Output: Reached here i oly oe of the threads I thread 7 Reached here i oly oe of the threads I thread 0 21

22 SWARM_pardo The SWARM library cotais several basic pardo directives for executig loops cocurretly o oe or more processig cores. Typically, this is useful whe a idepedet operatio is to be applied to every locatio i a ay, for example elemet-wise additio of two ays. Pardo implicitly partitios the loop amog the cores without the eed for coordiatig overheads such as sychroizatio of commuicatio betwee the cores. 0-1 pardo implicitly esures that each thread works o its part of the ay idepedetly ad i parallel A A A THREAD 0 THREAD 1 THREAD 7 Usage: static void routie (THREADED) /*example: partitioig a "for" loop amog cores */ SWARM_pardo(i, start, ed, icr) A[i] = A[i] * A[i]; 22

23 3. Workig with the Example Code The example code preseted below is used to demostrate the SWARM API. At least oe fuctio of each type (sychroizatio, replicatio, broadcast etc.) has bee icorporated ito the example code. The code is used to calculate the stadard deviatio of a set of umbers represeted i a ay i a parallel eviromet. The stadard deviatio is used to measure how widely the data is spread i a distributio. 3.1 Stadard Deviatio (σ) If x 1, x 2, x 3 x represets a sequece of umbers, the stadard distributio σ is give mathematically as: i 1 x i 2 where μ is the mea of the distributio give by, i 1 x i 3.2 SWARM for Stadard Deviatio Calculatio The basic idea is to distribute the elemets i the ay across various processors/threads. Each thread calculates the sum o its part of the ay, which is used to calculate the total sum ad hece, the mea. The mea is ow distributed across various threads, after which each thread ow computes the sum of square of differeces with the mea o its part of the ay, which is used to calculate the fial stadard deviatio. The example code has bee replicated below followed by which, we have a detailed explaatio of the code. 23

24 3.3 The Code itself #iclude <swarm.h> #iclude <swarm_radom.h> static void stddev_routie (THREADED) it i, = 10; it max_radom = 10, partial_sum = 0, prefix_sum = 0, global_sum = 0; it *; double global_mea = 0, partial_squared_sum = 0, squared_sum = 0, std_dev = 0; = SWARM_malloc( * sizeof(it), TH); o_oe_thread pritf("array memory allocated o a sigle thread...\\"); /***************** Figure ************************/ SWARM_radom_iit(TH); SWARM_sradom(MYTHREAD + 1,TH); pardo(i, 0,, 1) [i] = SWARM_radom(TH)%max_radom; SWARM_Bier(); /***************** Figure ************************/ o_oe_thread pritf("radomly geerated ay is:\"); for(i = 0; i < ; i++) pritf("[%d] = %ld\", i, [i]); pritf("\\"); SWARM_Bier(); pardo(i, 0,, 1) partial_sum += [i]; 24

25 o_oe_thread pritf("partial Sum:\"); pritf("thread %d: %d\", MYTHREAD, partial_sum); /***************** Figure ************************/ SWARM_Bier(); o_oe_thread pritf("\\"); pritf("global sum calculated usig Reduce operatio...\\"); pritf("global Sum:\"); SWARM_Bier(); global_sum = SWARM_Reduce_i(partial_sum, SUM, TH); pritf("thread %d: %d\", MYTHREAD, global_sum); /***************** Figure ************************/ o_oe_thread global_mea = 1.0 * global_sum/; pritf("\\"); pritf("mea calculated o a thread = %f\", global_mea); pritf("\\"); pritf("broadcastig mea...\\"); pritf("global Mea:\"); /***************** Figure ************************/ global_mea = SWARM_Bcast_d(global_mea, TH); /***************** Figure ************************/ SWARM_Bier(); pritf("thread %d: %f\", MYTHREAD, global_mea); SWARM_Bier(); pardo(i, 0,, 1) partial_squared_sum += ([i] - global_mea)*([i] - global_mea); /***************** Figure ************************/ 25

26 SWARM_Bier(); o_oe_thread pritf("\\"); pritf("partial Squared Sum:\"); SWARM_Bier(); pritf("thread %d: %f\", MYTHREAD, partial_squared_sum); SWARM_Bier(); o_oe_thread pritf("\\calculatig total squared sum from partial squared sums usig Sca operatio...\\"); pritf("total Squared Sum:\"); SWARM_Bier(); squared_sum = SWARM_Sca_d(partial_squared_sum, SUM, TH); pritf("thread %d: %f\", MYTHREAD, squared_sum); /***************** Figure ************************/ SWARM_Bier(); o_thread(threads - 1) pritf("\\"); pritf("calculatig Stadard Deviatio o last thread...\\"); std_dev = sqrt(squared_sum / ); pritf("stadard Deviatio = %f\", std_dev); SWARM_Bier(); /***************** Figure ************************/ o_oe_thread pritf("\\"); pritf("releasig ay memory...\\"); SWARM_free(, TH); 26

27 it mai (it argc, char **argv) SWARM_Iit(&argc,&argv); SWARM_Ru ((void *)stddev_routie); SWARM_Fialize(); retur 0; Output: THREADS: 3 Array memory allocated o a sigle thread... Radomly geerated ay is: [0] = 3 [1] = 4 [2] = 7 [3] = 8 [4] = 5 [5] = 9 [6] = 0 [7] = 6 [8] = 2 [9] = 2 Partial Sum: Thread 1: 22 Thread 2: 10 Thread 0: 14 Global sum calculated usig Reduce operatio... Global Sum: Thread 2: 46 Thread 0: 46 Thread 1: 46 Mea calculated o a thread = Broadcastig mea... Global Mea: Thread 2: Thread 1: Thread 0:

28 Partial Squared Sum: Thread 0: Thread 1: Thread 2: Calculatig total squared sum from partial squared sums usig Sca operatio... Total Squared Sum: Thread 1: Thread 0: Thread 2: Calculatig Stadard Deviatio o last thread... Stadard Deviatio = Releasig ay memory... 28

29 3.4 Step by Step Explaatio of the Code The code executio begis with mai. The mai costruct will be almost similar for all the examples usig SWARM. It begis with the iitializatio of the SWARM parallel eviromet ad cocludes with the clea up of this eviromet. Betwee this, we have a call to a routie that eeds to be executed i this parallel eviromet. All this is achieved usig 3 fuctios SWARM_Iit, SWARM_Ru ad SWARM_Fialize which are further detailed i the API explaatio. The routie defied for parallelizig the calculatio of stadard deviatio is stddev_routie. This routie will be executed o each of the threads/cores. We ow describe a istace of executio. For the purpose of explaatio, we assume that there are 3 threads ad the size of the data set is 10. (THREADS = 3, = 10). Withi the code, commets have bee placed specifyig the figure to refer, which will provide a detailed view of code executio at that poit of time. The memory that will hold the data set of umbers is dyamically allocated. This esures, that oly oe copy of this list is maitaied across all the threads. O the other had, variables defied locally (viz., partial_sum etc) are replicated withi all the threads. Figure below supports the explaatio partial_sum partial_sum partial_sum Thread 0 Thread 1 Thread 2 Figure

30 Next, the ay is populated radomly. Radomizatio fuctios have also bee icorporated i SWARM which mimics the stadard radom fuctio. A seed has to be specified for geeratig the radom sequece. Varyig the seed varies the sequece created. The populatig process is also doe i a parallel maer. Thus thread 0 fills the first 3 elemets, thread 1 fills the ext 3 ad the last thread fills the remaiig elemets partial_sum partial_sum partial_sum Thread 0 Thread 1 Thread 2 Figure

31 Each thread ow works o its part of the ay to calculate the partial sum partial_sum partial_sum partial_sum Thread 0 Thread 1 Thread 2 Figure

32 Oce the partial sum is calculated at each thread, global sum is calculated usig the reduce operatio. Reduce fuctio picks up a value from each thread ad performs a biary operatio o those values. Here we request to perform sum operatio o the partial sum values to obtai the global sum global_sum global_sum global_sum partial_sum partial_sum 10 partial_sum Thread 0 Thread 1 Thread 2 Figure

33 Global average or global mea is ow calculated o thread global_mea global_mea global_mea global_sum global_sum 46 global_sum Thread 0 Thread 1 Thread 2 Figure

34 This global mea o thread 0 has to be broadcasted to all other threads for further calculatio of stadard deviatio global_mea global_mea global_mea global_sum global_sum 46 global_sum Thread 0 Thread 1 Thread 2 Figure

35 As each thread receives the value of mea, they ow start calculatig the sum of square of differeces with the mea, o their part of the ay. This step is exactly similar to the part where we were calculatig the partial sum o each thread partial_squared_sum partial_squared_sum partial_squared_sum Thread 0 Thread 1 Thread 2 Figure

36 Sca operatio is ow used to add up the partial sums. The basic differece betwee sca ad reduce operatio is the fact that sca performs prefix operatio. Thus thread 0 has its ow value, thread 1 has sum of thread 0 ad thread 1, while thread 2 has sum of thread 0, 1 ad 2. Figure below describes this squared_sum squared_sum squared_sum partial_squared_sum partial_squared_sum partial_sqaured_sum Thread 0 Thread 1 Thread 2 Figure

37 Fially, the last thread (thread 2 here) has the total squared sum of differeces with the mea. We calculate the stadard deviatio o this thread by dividig the total by the umber of elemets (10) ad the takig the square root std_dev std_dev std_dev squared_sum squared_sum sqaured_sum Thread 0 Thread 1 Thread 2 Figure Oce the stadard deviatio is calculated, the memory allocated for the ay is released back to the operatig system ad we exit the parallel routie. 37

38 4. Coclusio SWARM is thus able to provide a soud iterface for program developers to develop parallel applicatios without worryig about the uderlyig thread level aspects. We have already implemeted several importat parallel primitives ad algorithms usig this framework. I future, we ited to add to the fuctioality of basic primitives i SWARM, as well as build more multicore applicatios usig this library. For further details, you ca view the paper: SWARM: A Parallel Programmig Framework for Multicore Processors, David A. Bader, Varu Kaade ad Kamesh Madduri 38

Chapter 4 Threads. Operating Systems: Internals and Design Principles. Ninth Edition By William Stallings

Chapter 4 Threads. Operating Systems: Internals and Design Principles. Ninth Edition By William Stallings Operatig Systems: Iterals ad Desig Priciples Chapter 4 Threads Nith Editio By William Stalligs Processes ad Threads Resource Owership Process icludes a virtual address space to hold the process image The