Performance and Overhead Measurements. on the Makbilan. Department of Computer Science. The Hebrew University of Jerusalem

Size: px

Start display at page:

Download "Performance and Overhead Measurements. on the Makbilan. Department of Computer Science. The Hebrew University of Jerusalem"

Jared Mills
5 years ago
Views:

1 Perormance and Overhead Measurements on the Makbilan Yosi Ben-Asher Dror G. Feitelson Dtment o Computer Science The Hebrew University o Jerusalem Jerusalem, Israel yosi,dror@cs.huji.ac.il Technical Report 91-5 October 1991 Abstract The Makbilan is a research multiprocessor with 12 sinle-board computers in a Multibus-II cae. Each sinle-board computer is a 386 processor runnin at 20 MHz with 4 MB o memory. The sotware includes a local kernel (Intel's RMK) on each board, and a parallel run-time library that supports the constructs o the ParC lanuae. We report measurements o bus contention eects, context-switch overhead, overhead or spawnin new tasks, the eectiveness o load balancin, and synchronization overhead on this system.

2 1 1 Introduction When evaluatin a new system, it is important to make accurate measurements o the overheads involved in various system operations. This is not always as easy as it sounds. On the Makbilan, and bus-based shared memory machines in eneral, the problem is one o isolatin the measurements rom irrelevant external eects. For example, the measured results depend on the ollowin: Whether data structures are local or remote, and must be accessed throuh the bus. I remote accesses are made, what is the contention or the bus at that time. What is the contention or a specic memory module. This inuences both local and remote reerences. What is the load on the PEs, i.e. how many tasks are bein time shared. The methodoloy used is to write application prorams that perorm a certain operation a lare number o times, and measure their execution times. The applications are written so that the transient eects at startup are neliible relative to the total execution time, thus enablin us to attribute the whole time to the repeated perormance o the operation bein measured. However, care must be taken that some additional operation does not \slip in" toether with the one we are interested in. Hardware Conuration The Hardware conuration o the Makbilan, as used in these measurements, was as ollows: A Multibus-II backplane with 20 slots. A CSM-001 bus controller. 12 sinle-board computers (SBC) actin as PEs. Each has a 386 processor runnin at 20 MHz, a 387 coprocessor, an MPC, and 4 MB o memory. Another SBC actin as a Unix host. A 186/410 board with six terminal ports. A 186/224 board peripheral controller interacin the bus with the SYP 500 chassis. This chassis contains the I/O devices (disk and tape). 2 Processin Rates Each processor in the Makbilan has an independent clock. The rst experiment is desined to check how well these clocks are synchronized. The code used is as ollows:

3 2 lparor int n; 0; proc no-1; -1; addr[n] = (int*) malloc( sizeo(int) ); or (i=0 ; i<proc no ; i++) or (j=i+1 ; j<proc no ; j++) or (k=0 ; k<expr ; k++) lparor int n; 0; proc no-1; -1; int *loc, rem, it1, it2; i ((n!= i) && (n!= j)) pcontinue; /* check i aainst j */ it1 = ITER1; it2 = ITER2; loc = addr[n]; *loc = 0; i (n == i) while (*loc < it2) i (*loc == it1) rem = *(addr[j]); (*loc)++; res[i][j][k] = rem; else while (*loc < it2) i (*loc == it1) rem = *(addr[i]); (*loc)++; res[j][i][k] = rem; All this code does is to check all pairs o processors aainst each other. this is done by spawnin a pair o activities, one on each processor, which iterate a lare number o times 12,000,000 was used. A bit beore the end, at 10,000,000 iterations, each activity checks the proress o the other activity and tabulates it. Everythin is local with no intererence rom any other processor in the system. The results are shown in table 1. These are averaes o ve measurements or each pair. Row i o the table contains the rates o all the processors relative to processor number i. Thus a row with many positive numbers, e.. row 3, indicates a relatively slow processor. A row with many neative numbers, e.. row 7, indicates a relatively ast processor. The reatest dierence measured,

4 +7 { {9 {6 {10 1 {7 {60 { {3 {5 {16 0 {8 2 0 {1 { {17 {19 +4 {12 5 {25 {17 {80 {24 {23 {10 {35 {28 {45 {20 {45 6 {2 +5 { {5 { { {5 10 Table 1: Relative rates o the 12 processors currently in the Makbilan, in parts per million {2 +8 { {1 {4 {3 {11 7 {18 {17 {80 {25 { {22 {30 {34 {24 {34 8 {1 +3 {45 {4 { {20 +5 { {59 +3 { {5 {6 {9 { { {

5 between processor 3 and processor 6, is only 80 parts per million. At least the last diit o these measurements should not be considered as accurate. For example, the table shows that processor #8 is slower than #11 by 5 ppm, and processor #4 is slower than #8 by an additional 4 ppm, but processor #4 is aster than #11 by 3 ppm. 4 3 Bus Contention The Makbilan backplane is a Multibus-II, with a throuhput o 40MB/s. This seems to be more than enouh to service a number o transers within one processor cycle, implyin that the bus should not be a bottleneck. But the Makbilan uses the bus mainly or remote memory access, and the bus protocol requires that the bus be held until the transaction terminates. The memory has dual ports, and access rom the bus may suer several wait states. Thus the memory response time may be a bottleneck that limits the bus throuhput. The experiments in this section are desined to check this. 3.1 Contention or Bus Access This subsection contains two experiments: the rst shows how perormance derades when more processors try to use the bus simultaneously, and the second shows how the mix o local and lobal accesses inuences the perormance. Experiment 1 This experiment is repeated with dierent numbers o processors. The code or each run is as ollows: int ; lparor int j; 1; num-o-procs; -1; int k; /*reister*/ int *p, l; p = &; or (k=0 ; k<iter ; k++) *p=l; /* repeated 100 times */ *p=l; The assinment in the loop is repeated 100 times in-line to reduce the eect o the loop handlin itsel. The experiment was conducted twice, once with p and l dened as local variables and once as reisters. Note that the lobal variable is actually in the local memory o the rst processor, so in the measurement or i processors only i? 1 are really usin the bus. In particular, the measurement or one processor is really a local access, not a lobal one. The results are shown in. 1. It is obvious that a linear relationship exists between the number o processors and the elapsed time, indicatin that the bus cannot support multiple requests within a sinle processor cycle. Thus the bus is indeed a bottleneck. As may be expected, when the

6 local variables: in memory in reisters time [sec] number o PEs Fiure 1: Measurement showin that the bus is not aster than the processors. Time or lobal accesses rom each processor is shown. local variables are in reisters the elapsed time is slihtly shorter. Note, however, that as more processors are added the dierence decreases until it disappears. The reason or this is that usin reisters reduces the time needed to issue the next lobal access, but the bottleneck is the lobal access itsel. Issuin the instruction aster does not help. Experiment 2 The load on the bus depends, o course, on the access rate rom the dierent processors. This in turn depends on the ratio o local to lobal memory reerences. This experiment shows how a larer percentae o local reerences makes the bus seem aster. The basic code is similar to the previous experiment, except that each processor does lobal accesses to a dierent lobal variable. The lobal variables are arraned so that each is in the memory o a dierent processor. This ensures that the bottleneck is indeed the bus (and its protocol), and not any overloadin eects on a specic memory module. /* allocate memory in every processor */ lparor int j; 1; proc no; -1; M[j] = (int *) malloc(sizeo(int)); /* repeat or dierent numbers o PEs */ lparor int j; 1; num-o-procs; -1;

7 6 int l, k, *p; /* arrane pointers */ i (et pid() < proc no) p = M[et pid()+1]; else p = M[1]; /* perorm local and lobal accesses */ or (k=0 ; k<iter ; k++) DOx( *p, l ) The DOx macro includes 100 instructions, out o which x are assinments to the lobal variable, o the orm *p=l, and the rest are increments o the local variable, i.e. l++. results or ve values o x are reported: 1, 10, 20, 50, and 100. or example, the macro DO50(x,y) looks like this: #deine DO50(x,y) \ l++; x=y; l++; x=y; l++; x=y; l++; x=y; l++; x=y; \ l++; x=y; l++; x=y; l++; x=y; l++; x=y; l++; x=y; \ l++; x=y; l++; x=y; l++; x=y; l++; x=y; l++; x=y; \ l++; x=y; l++; x=y; l++; x=y; l++; x=y; l++; x=y; \ l++; x=y; l++; x=y; l++; x=y; l++; x=y; l++; x=y; \ l++; x=y; l++; x=y; l++; x=y; l++; x=y; l++; x=y; \ l++; x=y; l++; x=y; l++; x=y; l++; x=y; l++; x=y; \ l++; x=y; l++; x=y; l++; x=y; l++; x=y; l++; x=y; \ l++; x=y; l++; x=y; l++; x=y; l++; x=y; l++; x=y; \ l++; x=y; l++; x=y; l++; x=y; l++; x=y; l++; x=y; As may be expected, when practically all the accesses are local, the number o processors workin in parallel has no inuence (x = 1). When a certain raction o accesses are lobal, bus contention beins to take its toll (x = 50 to x = 50), and the larer the raction o lobal accesses the larer the elapsed time becomes (x = 100). The results or heavy bus activity are practically identical to those rom experiment 1. this means that it is not possible to decouple the deradation due to bus contention rom the deradation due to contention or the same memory module. 3.2 Bus Access Priorities The Multibus-II arbitration mechanism ives hiher priority to PEs with lower serial numbers. This experiment was desined to check how bi the dierence in perormance can be. The code is: /* allocate a counter in every processor */ lparor int j; 1; proc no; -1; C[j] = (int *) malloc(sizeo(int)); *(C[j]) = 0;

8 7 time [sec] x=1 x=10 x=20 x=50 x= number o PEs Fiure 2: Eect o dierent ratios o lobal to local accesses on the deradation due to bus contention. la = 1; /* increment neihbor's counter until threshold reached */ lparor int j; 1; proc no; -1; int k, *p; p = C[(j<proc no)? j+1 : 1]; while () or (k=0 ; k<iter ; k++) *p = *p + 1; i (*p >= THRESHOLD) = 0; Aain, each processor increments a counter in another processor's local memory. Note that the accesses are done in a double loop: the outer loop checks the termination condition (when the rst PE reaches the threshold value), and the inner one perorms a certain constant number o iterations. This is done to reduce the eect o checkin the termination condition. The results are tabulated in table 2. The relative perormance displays a step-like behavior. Usin PE number 1 as a reerence point, we nd that PE number 2 achieves the same number o accesses, PE number 3 achieves 90% o that, PEs numbers 4 throuh 6 achieve about 2/3, and

9 8 serial number number o accesses Table 2: Number o accesses perormed by dierent PEs, when all compete or the bus. PEs 7 throuh 10 achieve only 1/2. The reason or this behavior is that the arbitration mechanism operates in a ated manner. First, all the contendin PEs are noted. Then they are serviced accordin to their serial numbers. While this is bein done, new requests that arrive are blocked out. Only when the previous set have all been serviced, does a new round bein. Thus PEs with low serial numbers, that are serviced early in the round, manae to enerate a new request beore the round is over, and thereore et into the next round as well. PEs late in line are serviced towards the end o the round, and do not manae to enerate a new request beore the round is over. Thereore they miss alternate rounds. PEs in the middle sometimes make it and sometimes not. 4 Remote Access Penalty Obviously a remote access takes more time than a local one. But how much more? The experiments in this section are desined to answer this question, and characterize the perormance in dierent operatin conditions. 4.1 Experiments All these experiments have the same structure. A set o activities is spawned and access some variables in a loop. In each set there is always one activity that accesses a local variable, such that no other activity accesses any variable in its local memory. This activity is thereore isolated rom all the others, and serves as a reerence point. When it completes iterations, it notes the number o iterations that the other activities have completed in the same time. This ives the relative execution rates. O course the measured time is or the whole instruction execution cycle, not only the memory reerence to et the operands. To reduce this eect, the instruction that is executed is the increment instruction, which perorms an increment on a memory address. This requires two memory accesses, and only minimal decodin. In addition, the overhead or executin the loop itsel (which is local) may obviously bias the perormance ures. To reduce this eect,

10 the loop contains in-line code or 250 consecutive increment instructions. The code or the rst experiment is: pparblock /* local undisturbed */ reister int *p; int i, res2, res3; : p = &i; while (la1 == 0) ; or ((*p)=0 ; (*p)<iter ; ) (*p)++; /* repeated 250 times */ (*p)++; res2 = *p2; res3 = *p3; sync; /* local disturbed */ reister int *p; int i, j; : p3 = &j; la2 = 1; p2 = &i; p = &i; while (la1 == 0) ; or ((*p)=0 ; (*p)<iter ; ) (*p)++; /* repeated 250 times */ (*p)++; sync; /* remote */ reister int *p; while (la2 == 0) ; p = p3; la1 = 1; or ((*p)=0 ; (*p)<iter ; ) (*p)++; /* repeated 250 times */ (*p)++; sync; 9

11 10 access conditions rate local ree memory 1.00 contention rom a sinle remote activity 0.74 contention rom multiple remote activities 0.57 remote to unused memory, no bus contention 0.11 to locally used memory, no bus contention 0.11 with heavy bus contention 0.01{0.03 Table 3: Relative access rates under dierent conditions. The variable la1 is used to sinal that all are ready and the measurement can bein. la2 sinals that the lobal pointer p3 now points to a local variable, and can be used by the process that perorms the remote accesses. res2 and res3 take a snapshot o the proress o the locally disturbed activity and the remote activity, respectively, at the instant that the locally undisturbed activity nishes its iterations. The other experiments ollow the same principle. To measure remote access to an unused memory, the middle (local disturbed) activity is terminated with a pcontinue ater it allocates local memory and assins its address to p3. malloc is used so that this memory will persist ater the activity terminates. To measure contention rom multiple activities, the third activity (remote) is replaced by a pparor that spawns proc no - 2 additional activities, which all access the same variable. 4.2 Results The results o these experiments are summarized in table 3. Remote reerences cause a deradation in the local access rate, and reduce it to 56{74% o the rate when there are no remote reerences. Under ideal conditions, remote reerences take nine times as lon as local reerences to an unused memory, or 5{7 times as lon as local reerences that are disturbed by remote reerences. Under less ideal conditions, where bus contention also takes its toll, remote reerences are much slower. The most extreme dierence is between local reerences to an unused memory and remote reerences that suer sever contention; in this case the remote reerences are rom thirty to seventy times slower. The lare variance is due to the act that dierent PEs have dierent priorities, and thereore suer dierent derees o deradation under contention. Under reasonable conditions, where all the memories are accessed both locally and throuh the bus, and there is moderate contention, the ratio between local and remote reerences is about a actor o ten to twenty. 5 Schedulin The schedulin overhead is the time required to perorm a context switch. The main problem with measurin this quantity is that the run time library has a ew tasks that compete with the user tasks or the CPU. This adds overhead that is unaccounted or.

12 The Deault Library This experiment measures the schedulin overhead by spawnin a set o activities, one per PE. Each activity perorms an empty loop, yieldin the processor in each iteration. Thus the averae time required per iteration is actually the time required to reschedule the activity. This scenario is repeated or dierent loads, so as to distinuish between the constant overhead due to the library tasks and the schedulin overhead that is proportional to the number o activities that are bein time shared. The code used is the ollowin: lparor int i; 1; proc no; -1; start = et loctime(); or (j=0 ; j<iter ; j++) stop = et loctime(); control[i] = (loat) stop - start; i (load == 1) sem init( &ready, 1 ); else sem init( &ready, 0 ); pparblock /* create loadin conditions */ pparor int l; 2; load; 1; lparor int i; 1; proc no; -1; int j; i ((l == 2) && (i == 1)) V( &ready ); /* sinal when ready */ or (j=0 ; j<iter+load+3 ; j++) KN sleep( 0 ); : /* perorm measurements */ P( &ready ); /* wait or load to be ready */ lparor int i; 1; proc no; -1; int j, start, stop; start = et loctime(); or (j=0 ; j<iter ; j++)

13 12 time per iteration d an load (tasks per PE) Fiure 3: Overhead or schedulin. KN sleep( 0 ); stop = et loctime(); or (j=0 ; j<cushion ; j++) ; res[i] = (loat) stop - start; res[i] = res[i] - control[i]; The control measurement evaluates the time needed do the loop itsel. The ready semaphore is used to trier the measurement ater the additional loadin activities have been spawned. The CUSHION delays writin the results into the shared array, so as not to interere with other measurements that have not terminated yet. The results o this experiment are shown in ure 3. The averae over 12 PEs was measured. As expected, the total schedulin overhead in a schedulin round is a linear unction o the load, obeyin the expression 0: :072 load (where the measurement is in ms). This means that each RMK context switch takes about 72s. The additional constant overhead o 140s is attributed to the schedulin and execution o the et work task. This overhead would be bier i this task actually had somethin to do; 140s is the time needed to nd out that there is nothin to do.

14 It should be noted that the measurements show that the overhead on the master PE (PE #1) is between 2 and 4% hiher than on the other PEs. This dierence may be due to the act that the et work tasks on all the other PEs access lobal data structures that are located in the memory o the rst PE, thus slowin it down. 5.2 The Gan Schedulin Library This experiment is desined to measure the context switchin overhead when the an schedulin library is used. With this library, context switchin is coordinated and done simultaneously on all the PEs, by means o a broadcast interrupt. To measure the overhead, a set o activities is created, one per PE. All the activities loop endlessly on a local variable, but one o them calls the library unction ts sched in each iteration; this unction sends the schedulin broadcast. The code used is int la=1; pparor int l; 1; load; 1; lparor int i; 1; proc no; -1; int j, start, stop; i (i == 2) /* to avoid la traic */ start = et loctime(); or (j=0 ; j<iter ; j++) ts sched( 0 ); stop = et loctime(); la = 0; res = (loat) stop - start; else while(la) or (j=0 ; j<1000 ; j++) ; 13 The results o this experiment are also shown in ure 3. Aain the overhead is more or less linear with the load, but in this case there is no added constant because there is no et work task. However, the overhead is sinicantly hiher than or the d library: it is approximately 0:2 load. This is reasonable since every schedulin operation here involves two RMK schedulin operations, a broadcast, and the execution o the scheduler task.

15 14 6 Spawnin new activities The question here is \how much time does it take to create a new activity, execute it, and terminate it". This is important since activities that are too short relative to this overhead usually cause a deradation rather than an improvement in perormance. 6.1 Experiment 1 In this experiment a set o empty activities is spawned, one per PE. This is repeated a lare number o times, and the averae time needed or a spawn in reported. The ollowin code was used: start = et loctime(); or (j=0 ; j<lim ; j++) stop = et loctime(); control = (loat) stop - start; start = et loctime(); or (j=0 ; j<lim ; j++) lparor int i; 1; proc no; -1; stop = et loctime(); res = (loat) stop - start; time per spawn = (res - control)/lim; Values o lim that were used were between and Rounded results were then typically the same to within less than 1%. The sequence o events in each iteration, in the d library, is as ollows: 1. The parent activity runs on the master PE, and (a) Starts another loop. (b) Spawns a new set o activities. This involves i. Initialization o the spawn descriptor. ii. Linkin the descriptor to the lobal list. iii. Suspendin execution. 2. The et work task runs (on all PEs), doin (a) Get a lock on the lobal descriptor list and et a new activity. (b) Create a new RMK task to execute the activity. (c) Yield the processor. 3. The new activities execute on all the PEs. This includes

16 d an time per spawn number o PEs Fiure 4: Results or experiment 1: overhead or spawnin empty activities on all the PEs. (a) The selector runs and ets whatever is needed rom the parent. (b) The activity terminates and the task is deleted. The last one links the parent's descriptor to the resume list. 4. the et work task runs aain (on the master PE), doin (a) Resume the parent. (b) yield the processor. The results are shown in ure 4. These results indicate that as more PEs are used, the overhead increases. This increase is slihtly superlinear, probably due to contention or the bus and or the spawn descriptor, and possibly also to the act that the additional PEs have to perorm remote accesses. The increased overhead in the an schedulin library is probably due to the act that the library scheduler must run to perorm all the necessary context switches. 6.2 Experiment 2 The main problem with the previous experiment is that it measures the total time to perorm a parallel construct, not just the time to spawn an activity. To measure the net time required just to et an activity and execute it, without the time needed or the spawn, we need to have one PE that just creates activities all the time, while other PEs keep the lobal list o spawn descriptors rom becomin empty. This is achieved by the ollowin code:

17 16 2 time per spawn number o enerators Fiure 5: Results or experiment 2: overhead or creatin empty activities rom a remote descriptor. start = et loctime(); lparor int j; 1; proc no; -1; intk, lim; i (j == 1) pcontinue; /* leave PE #1 ree */ lim = ITER; or (k=0 ; k<lim ; k++) /* spawn sinle task on PE # 1 */ lparor int l; 1; 1; -1; stop = et loctime(); res = (loat) stop - start; time per spawn = res/(iter*(proc no-1)); The results are shown in ure 5. When only one activity enerates new descriptors, the PE that creates the activities has to wait or the next one each time. When there are two or more, it does not have to wait, so the time per spawn is reduced. I there are too many enerators, however, they contend or the lobal descriptor list, causin a deradation in perormance.

18 17 2 time per spawn number o PEs Fiure 6: Results or experiment 3: overhead or creatin empty activities rom a lare descriptor. 6.3 Experiment 3 As a nal touch, we also consider a simpler version o the previous experiment. In this case a lare number o activities are spawned at once, and executed by all the processors. Thus there is no contention or the lobal descriptor list due to the addition o new descriptors, but only contention due to the execution o activities. However, one o the PEs has the descriptor in its local memory, while the others have to access it via the bus. The code is: lim = ITER*proc no; start = et loctime(); pparor int i; 1; lim; 1; stop = et loctime(); res = (loat) stop - start; time per spawn = res/iter; The results o this experiment are shown in ure Summary The experiments described in this section show that the eective overhead or spawnin an activity is hihly variable. The reason or this is that spawnin is communication intensive, and hence

19 18 depends on the loadin conditions. The minimum time needed just to create a task, start it and terminate it, is about 1ms. This number rows when there is contention or the bus, either due to other tasks bein spawned, or due to computation. 7 Load Balancin and Granularity The ParC run time library spawns tasks as ollows: the activity that perorms the spawn places a spawn descriptor in lobal memory, and each processor checks the descriptor and creates a new activity once every schedulin round. This should provide a measure o load balancin, because processors that are overloaded have a loner schedulin round, so they will take additional work at a slower rate. It should be noted that load balancin can be dened in two ways: accordin to the numerical load and accordin to the work/perormance load. Balancin the numerical load means that each processor will have the same number o activities. Balancin the work/perormance means that each processor will do work proportional to its perormance: inecient processors will do less, so as not to cause a delay. The two are not the same, even i all the activities are identical, because the eciency o dierent processors is indeed dierent. This is a result o the lobal memory access penalty and the dierent bus access priorities. The question o load balancin is urther complicated by the issue o ranularity. In particular, the system miht be prevented rom perormin any load balancin because the proram is partitioned into activities with the wron ranularity. The experiment reported in this section shows how all o this is tied toether. The methodoloy used is to how much time it would take to perorm a constant amount o work usin dierent numbers o activities. The work is 100,000,000 assinments to a lobal variable. This is divided equally amon a certain number o activities, that are then executed on a 10-processor system. The code or this experiment is simply exp = 0; or (i=1 ; i< ; i=i*10) exp++; start = et loctime(); pparor int j; 1; i; 1; int l, k, n; l = /i; or (k=0 ; k<l ; k++) n = ; res[exp][et pid()]++; stop = et loctime(); time[exp] = stop - start; The experiment is divided into phases, each o which does the same work with a dierent number o activities. The res array counts the number o activities perormed on each processor in each

20 percent o activities 19 number o activities elapsed time [s] Table 4: Elapsed time or computations with dierent numbers o activities number o activities: PE serial number Fiure 7: Number o activities executed on each processor. phase. The time array tabulates the elapsed time o each phase. The results o the elapsed time are iven in table 4. The best speedup, which is about 4.5, is achieved when between 1000 and activities are used (a more accurate measurement showed the optimal number to be 1900 activities, which took seconds). Usin activities is three times slower than doin the work serially, because the ranularity is much too ne. Most o the time is spent in overhead just creatin all those activities. For each measurement (identied by the number o activities used), the percentae o activities executed on each processor is shown in. 7. The results indicate that the best perormance is obtained when the loads are not numerically equal. Three cases may be distinuished: 1. The number o activities is too small the their ranularity is too lare. In this case a more ecient processor may nish executin its activity beore less ecient processors, but by the time it nishes all the activities have been picked up already. Thereore it idles, and the time

21 20 is determined by the slower processors. 2. The optimal case: a medium number o activities with medium ranularity. In this case processor #1 does 2 1 times more work than the others. Its increased eciency is due to the 2 act that both the spawn descriptor and the lobal variable that are used in the computation are in its local memory. Processor #2 also does a relatively lare amount o work. This is due to the act that it has the hihest bus priority o all the processors that et work throuh the bus. 3. The number o the activities is too lare and their ranularity is too small. In this case the contention or the spawn descriptor takes its toll, and limits the perormance. All the processors execute about the same number o activities, but it takes them a lon time because they have to wait or access to the descriptor. Processor #1 becomes relatively slow, because the heavy burden on its local memory slows it down. 8 Synchronization Overhead ParC has three synchronization mechanisms: an atomic etch-and-add instruction (aa), a barrier synchronization instruction (sync), and semaphores. In this section we measure the perormance o the rst two, and show how it depends on the number o activities involved. 8.1 Fetch-And-Add The aa instruction is implemented by a set o locks. The address o the variable in question is hashed to one o the locks. The atomic test-and-set instruction rom the 386 instruction set, which is supported across the Multibus-II, is used to obtain the lock. Busy waitin is used i the lock is not available. Once the lock is obtained, the aa operation is simulated. Obviously this implementation serializes concurrent aa instructions to the same variable. The ollowin code was used to aue the eect o this serialization: start = et loctime(); lparor int i; 1; proc no; -1; int *p, j; p = &; /* or p = ptr[ (i==proc no)?1:i+1 ]; */ or (j=0 ; j<iter ; j++) aa( p, 1 ); stop = et loctime(); As indicated by the comment, two cases were checked. In the rst, all the activities accesses the same variable. In the second, each accessed a distinct variable placed in the memory o another processor. The results are plotted in. 8. As expected, the overhead rises with the number o processors. When the aas are directed at distinct variables, the relationship is linear: it reects the lobal

22 same variable dierent variables time [sec] number o PEs Fiure 8: Time or aa instructions by each o a set o processors. operations on the bus. When all the aas are directed at the same variable, the relationship becomes superlinear or a lare number o processors. This is a result o the contention or the same lock. Note that the overhead is quite lare relative to the simple nature o the operation (about 0.4 ms per aa). The extra overhead is due to raisin the priority o the activity when it holds the lock, so as to prevent it rom bein preempted. 8.2 Barrier Synchronization The sync instruction imposes a barrier synchronization across siblin activities, i.e. across activities that were spawned toether by the same construct. The implementation uses a counter and a a in the spawn descriptor: each activity that reaches the barrier decrements the counter and then spins on the a. The last to arrive raises the a and releases all its waitin siblins (it also sets thins up or the next barrier). In order to avoid excessive waste i there are more activities than processors, the busy-wait loop includes an instruction to yield the processor i the a is not up. The counter is decremented usin aa, so there is an obvious serialization o this procedure. In addition, when there are more activities than processors, context switches are needed so that all will run and reach the barrier. The experiments in this subsection quantiy these two eects. Experiment 1 In this experiment there is exactly one activity on each processor. These activities synchronize repeatedly, and the time required is measured. The code used is: start = et loctime();

23 time per sync [ms] number o PEs Fiure 9: Time or sync instruction by a set o processors. Averae over is shown. lparor int i; 1; proc no; -1; int j; or (j=0 ; j<iter ; j++) sync; stop = et loctime(); The results are shown in. 9. The measurement or one processor is much lower than the rest because everythin is local. Note that the measured time includes the time or the loop itsel. Measurements show this to be ms per iteration, so the dierence is small. Experiment 2 When the number o activities exceeds the number o processors, context switches are needed to allow them all to run and reach the barrier. This is reatly exacerbated i all the activities do not exist yet when the rst reach the barrier. In this experiment 10 processors are used. Three types o measurements are made: 1. A lare number o activities are spawned, and they synchronize repeatedly or a lare number o times. Thus in eect we can measure the averae time to perorm a sync when all the activities have been spawned already. This is a direct extension o the previous experiment; it is shown by the \just sync" raph in. 10. The results indicate a linear relation between

24 23 time per operation [ms] just sync spawn & sync just spawn number o activities Fiure 10: Time or sync instruction by a lare number o activities. Perormance derades sharply i the activities have to be spawned. the number o activities and the time required, which is expected considerin that the implementation o the sync is actually serial. The constant o proportionality includes the time or the aa on the counter and the context switch. 2. A lare number o activities are spawned and they synchronize once. Thus the price o spawnin also enters the picture. This is shown in the \spawn & sync" raph in To calibrate the previous measurement, we also measured the time just to spawn the tasks (\just spawn" raph in. 10). As expected, this is linear in the number o activities, and the constant o proportionality is about 1 ms per activity. The important eect to notice is that the time to spawn and sync is much larer than the sum o its parts. In act, it seems to have a quadratic relation to the number o activities. This is a result o the way that the sync is implemented, and specically, a result o usin busy waitin with a yield instruction. The system then operates accordin to the ollowin scenario. First, one activity is spawned by the et work task. This activity tries to sync, nds that its siblins have not arrived yet, and yields. et work runs aain, and spawns another activity. This activity too tries to sync, ails, and yields. The rst activity is then rescheduled, tries aain, ails aain, and yields aain. This pattern is repeated until all the activities are spawned: ater each one, all the existin activities try to sync. Thus the time to spawn and sync N activities is proportional to the sum o N times the spawn overhead, plus (N=P ) 2 =2 times the overhead to perorm a context switch and check the sync condition.

Status. We ll do code generation first... Outline

Status. We ll do code generation first... Outline Status Run-time Environments Lecture 11 We have covered the ront-end phases Lexical analysis Parsin Semantic analysis Next are the back-end phases Optimization Code eneration We ll do code eneration irst...