Performance and Overhead Measurements. on the Makbilan. Department of Computer Science. The Hebrew University of Jerusalem

Size: px
Start display at page:

Download "Performance and Overhead Measurements. on the Makbilan. Department of Computer Science. The Hebrew University of Jerusalem"

Transcription

1 Perormance and Overhead Measurements on the Makbilan Yosi Ben-Asher Dror G. Feitelson Dtment o Computer Science The Hebrew University o Jerusalem Jerusalem, Israel yosi,dror@cs.huji.ac.il Technical Report 91-5 October 1991 Abstract The Makbilan is a research multiprocessor with 12 sinle-board computers in a Multibus-II cae. Each sinle-board computer is a 386 processor runnin at 20 MHz with 4 MB o memory. The sotware includes a local kernel (Intel's RMK) on each board, and a parallel run-time library that supports the constructs o the ParC lanuae. We report measurements o bus contention eects, context-switch overhead, overhead or spawnin new tasks, the eectiveness o load balancin, and synchronization overhead on this system.

2 1 1 Introduction When evaluatin a new system, it is important to make accurate measurements o the overheads involved in various system operations. This is not always as easy as it sounds. On the Makbilan, and bus-based shared memory machines in eneral, the problem is one o isolatin the measurements rom irrelevant external eects. For example, the measured results depend on the ollowin: Whether data structures are local or remote, and must be accessed throuh the bus. I remote accesses are made, what is the contention or the bus at that time. What is the contention or a specic memory module. This inuences both local and remote reerences. What is the load on the PEs, i.e. how many tasks are bein time shared. The methodoloy used is to write application prorams that perorm a certain operation a lare number o times, and measure their execution times. The applications are written so that the transient eects at startup are neliible relative to the total execution time, thus enablin us to attribute the whole time to the repeated perormance o the operation bein measured. However, care must be taken that some additional operation does not \slip in" toether with the one we are interested in. Hardware Conuration The Hardware conuration o the Makbilan, as used in these measurements, was as ollows: A Multibus-II backplane with 20 slots. A CSM-001 bus controller. 12 sinle-board computers (SBC) actin as PEs. Each has a 386 processor runnin at 20 MHz, a 387 coprocessor, an MPC, and 4 MB o memory. Another SBC actin as a Unix host. A 186/410 board with six terminal ports. A 186/224 board peripheral controller interacin the bus with the SYP 500 chassis. This chassis contains the I/O devices (disk and tape). 2 Processin Rates Each processor in the Makbilan has an independent clock. The rst experiment is desined to check how well these clocks are synchronized. The code used is as ollows:

3 2 lparor int n; 0; proc no-1; -1; addr[n] = (int*) malloc( sizeo(int) ); or (i=0 ; i<proc no ; i++) or (j=i+1 ; j<proc no ; j++) or (k=0 ; k<expr ; k++) lparor int n; 0; proc no-1; -1; int *loc, rem, it1, it2; i ((n!= i) && (n!= j)) pcontinue; /* check i aainst j */ it1 = ITER1; it2 = ITER2; loc = addr[n]; *loc = 0; i (n == i) while (*loc < it2) i (*loc == it1) rem = *(addr[j]); (*loc)++; res[i][j][k] = rem; else while (*loc < it2) i (*loc == it1) rem = *(addr[i]); (*loc)++; res[j][i][k] = rem; All this code does is to check all pairs o processors aainst each other. this is done by spawnin a pair o activities, one on each processor, which iterate a lare number o times 12,000,000 was used. A bit beore the end, at 10,000,000 iterations, each activity checks the proress o the other activity and tabulates it. Everythin is local with no intererence rom any other processor in the system. The results are shown in table 1. These are averaes o ve measurements or each pair. Row i o the table contains the rates o all the processors relative to processor number i. Thus a row with many positive numbers, e.. row 3, indicates a relatively slow processor. A row with many neative numbers, e.. row 7, indicates a relatively ast processor. The reatest dierence measured,

4 +7 { {9 {6 {10 1 {7 {60 { {3 {5 {16 0 {8 2 0 {1 { {17 {19 +4 {12 5 {25 {17 {80 {24 {23 {10 {35 {28 {45 {20 {45 6 {2 +5 { {5 { { {5 10 Table 1: Relative rates o the 12 processors currently in the Makbilan, in parts per million {2 +8 { {1 {4 {3 {11 7 {18 {17 {80 {25 { {22 {30 {34 {24 {34 8 {1 +3 {45 {4 { {20 +5 { {59 +3 { {5 {6 {9 { { {

5 between processor 3 and processor 6, is only 80 parts per million. At least the last diit o these measurements should not be considered as accurate. For example, the table shows that processor #8 is slower than #11 by 5 ppm, and processor #4 is slower than #8 by an additional 4 ppm, but processor #4 is aster than #11 by 3 ppm. 4 3 Bus Contention The Makbilan backplane is a Multibus-II, with a throuhput o 40MB/s. This seems to be more than enouh to service a number o transers within one processor cycle, implyin that the bus should not be a bottleneck. But the Makbilan uses the bus mainly or remote memory access, and the bus protocol requires that the bus be held until the transaction terminates. The memory has dual ports, and access rom the bus may suer several wait states. Thus the memory response time may be a bottleneck that limits the bus throuhput. The experiments in this section are desined to check this. 3.1 Contention or Bus Access This subsection contains two experiments: the rst shows how perormance derades when more processors try to use the bus simultaneously, and the second shows how the mix o local and lobal accesses inuences the perormance. Experiment 1 This experiment is repeated with dierent numbers o processors. The code or each run is as ollows: int ; lparor int j; 1; num-o-procs; -1; int k; /*reister*/ int *p, l; p = &; or (k=0 ; k<iter ; k++) *p=l; /* repeated 100 times */ *p=l; The assinment in the loop is repeated 100 times in-line to reduce the eect o the loop handlin itsel. The experiment was conducted twice, once with p and l dened as local variables and once as reisters. Note that the lobal variable is actually in the local memory o the rst processor, so in the measurement or i processors only i? 1 are really usin the bus. In particular, the measurement or one processor is really a local access, not a lobal one. The results are shown in. 1. It is obvious that a linear relationship exists between the number o processors and the elapsed time, indicatin that the bus cannot support multiple requests within a sinle processor cycle. Thus the bus is indeed a bottleneck. As may be expected, when the

6 local variables: in memory in reisters time [sec] number o PEs Fiure 1: Measurement showin that the bus is not aster than the processors. Time or lobal accesses rom each processor is shown. local variables are in reisters the elapsed time is slihtly shorter. Note, however, that as more processors are added the dierence decreases until it disappears. The reason or this is that usin reisters reduces the time needed to issue the next lobal access, but the bottleneck is the lobal access itsel. Issuin the instruction aster does not help. Experiment 2 The load on the bus depends, o course, on the access rate rom the dierent processors. This in turn depends on the ratio o local to lobal memory reerences. This experiment shows how a larer percentae o local reerences makes the bus seem aster. The basic code is similar to the previous experiment, except that each processor does lobal accesses to a dierent lobal variable. The lobal variables are arraned so that each is in the memory o a dierent processor. This ensures that the bottleneck is indeed the bus (and its protocol), and not any overloadin eects on a specic memory module. /* allocate memory in every processor */ lparor int j; 1; proc no; -1; M[j] = (int *) malloc(sizeo(int)); /* repeat or dierent numbers o PEs */ lparor int j; 1; num-o-procs; -1;

7 6 int l, k, *p; /* arrane pointers */ i (et pid() < proc no) p = M[et pid()+1]; else p = M[1]; /* perorm local and lobal accesses */ or (k=0 ; k<iter ; k++) DOx( *p, l ) The DOx macro includes 100 instructions, out o which x are assinments to the lobal variable, o the orm *p=l, and the rest are increments o the local variable, i.e. l++. results or ve values o x are reported: 1, 10, 20, 50, and 100. or example, the macro DO50(x,y) looks like this: #deine DO50(x,y) \ l++; x=y; l++; x=y; l++; x=y; l++; x=y; l++; x=y; \ l++; x=y; l++; x=y; l++; x=y; l++; x=y; l++; x=y; \ l++; x=y; l++; x=y; l++; x=y; l++; x=y; l++; x=y; \ l++; x=y; l++; x=y; l++; x=y; l++; x=y; l++; x=y; \ l++; x=y; l++; x=y; l++; x=y; l++; x=y; l++; x=y; \ l++; x=y; l++; x=y; l++; x=y; l++; x=y; l++; x=y; \ l++; x=y; l++; x=y; l++; x=y; l++; x=y; l++; x=y; \ l++; x=y; l++; x=y; l++; x=y; l++; x=y; l++; x=y; \ l++; x=y; l++; x=y; l++; x=y; l++; x=y; l++; x=y; \ l++; x=y; l++; x=y; l++; x=y; l++; x=y; l++; x=y; As may be expected, when practically all the accesses are local, the number o processors workin in parallel has no inuence (x = 1). When a certain raction o accesses are lobal, bus contention beins to take its toll (x = 50 to x = 50), and the larer the raction o lobal accesses the larer the elapsed time becomes (x = 100). The results or heavy bus activity are practically identical to those rom experiment 1. this means that it is not possible to decouple the deradation due to bus contention rom the deradation due to contention or the same memory module. 3.2 Bus Access Priorities The Multibus-II arbitration mechanism ives hiher priority to PEs with lower serial numbers. This experiment was desined to check how bi the dierence in perormance can be. The code is: /* allocate a counter in every processor */ lparor int j; 1; proc no; -1; C[j] = (int *) malloc(sizeo(int)); *(C[j]) = 0;

8 7 time [sec] x=1 x=10 x=20 x=50 x= number o PEs Fiure 2: Eect o dierent ratios o lobal to local accesses on the deradation due to bus contention. la = 1; /* increment neihbor's counter until threshold reached */ lparor int j; 1; proc no; -1; int k, *p; p = C[(j<proc no)? j+1 : 1]; while () or (k=0 ; k<iter ; k++) *p = *p + 1; i (*p >= THRESHOLD) = 0; Aain, each processor increments a counter in another processor's local memory. Note that the accesses are done in a double loop: the outer loop checks the termination condition (when the rst PE reaches the threshold value), and the inner one perorms a certain constant number o iterations. This is done to reduce the eect o checkin the termination condition. The results are tabulated in table 2. The relative perormance displays a step-like behavior. Usin PE number 1 as a reerence point, we nd that PE number 2 achieves the same number o accesses, PE number 3 achieves 90% o that, PEs numbers 4 throuh 6 achieve about 2/3, and

9 8 serial number number o accesses Table 2: Number o accesses perormed by dierent PEs, when all compete or the bus. PEs 7 throuh 10 achieve only 1/2. The reason or this behavior is that the arbitration mechanism operates in a ated manner. First, all the contendin PEs are noted. Then they are serviced accordin to their serial numbers. While this is bein done, new requests that arrive are blocked out. Only when the previous set have all been serviced, does a new round bein. Thus PEs with low serial numbers, that are serviced early in the round, manae to enerate a new request beore the round is over, and thereore et into the next round as well. PEs late in line are serviced towards the end o the round, and do not manae to enerate a new request beore the round is over. Thereore they miss alternate rounds. PEs in the middle sometimes make it and sometimes not. 4 Remote Access Penalty Obviously a remote access takes more time than a local one. But how much more? The experiments in this section are desined to answer this question, and characterize the perormance in dierent operatin conditions. 4.1 Experiments All these experiments have the same structure. A set o activities is spawned and access some variables in a loop. In each set there is always one activity that accesses a local variable, such that no other activity accesses any variable in its local memory. This activity is thereore isolated rom all the others, and serves as a reerence point. When it completes iterations, it notes the number o iterations that the other activities have completed in the same time. This ives the relative execution rates. O course the measured time is or the whole instruction execution cycle, not only the memory reerence to et the operands. To reduce this eect, the instruction that is executed is the increment instruction, which perorms an increment on a memory address. This requires two memory accesses, and only minimal decodin. In addition, the overhead or executin the loop itsel (which is local) may obviously bias the perormance ures. To reduce this eect,

10 the loop contains in-line code or 250 consecutive increment instructions. The code or the rst experiment is: pparblock /* local undisturbed */ reister int *p; int i, res2, res3; : p = &i; while (la1 == 0) ; or ((*p)=0 ; (*p)<iter ; ) (*p)++; /* repeated 250 times */ (*p)++; res2 = *p2; res3 = *p3; sync; /* local disturbed */ reister int *p; int i, j; : p3 = &j; la2 = 1; p2 = &i; p = &i; while (la1 == 0) ; or ((*p)=0 ; (*p)<iter ; ) (*p)++; /* repeated 250 times */ (*p)++; sync; /* remote */ reister int *p; while (la2 == 0) ; p = p3; la1 = 1; or ((*p)=0 ; (*p)<iter ; ) (*p)++; /* repeated 250 times */ (*p)++; sync; 9

11 10 access conditions rate local ree memory 1.00 contention rom a sinle remote activity 0.74 contention rom multiple remote activities 0.57 remote to unused memory, no bus contention 0.11 to locally used memory, no bus contention 0.11 with heavy bus contention 0.01{0.03 Table 3: Relative access rates under dierent conditions. The variable la1 is used to sinal that all are ready and the measurement can bein. la2 sinals that the lobal pointer p3 now points to a local variable, and can be used by the process that perorms the remote accesses. res2 and res3 take a snapshot o the proress o the locally disturbed activity and the remote activity, respectively, at the instant that the locally undisturbed activity nishes its iterations. The other experiments ollow the same principle. To measure remote access to an unused memory, the middle (local disturbed) activity is terminated with a pcontinue ater it allocates local memory and assins its address to p3. malloc is used so that this memory will persist ater the activity terminates. To measure contention rom multiple activities, the third activity (remote) is replaced by a pparor that spawns proc no - 2 additional activities, which all access the same variable. 4.2 Results The results o these experiments are summarized in table 3. Remote reerences cause a deradation in the local access rate, and reduce it to 56{74% o the rate when there are no remote reerences. Under ideal conditions, remote reerences take nine times as lon as local reerences to an unused memory, or 5{7 times as lon as local reerences that are disturbed by remote reerences. Under less ideal conditions, where bus contention also takes its toll, remote reerences are much slower. The most extreme dierence is between local reerences to an unused memory and remote reerences that suer sever contention; in this case the remote reerences are rom thirty to seventy times slower. The lare variance is due to the act that dierent PEs have dierent priorities, and thereore suer dierent derees o deradation under contention. Under reasonable conditions, where all the memories are accessed both locally and throuh the bus, and there is moderate contention, the ratio between local and remote reerences is about a actor o ten to twenty. 5 Schedulin The schedulin overhead is the time required to perorm a context switch. The main problem with measurin this quantity is that the run time library has a ew tasks that compete with the user tasks or the CPU. This adds overhead that is unaccounted or.

12 The Deault Library This experiment measures the schedulin overhead by spawnin a set o activities, one per PE. Each activity perorms an empty loop, yieldin the processor in each iteration. Thus the averae time required per iteration is actually the time required to reschedule the activity. This scenario is repeated or dierent loads, so as to distinuish between the constant overhead due to the library tasks and the schedulin overhead that is proportional to the number o activities that are bein time shared. The code used is the ollowin: lparor int i; 1; proc no; -1; start = et loctime(); or (j=0 ; j<iter ; j++) stop = et loctime(); control[i] = (loat) stop - start; i (load == 1) sem init( &ready, 1 ); else sem init( &ready, 0 ); pparblock /* create loadin conditions */ pparor int l; 2; load; 1; lparor int i; 1; proc no; -1; int j; i ((l == 2) && (i == 1)) V( &ready ); /* sinal when ready */ or (j=0 ; j<iter+load+3 ; j++) KN sleep( 0 ); : /* perorm measurements */ P( &ready ); /* wait or load to be ready */ lparor int i; 1; proc no; -1; int j, start, stop; start = et loctime(); or (j=0 ; j<iter ; j++)

13 12 time per iteration d an load (tasks per PE) Fiure 3: Overhead or schedulin. KN sleep( 0 ); stop = et loctime(); or (j=0 ; j<cushion ; j++) ; res[i] = (loat) stop - start; res[i] = res[i] - control[i]; The control measurement evaluates the time needed do the loop itsel. The ready semaphore is used to trier the measurement ater the additional loadin activities have been spawned. The CUSHION delays writin the results into the shared array, so as not to interere with other measurements that have not terminated yet. The results o this experiment are shown in ure 3. The averae over 12 PEs was measured. As expected, the total schedulin overhead in a schedulin round is a linear unction o the load, obeyin the expression 0: :072 load (where the measurement is in ms). This means that each RMK context switch takes about 72s. The additional constant overhead o 140s is attributed to the schedulin and execution o the et work task. This overhead would be bier i this task actually had somethin to do; 140s is the time needed to nd out that there is nothin to do.

14 It should be noted that the measurements show that the overhead on the master PE (PE #1) is between 2 and 4% hiher than on the other PEs. This dierence may be due to the act that the et work tasks on all the other PEs access lobal data structures that are located in the memory o the rst PE, thus slowin it down. 5.2 The Gan Schedulin Library This experiment is desined to measure the context switchin overhead when the an schedulin library is used. With this library, context switchin is coordinated and done simultaneously on all the PEs, by means o a broadcast interrupt. To measure the overhead, a set o activities is created, one per PE. All the activities loop endlessly on a local variable, but one o them calls the library unction ts sched in each iteration; this unction sends the schedulin broadcast. The code used is int la=1; pparor int l; 1; load; 1; lparor int i; 1; proc no; -1; int j, start, stop; i (i == 2) /* to avoid la traic */ start = et loctime(); or (j=0 ; j<iter ; j++) ts sched( 0 ); stop = et loctime(); la = 0; res = (loat) stop - start; else while(la) or (j=0 ; j<1000 ; j++) ; 13 The results o this experiment are also shown in ure 3. Aain the overhead is more or less linear with the load, but in this case there is no added constant because there is no et work task. However, the overhead is sinicantly hiher than or the d library: it is approximately 0:2 load. This is reasonable since every schedulin operation here involves two RMK schedulin operations, a broadcast, and the execution o the scheduler task.

15 14 6 Spawnin new activities The question here is \how much time does it take to create a new activity, execute it, and terminate it". This is important since activities that are too short relative to this overhead usually cause a deradation rather than an improvement in perormance. 6.1 Experiment 1 In this experiment a set o empty activities is spawned, one per PE. This is repeated a lare number o times, and the averae time needed or a spawn in reported. The ollowin code was used: start = et loctime(); or (j=0 ; j<lim ; j++) stop = et loctime(); control = (loat) stop - start; start = et loctime(); or (j=0 ; j<lim ; j++) lparor int i; 1; proc no; -1; stop = et loctime(); res = (loat) stop - start; time per spawn = (res - control)/lim; Values o lim that were used were between and Rounded results were then typically the same to within less than 1%. The sequence o events in each iteration, in the d library, is as ollows: 1. The parent activity runs on the master PE, and (a) Starts another loop. (b) Spawns a new set o activities. This involves i. Initialization o the spawn descriptor. ii. Linkin the descriptor to the lobal list. iii. Suspendin execution. 2. The et work task runs (on all PEs), doin (a) Get a lock on the lobal descriptor list and et a new activity. (b) Create a new RMK task to execute the activity. (c) Yield the processor. 3. The new activities execute on all the PEs. This includes

16 d an time per spawn number o PEs Fiure 4: Results or experiment 1: overhead or spawnin empty activities on all the PEs. (a) The selector runs and ets whatever is needed rom the parent. (b) The activity terminates and the task is deleted. The last one links the parent's descriptor to the resume list. 4. the et work task runs aain (on the master PE), doin (a) Resume the parent. (b) yield the processor. The results are shown in ure 4. These results indicate that as more PEs are used, the overhead increases. This increase is slihtly superlinear, probably due to contention or the bus and or the spawn descriptor, and possibly also to the act that the additional PEs have to perorm remote accesses. The increased overhead in the an schedulin library is probably due to the act that the library scheduler must run to perorm all the necessary context switches. 6.2 Experiment 2 The main problem with the previous experiment is that it measures the total time to perorm a parallel construct, not just the time to spawn an activity. To measure the net time required just to et an activity and execute it, without the time needed or the spawn, we need to have one PE that just creates activities all the time, while other PEs keep the lobal list o spawn descriptors rom becomin empty. This is achieved by the ollowin code:

17 16 2 time per spawn number o enerators Fiure 5: Results or experiment 2: overhead or creatin empty activities rom a remote descriptor. start = et loctime(); lparor int j; 1; proc no; -1; intk, lim; i (j == 1) pcontinue; /* leave PE #1 ree */ lim = ITER; or (k=0 ; k<lim ; k++) /* spawn sinle task on PE # 1 */ lparor int l; 1; 1; -1; stop = et loctime(); res = (loat) stop - start; time per spawn = res/(iter*(proc no-1)); The results are shown in ure 5. When only one activity enerates new descriptors, the PE that creates the activities has to wait or the next one each time. When there are two or more, it does not have to wait, so the time per spawn is reduced. I there are too many enerators, however, they contend or the lobal descriptor list, causin a deradation in perormance.

18 17 2 time per spawn number o PEs Fiure 6: Results or experiment 3: overhead or creatin empty activities rom a lare descriptor. 6.3 Experiment 3 As a nal touch, we also consider a simpler version o the previous experiment. In this case a lare number o activities are spawned at once, and executed by all the processors. Thus there is no contention or the lobal descriptor list due to the addition o new descriptors, but only contention due to the execution o activities. However, one o the PEs has the descriptor in its local memory, while the others have to access it via the bus. The code is: lim = ITER*proc no; start = et loctime(); pparor int i; 1; lim; 1; stop = et loctime(); res = (loat) stop - start; time per spawn = res/iter; The results o this experiment are shown in ure Summary The experiments described in this section show that the eective overhead or spawnin an activity is hihly variable. The reason or this is that spawnin is communication intensive, and hence

19 18 depends on the loadin conditions. The minimum time needed just to create a task, start it and terminate it, is about 1ms. This number rows when there is contention or the bus, either due to other tasks bein spawned, or due to computation. 7 Load Balancin and Granularity The ParC run time library spawns tasks as ollows: the activity that perorms the spawn places a spawn descriptor in lobal memory, and each processor checks the descriptor and creates a new activity once every schedulin round. This should provide a measure o load balancin, because processors that are overloaded have a loner schedulin round, so they will take additional work at a slower rate. It should be noted that load balancin can be dened in two ways: accordin to the numerical load and accordin to the work/perormance load. Balancin the numerical load means that each processor will have the same number o activities. Balancin the work/perormance means that each processor will do work proportional to its perormance: inecient processors will do less, so as not to cause a delay. The two are not the same, even i all the activities are identical, because the eciency o dierent processors is indeed dierent. This is a result o the lobal memory access penalty and the dierent bus access priorities. The question o load balancin is urther complicated by the issue o ranularity. In particular, the system miht be prevented rom perormin any load balancin because the proram is partitioned into activities with the wron ranularity. The experiment reported in this section shows how all o this is tied toether. The methodoloy used is to how much time it would take to perorm a constant amount o work usin dierent numbers o activities. The work is 100,000,000 assinments to a lobal variable. This is divided equally amon a certain number o activities, that are then executed on a 10-processor system. The code or this experiment is simply exp = 0; or (i=1 ; i< ; i=i*10) exp++; start = et loctime(); pparor int j; 1; i; 1; int l, k, n; l = /i; or (k=0 ; k<l ; k++) n = ; res[exp][et pid()]++; stop = et loctime(); time[exp] = stop - start; The experiment is divided into phases, each o which does the same work with a dierent number o activities. The res array counts the number o activities perormed on each processor in each

20 percent o activities 19 number o activities elapsed time [s] Table 4: Elapsed time or computations with dierent numbers o activities number o activities: PE serial number Fiure 7: Number o activities executed on each processor. phase. The time array tabulates the elapsed time o each phase. The results o the elapsed time are iven in table 4. The best speedup, which is about 4.5, is achieved when between 1000 and activities are used (a more accurate measurement showed the optimal number to be 1900 activities, which took seconds). Usin activities is three times slower than doin the work serially, because the ranularity is much too ne. Most o the time is spent in overhead just creatin all those activities. For each measurement (identied by the number o activities used), the percentae o activities executed on each processor is shown in. 7. The results indicate that the best perormance is obtained when the loads are not numerically equal. Three cases may be distinuished: 1. The number o activities is too small the their ranularity is too lare. In this case a more ecient processor may nish executin its activity beore less ecient processors, but by the time it nishes all the activities have been picked up already. Thereore it idles, and the time

21 20 is determined by the slower processors. 2. The optimal case: a medium number o activities with medium ranularity. In this case processor #1 does 2 1 times more work than the others. Its increased eciency is due to the 2 act that both the spawn descriptor and the lobal variable that are used in the computation are in its local memory. Processor #2 also does a relatively lare amount o work. This is due to the act that it has the hihest bus priority o all the processors that et work throuh the bus. 3. The number o the activities is too lare and their ranularity is too small. In this case the contention or the spawn descriptor takes its toll, and limits the perormance. All the processors execute about the same number o activities, but it takes them a lon time because they have to wait or access to the descriptor. Processor #1 becomes relatively slow, because the heavy burden on its local memory slows it down. 8 Synchronization Overhead ParC has three synchronization mechanisms: an atomic etch-and-add instruction (aa), a barrier synchronization instruction (sync), and semaphores. In this section we measure the perormance o the rst two, and show how it depends on the number o activities involved. 8.1 Fetch-And-Add The aa instruction is implemented by a set o locks. The address o the variable in question is hashed to one o the locks. The atomic test-and-set instruction rom the 386 instruction set, which is supported across the Multibus-II, is used to obtain the lock. Busy waitin is used i the lock is not available. Once the lock is obtained, the aa operation is simulated. Obviously this implementation serializes concurrent aa instructions to the same variable. The ollowin code was used to aue the eect o this serialization: start = et loctime(); lparor int i; 1; proc no; -1; int *p, j; p = &; /* or p = ptr[ (i==proc no)?1:i+1 ]; */ or (j=0 ; j<iter ; j++) aa( p, 1 ); stop = et loctime(); As indicated by the comment, two cases were checked. In the rst, all the activities accesses the same variable. In the second, each accessed a distinct variable placed in the memory o another processor. The results are plotted in. 8. As expected, the overhead rises with the number o processors. When the aas are directed at distinct variables, the relationship is linear: it reects the lobal

22 same variable dierent variables time [sec] number o PEs Fiure 8: Time or aa instructions by each o a set o processors. operations on the bus. When all the aas are directed at the same variable, the relationship becomes superlinear or a lare number o processors. This is a result o the contention or the same lock. Note that the overhead is quite lare relative to the simple nature o the operation (about 0.4 ms per aa). The extra overhead is due to raisin the priority o the activity when it holds the lock, so as to prevent it rom bein preempted. 8.2 Barrier Synchronization The sync instruction imposes a barrier synchronization across siblin activities, i.e. across activities that were spawned toether by the same construct. The implementation uses a counter and a a in the spawn descriptor: each activity that reaches the barrier decrements the counter and then spins on the a. The last to arrive raises the a and releases all its waitin siblins (it also sets thins up or the next barrier). In order to avoid excessive waste i there are more activities than processors, the busy-wait loop includes an instruction to yield the processor i the a is not up. The counter is decremented usin aa, so there is an obvious serialization o this procedure. In addition, when there are more activities than processors, context switches are needed so that all will run and reach the barrier. The experiments in this subsection quantiy these two eects. Experiment 1 In this experiment there is exactly one activity on each processor. These activities synchronize repeatedly, and the time required is measured. The code used is: start = et loctime();

23 time per sync [ms] number o PEs Fiure 9: Time or sync instruction by a set o processors. Averae over is shown. lparor int i; 1; proc no; -1; int j; or (j=0 ; j<iter ; j++) sync; stop = et loctime(); The results are shown in. 9. The measurement or one processor is much lower than the rest because everythin is local. Note that the measured time includes the time or the loop itsel. Measurements show this to be ms per iteration, so the dierence is small. Experiment 2 When the number o activities exceeds the number o processors, context switches are needed to allow them all to run and reach the barrier. This is reatly exacerbated i all the activities do not exist yet when the rst reach the barrier. In this experiment 10 processors are used. Three types o measurements are made: 1. A lare number o activities are spawned, and they synchronize repeatedly or a lare number o times. Thus in eect we can measure the averae time to perorm a sync when all the activities have been spawned already. This is a direct extension o the previous experiment; it is shown by the \just sync" raph in. 10. The results indicate a linear relation between

24 23 time per operation [ms] just sync spawn & sync just spawn number o activities Fiure 10: Time or sync instruction by a lare number o activities. Perormance derades sharply i the activities have to be spawned. the number o activities and the time required, which is expected considerin that the implementation o the sync is actually serial. The constant o proportionality includes the time or the aa on the counter and the context switch. 2. A lare number o activities are spawned and they synchronize once. Thus the price o spawnin also enters the picture. This is shown in the \spawn & sync" raph in To calibrate the previous measurement, we also measured the time just to spawn the tasks (\just spawn" raph in. 10). As expected, this is linear in the number o activities, and the constant o proportionality is about 1 ms per activity. The important eect to notice is that the time to spawn and sync is much larer than the sum o its parts. In act, it seems to have a quadratic relation to the number o activities. This is a result o the way that the sync is implemented, and specically, a result o usin busy waitin with a yield instruction. The system then operates accordin to the ollowin scenario. First, one activity is spawned by the et work task. This activity tries to sync, nds that its siblins have not arrived yet, and yields. et work runs aain, and spawns another activity. This activity too tries to sync, ails, and yields. The rst activity is then rescheduled, tries aain, ails aain, and yields aain. This pattern is repeated until all the activities are spawned: ater each one, all the existin activities try to sync. Thus the time to spawn and sync N activities is proportional to the sum o N times the spawn overhead, plus (N=P ) 2 =2 times the overhead to perorm a context switch and check the sync condition.

Status. We ll do code generation first... Outline

Status. We ll do code generation first... Outline Status Run-time Environments Lecture 11 We have covered the ront-end phases Lexical analysis Parsin Semantic analysis Next are the back-end phases Optimization Code eneration We ll do code eneration irst...

More information

OCC and Its Variants. Jan Lindström. Helsinki 7. November Seminar on real-time systems UNIVERSITY OF HELSINKI. Department of Computer Science

OCC and Its Variants. Jan Lindström. Helsinki 7. November Seminar on real-time systems UNIVERSITY OF HELSINKI. Department of Computer Science OCC and Its Variants Jan Lindström Helsinki 7. November 1997 Seminar on real-time systems UNIVERSITY OF HELSINKI Department o Computer Science Contents 1 Introduction 1 2 Optimistic Concurrency Control

More information

Thread-based vs Event-based Implementation of a Group Communication Service

Thread-based vs Event-based Implementation of a Group Communication Service Thread-based vs Event-based Implementation o a Group Communication Service Shivakant Mishra and Ronuan Yan Department o Computer Science University o Wyomin, P.O. Box 3682 Laramie, WY 8271-3682, USA. Email:

More information

A SUIF Interface Module for Eli. W. M. Waite. University of Colorado

A SUIF Interface Module for Eli. W. M. Waite. University of Colorado A SUIF Interace Module or Eli W. M. Waite Department o Electrical and Computer Enineerin University o Colorado William.Waite@Colorado.edu 1 What is Eli? Eli [2] is a domain-specic prorammin environment

More information

RE2C { A More Versatile Scanner Generator. Peter Bumbulis Donald D. Cowan. University of Waterloo. April 15, Abstract

RE2C { A More Versatile Scanner Generator. Peter Bumbulis Donald D. Cowan. University of Waterloo. April 15, Abstract RE2C { A More Versatile Scanner Generator Peter Bumbulis Donald D. Cowan Computer Science Department and Computer Systems Group University o Waterloo April 15, 1994 Abstract It is usually claimed that

More information

The SpecC Methodoloy Technical Report ICS December 29, 1999 Daniel D. Gajski Jianwen Zhu Rainer Doemer Andreas Gerstlauer Shuqin Zhao Department

The SpecC Methodoloy Technical Report ICS December 29, 1999 Daniel D. Gajski Jianwen Zhu Rainer Doemer Andreas Gerstlauer Shuqin Zhao Department The SpecC Methodoloy Technical Report ICS-99-56 December 29, 1999 Daniel D. Gajski Jianwen Zhu Rainer Doemer Andreas Gerstlauer Shuqin Zhao Department of Information and Computer Science University of

More information

A Cell Burst Scheduling for ATM Networking Part II: Implementation

A Cell Burst Scheduling for ATM Networking Part II: Implementation A Cell Burst Schedulin or ATM Networkin Part II: Implementation C. Tan, A. T. Chronopoulos, Senior Member, IEEE Computer Science Department Wayne State University email:ctan, chronos@cs.wayne.edu E. Yaprak,

More information

Composite functions. [Type the document subtitle] Composite functions, working them out.

Composite functions. [Type the document subtitle] Composite functions, working them out. Composite unctions [Type the document subtitle] Composite unctions, workin them out. luxvis 11/19/01 Composite Functions What are they? In the real world, it is not uncommon or the output o one thin to

More information

Real-Time Scalability of Nested Spin Locks. Hiroaki Takada and Ken Sakamura. Faculty of Science, University of Tokyo

Real-Time Scalability of Nested Spin Locks. Hiroaki Takada and Ken Sakamura. Faculty of Science, University of Tokyo Real-Time Scalability of Nested Spin Locks Hiroaki Takada and Ken Sakamura Department of Information Science, Faculty of Science, University of Tokyo 7-3-1, Hongo, Bunkyo-ku, Tokyo 113, Japan Abstract

More information

Integrated QOS management for disk I/O. Dept. of Comp. Sci. Dept. of Elec. Engg. 214 Zachry. College Station, TX

Integrated QOS management for disk I/O. Dept. of Comp. Sci. Dept. of Elec. Engg. 214 Zachry. College Station, TX Interated QOS manaement or disk I/O Ravi Wijayaratne A. L. Narasimha Reddy Dept. o Comp. Sci. Dept. o Elec. En. Texas A & M University 214 Zachry Collee Station, TX 77843-3128 ravi,reddy@ee.tamu.edu Abstract

More information

MetaTeD A Meta Language for Modeling. Telecommunication Networks. Kalyan S. Perumalla and Richard M. Fujimoto

MetaTeD A Meta Language for Modeling. Telecommunication Networks. Kalyan S. Perumalla and Richard M. Fujimoto MetaTeD A Meta Lanuae or Modelin Telecommunication Networks Kalyan S. Perumalla and Richard M. Fujimoto (kalyan@cc.atech.edu and ujimoto@cc.atech.edu) Collee o Computin Georia Institute o Technoloy Atlanta,

More information

a<b x = x = c!=d = y x = = x false true

a<b x = x = c!=d = y x = = x false true 1 Introduction 1.1 Predicated execution Predicated execution [HD86, RYYT89, DT93, KSR93] is an architectural model in which each operation is uarded by a boolean operand whose value determines whether

More information

LEGEND. Cattail Sawgrass-Cattail Mixture Sawgrass-Slough. Everglades. National. Park. Area of Enlargement. Lake Okeechobee

LEGEND. Cattail Sawgrass-Cattail Mixture Sawgrass-Slough. Everglades. National. Park. Area of Enlargement. Lake Okeechobee An Ecient Parallel Implementation of the Everlades Landscape Fire Model Usin Checkpointin Fusen He and Jie Wu Department of Computer Science and Enineerin Florida Atlantic University Boca Raton, FL 33431

More information

Client Host. Server Host. Registry. Client Object. Server. Remote Object. Stub. Skeleton

Client Host. Server Host. Registry. Client Object. Server. Remote Object. Stub. Skeleton Exploitin implicit loop parallelism usin multiple multithreaded servers in Java Fabian Bre Aart Bik Dennis Gannon December 16, 1997 1 Introduction Since its introduction in the late eihties, the lobal

More information

Theodore Johnson. Dept. of Computer and Information Science, University of Florida. Abstract

Theodore Johnson. Dept. of Computer and Information Science, University of Florida. Abstract A Concurrent Fast-Fits Memory Manaer University o Florida, Dept. o CIS Electronic TR91-009 Theodore Johnson Dept. o Computer and Inormation Science, University o Florida ted@cis.u.edu September 12, 1991

More information

Learning Geometric Concepts with an Evolutionary Algorithm. Andreas Birk. Universitat des Saarlandes, c/o Lehrstuhl Prof. W.J.

Learning Geometric Concepts with an Evolutionary Algorithm. Andreas Birk. Universitat des Saarlandes, c/o Lehrstuhl Prof. W.J. Learnin Geometric Concepts with an Evolutionary Alorithm Andreas Birk Universitat des Saarlandes, c/o Lehrstuhl Prof. W.J. Paul Postfach 151150, 66041 Saarbrucken, Germany cyrano@cs.uni-sb.de http://www-wjp.cs.uni-sb.de/cyrano/

More information

f y f x f z exu f xu syu s y s x s zl ezl f zl s zu ezu f zu sign logic significand multiplier exponent adder inc multiplexor multiplexor ty

f y f x f z exu f xu syu s y s x s zl ezl f zl s zu ezu f zu sign logic significand multiplier exponent adder inc multiplexor multiplexor ty A Combined Interval and Floatin Point Multiplier James E. Stine and Michael J. Schulte Computer Architecture and Arithmetic Laboratory Electrical Enineerin and Computer Science Department Lehih University

More information

Neighbourhood Operations

Neighbourhood Operations Neighbourhood Operations Neighbourhood operations simply operate on a larger neighbourhood o piels than point operations Origin Neighbourhoods are mostly a rectangle around a central piel Any size rectangle

More information

IBM Thomas J. Watson Research Center. Yorktown Heights, NY, U.S.A. The major advantage of BDDs is their eciency for a

IBM Thomas J. Watson Research Center. Yorktown Heights, NY, U.S.A. The major advantage of BDDs is their eciency for a Equivalence Checkin Usin Cuts and Heaps Andreas Kuehlmann Florian Krohm IBM Thomas J. Watson Research Center Yorktown Heihts, NY, U.S.A. Abstract This paper presents a verication technique which is specically

More information

Multiprocessor Systems. Chapter 8, 8.1

Multiprocessor Systems. Chapter 8, 8.1 Multiprocessor Systems Chapter 8, 8.1 1 Learning Outcomes An understanding of the structure and limits of multiprocessor hardware. An appreciation of approaches to operating system support for multiprocessor

More information

Introduction. 1 Measuring time. How large is the TLB? 1.1 process or wall time. 1.2 the test rig. Johan Montelius. September 20, 2018

Introduction. 1 Measuring time. How large is the TLB? 1.1 process or wall time. 1.2 the test rig. Johan Montelius. September 20, 2018 How large is the TLB? Johan Montelius September 20, 2018 Introduction The Translation Lookaside Buer, TLB, is a cache of page table entries that are used in virtual to physical address translation. Since

More information

Midterm Exam Amy Murphy 6 March 2002

Midterm Exam Amy Murphy 6 March 2002 University of Rochester Midterm Exam Amy Murphy 6 March 2002 Computer Systems (CSC2/456) Read before beginning: Please write clearly. Illegible answers cannot be graded. Be sure to identify all of your

More information

A Nearest Neighbor Method for Efficient ICP

A Nearest Neighbor Method for Efficient ICP A Nearest Neihbor Method or Eicient ICP Michael Greenspan Guy Godin Visual Inormation Technoloy Group Institute or Inormation Technoloy, National Research Council Canada Bld. M50, 1500 Montreal Rd., Ottawa,

More information

Centrum voor Wiskunde en Informatica REPORTRAPPORT. Manifold Version 1.0 Programming: Programs and Problems

Centrum voor Wiskunde en Informatica REPORTRAPPORT. Manifold Version 1.0 Programming: Programs and Problems Centrum voor Wiskunde en Inormatica REPORTRAPPORT Maniold Version 1.0 Prorammin: Prorams and Problems C.L. Blom Computer ScienceDepartment o Interactive Systems CS-R9334 1993 Centrum voor Wiskunde en

More information

residual residual program final result

residual residual program final result C-Mix: Making Easily Maintainable C-Programs run FAST The C-Mix Group, DIKU, University of Copenhagen Abstract C-Mix is a tool based on state-of-the-art technology that solves the dilemma of whether to

More information

Process Synchronization with Readers and Writers Revisited

Process Synchronization with Readers and Writers Revisited Journal of Computin and Information Technoloy - CIT 13, 2005, 1, 43 51 43 Process Synchronization with Readers and Writers Revisited Jalal Kawash Department of Computer Science, American University of

More information

Ecient and Precise Modeling of Exceptions for the Analysis of Java Programs. IBM Research. Thomas J. Watson Research Center

Ecient and Precise Modeling of Exceptions for the Analysis of Java Programs. IBM Research. Thomas J. Watson Research Center Ecient and Precise Modelin of Exceptions for the Analysis of Java Prorams Jon-Deok Choi David Grove Michael Hind Vivek Sarkar IBM Research Thomas J. Watson Research Center P.O. Box 704, Yorktown Heihts,

More information

Course Syllabus. Operating Systems

Course Syllabus. Operating Systems Course Syllabus. Introduction - History; Views; Concepts; Structure 2. Process Management - Processes; State + Resources; Threads; Unix implementation of Processes 3. Scheduling Paradigms; Unix; Modeling

More information

Operating System Concepts

Operating System Concepts Chapter 9: Virtual-Memory Management 9.1 Silberschatz, Galvin and Gagne 2005 Chapter 9: Virtual Memory Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped

More information

Suggested Solutions (Midterm Exam October 27, 2005)

Suggested Solutions (Midterm Exam October 27, 2005) Suggested Solutions (Midterm Exam October 27, 2005) 1 Short Questions (4 points) Answer the following questions (True or False). Use exactly one sentence to describe why you choose your answer. Without

More information

Using Real-Time Serializability and Optimistic Concurrency Control in Firm Real-Time Databases

Using Real-Time Serializability and Optimistic Concurrency Control in Firm Real-Time Databases Usin Real-Time Serializability and Optimistic Concurrency Control in Firm Real-Time Databases Jan Lindström and Kimmo Raatikainen University o Helsinki, Department o Computer Science P.O. Box 26 (Teollisuuskatu

More information

IN SUPERSCALAR PROCESSORS DAVID CHU LIN. B.S., University of Illinois, 1990 THESIS. Submitted in partial fulllment of the requirements

IN SUPERSCALAR PROCESSORS DAVID CHU LIN. B.S., University of Illinois, 1990 THESIS. Submitted in partial fulllment of the requirements COMPILER SUPPORT FOR PREDICTED EXECUTION IN SUPERSCLR PROCESSORS BY DVID CHU LIN B.S., University of Illinois, 1990 THESIS Submitted in partial fulllment of the requirements for the deree of Master of

More information

Network. Department of Statistics. University of California, Berkeley. January, Abstract

Network. Department of Statistics. University of California, Berkeley. January, Abstract Parallelizing CART Using a Workstation Network Phil Spector Leo Breiman Department of Statistics University of California, Berkeley January, 1995 Abstract The CART (Classication and Regression Trees) program,

More information

RTL Modeling in C++ Technical Report ICS April 30, Shuqing Zhao. Irvine, CA , USA (949)

RTL Modeling in C++ Technical Report ICS April 30, Shuqing Zhao. Irvine, CA , USA (949) RTL Modelin in C++ Technical Report ICS-01-18 April 30, 2001 Shuqin Zhao Department o Inormation and Computer Science University o Caliornia, Irvine Irvine, CA 92697-3425, USA (949)824-8059 szhao@ics.uci.edu

More information

A Rigorous Correctness Proof of a Tomasulo Scheduler Supporting Precise Interrupts

A Rigorous Correctness Proof of a Tomasulo Scheduler Supporting Precise Interrupts A Riorous Correctness Proo o a Tomasulo Scheduler Supportin Precise Interrupts Daniel Kroenin Λ, Silvia M. Mueller y, and Wolan J. Paul Dept. 14: Computer Science, University o Saarland, Post Box 151150,

More information

Computer Systems Assignment 4: Scheduling and I/O

Computer Systems Assignment 4: Scheduling and I/O Autumn Term 018 Distributed Computing Computer Systems Assignment : Scheduling and I/O Assigned on: October 19, 018 1 Scheduling The following table describes tasks to be scheduled. The table contains

More information

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017 ECE 550D Fundamentals of Computer Systems and Engineering Fall 2017 The Operating System (OS) Prof. John Board Duke University Slides are derived from work by Profs. Tyler Bletsch and Andrew Hilton (Duke)

More information

BinaryOperator. Array BinaryOpExpr Plus

BinaryOperator. Array BinaryOpExpr Plus Just when you thouht your little lanuae was sae: \Expression Templates" in Java ToddL.Veldhuizen Extreme Computin Laboratory Indiana University Computer Science Department Bloominton Indiana 47405, USA

More information

Language (TeD) Brian J. Premore, David M. Nicol, Xiaowen Liu. Abstract

Language (TeD) Brian J. Premore, David M. Nicol, Xiaowen Liu. Abstract Dartmouth Collee Computer Science Technical Report PCS-TR96-299 A Critique o the Telecommunication Description Lanuae (TeD) Brian J. Premore, David M. Nicol, Xiaowen Liu Department o Computer Science Dartmouth

More information

Iterative Single-Image Digital Super-Resolution Using Partial High-Resolution Data

Iterative Single-Image Digital Super-Resolution Using Partial High-Resolution Data Iterative Sinle-Imae Diital Super-Resolution Usin Partial Hih-Resolution Data Eran Gur, Member, IAENG and Zeev Zalevsky Abstract The subject of extractin hih-resolution data from low-resolution imaes is

More information

Remaining Contemplation Questions

Remaining Contemplation Questions Process Synchronisation Remaining Contemplation Questions 1. The first known correct software solution to the critical-section problem for two processes was developed by Dekker. The two processes, P0 and

More information

FPGA Technology Mapping: A Study of Optimality

FPGA Technology Mapping: A Study of Optimality FPGA Technoloy Mappin: A Study o Optimality Andrew Lin Department o Electrical and Computer Enineerin University o Toronto Toronto, Canada alin@eec.toronto.edu Deshanand P. Sinh Altera Corporation Toronto

More information

Operating Systems Design Fall 2010 Exam 1 Review. Paul Krzyzanowski

Operating Systems Design Fall 2010 Exam 1 Review. Paul Krzyzanowski Operating Systems Design Fall 2010 Exam 1 Review Paul Krzyzanowski pxk@cs.rutgers.edu 1 Question 1 To a programmer, a system call looks just like a function call. Explain the difference in the underlying

More information

Ecient Detection of Data Races in SR Programs. Darren John Esau. presented to the University of Waterloo. in fullment of the

Ecient Detection of Data Races in SR Programs. Darren John Esau. presented to the University of Waterloo. in fullment of the Ecient Detection o Data Races in SR Prorams by Darren John Esau A thesis presented to the University o Waterloo in ullment o the thesis requirement or the deree o Master o Mathematics in Computer Science

More information

Module. Sanko Lan Avi Ziv Abbas El Gamal. and its accompanying FPGA CAD tools, we are focusing on

Module. Sanko Lan Avi Ziv Abbas El Gamal. and its accompanying FPGA CAD tools, we are focusing on Placement and Routin For A Field Prorammable Multi-Chip Module Sanko Lan Avi Ziv Abbas El Gamal Information Systems Laboratory, Stanford University, Stanford, CA 94305 Abstract Placemen t and routin heuristics

More information

Efficient Sleep/Wake-up Protocols for User-Level IPC

Efficient Sleep/Wake-up Protocols for User-Level IPC Efficient Sleep/Wake-up Protocols for User-Level IPC Ronald C. Unrau Department of Computin Science University of Alberta, CANADA unrau@cs.ualberta.ca Orran Krieer IBM T.J. Watson Research Centre Yorktown

More information

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX Towards an Adaptive Distributed Shared Memory (Preliminary Version ) Jai-Hoon Kim Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3 E-mail: fjhkim,vaidyag@cs.tamu.edu

More information

environments with objects which diusely reect or emit liht [16]. The method is based on the enery radiation between surfaces of objects and accounts f

environments with objects which diusely reect or emit liht [16]. The method is based on the enery radiation between surfaces of objects and accounts f A Shared-Memory Implementation of the Hierarchical Radiosity Method Axel Podehl Fachbereich Informatik, Universitat des Saarlandes, PF 151150, 66041 Saarbrucken, Germany, axelp@tibco.com Thomas Rauber

More information

CSC Operating Systems Spring Lecture - XII Midterm Review. Tevfik Ko!ar. Louisiana State University. March 4 th, 2008.

CSC Operating Systems Spring Lecture - XII Midterm Review. Tevfik Ko!ar. Louisiana State University. March 4 th, 2008. CSC 4103 - Operating Systems Spring 2008 Lecture - XII Midterm Review Tevfik Ko!ar Louisiana State University March 4 th, 2008 1 I/O Structure After I/O starts, control returns to user program only upon

More information

1 Multiprocessors. 1.1 Kinds of Processes. COMP 242 Class Notes Section 9: Multiprocessor Operating Systems

1 Multiprocessors. 1.1 Kinds of Processes. COMP 242 Class Notes Section 9: Multiprocessor Operating Systems COMP 242 Class Notes Section 9: Multiprocessor Operating Systems 1 Multiprocessors As we saw earlier, a multiprocessor consists of several processors sharing a common memory. The memory is typically divided

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2017 Lecture 23 Virtual memory Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 FAQ Is a page replaces when

More information

Bus-Based Communication Synthesis on System-Level

Bus-Based Communication Synthesis on System-Level Bus-Based Communication Synthesis on System-Level Michael Gasteier Manfred Glesner Darmstadt University of Technoloy Institute of Microelectronic Systems Karlstrasse 15, 64283 Darmstadt, Germany Abstract

More information

Motivation Dynamic bindin facilitates more exible and extensible software architectures, e.., { Not all desin decisions need to be known durin the ini

Motivation Dynamic bindin facilitates more exible and extensible software architectures, e.., { Not all desin decisions need to be known durin the ini The C++ Prorammin Lanuae Dynamic Bindin Outline Motivation Dynamic vs. Static Bindin Shape Example Callin Mechanisms Downcastin Run-Time Type Identication Summary Motivation When desinin a system it is

More information

Web e-transactions. Svend Frolund, Fernando Pedone, Jim Pruyne Software Technology Laboratory HP Laboratories Palo Alto HPL July 12 th, 2001*

Web e-transactions. Svend Frolund, Fernando Pedone, Jim Pruyne Software Technology Laboratory HP Laboratories Palo Alto HPL July 12 th, 2001* Web e-transactions Svend Frolund, Fernando Pedone, Jim Pruyne Software Technoloy Laboratory HP Laboratories Palo Alto HPL-2001-177 July 12 th, 2001* E-mail: {frolund, pedone, pruyne} @ hpl.hp.com reliability,

More information

Chapter 8 Virtual Memory

Chapter 8 Virtual Memory Operating Systems: Internals and Design Principles Chapter 8 Virtual Memory Seventh Edition William Stallings Operating Systems: Internals and Design Principles You re gonna need a bigger boat. Steven

More information

Communication Networks ( ) / Spring 2011 The Blavatnik School of Computer Science, Tel-Aviv University

Communication Networks ( ) / Spring 2011 The Blavatnik School of Computer Science, Tel-Aviv University ommunication Networks (0368-3030) / Sprin 0 The lavatnik School o omputer Science, Tel-viv University llon Waner urose & Ross, hapters 5.5-5.6 (5 th ed.) Tanenbaum & Wetherall, hapters 4.3.4 4.3.8 (5 th

More information

HCE: A New Channel Exchange Scheme for Handovers in Mobile Cellular Systems

HCE: A New Channel Exchange Scheme for Handovers in Mobile Cellular Systems 129 HCE: A New Channel Exchane Scheme for Handovers in Mobile Cellular Systems DNESH K. ANVEKAR* and S. SANDEEP PRADHAN Department of Electrical Communication Enineerin ndian nstitute of Science, Banalore

More information

Operating Systems. Memory Management. Lecture 9 Michael O Boyle

Operating Systems. Memory Management. Lecture 9 Michael O Boyle Operating Systems Memory Management Lecture 9 Michael O Boyle 1 Memory Management Background Logical/Virtual Address Space vs Physical Address Space Swapping Contiguous Memory Allocation Segmentation Goals

More information

Processes. CS 475, Spring 2018 Concurrent & Distributed Systems

Processes. CS 475, Spring 2018 Concurrent & Distributed Systems Processes CS 475, Spring 2018 Concurrent & Distributed Systems Review: Abstractions 2 Review: Concurrency & Parallelism 4 different things: T1 T2 T3 T4 Concurrency: (1 processor) Time T1 T2 T3 T4 T1 T1

More information

Image Fusion for Enhanced Vision System using Laplacian Pyramid

Image Fusion for Enhanced Vision System using Laplacian Pyramid Imae Fusion or Enhanced Vision System usin aplacian Pyramid Abhilash G, T.V. Rama Murthy Department o ECE REVA Institute o Technoloy and Manaement Banalore-64, India V. P. S Naidu MSDF ab, FMCD, CSIR-National

More information

Bright Advance Corporation USER INSTRUCTIONS

Bright Advance Corporation USER INSTRUCTIONS Briht Advance Corporation USER INSTRUCTIONS Weihin Scale Briht Advance Corporation TABLE OF CONTENTS BEFORE USING THE SCALE2 PREPARING TO USE THE SCALE2 LCD DISPLAY3 KEYBOARD FUNCTION4 OPERATION5 1 DISPLAY

More information

Indoor Presence System Using Wireless LAN

Indoor Presence System Using Wireless LAN Indoor Presence System Usin Wireless LAN WLAN Position Information Presence Indoor Presence System Usin Wireless LAN NTT DoCoMo has developed an indoor presence system usin mobile terminals with a wireless

More information

Linear Network Coding

Linear Network Coding IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 49, NO. 2, FEBRUARY 2003 371 Linear Network Codin Shuo-Yen Robert Li, Senior Member, IEEE, Raymond W. Yeun, Fellow, IEEE, Nin Cai Abstract Consider a communication

More information

Multiprocessor scheduling

Multiprocessor scheduling Chapter 10 Multiprocessor scheduling When a computer system contains multiple processors, a few new issues arise. Multiprocessor systems can be categorized into the following: Loosely coupled or distributed.

More information

Cached. Cached. Cached. Active. Active. Active. Active. Cached. Cached. Cached

Cached. Cached. Cached. Active. Active. Active. Active. Cached. Cached. Cached Manain Pipeline-Reconurable FPGAs Srihari Cadambi, Jerey Weener, Seth Copen Goldstein, Herman Schmit, and Donald E. Thomas Carneie Mellon University Pittsburh, PA 15213-3890 fcadambi,weener,seth,herman,thomas@ece.cmu.edu

More information

2. Background. Figure 1. Dependence graph used for scheduling

2. Background. Figure 1. Dependence graph used for scheduling Speculative Hede: Reulatin Compile-Time Speculation Aainst Prole Variations Brian L. Deitrich Wen-mei W. Hwu Center for Reliable and Hih-Performance Computin University of Illinois Urbana-Champain, IL

More information

Chapter 8: Virtual Memory. Operating System Concepts

Chapter 8: Virtual Memory. Operating System Concepts Chapter 8: Virtual Memory Silberschatz, Galvin and Gagne 2009 Chapter 8: Virtual Memory Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating

More information

SMD149 - Operating Systems

SMD149 - Operating Systems SMD149 - Operating Systems Roland Parviainen November 3, 2005 1 / 45 Outline Overview 2 / 45 Process (tasks) are necessary for concurrency Instance of a program in execution Next invocation of the program

More information

Midterm Exam Solutions Amy Murphy 28 February 2001

Midterm Exam Solutions Amy Murphy 28 February 2001 University of Rochester Midterm Exam Solutions Amy Murphy 8 February 00 Computer Systems (CSC/56) Read before beginning: Please write clearly. Illegible answers cannot be graded. Be sure to identify all

More information

Client Host. Server Host. Registry. Client Object. Server. Remote Object. Stub. Skeleton

Client Host. Server Host. Registry. Client Object. Server. Remote Object. Stub. Skeleton Compiler support or an RMI implementation usin NexusJava Fabian Bre Dennis Gannon December 16, 1997 1 Introduction Java [7] is a portable, object oriented prorammin lanuae. Its portability is obtained

More information

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8. Multiprocessor System Multiprocessor Systems Chapter 8, 8.1 We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than

More information

1 What is an operating system?

1 What is an operating system? B16 SOFTWARE ENGINEERING: OPERATING SYSTEMS 1 1 What is an operating system? At first sight, an operating system is just a program that supports the use of some hardware. It emulates an ideal machine one

More information

Computer-System Organization (cont.)

Computer-System Organization (cont.) Computer-System Organization (cont.) Interrupt time line for a single process doing output. Interrupts are an important part of a computer architecture. Each computer design has its own interrupt mechanism,

More information

Using LDAP Directory Caches. Olga Kapitskaia. AT&T Labs{Research. on sample queries from a directory enabled application

Using LDAP Directory Caches. Olga Kapitskaia. AT&T Labs{Research. on sample queries from a directory enabled application Usin LDAP Directory Caches Sophie Cluet INRIA Rocquencourt Sophie.Cluet@inria.fr Ola Kapitskaia AT&T Labs{Research ola@research.att.com Divesh Srivastava AT&T Labs{Research divesh@research.att.com 1 Introduction

More information

Chapter 5 THE MODULE FOR DETERMINING AN OBJECT S TRUE GRAY LEVELS

Chapter 5 THE MODULE FOR DETERMINING AN OBJECT S TRUE GRAY LEVELS Qian u Chapter 5. Determinin an Object s True Gray evels 3 Chapter 5 THE MODUE OR DETERMNNG AN OJECT S TRUE GRAY EVES This chapter discusses the module for determinin an object s true ray levels. To compute

More information

Multiprocessor Systems. COMP s1

Multiprocessor Systems. COMP s1 Multiprocessor Systems 1 Multiprocessor System We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than one CPU to improve

More information

Operating system Dr. Shroouq J.

Operating system Dr. Shroouq J. 2.2.2 DMA Structure In a simple terminal-input driver, when a line is to be read from the terminal, the first character typed is sent to the computer. When that character is received, the asynchronous-communication

More information

pp , John Wiley and Sons, 1991 No. 3, 1994, pp , Victoria, 1994 Vol. 37, No. 5, pp Baltimore, 1993 pp.

pp , John Wiley and Sons, 1991 No. 3, 1994, pp , Victoria, 1994 Vol. 37, No. 5, pp Baltimore, 1993 pp. termediate representation used is much simpler than a ull UI specication lanuae. We have also proposed to populate automatically the taret GUI builder space thus allowin desiners/developers to ully exploit

More information

CPU Scheduling. The scheduling problem: When do we make decision? - Have K jobs ready to run - Have N 1 CPUs - Which jobs to assign to which CPU(s)

CPU Scheduling. The scheduling problem: When do we make decision? - Have K jobs ready to run - Have N 1 CPUs - Which jobs to assign to which CPU(s) 1/32 CPU Scheduling The scheduling problem: - Have K jobs ready to run - Have N 1 CPUs - Which jobs to assign to which CPU(s) When do we make decision? 2/32 CPU Scheduling Scheduling decisions may take

More information

Concurrent Counting using Combining Tree

Concurrent Counting using Combining Tree Final Project Report by Shang Wang, Taolun Chai and Xiaoming Jia Concurrent Counting using Combining Tree 1. Introduction Counting is one of the very basic and natural activities that computers do. However,

More information

ECE 3055: Final Exam

ECE 3055: Final Exam ECE 3055: Final Exam Instructions: You have 2 hours and 50 minutes to complete this quiz. The quiz is closed book and closed notes, except for one 8.5 x 11 sheet. No calculators are allowed. Multiple Choice

More information

Operating Systems (1DT020 & 1TT802)

Operating Systems (1DT020 & 1TT802) Uppsala University Department of Information Technology Name: Perso. no: Operating Systems (1DT020 & 1TT802) 2009-05-27 This is a closed book exam. Calculators are not allowed. Answers should be written

More information

CS370: System Architecture & Software [Fall 2014] Dept. Of Computer Science, Colorado State University

CS370: System Architecture & Software [Fall 2014] Dept. Of Computer Science, Colorado State University Frequently asked questions from the previous class survey CS 370: SYSTEM ARCHITECTURE & SOFTWARE [CPU SCHEDULING] Shrideep Pallickara Computer Science Colorado State University OpenMP compiler directives

More information

9.8 Graphing Rational Functions

9.8 Graphing Rational Functions 9. Graphing Rational Functions Lets begin with a deinition. Deinition: Rational Function A rational unction is a unction o the orm P where P and Q are polynomials. Q An eample o a simple rational unction

More information

MobiLink Performance. A whitepaper from ianywhere Solutions, Inc., a subsidiary of Sybase, Inc.

MobiLink Performance. A whitepaper from ianywhere Solutions, Inc., a subsidiary of Sybase, Inc. MobiLink Performance A whitepaper from ianywhere Solutions, Inc., a subsidiary of Sybase, Inc. Contents Executive summary 2 Introduction 3 What are the time-consuming steps in MobiLink synchronization?

More information

Lecture 17: Threads and Scheduling. Thursday, 05 Nov 2009

Lecture 17: Threads and Scheduling. Thursday, 05 Nov 2009 CS211: Programming and Operating Systems Lecture 17: Threads and Scheduling Thursday, 05 Nov 2009 CS211 Lecture 17: Threads and Scheduling 1/22 Today 1 Introduction to threads Advantages of threads 2 User

More information

CSE 410 Final Exam 6/09/09. Suppose we have a memory and a direct-mapped cache with the following characteristics.

CSE 410 Final Exam 6/09/09. Suppose we have a memory and a direct-mapped cache with the following characteristics. Question 1. (10 points) (Caches) Suppose we have a memory and a direct-mapped cache with the following characteristics. Memory is byte addressable Memory addresses are 16 bits (i.e., the total memory size

More information

Module 6: INPUT - OUTPUT (I/O)

Module 6: INPUT - OUTPUT (I/O) Module 6: INPUT - OUTPUT (I/O) Introduction Computers communicate with the outside world via I/O devices Input devices supply computers with data to operate on E.g: Keyboard, Mouse, Voice recognition hardware,

More information

Operating Systems Comprehensive Exam. There are five questions on this exam. Please answer any four questions total

Operating Systems Comprehensive Exam. There are five questions on this exam. Please answer any four questions total Operating Systems Comprehensive Exam There are five questions on this exam. Please answer any four questions. ID 1 2 3 4 5 total 1) The following questions pertain to McKusick et al's A Fast File System

More information

Computer Organization ECE514. Chapter 5 Input/Output (9hrs)

Computer Organization ECE514. Chapter 5 Input/Output (9hrs) Computer Organization ECE514 Chapter 5 Input/Output (9hrs) Learning Outcomes Course Outcome (CO) - CO2 Describe the architecture and organization of computer systems Program Outcome (PO) PO1 Apply knowledge

More information

THE FINANCIAL CALCULATOR

THE FINANCIAL CALCULATOR Starter Kit CHAPTER 3 Stalla Seminars THE FINANCIAL CALCULATOR In accordance with the AIMR calculator policy in eect at the time o this writing, CFA candidates are permitted to use one o two approved calculators

More information

The performance of single-keyword and multiple-keyword. pattern matching algorithms. Bruce W. Watson. Eindhoven University of Technology

The performance of single-keyword and multiple-keyword. pattern matching algorithms. Bruce W. Watson. Eindhoven University of Technology The performance of sinle-keyword and multiple-keyword pattern matchin alorithms Bruce W. Watson Faculty of Mathematics and Computin Science Eindhoven University of Technoloy P.O. Box 513, 5600 MB Eindhoven,

More information

Multiprocessor and Real- Time Scheduling. Chapter 10

Multiprocessor and Real- Time Scheduling. Chapter 10 Multiprocessor and Real- Time Scheduling Chapter 10 Classifications of Multiprocessor Loosely coupled multiprocessor each processor has its own memory and I/O channels Functionally specialized processors

More information

1 Theory 1 (CSc 473): Automata, Grammars, and Lanuaes For any strin w = w 1 w 2 :::w n, define w R to be the reversal of w: w R = w n w n 1 :::w 1. Su

1 Theory 1 (CSc 473): Automata, Grammars, and Lanuaes For any strin w = w 1 w 2 :::w n, define w R to be the reversal of w: w R = w n w n 1 :::w 1. Su Masters Examination Department of Computer Science March 27, 1999 Instructions This examination consists of nine problems. The questions are in three areas: 1. Theory and Alorithms: CSc 473, 545, and 573;

More information

Partial Marking GC. A traditional parallel mark and sweep GC algorithm has, however,

Partial Marking GC. A traditional parallel mark and sweep GC algorithm has, however, Partial Marking GC Yoshio Tanaka 1 Shogo Matsui 2 Atsushi Maeda 1 and Masakazu Nakanishi 1 1 Keio University, Yokohama 223, Japan 2 Kanagawa University, Hiratsuka 259-12, Japan Abstract Garbage collection

More information

General Design of Grid-based Data Replication. Schemes Using Graphs and a Few Rules. availability of read and write operations they oer and

General Design of Grid-based Data Replication. Schemes Using Graphs and a Few Rules. availability of read and write operations they oer and General esin of Grid-based ata Replication Schemes Usin Graphs and a Few Rules Oliver Theel University of California epartment of Computer Science Riverside, C 92521-0304, US bstract Grid-based data replication

More information

The Art and Science of Memory Allocation

The Art and Science of Memory Allocation Logical Diagram The Art and Science of Memory Allocation Don Porter CSE 506 Binary Formats RCU Memory Management Memory Allocators CPU Scheduler User System Calls Kernel Today s Lecture File System Networking

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Chapter 8 & Chapter 9 Main Memory & Virtual Memory

Chapter 8 & Chapter 9 Main Memory & Virtual Memory Chapter 8 & Chapter 9 Main Memory & Virtual Memory 1. Various ways of organizing memory hardware. 2. Memory-management techniques: 1. Paging 2. Segmentation. Introduction Memory consists of a large array

More information

G52CON: Concepts of Concurrency

G52CON: Concepts of Concurrency G52CON: Concepts of Concurrency Lecture 11: Semaphores I" Brian Logan School of Computer Science bsl@cs.nott.ac.uk Outline of this lecture" problems with Peterson s algorithm semaphores implementing semaphores

More information