Lutz Prechelt Fakultíat fíur Informatik. Universitíat Karlsruhe, Postfach Dí7500 Karlsruhe, Germany.

Size: px

Start display at page:

Download "Lutz Prechelt Fakultíat fíur Informatik. Universitíat Karlsruhe, Postfach Dí7500 Karlsruhe, Germany."

Myles Basil McCoy
5 years ago
Views:

1 Comparison of MasPar MP-1 and MP- Communication Operations Lutz Prechelt Institut fíur Programmstrukturen und Datenorganisation Fakultíat fíur Informatik Universitíat Karlsruhe, Postfach 9 Dí75 Karlsruhe, Germany ++9è71è-, Fax: ++9è71è99 April 3, 1993 Technical Report 1è93 Abstract Report 1è93 ëpre93ë describes the ændings of a series of communication measurements performed on a MasPar MP-1 series MP-11A machine. The current report covers the same measurements performed on a MP- series MP-1 machine. It compares the results and outlines and discusses the main diæerences. In these measurements, raw router communication was sometimes faster, sometimes slower on the MP- than on the MP-1, depending on the parameters of the communication requested. The relative performance of the MP- varied between 93è and 1è. Xnet communication was faster in all cases èperformance 1è to 175èè. Complex functions from the communication library were also always faster èperformance 1è to 1èè. Some of these results contradict technical speciæcations for the MP-1 and MP- published by MasPar. 1

2 CONTENTS Contents 1 Introduction 1.1 The scope of this report : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1. Technical speciæcation comparison of MP-1 and MP- : : : : : : : : : : : : : : : : : : Raw router communication 5.1 The router statement : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5. Using other activation patterns : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 7.3 Using other communication patterns : : : : : : : : : : : : : : : : : : : : : : : : : : : : 7. rsend and rfetch : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 9..1 sending large packets : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 9.. rsend vs. rfetch : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 9..3 rsend vs. router statement : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1.. singular vs. plural pointers : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 3 xnet communication 11 Communication library functions 1.1 reduce : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1. scan : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1.3 sendwith : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 13. enumerate, selectone, selectfirst : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 13.5 psort, rank : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 13 5 Summary of Results 1

3 LIST OF FIGURES 3 List of Figures 1 router permutation sensitivity : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : router send : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3 router fetch : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : router send vs. fetch : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 7 5 sendèfetch routercount by PE activity : : : : : : : : : : : : : : : : : : : : : : : : : : : 7 router sendèfetch by routercount : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 7 7 router send analytic approximation èlinear scaleè : : : : : : : : : : : : : : : : : : : : : router send analytic approximation èlogarithmic scaleè : : : : : : : : : : : : : : : : : : 9 router send with probabilistic vs. regular activity : : : : : : : : : : : : : : : : : : : : : 1 router fetch with diæerent communication patterns : : : : : : : : : : : : : : : : : : : : 11 router send large packets : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 9 1 router send very large packets : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 9 13 rsend vs. rfetch : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 9 1 router send vs. ss rsend : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 9 15 rsend with probabilistic vs. regular activity : : : : : : : : : : : : : : : : : : : : : : : : 1 1 rsend with singular and plural pointers è bytesè : : : : : : : : : : : : : : : : : : : : : 1 17 rfetch with singular and plural pointers è bytesè : : : : : : : : : : : : : : : : : : : : : 1 1 rsend with singular and plural pointers è3 bytesè : : : : : : : : : : : : : : : : : : : : : 1 19 xnet send : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 11 xnet send vs. fetch : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 11 1 xfetch in nonstandard direction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 11 ss xfetchèsend vs. pp xfetchèsend : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 3 sp xfetchèsend vs. ps xfetchèsend : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 pp xfetchèsend for small and large packets : : : : : : : : : : : : : : : : : : : : : : : : : 1 5 reduce : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 13 scan : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 13 7 sendwithadd vs. sendwithmax : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 13 sendwithadd with diæerent PE activity : : : : : : : : : : : : : : : : : : : : : : : : : : 13 9 sendwithadd with diæerent destination patterns : : : : : : : : : : : : : : : : : : : : : 1 3 rank and psort : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1

4 1 INTRODUCTION 1 Introduction 1.1 The scope of this report A few weeks after my technical report ëpre93ë had been published, Jan Berger Henriksen from the University of Bergen, Norway, contacted me. He asked whether he could run my measurement program on Bergen's new MasPar MP- machine. I agreed and he made the results available to me. I compiled the results into this report. The machine used was a MP-1 èthe 13 processor version of the MP-è, which corresponds directly to the MP-11A èthe 13 processor version of the MP-1è that was used for the measurements in the old report. This history has the following consequences: æ This report covers exactly the same measurements as ëpre93ë, which I will now call the old report. æ For easier overview, it also contains exactly the same set of diagrams. All diagrams use the same scales and nomenclature as in the old report. Only the actual data points are diæerent. æ Since this model of comparison tends to introduce a lot of redundancy, I have stripped most of the textual contents of the old report from this one. Thus, it is not useful to read this report alone, you need the old report as a reference to understand what the measurements mean. In particular, often-used phrases such as ëas we can see in ægure x"usually meann ëas we can see by comparing ægure x from this report with ægure x from the old report". æ The discussion of the MP- measurement results does not cover the absolute eæects that can be observed in MP- communication behavior but only the relative eæects in comparison to the MP-1. The report is structured as follows: After this introduction follows a comparison of the technical speciæcations of the two machines and a very short discussion of the eæects that can be expected from their diæerences. The other three sections è to è follow the same structure as in the old report: They discuss, in order, the raw router communication, the xnet communication, and the higher communication library functions. A short summary of the observations follows in section 5. To ease reading, the sequence of the diagrams from the old report was preserved, as was the substructuring of the individual sections. This decision resulted in some ægures not being referenced at all and some subsections being almost completely empty. All comparisons are expressed in percent relative performance of the MP- compared to the MP-1, i.e. the performance of the MP-1 is 1è for each operation measured. 1. Technical speciæcation comparison of MP-1 and MP- The main diæerence between the MP-1 and MP- series of MasPar machines is that for the MP- the CPU chip has been completely redesigned. This greatly enhances its computational performance. The communication network, on the other hand, was kept almost without èexternally visibleè change. Table 1 shows some technical speciæcations taken from MasPar marketing material. These speciæcations give an idea of the diæerences between the machines that we can expect to observe in communication measurements. The most important lines to look at are those that specify the router and xnet bandwith and the memory bandwith.

5 5 Unit MP-1 MP- relative Raw computation 3 bit Integer MIPS 1è 3 bit Floating Point MFLOPS è bit Floating Point MFLOPS 55 3è Memory bandwith direct addressing MBès 11 1è indirect addressing MBès 7 195è Communication xnet bandwith MBès 3 7è router bandwith MBès è Table 1: MasPar MP-11 and MP-1 technical speciæcations They tell us that we can expect to ænd memory-to-memory router communication slightly faster on the MP- than on the MP-1, in particular when the packet size is high èdue to higher memory bandwithè and in particular when indirect addressing is used. Xnet communication may be somewhat faster on MP- than on MP-1 for the same reasons, but may as well sometimes be a bit slower, because of the reduced bandwith of the xnet on the MP-. For more complex library operations that use communication we can expect to ænd increased performance on the MP-, because the high raw processing speed of its CPU will accelerate the computations embedded within these operations by factors of or 3. As we will see, these expectations were not always met by the actual measurements: Router communication happened to be slower on the MP- in some cases. We can only speculate about the reasons for this lack of speed: one possibility is that the speciæcations MasPar published are simply not quite correct, another possibility would be that the MP- machine had some hardware problems during the measurements 1, a third explanation would be that synchronization of CPU and communication network is more diæcult on the MP- and eats up a lot of time. Whatever the actual reason was, in the following I assume that the diæerences are real, i.e., not produced by hardware problems. This assumption is supported by ægure 1. It shows that over 1 repetitions of the same communication operation èa permutation with changing communication patternè, the MP- averages about 9 ticks while the MP-1 averages about ticks and the variance of both samples is very similar. From this I conclude that the MP- is indeed about è slower on this operation than the MP-1. Raw router communication In this section we will examine the behavior of the router statement and the rsend and rfetch library functions..1 The router statement From ægure we ænd that router send èusing a permutation as the communication pattern and moving data from memory and to memoryè is slightly slower on the MP-. The relative performance for full 1 Since the MasPar machines can transparently perform automatic retries when communication operations fail, a machine can get a little slower èinstead of failingè when some component has intermittant errors.

6 RAW ROUTER COMMUNICATION 7 send..ap1 send..ap Repetition Nr. Figure 1: router permutation sensitivity 7 send.1.pp send..pp send..pp 7 fetch.1.pp fetch..pp fetch..pp log(#active_pes) Figure : router send Figure 3: router fetch log(#active_pes) PE activity are on the order of 3è7 è=9èè for byte packets, è9 è=9èè for byte packets, and 3è395 è=9èè for 1 byte packets. Note that these values are derived from averaging of small samples of about measurement values which diæer by up to 11è. These variations mean that the results can only give an idea of the overall behavior but are not precise in the sense of statistical signiæcance. The absolute values given in the text are also rounded somewhat to ease their perception. For the router fetch corresponding to the above send, on the other hand, we ænd that the MP- is slightly faster, as ægure 3 shows. The relative performance is 13è for byte packets, 15è for byte packets, and 1è for 1 byte packets. That send performance is below 1è while fetch performance is above 1è is no accident, as we can see from ægure : For a constant number of router steps, that is, independent of the actual permutation used, the same diæerences show up: For instance for 35 router steps, the send took about 39 ticks on the MP-1 but almost ticks on the MP-, a relative performance of about 9è. For 39 fetch router steps, on the other hand, the MP-1 took almost ticks while the MP- needed only 39 a relative performance of 13è. As expected, ægure 5 indicates that the same number of router steps are required for a given problem This comment applies to most other comparisons as well. Sometimes, however, the number of relevant data points is much larger.

7 . Using other activation patterns 7 7 send.1.pp send..pp fetch.1.pp fetch..pp log(#active_pes) Figure : router send vs. fetch 5 5 send..pp fetch..pp 7 send..pp fetch..pp 35 routercount log(#active_pes) 5 Figure 5: sendèfetch routercount by PE activity Figure : router sendèfetch by routercount routercount on both machines. This was to be expected since the communication network functions algorithmically identical on both machines. Figures 7 and depict the analytic approximation describing the dependency between the number of active PEs and the time used for a -byte permutation send on the MP-1. We see that for the MP- this approximation is a bit too small for high PE activity èright part of ægure 7è and a bit too large for low PE activity èleft part of ægure è.. Using other activation patterns Using regular activation patterns instead of stochastic ones does not reveal any interesting new eæects èægure 9è..3 Using other communication patterns Using regular communication patterns instead of random permutations also does not reveal any interesting new eæects èægure 1è.

8 RAW ROUTER COMMUNICATION 7 send..pp.*x + 1*sqrt(x) send..pp.*x + 1*sqrt(x) # active PEs # active PEs Figure 7: router send analytic approximation èlinear scaleè Figure : router send analytic approximation èlogarithmic scaleè 7 send..pp send..rp send..bp 7 fetch..pp fetch..ps1 fetch..ps1 fetch..pr fetch..p log(#active_pes) log(#active_pes) Figure 9: router send with probabilistic vs. regular activity Figure 1: router fetch with diæerent communication patterns

9 . rsend and rfetch send..pp send..pp ss_rsend.3.pp ss_rsend.1.pp 15 send..pp ss_rsend.3.pp ss_rsend.1.pp ss_rsend.5.pp log(#active_pes) Figure 11: router send large packets Figure 1: router send very large packets 1 log(#active_pes) 1 1 ss_rsend..pp ss_rfetch..pp pp_rsend..pp pp_rfetch..pp 7 send..pp ss_rsend..pp log(#active_pes) Figure 13: rsend vs. rfetch Figure 1: router send vs. ss rsend 1 log(#active_pes). rsend and rfetch..1 sending large packets The timings for sending larger packets of data are given in ægures 11 and 1. They indicate that large packets may result in signiæcant increase in the relative performance of the MP-. This is due to the higher memory bandwith. For instance, a 5-byte send with 13 PEs active uses only about 1 ticks instead of 1 on the MP-1 è1è performanceè. 1-byte send with 1 PEs active uses 3 instead of 3 è1è performanceè. 3-byte sends perform at about 115è... rsend vs. rfetch Figure 13 indicates that ss rsend performs at 1è for -byte packets and ss rfetch at about 13è. For the indirect-addressing versions pp rsend and pp rfetch, the values are similar.

10 1 RAW ROUTER COMMUNICATION 3 ss_rsend.3.pp ss_rsend.3.rp ss_rsend.3.bp 1 1 ss_rsend..pp sp_rsend..pp ps_rsend..pp pp_rsend..pp log(#active_pes) log(#active_pes) Figure 15: rsend with probabilistic vs. regular activity Figure 1: rsend with singular and plural pointers è bytesè ss_rfetch..pp sp_rfetch..pp ps_rfetch..pp pp_rfetch..pp ss_rsend.3.pp sp_rsend.3.pp ps_rsend.3.pp pp_rsend.3.pp log(#active_pes) log(#active_pes) Figure 17: rfetch with singular and plural pointers è bytesè Figure 1: rsend with singular and plural pointers è3 bytesè..3 rsend vs. router statement Given the previous results, the comparison of rsend versus router statements èægure 1è yields no additional information. The results using diæerent activation patterns on rsend èægure 15è are consistent with the results from ægure 9 and singular vs. plural pointers In ægures 17 and 17 we ænd all four variants of the xx rsend as well as the xx rfetch command to perform at roughly 1è for -byte packets. When sending larger packets of 3 bytes, very surprising observations can be made èægure 1è: While ss rsend performs at about 115è, as seen before, the other variants are not above 1è on the MP-! For instance at 1 PEs active, sp rsend performs at about 9è, ps rsend at 1è, and pp rsend at 93è. For more than 1 PEs active, the measurements deviate very much, so that it is not possible to give a good estimation of performance, but it seems that the performance tends to be below 1è. I am not able to explain this eæect.

11 11 7 xnetesend.1 xnetesend. xnetesend. 7 xnetefetch. xnetesend. ss_xfetch. ss_xsend distance 1 Figure 19: xnet send Figure : xnet send vs. fetch 1 distance 7 ss_xfetch. ss_xfetch..y pp_xfetch. pp_xfetch..y distance Figure 1: xfetch in nonstandard direction 3 xnet communication For a simple xnet send operation we ænd in ægure 19 that for 1-byte packets the MP performs at 1è, for -byte packets at 15è, and for -byte packtets at about 11è Figure shows that both, ss xsend and ss xfetch perform at about 1è for -byte packets. A bit surprising èalthough only when judged in the light of the previous resultsè is the high performance for pp xfetch èægure 1è of 11è. For xfetch in non-straight directions, ægure 1 indicates a constant acceleration of about 1 ticks for the ss xfetch variant and 55 ticks pp xfetch, resulting in a performance of 19è for distance. For standard directions, we ænd 1è performance for the ss xsend and ss xfetch operations èægure è. The pp xsendèpp xfetch operations show a performance increase that consists of two parts: a constant acceleration of 1 and 5 ticks, respectively, plus a slightly faster communication per distance step. This results in relative performances of 175è and 1è for distance or 1è and 11è for distance èalso ægure è. A similar although much less drastic eæect occurs for the sp xxx and ps xxx variants of xsend and xfetch. The relative performance for these cases varies between 133è and 1è èægure 3è. For larger 3-byte packets, we get a performance of 11è from pp xfetch and 1è from pp xsend for

12 1 COMMUNICATION LIBRARY FUNCTIONS 7 ss_xfetch. ss_xsend. pp_xfetch. pp_xsend. 7 sp_xfetch. sp_xsend. ps_xfetch. ps_xsend distance Figure : ss xfetchèsend vs. pp xfetchèsend Figure 3: sp xfetchèsend vs. ps xfetchèsend distance 3 5 pp_xfetch. pp_xsend. pp_xfetch.3 pp_xsend distance Figure : pp xfetchèsend for small and large packets distance èægure è. Communication library functions.1 reduce The performance of the various reduce operations that were tested èægure 5è were 11è for reduce- Max, 133è for reducemaxd, 17è for reduceadd, and 1è for reduceaddd.. scan Scan operations èægure è are not faster on the MP- for bit integers. For -bit æoating point values there is a small acceleration, provided there are only a few segments, i.e. the computation part of the call is dominant. In this case è segmentsè the performance is 17è for scanmaxd and 11è for scanaddd. For æoating point scans on a large number of small segments è 1 è, performance drops to about 1è again.

13 .3 sendwith 13 7 reduceadd.p reducemax.p reduceaddd.p reducemaxd.p 3 scanadd.p scanmax.p scanaddd.p scanmaxd.p log(#active_pes) Figure 5: reduce Figure : scan log(#segments) sendwithadd.1r sendwithmax.1r sendwithaddd.1r sendwithmaxd.1r 11 sendwithadd.1r sendwithadd.r sendwithadd.rr sendwithadd.r log(#destinations) log(#destinations) Figure 7: sendwithadd vs. sendwithmax Figure : sendwithadd with diæerent PE activity.3 sendwith For the variants of the sendwith operation, the performance diæerences are not as dramatic as ægures 7 and may suggest, because these diagrams do not use as the lower end of the vertical axis. Nevertheless, the MP- performs at about 1è to 19è, regardless of the kind of sendwith operation performed and regardless of the number of destinations.. enumerate, selectone, selectfirst The performance for enumerate is about 13è, for selectone and for selectfirst 113è. actual diagrams are still boring, they are still not shown here. Since the.5 psort, rank For the rank and psort operations, as shown in ægure 3, we ænd roughly the following performance scores: rankd 19è, psortd 1è, rank 13è, and rankd 15è.

14 1 REFERENCES sendwithadd.1r sendwithadd.1c sendwithadd.1b 5 rank.p rankd.p psort.p psortd.p log(#destinations) log(#active_pes) Figure 9: sendwithadd with diæerent destination patterns Figure 3: rank and psort 5 Summary of Results Similar to the measurements made on the MP-1, the comparison of the same measurements for MP-1 and MP- yielded some results that were surprising. The most signiæcant of these is the fact that raw router communication can be slower on the MP- than it is on the MP-1. Looking at the big picture, the message that can be derived from the results in this report is the following: For realistic programs, one can usually expect a relative performance of between 1è and 1è from the communication operations, provided they are fed from and deliver to memory. This value is better than one would expect at ærst when looking at the technical speciæcations alone and stems mostly from the almost-twice-as-high memory bandwith of the MP-. Other observations to be made from the measurements are: 1. Router fetch is slightly cheaper than send on the MP-1. This diæerence is noticeably bigger on the MP-.. Router send is slower on the MP- than on the MP-1 for small packets è9è relative performance for -byte packetsè, but faster for large packets è115è relative performance for 3-byte packetsè. Even for 3-byte packets, relative performance mysteriously drops below 1è when indirect addressing èvia a plural pointerè is used. 3. Xnet communication was not slower on the MP- than on the MP-1 for any of the measurements used in this report, despite the fact that the technical speciæcations published by MasPar announced reduced bandwith of the xnet on the MP-.. The constant cost of an xfetch or xsend operation is signiæcantly lower on an MP-. This can result in a performance above 15è for small distances. 5. All complex communication functions from the MPL library are faster on the MP- than on the MP-1. For some cases this advantage is only in the per mille range èe.g. scan with very many segmentsè, for others it is almost factor èe.g. reduceaddd: 1èè. References ëpre93ë Lutz Prechelt. Measurements of MasPar MP-11A communication operations. Technical Report 1è93, Institut fíur Programmstrukturen und Datenorganisation, Universitíat Karlsruhe, D-75 Karlsruhe, Germany, January 1993.

Modular Arithmetic. is just the set of remainders we can get when we divide integers by n

Modular Arithmetic. is just the set of remainders we can get when we divide integers by n 20181004 Modular Arithmetic We are accustomed to performing arithmetic on infinite sets of numbers. But sometimes we need to perform arithmetic on a finite set, and we need it to make sense and be consistent