EFFICIENT SYNCHRONOUS PARALLEL DISCRETE EVENT SIMULATION

Size: px

Start display at page:

Download "EFFICIENT SYNCHRONOUS PARALLEL DISCRETE EVENT SIMULATION"

Kelly Leonard
6 years ago
Views:

1 EFFICIENT SYNCHRONOUS PARALLEL DISCRETE EVENT SIMULATION WITH THE ARMEN ARCHITECTURE C. Beaumont, B. Potter J.M. Flloque LIBr I.U.T. de Brest and LIBr Unversté de Bretagne Occdentale Télécom Bretagne BP 452 Brest, France BP 832 Brest, France ABSTRACT Acceleratng dscrete event smulaton can be acheved n two prncpal ways : ether by usng dedcated coprocessors n order to speed up event evaluaton or control task executon (such as enqueue/dequeue) and/or by developng or mprovng algorthms and protocols whch take beneft of today s wdely used general purpose parallel computers. In ths paper, we consder the second pont. More precsely, we show how some hardware complement to a conventonal parallel machne (here a Transputer network) sgnfcantly mproves global control operatons n dstrbuted smulaton. We outlne the frst results obtaned wth the synchronous smulaton kernel and expose the further developments they mply wth asynchronous protocols. INTRODUCTION The fact s now well known that one of the best ways to speed up executon of smulaton should be obtaned through the use of parallel archtectures. Indeed, the use of dedcated coprocessors n order to accelerate computaton s only possble n some specal cases lke logc smulaton. Parallel dscrete event smulaton (PDES) requres the management of a vrtual tme, shared by all the processors, nstead of the sngle logcal clock of sequental algorthms. Many protocols have been proposed n order to mantan the causalty constrant between events executed on each node. Reynolds has shown that the knowledge of global mnmum of local vrtual tme would mprove the PDES performances (Reynolds 1991). For a conventonal MIMD machne lke a Transputer network, the mplementaton of global controllers mplementng such computatons mples sgnfcant overheads, and nvolves mportant delays bound to message transmssons. Some new MIMD parallel machnes, lke the CM5 (TMC 1991) or the Volvox Machne from Archpel Company, offer a specfc hgh speed control network. For example, the CM5 s desgned wth a fat tree network n whch each node has computng capabltes. All global nformaton s avalable n less than 2:5 mcroseconds. The machne ArMen (Potter 1991) s the frst mplementaton of an archtecture where a programmable logc layer s assocated to a Transputer network n order to counter the lmtatons of standard parallel machnes. Ths confgurable logc layer (CLL) s used to synthesze hardware programmable accelerators. Ths paper wll show that a crtcal part of the control computatons can be undertaken by the CLL, n parallel wth the smulaton productve work executed on node processors and on the prmary communcaton network. We also present frst results obtaned wth the ArMen archtecture executng a synchronous smulaton. The very frst results show a speed-up n range 40 to 20,000 for the executon of global control operatons, dependng on the knd of mplementaton we choose n the logc layer. The paper s structured as follows: n the frst secton, we brefly descrbe the ArMen archtecture and the servces t provdes wth ts logc layer. Then we present the mplemented synchronous protocol whch nsures properly ordered event executon. We show where the overhead of the protocol s located and how we mplement these functons n hardware. The prelmnary results obtaned wth a smple synchronous smulaton kernel are exposed n the next secton, pontng out the mprovement achevable by the use of the logc layer of ArMen. Fnally, we dscuss these results and conclude wth an outlne of future work they mply. ARMEN ARCHITECTURE The ArMen machne s the prototype of an orgnal archtecture developed at Laboratore d Informatque de Brest. The basc dea s to combne a reprogrammable logc layer wth a conventonal MIMD machne. Current mplementaton nvolves modules bult wth a T800 Transputer, a Xlnx 3090 Logc Cell Array (LCA) (XIL- INX 1990), and a 1Mbyte statc memory. Logc Cell Arrays are reprogrammable crcuts organzed as an array of 1620 logc cells and wth four 36-bt ports. Cells can mplement boolean functons of up to 5 varables. The processor network s confgured and controlled by lnk swtches from a host computer nterface board. Each processor s n charge of the confguraton of ts local LCA, an operaton performed n less than 100ms. Programmable crcuts are connected together to form a

2 rng wth a 32-bt data path. Fgure 1 shows the structure of the machne and a two-node nterconnecton. Host Prmary Network They enable use of hgh level languages to specfy the behavor of the hard-wred coprocessors. The processors of the MIMD machne have access to the specfed servces by smple memory-lke read and wrte or va nterrupt channels. CLL control T800 Processor Local bus LCA A node Memory T800 Processor Local bus LCA MIMD Level Memory CLL Level SYNCHRONOUS DISTRIBUTED SIMULATION The man problem n PDES s the management of dstrbuted vrtual tme mposed by the central scheduler s dstrbuton among processors. The causalty constrant mposes that the messages receved by a process have to be consdered n globally non-decreasng tmestamp order. In order to respect ths rule, two man approaches (also called protocols ) may be used whch are: FIGURE 1: TWO ARMEN NODES CONNECTED THROUGH THE PROGRAMMABLE LOGIC LAYER An LCA can be used as a local coprocessor of ts assocated T800, or as part of a global coprocessor of the whole MIMD network. Ths archtecture s composed of three parallel layers: the communcaton system, the sequental processor, and the syntheszed operator arrays whch mplement the Confgurable Logc Layer (CLL). The CLL s fully avalable to provde applcaton speed-ups or addtonal servces lke: 1. Local acceleraton for data ntensve computatons wth the synthess of dedcated data-flow parallel operator networks (Potter and Lavener 1989). 2. The synthess of a global synchronous shared operator whch processes data from the nodes array (Bouazza et al. 1991). Ths knd of operator s dedcated to one or several calculatons performed by all nodes at the same tme, n an SPMD mode. It can be seen as a large ALU fed by the processor array. Very good results have been obtaned wth the synthess of mage processng operators based on cellular automata theory. 3. Global support for the control of dstrbuted algorthms (Flloque et al. 1991). Global autonomous processors can be bult. They can detect stable propertes or perform dstrbuted computatons, so as to gve complementary servces to the MIMD machne wth very lttle overhead. It s notceable that these dfferent servces are complementary and not exclusve. Programmng of such a machne s dffcult and thus software tools are beng developed to take off hardware work. A compler for global operators based on cellular automata (Bouazza et al. 1991) and a compler for dstrbuted control algorthms based on the UNITY formalsm (Dhaussy and Rubn 1992) are under development. the synchronous approach (Peacock et al. 1979), whch we mplement and dscuss later n ths paper; the asynchronous approach, where each processor s free to progress n vrtual tme as quck as t wants. The respect of the causalty constrant (.e., the consstency n event executon order) s obtaned ether by a pessmstc (Chandy and Msra 1979; Chandy and Msra 1981) or an optmstc (Jefferson 1985) method. Implementaton of the phase algorthm The protocol mplemented n order to dstrbute the global scheduler of sequental smulaton among the processor network conssts n allowng the parallel executon of all events wth the same date n vrtual tme. We call t phase protocol because one has to wat untl all event computatons for the current date (.e. phase) are completed before one can proceed to the next date (.e. phase). Determnng whch wll be the next date n vrtual tme to be smulated s acheved by a global computaton. Ths next date, called Global Vrtual Tme (GVT), s the global mnmum of all Local Vrtual Tmes (LVT) (.e., all event tmestamps on each node). Once ths mnmum s calculated and broadcasted, the protocol allows the parallel executon of the correspondng events on each node. The man loop of ths algorthm s gven n table 1. It s notceable that f a processor has no more events to process n ts local scheduler, t then sends a huge value HUGEVAL as ts local vrtual tme. Ths huge value won t bother GVT computaton (remember GV T = mn 2P (LV T ) ) and f ths value becomes the result of the global computaton, t means that smulaton s completed. Indeed, there are no more events to process n schedulers, and no messages are n transt due to the protocol. Thus, the loop-test from lne (I) can be rewrtten as: GVT6= HUGEVAL (I) Ths protocol clearly respects the causalty constrant snce all processors are always executng events wth

3 Synchronous DES : GVT=GVT Computaton(tWakeUp mn ); / Global mnmum computaton / / and broadcast / whle ( : End Smulaton) { (I) f ( GVT == twakeup mn ) { Model evaluaton; Sendng of generated messages; Watng for acknowledgments ; } Global Synchronzaton(); / : : : n order to be sure that / / every executon s over n / / the current tme step / twakeup mn evaluaton ; / Local mnmum search / GVT=GVT Computaton(tWakeUp mn ); / New global mnmum / / computaton and broadcast / } TABLE 1: MAIN LOOP OF OUR phase-protocol same the tmestamp. Ths approach was proposed n (Peacock et al. 1979) and s dscussed n (Flloque 1992). In such a dstrbuted synchronous smulaton kernel, the two man global control operatons are: () the synchronzaton barrer that every processor must reach before the global vrtual tme can progress (.e., functon Global Synchronzaton n table 1), () the calculaton of the global mnmum of all the local vrtual tmes n order to determne the next date to be smulated (.e., functon GVT Computaton). These two global operatons have been mplemented n the CLL of the ArMen machne. Hardware mplementaton Consderng the synchronzaton barrer (fg. 2), every Transputer sends a flag to ts LCA ndcatng ts wsh to synchronze. Observng results from a ppelned AND-functon along the rng structure of the CLL, an automaton on node 0 can send a flag when every processor has reached the barrer, that s when t receves TRUE from the ppelne. Once woken up, the processors are sure that no more messages wll be sent for the current tme step, so they have to determne whch wll be the next date n vrtual tme to be smulated. The global mnmum calculaton can then be mplemented n ether a dgt-seral or a parallel way. In the dgt-seral method, all processors contrbute to the calculaton at the same tme under control of the smulaton kernel. The possblty of ArMen to shft n an SPMD mode s effcent for ths knd of computaton. The functon mplemented n the LCA (see fg. 3) computes the mnmum between three values: the local one and the two adjacent ones. For each node, the Transputer wrtes ts local mnmum n the LCA begnnng wth the hgh order bt and reads back the result. The number of terated read/wrte operatons s dependent on the processor number, on the sze of the mnmum, and on the start ready 0 start 0 ready start n-2 n-2 ready n-1 start n-1 Automaton LCA 0 ready start e & LCA n start = false start start = true ready & LCA n-1 nt start = false ready FIGURE 2: SYNCHRONIZER S IMPLEMENTATION ON THE CLL AND NODE 0 FINITE STATE AUTOMATON wdth of the dgts wrtten by processors. Lmtng the access number of the T800s on ths global operator s acheved by ppelnng the functon n the CLL and/or by consderng wder dgts (2, 4, 8 bts). The parallel method uses the rng structure of the CLL as a systolc communcaton channel. Two global operatons are executed: the frst computes the global mnmum n a systolc fashon (LCA receves mn 0j?1 (val j ) from LCA?1 and sends mn 0j (val j ) to LCA +1 ). The second operaton broadcasts the global result to each node. The later method has not been mplemented yet. PERFORMANCES The model we choose to test our synchronous smulaton kernel s a 2D-torus where every process executes the same code (table 2). It s obvous that n such a process REPEAT receve msg from port N or W at tme t; pseudo-treatment; send msg to port E or S (alternatvely) for tme t+1; UNTIL local tme == end smulaton; TABLE 2: CODE OF THE SIMULATED PROCESSES the volume of messages s proportonal to the number of smulated processes. The pseudo-treatment s nserted to represent the effectve smulaton computaton (one unt of pseudo-treatment lasts about 0.5 ms). The executon tmes and relatve speed-ups obtaned for the smulaton of such a torus of 64 processes are gven n fgures 4 and 5. The man results to notce are: the use of the CLL always mproves executon tmes, compared to pure software parallel executon; the dstrbuton overhead s hdden behnd computaton snce the CLL s used, as soon as there are event calculatons;

4 Global and local clocks a b c d e f Processors f a fab abc bcd cde def efa efabc fabcd abcde bcdef cdefa defab Cellular automata network for mnmum computaon FIGURE 3: DIGIT-SERIAL IMPLEMENTATION OF THE GLOBAL MINIMUM COMPUTATION AND BROADCAST Tme (n msec.) when the volume of pseudo-treatment grows, the curves are convergng. Ths s due to the fact that the CLL only speeds up the global control operatons, and not event evaluatons nodes + LCA 2 nodes + LCA 4 nodes 2 nodes 1 node Speed-up relatve to sequental executon Number of nodes No treatment 1 treatment 2 treatments 4 treatments 8 treatments 16 treatments Max. speed-up achevable 20 FIGURE 5: RELATIVE SPEED-UP OBTAINED FOR A 64 PROCESS SIMULATION USING CLL Computaton s volume (n unt of pseudo-treatment) FIGURE 4: EXECUTION TIME OF 64 PROCESS SIMULATION Another pont worth beng mentoned s the speed-up obtaned on control operatons (see table 3). Usng the CLL, we acheve (compared to a pure software soluton usng the Trollus operatng system (Burns et al. 1990)) speed-up n order of: 40 for a global mnmum computaton on four 32- bt nteger values usng the dgt-seral method, wth 1-bt dgts and one level of ppelne n the LCA. Consderng, for example 8-bt dgts and 2 levels of ppelne, the speed-up s then 600, and 650 when computed on eght values wth 8-bt dgts and 4 levels of ppelne. It s also mportant to notce that we don t really need 32-bt values, as we can substract the last value of vrtual tme of the local mnmum; 1,000 to 3,500 for the synchronzaton barrer (these poor results are due to the performances of the Transputer. We here use nterrupton-sgnals to synchronze processors. The sgnal transfer tme n a LCA lasts n fact about 20 ns); 20,000 for a global mnmum computaton on four 32-bt nteger values usng the parallel method (estmated results consderng a sngle-lca traversal delay of 50ns).

5 Number of nodes / Global operaton Synchronzaton - pure soft wth CLL Global mnmum - pure soft wth CLL TABLE 3: EXECUTION TIME FOR GLOBAL OPERATIONS ON THE ARMEN ARCHITECTURE (TIMES ARE GIVEN IN S, GLOBAL MINIMUM IS COMPUTED IN BIT-SERIAL WAY ON 32-BIT VALUES) CONCLUSION AND FURTHER WORKS The speed-ups obtaned for control computaton by the use of the CLL mprove sgnfcantly the performances of the smulaton n case where a large amount of events are treated at each tme step: consderng an n n torus, the smulaton computes n n events at each tme step. As the number of events decreases, the synchronous approach wll be surpassed by the asynchronous one. Ths last method requres other mplementatons because global control doesn t act the same way. Correspondng work has begun wth a study on possble approaches mplementable n the CLL (Flloque 1992). In partcular, the CLL provdes effcent possbltes to compute local functons of selected contrbuted values. It s ntended to mplement asynchronous protocols based on the mnma of vrtual tme from processor subsets (the neghbourhood of each processor). The effectve mplementatons and tests are stll to be realzed. Another possblty to mprove PDES s studed by Drkx (Drkx 1993) and conssts n mplementng a hardware scheduler n an ASIC assocated wth a Transputer node. Wth a large LCA, t wll be possble to assocate both control and event schedulng n hardware to obtan better performances. We plan to work n the same drecton. REFERENCES Bouazza, K.; Champeau, J.; Ng, P.; Potter, B.; and Rubn, S. (1991). Implementng Cellular Automata on the ArMen Machne. In Qunton, P. and Robert, Y., edtors, Proceedngs of the Workshop on Algorthms and Parallel VLSI archtectures II, pages Elveser. Burns, G.; Radya, V.; Daoud, R.; and Machraju, R. (1990). All about Trollus. Occam User s Group Newsletter, pages Chandy, K. and Msra, J. (1981). Asynchronous dstrbuted smulaton va a sequence of parallel computatons. Communcatons of the ACM, 24(11): Dhaussy, P. and Rubn, S. (1992). Specfcaton and Complaton of Dstrbuted Algorthms for the Ar- Men Machne. Techncal Report 92-02, LIBr - Télécom Bretagne. Drkx, E. (1993). Dscrete Event Smulaton on a MIMD Paarallel Computer : Algorthm Optmzaton or Hardware acceleraton? In Proceedngs of EW- PDP 93, Gran Canara, pages Flloque, J. (1992). Synchronsaton réparte sur une machne à couche logque reconfgurable. PhD thess, Unversté de Rennes 1. Flloque, J.; Gautrn, E.; and Potter, B. (1991). Effcent computaton on processor network wth programmable logc. In Proceedng s of PARLE 91, number 505 n LNCS, pages Sprnger- Verlag. Jefferson, D. (1985). Vrtual Tme. ACM Transactons on Programmng Languages and Systems, 7(3): Peacock, J.; Manng, E.; and Wong, J. (1979). Dstrbuted smulaton usng a network of processors. Computer Networks, 3(1): Potter, B. (1991). ArMen : Une machne parallèle ntégrant un réseau de crcuts logques programmables. PhD thess, Unversté de Rennes 1. Potter, B. and Lavener, D. (1989). Hgh rate sgma flterng, feasblty studes on processors networks. In Proceedngs of IFIP Workshop on Parallel Archtectures on Slcon, pages , Grenoble, France. Reynolds, P. (1991). Effcent Framework for Parallel Smulatons. In Proceedngs of the SCS multconference on Advances n Parallel and dstrbuted smulaton, pages , San Dégo, USA. TMC (1991). The Connecton Machne CM-5 Techncal Summary. Techncal report, Thnkng Machne Corporaton, Cambrdge, Massachusetts. XILINX (1990). The Programmable Gate Array Data Book. Xlnx, San Jose, USA. Chandy, K. and Msra, J. (1979). Dstrbuted smulaton : A case study n desgn and verfcaton of dstrbuted programs. IEEE Transactons on Software Engneerng, 5(5):

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr