A Framework for Dstrbuted Computaton Over a Heterogeneous Beowulf Cluster. Jared A. Heuschele Computer Scence Unversty of Wsconsn-Eau Clare heuschja@uwec.edu Andrew T. Phllps Computer Scence Unversty of Wsconsn-Eau Clare phllpa@uwec.edu Abstract We present a problem ndependent frameworks software desgn that takes a descrpton of some computatonal problem consstng of both the mathematcal model and ts data, and then performs the calculatons n a dstrbuted computng envronment. The MPI standard for dstrbuted computng over a network of heterogeneous workstatons s used, but the software framework s fully applcaton ndependent. Specfc goals of the project were dynamc process control and load balancng and the development of a C++ object orented framework that would take a descrpton of the computatonal problem and ts data and dstrbute the computatons over a heterogeneous Beowulf cluster. That s, the dstrbuted computng aspect of the calculaton s completely separated from the problem and ts descrpton. We wll demonstrate the success of the applcaton framework for dstrbuted computaton, ncludng dynamc load balancng and process management on a network of Lnux and IRIX workstatons, all n the context of a non-trval applcaton, namely the Proten Foldng Problem.
Ths paper descrbes a problem ndependent framework based mplementaton of a varaton of the tradtonal master-slave dstrbuted computng model. Our framework approach takes a descrpton of the problem, consstng of the data and the mathematcal/algorthmc model, and performs all calculatons n a dstrbuted computng envronment wthout requrng any knowledge or understandng of dstrbuted computaton on the part of the user. Hence, the framework approach completely dvorces the dstrbuted computaton from the user-defned problem statement and soluton. Furthermore, our dstrbuted computng framework ncludes dynamc process control and load balancng usng a C++ object orented model. Dstrbuted computaton has been practced snce the advent of multple processor supercomputers n the early 1980s. Wth the wdespread growth of computer networks and the avalablty of message passng software, t s also now the doman of common desktop PCs. In ths paper we descrbe a model for dstrbuted computng that makes ths process both easer and problem ndependent. We do ths by creatng a framework for dstrbuted computng whch separates the dstrbuted computng aspect of the mplementaton from the problem and ts descrpton/model. Model for Dstrbuted Computaton Dstrbuted computaton s a method of breakng up a problem nto several computatonal parts and metng out those parts to several ndependent processors. There are several hardware confguratons on whch to mplement dstrbuted computng, and we have chosen to use a Page 1
heterogeneous Beowulf cluster. A Beowulf cluster conssts of several ndependent computers, known as "nodes," lnked together va ethernet or other network technology n the attempt to create a gestalt effect of a supercomputer. MPI (Message Passng Interface), a standard protocol for dstrbuted computaton, s then mplemented on the Beowulf cluster n order to run dstrbuted computatons (MPI Forum, 1997). Whle we have selected the Beowulf cluster mplementaton of dstrbuted computaton, none of our desgn s n any way dependent on ths choce. We assume only that whatever dstrbuted computaton mplementaton s chosen, t supports message passng va MPI. There are may dfferent dstrbuted computng models that can be used on a Beowulf cluster. We have chosen a varaton of the tradtonal master-slave model (Breshears, 1998). In the typcal master-slave scenaro, the master assgns tasks for the slaves to process n parallel. Slaves, n turn, perform most or all of the computaton relevant to the applcaton. After the computatons are done by a slave, the results are reported back to the master, whch then combnes the results and contnues the cycle of dstrbutng the tasks untl the entre computaton s complete. The key pont s that the master coordnates, or manages, the dstrbuton of work and the complaton of the results, whle the slaves are solely responsble for the actual computatonal kernel. Our desgn for dstrbuted computaton over N nodes on a Beowulf cluster dffers slghtly from ths master-slave scenaro. Our model nvolves dvdng the ndvdual processors nto a herarchy that conssts of one node runnng the master process, another runnng an assstant process, and the other N-2 nodes runnng the slave processes. Once started, the master node/process s responsble only for assgnng the workload, whch s dynamcally determned, to the slave nodes and montorng ther progress. As before, the slaves do the bulk of the actual Page 2
computaton but report ther results to the assstant (not the master). The assstant s duty then s strctly to gather and process the ndvdual results reported to t drectly by the slaves. Ths dvson of labor between master and assstant reduces the amount of communcaton/computaton the master s requred to perform, and frees t to focus exclusvely on manageral actvtes. To use the processng power of a Beowulf cluster at or near the level of greatest potental, the master program must also montor the slave nodes progress n order to deal wth the dffcultes that arse n the envronment created by the hardware and software of heterogeneous machnes and operatng systems. Dffcultes encountered nclude but are not lmted to: network falures and bottlenecks, memory falures, and schedulng bottlenecks assocated wth the operatng system. An effectve way of crcumventng these potental dffcultes s to ncorporate on the master node a dynamc process control mechansm that optmzes job dstrbuton va montorng the load balance among the N-2 slave nodes. Thus, the master process s responsble for more than dstrbutng work to the slaves. In our model, a majorty of ths mechansm s accomplshed va an observer thread that runs on the same node as the master process. The observer keeps a record of tasks assgned to slaves and the status of each task ncludng whether t s completed, n progress, or stalled. On a more fundamental level, t acts as an terator for the master. That s, the master s not responsble for determnng to whch slave to send the next task; nstead, t just asks the observer for the next slave. The observer therefore provdes the basc operaton of teraton among the lst of all actve slaves. The observer also tracks the status of each slave. Should one of the slaves de, the observer notes ths and ensures va ts terator methods that the master avods assgnng tasks to the dead node. Furthermore, the observer measures the effectve speed of a Page 3
slave by trackng task completon tmes, and enables the master to assgn the approprate amount of work to each slave. Thus, the load balancng and process management s abstracted out of the master nto the observer. Of course, the master s responsble for ntally assgnng tasks to each of the slaves whch t does wth the help of the observer s terator methods. As slaves complete ther tasks, the master assgns them more work and coordnates the load balancng based on the recommendatons of the observer. Thus, load balancng s an nherent part of the master process; faster slaves wll be able to request and receve more tasks. Fnally, task completon s tracked va verfcaton from the slaves as well as node montorng by the observer. In contrast to the manageral nature of the master, slaves do the actual computatonal work. They awat data from the master, work on t n a user defned way, and then send the results to the assstant, and a correspondng confrmaton of completon to the master, whch then prompts the master to send that slave another task. Note that slaves need know nothng about the exstence of other slaves, nothng about the workload dstrbuton, and nothng about the overall value/use of the results that they compute. The assstant s responsble for processng the results generated by the slaves. It needs to communcate wth the master only at the end of the entre computaton n order to confrm to the master that all work has been completed and the program s ready to end. The communcaton that occurs n our master-assstant-slave model s therefore more complcated than a typcal master-slave model ( Fgure 1): Master 1 4 5 3 Slave Assst. 2 Fgure 1 Page 4
In step one, the master sends a data set to the slave for processng. The slave then processes that data set and n step two, passes the result to the assstant and mmedately (step three) sends a confrmaton to the master n order to receve another data set. When the master has assgned all work to the slaves, t sends a sngle message to the assstant ndcatng that t has fnshed assgnng tasks (step four). When the assstant fnshes processng results receved from the slaves, step fve s to send a fnal confrmaton ndcatng problem completon to the master. The Framework The desgn and constructon of the dynamc process control and communcaton n the context of a problem ndependent framework s the key to our model. Much of the software that mplements the rudmentary communcaton between, and dstrbuton to, the nodes s wdely avalable n software lbrary packages lke MPI, whch we use. However, the mplementaton of an object orented framework for dstrbuted computaton s not ncluded n ths and other lbrary packages. A "framework" n the context of computer software s an extenson of the prncple of code reuse. The four prmary benefts of object orented frameworks are modularty, reusablty, extensblty, and nverson of control (Fayad, Schmdt, and Johnson, 1999, p. 8). Of the four nverson of control s the beneft unque to frameworks. Conventonal code reuse conssts of nsertng preexstng modules nto the code developed to solve a problem n order to save tme n Page 5
the constructon process. A framework s the fguratve nverse of ths process: nstead of pluggng the preexstng modules nto the problem, the problem s plugged nto the framework. Frameworks are peces/sutes of code that mplement a certan process or model and execute ndependent of the specfcs of a problem. Frameworks serve as a template for a problem soluton technque. In the case of dstrbuted computaton, our framework mplements dynamc load balancng, slave process management, and all communcaton va low-level message passng, all of whch can execute ndependent of the detals of the user defned problem. These aspects of dstrbutng the computaton must be mantaned ndependently and reman unrelated to the actual calculaton. Of course, detals are necessary n the dstrbuton of problem specfc data to each slave process (whch s the responsblty of the master process), the gatherng and processng of the ndvdual results (whch s the role of the assstant process), and n the calculatons specfc to the problem (whch s the duty of the slave processes). Thus, when a framework s constructed, all that s needed s a descrpton of certan specfc master, assstant, and slave process responsbltes. These descrptons are encapsulated n the four user defned functons used by the framework that encompass the user defned soluton method, ts assocated data to be calculated, and how the calculatons are to be dstrbuted. These four functons are: 1. UserNextTask: defnes and provdes a specfc data set on whch to perform computatons. The master requests ths data set from the user pror to communcatng wth a slave. 2. UserWork: defnes the computatonal kernel of work to be done. Each slave s responsble for provdng a specfc data set (obtaned from the master) on whch to perform ths work. Page 6
3. UserCombne: combnes results (as reported by the slaves) n a user defned way. The assstant s responsble for provdng a specfc soluton vector for combnng by the user. 4. UserTasksDone: determnes when there are no more data sets/tasks to be completed. The master makes ths request of the user pror to askng for a next data set (va UserNextTask). Thus, the user s problem, confned to these methods, can actually be wrtten and executed sequentally. The user does not have to know anythng about the MPI, message passng, or the vagares of dstrbuted computng. Ths s not to say that usng the framework to solve the user s problem s plug and play. The user must understand and dentfy the parallelsm nherent n the soluton method. Sample Computaton As a smple example of ths model, we use the followng elementary computaton: σ = n = 1 σ 2 whereσ R, and n s consdered to be large. Our framework approach would then requre the followng smple set of user defned functons/behavors: UserWork( σ ): returns 2 σ UserCombne( σ ): partal Page 7
update σ = σ + σ UserNextTask(): returns next UserTasksDone(): σ partal returns true f and only f n σ already have been requested. 2 In ths case, each slave smply computes σ (va UserWork) gven a σ from the master (the s obtaned va UserNextTask). Upon completon of ths task, the slave sends t result to the assstant whch n turn keeps a runnng total of the results va UserCombne. When the master has determned that all tasks have been completed (va UserTasksDone and confrmaton from all slaves), the assstant s notfed to prepare to report the fnal result. Notce that the user s responsble for understandng the nherent parallelsm n the computaton, but that no reference to dstrbuted computaton or MPI s expected or requred. σ Relable Message Passng Dstrbuted computaton models rely heavly on the relable transportaton of user data. In our framework, data buffers are sent from the master to be processed by the slaves and results are stored n buffers that are sent from the slaves to be gathered and processed by the assstant. Memory cannot be shared by these processes due to the dstrbuted nature of the computng envronment; that s, we are assumng a dstrbuted memory model, not a shared memory one. Therefore, each of these transfers requre that separate buffers be allocated by the master, assstant, and slaves on each of ther respectve nodes. Master and assstant need only manage one buffer, whereas the slave has two buffers to coordnate: one to store data receved from the Page 8
master, and another n whch to send the results to the assstant. Because of the need for relable transport of the buffers contanng the user data, the user must specfy the maxmum length of data buffers that each ndvdual task mght requre. By havng the user specfy the maxmum length of data buffers, the allocaton, passng, and referencng of these data buffers s greatly smplfed. All data buffers used n message passng are thus allocated by the framework, an abstracton that hdes one more complcated aspect of MPI operaton from the user. There s only one type of data that can be relably transferred between multple systems n a heterogeneous cluster: the IEEE double. By usng the IEEE standard 64 bt floatng pont representaton, the framework avods the overhead of trackng varous archtecture dfferences that can results n non-portable or ncorrect message passng. The Object Orented Framework Our framework runs on each machne usng a sngle program, multple data (SPMD) format; each machne comples and runs the same code, but executes that code dfferently dependng on whch personalty (master, slave, assstant) s ntated (Almas & Gottleb, 1994) Each node, whether master, slave, or assstant, shares common attrbutes and behavors whch are abstracted n a base class Mp_t. Each node personalty (Master_t, Asstant_t, or Slave_t) then nherts from ths base class, and then determnes f t s the master, the assstant, or a slave. Each node also determnes the dentty (node number) of the master and assstant. A node s dentty s easly establshed n the Mp_t constructor wth the functon MPI_Comm_rank(), Page 9
whch fnds the node s rank n the overall model (OSC, 1996). A node s dentty does not change throughout the run of the computaton. The master node always has a rank of 0, and the assstant node always has a rank of N-1 n a N node cluster. pseudocode: The run methods of each node type can be summarzed n the followng C++ vod Master_t::Run() { whle ((observer->moreslaves()) && UserMoreTasks() ) { allocate workbuffer memory for message passng; UserNextTask(workBuffer); Send workbuffer to observer->currentslave(); deallocate the workbuffer memory; observer->nextslave(); whle (!UserTasksDone()) { receve confrmaton of work completed by slave #d; allocate workbuffer memory for message passng; UserNextTask(workBuffer); Send workbuffer to slave #d; deallocate the workbuffer memory; whle (there are stll tasks unfnshed) { receve confrmaton of work completed from any slave; send message to assstant declarng end of task allocaton, ncludng count of tasks completed by slaves; receve confrmaton from the assstant; StopAllSlaves(); StopMeNow(); The master run method begns by sendng off a task to each slave. It uses the observer to terate through the lst of slaves and does so as long as task exsts for a slave to complete (determned by callng UserTasksDone). After sendng a job to each of the slaves, t wats for the slaves to complete a task and then sends that slave a new task f one remans. Tasks are Page 10
contnuously assgned as slaves complete ther work wth the help of UserNextTask. After all tasks have been assgned, the master wats for slaves to confrm that the remander of tasks have completed. vod Slave_t::Work() { allocate memory for resultsbuffer for message passng; UserWork(resultsBuffer,workBuffer); send resultsbuffer to assstant; send confrmaton to master; deallocate the resultsbuffer memory; vod Slave_t::Run() { bool keep_workng = true; whle (keep_workng) { allocate memory for workbuffer for message passng; swtch (Wat for message and buffer from master) { case stop: StopMeNow(); keep_workng = false; break; case more2do: Work(); break; default: keep_workng = false; break; deallocate the workbuffer memory; The slave run method wats n a loop for a message from the master. In that message, a tag determnes whether the slave contnues workng or whether t exts and termnates on the node. If t s told to do work, t calls ts work method, whch executes the UserWork functon to accomplsh the computaton. The slave next sends the results to the assstant and a confrmaton to the master, upon whch t exts the work method and returns to the run method awatng more work. Page 11
vod Assstant_t::Work() { allocate memory for resultsbuffer for message passng; wat for a message and resultsbuffer from anybody; f (sender == master) { masternotfedus = true; masterstally = count of completed tasks as sent by master; else { //t was from slave UserCombne(resultsBuffer); completedjobs++; deallocate resultsbuffer memory; bool Assstant_t::sAllWorkDone() { return (masternotfedus && (masterstally == completedjobs)); vod Assstant_t::Run() { whle (!sallworkdone()) { Work(); send confrmaton to master; StopMeNow(); The assstant executes smlar to the slave; ts run method conssts prmarly of a watng for message loop, but n a dfferent form. It checks to see f has receved the message to termnate from the master, and checks the master s tally of tasks assgned aganst the tally of tasks collated by the assstant. If nether of these condtons are met, t executes ts work method, whch wats for a message from anybody, processng the results va UserCombne f the message was from a slave, or checkng the tally from the master otherwse. If the ext condtons are met, t sends a confrmaton of completon to the master and termnates. At ths pont n the executon, computaton s complete, and the program exts normally on each node. Note that message passng s completely abstracted out from the framework wth the class Message_t. It contans the necessary handle nstances and other members to facltate Page 12
message passng. Its methods nclude two sendng and two recevng functons, one each for merely sendng and recevng a tag (a message wth no assocated buffer), and another two to send and receve both tag and data. Here s the class defnton for Message_t whch s used to encapsulate all message passng behavor: enum message_type {none, stop, more2do, confrmed, completed; class Message_t { publc: Message_t(); ~Message_t(); message_type RecvTag(); message_type RecvTagFrom(nt fromwhom); vod SendTag2(nt towhom, message_type tag); vod SendMsg2(nt towhom, message_type tag, double *buffer, nt len); message_type RecvMsg(double *buffer, nt len); message_type GetTag(); nt GetSender(); prvate: nt sender; message_type tagrecvd; MPI_Request therequest; MPI_Status thestatus; ; Concluson By developng a the dstrbuted computaton framework, the ease of utlzng parallel computatonal power s ncreased. Users need be able only to understand what aspects of ther problem can be run ndependently and n parallel n order to provde the detals of the template Page 13
functons UserWork, UserNextTask, UserCombne, and UserTasksDone. In no case s there any need of the part of the user to understand or use any prncples/methods of dstrbuted computng. Furthermore, by abstractng the message passng out of the master-slave-assstant model, the framework s more adaptable to the changng world of MPI standards and mplementatons, and also allows tself to be moved to other computng envronments by smply changng the mplementaton of the message passng class. References Almas, G. S., Gottleb, A. (1994). Hghly Parallel Computng, 2 nd Ed. Calforna: The Benjamn/Cummngs Publshng Company, Inc. Breshears, Clay. (1998). Detaled Examples. A Begnner's Gude to PVM Parallel Vrtual Machne [Onlne]. Avalable: http://www-jcs.cs.utk.edu/pvm/pvm_gude.html [2000, February 29]. Fayad, M., Schmdt, D., & Johnson, R. (1999). Buldng Applcaton Frameworks: Object Orented Foundatons of Framework Desgn. New York: Wley Computer Publshng. Message Passng Interface Forum. (1997). MPI-2: Extensons to the Message-Passng Interface [Onlne]. Avalable: http://www.mp-forum.org/docs/mp-20-html/mp2-report.html. Knoxvlle, Tennessee: Unversty of Tennessee. [2000, February 29] OSC (Oho Supercomputer Center), The Oho State Unversty. (1996). Basc Parallel Informaton. MPI Prmer/Developng wth LAM (p.21). Columbus, Oho: The Oho State Unversty. Acknowledgements We would lke to acknowledge the help of the many members of the lam@mp.nd.edu malng lst. Page 14