Chapter 1. Introduction

Size: px

Start display at page:

Download "Chapter 1. Introduction"

Lindsay Welch
5 years ago
Views:

1 Chapter 1 Introducton 1.1 Parallel Processng There s a contnual demand for greater computatonal speed from a computer system than s currently possble (.e. sequental systems). Areas need great computatonal speed nclude numercal modelng and smulaton of scentfc and engneerng problems. For example; weather forecastng, predctng the moton of the astronomcal bodes n the space, vrtual realty, etc. Such problems are known as grand challenge problems. On the other hand, the grand challenge problem s the problem that cannot be solved n a reasonable amount of tme [1]. One way of ncreasng the computatonal speed s by usng multple processors n sngle case box or network of computers lke cluster operate together on a sngle problem. Therefore, the overall problem s needed to splt nto parttons, wth each partton s performed by a separate processor n parallel. Wrtng programs for ths form of computaton s known as parallel programmng [1]. How to execute the programs of applcatons n very fast way and on a concurrent manner? Ths s known as parallel processng. In the parallel processng, we must have underlne parallel archtectures, as well as, parallel programmng languages and algorthms. 1.2 Parallel Archtectures The man feature of a parallel archtecture s that there s more than one processor. These processors may communcate and cooperate wth one another to execute the program nstructons. There are dverse classfcatons for the parallel archtectures and the most popular one s the Flynn taxonomy (see Fgure 1.1) [2]. 1

2 1.2.1 The Flynn Taxonomy Mchael Flynn [2] has ntroduced taxonomy for varous computer archtectures based on notons of Instructon Streams (IS) and Data Streams (DS). Accordng to ths taxonomy, the parallel archtectures could be classfed nto four categores; Sngle Instructon Sngle Data (SISD), Multple Instructon Sngle Data (MISD), Sngle Instructon Multple Data (SIMD), and Multple Instructon Multple Data (MIMD). SISD Von Neumann Model Parallel computers SIMD MISD MIMD Array Processors Ppelned Vector Processors Ppelned Vector Processors Systolc Arrays Multprocessors Multcomputers Data Flow Machnes Network Cluster Grd Computng Fgure 1.1: Taxonomy of Parallel Processng Archtectures Sngle Instructon Sngle Data (SISD) Accordng to ths category, only one nstructon stream s executed on one data stream each tme. Conventonal sequental machnes are consdered SISD archtecture. The prototype of the sequental machne s shown n Fgure

3 I/O IS IS DS CU PE MU CU: s the Control Unt PE: Processor Element IS: Instructon Stream DS: Data Stream Fgure 1.2: SISD Unprocessor Archtecture. Sngle Instructon Multple Data Stream (SIMD) The SIMD archtectures consst of N processors (see Fgure 1.3). The processors operate synchronously, where at each cycle all processors execute the same nstructon, each on a dfferent datum. Some popular commercal SIMD archtectures are ILLIAC IV, DAP, and connecton machne CM-1 and CM-2. The SIMD archtectures can also support vector processng, whch can be accomplshed by assgnng vector of elements to ndvdual processors for concurrent computaton [3]. SIMD machne s a synchronzed machne, where all of the processors n the machne execute only one nstructon at a tme and the fastng processor has to wat the slowest one before t s startng executng next nstructon. Ths type of synchronzaton s called Lockstep [3]. These archtectures are used mostly for problems havng hgh degrees of data parallelsm. Interconnecton Network DS 1 DS 2 DS N PE 1 PE 2 PE N IS CU Fgure 1.3: SIMD Archtecture 3

4 Multple Instructons Sngle Data (MISD) The MISD archtectures consst of N processors each wth ts own control unt sharng a common memory where data resde (see Fgure 1.4) [3]. The archtecture of the MISD computers fall nto two dfferent categores: 1. A class of machnes that would requre dstnct processors that would receve dstnct nstructons to be performed on the same data. 2. A class of machnes such that data flows through a seres of processors (.e. ppelned archtectures). On type of ths machne s a systolc array. A systolc array s formed wth a network of functonal unts whch are locally connected. Ths array operates synchronously wth multdmensonal ppelnng. Therefore, ths class of multdmensonal ppelned array archtectures s desgned for mplementng fxed algorthms and t offers good performance for specal applcatons, lke sgnal and mage processng, wth smple, regular and modular layouts. PE 1 IS 1 CU 1 Memory DS PE 2 IS 2 CU 2 PE N IS N CU N Fgure 1.4: MISD Archtecture Multple Instructons Multple Data (MIMD) Ths class of archtectures s the most general and powerful one, where most of practcal applcatons need MIMD machnes [3]. The MIMD archtectures consst of N processors. Each processor operates under the control of an nstructon stream ssued by ts control unt (see Fgure 1.5). All processors are potentally executng dfferent subproblems on dfferent data whle solvng a sngle problem. Ths means that the processors typcally operate asynchronously. The communcaton between processors s preformed through a 4

5 shared memory or an nterconnecton network. MIMD computers wth a shared common memory are often referred to as multprocessors (or tghtly coupled machnes), whle those wth an nterconnecton network connectng processors are known as multcomputers (or loosely coupled machnes) [4]. Gordon Bell [5] has provded taxonomy of MIMD machnes (see Fgure 1.6). He consders shared-memory multprocessors as havng a sngle address space (.e. global address space) and multcomputers use dstrbuted memores as havng multple address space (.e. local address space). Shared Memory Or Interconnecton network DS 1 DS 2 DS N PE 1 IS 1 IS 2 PE 2 IS N PE N CU 1 CU 2 CU N Fgure 1.5: MIMD Archtecture Multcomputer Archtectures. Accordng to Bell taxonomy of MIMD archtectures, the multcomputer archtecture s classfed nto dstrbuted multcomputer archtectures and central multcomputer. Dstrbuted multcomputer archtecture (see Fgure 1.7) conssts of multple computers nterconnected by a message-passng network. Each computer conssts of a processor, local memory, and attached I/O perpherals. All local memores are prvate and not accessble by other processors. The communcaton between processors s acheved through a message-passng nterface protocol, va a general nterconnecton network. The dsadvantage of these archtectures s that a programmng model mposes a sgnfcant burden on the programmer, whch nduces consderable software overhead. On the other hand, these archtectures are clamed to have better scalablty and cost-effectveness [6]. 5

6 Dynamc bndng of addresses to processors KSR MIMD Multprocessors Dstrbuted memory multprocessors Central memory multprocessors Dstrbuted multcomputers Statc bndng, rng mult IEEE SCI standard proposal Statc bndng, cacheng Allant, DASH Statc program bndng BBN, Cedar, CM* Cross-pont or mult-stage Cray, Futsu, Htach, IBM, NEC, Tera Smple, rng mult,bus mult replacement Bus mults DEC, Encore,NCR, Sequent, SGI, Sun Mesh connected Intel Butterfly/Fat Tree, CM5 Hypercubes NCUBE Multcomputers Fast LANs for hgh avalablty and hgh capacty clusters. DEC, Tandem LANs for dstrbutd processng Workstatons, PCs Clusters PCs Clusters of Workstatons (CoWs) Central multcomputers Fgure 1.6: Bell s Taxonomy of MIMD Computers Interconnecton Network PE 1 PE 2 PE N M 1 M 2 M N Fgure 1.7: Generc Model of Message-Passng Multcomputer 6

7 Accordng to central multcomputers (see Fgure 1.8), there s one computer s consdered as a master one where all of the database as well as the applcaton programs are resdent ther. Before executng a program the master computer has to dstrbute the program parttons nto the other processors (slaves). After fnshng executng the program, the fnal results from the slaves are receved and organzed n the master. Ths type of multcomputers s referred as farm model. Master Slaves PE 1 PE LM 1 PE 2 LM 2 LM PE N LM N Fgure 1.8: Central Multcomputer Multprocessor Archtecture Multprocessor archtecture called shared memory archtecture, where a sngle global memory s shared among all processors (.e. global space memory model). Consequently, the multprocessor s classfed nto two category; shared memory multprocessor and dstrbuted shared memory (DSM) [6]. Accordng to shared memory multprocessor, only one memory s accessble by all processors equally (see Fgure 1.9). The man advantage of these archtectures s the smplcty n desgnng algorthms for them and transparent data access to the user. However, t suffers from ncreasng contenton n accessng the shared memory, whch lmts the scalablty [7]. A relatvely new concept - (DSM) - tres to combne the advantages of the multcomputer and multprocessors archtectures [8]. 7

8 The DSM archtecture can be generally vewed as a set of nodes or clusters, connected by an nterconnecton network (see Fgure 1.10). Each clusters organzed around a shared bus contans a physcally local memory module, whch s partally or entrely mapped to the DSM global address space. Prvate caches attached to the processors are nevtable for reducng memory latency. Regardless of the network topology a specfc nterconnecton controller wthn each cluster s needed to connect t nto the system. Therefore, the DSM archtecture logcally mplements the shared memory model n physcally dstrbuted memory archtecture [8]. PE 1 PE 2 Shared Memory PE N Fgure 1.9: Generc Model of Shared Memory Archtecture. Interconnecton network Cluster 1 Cluster 2 Cluster N Interconnecton controller Interconnecton controller Interconnecton controller Drectory DSM Porton Processors Caches Drectory DSM Porton Processors Caches Drectory DSM Porton Processors Caches DSM Shared Address Space Fgure 1.10: Structure and Organzaton of a DSM System. 8

9 1.3 The Parallel Algorthms Generally, a parallel algorthm can be defned as a set of nstructons that may be executed n parallel concurrently and may, connect, wth each other n order to solve a gven problem [3]. The term task or process may be defned as a part of a program that can be executed on a processor. The man phases of a parallel algorthm nclude four man phases; parttonng, communcaton, agglomeraton, and mappng phase (see Fgure 1.11). [3]. Problem Partton Communcaton Agglomeraton Mappng and schedulng Processor1 Processor2 Processor3 Fgure 1.11: Dfferent Phases of a Parallel Algorthm 9

10 1.3.1 Parttonng Phase The parttonng phase s ntended to expose opportuntes for parallel executon by decomposng computatons of assgned problem nto small tasks. A good parttonng method dvdes both the computaton and the data assocated wth a problem nto lttle nstructons [3] Communcaton Phase The communcaton phase s enables an approprate communcaton structure and algorthms to coordnate task executon. To allow computaton to proceed, the data must be transferred between tasks, whch are the outcome of the communcaton phase of a desgn. There are two communcaton structures: channel structure and message-passng structure [3]. In the channel structure, there s a lnk (drectly or ndrectly) between tasks that requre data wth other tasks that possess those data. In the message passng structure, the message s sent and receved on these channels Agglomeraton phase The frst two phases of the desgn process concern about dvdng the computatons nto a set of tasks and communcaton to provde data requred by these tasks. The agglomeraton phase s ntended to evaluate the tasks and the requred communcaton wth respect to the performance and mplementaton costs. For example, one crtcal ssue nfluencng the parallel performance s the communcaton cost. The performance mght be mproved by reducng the amount of tme spent communcatng (.e. sendng less data) [3] Mappng and Schedulng Phase The mappng and schedulng phase ams to specfy where each task has to be executed n whch processor and determne a sequence for ther executon such that the processor utlzaton s maxmzed and the communcaton costs, as well as, executon tme are mnmzed [3]. If these four phases (problems) are not properly handled, parallelzaton of an applcaton may not be benefcal. A large research efforts addressng theses problems has been 10

11 reported n the lterature [9,10]. Our work s concerned manly to study and mplement the mappng and schedule phase. Through the thess, we refer to ths phase as task schedulng problem. Two mportant ssues are nvolved n the task schedulng problem: enhancng concurrency and ncreasng localty. Enhancng concurrency deals wth placng tasks that can be executed smultaneously on dfferent processors, whle ncreasng localty refers to placng tasks that lkely communcate frequently on the same processor [3]. 1.4 The Task Schedulng Problem The task schedulng problem concerns wth determne whch computatonal tasks wll be executed on whch processor and at what tme. On the other hand, the schedulng and allocaton of multple nteractng tasks of a sngle parallel program s a hghly mportant ssue snce an napproprate schedulng of tasks can fal to explot the true performance of the system and can offset the gran from parallelzaton [10]. Therefore, ths problem has receved consderable attenton n recent years [11, 12]. The man obectve of the task schedulng s to assgn tasks to avalable processors such that precedence requrements between tasks are satsfed and, n the same tme, the overall completon tme whch known as schedule length (or make span) s mnmzed [13]. The task schedulng technques are manly classfed as statc and dynamc. In statc schedulng, the characterstc of a parallel program, ncludng task processng tmes, data dependences and synchronzaton, are known before program executon [10]. In dynamc schedulng algorthm, few assumptons about the parallel program can be made before executon, and thus, schedulng decsons have to be made on-the-fly [13]. On the other hand, the schedulng problem could be classfed nto preemptve and nonpreemptve schedulng [13]. The preemptve schedulng permts a task to be nterrupted and removed from the processor under the assumpton that t wll eventually receve the executon tme t requres. Ths nterrupton of tasks contrbutes to system overhead. Wth nonpreemptve schedulng, a task cannot be nterrupted once t has begun executon n a specfc processor. If there are no precedence relatons among the tasks formng a 11

12 program, the problem s known as task allocaton problem [13]. Our research work s focused on statc and nonpreemptve schedulng The Problem Model The model of the underlne parallel system to be consdered n ths research work could be descrbed as follows [14]: The archtecture s a network of arbtrary number of homogeneous processors. Assume P = {P 1, P 2, P 3 P m } denotes the set of m processors. Let a task graph G be a Drected, Acyclc Graph (DAG) composed of N nodes n, n2,, n, each node terms a task of the 1 N graph whch n turn s a set of nstructon that must be executed sequentally wthout preempton n the same processor. A node 1 has one or more nputs. When all nputs are avalable, the node s trggered to execute. A node wth no predecessor s called an entry node and a node wth no successor s called an ext node. The weght of node n s called the computaton cost of a node n and s denoted by weght n ). The graph also has E ( drected edges, where each edge e( n, n ) E representng a partal order among the tasks. The partal order ntroduced a precedence-constraned DAG and mples that f n n, then n s a successor ofn, whch cannot be started untl ts predecessor fnshes. The weght on an edge s called communcaton cost of the edge and s denoted by c n, n ) (. Ths cost s ncurred f n and s consdered to be zero f n and n are scheduled on dfferent processors and n are scheduled on the same processor. Let p n ) and Sc( n ) be the set of mmedate predecessors and the set of successors of the node n respectvely. Where, p( n ) { n : e( n, n ) E} and Sc( n ) { n : e( n, n ) E}. If a node n s scheduled to processor P, the start tme and fnsh tme of the node s denoted by ST n ) and FT n ) respectvely. After all nodes have been scheduled, the schedule ( ( length s defned as max { FT ( n )} across all processors. The obectve s to fnd an assgnment and the start tmes of the tasks to processors such that the schedule length s mnmzed and, n the same tme, the precedence constrans are preserved. A Crtcal Path ( n 1 task and node are exchangeable through the thess 12

13 (CP) of a task graph s defned as the path wth the maxmum sum of node and edge weghts from an entry node to an ext node. The nodes les on CP are denoted as CPNs, and the communcaton-to-computaton-rato (CCR) of a parallel program s defned as ts average edge weght dvded by ts average task weght. The task schedulng n general s known to be NP-complete problem except for some specal cases. The suggested taxonomy of the statc task schedulng algorthms s represented n Fgure Accordng to ths taxonomy, schedulng methods are dvded nto optmal soluton and non optmal soluton. Accordng to the non optmal solutons, many heurstcs have been suggested to tackle the problem under more pragmatc stuatons. These heurstcs algorthms could be dvded n two categores: greedy and non greedy (teratve). The greedy algorthms are ntalzed by a partal soluton and search to extend ths soluton untl a complete task schedulng s acheved [15]. At each step, one task assgnment s done and t cannot be changed n the remanng steps. The teratve algorthms are ntalzed by a complete schedulng and search to mprove t by movng a task from one processor to another or by exchangng the mappng of two tasks [15] Optmal Task Schedulng Algorthms The optmal soluton s known n few restrcted cases, such as polynomal tme algorthms whch descrbed as Hu algorthm [13], Coffman and Graham algorthm [16], and Papadmtrou and Yannakaks algorthm [13]. Statc schedulng Optmal Not Optmal Heurstcs Approxmate Iteratve Greedy Fgure 1.12: A Partal Taxonomy of Statc Schedulng Methods 13

14 1. HU Algorthm Hu [13] devsed a lnear-tme algorthm to schedulng problem called level algorthm. The algorthms s used to solve the schedule length n lnear tme when the task graph s ether an n-forest,.e., each task at most one mmedate successor, or an out-forset,.e., each task has at most one mmedate predecessor. The communcaton between tasks s gnored and all tasks have unt computatons. The algorthm begn wth calculate the level of each node n whch s defned as the maxmum number of nodes (ncludng n ) on any path from n to the ext node. The pseudo code of the algorthm s as follow: HU Algorthm Step1: The level of each node n the DAG s calculated and used as each nodes prorty. Step2: When the processor become avalable. Assgn t the ready node wth the hghest prorty. IF the number of nodes n a level s greater than the number of processors n the system, Then schedule the tasks usng round robn fashon. The complexty of the algorthm s (v) where v s the number of tasks n DAG[13]. Example By applyng HU algorthm to the DAG shown n Fgure 1.13a usng three processors, the lst of read tasks accordng to ts level s { 1,2,3,4,5,6,7,8,9}. The generated schedule length s shown n Fgure 1.13b. (a) n 1 n 2 n 3 n 4 n 5 n 6 n 7 n P 0 P 1 n 1 n 4 n 7 n 2 n 5 n 8 P 2 n 3 n 6 n 9 (b) n 9 Fgure1.13: (a) A Smple In-Forest Task Graph; (b) The Optmal Schedule of the Task Graph usng Three Processors System 14

15 2. Coffman and Graham Algorthm Coffman and Graham [16] devsed a quadratc-tme algorthm for schedulng an arbtrary structured DAG wth unt-weghted tasks and zero-weghted edges to a two-processors system. The algorthm starts by assgnng Labels to DAG s tasks startng wth the ext task. After that, a lst of tasks s constructed by sortng them n descendng order. Schedule each task to one of the two processors that allows earlest start tme. The pseudo code of the algorthm s as follow: Coffman and Graham Algorthm Step1: Assgn the number 1 to one of the ext tasks Step2: Let labels 1,2,, 1 have been assgned. Let S be the set of unassgned tasks wth no unlabeled successor. Select an element of S to be assgned label. For each task t n S, Defne l(t) as follow: Let n 1, n 2,,n k be the mmedate successors of t. Then l(t) s the decreasng sequence of ntegers formed by orderng the set { L( n1 ), L( n2 ),, L( nk )}. Let t be an element of S such that for all t n S, l( t) l( t ) (lexcographcally). Defne L (t) to be. End For Step3: When all tasks have been labeled, use the lst ( Tn, Tn 1,, T1 ) where for all, 1 n, L( T ) to schedule the tasks. The complexty of the algorthm s (v 2 ) where v s the number of tasks n DAG [16]. Example: Consder the task graph shown n Fgure 1.14(a). The two termnal task are assgned the label 1,2 respectvely. The set S of unassgned tasks wth no unlabeled successors becomes { 4,5}. Also t can be notced that l ( 4) {6} and l ( 5) {7,6}. Snce {6} < {7, 6} (lexcographcally), assgn label 3, 4 as shown n Fgure 1.14a. Begn the schedule by assgnng the tasks wth no predecessors after that assgn the task n decreasng order of ts label. Fgure 1.14b shows the output schedule on two processor. 15

16 5 n 1 6 n 2 7 n 3 3 n 4 4 n 5 n 6 n (a) P 0 P 1 n 3 n 2 n 1 n 5 n 4 n 7 n 6 (b) Fgure1.14: (a) A Smple Task Graph wth Unt-Tasks and No-Cost Communcaton Edges; (b) The Optmal Schedule of the Task Graph n a Two-Processors System 3. Papadmtrou and Yannakaks Algorthm Optmal schedulng algorthm has also been addressed by Papadmtrou and Yannakaks [13]. They desgned a lnear-tme algorthm to tackle the schedulng problem of an nterval-ordered DAG wth unt-weght nodes to an arbtrary number of processors. In an nterval-ordered DAG, two nodes are precedence-related f and only f the nodes can be mapped to non-overlappng ntervals on the real lne. The pseudo code of the algorthm s as follow: Papadmtrou and Yannakaks Algorthm 1. The number of successors of each node s used as ts prorty. 2. Whenever a processor becomes avalable, assgn t the ready task wth hgh prorty. Ths algorthm solves the schedulng problem for nterval order (V, A) n O( A v ). Example: Consder the problem of schedulng the nterval order gven n Fgure 1.15a on two dentcal processors. The result accordng to Papadmtrou and Yannakaks Algorthm s shown n Fgure 1.15b. 16

17 n 1 n 2 n 3 0 P 0 P 1 1 n 3 n 2 n 4 n 5 2 n 1 n 5 3 n 4 n 7 n 6 n 7 4 n 6 (a) (b) Fgure 1.15: (a) A Unt-Computatonal Interval Ordered DAG; (b) An Optmal Schedule of The DAG. 1.5 Performance Crtera The performance of task schedulng algorthms of parallel systems s generally evaluated and measured by means of some crtera (speedup, effcency, normalzed schedule length)[6, 17, 18] Speedup Speedup s a good measure for the executon of an applcaton program on a parallel system. The speedup relates to the tme for executng the program on a sngle processor to the tme for executng the same program on a parallel system. Lnear speedup means that the value of speedup ncreases as the number of processors n the parallel systems ncreases [6]. It s known that lnear speedup does not occur n dstrbuted memory archtectures because of the communcaton overhead. Assume T(1) s the tme requred for executng a program on a unprocessor computer and T(P) s the tme requred for executng the same program on a parallel computer contanng P processors. Thus the speedup can be estmated as: T(1) S( P) T( P) 17

18 Speedup Folklore Theorem [17]: For a gven computatonal problem, the speedup provded by a parallel program usng P processors, over the fastest possble sequental program for the problem, s at most equal to P. 1 S( P) P Effcency Effcency s an ndcaton to what percentage of a processor s tme s beng spent n useful computaton. The effcency of a parallel computer contanng P processors can be defned as S( P) E( P) P Snce 1 S( P) P, then 1/ P E( P) 1. The maxmum effcency s acheved when all P processors are fully utlzed throughout the executon [6] Normalzed Schedule Length The man performance measure of task schedulng algorthms s the schedule length. The Normalzed Schedule Length (NSL) of an algorthm s defned as: NSL S _ Length, weght ( n ) n CP Where S_Length s the maxmum fnsh tme of each task and weght (n ) s the executon tme of the node n. The sum of computaton costs on the CP represents a lower bound on the schedule length [18]. Such lower bound may not always be possble acheve, and the optmal schedule length may be larger than ths bound. 1.6 Obectve of the Thess The man obectve of the thess s developed and mplemented task schedulng algorthms usng genetc approach to mprove the performance of genetc algorthms, as well as, the heurstc one. So two genetc algorthms have been developed and 18

19 mplemented: Crtcal Path Genetc Algorthms (CPGA) and Task Duplcaton Genetc Algorthm (TDGA). 1.7 Organzaton of the Thess The organzaton of the thess s as follow: In chapter two, some of the early and recent task schedulng algorthms that were developed n the lterature are shortly revewed. Chapter three nclude the frst proposed algorthm called Crtcal Path Genetc Algorthm (CPGA). In ths chapter, we present brefly the basc dea of the algorthm wth some defntons, followed by the detals of the algorthm. Fnally, our CPGA compared wth Modfed Crtcal Path (MCP) algorthm. In chapter four, we present the second proposed algorthm Task Duplcaton Genetc Algorthm (TDGA). The TDGA s based on task duplcaton technque n an attempt to reduce the communcaton delays and then mnmze the overall executon tme. Therefore, the performance of the genetc algorthm s ncreased. The performance of our TDGA s compared wth a common heurstc task schedulng technque based on task duplcaton (DSH). 1.8 Concluson A survey of man types of the parallel archtectures, as well as, the man conceptes of the task schedulng has been dscussed. In the next chapter, a complete survey about man task schedulng algorthms wll be dscussed. One of these approaches s usng Genetc Algorthms (GAs) whch s consdered n ths research work. Also the prncple of GAs wll be ncluded n the next chapter. 19

Parallel matrix-vector multiplication

Parallel matrix-vector multiplication Appendx A Parallel matrx-vector multplcaton The reduced transton matrx of the three-dmensonal cage model for gel electrophoress, descrbed n secton 3.2, becomes excessvely large for polymer lengths more