[CZ89]), communication costs such asnetwork latency and bandwidth (e.g., the LPRAM [ACS89], Postal Model [BNK92], BSP [Val90], and LogP [CKP + 93]), a

Size: px

Start display at page:

Download "[CZ89]), communication costs such asnetwork latency and bandwidth (e.g., the LPRAM [ACS89], Postal Model [BNK92], BSP [Val90], and LogP [CKP + 93]), a"

Rosa McKenzie
5 years ago
Views:

2 [CZ89]), communication costs such asnetwork latency and bandwidth (e.g., the LRAM [ACS89], ostal Model [BK9], BS [Val90], and Log [CK + 93]), and memory hierarchy, reecting the eects of multileveled memory such as diering access times for registers, local cache, main memory and disk I/O (e.g., the -HMM [VS94], MH [AC94], and -UMH [V91]). The aroach followed by these models is that of a arameterized (or generic) model, which abstracts the architectural details into several generic arameters which we call resource metrics. Tyical resource metrics include the number of rocessors, communication latency, bandwidth, block transfer caability, network toology, memory hierarchy, memory organization and degree of asynchrony. Using such a arameterized model one can design broadly alicable arameterized algorithms that can be tailored to secic machines by instantiating the arameters, such as latency and bandwidth, to match machine characteristics. The more recent models tyically try to use more arameters to more nely cature the resource characteristics of arallel machines. However, a careful balance must be struck between incororating detail and being too nely arameterized (in too many dimensions) so as to render otimal algorithm design imossible. Therefore, identifying resource metrics and aroriately choosing them is critical in the design of models of arallel comutation. In this aer we identify resources and resource metrics which are imortant for the erformance of arallel machines and use them as a framework to characterize the variety of arallel models. Within this framework we will categorize models into four classes: basic synchronous models, asynchronous models, models which incororate notions of latency and bandwidth, and models which address hierarchical memory. The models discussed here are generally in an increasing order of the number of resource metrics considered. Throughout the discussion, we use the roblem of Fast Fourier Transform (FFT) comutation to illustrate the rinciles of algorithm design and comlexity analysis for many of the dierent models. The data movement of the FFT comutation forms a buttery grah, also called a FFT grah. The -oint FFT grah with = m, can be dened as follows: there are inut and outut oints denoted as x 0;0 ;x 0;1 ; :::; x 0;,1 and x m;0 ;x m;1 ; :::; x m;,1 resectively. The comutation is often called ebbling and denoted as x j;q = f(x j,1;q;x j,1;r); where f is a constant cost function and the only dierence of q and r is in the (j, 1)th osition when reresented as binary numbers. The examination of the resource metrics chosen by revious models reveals a void in models that accurately treat both network communication and multilevel memory. As a simle examle of the rocess of develoing imroved erformance models, we roose a new hybrid model of arallel comutation, the Log- HMM model, whose evolution naturally lls this void. The Log-HMM extends an existing arameterized network model (the Log, with resource metrics of latency, bandwidth, and overhead on handling messages) with memory hierarchy at each rocessor (there following the sequential HMM model). The Log-HMM reresents a ragmatic renement of arallel comutation models within the framework of resource metrics to accommodate more detailed erformance measures. We examine the otential utility of Log-HMM model in the design of near otimal sorting and FFT algorithms. It turns out that one of the near otimal FFT algorithms, the hybrid layout method, can run otimally in the -HMM model. The remainder of this aer is organized as follows. Section identies resources and resource metrics for arallel comutation. Section 3 discusses basic synchronous models. Section 4 discusses extensions of the basic synchronous models, including asynchronous models, models incororating communication cost, and hierarchical models. In Section 5 we summarize the resource metrics chosen by dierent models of arallel comutation, dene the new Log-HMM model, and resent near otimal sorting and FFT algorithms for this model. We conclude by reviewing the critical role the careful choice of resource metrics lays in the design of arallel models, as demonstrated through the Log-HMM, and discuss ongoing research.. Resource metrics We rst resent denitions of resources, resource metrics and models of arallel comutation. Deniton 1 Aresource refers to an architectural feature that signicantly aects the erformance of a arallel machine. A resource metric is a measure of the corresonding resource, which could be quantitative or qualitative. The value of a quantitative resource metric is normally a multile of the unit rocessor execution time. Deniton Acomutational model is an abstraction of a comuting machine which is characterized by the choice of several resources and the corresonding resource metrics. A comutational model may thus be identied with a set of resource metrics. Moreover, algorithms will be designed and analyzed based on these resource metrics.

3 For examle, a sequential comuter is suitably characterized by the resources of sequential comutation time and sace usage. It is commonly acceted that the sequential comutational models, such as RAM and its hierarchical memory extensions HMM, BT f and UMH [AACS87, ACS87, ACF90], reect these resources quite well and therefore rovide a common base for sequential comutation. Resource metrics for a ractical arallel machine are far more comlicated than those of sequential machines. We identify the following list of signicant resources and resource metrics for arallel comutation: umber of rocessors. A theoretical model normally assumes that there is an unlimited number of rocessors available, while a more ractical model assumes a bounded number of rocessors, such ashun- dreds or thousands. Memory organization. Machines may be characterized as having hysically shared memory or distributed memory. The former often has a local cache. The latter tyically uses exlicit message assing rimitives. Communication latency. The costs of accessing local memory and global memory in a shared memory machine are quite dierent; similarly, in a distributed memory machine, the costs of accessing local memory and communication with other rocessors are quite different. Latency is one measure of the cost of global memory access. Degree of asynchrony. rocessors may run synchronously, semi-synchronously (loosely synchronously) or asynchronously. Semi-asynchrony refers to a comutation that is divided into a sequence of indeendent execution hases; within each hase, the rogram runs asynchronously, but all of the rocessors are synchronized at the end of each hase (i.e., barrier synchronization occurs). Bandwidth. Communication bandwidth and memory bandwidth are both limited in ractice. Currently, communication bandwidth lags far behind internal rocessor memory bandwidth. The bisection bandwidth, dened as the bandwidth across a line that searates the network into two arts, is normally used for the bandwidth resource metric. Overhead of a rocessor for message handling. The communication overhead is the time that the rocessor engages in sending and receiving a message. In most cases, the value is deendent on the communication rotocol imlemented in a ractical machine. For examle, in the CM-5 it could be a linear function of the message size [CK + 93]. Block transfer caability. In most architectures, a signicant cost (latency) is incurred to access the rst of a contiguous block of words, but after that, successive words can be accessed in unit time. Memory hierarchy. Many sohisticated machines have several layers of memory with diering access times, such as register, cache, main memory and secondary memory. A model caturing this memory hierarchy can organize the memory as a tree or a sequence of layers with increasing sizes, where each nodeorlevel is arameterized by memory size, block size, and intermodule bandwidth. Memory contention. Memory can be accessed as a block or a set of banks. When the bank is unit size, the memory locations may be accessed simultaneously. This assumtion is adoted by most of the models discussed in this aer. rotocols for resolving conict in concurrent memory access include EREW, CREW, CRCW and QRQW. etwork toology. The rocessors may be interconnected using a mesh, cube, fat tree, ring, or other toology. The most common metric for this resource is the diameter of the network. In the subsequent sections we resent a detailed discussion of each model and its choice of resource metrics. Because each resource has a metric associated with it, we will henceforth not distinguish resources and resource metrics. 3. Basic synchronous models RAM. The RAM model extends the sequential RAM model by relicating the rocessor art. A RAM machine is a set of sequential rocessors sharing a global memory and each having its own rivate unbounded local memory. A RAM comutation is a sequence of read, write and comutation stes. All rocessors execute in lock-ste, that is, they are synchronized before they execute the next instruction. The costs of memory access, either to local or global memory, and comutation stes are uniform. The cost of synchronization is free. Several variations of the RAM model use dierent rotocols to handle simultaneous access of several rocessors to the same location of global memory. rotocols include EREW (exclusive read - exclusive write), CREW (concurrent read - exclusive write), and CRCW (concurrent read - concurrent write). The latter rotocol can be further divided

4 into several classes by the semantics of the concurrent write. The most recent variation is QRQW [GMR94], which assumes that simultaneous access to the same memory block will be inserted into a request queue and served in a FIFO manner. VRAM. Another extension of the serial RAM model is the Vector Random Access Machine (VRAM) [Ble90]. The VRAM is a serial random access machine with the addition of avector memory, avector rocessor, and vector inut and outut orts. Tyical vector instructions include elementwise oerations, data movement oerations, scans, and acks. Two measures, ste and element comlexity, can be derived for a roblem of size in the VRAM model. Ste comlexity (s) is the total number of instructions executed and element comlexity (e) is the vector length er rimitive instruction call, summed over the number of the calls. These values give a measure of the arallel time and the total work resectively. The RAM comlexity of an algorithm designed in the VRAM model can be derived as O(e= + s). The RAM has roven to be useful by ermitting algorithm designers to focus on the structure of comutational tasks rather than the architecture details of a currently available machine. A large number of ef- cient algorithms have been develoed by exloiting its simlistic assumtions, but in ractice, some of the architectural issues that the RAM ignores are imortant. 4. Extensions of the basic models The RAM model rovides an abstraction that ignores concerns such as asynchrony, communication delay and memory hierarchy. In this section we discuss several extensions of the RAM model which incororate some of these measures. The extensions may be viewed as adding more resource metrics to the RAM model in order to gain imroved erformance measures Asynchronous models Among the rst extensions to the RAM were the hase RAM and ARAM models, which incororate some notion of asynchronous execution. hase RAM. The hase RAM [Gib89] extends the RAM model with semi-asynchrony. A hase RAM machine consists of a shared global memory, a set of sequential rocessors, and a rivate local memory for each rocessor. The comutational task is searated into a set of hases of asynchronous execution, each ended by an exlicit barrier synchronization. The cost of global read, global write and local oerations are the same constant. The cost of a synchronization ste is deendent on the number of rocessors. Sace limitations reclude resentation of a hase RAM algorithm design for FFT comutation; an examle may be found in [Gib89]. It is worth noting that a variantof the hase RAM, the hase LRAM model, accounts as well for the cost of communication latency. ARAM. The Asynchronous RAM (ARAM) is a \fully" asynchronous model [CZ89]. The ARAM model consists of a global shared memory and a set of rocesses with their own local memories. The basic oerations executed by the ARAM rocess are called events. An ARAM comutation is denoted as the set of ossible serializations of events executed by the rocesses. A virtual clock is associated with each serialization. This virtual clock assigns a time t(e) to each event e. The clock \ticks" when each rocess has executed at least one event. Events may be read and write events, which oerate on the shared and local memory, or local events. All events are charged unit cost. The air [round comlexity, number of rocesses] is used to measure the comlexity of an ARAM algorithm, where a round is dened as the sequence of events between two clock ticks in a comutation. The round comlexity for a comutation is dened to be the maximum number of ossible ticks for that comutation. For an algorithm the round comlexity is dened as the maximum round comlexity over all of the ossible comutations. An examle of ARAM algorithm design using these measures for the roblem of summation may be found in [CZ89]. 4.. Models incororating communication latency and bandwidth The models discussed here consist of two subclasses: the synchronous latency models such as the LRAM and BRAM which add the notion of latency into the RAM model, and models such as the BS and Log which not only incororate asynchrony and latency but also address the issue of bandwidth limitation. LRAM. The Local-Memory RAM (LRAM) model [ACS89] consists of a shared global memory and a set of rocessors with unbounded local memory executing in lock-ste. The access rotocol to global memory is CREW. At every time ste, each rocessor can erform either a communication ste, in which it can write and then read a word from the global memory, or a comutation comutation ste, which is an oeration which accesses at most two words from its local

5 memory. BRAM. An extension of the LRAM, the Block RAM (BRAM), is described in [ACS90]. The BRAM takes into account the reduced cost for transferring a contiguous block of data. The BRAM model is dened with two arameters L (latency or startu time) and b (block size). The cost of accessing local memory is unit time. However the cost of transmitting a block of size b of contiguous locations from global memory is L + b. ostal model. The ostal model [BK9] is a distributed memory model with the constraint that the oint-to-oint communication has latency. Several elegant otimal broadcast and summation algorithms have been designed based on this model, which were then extended for the Log model [KSSS93]. Algorithms other than broadcast and summation have largely not been resented for this model. Bulk-Synchronous arallel model. The BS is a distributed memory model [Val90]. Like the hase RAM, the BS is also a semi-asynchronous model because it requires synchronization after each \suerste", within which the rocesses can run asynchronously. The BS model is described in terms of three elements: rocesses/memory modules, a router which delivers the messages between airs of comonents and a synchronizer which synchronizes all or a subset of the comonents. Within the framework of resource metrics, a router is just an abstraction of network bandwidth and latency. A comutational task in the BS model consists of a sequence of suerstes. In each suerste, every comonent is allocated a task which contains some combination of local comutation stes and message transmissions. The local comutations, including reads and writes to local memory, are charged unit cost. The message transmission is accomlished by the router, which can send and receive a certain number of messages in each suerste (or in BS terminology, the router can realize an h-relation). The cost of realizing such anh-relation is assumed to be gh + s time units, where g can be thought as the recirocal of the communication bandwidth and s denotes the startu cost or the latency. If the length of a suerste is L, then L local oerations and a b L g c - relation message attern can be realized. The arameters of the machine are therefore L, g and (the number of rocessors). A FFT algorithm for the BS is described in [Val90]. Log model. The Log model is motivated by current technological trends in high erformance comuting towards networks of large-grained sohisticated rocessors. The Log model uses the arameters L (an uer bound of latency for transmitting a single message), o (the comutation overhead of handling a message), g (a lower bound of time interval between consecutive message transmissions at a rocessor) and (the number of rocessors) [CK + 93]. In contrast to the BS model, it removes the barrier synchronization requirement (h-relation in BS) and allows the rocessors to run asynchronously. The network of a Log machine has a nite caacity such that at any time at most b L g c messages can be in transit from or to any rocessor. It can suort shared or distributed memory. The Log model encourages well-known general techniques of designing algorithms for distributed memory machines including exloiting locality, reducing communication comlexity and overlaing communication and comutation. The Log model also romotes balanced communication atterns by introducing the limitation on network caacity so that no rocessor is overloaded with incoming messages. Moreover, it is often reasonable to ignore the arameter of o in a ractical machine, such as in a machine with low bandwidth (high g value). Examles using this strategy can be found in the FFT algorithm discussed below and also in [KSSS93]. FFT. The data layout and communication scheduling are two key asects to achieving an eective algorithm for the FFT roblem under the Log model. Three methods of data layout are discussed in [CK + 93]. The cyclic layout assigns the ith row of the buttery to the ith rocessor. The block layout laces the rst rows on the rst rocessor, the next rows on the second rocessor, and so on. Under either layout, each rocessor sends log time comuting and sends and receives messages, which needs ( g + L) log time. The third method is called hybrid layout, which switches from cyclic to blocked layout at any column between the log -th and the log -th (assuming > ). With this layout, each rocessor sends messages to every other rocessor, requiring only g(, )+L time. Therefore, this method leads to an algorithm with running time O( log +(, )g+l):which is within a factor (1 + g log ) of otimal. The naive communication schedule stalls on the rst send. The technique of overlaing comutation and communication can be used to eliminate this stall for the large roblem instances. The method introduced in [Sah9] staggers the dierent starting rows for different rocessors: rocessor i starts with its i -th row,

6 roceeds to the last row, and wras around. This leads an otimal algorithm for large roblem instances and reasonable g value Hierarchical models The arallel Memory Hierarchy model (MH) [AC94] and arallel Hierarchical Memory model (- HMM) [VS94] discussed in this section address the concerns of memory hierarchy in a arallel setting. The -HMM rimarily originates from considering in a arallel network the existence of secondary or disk memory. MH on the other hand uses \memory hierarchy" as a more general technique to model not only the hierarchy within a rocessor but also the communication characteristics of a arallel machine. arallel Memory Hierarchy model. The MH is a so-called generic model which denes a class of secic models [AC94]. In the MH model, a arallel comuter is modeled as a tree and each node of the tree is called a module. All of the leaf modules are used to denote the rocessors and the internal modules hold the data. Achild module is connected to its arent by a unique channel with a certain amount of bandwidth. Data in a module is artitioned into blocks which are the basic unit of data transfer between the child and arent. Thus communication between two rocessors roceeds u the tree by some ath and then down the tree to the target rocessor, somewhat resembling the fat tree architecture. The model is characterized by four arameters secied for each module m: blocksize s m, blockcount m which denotes the number of blocks in each module, childcount c m which denotes the number of children for each module, and transfer time t m which denotes the number of cycles used to transfer a block between the current module and its arent. To model a articular comuter, one chooses a tree structure and values for the arameters aroriate for the machine's communication caabilities and memory hierarchy. Even though this model is termed a arallel memory hierarchy, the internal modules need not necessarily corresond to actual memory modules of the real machine; when modeling the CM-5 for examle, many of the modules are used to cature the interrocessor communication caabilities [AC94]. The MH model derives from the develoers' exerience in tuning code for memory hierarchy. Many sequential algorithms have been develoed for the original sequential UMH model. However, erhas because of the comlexity and generality of the model, not too many algorithms have been develoed for the MH arallel model. arallel Hierarchical Memory model. The - HMM model is also called the arallel I/O model [VS90, VS94]. It originates from the consideration that data must often reside in secondary storage rather than main memory; in a arallel setting this may involve the arallel access of multile disks. Therefore it is necessary to design arallel algorithms which consider the ossible data movement between main and secondary memory [AV88], and more generally which consider multile levels of memory including register and cache. In the -HMM model, each rocessor has a memory hierarchy organized into discrete levels, much like the memory organization in the HMM, and all of searate memories are connected together at the base level of each hierarchy. A further assumtion is that the hierarchies can each function indeendently. Communication between hierarchies takes lace at the base memory level (level 1) which consists of location 1 from each of the hierarchies. The interconnection network for the base memory level locations is normally assumed to be a hyercube (or cube-connected cycles) so that the records in the base memory level can be sorted in O(log ) time. The model can be extended to allow block transfer; the resulting model is called the -BT model [VS94]. Several other extensions which use the UMH memory model and RAM interconnection resectively are discussed in [V91]. Two factors are critical for develoing eective - HMM algorithms: data lacement and movement between the levels of memory hierarchy, and data movement among the rocessors. FFT. We resent the following -HMM algorithm from [VS94], erforming the -inut FFT when : 1. Comute -inuts FFTs. Assume that ith FFT is on the ith contiguous grou of tracks (or memory levels).. Shift the records in the kth hierarchy to hierarchy 1 +(k+oset,1) mod, where oset = (i, 1) mod for ith grou. 3. Shue the records to form new contiguous grous of tracks. For every 1 i, the ith new grou consists of the ith record from each of the original grous. 4. Do -inut FFTs for the new grous. When,we rst do -inut FFTs, followed by ashue-merge, and then do -inut FFTs.

7 roc Synchrony/ Memory Latency Bandwidth Block Overhead Memory asynchrony Organization Transfer Hierarchy RAM Synchronous Shared VRAM 1 Synchronous Vector LRAM Synchronous Shared BRAM Synchronous Shared ostal Asynchronous Distributed HASE Semi-asynch Shared RAM HASE Semi-asynch Shared LRAM ARAM Asynchronous Shared BS Semi-asynch Distributed Log Asynchronous Both MH Asynchronous Distributed -HMM Asynchronous Distributed H-RAM Semi-asynch Both Log-HMM Asynchronous Distributed Table 1: Resource metrics chosen by dierent models of arallel comutation. The time required by this algorithm, T (; ), is exlained as follows. Ste 1 and 4 take T( ;)+ log time. The rst term in this formula is easy to understand. The second term arises because the different -inuts FFTs sit in dierent memory levels (or tracks). We need to move them into the lowest memory levels in order to recursively comute the -inut FFT. Similarly, the shifted elements need to be brought to the base memory and to be transmitted by the network which connects the base memory. Therefore, the shift time is O( (log + log )). The shuing can be done in the order of inut size times the data movement in the hierarchy, which iso( log ). Therefore, we have following recurrence T (; ) = T( ;)+O( log ); which gives the result O( log log log log ): This matches the FFT lower bound in [VS94] and so is otimal. 5. Log-HMM model The resource metrics chosen by dierent arallel models discussed so far are summarized in Table 1. It is aarent that each of the models addresses diering asects of resource usage, which suggests that resource metrics are aroriate tools to categorize and examine dierent models of arallel comutation. Using the framework of resource metrics, we can also build new models by adding or deleting several resource metrics from the existing models. However, this exibility otentially causes diculty. If we consider each resource metric as a dimension in the design sace of a machine's architecture, then we may introduce too many dimensions with which to deal. Therefore, a careful analysis and choice of resource metrics is necessary. The Log-HMM model, dened below, serves as an illustration of how the framework of resource metrics can be used to guide the design of ragmatically re- ned models that ll a need for more detailed erformance measures. From an examination of the resource metrics of existing models, it is aarent that a void exists in arallel models that accurately treat both network communication (as does the Log) and multi-level memory (as does the -HMM). The Log model does not address the roblem of several layers of memory. Yet it is imortant to model the several layers of memories which exist in many machines, since diering access times to local cache and disk may strongly eect erformance. On the other hand, models such as the -HMM or -UMH address in detail the resource of memory hierarchy, but not so much the accurate characterization of the communication network. These models normally use simle assumtions about the network, such as a RAM connection or an abstracted hyercube connection. The Log-HMM model lls the ga by combining both kinds of models together (or one might say by adding more resource metrics into either model). We take the aroach of extending an existing arallel model with memory hierarchy. The resulting model consists of two arts: the network art and the memory art. The network art can be any of the arallel models such as BS and Log, while the memory art can be any of the sequential hierarchical memory models such as HMM and UMH. In this aer we will focus on the extension of the Log model with HMM, which we call Log-HMM.

8 M/ M/ M/ M M M M THE ETWORK M M Log arameters Hierarchical arameters Figure 1: Structure of the Log-HMM model Denition of the model The Log-HMM model, ictured in Figure 1, is de- ned much like the arallel hierarchy memory model [VS94]. A Log-HMM machine consists of a set of asynchronously executing rocessors, each with an unlimited local memory. The local memory is organized as a sequence of layers with increasing size, where the size of layer i is i. Each memory location can be accessed randomly; the cost of accessing a memory location at address x is log x (using access cost function f(x) = log x). The rocessors are connected by a Log network at level 0. In other words, the four Log arameters, L, o, g and, are used to describe the interconnection network. A further assumtion is that the network has a nite caacity such that at any time at most b L g c messages can be in transit from or to any rocessor. 5.. Algorithm design and analysis Exloiting locality is the key to designing ecient algorithms for the Log-HMM model. In Log-HMM, there are two otential sources of data locality: the network art and the memory art. In the network art, we need to layout data in such a manner that each rocessor will use the data intensively before it needs the data in other rocessors. A similar situation exists in the memory art: before we move data to higher memory levels, the data should have been used intensively and should not be needed after moving. Several algorithms resented below illustrate these ideas. Before we resent the algorithms, we rst rove the following theorem. Theorem 5.1 The lower bound of sorting elements (or comuting a FFT grah) in the Log-HMM model with memory access cost f(x)=log x is ( log log log ): roof: In a Log-HMM machine with rocessors, the best way we can sort elements is to divide elements into sets of equal size and lace the elements among the rocessors already in interrocessorsorted order. Then each rocessor simultaneously sorts elements in its local memory and no communication between rocessors are required. In this case, the sorting lower bound for elements in each rocessor is also the lower bound for the whole sorting rocedure. By the result in [AACS87], the sorting lower bound for elements in HMM model with memory access cost f(x)=log x is exactly ( log log log ): A similar argument can be used for FFT comutation. Therefore we rove the theorem. FFT { Algorithm 1. Using the technique in [VS94] and the algorithm given in section 4.3., we can comute the FFT on a Log-HMM machine. The time needed by this algorithm is O( L+o L g log log log ): Stes 1 and 4 take T( ;)+ log time. The shuing ste sends time O( log ) after each grou is \shifted" by an aroriate oset, because the shuing can be done in the local memory hierarchy and does not cause any communication. The shifting can be done using following method. Each rocessor sends the \shift" messages to the other rocessors simultaneously. When d L g e messages have been sent, every rocessor begins to receive the d L g e messages. This rocedure is continued until each rocessor has sent and received all of the messages. The time required for this communication is O(( =d L L+o g e)(4l +4o)) = O( L g ): But each record sent may reside at a dierent level of the memory hierarchy. In order to move them into the lower memory levels, we need another log factor, which gives O( L+o L g log ): Therefore we have following recurrence T (; ) = T( ;)+O( L+o L g log ); which yields the bound given above. The algorithm is thus within a factor of g(l+o) L otimal. FFT { Algorithm. Algorithm 1 basically uses block layout for the inut data. This algorithm will use the hybrid data layout and a tighter uer bound can be derived. The idea of the hybrid method has been discussed in section 4.. A more detailed discussion can be found in [Sah9].

9 Using the similar notation given in [Sah9], we denote m = and l = m. Two stes are used to comute the -inut FFT. In Ste I, each rocessor comutes a m-inut buttery and after each rocessor nishes its comutation, it sends messages to each of the other rocessors. After each rocessor accets the, messages, it begins Ste II which comrises the comutation of the non-inut nodes of l dis- joint -inut butteries. By the result of [AACS87], Ste I comutation needs O(m log m log log m) time. Ste II comutation needs O(l log log log ) = O(m log log log ) time lus the time to move the l -inut FFTs to the lower memory levels, which is bounded by O( log )oro( log ). When there is no memory hierarchy, the communication time is (, )=d L L+o g e)(4l +4o)) = L g(, ). However, the messages sent may reside at dierent memory levels. This gives the communication time to be L+o L g(, ) log Thus the total running time is: O( log log log + L+o L g(, ) log ); Because, the running time is within a factor 1+ g(l+o) of otimal. Llog log Sorting. A near otimal sorting algorithm can be obtained by using columnsort [Lei85] and the modied median-sort algorithm roosed in [AACS87]. We will denote this modied median-sort algorithm as HMMsorting in the rest of the aer and we will use the result that the HMM-sorting can sort elements in O( log log log ) time. The columnsort erforms sorting on a column-major r c matrix with the conditions of c%r = 0 and r > (c, 1). It executes eight consecutive stes, of which the odd-numbered stes sort the r elements in each column and the evennumber stes ermute the data among the rocessors. In order to devise an ecient algorithm in the Log- HMM model, we take c =, and r =. The columnsort condition is satised if > (, 3). Each rocessor will be resonsible for sorting a column. Therefore, by using HMM-sorting, each of oddnumbered stes will take time O( log log log ). In ste and 4, each rocessor sends and receives, elements. These elements may reside at or send to dierent memory levels, therefore the communication overhead would be L+o L g(, ) log : In ste 6 and 8, each rocessor sends and receives elements. The communication overhead would be L+o L g log : Therefore the sorting can be done in time O( log log log + L+o L g log ); when > (, 3), which is within a factor 1+ g(l+o) of otimal. Llog log 5.3. An alternative otimal FFT algorithm for the -HMM model The hybrid algorithm discussed above can be adated for the -HMM model and an otimal running time can be achieved. The analysis is summarized below: 1. The time for m-inut FFT is O(m log m log log m) on a single rocessor.. In the shue stage, each rocessor sends, elements. In -HMM, sending a message over the network takes log time. Therefore, the time for this stage is (, ) (log +log ), where the second term of log is for transferring the data in the memory hierarchy. 3. The time to comute l -inut FFTs is (O(l( log log log )+ log ). Again the second term is for transferring the data in the memory hierarchy. Summing all of them together, we get the following result: O( log log log ); which matches the lower bound O( log log log log ) roven in [VS94]. And therefore it is otimal. 6. Conclusions This aer resents a framework of using resource metrics to characterize the various models of arallel comutation. Using the roerties of resource metrics, we can classify models into the basic synchronous models and extensions which incororate notions of asynchrony, communication costs, and memory hierarchy. The merits and disadvantages of these models are brought out by examining their characteristics in terms of resource metrics and their utility in designing such algorithms as FFT. Resource metrics can be used not only to understand existing models, but also to guide the design of

10 new models. The Log-HMM model, roosed in this aer, serves as an illustration of a model designed to ll the ga observed between models which address communication and those which address memory hierarchy. The design of near otimal sorting and FFT algorithms for Log-HMM gives romise that the Log- HMM model has the otential to serve as a viable tool for the design and analysis of what may rove to be ractical algorithms for a large class of machines. The Log-HMM model leaves large room for further study. Ongoing investigations include examining the utility of relacing the memory art by a more ractical model like the UMH. We are also ursuing the design of algorithms besides FFT and sorting for the Log-HMM model. Finally, we are ursuing the comarative evaluation and exerimental measurement of how accurately these models reect real arallel machines by imlementing the algorithms on machines such as the IBM S-. References [AACS87] A. Aggarwal, B. Alern, A. Chandra, and M. Snir, \A model for hierarchical memory," in roc. 19th ACM Sym. on Theory of Comuting,. 305{314, May [AC94] [ACF90] [ACS87] [ACS89] [ACS90] [AV88] [Ble90] [BK9] B. Alern and L. Carter, \Towards a model for ortable arallel erformance: exosing the memory hierarchy," in ortability and erformance for arallel rocessing,. 1{41, John Wiley & Sons, B. Alern, L. Carter, and E. Feig, \Uniform memory hierarchies," in roc. 31st IEEE Sym. on Foundations of Comuter Science, A. Aggarwal, A. Chandra, and M. Snir, \Hierarchical memory with block transfer," in roc. 8th Sym. on Foundations of Comuter Science,. 04{16, Oct A. Aggarwal, A. Chandra, and M. Snir, \On communication latency in RAM comutation," in roc. 1st ACM Sym. on arallel Algorithms and Architectures,. 11{1, A. Aggarwal, A. K. Chandra, and M. Snir, \Communication comlexity of RAMs," J. Theoretical Comuter Science, Mar A. Aggarwal and J. Vitter, \The inut/outut comlexity of sorting and related roblems," Comm. ACM, vol. 31,. 1116{117, Set G. E. Blelloch, Vector Models for Data-arallel Comuting. MIT ress, A. Bar-oy and S. Kinis, \Designing broadcasting algorithms in the ostal model for message-assing systems," in roc. 4th ACM Sym. on arallel Algorithms and Architectures,. 11{, ACM, June 199. [CK + 93] D. Culler, R. Kar, D. atterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. von Eicken, \Log: Towards a realistic model of arallel comutation," in roc. 4th ACM Sym. on rinciles and ractice of arallel rogramming,. 1{1, ACM, [CZ89] [FW78] R. Cole and O. Zajicek, \The ARAM: Incororating asynchrony into the RAM model," in roc. 1st ACM Sym. on arallel Algorithms and Architectures,. 169{178, ACM, S. Fortune and J. Wyllie, \arallelism in random access machines," in roc.10th ACM Sym. on Theory of Comuting,. 114{118, [Gib89]. B. Gibbons, \A more ractical RAM model," in roc. 1st ACM Sym. on arallel Algorithms and Architectures,. 158{168, ACM, [GMR94]. B. Gibbons, Y. Matias, and V. Ramachandran, \The QRQW RAM : Accounting for contention in arallel algorithms," in roc. 5th ACM-SIAM Sym. on Discrete Algorithms,. 638{647, [Goo93] M. Goodrich, \arallel algorithms column I: Models of comutation," SIGACT ews, vol. 4,. 16{1, Dec [KSSS93] R. Kar, A. Sahay, E. Santos, and K. Schauser, \Otimal broadcast and summation in the Log model," in roc. 5th ACM Sym. on arallel Algorithms and Architectures,. 14{153, [Lei85] [V91] T. Leighton, \Tight bounds on the comlexity of arallel sorting," IEEE Transactions on Comuters, vol. 34, no. 3,. 344{354, M. odine and J. Vitter, \Large-scale sorting in arallel memories," in roc. 3rd ACM Sym. on arallel Algorithms and Architectures,. 9{ 39, July [Sah9] A. Sahay, \Hiding communication costs in bandwidth-limited arallel FFT comutation," Technical Reort UCB/CSD 9/7, UC Berkeley, 199. [Ski91] D. Skillicorn, \Models for ractical arallel comutation," International Journal of arallel rogramming, vol. 0, no.,. 133{158, [Val90] [VS90] [VS94] L. Valiant, \A bridging model for arallel comutation," Comm. ACM, vol. 33,. 103{111, Aug J. Vitter and E. Shriver, \Otimal disk I/O with arallel block transfer," in roc. nd ACM Sym. on Theory of Comuting,. 159{169, J. S. Vitter and E. A. M. Shriver, \Algorithms for arallel memory II: Hierarchical multilevel memories," Algorithmica, 1994.

Models and Resource Metrics for Parallel and Distributed Computationt

Models and Resource Metrics for arallel and Distributed Computationt Zhiyong Li, eter H. Mills, John H. Reif Department of Computer Science, Duke University, Durham, N.C. 27708-0129 Abstract This paper