[CZ89]), communication costs such asnetwork latency and bandwidth (e.g., the LPRAM [ACS89], Postal Model [BNK92], BSP [Val90], and LogP [CKP + 93]), a

Size: px
Start display at page:

Download "[CZ89]), communication costs such asnetwork latency and bandwidth (e.g., the LPRAM [ACS89], Postal Model [BNK92], BSP [Val90], and LogP [CKP + 93]), a"

Transcription

1

2 [CZ89]), communication costs such asnetwork latency and bandwidth (e.g., the LRAM [ACS89], ostal Model [BK9], BS [Val90], and Log [CK + 93]), and memory hierarchy, reecting the eects of multileveled memory such as diering access times for registers, local cache, main memory and disk I/O (e.g., the -HMM [VS94], MH [AC94], and -UMH [V91]). The aroach followed by these models is that of a arameterized (or generic) model, which abstracts the architectural details into several generic arameters which we call resource metrics. Tyical resource metrics include the number of rocessors, communication latency, bandwidth, block transfer caability, network toology, memory hierarchy, memory organization and degree of asynchrony. Using such a arameterized model one can design broadly alicable arameterized algorithms that can be tailored to secic machines by instantiating the arameters, such as latency and bandwidth, to match machine characteristics. The more recent models tyically try to use more arameters to more nely cature the resource characteristics of arallel machines. However, a careful balance must be struck between incororating detail and being too nely arameterized (in too many dimensions) so as to render otimal algorithm design imossible. Therefore, identifying resource metrics and aroriately choosing them is critical in the design of models of arallel comutation. In this aer we identify resources and resource metrics which are imortant for the erformance of arallel machines and use them as a framework to characterize the variety of arallel models. Within this framework we will categorize models into four classes: basic synchronous models, asynchronous models, models which incororate notions of latency and bandwidth, and models which address hierarchical memory. The models discussed here are generally in an increasing order of the number of resource metrics considered. Throughout the discussion, we use the roblem of Fast Fourier Transform (FFT) comutation to illustrate the rinciles of algorithm design and comlexity analysis for many of the dierent models. The data movement of the FFT comutation forms a buttery grah, also called a FFT grah. The -oint FFT grah with = m, can be dened as follows: there are inut and outut oints denoted as x 0;0 ;x 0;1 ; :::; x 0;,1 and x m;0 ;x m;1 ; :::; x m;,1 resectively. The comutation is often called ebbling and denoted as x j;q = f(x j,1;q;x j,1;r); where f is a constant cost function and the only dierence of q and r is in the (j, 1)th osition when reresented as binary numbers. The examination of the resource metrics chosen by revious models reveals a void in models that accurately treat both network communication and multilevel memory. As a simle examle of the rocess of develoing imroved erformance models, we roose a new hybrid model of arallel comutation, the Log- HMM model, whose evolution naturally lls this void. The Log-HMM extends an existing arameterized network model (the Log, with resource metrics of latency, bandwidth, and overhead on handling messages) with memory hierarchy at each rocessor (there following the sequential HMM model). The Log-HMM reresents a ragmatic renement of arallel comutation models within the framework of resource metrics to accommodate more detailed erformance measures. We examine the otential utility of Log-HMM model in the design of near otimal sorting and FFT algorithms. It turns out that one of the near otimal FFT algorithms, the hybrid layout method, can run otimally in the -HMM model. The remainder of this aer is organized as follows. Section identies resources and resource metrics for arallel comutation. Section 3 discusses basic synchronous models. Section 4 discusses extensions of the basic synchronous models, including asynchronous models, models incororating communication cost, and hierarchical models. In Section 5 we summarize the resource metrics chosen by dierent models of arallel comutation, dene the new Log-HMM model, and resent near otimal sorting and FFT algorithms for this model. We conclude by reviewing the critical role the careful choice of resource metrics lays in the design of arallel models, as demonstrated through the Log-HMM, and discuss ongoing research.. Resource metrics We rst resent denitions of resources, resource metrics and models of arallel comutation. Deniton 1 Aresource refers to an architectural feature that signicantly aects the erformance of a arallel machine. A resource metric is a measure of the corresonding resource, which could be quantitative or qualitative. The value of a quantitative resource metric is normally a multile of the unit rocessor execution time. Deniton Acomutational model is an abstraction of a comuting machine which is characterized by the choice of several resources and the corresonding resource metrics. A comutational model may thus be identied with a set of resource metrics. Moreover, algorithms will be designed and analyzed based on these resource metrics.

3 For examle, a sequential comuter is suitably characterized by the resources of sequential comutation time and sace usage. It is commonly acceted that the sequential comutational models, such as RAM and its hierarchical memory extensions HMM, BT f and UMH [AACS87, ACS87, ACF90], reect these resources quite well and therefore rovide a common base for sequential comutation. Resource metrics for a ractical arallel machine are far more comlicated than those of sequential machines. We identify the following list of signicant resources and resource metrics for arallel comutation: umber of rocessors. A theoretical model normally assumes that there is an unlimited number of rocessors available, while a more ractical model assumes a bounded number of rocessors, such ashun- dreds or thousands. Memory organization. Machines may be characterized as having hysically shared memory or distributed memory. The former often has a local cache. The latter tyically uses exlicit message assing rimitives. Communication latency. The costs of accessing local memory and global memory in a shared memory machine are quite dierent; similarly, in a distributed memory machine, the costs of accessing local memory and communication with other rocessors are quite different. Latency is one measure of the cost of global memory access. Degree of asynchrony. rocessors may run synchronously, semi-synchronously (loosely synchronously) or asynchronously. Semi-asynchrony refers to a comutation that is divided into a sequence of indeendent execution hases; within each hase, the rogram runs asynchronously, but all of the rocessors are synchronized at the end of each hase (i.e., barrier synchronization occurs). Bandwidth. Communication bandwidth and memory bandwidth are both limited in ractice. Currently, communication bandwidth lags far behind internal rocessor memory bandwidth. The bisection bandwidth, dened as the bandwidth across a line that searates the network into two arts, is normally used for the bandwidth resource metric. Overhead of a rocessor for message handling. The communication overhead is the time that the rocessor engages in sending and receiving a message. In most cases, the value is deendent on the communication rotocol imlemented in a ractical machine. For examle, in the CM-5 it could be a linear function of the message size [CK + 93]. Block transfer caability. In most architectures, a signicant cost (latency) is incurred to access the rst of a contiguous block of words, but after that, successive words can be accessed in unit time. Memory hierarchy. Many sohisticated machines have several layers of memory with diering access times, such as register, cache, main memory and secondary memory. A model caturing this memory hierarchy can organize the memory as a tree or a sequence of layers with increasing sizes, where each nodeorlevel is arameterized by memory size, block size, and intermodule bandwidth. Memory contention. Memory can be accessed as a block or a set of banks. When the bank is unit size, the memory locations may be accessed simultaneously. This assumtion is adoted by most of the models discussed in this aer. rotocols for resolving conict in concurrent memory access include EREW, CREW, CRCW and QRQW. etwork toology. The rocessors may be interconnected using a mesh, cube, fat tree, ring, or other toology. The most common metric for this resource is the diameter of the network. In the subsequent sections we resent a detailed discussion of each model and its choice of resource metrics. Because each resource has a metric associated with it, we will henceforth not distinguish resources and resource metrics. 3. Basic synchronous models RAM. The RAM model extends the sequential RAM model by relicating the rocessor art. A RAM machine is a set of sequential rocessors sharing a global memory and each having its own rivate unbounded local memory. A RAM comutation is a sequence of read, write and comutation stes. All rocessors execute in lock-ste, that is, they are synchronized before they execute the next instruction. The costs of memory access, either to local or global memory, and comutation stes are uniform. The cost of synchronization is free. Several variations of the RAM model use dierent rotocols to handle simultaneous access of several rocessors to the same location of global memory. rotocols include EREW (exclusive read - exclusive write), CREW (concurrent read - exclusive write), and CRCW (concurrent read - concurrent write). The latter rotocol can be further divided

4 into several classes by the semantics of the concurrent write. The most recent variation is QRQW [GMR94], which assumes that simultaneous access to the same memory block will be inserted into a request queue and served in a FIFO manner. VRAM. Another extension of the serial RAM model is the Vector Random Access Machine (VRAM) [Ble90]. The VRAM is a serial random access machine with the addition of avector memory, avector rocessor, and vector inut and outut orts. Tyical vector instructions include elementwise oerations, data movement oerations, scans, and acks. Two measures, ste and element comlexity, can be derived for a roblem of size in the VRAM model. Ste comlexity (s) is the total number of instructions executed and element comlexity (e) is the vector length er rimitive instruction call, summed over the number of the calls. These values give a measure of the arallel time and the total work resectively. The RAM comlexity of an algorithm designed in the VRAM model can be derived as O(e= + s). The RAM has roven to be useful by ermitting algorithm designers to focus on the structure of comutational tasks rather than the architecture details of a currently available machine. A large number of ef- cient algorithms have been develoed by exloiting its simlistic assumtions, but in ractice, some of the architectural issues that the RAM ignores are imortant. 4. Extensions of the basic models The RAM model rovides an abstraction that ignores concerns such as asynchrony, communication delay and memory hierarchy. In this section we discuss several extensions of the RAM model which incororate some of these measures. The extensions may be viewed as adding more resource metrics to the RAM model in order to gain imroved erformance measures Asynchronous models Among the rst extensions to the RAM were the hase RAM and ARAM models, which incororate some notion of asynchronous execution. hase RAM. The hase RAM [Gib89] extends the RAM model with semi-asynchrony. A hase RAM machine consists of a shared global memory, a set of sequential rocessors, and a rivate local memory for each rocessor. The comutational task is searated into a set of hases of asynchronous execution, each ended by an exlicit barrier synchronization. The cost of global read, global write and local oerations are the same constant. The cost of a synchronization ste is deendent on the number of rocessors. Sace limitations reclude resentation of a hase RAM algorithm design for FFT comutation; an examle may be found in [Gib89]. It is worth noting that a variantof the hase RAM, the hase LRAM model, accounts as well for the cost of communication latency. ARAM. The Asynchronous RAM (ARAM) is a \fully" asynchronous model [CZ89]. The ARAM model consists of a global shared memory and a set of rocesses with their own local memories. The basic oerations executed by the ARAM rocess are called events. An ARAM comutation is denoted as the set of ossible serializations of events executed by the rocesses. A virtual clock is associated with each serialization. This virtual clock assigns a time t(e) to each event e. The clock \ticks" when each rocess has executed at least one event. Events may be read and write events, which oerate on the shared and local memory, or local events. All events are charged unit cost. The air [round comlexity, number of rocesses] is used to measure the comlexity of an ARAM algorithm, where a round is dened as the sequence of events between two clock ticks in a comutation. The round comlexity for a comutation is dened to be the maximum number of ossible ticks for that comutation. For an algorithm the round comlexity is dened as the maximum round comlexity over all of the ossible comutations. An examle of ARAM algorithm design using these measures for the roblem of summation may be found in [CZ89]. 4.. Models incororating communication latency and bandwidth The models discussed here consist of two subclasses: the synchronous latency models such as the LRAM and BRAM which add the notion of latency into the RAM model, and models such as the BS and Log which not only incororate asynchrony and latency but also address the issue of bandwidth limitation. LRAM. The Local-Memory RAM (LRAM) model [ACS89] consists of a shared global memory and a set of rocessors with unbounded local memory executing in lock-ste. The access rotocol to global memory is CREW. At every time ste, each rocessor can erform either a communication ste, in which it can write and then read a word from the global memory, or a comutation comutation ste, which is an oeration which accesses at most two words from its local

5 memory. BRAM. An extension of the LRAM, the Block RAM (BRAM), is described in [ACS90]. The BRAM takes into account the reduced cost for transferring a contiguous block of data. The BRAM model is dened with two arameters L (latency or startu time) and b (block size). The cost of accessing local memory is unit time. However the cost of transmitting a block of size b of contiguous locations from global memory is L + b. ostal model. The ostal model [BK9] is a distributed memory model with the constraint that the oint-to-oint communication has latency. Several elegant otimal broadcast and summation algorithms have been designed based on this model, which were then extended for the Log model [KSSS93]. Algorithms other than broadcast and summation have largely not been resented for this model. Bulk-Synchronous arallel model. The BS is a distributed memory model [Val90]. Like the hase RAM, the BS is also a semi-asynchronous model because it requires synchronization after each \suerste", within which the rocesses can run asynchronously. The BS model is described in terms of three elements: rocesses/memory modules, a router which delivers the messages between airs of comonents and a synchronizer which synchronizes all or a subset of the comonents. Within the framework of resource metrics, a router is just an abstraction of network bandwidth and latency. A comutational task in the BS model consists of a sequence of suerstes. In each suerste, every comonent is allocated a task which contains some combination of local comutation stes and message transmissions. The local comutations, including reads and writes to local memory, are charged unit cost. The message transmission is accomlished by the router, which can send and receive a certain number of messages in each suerste (or in BS terminology, the router can realize an h-relation). The cost of realizing such anh-relation is assumed to be gh + s time units, where g can be thought as the recirocal of the communication bandwidth and s denotes the startu cost or the latency. If the length of a suerste is L, then L local oerations and a b L g c - relation message attern can be realized. The arameters of the machine are therefore L, g and (the number of rocessors). A FFT algorithm for the BS is described in [Val90]. Log model. The Log model is motivated by current technological trends in high erformance comuting towards networks of large-grained sohisticated rocessors. The Log model uses the arameters L (an uer bound of latency for transmitting a single message), o (the comutation overhead of handling a message), g (a lower bound of time interval between consecutive message transmissions at a rocessor) and (the number of rocessors) [CK + 93]. In contrast to the BS model, it removes the barrier synchronization requirement (h-relation in BS) and allows the rocessors to run asynchronously. The network of a Log machine has a nite caacity such that at any time at most b L g c messages can be in transit from or to any rocessor. It can suort shared or distributed memory. The Log model encourages well-known general techniques of designing algorithms for distributed memory machines including exloiting locality, reducing communication comlexity and overlaing communication and comutation. The Log model also romotes balanced communication atterns by introducing the limitation on network caacity so that no rocessor is overloaded with incoming messages. Moreover, it is often reasonable to ignore the arameter of o in a ractical machine, such as in a machine with low bandwidth (high g value). Examles using this strategy can be found in the FFT algorithm discussed below and also in [KSSS93]. FFT. The data layout and communication scheduling are two key asects to achieving an eective algorithm for the FFT roblem under the Log model. Three methods of data layout are discussed in [CK + 93]. The cyclic layout assigns the ith row of the buttery to the ith rocessor. The block layout laces the rst rows on the rst rocessor, the next rows on the second rocessor, and so on. Under either layout, each rocessor sends log time comuting and sends and receives messages, which needs ( g + L) log time. The third method is called hybrid layout, which switches from cyclic to blocked layout at any column between the log -th and the log -th (assuming > ). With this layout, each rocessor sends messages to every other rocessor, requiring only g(, )+L time. Therefore, this method leads to an algorithm with running time O( log +(, )g+l):which is within a factor (1 + g log ) of otimal. The naive communication schedule stalls on the rst send. The technique of overlaing comutation and communication can be used to eliminate this stall for the large roblem instances. The method introduced in [Sah9] staggers the dierent starting rows for different rocessors: rocessor i starts with its i -th row,

6 roceeds to the last row, and wras around. This leads an otimal algorithm for large roblem instances and reasonable g value Hierarchical models The arallel Memory Hierarchy model (MH) [AC94] and arallel Hierarchical Memory model (- HMM) [VS94] discussed in this section address the concerns of memory hierarchy in a arallel setting. The -HMM rimarily originates from considering in a arallel network the existence of secondary or disk memory. MH on the other hand uses \memory hierarchy" as a more general technique to model not only the hierarchy within a rocessor but also the communication characteristics of a arallel machine. arallel Memory Hierarchy model. The MH is a so-called generic model which denes a class of secic models [AC94]. In the MH model, a arallel comuter is modeled as a tree and each node of the tree is called a module. All of the leaf modules are used to denote the rocessors and the internal modules hold the data. Achild module is connected to its arent by a unique channel with a certain amount of bandwidth. Data in a module is artitioned into blocks which are the basic unit of data transfer between the child and arent. Thus communication between two rocessors roceeds u the tree by some ath and then down the tree to the target rocessor, somewhat resembling the fat tree architecture. The model is characterized by four arameters secied for each module m: blocksize s m, blockcount m which denotes the number of blocks in each module, childcount c m which denotes the number of children for each module, and transfer time t m which denotes the number of cycles used to transfer a block between the current module and its arent. To model a articular comuter, one chooses a tree structure and values for the arameters aroriate for the machine's communication caabilities and memory hierarchy. Even though this model is termed a arallel memory hierarchy, the internal modules need not necessarily corresond to actual memory modules of the real machine; when modeling the CM-5 for examle, many of the modules are used to cature the interrocessor communication caabilities [AC94]. The MH model derives from the develoers' exerience in tuning code for memory hierarchy. Many sequential algorithms have been develoed for the original sequential UMH model. However, erhas because of the comlexity and generality of the model, not too many algorithms have been develoed for the MH arallel model. arallel Hierarchical Memory model. The - HMM model is also called the arallel I/O model [VS90, VS94]. It originates from the consideration that data must often reside in secondary storage rather than main memory; in a arallel setting this may involve the arallel access of multile disks. Therefore it is necessary to design arallel algorithms which consider the ossible data movement between main and secondary memory [AV88], and more generally which consider multile levels of memory including register and cache. In the -HMM model, each rocessor has a memory hierarchy organized into discrete levels, much like the memory organization in the HMM, and all of searate memories are connected together at the base level of each hierarchy. A further assumtion is that the hierarchies can each function indeendently. Communication between hierarchies takes lace at the base memory level (level 1) which consists of location 1 from each of the hierarchies. The interconnection network for the base memory level locations is normally assumed to be a hyercube (or cube-connected cycles) so that the records in the base memory level can be sorted in O(log ) time. The model can be extended to allow block transfer; the resulting model is called the -BT model [VS94]. Several other extensions which use the UMH memory model and RAM interconnection resectively are discussed in [V91]. Two factors are critical for develoing eective - HMM algorithms: data lacement and movement between the levels of memory hierarchy, and data movement among the rocessors. FFT. We resent the following -HMM algorithm from [VS94], erforming the -inut FFT when : 1. Comute -inuts FFTs. Assume that ith FFT is on the ith contiguous grou of tracks (or memory levels).. Shift the records in the kth hierarchy to hierarchy 1 +(k+oset,1) mod, where oset = (i, 1) mod for ith grou. 3. Shue the records to form new contiguous grous of tracks. For every 1 i, the ith new grou consists of the ith record from each of the original grous. 4. Do -inut FFTs for the new grous. When,we rst do -inut FFTs, followed by ashue-merge, and then do -inut FFTs.

7 roc Synchrony/ Memory Latency Bandwidth Block Overhead Memory asynchrony Organization Transfer Hierarchy RAM Synchronous Shared VRAM 1 Synchronous Vector LRAM Synchronous Shared BRAM Synchronous Shared ostal Asynchronous Distributed HASE Semi-asynch Shared RAM HASE Semi-asynch Shared LRAM ARAM Asynchronous Shared BS Semi-asynch Distributed Log Asynchronous Both MH Asynchronous Distributed -HMM Asynchronous Distributed H-RAM Semi-asynch Both Log-HMM Asynchronous Distributed Table 1: Resource metrics chosen by dierent models of arallel comutation. The time required by this algorithm, T (; ), is exlained as follows. Ste 1 and 4 take T( ;)+ log time. The rst term in this formula is easy to understand. The second term arises because the different -inuts FFTs sit in dierent memory levels (or tracks). We need to move them into the lowest memory levels in order to recursively comute the -inut FFT. Similarly, the shifted elements need to be brought to the base memory and to be transmitted by the network which connects the base memory. Therefore, the shift time is O( (log + log )). The shuing can be done in the order of inut size times the data movement in the hierarchy, which iso( log ). Therefore, we have following recurrence T (; ) = T( ;)+O( log ); which gives the result O( log log log log ): This matches the FFT lower bound in [VS94] and so is otimal. 5. Log-HMM model The resource metrics chosen by dierent arallel models discussed so far are summarized in Table 1. It is aarent that each of the models addresses diering asects of resource usage, which suggests that resource metrics are aroriate tools to categorize and examine dierent models of arallel comutation. Using the framework of resource metrics, we can also build new models by adding or deleting several resource metrics from the existing models. However, this exibility otentially causes diculty. If we consider each resource metric as a dimension in the design sace of a machine's architecture, then we may introduce too many dimensions with which to deal. Therefore, a careful analysis and choice of resource metrics is necessary. The Log-HMM model, dened below, serves as an illustration of how the framework of resource metrics can be used to guide the design of ragmatically re- ned models that ll a need for more detailed erformance measures. From an examination of the resource metrics of existing models, it is aarent that a void exists in arallel models that accurately treat both network communication (as does the Log) and multi-level memory (as does the -HMM). The Log model does not address the roblem of several layers of memory. Yet it is imortant to model the several layers of memories which exist in many machines, since diering access times to local cache and disk may strongly eect erformance. On the other hand, models such as the -HMM or -UMH address in detail the resource of memory hierarchy, but not so much the accurate characterization of the communication network. These models normally use simle assumtions about the network, such as a RAM connection or an abstracted hyercube connection. The Log-HMM model lls the ga by combining both kinds of models together (or one might say by adding more resource metrics into either model). We take the aroach of extending an existing arallel model with memory hierarchy. The resulting model consists of two arts: the network art and the memory art. The network art can be any of the arallel models such as BS and Log, while the memory art can be any of the sequential hierarchical memory models such as HMM and UMH. In this aer we will focus on the extension of the Log model with HMM, which we call Log-HMM.

8 M/ M/ M/ M M M M THE ETWORK M M Log arameters Hierarchical arameters Figure 1: Structure of the Log-HMM model Denition of the model The Log-HMM model, ictured in Figure 1, is de- ned much like the arallel hierarchy memory model [VS94]. A Log-HMM machine consists of a set of asynchronously executing rocessors, each with an unlimited local memory. The local memory is organized as a sequence of layers with increasing size, where the size of layer i is i. Each memory location can be accessed randomly; the cost of accessing a memory location at address x is log x (using access cost function f(x) = log x). The rocessors are connected by a Log network at level 0. In other words, the four Log arameters, L, o, g and, are used to describe the interconnection network. A further assumtion is that the network has a nite caacity such that at any time at most b L g c messages can be in transit from or to any rocessor. 5.. Algorithm design and analysis Exloiting locality is the key to designing ecient algorithms for the Log-HMM model. In Log-HMM, there are two otential sources of data locality: the network art and the memory art. In the network art, we need to layout data in such a manner that each rocessor will use the data intensively before it needs the data in other rocessors. A similar situation exists in the memory art: before we move data to higher memory levels, the data should have been used intensively and should not be needed after moving. Several algorithms resented below illustrate these ideas. Before we resent the algorithms, we rst rove the following theorem. Theorem 5.1 The lower bound of sorting elements (or comuting a FFT grah) in the Log-HMM model with memory access cost f(x)=log x is ( log log log ): roof: In a Log-HMM machine with rocessors, the best way we can sort elements is to divide elements into sets of equal size and lace the elements among the rocessors already in interrocessorsorted order. Then each rocessor simultaneously sorts elements in its local memory and no communication between rocessors are required. In this case, the sorting lower bound for elements in each rocessor is also the lower bound for the whole sorting rocedure. By the result in [AACS87], the sorting lower bound for elements in HMM model with memory access cost f(x)=log x is exactly ( log log log ): A similar argument can be used for FFT comutation. Therefore we rove the theorem. FFT { Algorithm 1. Using the technique in [VS94] and the algorithm given in section 4.3., we can comute the FFT on a Log-HMM machine. The time needed by this algorithm is O( L+o L g log log log ): Stes 1 and 4 take T( ;)+ log time. The shuing ste sends time O( log ) after each grou is \shifted" by an aroriate oset, because the shuing can be done in the local memory hierarchy and does not cause any communication. The shifting can be done using following method. Each rocessor sends the \shift" messages to the other rocessors simultaneously. When d L g e messages have been sent, every rocessor begins to receive the d L g e messages. This rocedure is continued until each rocessor has sent and received all of the messages. The time required for this communication is O(( =d L L+o g e)(4l +4o)) = O( L g ): But each record sent may reside at a dierent level of the memory hierarchy. In order to move them into the lower memory levels, we need another log factor, which gives O( L+o L g log ): Therefore we have following recurrence T (; ) = T( ;)+O( L+o L g log ); which yields the bound given above. The algorithm is thus within a factor of g(l+o) L otimal. FFT { Algorithm. Algorithm 1 basically uses block layout for the inut data. This algorithm will use the hybrid data layout and a tighter uer bound can be derived. The idea of the hybrid method has been discussed in section 4.. A more detailed discussion can be found in [Sah9].

9 Using the similar notation given in [Sah9], we denote m = and l = m. Two stes are used to comute the -inut FFT. In Ste I, each rocessor comutes a m-inut buttery and after each rocessor nishes its comutation, it sends messages to each of the other rocessors. After each rocessor accets the, messages, it begins Ste II which comrises the comutation of the non-inut nodes of l dis- joint -inut butteries. By the result of [AACS87], Ste I comutation needs O(m log m log log m) time. Ste II comutation needs O(l log log log ) = O(m log log log ) time lus the time to move the l -inut FFTs to the lower memory levels, which is bounded by O( log )oro( log ). When there is no memory hierarchy, the communication time is (, )=d L L+o g e)(4l +4o)) = L g(, ). However, the messages sent may reside at dierent memory levels. This gives the communication time to be L+o L g(, ) log Thus the total running time is: O( log log log + L+o L g(, ) log ); Because, the running time is within a factor 1+ g(l+o) of otimal. Llog log Sorting. A near otimal sorting algorithm can be obtained by using columnsort [Lei85] and the modied median-sort algorithm roosed in [AACS87]. We will denote this modied median-sort algorithm as HMMsorting in the rest of the aer and we will use the result that the HMM-sorting can sort elements in O( log log log ) time. The columnsort erforms sorting on a column-major r c matrix with the conditions of c%r = 0 and r > (c, 1). It executes eight consecutive stes, of which the odd-numbered stes sort the r elements in each column and the evennumber stes ermute the data among the rocessors. In order to devise an ecient algorithm in the Log- HMM model, we take c =, and r =. The columnsort condition is satised if > (, 3). Each rocessor will be resonsible for sorting a column. Therefore, by using HMM-sorting, each of oddnumbered stes will take time O( log log log ). In ste and 4, each rocessor sends and receives, elements. These elements may reside at or send to dierent memory levels, therefore the communication overhead would be L+o L g(, ) log : In ste 6 and 8, each rocessor sends and receives elements. The communication overhead would be L+o L g log : Therefore the sorting can be done in time O( log log log + L+o L g log ); when > (, 3), which is within a factor 1+ g(l+o) of otimal. Llog log 5.3. An alternative otimal FFT algorithm for the -HMM model The hybrid algorithm discussed above can be adated for the -HMM model and an otimal running time can be achieved. The analysis is summarized below: 1. The time for m-inut FFT is O(m log m log log m) on a single rocessor.. In the shue stage, each rocessor sends, elements. In -HMM, sending a message over the network takes log time. Therefore, the time for this stage is (, ) (log +log ), where the second term of log is for transferring the data in the memory hierarchy. 3. The time to comute l -inut FFTs is (O(l( log log log )+ log ). Again the second term is for transferring the data in the memory hierarchy. Summing all of them together, we get the following result: O( log log log ); which matches the lower bound O( log log log log ) roven in [VS94]. And therefore it is otimal. 6. Conclusions This aer resents a framework of using resource metrics to characterize the various models of arallel comutation. Using the roerties of resource metrics, we can classify models into the basic synchronous models and extensions which incororate notions of asynchrony, communication costs, and memory hierarchy. The merits and disadvantages of these models are brought out by examining their characteristics in terms of resource metrics and their utility in designing such algorithms as FFT. Resource metrics can be used not only to understand existing models, but also to guide the design of

10 new models. The Log-HMM model, roosed in this aer, serves as an illustration of a model designed to ll the ga observed between models which address communication and those which address memory hierarchy. The design of near otimal sorting and FFT algorithms for Log-HMM gives romise that the Log- HMM model has the otential to serve as a viable tool for the design and analysis of what may rove to be ractical algorithms for a large class of machines. The Log-HMM model leaves large room for further study. Ongoing investigations include examining the utility of relacing the memory art by a more ractical model like the UMH. We are also ursuing the design of algorithms besides FFT and sorting for the Log-HMM model. Finally, we are ursuing the comarative evaluation and exerimental measurement of how accurately these models reect real arallel machines by imlementing the algorithms on machines such as the IBM S-. References [AACS87] A. Aggarwal, B. Alern, A. Chandra, and M. Snir, \A model for hierarchical memory," in roc. 19th ACM Sym. on Theory of Comuting,. 305{314, May [AC94] [ACF90] [ACS87] [ACS89] [ACS90] [AV88] [Ble90] [BK9] B. Alern and L. Carter, \Towards a model for ortable arallel erformance: exosing the memory hierarchy," in ortability and erformance for arallel rocessing,. 1{41, John Wiley & Sons, B. Alern, L. Carter, and E. Feig, \Uniform memory hierarchies," in roc. 31st IEEE Sym. on Foundations of Comuter Science, A. Aggarwal, A. Chandra, and M. Snir, \Hierarchical memory with block transfer," in roc. 8th Sym. on Foundations of Comuter Science,. 04{16, Oct A. Aggarwal, A. Chandra, and M. Snir, \On communication latency in RAM comutation," in roc. 1st ACM Sym. on arallel Algorithms and Architectures,. 11{1, A. Aggarwal, A. K. Chandra, and M. Snir, \Communication comlexity of RAMs," J. Theoretical Comuter Science, Mar A. Aggarwal and J. Vitter, \The inut/outut comlexity of sorting and related roblems," Comm. ACM, vol. 31,. 1116{117, Set G. E. Blelloch, Vector Models for Data-arallel Comuting. MIT ress, A. Bar-oy and S. Kinis, \Designing broadcasting algorithms in the ostal model for message-assing systems," in roc. 4th ACM Sym. on arallel Algorithms and Architectures,. 11{, ACM, June 199. [CK + 93] D. Culler, R. Kar, D. atterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. von Eicken, \Log: Towards a realistic model of arallel comutation," in roc. 4th ACM Sym. on rinciles and ractice of arallel rogramming,. 1{1, ACM, [CZ89] [FW78] R. Cole and O. Zajicek, \The ARAM: Incororating asynchrony into the RAM model," in roc. 1st ACM Sym. on arallel Algorithms and Architectures,. 169{178, ACM, S. Fortune and J. Wyllie, \arallelism in random access machines," in roc.10th ACM Sym. on Theory of Comuting,. 114{118, [Gib89]. B. Gibbons, \A more ractical RAM model," in roc. 1st ACM Sym. on arallel Algorithms and Architectures,. 158{168, ACM, [GMR94]. B. Gibbons, Y. Matias, and V. Ramachandran, \The QRQW RAM : Accounting for contention in arallel algorithms," in roc. 5th ACM-SIAM Sym. on Discrete Algorithms,. 638{647, [Goo93] M. Goodrich, \arallel algorithms column I: Models of comutation," SIGACT ews, vol. 4,. 16{1, Dec [KSSS93] R. Kar, A. Sahay, E. Santos, and K. Schauser, \Otimal broadcast and summation in the Log model," in roc. 5th ACM Sym. on arallel Algorithms and Architectures,. 14{153, [Lei85] [V91] T. Leighton, \Tight bounds on the comlexity of arallel sorting," IEEE Transactions on Comuters, vol. 34, no. 3,. 344{354, M. odine and J. Vitter, \Large-scale sorting in arallel memories," in roc. 3rd ACM Sym. on arallel Algorithms and Architectures,. 9{ 39, July [Sah9] A. Sahay, \Hiding communication costs in bandwidth-limited arallel FFT comutation," Technical Reort UCB/CSD 9/7, UC Berkeley, 199. [Ski91] D. Skillicorn, \Models for ractical arallel comutation," International Journal of arallel rogramming, vol. 0, no.,. 133{158, [Val90] [VS90] [VS94] L. Valiant, \A bridging model for arallel comutation," Comm. ACM, vol. 33,. 103{111, Aug J. Vitter and E. Shriver, \Otimal disk I/O with arallel block transfer," in roc. nd ACM Sym. on Theory of Comuting,. 159{169, J. S. Vitter and E. A. M. Shriver, \Algorithms for arallel memory II: Hierarchical multilevel memories," Algorithmica, 1994.

Models and Resource Metrics for Parallel and Distributed Computationt

Models and Resource Metrics for Parallel and Distributed Computationt Models and Resource Metrics for arallel and Distributed Computationt Zhiyong Li, eter H. Mills, John H. Reif Department of Computer Science, Duke University, Durham, N.C. 27708-0129 Abstract This paper

More information

COMP Parallel Computing. BSP (1) Bulk-Synchronous Processing Model

COMP Parallel Computing. BSP (1) Bulk-Synchronous Processing Model COMP 6 - Parallel Comuting Lecture 6 November, 8 Bulk-Synchronous essing Model Models of arallel comutation Shared-memory model Imlicit communication algorithm design and analysis relatively simle but

More information

Lecture 18. Today, we will discuss developing algorithms for a basic model for parallel computing the Parallel Random Access Machine (PRAM) model.

Lecture 18. Today, we will discuss developing algorithms for a basic model for parallel computing the Parallel Random Access Machine (PRAM) model. U.C. Berkeley CS273: Parallel and Distributed Theory Lecture 18 Professor Satish Rao Lecturer: Satish Rao Last revised Scribe so far: Satish Rao (following revious lecture notes quite closely. Lecture

More information

Parallel Construction of Multidimensional Binary Search Trees. Ibraheem Al-furaih, Srinivas Aluru, Sanjay Goil Sanjay Ranka

Parallel Construction of Multidimensional Binary Search Trees. Ibraheem Al-furaih, Srinivas Aluru, Sanjay Goil Sanjay Ranka Parallel Construction of Multidimensional Binary Search Trees Ibraheem Al-furaih, Srinivas Aluru, Sanjay Goil Sanjay Ranka School of CIS and School of CISE Northeast Parallel Architectures Center Syracuse

More information

Complexity Issues on Designing Tridiagonal Solvers on 2-Dimensional Mesh Interconnection Networks

Complexity Issues on Designing Tridiagonal Solvers on 2-Dimensional Mesh Interconnection Networks Journal of Comuting and Information Technology - CIT 8, 2000, 1, 1 12 1 Comlexity Issues on Designing Tridiagonal Solvers on 2-Dimensional Mesh Interconnection Networks Eunice E. Santos Deartment of Electrical

More information

split split (a) (b) split split (c) (d)

split split (a) (b) split split (c) (d) International Journal of Foundations of Comuter Science c World Scientic Publishing Comany ON COST-OPTIMAL MERGE OF TWO INTRANSITIVE SORTED SEQUENCES JIE WU Deartment of Comuter Science and Engineering

More information

level 0 level 1 level 2 level 3

level 0 level 1 level 2 level 3 Communication-Ecient Deterministic Parallel Algorithms for Planar Point Location and 2d Voronoi Diagram? Mohamadou Diallo 1, Afonso Ferreira 2 and Andrew Rau-Chalin 3 1 LIMOS, IFMA, Camus des C zeaux,

More information

Introduction to Parallel Algorithms

Introduction to Parallel Algorithms CS 1762 Fall, 2011 1 Introduction to Parallel Algorithms Introduction to Parallel Algorithms ECE 1762 Algorithms and Data Structures Fall Semester, 2011 1 Preliminaries Since the early 1990s, there has

More information

Sensitivity Analysis for an Optimal Routing Policy in an Ad Hoc Wireless Network

Sensitivity Analysis for an Optimal Routing Policy in an Ad Hoc Wireless Network 1 Sensitivity Analysis for an Otimal Routing Policy in an Ad Hoc Wireless Network Tara Javidi and Demosthenis Teneketzis Deartment of Electrical Engineering and Comuter Science University of Michigan Ann

More information

An Indexing Framework for Structured P2P Systems

An Indexing Framework for Structured P2P Systems An Indexing Framework for Structured P2P Systems Adina Crainiceanu Prakash Linga Ashwin Machanavajjhala Johannes Gehrke Carl Lagoze Jayavel Shanmugasundaram Deartment of Comuter Science, Cornell University

More information

Collective communication: theory, practice, and experience

Collective communication: theory, practice, and experience CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Comutat.: Pract. Exer. 2007; 19:1749 1783 Published online 5 July 2007 in Wiley InterScience (www.interscience.wiley.com)..1206 Collective

More information

AUTOMATIC GENERATION OF HIGH THROUGHPUT ENERGY EFFICIENT STREAMING ARCHITECTURES FOR ARBITRARY FIXED PERMUTATIONS. Ren Chen and Viktor K.

AUTOMATIC GENERATION OF HIGH THROUGHPUT ENERGY EFFICIENT STREAMING ARCHITECTURES FOR ARBITRARY FIXED PERMUTATIONS. Ren Chen and Viktor K. inuts er clock cycle Streaming ermutation oututs er clock cycle AUTOMATIC GENERATION OF HIGH THROUGHPUT ENERGY EFFICIENT STREAMING ARCHITECTURES FOR ARBITRARY FIXED PERMUTATIONS Ren Chen and Viktor K.

More information

10. Parallel Methods for Data Sorting

10. Parallel Methods for Data Sorting 10. Parallel Methods for Data Sorting 10. Parallel Methods for Data Sorting... 1 10.1. Parallelizing Princiles... 10.. Scaling Parallel Comutations... 10.3. Bubble Sort...3 10.3.1. Sequential Algorithm...3

More information

Efficient Parallel Hierarchical Clustering

Efficient Parallel Hierarchical Clustering Efficient Parallel Hierarchical Clustering Manoranjan Dash 1,SimonaPetrutiu, and Peter Scheuermann 1 Deartment of Information Systems, School of Comuter Engineering, Nanyang Technological University, Singaore

More information

Collective Communication: Theory, Practice, and Experience. FLAME Working Note #22

Collective Communication: Theory, Practice, and Experience. FLAME Working Note #22 Collective Communication: Theory, Practice, and Exerience FLAME Working Note # Ernie Chan Marcel Heimlich Avi Purkayastha Robert van de Geijn Setember, 6 Abstract We discuss the design and high-erformance

More information

[9] J. J. Dongarra, R. Hempel, A. J. G. Hey, and D. W. Walker, \A Proposal for a User-Level,

[9] J. J. Dongarra, R. Hempel, A. J. G. Hey, and D. W. Walker, \A Proposal for a User-Level, [9] J. J. Dongarra, R. Hemel, A. J. G. Hey, and D. W. Walker, \A Proosal for a User-Level, Message Passing Interface in a Distributed-Memory Environment," Tech. Re. TM-3, Oak Ridge National Laboratory,

More information

Ad Hoc Networks. Latency-minimizing data aggregation in wireless sensor networks under physical interference model

Ad Hoc Networks. Latency-minimizing data aggregation in wireless sensor networks under physical interference model Ad Hoc Networks (4) 5 68 Contents lists available at SciVerse ScienceDirect Ad Hoc Networks journal homeage: www.elsevier.com/locate/adhoc Latency-minimizing data aggregation in wireless sensor networks

More information

Multicast in Wormhole-Switched Torus Networks using Edge-Disjoint Spanning Trees 1

Multicast in Wormhole-Switched Torus Networks using Edge-Disjoint Spanning Trees 1 Multicast in Wormhole-Switched Torus Networks using Edge-Disjoint Sanning Trees 1 Honge Wang y and Douglas M. Blough z y Myricom Inc., 325 N. Santa Anita Ave., Arcadia, CA 916, z School of Electrical and

More information

Shuigeng Zhou. May 18, 2016 School of Computer Science Fudan University

Shuigeng Zhou. May 18, 2016 School of Computer Science Fudan University Query Processing Shuigeng Zhou May 18, 2016 School of Comuter Science Fudan University Overview Outline Measures of Query Cost Selection Oeration Sorting Join Oeration Other Oerations Evaluation of Exressions

More information

An Efficient Video Program Delivery algorithm in Tree Networks*

An Efficient Video Program Delivery algorithm in Tree Networks* 3rd International Symosium on Parallel Architectures, Algorithms and Programming An Efficient Video Program Delivery algorithm in Tree Networks* Fenghang Yin 1 Hong Shen 1,2,** 1 Deartment of Comuter Science,

More information

OMNI: An Efficient Overlay Multicast. Infrastructure for Real-time Applications

OMNI: An Efficient Overlay Multicast. Infrastructure for Real-time Applications OMNI: An Efficient Overlay Multicast Infrastructure for Real-time Alications Suman Banerjee, Christoher Kommareddy, Koushik Kar, Bobby Bhattacharjee, Samir Khuller Abstract We consider an overlay architecture

More information

Brigham Young University Oregon State University. Abstract. In this paper we present a new parallel sorting algorithm which maximizes the overlap

Brigham Young University Oregon State University. Abstract. In this paper we present a new parallel sorting algorithm which maximizes the overlap Aeared in \Journal of Parallel and Distributed Comuting, July 1995 " Overlaing Comutations, Communications and I/O in Parallel Sorting y Mark J. Clement Michael J. Quinn Comuter Science Deartment Deartment

More information

IMS Network Deployment Cost Optimization Based on Flow-Based Traffic Model

IMS Network Deployment Cost Optimization Based on Flow-Based Traffic Model IMS Network Deloyment Cost Otimization Based on Flow-Based Traffic Model Jie Xiao, Changcheng Huang and James Yan Deartment of Systems and Comuter Engineering, Carleton University, Ottawa, Canada {jiexiao,

More information

RST(0) RST(1) RST(2) RST(3) RST(4) RST(5) P4 RSR(0) RSR(1) RSR(2) RSR(3) RSR(4) RSR(5) Processor 1X2 Switch 2X1 Switch

RST(0) RST(1) RST(2) RST(3) RST(4) RST(5) P4 RSR(0) RSR(1) RSR(2) RSR(3) RSR(4) RSR(5) Processor 1X2 Switch 2X1 Switch Sub-logarithmic Deterministic Selection on Arrays with a Recongurable Otical Bus 1 Yijie Han Electronic Data Systems, Inc. 750 Tower Dr. CPS, Mail Sto 7121 Troy, MI 48098 Yi Pan Deartment of Comuter Science

More information

RESEARCH ARTICLE. Simple Memory Machine Models for GPUs

RESEARCH ARTICLE. Simple Memory Machine Models for GPUs The International Journal of Parallel, Emergent and Distributed Systems Vol. 00, No. 00, Month 2011, 1 22 RESEARCH ARTICLE Simle Memory Machine Models for GPUs Koji Nakano a a Deartment of Information

More information

An Efficient Coding Method for Coding Region-of-Interest Locations in AVS2

An Efficient Coding Method for Coding Region-of-Interest Locations in AVS2 An Efficient Coding Method for Coding Region-of-Interest Locations in AVS2 Mingliang Chen 1, Weiyao Lin 1*, Xiaozhen Zheng 2 1 Deartment of Electronic Engineering, Shanghai Jiao Tong University, China

More information

GEOMETRIC CONSTRAINT SOLVING IN < 2 AND < 3. Department of Computer Sciences, Purdue University. and PAMELA J. VERMEER

GEOMETRIC CONSTRAINT SOLVING IN < 2 AND < 3. Department of Computer Sciences, Purdue University. and PAMELA J. VERMEER GEOMETRIC CONSTRAINT SOLVING IN < AND < 3 CHRISTOPH M. HOFFMANN Deartment of Comuter Sciences, Purdue University West Lafayette, Indiana 47907-1398, USA and PAMELA J. VERMEER Deartment of Comuter Sciences,

More information

PRO: a Model for Parallel Resource-Optimal Computation

PRO: a Model for Parallel Resource-Optimal Computation PRO: a Model for Parallel Resource-Otimal Comutation Assefaw Hadish Gebremedhin Isabelle Guérin Lassous Jens Gustedt Jan Arne Telle Abstract We resent a new arallel comutation model that enables the design

More information

Truth Trees. Truth Tree Fundamentals

Truth Trees. Truth Tree Fundamentals Truth Trees 1 True Tree Fundamentals 2 Testing Grous of Statements for Consistency 3 Testing Arguments in Proositional Logic 4 Proving Invalidity in Predicate Logic Answers to Selected Exercises Truth

More information

Relations with Relation Names as Arguments: Algebra and Calculus. Kenneth A. Ross. Columbia University.

Relations with Relation Names as Arguments: Algebra and Calculus. Kenneth A. Ross. Columbia University. Relations with Relation Names as Arguments: Algebra and Calculus Kenneth A. Ross Columbia University kar@cs.columbia.edu Abstract We consider a version of the relational model in which relation names may

More information

EE678 Application Presentation Content Based Image Retrieval Using Wavelets

EE678 Application Presentation Content Based Image Retrieval Using Wavelets EE678 Alication Presentation Content Based Image Retrieval Using Wavelets Grou Members: Megha Pandey megha@ee. iitb.ac.in 02d07006 Gaurav Boob gb@ee.iitb.ac.in 02d07008 Abstract: We focus here on an effective

More information

Lecture 8: Orthogonal Range Searching

Lecture 8: Orthogonal Range Searching CPS234 Comutational Geometry Setember 22nd, 2005 Lecture 8: Orthogonal Range Searching Lecturer: Pankaj K. Agarwal Scribe: Mason F. Matthews 8.1 Range Searching The general roblem of range searching is

More information

J. Parallel Distrib. Comput.

J. Parallel Distrib. Comput. J. Parallel Distrib. Comut. 71 (2011) 288 301 Contents lists available at ScienceDirect J. Parallel Distrib. Comut. journal homeage: www.elsevier.com/locate/jdc Quality of security adatation in arallel

More information

SPITFIRE: Scalable Parallel Algorithms for Test Set Partitioned Fault Simulation

SPITFIRE: Scalable Parallel Algorithms for Test Set Partitioned Fault Simulation To aear in IEEE VLSI Test Symosium, 1997 SITFIRE: Scalable arallel Algorithms for Test Set artitioned Fault Simulation Dili Krishnaswamy y Elizabeth M. Rudnick y Janak H. atel y rithviraj Banerjee z y

More information

Simulating Ocean Currents. Simulating Galaxy Evolution

Simulating Ocean Currents. Simulating Galaxy Evolution Simulating Ocean Currents (a) Cross sections (b) Satial discretization of a cross section Model as two-dimensional grids Discretize in sace and time finer satial and temoral resolution => greater accuracy

More information

Hardware-Accelerated Formal Verification

Hardware-Accelerated Formal Verification Hardare-Accelerated Formal Verification Hiroaki Yoshida, Satoshi Morishita 3 Masahiro Fujita,. VLSI Design and Education Center (VDEC), University of Tokyo. CREST, Jaan Science and Technology Agency 3.

More information

Contents 1 Introduction 2 2 Outline of the SAT Aroach Performance View Abstraction View

Contents 1 Introduction 2 2 Outline of the SAT Aroach Performance View Abstraction View Abstraction and Performance in the Design of Parallel Programs Der Fakultat fur Mathematik und Informatik der Universitat Passau vorgelegte Zusammenfassung der Veroentlichungen zur Erlangung der venia

More information

Equality-Based Translation Validator for LLVM

Equality-Based Translation Validator for LLVM Equality-Based Translation Validator for LLVM Michael Ste, Ross Tate, and Sorin Lerner University of California, San Diego {mste,rtate,lerner@cs.ucsd.edu Abstract. We udated our Peggy tool, reviously resented

More information

Directed File Transfer Scheduling

Directed File Transfer Scheduling Directed File Transfer Scheduling Weizhen Mao Deartment of Comuter Science The College of William and Mary Williamsburg, Virginia 387-8795 wm@cs.wm.edu Abstract The file transfer scheduling roblem was

More information

Space-efficient Region Filling in Raster Graphics

Space-efficient Region Filling in Raster Graphics "The Visual Comuter: An International Journal of Comuter Grahics" (submitted July 13, 1992; revised December 7, 1992; acceted in Aril 16, 1993) Sace-efficient Region Filling in Raster Grahics Dominik Henrich

More information

Randomized Selection on the Hypercube 1

Randomized Selection on the Hypercube 1 Randomized Selection on the Hyercube 1 Sanguthevar Rajasekaran Det. of Com. and Info. Science and Engg. University of Florida Gainesville, FL 32611 ABSTRACT In this aer we resent randomized algorithms

More information

A Model-Adaptable MOSFET Parameter Extraction System

A Model-Adaptable MOSFET Parameter Extraction System A Model-Adatable MOSFET Parameter Extraction System Masaki Kondo Hidetoshi Onodera Keikichi Tamaru Deartment of Electronics Faculty of Engineering, Kyoto University Kyoto 66-1, JAPAN Tel: +81-7-73-313

More information

A Novel Iris Segmentation Method for Hand-Held Capture Device

A Novel Iris Segmentation Method for Hand-Held Capture Device A Novel Iris Segmentation Method for Hand-Held Cature Device XiaoFu He and PengFei Shi Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai 200030, China {xfhe,

More information

arxiv: v1 [cs.dc] 13 Nov 2018

arxiv: v1 [cs.dc] 13 Nov 2018 Task Grah Transformations for Latency Tolerance arxiv:1811.05077v1 [cs.dc] 13 Nov 2018 Victor Eijkhout November 14, 2018 Abstract The Integrative Model for Parallelism (IMP) derives a task grah from a

More information

A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism

A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism Erlin Yao, Mingyu Chen, Rui Wang, Wenli Zhang, Guangming Tan Key Laboratory of Comuter System and Architecture Institute

More information

10 File System Mass Storage Structure Mass Storage Systems Mass Storage Structure Mass Storage Structure FILE SYSTEM 1

10 File System Mass Storage Structure Mass Storage Systems Mass Storage Structure Mass Storage Structure FILE SYSTEM 1 10 File System 1 We will examine this chater in three subtitles: Mass Storage Systems OERATING SYSTEMS FILE SYSTEM 1 File System Interface File System Imlementation 10.1.1 Mass Storage Structure 3 2 10.1

More information

The Anubis Service. Paul Murray Internet Systems and Storage Laboratory HP Laboratories Bristol HPL June 8, 2005*

The Anubis Service. Paul Murray Internet Systems and Storage Laboratory HP Laboratories Bristol HPL June 8, 2005* The Anubis Service Paul Murray Internet Systems and Storage Laboratory HP Laboratories Bristol HPL-2005-72 June 8, 2005* timed model, state monitoring, failure detection, network artition Anubis is a fully

More information

Efficient Processing of Top-k Dominating Queries on Multi-Dimensional Data

Efficient Processing of Top-k Dominating Queries on Multi-Dimensional Data Efficient Processing of To-k Dominating Queries on Multi-Dimensional Data Man Lung Yiu Deartment of Comuter Science Aalborg University DK-922 Aalborg, Denmark mly@cs.aau.dk Nikos Mamoulis Deartment of

More information

TOPP Probing of Network Links with Large Independent Latencies

TOPP Probing of Network Links with Large Independent Latencies TOPP Probing of Network Links with Large Indeendent Latencies M. Hosseinour, M. J. Tunnicliffe Faculty of Comuting, Information ystems and Mathematics, Kingston University, Kingston-on-Thames, urrey, KT1

More information

An improved algorithm for Hausdorff Voronoi diagram for non-crossing sets

An improved algorithm for Hausdorff Voronoi diagram for non-crossing sets An imroved algorithm for Hausdorff Voronoi diagram for non-crossing sets Frank Dehne, Anil Maheshwari and Ryan Taylor May 26, 2006 Abstract We resent an imroved algorithm for building a Hausdorff Voronoi

More information

12) United States Patent 10) Patent No.: US 6,321,328 B1

12) United States Patent 10) Patent No.: US 6,321,328 B1 USOO6321328B1 12) United States Patent 10) Patent No.: 9 9 Kar et al. (45) Date of Patent: Nov. 20, 2001 (54) PROCESSOR HAVING DATA FOR 5,961,615 10/1999 Zaid... 710/54 SPECULATIVE LOADS 6,006,317 * 12/1999

More information

Privacy Preserving Moving KNN Queries

Privacy Preserving Moving KNN Queries Privacy Preserving Moving KNN Queries arxiv:4.76v [cs.db] 4 Ar Tanzima Hashem Lars Kulik Rui Zhang National ICT Australia, Deartment of Comuter Science and Software Engineering University of Melbourne,

More information

Lecture 3: Geometric Algorithms(Convex sets, Divide & Conquer Algo.)

Lecture 3: Geometric Algorithms(Convex sets, Divide & Conquer Algo.) Advanced Algorithms Fall 2015 Lecture 3: Geometric Algorithms(Convex sets, Divide & Conuer Algo.) Faculty: K.R. Chowdhary : Professor of CS Disclaimer: These notes have not been subjected to the usual

More information

Implementations of Partial Document Ranking Using. Inverted Files. Wai Yee Peter Wong. Dik Lun Lee

Implementations of Partial Document Ranking Using. Inverted Files. Wai Yee Peter Wong. Dik Lun Lee Imlementations of Partial Document Ranking Using Inverted Files Wai Yee Peter Wong Dik Lun Lee Deartment of Comuter and Information Science, Ohio State University, 36 Neil Ave, Columbus, Ohio 4321, U.S.A.

More information

Compiling for Heterogeneous Systems: A Survey and an. Approach

Compiling for Heterogeneous Systems: A Survey and an. Approach Comiling for Heterogeneous Systems: A Survey and an Aroach Kathryn S. McKinley, J. Eliot B. Moss, Sharad K. Singhai, Glen E. Weaver, Charles C. Weems CMPSCI Techincal Reort 95-59 Setember 1995 Deartment

More information

Parallel Algorithms for the Summed Area Table on the Asynchronous Hierarchical Memory Machine, with GPU implementations

Parallel Algorithms for the Summed Area Table on the Asynchronous Hierarchical Memory Machine, with GPU implementations Parallel Algorithms for the Summed Area Table on the Asynchronous Hierarchical Memory Machine, ith GPU imlementations Akihiko Kasagi, Koji Nakano, and Yasuaki Ito Deartment of Information Engineering Hiroshima

More information

Implementation of Evolvable Fuzzy Hardware for Packet Scheduling Through Online Context Switching

Implementation of Evolvable Fuzzy Hardware for Packet Scheduling Through Online Context Switching Imlementation of Evolvable Fuzzy Hardware for Packet Scheduling Through Online Context Switching Ju Hui Li, eng Hiot Lim and Qi Cao School of EEE, Block S Nanyang Technological University Singaore 639798

More information

Learning Motion Patterns in Crowded Scenes Using Motion Flow Field

Learning Motion Patterns in Crowded Scenes Using Motion Flow Field Learning Motion Patterns in Crowded Scenes Using Motion Flow Field Min Hu, Saad Ali and Mubarak Shah Comuter Vision Lab, University of Central Florida {mhu,sali,shah}@eecs.ucf.edu Abstract Learning tyical

More information

Source-to-Source Code Generation Based on Pattern Matching and Dynamic Programming

Source-to-Source Code Generation Based on Pattern Matching and Dynamic Programming Source-to-Source Code Generation Based on Pattern Matching and Dynamic Programming Weimin Chen, Volker Turau TR-93-047 August, 1993 Abstract This aer introduces a new technique for source-to-source code

More information

Randomized algorithms: Two examples and Yao s Minimax Principle

Randomized algorithms: Two examples and Yao s Minimax Principle Randomized algorithms: Two examles and Yao s Minimax Princile Maximum Satisfiability Consider the roblem Maximum Satisfiability (MAX-SAT). Bring your knowledge u-to-date on the Satisfiability roblem. Maximum

More information

Design Trade-offs in Customized On-chip Crossbar Schedulers

Design Trade-offs in Customized On-chip Crossbar Schedulers J Sign Process Syst () 8:9 8 DOI.7/s-8--x Design Trade-offs in Customized On-chi Crossbar Schedulers Jae Young Hur Stehan Wong Todor Stefanov Received: October 7 / Revised: June 8 / cceted: ugust 8 / Published

More information

An Efficient VLSI Architecture for Adaptive Rank Order Filter for Image Noise Removal

An Efficient VLSI Architecture for Adaptive Rank Order Filter for Image Noise Removal International Journal of Information and Electronics Engineering, Vol. 1, No. 1, July 011 An Efficient VLSI Architecture for Adative Rank Order Filter for Image Noise Removal M. C Hanumantharaju, M. Ravishankar,

More information

AN ANALYTICAL MODEL DESCRIBING THE RELATIONSHIPS BETWEEN LOGIC ARCHITECTURE AND FPGA DENSITY

AN ANALYTICAL MODEL DESCRIBING THE RELATIONSHIPS BETWEEN LOGIC ARCHITECTURE AND FPGA DENSITY AN ANALYTICAL MODEL DESCRIBING THE RELATIONSHIPS BETWEEN LOGIC ARCHITECTURE AND FPGA DENSITY Andrew Lam 1, Steven J.E. Wilton 1, Phili Leong 2, Wayne Luk 3 1 Elec. and Com. Engineering 2 Comuter Science

More information

Optimization of Collective Communication Operations in MPICH

Optimization of Collective Communication Operations in MPICH To be ublished in the International Journal of High Performance Comuting Alications, 5. c Sage Publications. Otimization of Collective Communication Oerations in MPICH Rajeev Thakur Rolf Rabenseifner William

More information

S16-02, URL:

S16-02, URL: Self Introduction A/Prof ay Seng Chuan el: Email: scitaysc@nus.edu.sg Office: S-0, Dean s s Office at Level URL: htt://www.hysics.nus.edu.sg/~hytaysc I was a rogrammer from to. I have been working in NUS

More information

CMSC 425: Lecture 16 Motion Planning: Basic Concepts

CMSC 425: Lecture 16 Motion Planning: Basic Concepts : Lecture 16 Motion lanning: Basic Concets eading: Today s material comes from various sources, including AI Game rogramming Wisdom 2 by S. abin and lanning Algorithms by S. M. LaValle (Chats. 4 and 5).

More information

Fast Distributed Process Creation with the XMOS XS1 Architecture

Fast Distributed Process Creation with the XMOS XS1 Architecture Communicating Process Architectures 20 P.H. Welch et al. (Eds.) IOS Press, 20 c 20 The authors and IOS Press. All rights reserved. Fast Distributed Process Creation with the XMOS XS Architecture James

More information

A CLASS OF STRUCTURED LDPC CODES WITH LARGE GIRTH

A CLASS OF STRUCTURED LDPC CODES WITH LARGE GIRTH A CLASS OF STRUCTURED LDPC CODES WITH LARGE GIRTH Jin Lu, José M. F. Moura, and Urs Niesen Deartment of Electrical and Comuter Engineering Carnegie Mellon University, Pittsburgh, PA 15213 jinlu, moura@ece.cmu.edu

More information

PREDICTING LINKS IN LARGE COAUTHORSHIP NETWORKS

PREDICTING LINKS IN LARGE COAUTHORSHIP NETWORKS PREDICTING LINKS IN LARGE COAUTHORSHIP NETWORKS Kevin Miller, Vivian Lin, and Rui Zhang Grou ID: 5 1. INTRODUCTION The roblem we are trying to solve is redicting future links or recovering missing links

More information

in Distributed Systems Department of Computer Science, Keio University into four forms according to asynchrony and real-time properties.

in Distributed Systems Department of Computer Science, Keio University into four forms according to asynchrony and real-time properties. Asynchrony and Real-Time in Distributed Systems Mario Tokoro? and Ichiro Satoh?? Deartment of Comuter Science, Keio University 3-14-1, Hiyoshi, Kohoku-ku, Yokohama, 223, Jaan Tel: +81-45-56-115 Fax: +81-45-56-1151

More information

Distributed Algorithms

Distributed Algorithms Course Outline With grateful acknowledgement to Christos Karamanolis for much of the material Jeff Magee & Jeff Kramer Models of distributed comuting Synchronous message-assing distributed systems Algorithms

More information

Patterned Wafer Segmentation

Patterned Wafer Segmentation atterned Wafer Segmentation ierrick Bourgeat ab, Fabrice Meriaudeau b, Kenneth W. Tobin a, atrick Gorria b a Oak Ridge National Laboratory,.O.Box 2008, Oak Ridge, TN 37831-6011, USA b Le2i Laboratory Univ.of

More information

Extracting Optimal Paths from Roadmaps for Motion Planning

Extracting Optimal Paths from Roadmaps for Motion Planning Extracting Otimal Paths from Roadmas for Motion Planning Jinsuck Kim Roger A. Pearce Nancy M. Amato Deartment of Comuter Science Texas A&M University College Station, TX 843 jinsuckk,ra231,amato @cs.tamu.edu

More information

Control plane and data plane. Computing systems now. Glacial process of innovation made worse by standards process. Computing systems once upon a time

Control plane and data plane. Computing systems now. Glacial process of innovation made worse by standards process. Computing systems once upon a time Classical work Architecture A A A Intro to SDN A A Oerating A Secialized Packet A A Oerating Secialized Packet A A A Oerating A Secialized Packet A A Oerating A Secialized Packet Oerating Secialized Packet

More information

Stereo Disparity Estimation in Moment Space

Stereo Disparity Estimation in Moment Space Stereo Disarity Estimation in oment Sace Angeline Pang Faculty of Information Technology, ultimedia University, 63 Cyberjaya, alaysia. angeline.ang@mmu.edu.my R. ukundan Deartment of Comuter Science, University

More information

Skip List Based Authenticated Data Structure in DAS Paradigm

Skip List Based Authenticated Data Structure in DAS Paradigm 009 Eighth International Conference on Grid and Cooerative Comuting Ski List Based Authenticated Data Structure in DAS Paradigm Jieing Wang,, Xiaoyong Du,. Key Laboratory of Data Engineering and Knowledge

More information

A Study of Protocols for Low-Latency Video Transport over the Internet

A Study of Protocols for Low-Latency Video Transport over the Internet A Study of Protocols for Low-Latency Video Transort over the Internet Ciro A. Noronha, Ph.D. Cobalt Digital Santa Clara, CA ciro.noronha@cobaltdigital.com Juliana W. Noronha University of California, Davis

More information

Autonomic Physical Database Design - From Indexing to Multidimensional Clustering

Autonomic Physical Database Design - From Indexing to Multidimensional Clustering Autonomic Physical Database Design - From Indexing to Multidimensional Clustering Stehan Baumann, Kai-Uwe Sattler Databases and Information Systems Grou Technische Universität Ilmenau, Ilmenau, Germany

More information

CENTRAL AND PARALLEL PROJECTIONS OF REGULAR SURFACES: GEOMETRIC CONSTRUCTIONS USING 3D MODELING SOFTWARE

CENTRAL AND PARALLEL PROJECTIONS OF REGULAR SURFACES: GEOMETRIC CONSTRUCTIONS USING 3D MODELING SOFTWARE CENTRAL AND PARALLEL PROJECTIONS OF REGULAR SURFACES: GEOMETRIC CONSTRUCTIONS USING 3D MODELING SOFTWARE Petra Surynková Charles University in Prague, Faculty of Mathematics and Physics, Sokolovská 83,

More information

Brief Contributions. A Geometric Theorem for Network Design 1 INTRODUCTION

Brief Contributions. A Geometric Theorem for Network Design 1 INTRODUCTION IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO., APRIL 00 83 Brief Contributions A Geometric Theorem for Network Design Massimo Franceschetti, Member, IEEE, Matthew Cook, and Jehoshua Bruck, Fellow, IEEE

More information

Computing the Convex Hull of. W. Hochstattler, S. Kromberg, C. Moll

Computing the Convex Hull of. W. Hochstattler, S. Kromberg, C. Moll Reort No. 94.60 A new Linear Time Algorithm for Comuting the Convex Hull of a Simle Polygon in the Plane by W. Hochstattler, S. Kromberg, C. Moll 994 W. Hochstattler, S. Kromberg, C. Moll Universitat zu

More information

An FPGA Implementation of ( )-Regular Low-Density. Parity-Check Code Decoder

An FPGA Implementation of ( )-Regular Low-Density. Parity-Check Code Decoder An FPGA Imlementation of ( )-Regular Low-Density Parity-Check Code Decoder Tong Zhang ECSE Det., RPI Troy, NY tzhang@ecse.ri.edu Keshab K. Parhi ECE Det., Univ. of Minnesota Minneaolis, MN arhi@ece.umn.edu

More information

Auto-Tuning Distributed-Memory 3-Dimensional Fast Fourier Transforms on the Cray XT4

Auto-Tuning Distributed-Memory 3-Dimensional Fast Fourier Transforms on the Cray XT4 Auto-Tuning Distributed-Memory 3-Dimensional Fast Fourier Transforms on the Cray XT4 M. Gajbe a A. Canning, b L-W. Wang, b J. Shalf, b H. Wasserman, b and R. Vuduc, a a Georgia Institute of Technology,

More information

AN INTEGER LINEAR MODEL FOR GENERAL ARC ROUTING PROBLEMS

AN INTEGER LINEAR MODEL FOR GENERAL ARC ROUTING PROBLEMS AN INTEGER LINEAR MODEL FOR GENERAL ARC ROUTING PROBLEMS Philie LACOMME, Christian PRINS, Wahiba RAMDANE-CHERIF Université de Technologie de Troyes, Laboratoire d Otimisation des Systèmes Industriels (LOSI)

More information

A Scalable Parallel Sorting Algorithm Using Exact Splitting

A Scalable Parallel Sorting Algorithm Using Exact Splitting A Scalable Parallel Sorting Algorithm Using Exact Slitting Christian Siebert 1,2 and Felix Wolf 1,2,3 1 German Research School for Simulation Sciences, 52062 Aachen, Germany 2 RWTH Aachen University, Comuter

More information

Record Route IP Traceback: Combating DoS Attacks and the Variants

Record Route IP Traceback: Combating DoS Attacks and the Variants Record Route IP Traceback: Combating DoS Attacks and the Variants Abdullah Yasin Nur, Mehmet Engin Tozal University of Louisiana at Lafayette, Lafayette, LA, US ayasinnur@louisiana.edu, metozal@louisiana.edu

More information

Interactive Image Segmentation

Interactive Image Segmentation Interactive Image Segmentation Fahim Mannan (260 266 294) Abstract This reort resents the roject work done based on Boykov and Jolly s interactive grah cuts based N-D image segmentation algorithm([1]).

More information

The Transportation Primitive

The Transportation Primitive Syracuse University SURFACE College of Engineering and Comuter Science - Former Deartments, Centers, Institutes and Projects College of Engineering and Comuter Science 1994 The Transortation Primitive

More information

Figure 8.1: Home age taken from the examle health education site (htt:// Setember 14, 2001). 201

Figure 8.1: Home age taken from the examle health education site (htt://  Setember 14, 2001). 201 200 Chater 8 Alying the Web Interface Profiles: Examle Web Site Assessment 8.1 Introduction This chater describes the use of the rofiles develoed in Chater 6 to assess and imrove the quality of an examle

More information

Improving Parallel-Disk Buffer Management using Randomized Writeback æ

Improving Parallel-Disk Buffer Management using Randomized Writeback æ Aears in Proceedings of ICPP 98 1 Imroving Parallel-Disk Buffer Management using Randomized Writeback æ Mahesh Kallahalla Peter J Varman Detartment of Electrical and Comuter Engineering Rice University

More information

Improved heuristics for the single machine scheduling problem with linear early and quadratic tardy penalties

Improved heuristics for the single machine scheduling problem with linear early and quadratic tardy penalties Imroved heuristics for the single machine scheduling roblem with linear early and quadratic tardy enalties Jorge M. S. Valente* LIAAD INESC Porto LA, Faculdade de Economia, Universidade do Porto Postal

More information

The degree constrained k-cardinality minimum spanning tree problem: a lexisearch

The degree constrained k-cardinality minimum spanning tree problem: a lexisearch Decision Science Letters 7 (2018) 301 310 Contents lists available at GrowingScience Decision Science Letters homeage: www.growingscience.com/dsl The degree constrained k-cardinality minimum sanning tree

More information

A GPU Heterogeneous Cluster Scheduling Model for Preventing Temperature Heat Island

A GPU Heterogeneous Cluster Scheduling Model for Preventing Temperature Heat Island A GPU Heterogeneous Cluster Scheduling Model for Preventing Temerature Heat Island Yun-Peng CAO 1,2,a and Hai-Feng WANG 1,2 1 School of Information Science and Engineering, Linyi University, Linyi Shandong,

More information

A Reconfigurable Architecture for Quad MAC VLIW DSP

A Reconfigurable Architecture for Quad MAC VLIW DSP A Reconfigurable Architecture for Quad MAC VLIW DSP Sangwook Kim, Sungchul Yoon, Jaeseuk Oh, Sungho Kang Det. of Electrical & Electronic Engineering, Yonsei University 132 Shinchon-Dong, Seodaemoon-Gu,

More information

I ACCEPT NO RESPONSIBILITY FOR ERRORS ON THIS SHEET. I assume that E = (V ).

I ACCEPT NO RESPONSIBILITY FOR ERRORS ON THIS SHEET. I assume that E = (V ). 1 I ACCEPT NO RESPONSIBILITY FOR ERRORS ON THIS SHEET. I assume that E = (V ). Data structures Sorting Binary heas are imlemented using a hea-ordered balanced binary tree. Binomial heas use a collection

More information

ISO/IEC , the standard for Modula-2: Changes, Clarications and Additions. June 27, 1996

ISO/IEC , the standard for Modula-2: Changes, Clarications and Additions. June 27, 1996 ISO/IEC 10514-1, the standard for Modula-2: Changes, Clarications and Additions M. Schonhacker Vienna University of Technology Austria schoenhacker@eiunix.tuwien.ac.at C. Pronk Delft University of Technology

More information

aclass: ClassDeclaration (name=name) aspectstatement: Statement (place=name) methods method: MethodDeclaration

aclass: ClassDeclaration (name=name) aspectstatement: Statement (place=name) methods method: MethodDeclaration Asect Weaving with Grah Rewriting Uwe A mann and Andreas Ludwig Institut f r Programmstrukturen und Datenorganisation Universit t Karlsruhe, RZ, Postfach 6980, 76128 Karlsruhe, Germany (assmannjludwig)@id.info.uni-karlsruhe.de

More information

An empirical analysis of loopy belief propagation in three topologies: grids, small-world networks and random graphs

An empirical analysis of loopy belief propagation in three topologies: grids, small-world networks and random graphs An emirical analysis of looy belief roagation in three toologies: grids, small-world networks and random grahs R. Santana, A. Mendiburu and J. A. Lozano Intelligent Systems Grou Deartment of Comuter Science

More information

Identity-sensitive Points-to Analysis for the Dynamic Behavior of JavaScript Objects

Identity-sensitive Points-to Analysis for the Dynamic Behavior of JavaScript Objects Identity-sensitive Points-to Analysis for the Dynamic Behavior of JavaScrit Objects Shiyi Wei and Barbara G. Ryder Deartment of Comuter Science, Virginia Tech, Blacksburg, VA, USA. {wei,ryder}@cs.vt.edu

More information

Single character type identification

Single character type identification Single character tye identification Yefeng Zheng*, Changsong Liu, Xiaoqing Ding Deartment of Electronic Engineering, Tsinghua University Beijing 100084, P.R. China ABSTRACT Different character recognition

More information