Models for Advancing PRAM and Other Algorithms into Parallel Programs for a PRAM-On-Chip Platform

Size: px
Start display at page:

Download "Models for Advancing PRAM and Other Algorithms into Parallel Programs for a PRAM-On-Chip Platform"

Transcription

1 Models for Advancing PRAM and Other Algorithms into Parallel Programs for a PRAM-On-Chi Platform Uzi Vishkin George C. Caragea Bryant Lee Aril 2006 University of Maryland, College Park, MD UMIACS-TR Justin Rattner, CTO, Intel, Electronic News, March 13, 2006: It is better for Intel to get involved in this now so when we get to the oint of having 10s and 100s of cores we will have the answers. There is a lot of architecture work to do to release the otential, and we will not bring these roducts to market until we have good solutions to the rogramming roblem. [underline added] Abstract A bold vision that guided this work is as follows: (i) a arallel algorithms and rogramming course could become a standard course in every undergraduate comuter science rogram, and (ii) this course could be couled with a so-called PRAM-On-Chi architecture a commodity high-end multi-core comuter architecture. In fact, the current aer is a tutorial on how to convert PRAM algorithms into efficient PRAM-On-Chi rograms. Couled with a text on PRAM algorithms as well as an available PRAM-On-Chi tool-chain, comrising a comiler and a simulator, the aer rovides the missing link for ugrading a standard theoretical PRAM algorithms class to a arallel algorithms and rogramming class. Having demonstrated that such a course could cover similar rogramming rojects and material to what is covered by a tyical first serial algorithms and rogramming course, the aer suggests that arallel rogramming in the emerging multi-core era does not need to be more difficult than serial rogramming. If true, a owerful answer to the so-called arallel rogramming oen roblem is being rovided. This oen roblem is currently the main stumbling block for the industry in getting the ucoming generation of multi-core architectures to imrove single task comletion time using easy-to-rogram alication rogrammer interfaces. Known constraints of this oen roblem, such as backwards comatibility on serial code, are also addressed by the overall aroach. More concretely, a widely used methodology for advancing arallel algorithmic thinking into arallel algorithms is revisited, and is extended into a methodology for advancing arallel algorithms to PRAM-On-Chi rograms. A erformance cost model for the PRAM-On-Chi is also resented. It uses as comlexity metrics the length of sequence of round tris to memory (LSRTM) and queuing delay (QD) from memory access queues, in addition to standard PRAM comutation costs. Highlighting the imortance of LSRTM in determining erformance is another contribution of the aer. Finally, some alternatives to PRAM algorithms, which, on one hand, are easier-to-think, but, on the other hand, suress more architecture details, are also discussed. 1 Introduction Parallel rogramming is currently a difficult task. Current methods tend to be coarse-grained and use either a shared memory or a message assing model. These methods often require the rogrammer to think in a way that takes into account details of memory layout or architectural imlementation. It has been a common Partially suorted by NSF grant

2 sentiment that the develoment of an easy way for arallel rogramming would be a major breakthrough; see, e.g., Culler and Singh [CS99]. Indeed, to date the outreach of arallel comuting has fallen short of historical exectations. Overall, there is a strong renewed interest in inventing new rogramming languages that accommodate simle reresentation of concurrency. However, during the revious decades thousands of aers have been written on this toic. This effort brought about a fierce debate between a considerable number of schools-of-thoughts. One of these aroaches, the PRAM aroach, emerged as a clear winner in this battle of ideas. In fact, we would like to defend an even stronger remise: Had a arallel architecture that can look to the erformance rogrammer like a PRAM been feasible in the early 1990s, its arallel rogramming aroach would have become common knowledge and the revailing standard by now. As evidence to suort this remise we oint out that 3 of the main algorithms textbooks (taught in standard undergraduate comuter science courses everywhere by 1990) [Baa88, CLR90, Man89] chose to include large chaters on PRAM algorithms. The PRAM was the model of choice for arallel algorithms in all major algorithms/theory communities and was taught everywhere. The only reason that this win did not register in the collective memory as the clear and decisive victory it really is that, at about the same time (early 1990s), it became clear that it will not be ossible to build such a machine (i.e., one that can look to the erformance rogrammer as a PRAM) using early 1990s technology. The Parallel Random Access Model (PRAM) is an easy model for arallel algorithmic thinking and for rogramming. It abstracts away architecture details by assuming that many memory accesses to a shared memory can be satisfied within the same time as a single access. As noted above, the PRAM was develoed during the 1980s and 1990s in anticiation of a arallel rogrammability challenge. It rovides the second largest algorithmic knowledge base right next to the standard serial knowledge base. With the continuing increase of silicon caacity, it becomes ossible to build a single-chi arallel rocessor. Such demonstration has been the urose of the Exlicit Multi-Threading (XMT) roject [VDBN98, NNTV03] that seeks to rototye a PRAM-On-Chi vision, as on-chi interconnection networks rovide enough bandwidth for connecting rocessors-to-memories. Thread-level arallelism (TLP) allows multile threads of execution to roceed concurrently. There is a long record of comiler efforts for arallelizing serial code. Two reresentatives include [AALT95, ACK87]. While there have been some success stories, it is now recognized that automatic arallelization by comilers is generally insufficient. The PRAM-On-Chi latform, to be discussed later in the current aer, is quite broad. The current aer will focus on a thread-level arallelism (TLP) aroach for rogramming it. However, instead of using oerating system threads, as in most current systems, threads are defined by the rogramming language and handled by its imlementation. Also, threads are short and the overall objective for multi-threading is reducing single-task comletion time. Several multi-chi multirocessor architectures targeted imlementation of PRAM algorithms, or came close to that: (i) The NYU Ultracomuter roject sought to aroximate the PRAM [AG94], viewing the PRAM as roviding theoretical yardstick for limits of arallelism as oosed to a ractical rogramming model [Sch80]. (ii) The Tera/Cray Multi-threaded Architecture (MTA) advanced Burton Smith s 1978 HEP novel hardware design. Seeking to hide latencies to memory ([SCB + 98]) each rocessor has sufficiently many (128 was a tyical number) hardware threads that can context switch quickly. The aer [BCF05] suggests that MTA is close to a PRAM and may allow more efficient imlementation of algorithms with irregular memory access such as those from grah theory. Some authors have stated that an MTA with large number of rocessors looks almost like a PRAM [CFS99]. (iii) The SB-PRAM may be the first multichi multirocessor architecture whose declared objective was to rovide emulation of the PRAM [KKT00]. It allows writing comuter rograms that are similar to the original PRAM algorithms. A 64-rocessor rototye has been built [DKP02]. (iv) Although a language rather than an architecture, NESL also made a contribution to imlementing PRAM algorithms by making the algorithms easier to exress using the NESL functional language [Ble96]. NESL rograms are comiled and run on standard multi-chi arallel architectures. However, the fact remains that the PRAM theory has generally not reached out beyond the ivory towers of academia. For examle, the jury is still out on whether the PRAM can rovide an effective abstraction for a roer design of a multi-chi multi-rocessors. The main difficulty [CS99] aears to be the limits on the bandwidth of such a multi-chi architecture. 2

3 More of the case for a lower hanging fruit, PRAM-On-Chi, is resented next. Guided by the fact that the number of transistors on a chi already exceeds one Billion, u from less than 30,000 circa 1980, and kees growing, the main insight behind PRAM-On-Chi is as follows. The Billion transistor chi era allows for the first time a low-overhead on-chi multi-rocessor thereby avoiding concerns regarding the higher overhead of multi-chi multirocessors. It also allows an evolutionary ath from serial comuting. The drastic recent slow down in clock rate imrovement for commodity rocessors will force vendors to seek single task erformance imrovements through arallelism. While some have already noted likely growth to 100-core chis by 2015, they are yet to choose rogramming languages and architectures toward harnessing these enormous hardware resources toward single task comletion time. PRAM-On-Chi addresses these issue. Some key differences between the PRAM-On-Chi and the above multi-chi aroaches are: (i) its larger bandwidth, benefiting from the on-chi environment; (ii) lower latencies to shared memory, since an on-chi aroach allows on-chi shared caches; (iii) effective suort for serial code; this may be needed for backward comatibility for serial rograms, or for serial sections in PRAM-like rograms; (iv) effective suort for arallel execution where the amount of arallelism is low; certain algorithms (e.g., breadth first-search (BFS) on grahs resented later) have articularly simle arallel algorithms; some are only a minor variation of the serial algorithm; since they may not offer sufficient arallelism for some multi-chi architectures, such imortant algorithms had no merit for these architectures; and (v) PRAM-On-Chi introduced a so-called Indeendence of Order Semantics (IOS), that is each thread executes at its own ace and any ordering of interactions among threads is valid. If more than one thread may seek to write to the same shared variable this would be in line with the PRAM arbitrary CRCW convention (see section 2.1). This feature imroves erformance as it allows rocessing with whatever data is available at the rocessing elements and saves ower as it reduces synchronization needs. The feature could have been added to multi-chi aroaches roviding some, but aarently not all the benefits. Other PRAM-related aroaches tended to emhasize cometition with (massively arallel) arallel comuting aroaches and have not aid that much attention to serial code, serial mode in a arallel rogram, or even arallel execution where the amount of arallelism is low. The aroach could also suort standard alication rogramming interfaces (APIs) such as those used for grahics (e.g. OenGL) or circuit design (e.g. VHDL). Use of high-level APIs can allow automatic extraction of much more arallelism than from code written for erformance rogramming languages such as C. With an effective imlementation of such an API for a PRAM-On-Chi (see figure 17.b), an alication rogrammer could take advantage of arallel hardware with few or no changes to an existing API. See [GV06] for a recent examle of seedus exceeding a hundred fold over serial comuting for gate-level VHDL simulations on PRAM-On-Chi. The main contribution of this aer is resenting a rogramming methodology for converting PRAM algorithms to PRAM-on-chi rograms. An overview of some alternatives to PRAM algorithms, which are easier-to-think, but, on the other hand, suress more architecture details, are also discussed. Performance models used in develoing a PRAM-On-Chi algorithm are described in section 2. An examle of using the models is given in section 3. Section 4 exlains comiler otimizations that could affect the actual execution of rograms. Section 5 gives another examle for alying the models to the refix sums roblem. Section 6 resents Breadth-First Search (BFS) in the PRAM-On-Chi Programming Model. Section 7 exlains the alication of comiler otimizations to BFS and comares erformance of several BFS imlementations. Section 8 discusses the Adative Bitonic Sorting algorithm and its imlementation while section 9 introduces a variant of Samle Sorting that runs on a PRAM-On-Chi. Section 10 discusses matrix-vector multilication. Some emirical validation of the models is resented in section 11. We conclude in section Model descritions Given a roblem, a recie for develoing an efficient PRAM-on-chi rogram from concet to imlementation is roosed. In articular, the stages through which such develoment needs to ass are resented. Figure 1 deicts the roosed methodology. For context, the figure also deicts the widely used Work- 3

4 High level Work Deth Descrition Work Deth Model PRAM Model Problem Algorithm design "How to think in arallel" Sequence of stes Each ste: Set of concurrent oerations Informal Work/Deth comlexity 1 Sequence of stes Each ste: Sequence of concurrent oerations Work/Deth comlexity 2 Scheduling Lemma Allocate/schedule rocessors Each ste: Sequence of concurrent oerations No. of arallel stes 3 Programmer Legend: original "arallel thinking to PRAM" methodology roosed "arallel thinking to PRAM on chi rogram" methodology PRAM on chi Programming Model Program in High Level Language Language based Performance Comiler PRAM on chi Execution Model Program in Low Level Language Machine Level Run time Performance roosed shortcut for PRAM on chi rogrammers 4 Without nesting With nesting 5 Figure 1: Proosed Methodology for Develoing PRAM-On-Chi Programs in view of the Work-Deth Paradigm for Develoing PRAM algorithms. Deth methodology for advancing from concet to a PRAM algorithm; namely, the sequence of models in the figure illustrates rogression from a high-level descrition to a PRAM algorithm. For develoing a PRAM-on-chi imlementation, we roose following the sequence of models : given a secific roblem, an algorithm design stage will roduce a High-Level descrition of the arallel algorithm; this informal descrition is fleshed out as a sequence of stes each comrising a set of concurrent oerations. In a first draft, the set of concurrent oerations can be imlicitly defined. See the BFS examle in Section This first draft is refined to a sequence of stes each comrising now a sequence of concurrent oerations. Such formal Work-Deth descrition fully sells out how to advance in a given ste, whose sequence of concurrent oerations include j oerations indexed by integers from 1 to j, from each index i where 1 i j, to an oeration. The rogramming effort amounts to translating this descrition into a single-rogram multile-data (SPMD) rogram using a high-level PRAM-on-chi rogramming language. From this SPMD rogram, a comiler will transform and reorganize the code to achieve the best erformance in the target PRAM-on-chi execution model. As a PRAM-on-chi rogrammer gains exerience, he/she will be able to ski box 2 (the Work-Deth model) and directly advance from box 1 (high-level Work-Deth descrition) to box 4 (high-level PRAM-on-chi rogram). We also demonstrate some instances where it may be advantageous to ski box 2 because of some features of the rogramming model (such as some ability to handle nesting of arallelism). In Figure 1 this shortcut is deicted by the arrow 1 4. Much of the current aer is devoted to resenting the methodology and demonstrating it. We start with elaborating on each model. 2.1 PRAM Model PRAM (for Parallel Random Access Machine, or Model) augments the standard serial model of comutation, known as RAM [AU94], with arallelism. A PRAM consists of synchronous rocessors and a global shared memory accessible in unit time from each of the rocessors. The only mean of inter-rocessor communication is through the shared memory. Different conventions exist regarding concurrent access to the memory, including: (i) exclusive-read exclusive-write (EREW) under which simultaneous access to the same memory location for read or write uroses are forbidden, (ii) concurrent-read exclusive-write (CREW), which allows concurrent reads but not writes, and (iii) concurrent-read concurrent-write (CRCW) where both are ermitted, and a convention regarding how concurrent writes are resolved is secified. One of these conventions, Arbitrary CRCW, stiulates that concurrent writes into a common memory location result in an arbitrary rocessor, among those attemting to write, succeeding, but it is not known in advance which 4

5 rocessor. The are quite a few sources for PRAM algorithms including [JáJ92, KR90, EG88, Vis02]. An algorithm in the PRAM model is described as a sequence of arallel time units, or rounds; each round consists of exactly instructions to be erformed concurrently, one er each rocessor. Producing such a descrition imoses a significant burden on the algorithm designer. Luckily this burden can be somewhat mitigated using the Work-Deth methodology. 2.2 The Work-Deth Methodology Introduced in [SV82], the Work-Deth methodology for designing PRAM algorithms has roved to be quite useful as a framework for describing arallel algorithms and reasoning about their erformance. For examle, it was used as the descrition framework in [JáJ92]. The methodology is guided by seeking to otimize two quantities in a arallel algorithm: deth and work. Deth reresents the number of stes the algorithm would take if unlimited arallel hardware was available, while work is the total number of oerations erformed, over all arallel stes. The methodology suggests starting by roducing an informal descrition of the algorithm in a high-level work-deth model (HLWD), and then advancing this descrition into a fuller resentation in a model of comutation called Work-Deth. We roceed to describe these two models next High-Level Work-Deth Descrition A HLWD descrition consists of a succession of arallel rounds, each round being a set of any number of instructions to be erformed concurrently. Descritions can come in several flavors, and even imlicit descritions, where the number of instructions is not obvious, are accetable. Examle: Given is an undirected grah G(V, E), where the length of every edge in E is 1, and a source node s V ; the breadth-first search (BFS) algorithm finds the lengths of the shortest aths from s to every node in V. An informal work-deth descrition of the arallel BFS algorithm can look as follows. Suose that V, the set of vertices of the grah G, is artitioned into layers, where layer L i includes all vertices of V whose shortest ath from s includes exactly i edges. The algorithm works in iterations. In iteration i, layer L i is found. Iteration 0: node s forms layer L 0. Iteration i, i > 0: Assume inductively that layer L i 1 has already been found. In arallel, consider all the edges (u, v) that have an endoint u in layer L i 1 ; if v is not in a layer L j, j < i, it must be in layer L i. As more than one edge may lead from a vertex in layer L i 1 to v, vertex v is marked as belonging to layer L i by one of these edges using the arbitrary concurrent write convention. This ends an informal, high-level work-deth verbal descrition. A seudocode descrition of an iteration of this algorithm could look as follows: for a l l v e r tices v in L( i ) ardo for a l l edges e=(v,w) ardo i f w unvisited mark w as art of L( i +1) The above HLWD descritions challenge us to try to find an efficient PRAM imlementation for an iteration. Namely, given a -rocessor PRAM how to allocate rocessors to tasks to finish all oerations of an iterations as quickly as ossible? As noted earlier, a more detailed descrition in the Work-Deth model would address these issues Work-Deth Model In the Work-Deth model the descrition is to be cast in terms of successive time stes, where the concurrent oerations in a time ste form a sequence; each element in the sequence is indexed by a different index between 1 and the number of oerations in the ste. The Work-Deth model is formally equivalent to the PRAM. For examle, a work-deth algorithm with T(n) deth (or time) and W(n) work runs on a rocessor PRAM in at most T(n) + W(n) time stes. The simle equivalence roof follows Brent s scheduling rincile, which 5

6 was introduced in [Bre74] for a model of arallel model of comutation that was much more abstract than the PRAM (counting arithmetic oerations, but suressing anything else). Examle (continued): We only note here the challenge for coming u with a Work-Deth descrition for the BFS algorithm. The challenge would be to find a way for listing in a single sequence all the edges that have as an endoint a vertex of layer L i. In other words, the Work-Deth model does not allow us to leave nesting of arallelism unresolved. On the other hand PRAM-On-Chi rogramming should allow nesting since this mechanism rovides an easy way for arallel rogramming. It is also imortant to note that the PRAM-on-chi architectures includes some limited suort for nesting of arallelism. The way in which we suggest to resolve this roblem is as follows. The ideal long term solution is: (a) allow the rogrammer free unlimited use of nesting, (b) have it imlemented as efficiently as ossible by comiler, and (c) make the rogrammer (esecially the erformance rogrammer ) be fully aware of the cost of using nesting. However, since our comiler is not yet mature enough to handle this matter, our tentative short term solution is resented in Section 6, which shows how to build on the suort for nesting rovided by the architecture. There is merit to this manual solution beyond its tentative role till the comiler matures. It should still need to be taught (even after the ideal comiler solution is in lace) in order to exlain the cost of nesting to rogrammers. The reason for bringing this issue u this early in the discussion is that it actually suggests that our methodology does not necessarily need to make a comlete sto at the Work-Deth model, but can erhas detour it and roceed directly to the PRAM-like rogramming methodology. 2.3 PRAM-on-chi Programming Model The PRAM-on-chi rogramming model is a framework for a high-level rogramming language. It can be used to imlement an algorithm described in the Work-Deth resentation model, but as noted before it also offers shortcuts from higher-level descritions. The overall objective of the rogramming model is to mitigate two goals: (i) Programmability: given an algorithm in HLWD or Work-Deth model, the rogrammer s effort should be minimized; and (ii) Imlementability: effective comiler translation into the PRAM-on-chi execution model should be feasible. A fine-grained, SPMD tye model, in which execution frequently alternates between serial and arallel execution mode, is resented. As illustrated in Figure 2, a Sawn command romts a switch from serial mode to arallel mode. The Sawn command can secify any number of threads. Ideally, each such thread can roceed until termination (a Join command) without ever having to busy-wait or synchronize with other threads. To facilitate that, an indeendence of order semantics (IOS) was introduced: the rogrammer can use commands (e.g., refix-sum ) that ermit threads to roceed even if they try to write into the same memory location. This was insired by the PRAM arbitrary concurrent-write convention noted earlier. The following are some of the rimitives in the PRAM-on-chi rogramming model: Sawn Instruction. Used to start a arallel section. Accets as arameter the number of arallel threads to start. Thread-id. A secial variable name used inside a arallel section, which evaluates to the thread ID. This allows SPMD style rogramming. Prefix-sum Instruction. The refix-sum instruction defines an atomic oeration. Oerating on two variables, a base variable B and an increment variable R, the result of a refix-sum is that B gets the value B + R, while R gets the original value of B. Some interesting uses of the refix-sum instruction are when several concurrent threads use it with resect to the same base. It rovides a tool for imlementing IOS as well as for inter-thread coordination. While, the basic definition of refix-sum follows the fetch-and-add of the NYU-Ultracomuter [GGK + 82], PRAM-on-Chi uses a fast arallel hardware imlementation (s()) if R is from a small range (e.g., one bit) and B can fit one of a small number of global registers; otherwise, refix-sums are done using a refix-sum-to-memory (sm()) instruction and are resolved by queuing to memory. Nested arallelism. A arallel thread can be rogrammed to initiate more threads. However, as noted in Section this comes with some (tentative) restrictions and cost caveats, due to comiler and 6

7 Sawn Join Sawn Join Figure 2: Switching between serial and arallel execution modes in the PRAM-on-chi rogramming model. Each arallel thread executes at its own seed, without ever needing to synchronize with another thread hardware suort issues. As illustrated with the Breadth-First search examle, nesting of arallelism could imrove the rogrammer s ability to describe an algorithms in a clear and concise way. Nesting is discussed in several laces in the current aer, including section 4.1. Note that Figure 1 deicts two alternative PRAM-On-Chi rogramming models: without nesting and with nesting. The Work-Deth model mas directly into the rogramming model without nesting. Allowing nesting could make it easier to turn a descrition in the High-Level Work-Deth model into a rogram. Since our current embodiment of PRAM-On-Chi is called XMT, for exlicit Multi-Threading, we call the illustration of this rogramming model XMTC. XMTC is a suerset of the language C, obtained from it by adding structures for the above rimitives. Examles of XMTC code Several examles of actual imlementations of PRAM algorithms using XMTC are resented in figure 3. While each of these rograms is discussed in greater detail in the following sections, the urose of the table was to convey to readers familiar with other arallel rogramming frameworks the relative conciseness of these rograms. Some language constructs, such as variable and function declarations, have been left out in this table, but they need to be included in a valid XMTC rogram. Next, the language features of XMTC are demonstrated using the array comaction roblem, resented in figure 3.a: given an array of integers T[0..n 1], coy all its non-zero elements into another array S; any order will do. The secial variable $ denotes the thread-id. The command sawn(0,n-1) sawns n threads whose id s are the integers in the range 0...n 1. The s(increment,length) instruction executes an atomic refix-sum command using length as the base and increment as the increment value. The variable increment is local to a thread while length is a global variable which will hold the number of non-zero elements coied at the end of the sawn block. Variables declared inside a sawn() block are local for each thread, and are usually much faster to access than the shared memory. 1 To evaluate erformance in this model, a Language-Based Performance Model is used: erformance costs are assigned to each rimitive instruction in the language and rules are secified for combining them into exressions. Such erformance modeling was used by Aho and Ullman [AU94] and was generalized for arallelism by Blelloch [Ble96]. The aer [DV00] used language-based modeling for studying arallel list ranking relative to an earlier erformance model for XMT. 2.4 PRAM-on-chi Execution Model The execution model deends heavily on articulars of the PRAM-on-chi imlementation. For illustration uroses, we will use the XMT PRAM-on-chi latform (see [NNTV03]). A bird eye s view of XMT is resented in Figure 4. A number of (say 1024) Thread Control Units (TCUs) are groued into (say 64) clusters. Clusters are connected to the memory subsystem by a high-throughut, low-latency interconnection network; they also interface with secialized units such as refix-sum unit and global registers. A hash function is alied to memory addresses in order to rovide better load balancing at the shared memory modules. An imortant comonent of a cluster is the read-only cache included at cluster level; this is used to store values read from memory by a TCU and also holds the values read by refetch instructions. The memory system consists of memory modules each having several levels of cache 1 On XMT, local thread variables are tyically stored into local registers of the executing hardware thread control unit (TCU). The rogrammer is encouraged to use local variables to store frequently used values This tye of otimizations can also be erformed by an otimizing comiler. 7

8 (a) Array comaction length = 0; sawn (0,n 1) { // start one thread er array element int increment = 1; i f (T[ $ ]!= 0) { // execute refix sum to al locate one entry in array S s ( increment, length ) ; S [ increment ] = T[ $ ] ; (b) k-ary Tree Summation Inut : N numbers in sum [ 0..N 1] Outut : The sum o f the numbers in sum [ 0 ] The sum array i s a 1D comlete tree reresentation ( See Summation section ) l e vel = 0; / / rocess l e vels of tree from leaves to root l evel++; sawn( current le vel start inde x, current level end index ) { int count, local sum =0; for ( count = 0; count < k ; count++) tem sum += sum [ k $ + count + 1]; sum [ $ ] = local sum ; while ( l e vel < log k (N) ) { (c) k-ary Tree Prefix-Sums Inut : N numbers in sum [ 0..N 1] Outut : the refix sums of the numbers in refix sum [ o f f s e t t o 1 s t l e a f.. o f f s e t t o 1 s t l e a f+n 1] The refix sum array i s a 1D comlete tree reresentation ( See Summation) kary tree summation ( sum ) ; // run k ary t r e e summation algorithm refix sum [ 0 ] = 0 ; le vel = log k (N); while ( l e vel > 0) { // a l l l e v e l s from root to leaves sawn( current le vel start inde x, current level end index ) { int count, l oc al s = refix sum [ $ ] ; for ( count = 0; count < k ; count++) { refix sum [ k$ + count + 1] = l ocal s ; local s += sum [ k$ + count + 1]; level ; (d) Breadth-First Search Inut : Grah G=(E,V) using adjacency l i s t s ( See Programming BFS section ) Outut : distance [N] distance from start vertex for each vertex Uses : l e vel [L ] [N] sets of ve rtices at each BFS l e vel. //run refix sums on degrees to determine osition of start edge for each vertex start edge = kary refix sums ( degrees ); l e vel [0]= start node ; i =0; while ( l e vel [ i ] not emty) { sawn (0, l e v e l s i z e [ i ] 1) { // start one thread for each vertex in l e vel [ i ] v = l evel [ i ] [ $ ] ; / / read one vertex sawn (0, degree [ v] 1) { // start one thread for each edge of each vertex int w = edges [ start edge [ v]+$ ] [ 2 ] ; / / read one edge ( v,w) sm( gatekeeer [w],1);// check the gatekeeer of the end vertex w i f gakeeer [w] was 0 { sm( l e v e l s i z e [ i +1],1);// allocate one entry in l evel [ i +1] store w in l e vel [ i +1]; i++; (e) Sarse Matrix - Dense Vector Multilication Inut : Vector b [ n ], sarse matrix A[m] [ n ] given in Comact Sarse Row form, as in f igure 12 Outut : Vector c [m] = Ab sawn (0,m) { // start one thread for each row in A int row start=row [ $ ], elements on row = row [ $+1] row start ; sawn (0, elements on row 1) {// start one thread for each non zero element on row // comute A[ i ] [ j ] b [ j ] for a l l non zero elements on current row tmsum[ $]= values [ row start+$ ] b [ columns [ row start+$ ] ] ; c [ $ ] = kary tree summation (tmsum [ 0.. elts on row 1]); // sum u Figure 3: Imlementation of some PRAM algorithms in the XMT PRAM-on-chi framework to demonstrate comactness. 8

9 Figure 4: An overview of the XMT PRAM-on-chi Architecture. memories. In general each logical memory address can reside in only one memory module, alleviating cache coherence roblems. This exlains why only read-only caches are used at the clusters. The Master TCU runs serial code, or the serial mode for XMT. When it hits a Sawn command it initiates a arallel mode by broadcasting the same SPMD arallel code segment to all the TCUs. As each TCU catures its coy, it executes it is based on a thread-id assigned to it. A searate distributed hardware system, reorted in [NNTV03] but not shown in figure 4, ensures that all the thread id s mandated by the current Sawn command are allocated to the TCUs. A sufficient art of this allocation is done dynamically to ensure that no TCU needs to execute more than one thread id, once another TCU is already idle. A rogram in the high-level PRAM-on-chi Programming model needs to be translated by an otimizing comiler in order to take advantage of features of the architecture. A rogram in the Execution model could include refetch instructions, as well as broadcast instructions, where some values needed by all, or nearly all TCUs, are broadcasted to all. More advanced otimizations such as combining shorter virtual threads into a longer thread (a mechanism called thread clustering ), are also considered at this otimization stage. If the rogramming model allows nested arallelism, the comiler will use the mechanisms suorted by the architecture to imlement or emulate it. Comiler otimizations and issues such as nesting and thread clustering are discussed in section 4. To evaluate the erformance of a rogram in this model, we use an extension of the notions of work and deth to include measurements aroriate for an execution model, and then roceed to give a formula for estimating execution time based on them. The deth of an alication in the PRAM-on-chi Execution model must include the following three quantities: (i) Comutation Deth, given by the number of oerations that have to be erformed sequentially, either by a thread or while in serial mode. (ii) Length of Sequence of Round-Tris to Memory (or LSRTM) which reresents the number of cycles on the critical ath sent by execution units waiting for data from memory. A read request from a TCU usually causes a round-tri to memory (or RTM); memory writes in general roceed without acknowledgment, thus not being counted as round-tris, but ending a arallel section imlies one RTM used to flush all the data still in the interconnection network to the memory. (iii) Queuing delay (or QD) which is caused by concurrent requests to the same memory location; the resonse time is roortional to the size of the queue. The refix-sum s() rimitive is suorted by a secial hardware unit that combines s() calls from multile threads into a single multi-oerand refix-sum oeration. In one thread, a s() instruction causes one RTM and 0 queuing delay. In addition, a refix-sum to memory (sm()) instruction is suorted. Its syntax is similar to the s() 9

10 instruction excet the base variable is a memory location instead of a global register. This instruction is executed by queued udates to the memory location rather than by secial hardware, due to the difficulty in creating multi-oerand hardware that would oerate on arbitrary memory locations. The sm() command costs 1 RTM and additionally has a queuing delay equal to the number of threads calling sm() on the same location. We can now define the PRAM-on-chi execution deth and execution time. PRAM-On-Chi Execution Deth reresents the time sent on the critical ath (that is, the time assuming unlimited amount of hardware) and is the sum of the PRAM comutation deth, LSRTM, and QD on the critical ath. Assuming that a round-tri to memory takes R cycles: Execution Deth = Comutation Deth + LSRTM R + QD (1) Sometimes more Work (the total number of instructions executed) can be executed in arallel than what the hardware can handle concurrently. For the additional time sent executing oerations outside the critical ath (i.e. beyond the Execution deth), the work of each arallel section needs to be considered searately. Suose that one such arallel section could emloy in arallel u to i TCUs. Let Work i = i ComutationDeth i be the total comutation work of arallel section i. If our architecture has TCUs and i <, we will be able to use only i of them, while if i, only TCUs can be used to start the threads, and the remaining i threads will be allocated to TCUs as they become available; each concurrent allocation of threads to TCUs is charged as one RTM to the Execution Time, as denoted by relation 2. The total time sent executing instructions outside the critical ath over all arallel sections is given in relation 3. i ThreadStartOverhead i = R (2) ( ) Worki Additional W ork = min(, i ) + ThreadStartOverhead i (3) sawn block i Adding u, the execution time of the entire rogram is: 2.5 Clarifications of the modeling Execution T ime = Execution Deth + Additional W ork (4) Our model of erformance attemts to distill the major factors affecting runtime secifically for the PRAM- On-Chi latform. The erformance modeling for PRAM-On-Chi has the advantage of being close to the Work-Deth algorithmic framework, with additional accounting for memory costs using the LSRTM and QD. First, we would like to resent a somewhat subtle oint: Following the ath from the HLWD model to the PRAM-ON-Chi models in Figure 1 may be imortant not only for the urose of develoing a PRAM-On-Chi rogram, but also for otimizing erformance. Note that bandwidth is not accounted for in the PRAM-On-Chi erformance modeling, since a PRAM-On-Chi architecture should be able to rovide sufficient bandwidth for an efficient algorithm in the Work-Deth model. In other words, the only way in which our modeling accounts for bandwidth is indirect: by first screening an algorithm through the Work-Deth erformance modeling, where we account for work. Let us examine what could haen if PRAM-On-Chi erformance modeling is not couled with Work- Time erformance modeling. The rogram could include excessive seculative refetching to suosedly imrove erformance (reduce LSRTM). The subtle oint is that the extra refetches add to the overall work count. In other words, accounting for them in the Work-Deth model revents this loohole. It is also imortant to recognize that the model abstracts away some significant details. The PRAM- On-Chi hardware has a limited number of memory modules, and if multile requests attemt to access the same module, queuing will occur. Although the model accounts for queuing to the same memory location, it does not account for queuing that may occur for accesses to different locations (in the same module). However, hashing memory addresses among modules lessens roblems that would occur for accesses with 10

11 high satial locality and generally mitigates this tye of hot sots. If functional units within a cluster are shared between the TCUs, threads can be delayed while waiting for functional units to become available. The model does also not account for these delays. To some limited extent, the effect of these aroximations on running times can be observed from the exerimental results in section 11, where a comarison with simulations is resented. Similar to some serial erformance modeling, the above modeling assumes that data is found in the (shared) caches. This allows roer comarison to serial comuting where data is found in the cache, as the number of clocks to reach the cache for PRAM-On-Chi is assumed to be significantly higher than in serial comuting; for examle, our rototye XMT architecture suggests values that range between 6 and 24 cycles for a round-tri to the first level of cache, deending on the characteristics of the interconnection network and its load level; we took the conservative aroach to use the value R = 24 cycles for one RTM for the rest of this aer. We note that the number of clocks to access main memory should be about the same as for serial comuting and also that both for serial comuting and for PRAM-On-Chi large caches can be built. However, this modeling is inaroriate if PRAM-On-Chi is to be comared to Cray MTA where no shared caches are used: for the MTA the number of clocks to access main memory is imortant and it will not be aroriate not to include this figure for cache misses on PRAM-On-Chi, as well. Note that some of the comutation work is counted twice in our Execution Time, once as art of the critical ath under Execution Deth and once in the Additional Work factor. We could further refine our analysis and roose a more accurate model, but with much more involved modeling. For the sake of clarity, we made the choice to sto at the level of detail that allows for a concise resentation while roviding relevant results. Other researchers that worked on erformance modeling of arallel algorithms have tyically focused on different factors than those we have identified here. The reason is they dealt with other latforms. Helman and JáJá [HJ99] measured the comlexity of algorithms running on SMPs using the trilet of maximum number of non-contiguous accesses by any rocessor to main memory, number of barrier synchronizations, and local comutation cost. However, these quantities are less imortant in a PRAM-like environment. Bader, Cong, and Feo [BCF05] found that in some exeriments on the Cray MTA, the costs of non-contiguous memory access and barrier synchronization were reduced almost to zero by multithreading and that erformance was best modeled by comutation alone. For the latest generation of the MTA architecture, researchers have develoed a calculator for erformance that includes the arameters of count of tris to memory, number of instructions, and number of accesses to local memory [FHKK05]. Our measures are still different because the RTMs that we count are round tris to the shared cache, and we also count queuing at the shared cache. In addition, we consider the effect of otimizations such as refetch and thread clustering. Nevertheless, the calculator should rovide an interesting basis for comarison between erformance of alications on MTA and PRAM-On-Chi. 3 An Examle for Using the Methodology: Summation Consider the roblem of comuting the sum of n numbers. Given as inut an array A of size n the outut rovides the sum of its values. Develoing a arallel rogram for this simle roblem is resented next as an examle for the methodology of the revious section. Progressing through the models is resented. A High-Level Work-Deth descrition of the algorithm is resented in figure 5.a. A non-recursive Work-Deth resentation of this algorithm can be derived from it, as resented in figure 5.b. In the WD algorithm, we use an unidimensional array to store all the elements of the tree, as shown in figure 6. For the more general case of a comlete k-ary tree, we store the root at element 0, followed by the k elements of the first level, listed from left to right, then the k 2 elements of second level etc. The array is densely acked, with no gas, thus (a) the children of node i are at indices k i + 1, k i + 2,...,k i + k and (b) the arent of node i is at index i 1 k. Note that this simle relationshi between a node and its children is helful for imroving erformance. We now roceed to exress this algorithm in the PRAM-On-Chi Programming Model. Note that the WD algorithm uses a balanced-binary tree aroach, by reeatedly adding in arallel airs of values. Alternatively, k values can be summed serially; this constitutes a k-ary tree aroach. The k-ary tree is 11

12 SUM(A, n) I f n = 1 then sum = A[ 1 ] ; exit For 1 <= i <= n / 2 ardo B[ i ] = A[2 i 1] + A[2 i ] Call SUM(B, n/2) (a) For 1 <= i <= n ardo // B i s a 1D array B[ n 1 + i ] = A[ i ]// model of a tree For h = logn to 1 do For 2ˆ(h 1) <= i < 2ˆh ardo B[ i ] = B[2 i 1] + B[2 i ] sum = B[ 1 ] (b) Figure 5: The Summation Algorithm. (a) A High-Level Work Deth resentation. Pairs of values of A are summed u and stored into array B, followed by a recursive call on array B. (b) A Work-Deth descrition Figure 6: The array reresentation of a comlete ternary tree. The array is densely acked, with the root coming first, then the elements at level 1, and then the elements at level 2. shorter when k > 2, having log k n instead of log 2 n levels; this reduces the number of iterations at the cost of increased iteration comlexity. The otimum k is chosen as the value that minimizes the estimated running time in the erformance model for a articular N. The k-ary tree is reresented as a 1D array in the comlete tree reresentation, similar to the Work-Deth descrition. The PRAM-on-chi imlementation of this algorithm is resented in figure 3.b using the XMTC rogramming language. We will consider the erformance of the algorithm in the PRAM-On-Chi Execution Model in Section 4.4 after describing comiler otimizations. 4 Comiler Otimizations Given a rogram in the PRAM-On-Chi Programming Model, an otimizing comiler can erform various transformations on it to better fit the target PRAM-On-Chi Execution Model and reduce execution time. We describe several ossible otimizations and demonstrate their effect using the Summation algorithm described above. 4.1 Nested Parallel Sections Quite a few PRAM algorithms can be exressed with greater clarity and conciseness when nested arallelism is allowed [Ble96]. For this reason, nesting arallel sections with arbitrary numbers of threads needs to be allowed in the PRAM-On-Chi Programming Model. However, hardware imlementation of nesting is not free, and the rogrammer needs to be aware of the imlementation overheads. In order to exlain a key imlementation roblem we need to review the hardware mechanism that allocates code threads to the hysical TCUs. Consider an SMPD arallel code section that starts with a sawn(1,n) command, and each of the n threads ends with a join command without any nested sawns. As noted before, the Master TCU broadcasts the arallel code section to all TCUs. In addition it broadcasts the number n to all TCUs. TCU i, 1 i, will check whether i > n, and if not it will execute thread i; once TCU i hits a join, it will execute a secial system s() command with an increment of 1 relative to a counter that includes the number of threads started so far; denote the result it gets back by j; if j > n TCU i is done, and if not it will execute thread j; this rocess is reeated each time a TCU hits a join until all TCUs are done, when a transition back into serial mode occurs. Allowing nesting of sawn() commands would require: (i) Ugrading this thread allocation mechanism. First, the number n reresenting the total number of threads will be reeatedly udated and broadcast to 12

13 the TCUs. (ii) Since a TCU gets just an integer result through the system s() command, more information is needed to link this integer to a new thread that needs to execute. In addition, we need to facilitate a way for the arent (sawning) thread to forward initialization data to a child (sawned) thread. In our rototye XMT PRAM-On-Chi Programming Model, we allow nested sawns of a small fixed number of threads through the single-sawn and k-sawn instructions; ssawn() starts one single additional thread while ksawn() starts exactly k threads, where k is a small constant (such as 2 or 4). Each of these instructions causes a delay of one RTM before the arent can roceed, and an additional delay of 1-2 RTMs before the child thread can roceed (or actually get started). Suose that a arent thread wants to create another thread whose virtual thread number (as referenced from the SPMD code) is v. First, the arent uses a refix-sum instruction to a global thread-counter register to create a unique thread ID i for the child. The arent then enters the value v in A(i), where A is a secially designated array in memory. As a result of executing an ssawn (or a ksawn command, see below) by the arent thread: (i) n will be incremented, and at some oint in the future (ii) the thread allocation mechanism will generate virtual thread i. The rogram for thread i starts with reading v through A(i). It then can be rogrammed to use v as its effective thread ID. An algorithm that could benefit from nested sawns is the BFS algorithm. Each iteration of the algorithm takes as inut L i 1 the vertices whose distance from starting vertex s is i 1 and oututs L i. As noted in section 2.2, a simle way to do this is to sawn one thread for each vertex in L i 1, and have each thread sawn as many threads as the number of its edges, one er edge. In the BFS examle, the arent thread needs to ass information, such as which edge to traverse, to child threads. To ass data to the child, the arent writes data in memory at locations indexed by the child s ID, using non-blocking writes (namely, the arent sends out a write request, and can roceed immediately to its next instruction without waiting for any confirmation regarding write has request). Since it is ossible that the child tries to read this data before it is available, it should be ossible to recognize that the data is not yet there and wait until the data is committed to memory. One ossible solution for that is described in the next aragrah. The ksawn instruction uses a refix-sum instruction with increment k to get k thread IDs and roceeds similarly; the delays on the arent and children threads are similar, though a few additional cycles being required for the arent to initialize the data for all k children. When starting threads using single-sawn or k-sawn, a synchronization ste between the arent and the child is necessary to ensure the roer initialization of the latter. Since we would rather not use a busy-wait synchronization technique that could overload the interconnection network and waste ower, our envisioned PRAM-on-chi architecture would include a secial rimitive, called slee-waiting: the memory system holds the read request from the child thread until the data is actually committed by the arent thread, and only then satisfies the request. When advancing from the rogramming to the execution model, a comiler can automatically transform a nested sawn of n threads, and n can be any number, into a recursive alication of single-sawns (or k- sawns). The recursive alication divides much of the task of sawning n thread among the newly sawned 1 threads. When a thread starts a new child, it assigns to it half (or k+1 for k-sawn) of the n 1 remaining threads that need to be sawned. This rocess roceeds in a recursive manner. 4.2 Clustering The PRAM-On-Chi Programming Model allows sawning an arbitrary number of virtual threads, but the architecture has only a limited number of TCUs to run these threads. In the rogression from the Programming Model to the Execution Model, we often need to make a choice between two otions. The first otion is to sawn fewer threads each doing more comutation, while the second one is to run the shorter threads as is. Combining short threads into a longer thread is called clustering and offers several advantages: (a) we can ieline memory accesses that had reviously been in searate threads; this can reduce extra costs from serialization of RTMs and QDs that are not on the critical ath; (b) sawning fewer threads means reducing thread allocation overheads, i.e. the time required to start a new thread on a recently freed TCU; (c) each sawned thread (even those that are waiting for a TCU) usually takes u sace in the system memory, to store the local data for the thread. If the code rovides fewer threads than the hardware can suort, there are fewer advantages if any to using fewer longer threads. Also, running fewer, longer threads 13

Lecture 18. Today, we will discuss developing algorithms for a basic model for parallel computing the Parallel Random Access Machine (PRAM) model.

Lecture 18. Today, we will discuss developing algorithms for a basic model for parallel computing the Parallel Random Access Machine (PRAM) model. U.C. Berkeley CS273: Parallel and Distributed Theory Lecture 18 Professor Satish Rao Lecturer: Satish Rao Last revised Scribe so far: Satish Rao (following revious lecture notes quite closely. Lecture

More information

1 Models for Advancing PRAM and Other Algorithms into Parallel Programs for a PRAM-On-Chip Platform

1 Models for Advancing PRAM and Other Algorithms into Parallel Programs for a PRAM-On-Chip Platform 1 Models for Advancing PRAM and Other Algorithms into Parallel Programs for a PRAM-On-Chip Platform 1.1 Introduction... 1-2 1.2 Model descriptions... 1-5 PRAM Model The Work-Depth Methodology XMT Programming

More information

Shuigeng Zhou. May 18, 2016 School of Computer Science Fudan University

Shuigeng Zhou. May 18, 2016 School of Computer Science Fudan University Query Processing Shuigeng Zhou May 18, 2016 School of Comuter Science Fudan University Overview Outline Measures of Query Cost Selection Oeration Sorting Join Oeration Other Oerations Evaluation of Exressions

More information

Introduction to Parallel Algorithms

Introduction to Parallel Algorithms CS 1762 Fall, 2011 1 Introduction to Parallel Algorithms Introduction to Parallel Algorithms ECE 1762 Algorithms and Data Structures Fall Semester, 2011 1 Preliminaries Since the early 1990s, there has

More information

AUTOMATIC GENERATION OF HIGH THROUGHPUT ENERGY EFFICIENT STREAMING ARCHITECTURES FOR ARBITRARY FIXED PERMUTATIONS. Ren Chen and Viktor K.

AUTOMATIC GENERATION OF HIGH THROUGHPUT ENERGY EFFICIENT STREAMING ARCHITECTURES FOR ARBITRARY FIXED PERMUTATIONS. Ren Chen and Viktor K. inuts er clock cycle Streaming ermutation oututs er clock cycle AUTOMATIC GENERATION OF HIGH THROUGHPUT ENERGY EFFICIENT STREAMING ARCHITECTURES FOR ARBITRARY FIXED PERMUTATIONS Ren Chen and Viktor K.

More information

Source Coding and express these numbers in a binary system using M log

Source Coding and express these numbers in a binary system using M log Source Coding 30.1 Source Coding Introduction We have studied how to transmit digital bits over a radio channel. We also saw ways that we could code those bits to achieve error correction. Bandwidth is

More information

Equality-Based Translation Validator for LLVM

Equality-Based Translation Validator for LLVM Equality-Based Translation Validator for LLVM Michael Ste, Ross Tate, and Sorin Lerner University of California, San Diego {mste,rtate,lerner@cs.ucsd.edu Abstract. We udated our Peggy tool, reviously resented

More information

An Efficient Video Program Delivery algorithm in Tree Networks*

An Efficient Video Program Delivery algorithm in Tree Networks* 3rd International Symosium on Parallel Architectures, Algorithms and Programming An Efficient Video Program Delivery algorithm in Tree Networks* Fenghang Yin 1 Hong Shen 1,2,** 1 Deartment of Comuter Science,

More information

SPITFIRE: Scalable Parallel Algorithms for Test Set Partitioned Fault Simulation

SPITFIRE: Scalable Parallel Algorithms for Test Set Partitioned Fault Simulation To aear in IEEE VLSI Test Symosium, 1997 SITFIRE: Scalable arallel Algorithms for Test Set artitioned Fault Simulation Dili Krishnaswamy y Elizabeth M. Rudnick y Janak H. atel y rithviraj Banerjee z y

More information

Directed File Transfer Scheduling

Directed File Transfer Scheduling Directed File Transfer Scheduling Weizhen Mao Deartment of Comuter Science The College of William and Mary Williamsburg, Virginia 387-8795 wm@cs.wm.edu Abstract The file transfer scheduling roblem was

More information

OMNI: An Efficient Overlay Multicast. Infrastructure for Real-time Applications

OMNI: An Efficient Overlay Multicast. Infrastructure for Real-time Applications OMNI: An Efficient Overlay Multicast Infrastructure for Real-time Alications Suman Banerjee, Christoher Kommareddy, Koushik Kar, Bobby Bhattacharjee, Samir Khuller Abstract We consider an overlay architecture

More information

An Indexing Framework for Structured P2P Systems

An Indexing Framework for Structured P2P Systems An Indexing Framework for Structured P2P Systems Adina Crainiceanu Prakash Linga Ashwin Machanavajjhala Johannes Gehrke Carl Lagoze Jayavel Shanmugasundaram Deartment of Comuter Science, Cornell University

More information

COMP Parallel Computing. BSP (1) Bulk-Synchronous Processing Model

COMP Parallel Computing. BSP (1) Bulk-Synchronous Processing Model COMP 6 - Parallel Comuting Lecture 6 November, 8 Bulk-Synchronous essing Model Models of arallel comutation Shared-memory model Imlicit communication algorithm design and analysis relatively simle but

More information

Lecture 8: Orthogonal Range Searching

Lecture 8: Orthogonal Range Searching CPS234 Comutational Geometry Setember 22nd, 2005 Lecture 8: Orthogonal Range Searching Lecturer: Pankaj K. Agarwal Scribe: Mason F. Matthews 8.1 Range Searching The general roblem of range searching is

More information

An empirical analysis of loopy belief propagation in three topologies: grids, small-world networks and random graphs

An empirical analysis of loopy belief propagation in three topologies: grids, small-world networks and random graphs An emirical analysis of looy belief roagation in three toologies: grids, small-world networks and random grahs R. Santana, A. Mendiburu and J. A. Lozano Intelligent Systems Grou Deartment of Comuter Science

More information

Efficient Processing of Top-k Dominating Queries on Multi-Dimensional Data

Efficient Processing of Top-k Dominating Queries on Multi-Dimensional Data Efficient Processing of To-k Dominating Queries on Multi-Dimensional Data Man Lung Yiu Deartment of Comuter Science Aalborg University DK-922 Aalborg, Denmark mly@cs.aau.dk Nikos Mamoulis Deartment of

More information

PREDICTING LINKS IN LARGE COAUTHORSHIP NETWORKS

PREDICTING LINKS IN LARGE COAUTHORSHIP NETWORKS PREDICTING LINKS IN LARGE COAUTHORSHIP NETWORKS Kevin Miller, Vivian Lin, and Rui Zhang Grou ID: 5 1. INTRODUCTION The roblem we are trying to solve is redicting future links or recovering missing links

More information

Efficient Parallel Hierarchical Clustering

Efficient Parallel Hierarchical Clustering Efficient Parallel Hierarchical Clustering Manoranjan Dash 1,SimonaPetrutiu, and Peter Scheuermann 1 Deartment of Information Systems, School of Comuter Engineering, Nanyang Technological University, Singaore

More information

Sensitivity Analysis for an Optimal Routing Policy in an Ad Hoc Wireless Network

Sensitivity Analysis for an Optimal Routing Policy in an Ad Hoc Wireless Network 1 Sensitivity Analysis for an Otimal Routing Policy in an Ad Hoc Wireless Network Tara Javidi and Demosthenis Teneketzis Deartment of Electrical Engineering and Comuter Science University of Michigan Ann

More information

Improved heuristics for the single machine scheduling problem with linear early and quadratic tardy penalties

Improved heuristics for the single machine scheduling problem with linear early and quadratic tardy penalties Imroved heuristics for the single machine scheduling roblem with linear early and quadratic tardy enalties Jorge M. S. Valente* LIAAD INESC Porto LA, Faculdade de Economia, Universidade do Porto Postal

More information

split split (a) (b) split split (c) (d)

split split (a) (b) split split (c) (d) International Journal of Foundations of Comuter Science c World Scientic Publishing Comany ON COST-OPTIMAL MERGE OF TWO INTRANSITIVE SORTED SEQUENCES JIE WU Deartment of Comuter Science and Engineering

More information

Multicast in Wormhole-Switched Torus Networks using Edge-Disjoint Spanning Trees 1

Multicast in Wormhole-Switched Torus Networks using Edge-Disjoint Spanning Trees 1 Multicast in Wormhole-Switched Torus Networks using Edge-Disjoint Sanning Trees 1 Honge Wang y and Douglas M. Blough z y Myricom Inc., 325 N. Santa Anita Ave., Arcadia, CA 916, z School of Electrical and

More information

IMS Network Deployment Cost Optimization Based on Flow-Based Traffic Model

IMS Network Deployment Cost Optimization Based on Flow-Based Traffic Model IMS Network Deloyment Cost Otimization Based on Flow-Based Traffic Model Jie Xiao, Changcheng Huang and James Yan Deartment of Systems and Comuter Engineering, Carleton University, Ottawa, Canada {jiexiao,

More information

10. Parallel Methods for Data Sorting

10. Parallel Methods for Data Sorting 10. Parallel Methods for Data Sorting 10. Parallel Methods for Data Sorting... 1 10.1. Parallelizing Princiles... 10.. Scaling Parallel Comutations... 10.3. Bubble Sort...3 10.3.1. Sequential Algorithm...3

More information

Matlab Virtual Reality Simulations for optimizations and rapid prototyping of flexible lines systems

Matlab Virtual Reality Simulations for optimizations and rapid prototyping of flexible lines systems Matlab Virtual Reality Simulations for otimizations and raid rototying of flexible lines systems VAMVU PETRE, BARBU CAMELIA, POP MARIA Deartment of Automation, Comuters, Electrical Engineering and Energetics

More information

Mitigating the Impact of Decompression Latency in L1 Compressed Data Caches via Prefetching

Mitigating the Impact of Decompression Latency in L1 Compressed Data Caches via Prefetching Mitigating the Imact of Decomression Latency in L1 Comressed Data Caches via Prefetching by Sean Rea A thesis resented to Lakehead University in artial fulfillment of the requirement for the degree of

More information

Source-to-Source Code Generation Based on Pattern Matching and Dynamic Programming

Source-to-Source Code Generation Based on Pattern Matching and Dynamic Programming Source-to-Source Code Generation Based on Pattern Matching and Dynamic Programming Weimin Chen, Volker Turau TR-93-047 August, 1993 Abstract This aer introduces a new technique for source-to-source code

More information

An improved algorithm for Hausdorff Voronoi diagram for non-crossing sets

An improved algorithm for Hausdorff Voronoi diagram for non-crossing sets An imroved algorithm for Hausdorff Voronoi diagram for non-crossing sets Frank Dehne, Anil Maheshwari and Ryan Taylor May 26, 2006 Abstract We resent an imroved algorithm for building a Hausdorff Voronoi

More information

Randomized algorithms: Two examples and Yao s Minimax Principle

Randomized algorithms: Two examples and Yao s Minimax Principle Randomized algorithms: Two examles and Yao s Minimax Princile Maximum Satisfiability Consider the roblem Maximum Satisfiability (MAX-SAT). Bring your knowledge u-to-date on the Satisfiability roblem. Maximum

More information

arxiv: v1 [cs.dc] 13 Nov 2018

arxiv: v1 [cs.dc] 13 Nov 2018 Task Grah Transformations for Latency Tolerance arxiv:1811.05077v1 [cs.dc] 13 Nov 2018 Victor Eijkhout November 14, 2018 Abstract The Integrative Model for Parallelism (IMP) derives a task grah from a

More information

Parallel Algorithms for the Summed Area Table on the Asynchronous Hierarchical Memory Machine, with GPU implementations

Parallel Algorithms for the Summed Area Table on the Asynchronous Hierarchical Memory Machine, with GPU implementations Parallel Algorithms for the Summed Area Table on the Asynchronous Hierarchical Memory Machine, ith GPU imlementations Akihiko Kasagi, Koji Nakano, and Yasuaki Ito Deartment of Information Engineering Hiroshima

More information

Texture Mapping with Vector Graphics: A Nested Mipmapping Solution

Texture Mapping with Vector Graphics: A Nested Mipmapping Solution Texture Maing with Vector Grahics: A Nested Mimaing Solution Wei Zhang Yonggao Yang Song Xing Det. of Comuter Science Det. of Comuter Science Det. of Information Systems Prairie View A&M University Prairie

More information

A Study of Protocols for Low-Latency Video Transport over the Internet

A Study of Protocols for Low-Latency Video Transport over the Internet A Study of Protocols for Low-Latency Video Transort over the Internet Ciro A. Noronha, Ph.D. Cobalt Digital Santa Clara, CA ciro.noronha@cobaltdigital.com Juliana W. Noronha University of California, Davis

More information

Object and Native Code Thread Mobility Among Heterogeneous Computers

Object and Native Code Thread Mobility Among Heterogeneous Computers Object and Native Code Thread Mobility Among Heterogeneous Comuters Bjarne Steensgaard Eric Jul Microsoft Research DIKU (Det. of Comuter Science) One Microsoft Way University of Coenhagen Redmond, WA 98052

More information

10 File System Mass Storage Structure Mass Storage Systems Mass Storage Structure Mass Storage Structure FILE SYSTEM 1

10 File System Mass Storage Structure Mass Storage Systems Mass Storage Structure Mass Storage Structure FILE SYSTEM 1 10 File System 1 We will examine this chater in three subtitles: Mass Storage Systems OERATING SYSTEMS FILE SYSTEM 1 File System Interface File System Imlementation 10.1.1 Mass Storage Structure 3 2 10.1

More information

A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism

A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism Erlin Yao, Mingyu Chen, Rui Wang, Wenli Zhang, Guangming Tan Key Laboratory of Comuter System and Architecture Institute

More information

Figure 8.1: Home age taken from the examle health education site (htt:// Setember 14, 2001). 201

Figure 8.1: Home age taken from the examle health education site (htt://  Setember 14, 2001). 201 200 Chater 8 Alying the Web Interface Profiles: Examle Web Site Assessment 8.1 Introduction This chater describes the use of the rofiles develoed in Chater 6 to assess and imrove the quality of an examle

More information

J. Parallel Distrib. Comput.

J. Parallel Distrib. Comput. J. Parallel Distrib. Comut. 71 (2011) 288 301 Contents lists available at ScienceDirect J. Parallel Distrib. Comut. journal homeage: www.elsevier.com/locate/jdc Quality of security adatation in arallel

More information

10. Multiprocessor Scheduling (Advanced)

10. Multiprocessor Scheduling (Advanced) 10. Multirocessor Scheduling (Advanced) Oerating System: Three Easy Pieces AOS@UC 1 Multirocessor Scheduling The rise of the multicore rocessor is the source of multirocessorscheduling roliferation. w

More information

Space-efficient Region Filling in Raster Graphics

Space-efficient Region Filling in Raster Graphics "The Visual Comuter: An International Journal of Comuter Grahics" (submitted July 13, 1992; revised December 7, 1992; acceted in Aril 16, 1993) Sace-efficient Region Filling in Raster Grahics Dominik Henrich

More information

An Efficient Coding Method for Coding Region-of-Interest Locations in AVS2

An Efficient Coding Method for Coding Region-of-Interest Locations in AVS2 An Efficient Coding Method for Coding Region-of-Interest Locations in AVS2 Mingliang Chen 1, Weiyao Lin 1*, Xiaozhen Zheng 2 1 Deartment of Electronic Engineering, Shanghai Jiao Tong University, China

More information

AN INTEGER LINEAR MODEL FOR GENERAL ARC ROUTING PROBLEMS

AN INTEGER LINEAR MODEL FOR GENERAL ARC ROUTING PROBLEMS AN INTEGER LINEAR MODEL FOR GENERAL ARC ROUTING PROBLEMS Philie LACOMME, Christian PRINS, Wahiba RAMDANE-CHERIF Université de Technologie de Troyes, Laboratoire d Otimisation des Systèmes Industriels (LOSI)

More information

A GPU Heterogeneous Cluster Scheduling Model for Preventing Temperature Heat Island

A GPU Heterogeneous Cluster Scheduling Model for Preventing Temperature Heat Island A GPU Heterogeneous Cluster Scheduling Model for Preventing Temerature Heat Island Yun-Peng CAO 1,2,a and Hai-Feng WANG 1,2 1 School of Information Science and Engineering, Linyi University, Linyi Shandong,

More information

Submission. Verifying Properties Using Sequential ATPG

Submission. Verifying Properties Using Sequential ATPG Verifying Proerties Using Sequential ATPG Jacob A. Abraham and Vivekananda M. Vedula Comuter Engineering Research Center The University of Texas at Austin Austin, TX 78712 jaa, vivek @cerc.utexas.edu Daniel

More information

The VEGA Moderately Parallel MIMD, Moderately Parallel SIMD, Architecture for High Performance Array Signal Processing

The VEGA Moderately Parallel MIMD, Moderately Parallel SIMD, Architecture for High Performance Array Signal Processing The VEGA Moderately Parallel MIMD, Moderately Parallel SIMD, Architecture for High Performance Array Signal Processing Mikael Taveniku 2,3, Anders Åhlander 1,3, Magnus Jonsson 1 and Bertil Svensson 1,2

More information

Efficient Sequence Generator Mining and its Application in Classification

Efficient Sequence Generator Mining and its Application in Classification Efficient Sequence Generator Mining and its Alication in Classification Chuancong Gao, Jianyong Wang 2, Yukai He 3 and Lizhu Zhou 4 Tsinghua University, Beijing 0084, China {gaocc07, heyk05 3 }@mails.tsinghua.edu.cn,

More information

Complexity Issues on Designing Tridiagonal Solvers on 2-Dimensional Mesh Interconnection Networks

Complexity Issues on Designing Tridiagonal Solvers on 2-Dimensional Mesh Interconnection Networks Journal of Comuting and Information Technology - CIT 8, 2000, 1, 1 12 1 Comlexity Issues on Designing Tridiagonal Solvers on 2-Dimensional Mesh Interconnection Networks Eunice E. Santos Deartment of Electrical

More information

AN ANALYTICAL MODEL DESCRIBING THE RELATIONSHIPS BETWEEN LOGIC ARCHITECTURE AND FPGA DENSITY

AN ANALYTICAL MODEL DESCRIBING THE RELATIONSHIPS BETWEEN LOGIC ARCHITECTURE AND FPGA DENSITY AN ANALYTICAL MODEL DESCRIBING THE RELATIONSHIPS BETWEEN LOGIC ARCHITECTURE AND FPGA DENSITY Andrew Lam 1, Steven J.E. Wilton 1, Phili Leong 2, Wayne Luk 3 1 Elec. and Com. Engineering 2 Comuter Science

More information

EE678 Application Presentation Content Based Image Retrieval Using Wavelets

EE678 Application Presentation Content Based Image Retrieval Using Wavelets EE678 Alication Presentation Content Based Image Retrieval Using Wavelets Grou Members: Megha Pandey megha@ee. iitb.ac.in 02d07006 Gaurav Boob gb@ee.iitb.ac.in 02d07008 Abstract: We focus here on an effective

More information

Autonomic Physical Database Design - From Indexing to Multidimensional Clustering

Autonomic Physical Database Design - From Indexing to Multidimensional Clustering Autonomic Physical Database Design - From Indexing to Multidimensional Clustering Stehan Baumann, Kai-Uwe Sattler Databases and Information Systems Grou Technische Universität Ilmenau, Ilmenau, Germany

More information

1.5 Case Study. dynamic connectivity quick find quick union improvements applications

1.5 Case Study. dynamic connectivity quick find quick union improvements applications . Case Study dynamic connectivity quick find quick union imrovements alications Subtext of today s lecture (and this course) Stes to develoing a usable algorithm. Model the roblem. Find an algorithm to

More information

Using Rational Numbers and Parallel Computing to Efficiently Avoid Round-off Errors on Map Simplification

Using Rational Numbers and Parallel Computing to Efficiently Avoid Round-off Errors on Map Simplification Using Rational Numbers and Parallel Comuting to Efficiently Avoid Round-off Errors on Ma Simlification Maurício G. Grui 1, Salles V. G. de Magalhães 1,2, Marcus V. A. Andrade 1, W. Randolh Franklin 2,

More information

Leak Detection Modeling and Simulation for Oil Pipeline with Artificial Intelligence Method

Leak Detection Modeling and Simulation for Oil Pipeline with Artificial Intelligence Method ITB J. Eng. Sci. Vol. 39 B, No. 1, 007, 1-19 1 Leak Detection Modeling and Simulation for Oil Pieline with Artificial Intelligence Method Pudjo Sukarno 1, Kuntjoro Adji Sidarto, Amoranto Trisnobudi 3,

More information

Parallel Construction of Multidimensional Binary Search Trees. Ibraheem Al-furaih, Srinivas Aluru, Sanjay Goil Sanjay Ranka

Parallel Construction of Multidimensional Binary Search Trees. Ibraheem Al-furaih, Srinivas Aluru, Sanjay Goil Sanjay Ranka Parallel Construction of Multidimensional Binary Search Trees Ibraheem Al-furaih, Srinivas Aluru, Sanjay Goil Sanjay Ranka School of CIS and School of CISE Northeast Parallel Architectures Center Syracuse

More information

Improving Trust Estimates in Planning Domains with Rare Failure Events

Improving Trust Estimates in Planning Domains with Rare Failure Events Imroving Trust Estimates in Planning Domains with Rare Failure Events Colin M. Potts and Kurt D. Krebsbach Det. of Mathematics and Comuter Science Lawrence University Aleton, Wisconsin 54911 USA {colin.m.otts,

More information

Optimizing Dynamic Memory Management!

Optimizing Dynamic Memory Management! Otimizing Dynamic Memory Management! 1 Goals of this Lecture! Hel you learn about:" Details of K&R hea mgr" Hea mgr otimizations related to Assignment #6" Faster free() via doubly-linked list, redundant

More information

Truth Trees. Truth Tree Fundamentals

Truth Trees. Truth Tree Fundamentals Truth Trees 1 True Tree Fundamentals 2 Testing Grous of Statements for Consistency 3 Testing Arguments in Proositional Logic 4 Proving Invalidity in Predicate Logic Answers to Selected Exercises Truth

More information

Extracting Optimal Paths from Roadmaps for Motion Planning

Extracting Optimal Paths from Roadmaps for Motion Planning Extracting Otimal Paths from Roadmas for Motion Planning Jinsuck Kim Roger A. Pearce Nancy M. Amato Deartment of Comuter Science Texas A&M University College Station, TX 843 jinsuckk,ra231,amato @cs.tamu.edu

More information

This version of the software

This version of the software Sage Estimating (SQL) (formerly Sage Timberline Estimating) SQL Server Guide Version 16.11 This is a ublication of Sage Software, Inc. 2015 The Sage Grou lc or its licensors. All rights reserved. Sage,

More information

Collective communication: theory, practice, and experience

Collective communication: theory, practice, and experience CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Comutat.: Pract. Exer. 2007; 19:1749 1783 Published online 5 July 2007 in Wiley InterScience (www.interscience.wiley.com)..1206 Collective

More information

12) United States Patent 10) Patent No.: US 6,321,328 B1

12) United States Patent 10) Patent No.: US 6,321,328 B1 USOO6321328B1 12) United States Patent 10) Patent No.: 9 9 Kar et al. (45) Date of Patent: Nov. 20, 2001 (54) PROCESSOR HAVING DATA FOR 5,961,615 10/1999 Zaid... 710/54 SPECULATIVE LOADS 6,006,317 * 12/1999

More information

Sage Estimating. (formerly Sage Timberline Estimating) Getting Started Guide

Sage Estimating. (formerly Sage Timberline Estimating) Getting Started Guide Sage Estimating (formerly Sage Timberline Estimating) Getting Started Guide This is a ublication of Sage Software, Inc. Document Number 20001S14030111ER 09/2012 2012 Sage Software, Inc. All rights reserved.

More information

Randomized Selection on the Hypercube 1

Randomized Selection on the Hypercube 1 Randomized Selection on the Hyercube 1 Sanguthevar Rajasekaran Det. of Com. and Info. Science and Engg. University of Florida Gainesville, FL 32611 ABSTRACT In this aer we resent randomized algorithms

More information

Lecture 2: Fixed-Radius Near Neighbors and Geometric Basics

Lecture 2: Fixed-Radius Near Neighbors and Geometric Basics structure arises in many alications of geometry. The dual structure, called a Delaunay triangulation also has many interesting roerties. Figure 3: Voronoi diagram and Delaunay triangulation. Search: Geometric

More information

An integrated system for virtual scene rendering, stereo reconstruction, and accuracy estimation.

An integrated system for virtual scene rendering, stereo reconstruction, and accuracy estimation. An integrated system for virtual scene rendering, stereo reconstruction, and accuracy estimation. Marichal-Hernández J.G., Pérez Nava F*., osa F., estreo., odríguez-amos J.M. Universidad de La Laguna,

More information

Identity-sensitive Points-to Analysis for the Dynamic Behavior of JavaScript Objects

Identity-sensitive Points-to Analysis for the Dynamic Behavior of JavaScript Objects Identity-sensitive Points-to Analysis for the Dynamic Behavior of JavaScrit Objects Shiyi Wei and Barbara G. Ryder Deartment of Comuter Science, Virginia Tech, Blacksburg, VA, USA. {wei,ryder}@cs.vt.edu

More information

Tiling for Performance Tuning on Different Models of GPUs

Tiling for Performance Tuning on Different Models of GPUs Tiling for Performance Tuning on Different Models of GPUs Chang Xu Deartment of Information Engineering Zhejiang Business Technology Institute Ningbo, China colin.xu198@gmail.com Steven R. Kirk, Samantha

More information

An Efficient VLSI Architecture for Adaptive Rank Order Filter for Image Noise Removal

An Efficient VLSI Architecture for Adaptive Rank Order Filter for Image Noise Removal International Journal of Information and Electronics Engineering, Vol. 1, No. 1, July 011 An Efficient VLSI Architecture for Adative Rank Order Filter for Image Noise Removal M. C Hanumantharaju, M. Ravishankar,

More information

Hardware-Accelerated Formal Verification

Hardware-Accelerated Formal Verification Hardare-Accelerated Formal Verification Hiroaki Yoshida, Satoshi Morishita 3 Masahiro Fujita,. VLSI Design and Education Center (VDEC), University of Tokyo. CREST, Jaan Science and Technology Agency 3.

More information

Learning Robust Locality Preserving Projection via p-order Minimization

Learning Robust Locality Preserving Projection via p-order Minimization Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence Learning Robust Locality Preserving Projection via -Order Minimization Hua Wang, Feiing Nie, Heng Huang Deartment of Electrical

More information

has been retired This version of the software Sage Timberline Office Get Started Document Management 9.8 NOTICE

has been retired This version of the software Sage Timberline Office Get Started Document Management 9.8 NOTICE This version of the software has been retired Sage Timberline Office Get Started Document Management 9.8 NOTICE This document and the Sage Timberline Office software may be used only in accordance with

More information

Privacy Preserving Moving KNN Queries

Privacy Preserving Moving KNN Queries Privacy Preserving Moving KNN Queries arxiv:4.76v [cs.db] 4 Ar Tanzima Hashem Lars Kulik Rui Zhang National ICT Australia, Deartment of Comuter Science and Software Engineering University of Melbourne,

More information

PRO: a Model for Parallel Resource-Optimal Computation

PRO: a Model for Parallel Resource-Optimal Computation PRO: a Model for Parallel Resource-Otimal Comutation Assefaw Hadish Gebremedhin Isabelle Guérin Lassous Jens Gustedt Jan Arne Telle Abstract We resent a new arallel comutation model that enables the design

More information

Interactive Image Segmentation

Interactive Image Segmentation Interactive Image Segmentation Fahim Mannan (260 266 294) Abstract This reort resents the roject work done based on Boykov and Jolly s interactive grah cuts based N-D image segmentation algorithm([1]).

More information

Sage Document Management Version 17.1

Sage Document Management Version 17.1 Sage Document Management Version 17.1 User's Guide This is a ublication of Sage Software, Inc. 2017 The Sage Grou lc or its licensors. All rights reserved. Sage, Sage logos, and Sage roduct and service

More information

Constrained Path Optimisation for Underground Mine Layout

Constrained Path Optimisation for Underground Mine Layout Constrained Path Otimisation for Underground Mine Layout M. Brazil P.A. Grossman D.H. Lee J.H. Rubinstein D.A. Thomas N.C. Wormald Abstract The major infrastructure comonent reuired to develo an underground

More information

Lecture 3: Geometric Algorithms(Convex sets, Divide & Conquer Algo.)

Lecture 3: Geometric Algorithms(Convex sets, Divide & Conquer Algo.) Advanced Algorithms Fall 2015 Lecture 3: Geometric Algorithms(Convex sets, Divide & Conuer Algo.) Faculty: K.R. Chowdhary : Professor of CS Disclaimer: These notes have not been subjected to the usual

More information

Sage Estimating (formerly Sage Timberline Estimating) Getting Started Guide. Version has been retired. This version of the software

Sage Estimating (formerly Sage Timberline Estimating) Getting Started Guide. Version has been retired. This version of the software Sage Estimating (formerly Sage Timberline Estimating) Getting Started Guide Version 14.12 This version of the software has been retired This is a ublication of Sage Software, Inc. Coyright 2014. Sage Software,

More information

Fast Distributed Process Creation with the XMOS XS1 Architecture

Fast Distributed Process Creation with the XMOS XS1 Architecture Communicating Process Architectures 20 P.H. Welch et al. (Eds.) IOS Press, 20 c 20 The authors and IOS Press. All rights reserved. Fast Distributed Process Creation with the XMOS XS Architecture James

More information

Sage Estimating (SQL) (formerly Sage Timberline Estimating) Installation and Administration Guide. Version 16.11

Sage Estimating (SQL) (formerly Sage Timberline Estimating) Installation and Administration Guide. Version 16.11 Sage Estimating (SQL) (formerly Sage Timberline Estimating) Installation and Administration Guide Version 16.11 This is a ublication of Sage Software, Inc. 2016 The Sage Grou lc or its licensors. All rights

More information

Optimization of Collective Communication Operations in MPICH

Optimization of Collective Communication Operations in MPICH To be ublished in the International Journal of High Performance Comuting Alications, 5. c Sage Publications. Otimization of Collective Communication Oerations in MPICH Rajeev Thakur Rolf Rabenseifner William

More information

[9] J. J. Dongarra, R. Hempel, A. J. G. Hey, and D. W. Walker, \A Proposal for a User-Level,

[9] J. J. Dongarra, R. Hempel, A. J. G. Hey, and D. W. Walker, \A Proposal for a User-Level, [9] J. J. Dongarra, R. Hemel, A. J. G. Hey, and D. W. Walker, \A Proosal for a User-Level, Message Passing Interface in a Distributed-Memory Environment," Tech. Re. TM-3, Oak Ridge National Laboratory,

More information

RESEARCH ARTICLE. Simple Memory Machine Models for GPUs

RESEARCH ARTICLE. Simple Memory Machine Models for GPUs The International Journal of Parallel, Emergent and Distributed Systems Vol. 00, No. 00, Month 2011, 1 22 RESEARCH ARTICLE Simle Memory Machine Models for GPUs Koji Nakano a a Deartment of Information

More information

Control plane and data plane. Computing systems now. Glacial process of innovation made worse by standards process. Computing systems once upon a time

Control plane and data plane. Computing systems now. Glacial process of innovation made worse by standards process. Computing systems once upon a time Classical work Architecture A A A Intro to SDN A A Oerating A Secialized Packet A A Oerating Secialized Packet A A A Oerating A Secialized Packet A A Oerating A Secialized Packet Oerating Secialized Packet

More information

Relations with Relation Names as Arguments: Algebra and Calculus. Kenneth A. Ross. Columbia University.

Relations with Relation Names as Arguments: Algebra and Calculus. Kenneth A. Ross. Columbia University. Relations with Relation Names as Arguments: Algebra and Calculus Kenneth A. Ross Columbia University kar@cs.columbia.edu Abstract We consider a version of the relational model in which relation names may

More information

521493S Computer Graphics Exercise 3 (Chapters 6-8)

521493S Computer Graphics Exercise 3 (Chapters 6-8) 521493S Comuter Grahics Exercise 3 (Chaters 6-8) 1 Most grahics systems and APIs use the simle lighting and reflection models that we introduced for olygon rendering Describe the ways in which each of

More information

Skip List Based Authenticated Data Structure in DAS Paradigm

Skip List Based Authenticated Data Structure in DAS Paradigm 009 Eighth International Conference on Grid and Cooerative Comuting Ski List Based Authenticated Data Structure in DAS Paradigm Jieing Wang,, Xiaoyong Du,. Key Laboratory of Data Engineering and Knowledge

More information

A DEA-bases Approach for Multi-objective Design of Attribute Acceptance Sampling Plans

A DEA-bases Approach for Multi-objective Design of Attribute Acceptance Sampling Plans Available online at htt://ijdea.srbiau.ac.ir Int. J. Data Enveloment Analysis (ISSN 2345-458X) Vol.5, No.2, Year 2017 Article ID IJDEA-00422, 12 ages Research Article International Journal of Data Enveloment

More information

Ad Hoc Networks. Latency-minimizing data aggregation in wireless sensor networks under physical interference model

Ad Hoc Networks. Latency-minimizing data aggregation in wireless sensor networks under physical interference model Ad Hoc Networks (4) 5 68 Contents lists available at SciVerse ScienceDirect Ad Hoc Networks journal homeage: www.elsevier.com/locate/adhoc Latency-minimizing data aggregation in wireless sensor networks

More information

A CLASS OF STRUCTURED LDPC CODES WITH LARGE GIRTH

A CLASS OF STRUCTURED LDPC CODES WITH LARGE GIRTH A CLASS OF STRUCTURED LDPC CODES WITH LARGE GIRTH Jin Lu, José M. F. Moura, and Urs Niesen Deartment of Electrical and Comuter Engineering Carnegie Mellon University, Pittsburgh, PA 15213 jinlu, moura@ece.cmu.edu

More information

Non-Strict Independence-Based Program Parallelization Using Sharing and Freeness Information

Non-Strict Independence-Based Program Parallelization Using Sharing and Freeness Information Non-Strict Indeendence-Based Program Parallelization Using Sharing and Freeness Information Daniel Cabeza Gras 1 and Manuel V. Hermenegildo 1,2 Abstract The current ubiuity of multi-core rocessors has

More information

Collective Communication: Theory, Practice, and Experience. FLAME Working Note #22

Collective Communication: Theory, Practice, and Experience. FLAME Working Note #22 Collective Communication: Theory, Practice, and Exerience FLAME Working Note # Ernie Chan Marcel Heimlich Avi Purkayastha Robert van de Geijn Setember, 6 Abstract We discuss the design and high-erformance

More information

A Metaheuristic Scheduler for Time Division Multiplexed Network-on-Chip

A Metaheuristic Scheduler for Time Division Multiplexed Network-on-Chip Downloaded from orbit.dtu.dk on: Jan 25, 2019 A Metaheuristic Scheduler for Time Division Multilexed Network-on-Chi Sørensen, Rasmus Bo; Sarsø, Jens; Pedersen, Mark Ruvald; Højgaard, Jasur Publication

More information

Record Route IP Traceback: Combating DoS Attacks and the Variants

Record Route IP Traceback: Combating DoS Attacks and the Variants Record Route IP Traceback: Combating DoS Attacks and the Variants Abdullah Yasin Nur, Mehmet Engin Tozal University of Louisiana at Lafayette, Lafayette, LA, US ayasinnur@louisiana.edu, metozal@louisiana.edu

More information

A BICRITERION STEINER TREE PROBLEM ON GRAPH. Mirko VUJO[EVI], Milan STANOJEVI] 1. INTRODUCTION

A BICRITERION STEINER TREE PROBLEM ON GRAPH. Mirko VUJO[EVI], Milan STANOJEVI] 1. INTRODUCTION Yugoslav Journal of Oerations Research (00), umber, 5- A BICRITERIO STEIER TREE PROBLEM O GRAPH Mirko VUJO[EVI], Milan STAOJEVI] Laboratory for Oerational Research, Faculty of Organizational Sciences University

More information

A Model-Adaptable MOSFET Parameter Extraction System

A Model-Adaptable MOSFET Parameter Extraction System A Model-Adatable MOSFET Parameter Extraction System Masaki Kondo Hidetoshi Onodera Keikichi Tamaru Deartment of Electronics Faculty of Engineering, Kyoto University Kyoto 66-1, JAPAN Tel: +81-7-73-313

More information

Brigham Young University Oregon State University. Abstract. In this paper we present a new parallel sorting algorithm which maximizes the overlap

Brigham Young University Oregon State University. Abstract. In this paper we present a new parallel sorting algorithm which maximizes the overlap Aeared in \Journal of Parallel and Distributed Comuting, July 1995 " Overlaing Comutations, Communications and I/O in Parallel Sorting y Mark J. Clement Michael J. Quinn Comuter Science Deartment Deartment

More information

CENTRAL AND PARALLEL PROJECTIONS OF REGULAR SURFACES: GEOMETRIC CONSTRUCTIONS USING 3D MODELING SOFTWARE

CENTRAL AND PARALLEL PROJECTIONS OF REGULAR SURFACES: GEOMETRIC CONSTRUCTIONS USING 3D MODELING SOFTWARE CENTRAL AND PARALLEL PROJECTIONS OF REGULAR SURFACES: GEOMETRIC CONSTRUCTIONS USING 3D MODELING SOFTWARE Petra Surynková Charles University in Prague, Faculty of Mathematics and Physics, Sokolovská 83,

More information

To appear in IEEE TKDE Title: Efficient Skyline and Top-k Retrieval in Subspaces Keywords: Skyline, Top-k, Subspace, B-tree

To appear in IEEE TKDE Title: Efficient Skyline and Top-k Retrieval in Subspaces Keywords: Skyline, Top-k, Subspace, B-tree To aear in IEEE TKDE Title: Efficient Skyline and To-k Retrieval in Subsaces Keywords: Skyline, To-k, Subsace, B-tree Contact Author: Yufei Tao (taoyf@cse.cuhk.edu.hk) Deartment of Comuter Science and

More information

Learning Motion Patterns in Crowded Scenes Using Motion Flow Field

Learning Motion Patterns in Crowded Scenes Using Motion Flow Field Learning Motion Patterns in Crowded Scenes Using Motion Flow Field Min Hu, Saad Ali and Mubarak Shah Comuter Vision Lab, University of Central Florida {mhu,sali,shah}@eecs.ucf.edu Abstract Learning tyical

More information