Models for Advancing PRAM and Other Algorithms into Parallel Programs for a PRAM-On-Chip Platform

Size: px

Start display at page:

Download "Models for Advancing PRAM and Other Algorithms into Parallel Programs for a PRAM-On-Chip Platform"

Kristin Thomas
6 years ago
Views:

1 Models for Advancing PRAM and Other Algorithms into Parallel Programs for a PRAM-On-Chi Platform Uzi Vishkin George C. Caragea Bryant Lee Aril 2006 University of Maryland, College Park, MD UMIACS-TR Justin Rattner, CTO, Intel, Electronic News, March 13, 2006: It is better for Intel to get involved in this now so when we get to the oint of having 10s and 100s of cores we will have the answers. There is a lot of architecture work to do to release the otential, and we will not bring these roducts to market until we have good solutions to the rogramming roblem. [underline added] Abstract A bold vision that guided this work is as follows: (i) a arallel algorithms and rogramming course could become a standard course in every undergraduate comuter science rogram, and (ii) this course could be couled with a so-called PRAM-On-Chi architecture a commodity high-end multi-core comuter architecture. In fact, the current aer is a tutorial on how to convert PRAM algorithms into efficient PRAM-On-Chi rograms. Couled with a text on PRAM algorithms as well as an available PRAM-On-Chi tool-chain, comrising a comiler and a simulator, the aer rovides the missing link for ugrading a standard theoretical PRAM algorithms class to a arallel algorithms and rogramming class. Having demonstrated that such a course could cover similar rogramming rojects and material to what is covered by a tyical first serial algorithms and rogramming course, the aer suggests that arallel rogramming in the emerging multi-core era does not need to be more difficult than serial rogramming. If true, a owerful answer to the so-called arallel rogramming oen roblem is being rovided. This oen roblem is currently the main stumbling block for the industry in getting the ucoming generation of multi-core architectures to imrove single task comletion time using easy-to-rogram alication rogrammer interfaces. Known constraints of this oen roblem, such as backwards comatibility on serial code, are also addressed by the overall aroach. More concretely, a widely used methodology for advancing arallel algorithmic thinking into arallel algorithms is revisited, and is extended into a methodology for advancing arallel algorithms to PRAM-On-Chi rograms. A erformance cost model for the PRAM-On-Chi is also resented. It uses as comlexity metrics the length of sequence of round tris to memory (LSRTM) and queuing delay (QD) from memory access queues, in addition to standard PRAM comutation costs. Highlighting the imortance of LSRTM in determining erformance is another contribution of the aer. Finally, some alternatives to PRAM algorithms, which, on one hand, are easier-to-think, but, on the other hand, suress more architecture details, are also discussed. 1 Introduction Parallel rogramming is currently a difficult task. Current methods tend to be coarse-grained and use either a shared memory or a message assing model. These methods often require the rogrammer to think in a way that takes into account details of memory layout or architectural imlementation. It has been a common Partially suorted by NSF grant

2 sentiment that the develoment of an easy way for arallel rogramming would be a major breakthrough; see, e.g., Culler and Singh [CS99]. Indeed, to date the outreach of arallel comuting has fallen short of historical exectations. Overall, there is a strong renewed interest in inventing new rogramming languages that accommodate simle reresentation of concurrency. However, during the revious decades thousands of aers have been written on this toic. This effort brought about a fierce debate between a considerable number of schools-of-thoughts. One of these aroaches, the PRAM aroach, emerged as a clear winner in this battle of ideas. In fact, we would like to defend an even stronger remise: Had a arallel architecture that can look to the erformance rogrammer like a PRAM been feasible in the early 1990s, its arallel rogramming aroach would have become common knowledge and the revailing standard by now. As evidence to suort this remise we oint out that 3 of the main algorithms textbooks (taught in standard undergraduate comuter science courses everywhere by 1990) [Baa88, CLR90, Man89] chose to include large chaters on PRAM algorithms. The PRAM was the model of choice for arallel algorithms in all major algorithms/theory communities and was taught everywhere. The only reason that this win did not register in the collective memory as the clear and decisive victory it really is that, at about the same time (early 1990s), it became clear that it will not be ossible to build such a machine (i.e., one that can look to the erformance rogrammer as a PRAM) using early 1990s technology. The Parallel Random Access Model (PRAM) is an easy model for arallel algorithmic thinking and for rogramming. It abstracts away architecture details by assuming that many memory accesses to a shared memory can be satisfied within the same time as a single access. As noted above, the PRAM was develoed during the 1980s and 1990s in anticiation of a arallel rogrammability challenge. It rovides the second largest algorithmic knowledge base right next to the standard serial knowledge base. With the continuing increase of silicon caacity, it becomes ossible to build a single-chi arallel rocessor. Such demonstration has been the urose of the Exlicit Multi-Threading (XMT) roject [VDBN98, NNTV03] that seeks to rototye a PRAM-On-Chi vision, as on-chi interconnection networks rovide enough bandwidth for connecting rocessors-to-memories. Thread-level arallelism (TLP) allows multile threads of execution to roceed concurrently. There is a long record of comiler efforts for arallelizing serial code. Two reresentatives include [AALT95, ACK87]. While there have been some success stories, it is now recognized that automatic arallelization by comilers is generally insufficient. The PRAM-On-Chi latform, to be discussed later in the current aer, is quite broad. The current aer will focus on a thread-level arallelism (TLP) aroach for rogramming it. However, instead of using oerating system threads, as in most current systems, threads are defined by the rogramming language and handled by its imlementation. Also, threads are short and the overall objective for multi-threading is reducing single-task comletion time. Several multi-chi multirocessor architectures targeted imlementation of PRAM algorithms, or came close to that: (i) The NYU Ultracomuter roject sought to aroximate the PRAM [AG94], viewing the PRAM as roviding theoretical yardstick for limits of arallelism as oosed to a ractical rogramming model [Sch80]. (ii) The Tera/Cray Multi-threaded Architecture (MTA) advanced Burton Smith s 1978 HEP novel hardware design. Seeking to hide latencies to memory ([SCB + 98]) each rocessor has sufficiently many (128 was a tyical number) hardware threads that can context switch quickly. The aer [BCF05] suggests that MTA is close to a PRAM and may allow more efficient imlementation of algorithms with irregular memory access such as those from grah theory. Some authors have stated that an MTA with large number of rocessors looks almost like a PRAM [CFS99]. (iii) The SB-PRAM may be the first multichi multirocessor architecture whose declared objective was to rovide emulation of the PRAM [KKT00]. It allows writing comuter rograms that are similar to the original PRAM algorithms. A 64-rocessor rototye has been built [DKP02]. (iv) Although a language rather than an architecture, NESL also made a contribution to imlementing PRAM algorithms by making the algorithms easier to exress using the NESL functional language [Ble96]. NESL rograms are comiled and run on standard multi-chi arallel architectures. However, the fact remains that the PRAM theory has generally not reached out beyond the ivory towers of academia. For examle, the jury is still out on whether the PRAM can rovide an effective abstraction for a roer design of a multi-chi multi-rocessors. The main difficulty [CS99] aears to be the limits on the bandwidth of such a multi-chi architecture. 2

3 More of the case for a lower hanging fruit, PRAM-On-Chi, is resented next. Guided by the fact that the number of transistors on a chi already exceeds one Billion, u from less than 30,000 circa 1980, and kees growing, the main insight behind PRAM-On-Chi is as follows. The Billion transistor chi era allows for the first time a low-overhead on-chi multi-rocessor thereby avoiding concerns regarding the higher overhead of multi-chi multirocessors. It also allows an evolutionary ath from serial comuting. The drastic recent slow down in clock rate imrovement for commodity rocessors will force vendors to seek single task erformance imrovements through arallelism. While some have already noted likely growth to 100-core chis by 2015, they are yet to choose rogramming languages and architectures toward harnessing these enormous hardware resources toward single task comletion time. PRAM-On-Chi addresses these issue. Some key differences between the PRAM-On-Chi and the above multi-chi aroaches are: (i) its larger bandwidth, benefiting from the on-chi environment; (ii) lower latencies to shared memory, since an on-chi aroach allows on-chi shared caches; (iii) effective suort for serial code; this may be needed for backward comatibility for serial rograms, or for serial sections in PRAM-like rograms; (iv) effective suort for arallel execution where the amount of arallelism is low; certain algorithms (e.g., breadth first-search (BFS) on grahs resented later) have articularly simle arallel algorithms; some are only a minor variation of the serial algorithm; since they may not offer sufficient arallelism for some multi-chi architectures, such imortant algorithms had no merit for these architectures; and (v) PRAM-On-Chi introduced a so-called Indeendence of Order Semantics (IOS), that is each thread executes at its own ace and any ordering of interactions among threads is valid. If more than one thread may seek to write to the same shared variable this would be in line with the PRAM arbitrary CRCW convention (see section 2.1). This feature imroves erformance as it allows rocessing with whatever data is available at the rocessing elements and saves ower as it reduces synchronization needs. The feature could have been added to multi-chi aroaches roviding some, but aarently not all the benefits. Other PRAM-related aroaches tended to emhasize cometition with (massively arallel) arallel comuting aroaches and have not aid that much attention to serial code, serial mode in a arallel rogram, or even arallel execution where the amount of arallelism is low. The aroach could also suort standard alication rogramming interfaces (APIs) such as those used for grahics (e.g. OenGL) or circuit design (e.g. VHDL). Use of high-level APIs can allow automatic extraction of much more arallelism than from code written for erformance rogramming languages such as C. With an effective imlementation of such an API for a PRAM-On-Chi (see figure 17.b), an alication rogrammer could take advantage of arallel hardware with few or no changes to an existing API. See [GV06] for a recent examle of seedus exceeding a hundred fold over serial comuting for gate-level VHDL simulations on PRAM-On-Chi. The main contribution of this aer is resenting a rogramming methodology for converting PRAM algorithms to PRAM-on-chi rograms. An overview of some alternatives to PRAM algorithms, which are easier-to-think, but, on the other hand, suress more architecture details, are also discussed. Performance models used in develoing a PRAM-On-Chi algorithm are described in section 2. An examle of using the models is given in section 3. Section 4 exlains comiler otimizations that could affect the actual execution of rograms. Section 5 gives another examle for alying the models to the refix sums roblem. Section 6 resents Breadth-First Search (BFS) in the PRAM-On-Chi Programming Model. Section 7 exlains the alication of comiler otimizations to BFS and comares erformance of several BFS imlementations. Section 8 discusses the Adative Bitonic Sorting algorithm and its imlementation while section 9 introduces a variant of Samle Sorting that runs on a PRAM-On-Chi. Section 10 discusses matrix-vector multilication. Some emirical validation of the models is resented in section 11. We conclude in section Model descritions Given a roblem, a recie for develoing an efficient PRAM-on-chi rogram from concet to imlementation is roosed. In articular, the stages through which such develoment needs to ass are resented. Figure 1 deicts the roosed methodology. For context, the figure also deicts the widely used Work- 3

4 High level Work Deth Descrition Work Deth Model PRAM Model Problem Algorithm design "How to think in arallel" Sequence of stes Each ste: Set of concurrent oerations Informal Work/Deth comlexity 1 Sequence of stes Each ste: Sequence of concurrent oerations Work/Deth comlexity 2 Scheduling Lemma Allocate/schedule rocessors Each ste: Sequence of concurrent oerations No. of arallel stes 3 Programmer Legend: original "arallel thinking to PRAM" methodology roosed "arallel thinking to PRAM on chi rogram" methodology PRAM on chi Programming Model Program in High Level Language Language based Performance Comiler PRAM on chi Execution Model Program in Low Level Language Machine Level Run time Performance roosed shortcut for PRAM on chi rogrammers 4 Without nesting With nesting 5 Figure 1: Proosed Methodology for Develoing PRAM-On-Chi Programs in view of the Work-Deth Paradigm for Develoing PRAM algorithms. Deth methodology for advancing from concet to a PRAM algorithm; namely, the sequence of models in the figure illustrates rogression from a high-level descrition to a PRAM algorithm. For develoing a PRAM-on-chi imlementation, we roose following the sequence of models : given a secific roblem, an algorithm design stage will roduce a High-Level descrition of the arallel algorithm; this informal descrition is fleshed out as a sequence of stes each comrising a set of concurrent oerations. In a first draft, the set of concurrent oerations can be imlicitly defined. See the BFS examle in Section This first draft is refined to a sequence of stes each comrising now a sequence of concurrent oerations. Such formal Work-Deth descrition fully sells out how to advance in a given ste, whose sequence of concurrent oerations include j oerations indexed by integers from 1 to j, from each index i where 1 i j, to an oeration. The rogramming effort amounts to translating this descrition into a single-rogram multile-data (SPMD) rogram using a high-level PRAM-on-chi rogramming language. From this SPMD rogram, a comiler will transform and reorganize the code to achieve the best erformance in the target PRAM-on-chi execution model. As a PRAM-on-chi rogrammer gains exerience, he/she will be able to ski box 2 (the Work-Deth model) and directly advance from box 1 (high-level Work-Deth descrition) to box 4 (high-level PRAM-on-chi rogram). We also demonstrate some instances where it may be advantageous to ski box 2 because of some features of the rogramming model (such as some ability to handle nesting of arallelism). In Figure 1 this shortcut is deicted by the arrow 1 4. Much of the current aer is devoted to resenting the methodology and demonstrating it. We start with elaborating on each model. 2.1 PRAM Model PRAM (for Parallel Random Access Machine, or Model) augments the standard serial model of comutation, known as RAM [AU94], with arallelism. A PRAM consists of synchronous rocessors and a global shared memory accessible in unit time from each of the rocessors. The only mean of inter-rocessor communication is through the shared memory. Different conventions exist regarding concurrent access to the memory, including: (i) exclusive-read exclusive-write (EREW) under which simultaneous access to the same memory location for read or write uroses are forbidden, (ii) concurrent-read exclusive-write (CREW), which allows concurrent reads but not writes, and (iii) concurrent-read concurrent-write (CRCW) where both are ermitted, and a convention regarding how concurrent writes are resolved is secified. One of these conventions, Arbitrary CRCW, stiulates that concurrent writes into a common memory location result in an arbitrary rocessor, among those attemting to write, succeeding, but it is not known in advance which 4

5 rocessor. The are quite a few sources for PRAM algorithms including [JáJ92, KR90, EG88, Vis02]. An algorithm in the PRAM model is described as a sequence of arallel time units, or rounds; each round consists of exactly instructions to be erformed concurrently, one er each rocessor. Producing such a descrition imoses a significant burden on the algorithm designer. Luckily this burden can be somewhat mitigated using the Work-Deth methodology. 2.2 The Work-Deth Methodology Introduced in [SV82], the Work-Deth methodology for designing PRAM algorithms has roved to be quite useful as a framework for describing arallel algorithms and reasoning about their erformance. For examle, it was used as the descrition framework in [JáJ92]. The methodology is guided by seeking to otimize two quantities in a arallel algorithm: deth and work. Deth reresents the number of stes the algorithm would take if unlimited arallel hardware was available, while work is the total number of oerations erformed, over all arallel stes. The methodology suggests starting by roducing an informal descrition of the algorithm in a high-level work-deth model (HLWD), and then advancing this descrition into a fuller resentation in a model of comutation called Work-Deth. We roceed to describe these two models next High-Level Work-Deth Descrition A HLWD descrition consists of a succession of arallel rounds, each round being a set of any number of instructions to be erformed concurrently. Descritions can come in several flavors, and even imlicit descritions, where the number of instructions is not obvious, are accetable. Examle: Given is an undirected grah G(V, E), where the length of every edge in E is 1, and a source node s V ; the breadth-first search (BFS) algorithm finds the lengths of the shortest aths from s to every node in V. An informal work-deth descrition of the arallel BFS algorithm can look as follows. Suose that V, the set of vertices of the grah G, is artitioned into layers, where layer L i includes all vertices of V whose shortest ath from s includes exactly i edges. The algorithm works in iterations. In iteration i, layer L i is found. Iteration 0: node s forms layer L 0. Iteration i, i > 0: Assume inductively that layer L i 1 has already been found. In arallel, consider all the edges (u, v) that have an endoint u in layer L i 1 ; if v is not in a layer L j, j < i, it must be in layer L i. As more than one edge may lead from a vertex in layer L i 1 to v, vertex v is marked as belonging to layer L i by one of these edges using the arbitrary concurrent write convention. This ends an informal, high-level work-deth verbal descrition. A seudocode descrition of an iteration of this algorithm could look as follows: for a l l v e r tices v in L( i ) ardo for a l l edges e=(v,w) ardo i f w unvisited mark w as art of L( i +1) The above HLWD descritions challenge us to try to find an efficient PRAM imlementation for an iteration. Namely, given a -rocessor PRAM how to allocate rocessors to tasks to finish all oerations of an iterations as quickly as ossible? As noted earlier, a more detailed descrition in the Work-Deth model would address these issues Work-Deth Model In the Work-Deth model the descrition is to be cast in terms of successive time stes, where the concurrent oerations in a time ste form a sequence; each element in the sequence is indexed by a different index between 1 and the number of oerations in the ste. The Work-Deth model is formally equivalent to the PRAM. For examle, a work-deth algorithm with T(n) deth (or time) and W(n) work runs on a rocessor PRAM in at most T(n) + W(n) time stes. The simle equivalence roof follows Brent s scheduling rincile, which 5

6 was introduced in [Bre74] for a model of arallel model of comutation that was much more abstract than the PRAM (counting arithmetic oerations, but suressing anything else). Examle (continued): We only note here the challenge for coming u with a Work-Deth descrition for the BFS algorithm. The challenge would be to find a way for listing in a single sequence all the edges that have as an endoint a vertex of layer L i. In other words, the Work-Deth model does not allow us to leave nesting of arallelism unresolved. On the other hand PRAM-On-Chi rogramming should allow nesting since this mechanism rovides an easy way for arallel rogramming. It is also imortant to note that the PRAM-on-chi architectures includes some limited suort for nesting of arallelism. The way in which we suggest to resolve this roblem is as follows. The ideal long term solution is: (a) allow the rogrammer free unlimited use of nesting, (b) have it imlemented as efficiently as ossible by comiler, and (c) make the rogrammer (esecially the erformance rogrammer ) be fully aware of the cost of using nesting. However, since our comiler is not yet mature enough to handle this matter, our tentative short term solution is resented in Section 6, which shows how to build on the suort for nesting rovided by the architecture. There is merit to this manual solution beyond its tentative role till the comiler matures. It should still need to be taught (even after the ideal comiler solution is in lace) in order to exlain the cost of nesting to rogrammers. The reason for bringing this issue u this early in the discussion is that it actually suggests that our methodology does not necessarily need to make a comlete sto at the Work-Deth model, but can erhas detour it and roceed directly to the PRAM-like rogramming methodology. 2.3 PRAM-on-chi Programming Model The PRAM-on-chi rogramming model is a framework for a high-level rogramming language. It can be used to imlement an algorithm described in the Work-Deth resentation model, but as noted before it also offers shortcuts from higher-level descritions. The overall objective of the rogramming model is to mitigate two goals: (i) Programmability: given an algorithm in HLWD or Work-Deth model, the rogrammer s effort should be minimized; and (ii) Imlementability: effective comiler translation into the PRAM-on-chi execution model should be feasible. A fine-grained, SPMD tye model, in which execution frequently alternates between serial and arallel execution mode, is resented. As illustrated in Figure 2, a Sawn command romts a switch from serial mode to arallel mode. The Sawn command can secify any number of threads. Ideally, each such thread can roceed until termination (a Join command) without ever having to busy-wait or synchronize with other threads. To facilitate that, an indeendence of order semantics (IOS) was introduced: the rogrammer can use commands (e.g., refix-sum ) that ermit threads to roceed even if they try to write into the same memory location. This was insired by the PRAM arbitrary concurrent-write convention noted earlier. The following are some of the rimitives in the PRAM-on-chi rogramming model: Sawn Instruction. Used to start a arallel section. Accets as arameter the number of arallel threads to start. Thread-id. A secial variable name used inside a arallel section, which evaluates to the thread ID. This allows SPMD style rogramming. Prefix-sum Instruction. The refix-sum instruction defines an atomic oeration. Oerating on two variables, a base variable B and an increment variable R, the result of a refix-sum is that B gets the value B + R, while R gets the original value of B. Some interesting uses of the refix-sum instruction are when several concurrent threads use it with resect to the same base. It rovides a tool for imlementing IOS as well as for inter-thread coordination. While, the basic definition of refix-sum follows the fetch-and-add of the NYU-Ultracomuter [GGK + 82], PRAM-on-Chi uses a fast arallel hardware imlementation (s()) if R is from a small range (e.g., one bit) and B can fit one of a small number of global registers; otherwise, refix-sums are done using a refix-sum-to-memory (sm()) instruction and are resolved by queuing to memory. Nested arallelism. A arallel thread can be rogrammed to initiate more threads. However, as noted in Section this comes with some (tentative) restrictions and cost caveats, due to comiler and 6

7 Sawn Join Sawn Join Figure 2: Switching between serial and arallel execution modes in the PRAM-on-chi rogramming model. Each arallel thread executes at its own seed, without ever needing to synchronize with another thread hardware suort issues. As illustrated with the Breadth-First search examle, nesting of arallelism could imrove the rogrammer s ability to describe an algorithms in a clear and concise way. Nesting is discussed in several laces in the current aer, including section 4.1. Note that Figure 1 deicts two alternative PRAM-On-Chi rogramming models: without nesting and with nesting. The Work-Deth model mas directly into the rogramming model without nesting. Allowing nesting could make it easier to turn a descrition in the High-Level Work-Deth model into a rogram. Since our current embodiment of PRAM-On-Chi is called XMT, for exlicit Multi-Threading, we call the illustration of this rogramming model XMTC. XMTC is a suerset of the language C, obtained from it by adding structures for the above rimitives. Examles of XMTC code Several examles of actual imlementations of PRAM algorithms using XMTC are resented in figure 3. While each of these rograms is discussed in greater detail in the following sections, the urose of the table was to convey to readers familiar with other arallel rogramming frameworks the relative conciseness of these rograms. Some language constructs, such as variable and function declarations, have been left out in this table, but they need to be included in a valid XMTC rogram. Next, the language features of XMTC are demonstrated using the array comaction roblem, resented in figure 3.a: given an array of integers T[0..n 1], coy all its non-zero elements into another array S; any order will do. The secial variable $ denotes the thread-id. The command sawn(0,n-1) sawns n threads whose id s are the integers in the range 0...n 1. The s(increment,length) instruction executes an atomic refix-sum command using length as the base and increment as the increment value. The variable increment is local to a thread while length is a global variable which will hold the number of non-zero elements coied at the end of the sawn block. Variables declared inside a sawn() block are local for each thread, and are usually much faster to access than the shared memory. 1 To evaluate erformance in this model, a Language-Based Performance Model is used: erformance costs are assigned to each rimitive instruction in the language and rules are secified for combining them into exressions. Such erformance modeling was used by Aho and Ullman [AU94] and was generalized for arallelism by Blelloch [Ble96]. The aer [DV00] used language-based modeling for studying arallel list ranking relative to an earlier erformance model for XMT. 2.4 PRAM-on-chi Execution Model The execution model deends heavily on articulars of the PRAM-on-chi imlementation. For illustration uroses, we will use the XMT PRAM-on-chi latform (see [NNTV03]). A bird eye s view of XMT is resented in Figure 4. A number of (say 1024) Thread Control Units (TCUs) are groued into (say 64) clusters. Clusters are connected to the memory subsystem by a high-throughut, low-latency interconnection network; they also interface with secialized units such as refix-sum unit and global registers. A hash function is alied to memory addresses in order to rovide better load balancing at the shared memory modules. An imortant comonent of a cluster is the read-only cache included at cluster level; this is used to store values read from memory by a TCU and also holds the values read by refetch instructions. The memory system consists of memory modules each having several levels of cache 1 On XMT, local thread variables are tyically stored into local registers of the executing hardware thread control unit (TCU). The rogrammer is encouraged to use local variables to store frequently used values This tye of otimizations can also be erformed by an otimizing comiler. 7

8 (a) Array comaction length = 0; sawn (0,n 1) { // start one thread er array element int increment = 1; i f (T[ $ ]!= 0) { // execute refix sum to al locate one entry in array S s ( increment, length ) ; S [ increment ] = T[ $ ] ; (b) k-ary Tree Summation Inut : N numbers in sum [ 0..N 1] Outut : The sum o f the numbers in sum [ 0 ] The sum array i s a 1D comlete tree reresentation ( See Summation section ) l e vel = 0; / / rocess l e vels of tree from leaves to root l evel++; sawn( current le vel start inde x, current level end index ) { int count, local sum =0; for ( count = 0; count < k ; count++) tem sum += sum [ k $ + count + 1]; sum [ $ ] = local sum ; while ( l e vel < log k (N) ) { (c) k-ary Tree Prefix-Sums Inut : N numbers in sum [ 0..N 1] Outut : the refix sums of the numbers in refix sum [ o f f s e t t o 1 s t l e a f.. o f f s e t t o 1 s t l e a f+n 1] The refix sum array i s a 1D comlete tree reresentation ( See Summation) kary tree summation ( sum ) ; // run k ary t r e e summation algorithm refix sum [ 0 ] = 0 ; le vel = log k (N); while ( l e vel > 0) { // a l l l e v e l s from root to leaves sawn( current le vel start inde x, current level end index ) { int count, l oc al s = refix sum [ $ ] ; for ( count = 0; count < k ; count++) { refix sum [ k$ + count + 1] = l ocal s ; local s += sum [ k$ + count + 1]; level ; (d) Breadth-First Search Inut : Grah G=(E,V) using adjacency l i s t s ( See Programming BFS section ) Outut : distance [N] distance from start vertex for each vertex Uses : l e vel [L ] [N] sets of ve rtices at each BFS l e vel. //run refix sums on degrees to determine osition of start edge for each vertex start edge = kary refix sums ( degrees ); l e vel [0]= start node ; i =0; while ( l e vel [ i ] not emty) { sawn (0, l e v e l s i z e [ i ] 1) { // start one thread for each vertex in l e vel [ i ] v = l evel [ i ] [ $ ] ; / / read one vertex sawn (0, degree [ v] 1) { // start one thread for each edge of each vertex int w = edges [ start edge [ v]+$ ] [ 2 ] ; / / read one edge ( v,w) sm( gatekeeer [w],1);// check the gatekeeer of the end vertex w i f gakeeer [w] was 0 { sm( l e v e l s i z e [ i +1],1);// allocate one entry in l evel [ i +1] store w in l e vel [ i +1]; i++; (e) Sarse Matrix - Dense Vector Multilication Inut : Vector b [ n ], sarse matrix A[m] [ n ] given in Comact Sarse Row form, as in f igure 12 Outut : Vector c [m] = Ab sawn (0,m) { // start one thread for each row in A int row start=row [ $ ], elements on row = row [ $+1] row start ; sawn (0, elements on row 1) {// start one thread for each non zero element on row // comute A[ i ] [ j ] b [ j ] for a l l non zero elements on current row tmsum[ $]= values [ row start+$ ] b [ columns [ row start+$ ] ] ; c [ $ ] = kary tree summation (tmsum [ 0.. elts on row 1]); // sum u Figure 3: Imlementation of some PRAM algorithms in the XMT PRAM-on-chi framework to demonstrate comactness. 8

9 Figure 4: An overview of the XMT PRAM-on-chi Architecture. memories. In general each logical memory address can reside in only one memory module, alleviating cache coherence roblems. This exlains why only read-only caches are used at the clusters. The Master TCU runs serial code, or the serial mode for XMT. When it hits a Sawn command it initiates a arallel mode by broadcasting the same SPMD arallel code segment to all the TCUs. As each TCU catures its coy, it executes it is based on a thread-id assigned to it. A searate distributed hardware system, reorted in [NNTV03] but not shown in figure 4, ensures that all the thread id s mandated by the current Sawn command are allocated to the TCUs. A sufficient art of this allocation is done dynamically to ensure that no TCU needs to execute more than one thread id, once another TCU is already idle. A rogram in the high-level PRAM-on-chi Programming model needs to be translated by an otimizing comiler in order to take advantage of features of the architecture. A rogram in the Execution model could include refetch instructions, as well as broadcast instructions, where some values needed by all, or nearly all TCUs, are broadcasted to all. More advanced otimizations such as combining shorter virtual threads into a longer thread (a mechanism called thread clustering ), are also considered at this otimization stage. If the rogramming model allows nested arallelism, the comiler will use the mechanisms suorted by the architecture to imlement or emulate it. Comiler otimizations and issues such as nesting and thread clustering are discussed in section 4. To evaluate the erformance of a rogram in this model, we use an extension of the notions of work and deth to include measurements aroriate for an execution model, and then roceed to give a formula for estimating execution time based on them. The deth of an alication in the PRAM-on-chi Execution model must include the following three quantities: (i) Comutation Deth, given by the number of oerations that have to be erformed sequentially, either by a thread or while in serial mode. (ii) Length of Sequence of Round-Tris to Memory (or LSRTM) which reresents the number of cycles on the critical ath sent by execution units waiting for data from memory. A read request from a TCU usually causes a round-tri to memory (or RTM); memory writes in general roceed without acknowledgment, thus not being counted as round-tris, but ending a arallel section imlies one RTM used to flush all the data still in the interconnection network to the memory. (iii) Queuing delay (or QD) which is caused by concurrent requests to the same memory location; the resonse time is roortional to the size of the queue. The refix-sum s() rimitive is suorted by a secial hardware unit that combines s() calls from multile threads into a single multi-oerand refix-sum oeration. In one thread, a s() instruction causes one RTM and 0 queuing delay. In addition, a refix-sum to memory (sm()) instruction is suorted. Its syntax is similar to the s() 9

10 instruction excet the base variable is a memory location instead of a global register. This instruction is executed by queued udates to the memory location rather than by secial hardware, due to the difficulty in creating multi-oerand hardware that would oerate on arbitrary memory locations. The sm() command costs 1 RTM and additionally has a queuing delay equal to the number of threads calling sm() on the same location. We can now define the PRAM-on-chi execution deth and execution time. PRAM-On-Chi Execution Deth reresents the time sent on the critical ath (that is, the time assuming unlimited amount of hardware) and is the sum of the PRAM comutation deth, LSRTM, and QD on the critical ath. Assuming that a round-tri to memory takes R cycles: Execution Deth = Comutation Deth + LSRTM R + QD (1) Sometimes more Work (the total number of instructions executed) can be executed in arallel than what the hardware can handle concurrently. For the additional time sent executing oerations outside the critical ath (i.e. beyond the Execution deth), the work of each arallel section needs to be considered searately. Suose that one such arallel section could emloy in arallel u to i TCUs. Let Work i = i ComutationDeth i be the total comutation work of arallel section i. If our architecture has TCUs and i <, we will be able to use only i of them, while if i, only TCUs can be used to start the threads, and the remaining i threads will be allocated to TCUs as they become available; each concurrent allocation of threads to TCUs is charged as one RTM to the Execution Time, as denoted by relation 2. The total time sent executing instructions outside the critical ath over all arallel sections is given in relation 3. i ThreadStartOverhead i = R (2) ( ) Worki Additional W ork = min(, i ) + ThreadStartOverhead i (3) sawn block i Adding u, the execution time of the entire rogram is: 2.5 Clarifications of the modeling Execution T ime = Execution Deth + Additional W ork (4) Our model of erformance attemts to distill the major factors affecting runtime secifically for the PRAM- On-Chi latform. The erformance modeling for PRAM-On-Chi has the advantage of being close to the Work-Deth algorithmic framework, with additional accounting for memory costs using the LSRTM and QD. First, we would like to resent a somewhat subtle oint: Following the ath from the HLWD model to the PRAM-ON-Chi models in Figure 1 may be imortant not only for the urose of develoing a PRAM-On-Chi rogram, but also for otimizing erformance. Note that bandwidth is not accounted for in the PRAM-On-Chi erformance modeling, since a PRAM-On-Chi architecture should be able to rovide sufficient bandwidth for an efficient algorithm in the Work-Deth model. In other words, the only way in which our modeling accounts for bandwidth is indirect: by first screening an algorithm through the Work-Deth erformance modeling, where we account for work. Let us examine what could haen if PRAM-On-Chi erformance modeling is not couled with Work- Time erformance modeling. The rogram could include excessive seculative refetching to suosedly imrove erformance (reduce LSRTM). The subtle oint is that the extra refetches add to the overall work count. In other words, accounting for them in the Work-Deth model revents this loohole. It is also imortant to recognize that the model abstracts away some significant details. The PRAM- On-Chi hardware has a limited number of memory modules, and if multile requests attemt to access the same module, queuing will occur. Although the model accounts for queuing to the same memory location, it does not account for queuing that may occur for accesses to different locations (in the same module). However, hashing memory addresses among modules lessens roblems that would occur for accesses with 10

11 high satial locality and generally mitigates this tye of hot sots. If functional units within a cluster are shared between the TCUs, threads can be delayed while waiting for functional units to become available. The model does also not account for these delays. To some limited extent, the effect of these aroximations on running times can be observed from the exerimental results in section 11, where a comarison with simulations is resented. Similar to some serial erformance modeling, the above modeling assumes that data is found in the (shared) caches. This allows roer comarison to serial comuting where data is found in the cache, as the number of clocks to reach the cache for PRAM-On-Chi is assumed to be significantly higher than in serial comuting; for examle, our rototye XMT architecture suggests values that range between 6 and 24 cycles for a round-tri to the first level of cache, deending on the characteristics of the interconnection network and its load level; we took the conservative aroach to use the value R = 24 cycles for one RTM for the rest of this aer. We note that the number of clocks to access main memory should be about the same as for serial comuting and also that both for serial comuting and for PRAM-On-Chi large caches can be built. However, this modeling is inaroriate if PRAM-On-Chi is to be comared to Cray MTA where no shared caches are used: for the MTA the number of clocks to access main memory is imortant and it will not be aroriate not to include this figure for cache misses on PRAM-On-Chi, as well. Note that some of the comutation work is counted twice in our Execution Time, once as art of the critical ath under Execution Deth and once in the Additional Work factor. We could further refine our analysis and roose a more accurate model, but with much more involved modeling. For the sake of clarity, we made the choice to sto at the level of detail that allows for a concise resentation while roviding relevant results. Other researchers that worked on erformance modeling of arallel algorithms have tyically focused on different factors than those we have identified here. The reason is they dealt with other latforms. Helman and JáJá [HJ99] measured the comlexity of algorithms running on SMPs using the trilet of maximum number of non-contiguous accesses by any rocessor to main memory, number of barrier synchronizations, and local comutation cost. However, these quantities are less imortant in a PRAM-like environment. Bader, Cong, and Feo [BCF05] found that in some exeriments on the Cray MTA, the costs of non-contiguous memory access and barrier synchronization were reduced almost to zero by multithreading and that erformance was best modeled by comutation alone. For the latest generation of the MTA architecture, researchers have develoed a calculator for erformance that includes the arameters of count of tris to memory, number of instructions, and number of accesses to local memory [FHKK05]. Our measures are still different because the RTMs that we count are round tris to the shared cache, and we also count queuing at the shared cache. In addition, we consider the effect of otimizations such as refetch and thread clustering. Nevertheless, the calculator should rovide an interesting basis for comarison between erformance of alications on MTA and PRAM-On-Chi. 3 An Examle for Using the Methodology: Summation Consider the roblem of comuting the sum of n numbers. Given as inut an array A of size n the outut rovides the sum of its values. Develoing a arallel rogram for this simle roblem is resented next as an examle for the methodology of the revious section. Progressing through the models is resented. A High-Level Work-Deth descrition of the algorithm is resented in figure 5.a. A non-recursive Work-Deth resentation of this algorithm can be derived from it, as resented in figure 5.b. In the WD algorithm, we use an unidimensional array to store all the elements of the tree, as shown in figure 6. For the more general case of a comlete k-ary tree, we store the root at element 0, followed by the k elements of the first level, listed from left to right, then the k 2 elements of second level etc. The array is densely acked, with no gas, thus (a) the children of node i are at indices k i + 1, k i + 2,...,k i + k and (b) the arent of node i is at index i 1 k. Note that this simle relationshi between a node and its children is helful for imroving erformance. We now roceed to exress this algorithm in the PRAM-On-Chi Programming Model. Note that the WD algorithm uses a balanced-binary tree aroach, by reeatedly adding in arallel airs of values. Alternatively, k values can be summed serially; this constitutes a k-ary tree aroach. The k-ary tree is 11

12 SUM(A, n) I f n = 1 then sum = A[ 1 ] ; exit For 1 <= i <= n / 2 ardo B[ i ] = A[2 i 1] + A[2 i ] Call SUM(B, n/2) (a) For 1 <= i <= n ardo // B i s a 1D array B[ n 1 + i ] = A[ i ]// model of a tree For h = logn to 1 do For 2ˆ(h 1) <= i < 2ˆh ardo B[ i ] = B[2 i 1] + B[2 i ] sum = B[ 1 ] (b) Figure 5: The Summation Algorithm. (a) A High-Level Work Deth resentation. Pairs of values of A are summed u and stored into array B, followed by a recursive call on array B. (b) A Work-Deth descrition Figure 6: The array reresentation of a comlete ternary tree. The array is densely acked, with the root coming first, then the elements at level 1, and then the elements at level 2. shorter when k > 2, having log k n instead of log 2 n levels; this reduces the number of iterations at the cost of increased iteration comlexity. The otimum k is chosen as the value that minimizes the estimated running time in the erformance model for a articular N. The k-ary tree is reresented as a 1D array in the comlete tree reresentation, similar to the Work-Deth descrition. The PRAM-on-chi imlementation of this algorithm is resented in figure 3.b using the XMTC rogramming language. We will consider the erformance of the algorithm in the PRAM-On-Chi Execution Model in Section 4.4 after describing comiler otimizations. 4 Comiler Otimizations Given a rogram in the PRAM-On-Chi Programming Model, an otimizing comiler can erform various transformations on it to better fit the target PRAM-On-Chi Execution Model and reduce execution time. We describe several ossible otimizations and demonstrate their effect using the Summation algorithm described above. 4.1 Nested Parallel Sections Quite a few PRAM algorithms can be exressed with greater clarity and conciseness when nested arallelism is allowed [Ble96]. For this reason, nesting arallel sections with arbitrary numbers of threads needs to be allowed in the PRAM-On-Chi Programming Model. However, hardware imlementation of nesting is not free, and the rogrammer needs to be aware of the imlementation overheads. In order to exlain a key imlementation roblem we need to review the hardware mechanism that allocates code threads to the hysical TCUs. Consider an SMPD arallel code section that starts with a sawn(1,n) command, and each of the n threads ends with a join command without any nested sawns. As noted before, the Master TCU broadcasts the arallel code section to all TCUs. In addition it broadcasts the number n to all TCUs. TCU i, 1 i, will check whether i > n, and if not it will execute thread i; once TCU i hits a join, it will execute a secial system s() command with an increment of 1 relative to a counter that includes the number of threads started so far; denote the result it gets back by j; if j > n TCU i is done, and if not it will execute thread j; this rocess is reeated each time a TCU hits a join until all TCUs are done, when a transition back into serial mode occurs. Allowing nesting of sawn() commands would require: (i) Ugrading this thread allocation mechanism. First, the number n reresenting the total number of threads will be reeatedly udated and broadcast to 12

13 the TCUs. (ii) Since a TCU gets just an integer result through the system s() command, more information is needed to link this integer to a new thread that needs to execute. In addition, we need to facilitate a way for the arent (sawning) thread to forward initialization data to a child (sawned) thread. In our rototye XMT PRAM-On-Chi Programming Model, we allow nested sawns of a small fixed number of threads through the single-sawn and k-sawn instructions; ssawn() starts one single additional thread while ksawn() starts exactly k threads, where k is a small constant (such as 2 or 4). Each of these instructions causes a delay of one RTM before the arent can roceed, and an additional delay of 1-2 RTMs before the child thread can roceed (or actually get started). Suose that a arent thread wants to create another thread whose virtual thread number (as referenced from the SPMD code) is v. First, the arent uses a refix-sum instruction to a global thread-counter register to create a unique thread ID i for the child. The arent then enters the value v in A(i), where A is a secially designated array in memory. As a result of executing an ssawn (or a ksawn command, see below) by the arent thread: (i) n will be incremented, and at some oint in the future (ii) the thread allocation mechanism will generate virtual thread i. The rogram for thread i starts with reading v through A(i). It then can be rogrammed to use v as its effective thread ID. An algorithm that could benefit from nested sawns is the BFS algorithm. Each iteration of the algorithm takes as inut L i 1 the vertices whose distance from starting vertex s is i 1 and oututs L i. As noted in section 2.2, a simle way to do this is to sawn one thread for each vertex in L i 1, and have each thread sawn as many threads as the number of its edges, one er edge. In the BFS examle, the arent thread needs to ass information, such as which edge to traverse, to child threads. To ass data to the child, the arent writes data in memory at locations indexed by the child s ID, using non-blocking writes (namely, the arent sends out a write request, and can roceed immediately to its next instruction without waiting for any confirmation regarding write has request). Since it is ossible that the child tries to read this data before it is available, it should be ossible to recognize that the data is not yet there and wait until the data is committed to memory. One ossible solution for that is described in the next aragrah. The ksawn instruction uses a refix-sum instruction with increment k to get k thread IDs and roceeds similarly; the delays on the arent and children threads are similar, though a few additional cycles being required for the arent to initialize the data for all k children. When starting threads using single-sawn or k-sawn, a synchronization ste between the arent and the child is necessary to ensure the roer initialization of the latter. Since we would rather not use a busy-wait synchronization technique that could overload the interconnection network and waste ower, our envisioned PRAM-on-chi architecture would include a secial rimitive, called slee-waiting: the memory system holds the read request from the child thread until the data is actually committed by the arent thread, and only then satisfies the request. When advancing from the rogramming to the execution model, a comiler can automatically transform a nested sawn of n threads, and n can be any number, into a recursive alication of single-sawns (or k- sawns). The recursive alication divides much of the task of sawning n thread among the newly sawned 1 threads. When a thread starts a new child, it assigns to it half (or k+1 for k-sawn) of the n 1 remaining threads that need to be sawned. This rocess roceeds in a recursive manner. 4.2 Clustering The PRAM-On-Chi Programming Model allows sawning an arbitrary number of virtual threads, but the architecture has only a limited number of TCUs to run these threads. In the rogression from the Programming Model to the Execution Model, we often need to make a choice between two otions. The first otion is to sawn fewer threads each doing more comutation, while the second one is to run the shorter threads as is. Combining short threads into a longer thread is called clustering and offers several advantages: (a) we can ieline memory accesses that had reviously been in searate threads; this can reduce extra costs from serialization of RTMs and QDs that are not on the critical ath; (b) sawning fewer threads means reducing thread allocation overheads, i.e. the time required to start a new thread on a recently freed TCU; (c) each sawned thread (even those that are waiting for a TCU) usually takes u sace in the system memory, to store the local data for the thread. If the code rovides fewer threads than the hardware can suort, there are fewer advantages if any to using fewer longer threads. Also, running fewer, longer threads 13

Lecture 18. Today, we will discuss developing algorithms for a basic model for parallel computing the Parallel Random Access Machine (PRAM) model.

Lecture 18. Today, we will discuss developing algorithms for a basic model for parallel computing the Parallel Random Access Machine (PRAM) model. U.C. Berkeley CS273: Parallel and Distributed Theory Lecture 18 Professor Satish Rao Lecturer: Satish Rao Last revised Scribe so far: Satish Rao (following revious lecture notes quite closely. Lecture