Worker-Checker A Framework for Run-time. National Tsing Hua University. Hsinchu, Taiwan 300, R.O.C.

Size: px

Start display at page:

Download "Worker-Checker A Framework for Run-time. National Tsing Hua University. Hsinchu, Taiwan 300, R.O.C."

Blaze Jacobs
5 years ago
Views:

1 Worker-Checker A Framework for Run-time Parallelization on Multiprocessors Kuang-Chih Liu Chung-Ta King Department of Computer Science National Tsing Hua University Hsinchu, Taiwan 300, R.O.C. fkcliu,kingg@cs.nthu.edu.tw Abstract Run-time parallelization is a technique for solving problems whose data access patterns are dicult to analyze at compile time. In this paper we propose a worker-checker framework to classify dierent run-time parallelization schemes. Under the framework, operations performed during run-time parallelization are classied loosely into a worker and a checker. Dierent schemes are then cast into the framework based on the relative execution order of their worker and checker. From the framework, we identied several new run-time parallelization methods. In the second part of the paper we then examine the implementation of one such method derived from speculative parallelization [10]. The implementation is based on the idea of embedding hardware checkers inside memory controllers. We will present the design of the hardware checker and evaluate the eectiveness of the design on run-time parallelizing DOALL and DOACROSS loops. Keywords: 1 Introduction Run-time parallelization, speculative parallelization, inspector-executor, irregular problem, smart memory Run-time parallelization is a technique for solving problems whose data access patterns are dicult to analyze at compile time. The programs of these problems usually have array elements that are accessed via nonlinear subscripts, pointers, or subscripted subscripts. Analyzing such program constructs has been a major challenge to current compiler techniques. A large class of problems, including molecular dynamics, uid dynamics, astrophysics and device simulation, exhibit this characteristic. To exploit the parallelism inherent in these irregular applications, we need to resort A preliminary version of this paper appears in Proc. of the Eighth IASTED International Conference on Parallel and Distributed Computing and Systems, Oct This work was supported in part by National Science Council grants NSC E R and NSC E

2 to run-time solutions. Since loops usually consume the most execution time of a program and they potentially contain the most parallelism, the majority of existing run-time parallelization schemes focus on loops. There are two basic approaches to run-time parallelization of irregular loops: speculative parallelization [9, 10] and inspector-executor [2, 7, 11, 12, 13, 15]. Speculative parallelization is a run-time strategy in which the given loop is executed speculatively as if it were a DOALL loop. While the loop is executing, data accesses are recorded. At the end of the execution, the access records are checked to discover any violation to data dependences. If so, the execution is rolled back and the loop is executed sequentially again. Obviously, this approach can yield good results when the loop is in fact fully parallel. Inspector-executor is a classic paradigm for run-time parallelization. Given a sequential loop, the compiler (or programmer) generates two pieces of codes from the loop: an inspector and an executor. The inspector examines data access patterns and generates a schedule for executing the loop. The executor uses the schedule to access data and synchronize with each other so as to execute the operations in parallel. Although these two approaches look dierent, they in fact have many common characteristics. For example, in both approaches there are operations for recording, analyzing, and checking dependences. There are also operations for actually evaluating the loop iterations. If we could come up with a proper taxonomy to classify dierent run-time parallelization schemes, then we may be able to observe these strategies more closely and perhaps identify other opportunities for parallelization. In this paper we propose one such taxonomy. From the taxonomy several variations to the two basic run-time parallelization strategies are identied, and in the latter part of the paper we will examine one variation in detail. In order to come up with a simple but useful classication, we partition the activities in a run-time parallelization scheme loosely into two general classes: worker and checker. The worker contains roughly the operations involved in executing the actual loop iterations, while the checker contains the remaining, including all bookkeeping and checking operations. Dierent run-time parallelization schemes are then classied based on the relative execution order of their worker and checker. In our framework, the execution order may be strict, interleaved, or overlapped. According to this taxonomy, the speculative parallelization scheme proposed in [9, 10] is a strict worker-thenchecker strategy, while the classic inspector-executor scheme [15, 12] is a strict checker-then-worker 2

3 strategy. Details of the taxonomy will be described in Section 2. One use of the taxonomy is to identify new opportunities for performing run-time parallelization. From our classication we can see that the operations done in the checker are mostly overheads they do not perform the actual computations. Although previous approaches have considered the parallelization of the checker and/or the worker, these two classes of activities are primarily performed sequentially. The opportunities of overlapping the executions of the two so as to hide the overheads in the checker were not exploited. In this paper, we will investigate such opportunities and examine one possible solution. Our proposed scheme extends speculative parallelization and supports overlapped executions of worker and checker with smart memories. The basic idea is to put some checker functions into the memory controllers so that the memories can play the role of the checker while the processors perform the workers. With architectural supports the operations in the checker and the worker are overlapped naturally. In addition, we can also avoid possible long execution time due to software implementation of the checker. We will show that the checker logic added in the memory controllers is quite simple but is very eective for improving the performance of run-time parallelization. Our simulation results also show that the extra circuits in the controllers can help to exploit run-time loop-level parallelism but do not degrade the memory performance much. Contributions of this paper are two-folds. First, we propose the worker-checker framework, which gives the direction to improve current techniques and identies possible overheads. Second, we demonstrate a smart memory design for run-time parallelization based on the speculative parallelization strategy. The rest of this paper is organized as follow: Section 2 provides background on related works and the worker-checker framework. In Section 3 we examine one worker-checker algorithm for speculative parallelization as a case study. Section 4 evaluates the performance of our design. Section 5 concludes the paper. 2 Worker-Checker Framework 2.1 Previous Run-time Parallelization Schemes and Their Overheads In this subsection we briey review previous run-time parallelization schemes. Main operations in these schemes are analyzed and classied, which serve as a basis for our worker-checker framework. 3

4 Operations Schemes Phase Access Execution of Synchronization Dependence Parallel Marking Source Loop Primitives Testing Schedule p p Speculative Specu.Run p Parallelization Specu.Test p p Inspector- Inspector p p Executor Executor Worker Checker Table 1: Typical operations in run-time parallelization Possible overheads in these schemes are also identied, from which new approaches to run-time parallelization are motivated. To exploit loop-level parallelism at run-time, there are two general approaches: speculative parallelization and inspector-executor. Their main operations are summarized in Table 1. For simplicity, speculative parallelization is assumed to be successful and there is no roll-back and re-execution. Speculatively executing loops in parallel at run-time was originally proposed by Rauchwerger and Padua [9, 10]. Their algorithm used three shadow arrays A r, A w and A np for recording array accesses. Two techniques, privatization and reduction parallelization, were used to enhance the chance of correct speculative executions. Their LRPD-test (or the PD-test in [9]) has shown some promising results in parallelizing PERFECT benchmarks. Speculative parallelization typically consists of two phases: Specu.Run and Specu.Test. Specu.Run, while executing the source loop in parallel (Execution of Source Loop), uses some data structures to mark shared data references (Access Marking). If a loop-carried dependence is violated in the execution, called a dependence hazard, then Specu.Test can report this violation by analyzing these data structures (Dependence Testing) at the end of the execution. According to the characteristics of the operations done in each phase, Specu.Run can be classied loosely as performing the operations of workers, while Specu.Test performing checkers. Note that in speculative parallelization, checkers follow workers sequentially. Next, let us analyze the overhead in speculative execution. Access marking in the workers will cause extra memory references and consume some CPU resource. However, most overheads are in checkers. In fact, in speculative parallelization, checkers are pure overheads. This is because 4

5 checkers perform only checking but not any computation of the loop. In addition, the operations of checkers and workers are not overlapped. From the viewpoint of overhead hiding, it is possible to let workers overlap with checkers. We shall discuss one implementation in detail in Section 3. Zhu and Yew rst proposed the inspector-executor strategy on shared-memory multiprocessors [15]. Saltz et al. extended the idea and contributed signicantly to run-time scheduling of irregular loops, either on shared-memory multiprocessors [12, 13] or on distributed-memory multicomputers [4, 8]. Leung and Zahorjan proposed methods to parallelize the inspector by sectioning and bootstrapping [7]. Chen and Yew extended the results in [15] and discussed how to obtain optimal schedules in parallel [2]. A recent work by Rauchwerger and Padua improved previous results and discussed the ideas of interleaved and overlapped inspector-executor [11]. In inspector-executor, the inspector plays the role of checkers, and executor plays the role of workers. The checkers have to analyze data dependences (Dependence Testing) and then prepare an execution schedule (Parallel Schedule) for workers. Workers then execute loops in parallel (Execution of Source Loop) according to the schedule. Unlike speculative parallelization, inspectorexecutor can handle partially parallel loops, i.e. DOACROSS loops. Overheads in inspectorexecutor include the time spent in the inspector and in synchronization (Synchronization Primitives) during the executor. As pointed out in [7] processor utilization and load balance are worthwhile criteria to achieve good performance without global optimal schedule. How these criteria can be addressed from the viewpoint of overhead hiding is an interesting research topic. We shall discuss this issue based on the worker-checker framework later. 2.2 The Worker-Checker Model Type Schemes Example References 1 strict worker-then-checker speculative parallelization [9, 10] 2 strict checker-then-worker inspector-executor [2, 7, 12] 3 interleaved worker-then-checker our algorithm this paper 4 interleaved checker-then-worker sectioning inspector-executor [7, 11] 5 overlapped worker-then-checker hardware supported checker this paper 6 overlapped checker-then-worker dynamically scheduled inspector-executor [11] Table 2: Run-time parallelization schemes under the worker-checker framework. Under the worker-checker model, we can identify at least six dierent run-time parallelization 5

6 schemes as shown in Table 2. This classication is based on the relative execution order of the workers and checkers. There are three possible orders: strict, interleaved, and overlapped. These six schemes further fall into two general categories: speculative parallelization and inspector-executor. Types 1, 3 and 5 belong to speculative parallelization, and Types 2, 4 and 6 belong to inspectorexecutor. (a) Worker Checker (b) Checker Worker (c) (d) (e) (f) Figure 1: Six types of run-time parallelization schemes Run-Time Parallelization Based on Speculative Parallelization Strict worker-then-checker: The term strict means that there is a strict sequential execution order between the worker and the checker, i.e., the checker does not start until the worker nishes the speculative execution. Figure 1(a) illustrates such an execution order. Note however that the worker and the checker may themselves run in parallel. This scheme is the run-time parallelization method originally proposed in [9, 10]. Apparently, this scheme has the drawback that the worker always runs through the whole execution even if violations to dependences have occurred in the middle of the execution. This is because the checker has to wait until the worker nishes it cannot check the hazards and inform the worker earlier. Interleaved worker-then-checker: In this scheme the worker only works on a portion of the whole computation, called a chunk, and then the checker checks the partial results immediately. If the checker does not discover any 6

7 hazard, the worker continues with the next chunk of the computation. Otherwise, the current chunk is executed again sequentially. The execution order of the worker and the checker is shown in Figure 1(c). This scheme is a straightforward extension of Type 1 scheme. It has the benet that the checker can detect hazards earlier and need only to test partial results. In addition, the worker only has to sequentially re-execute the chunks in which the checker fails. In this way, even DOACROSS loops can be executed with some degree of parallelism. In this paper, such an execution scheme will be referred to as the greedy-chunk approach. Overlapped worker-then-checker: In this scheme, the operations of the worker and the checker are overlapped, with the worker ahead of the checker. The greedy-chunk approach introduced above can also be used here. Again the whole computation is partitioned into chunks. While the worker is executing the current chunk, the checker checks hazards in the generated results simultaneously. If a hazard is detected, the current chunk is re-executed in sequential. In this way, the overhead in executing the checker can be hidden by the worker. Figure 1(e) illustrates such an execution scenario. It is apparent that in order for this scheme to work the system must be able to execute the worker and the checker concurrently either logically or physically. For example, in a system supporting multi-threading, the checker can be executed with a thread while the worker with another thread. Alternatively, the checker can be realized with hardware that works concurrently with the main CPU which executes the worker. In the second part of this paper we will address the design issues of this hardware approach. One critical issue in the Type 5 scheme is when the checkers are invoked. If the checkers are invoked at the end of each chunk, then this is equivalent to the Type 3 scheme. To increase the overlapping between the worker and the checker, there are at least two possible strategies: checkon-iteration and check-on-reference. In the check-on-iteration strategy, the checkers are invoked when the workers have executed a number iterations in the loop. On the other hand, in the checkon-reference strategy hazards are checked at each shared array reference. The dierence between these two strategies is mainly a tradeo between the degree of parallelism and the checker overhead. A ne-grain strategy such as check-on-reference will incur a large overhead in invoking the checkers but allows hazards to be detected as early as possible. Section 4 will give a comparison between these two strategies. 7

8 2.2.2 Run-Time Parallelization Based on Inspector-Executor Strict checker-then-worker: This scheme is the conventional inspector-executor method [7, 12, 13, 15]. The checker (i.e., the inspector) and the worker (i.e., the executor) can both be parallelized, but the worker must wait for the execution schedule and cannot start until the checker nishes. The behavior of this scheme is shown in Figure 1(b). One important issue in this scheme is to obtain an optimal execution schedule for the worker and obtain it quickly. The inherent sequential nature in deciding the schedule makes it dicult to fully utilize all the processors. It is also dicult to come up with a schedule which balances the load of the processors in executing the worker. Hints to solve these problems may be found in [11]. Interleaved checker-then-worker: Similar to the interleaved worker-then-checker scheme, the operations in the checker and the worker are interleaved in this scheme. Figure 1(d) illustrates the behavior. This scheme is a simple extension to the conventional inspector-executor methods. The greedy-chunk approach introduced above can be applied here, in corporation with the sectioning algorithm [7]. Note that with sectioning it is not possible to have a global optimal schedule, but the worker can still be parallelized to some extend and good speedup can be obtained if the processors are well utilized. Overlapped checker-then-worker: In this scheme, while the checker is inspecting the iterations in the next chunk, the worker executes those iterations in the current chunk according to the schedule generated by the checker previously. Operations in the checker and the worker are overlapped, and the checker overhead can be hidden by the worker. Figure 1(f) illustrates the scenario. Again the system needs to support concurrent execution of the checker and the work, and allow them to synchronize eciently. This can be done with multi-threading or special hardware. In [11] a rough sketch of an implementation with multi-threading and dynamic processor assignment was given. We will leave the investigation of this scheme for future research. 3 A Case Study Design of a Hardware Checker From the previous section, we can see that by interleaving or overlapping the worker and the checker, we can exploit further opportunities to optimize run-time parallelization. As a case study we will 8

9 examine one design of the Type 5 scheme in this section. We will discuss the overall design concept, the check-on-reference algorithm in detail, and the considerations in hardware implementation. Workers Checkers P P P P Interconnection Network checker checker checker checker Memory Module Memory Module Memory Module..... Memory Module Global Main Memory Figure 2: A shared-memory multiprocessor with hardware checkers. 3.1 Overall Organization Figure 2 shows the organization of a shared-memory multiprocessor which implements the Type 5 scheme of the worker-checker model. There is a hardware checker embedded in the memory controller in each memory module. In this system the processors play the role of the worker and the memory controllers play the role of the checker. Given a loop to be executed speculatively, the processors rst initialize the checkers before entering the source loop. While the processors are executing the iterations, the checkers monitor the accesses to the shared array in memory at the same time. Based on the types of the accesses and the iterations when the accesses occur, the checkers can detect any hazard due to loop-carried data dependences. Since the checkers are invoked on every access to the shared array, they follow the check-on-reference strategy. In the following, we will rst examine the hazard conditions and introduce the check-on-reference algorithm, which is used by the hardware checkers to detect hazards. Implementation details of the hardware checkers are then presented. 3.2 Hazard Conditions In this subsection, we examine the conditions under which dependence hazards may occur. Note rst that data dependences may or may not cause hazards, depending mainly on the order of 9

10 accesses to the shared array. To illustrate the idea, consider the example loop shown in Figure 3. In the loop there are two references to the shared array A, which are labeled a and b respectively. Every access to the shared array can be represented using an access identier: x(p; q), where x is either r for read or w for write, p is the iteration number and q is the label of the reference. Let addr(x(p; q)) denote the address of the array element accessed in x(p; q). The operator! denotes a partial order and r(p; q)! w(r; s) means that the access r(p; q) happens before w(r; s). Code: Data: DO i = 1, N index (a) A(I1(i))= I (b)... =A(I2(i)) I END Figure 3: An example to illustrate shared data accesses and possible hazards. Suppose the memory consistency model is sequential consistency [5]. From Figure 3, we can see that there is a loop-carried ow dependence from w(2; a) to r(3; b), because they both access to the same location A(6), i.e. addr(w(2; a)) = addr(r(3; b)) = A(6). Now, if w(2; a) occurs before r(3; b), then the dependence can be satised and there will be no run-time hazard. However, if r(3; b) is performed before w(2; a), i.e., r(3; b)! w(2; a), then a run-time hazard will occur. This is because r(3; b) will fetch the old value of A(6) instead of the most up-to-date value written by w(2; a). In the following discussion, we will say that r(3; b)! w(2; a) forms a violated access sequence. Depending on the order and types of the accesses, there are three kinds of violated access sequences, which are shown in Table 3. From Table 3 we can see that a read access will cause a violated access sequence with a previous write if (1) they both access the same array element, and (2) the write is labeled with a larger iteration number, i.e. it is supposed to be invoked in a later iteration. This is a WAR (write after read) hazard, which typically violates loop-carried anti-dependences. Similarly Type 2 hazard is a WAW (write after write) hazard, which violates output dependences, and Type 3 hazard is a RAW (read after write) hazard, which violates ow dependences. 10

11 Type Condition Hazard Type 1 w(i; x)! r(j; y), where i > j and addr(w(i; x)) = addr(r(j; y)) WAR 2 w(i; x)! w(j; y), where i > j and addr(w(i; x)) = addr(w(j; y)) WAW 3 r(i; x)! w(j; y), where i > j and addr(r(i; x)) = addr(w(j; y)) RAW 3.3 The Checker Algorithm Table 3: Three types of violated access sequences. In this subsection, we present the checker algorithm which is to be executed in each checker. The algorithm is based on the check-on-reference strategy. From Table 3 we can see that the iteration number associated with an access indicates when the access is supposed to happen. This information should match the time when the access actually occur; otherwise run-time hazards might happen. Thus, for example, to check if a read access causes a run-time hazard, we need to compare its associated iteration number iter with the iteration numbers of all previous writes which access to the same array element. If iter is smaller than any of those iteration numbers, then there is a write access which occurred earlier but is supposed to occur later. As a result the read operation causes a violated access sequence and a run-time hazard occurs. Similar arguments can be applied to write accesses. In practice, we need not check \all" iteration numbers only the largest iteration number recorded so far. These ideas lead to the checker algorithm shown in Figure 4. The algorithm is invoked on every access to the shared array, because it implements the checkon-reference strategy. The access must indicate the type of access op, the address of the array element to be accessed A[k], and the iteration iter when the access occurs. To check for run-time hazards, the algorithm references two shadow arrays. The array A w (A r ) records for each array element the maximum iteration number at which an access has ever written to (read from) the element. In the algorithm we assume that the index of the given array is normalized and positive. Thus if the shadow arrays have been properly initialized to zero, then a nonzero value in a shadow element indicates that the corresponding array element has been accessed before. The primary task of the algorithm is to maintain the two shadow arrays and to use their information to check for hazards according to Table 3. For a read access to A[k] in iteration iter, the algorithm checks the maximum iteration number stored in A w [k]. If iter is smaller than the 11

12 /* CHECKER ALGORITHM: Input: type of access op (READ or WRITE), array base address A, array index k, iteration number when the access occurs iter. Output: inform the worker if a hazard occurs. */ 1 MEMORY CHECK(op, A, k, iter) 2 f 3 get A r [k] and A w [k] from shadow arrays A r and A w ; 4 switch(op) f 5 case READ: 6 /* if A w [k] 6= 0, A[k] has been written before */ 7 if (iter < A w [k]) hazard = VIOLATE ANTI; /* WAR */ 8 if (A r [k] < iter) A r [k] = iter; 9 if (hazard) inform the worker and exit. 10 case WRITE: 11 /* if A w [k] 6= 0, A[k] has been written before */ 12 if (A w [k] == 0) backup(a[k]); /* backup on the rst write */ 13 if (iter < A w [k]) hazard = VIOLATE OUTPUT; /* WAW */ 14 if (hazard) inform the worker and exit. 15 /* if A r [k] 6= 0, A[k] has been read before */ 16 if (iter < A r [k]) hazard = VIOLATE FLOW; /* RAW */ 17 if (A w [k] < iter) A w [k] = iter; 18 if (hazard) inform the worker and exit. 19 g 20 g Figure 4: The checker algorithm. value in A w [k], then a WAR hazard occurs. Otherwise, A r [k] is updated to the maximum iteration number. The operations performed during a write access are similar. 3.4 Considerations for Implementing Checkers The checker algorithm shown in Figure 4 can be implemented entirely in software. One possible implementation on a multiprocessor is to use one thread or process in each processor to execute the checker algorithm, while another thread or process to execute the worker. Since the checker threads or processes are executed on dierent processors in parallel, updating the shadow arrays A w and A r must be protected as a critical section. The critical section serializes the operations of 12

13 the checkers and results in a system bottleneck. The checker algorithm can also be implemented in hardware. The primary reason is that Type 5 scheme (overlapped worker-then-checker) can be actually realized. Besides, a recent trend in memory design is to put more functionalities into the memory. Memory controllers become more intelligent for specic applications, e.g. irregular memory access [14], locality analysis using prefetch cache, fast graphics display with smart Z-buer, etc. It is thus feasible to embed some checker hardware inside memory controllers to support run-time parallelization. Details of the hardware checkers will be given in the following subsections. Note that an alternative to hardware checkers is to implement the checkers inside the caches. The checker algorithm then operates cooperatively with the cache-coherence protocol. This approach is more complex and will be left for future investigation. write?ar:max(ar,iter) read?aw:max(aw,iter) 0 1 Max Shadow Ar Aw A A B CMP iter B read C Selector #1 B<A? C B<A? (b) Selector Circuit write A B B<A? C Selector #2 read (a) Checker Circuit INT Figure 5: Circuit diagram of the hardware checker. 3.5 The Checker Circuit Figure 5 shows the checker circuit, which implements lines 4 19 in the checker algorithm in Figure 4. Again, the circuit operates whenever there is an access to the shared array. Each access will supply the circuit with the following inputs: the type of request, the iteration number, and the elements of the shadow arrays A w and A r corresponding to the accessed array element. Note 13

14 that a shadow array element stores the maximum iteration number at which an access to the corresponding array element has ever occurred. This data will be loaded from memory for checking and then be stored back for updating. The checker circuit consists of two identical selectors, one for read access and the other for write access. The selectors are used to compare the iteration numbers, report hazards, and update the maximum iteration number. In each selector, the input A contains the value from the shadow array (A r or A w ) and B contains the iteration number in which the current access occurs. A comparator (CMP) is used to compare these two values. A multiplexer is used to select a value from A or B to update the shadow array element: A is selected when the control input is 1 otherwise B is selected. The checker circuit works as follows. When a read request to a shared array element A[k] arrives (read=1 and write=0), the read selector will compare inputs A (the shadow array element A w [k]) and B (the iteration number). If B < A, then a WAR hazard occurs and the hazard interrupt signal INT is asserted. Since the C input to the selector is now 1, A w [k] is unchanged. In the mean time, the write selector has input C = 0. Thus, if the iteration number iter is larger than the shadow element A r [k], then the write selector will select B input to store into A r [k]. In this way, A r [k] is updated with the maximum iteration number. Note that the write selector will signal a RAR (read after read) hazard if iter < A r [k]. An inverter is used here to prohibit such a signal. The operations performed during a write access are similar. Now the read selector checks any WAW hazard and the write selector checks any RAW hazard. The read selector is also responsible for updating the shadow element A w [k]. 3.6 Memory Controllers with the Checker Circuit The checker circuit in Figure 5 can be easily embedded in the memory controller. The architecture of such a memory controller is shown in Figure 6. There are two buers: One is a FIFO buer for queuing memory access requests, i.e. regular accesses. The other is used by the checker and enqueues those requests for checking hazards, i.e. checker accesses. A lter is used to enable/disable checking and translate referenced addresses into shadow array addresses. In our design, the shadow arrays are stored in the main memory. The lter contains a number of registers. The CSW (Checker Status Word) register contains a ag indicating whether the checker is enabled or disabled. The MSA (Marked Start Address) 14

15 command/address Interrupt Filter regular address shadow address Bank #4 #1 #2 #3 Shadow read/write/iter Checker Memory Module Data Bus Figure 6: A memory controller with the checker circuit. and MEA (Marked End Address) registers contain the starting and ending addresses of the shared array. The SBA register contains the starting address of the shadow arrays. The lter also has a register le to record which processor is executing which iteration. The register le has at least N registers, where N is the number of processors. Now let us see how the memory controller works. For simplicity, we assume that only one shared array needs to be checked in the speculated loop. When the processors start executing the speculated loop, they rst initialize the registers in the lter. When a processor p begins executing an iteration, it broadcasts the iteration number i to all the memory controllers. The lter in each controller will record i in the p-th entry of its register le. When the lter receives a request from processor p to access a shared array element A[k], it will (1) obtain the iteration number from the p-th entry of the register le, (2) translate the shared array address A[k] into the shadow array addresses A r [k] or A w [k], and (3) pass these information into the checking queue for the checker circuit to check hazards. As mentioned, the shadow arrays are stored in the main memory. If the memory system is dual-port, then the checker can load or update the shadow array without aecting regular accesses. However, dual-port memories are very expensive. Thus, we assume one-port memories in the following discussions. Unfortunately, in such memories we need to arbitrate regular and checker accesses. 15

16 Programmer Compiler Operating System Hardware Give a directive SP LOOP(addr, size) before the loop to be speculated Transform the specied loop into a speculative DOALL loop with a recovery code by inserting primitives: MEMCHECK ON, MEMCHECK OFF, MEMCHECK CONTINUE and SET ITER Called by the primitives to initialize the lter in the checker and invoke user-level interrupt handler to handle hazards Check hazards and report to the processors by interrupt Table 4: Operations necessary for executing a speculative loop. Our design here is to serve the regular access rst and then the corresponding checker access. In the next section we will study how regular accesses may be aected by checker accesses. Note that if the referenced shared element A[k] is placed on one memory module and its corresponding shadow elements A r [k] and A w [k] are placed in another module, then the checker access can be performed at the same time as the regular access. Such a checking scheme will be referred to as the neighbor checking and can achieve an eect similar to dual-port memories. In Section 4 we will study the performance of neighbor checking also. 3.7 Software Supports Given the hardware checker described in the previous subsections, the multiprocessor requires a number of system supports to accomplish speculative parallelization of nested loops. Table 4 summarizes the necessary supports. As an example, consider the code in Figure 7. In the source loop (Figure 7(a)), the programmers must mark the loop to be speculated with a directive SP LOOP(addr,size), where addr is the starting address of the shared array and size is the range to be monitored/checked. The compiler, then transforms the source loop into a speculative DOALL loop and a segment of recovery code, as shown in Figure 7(b), The speculative DOALL loop is embraced by two primitives MEMCHECK ON and MEMCHECK OFF. At run-time the primitive MEMCHECK ON will be invoked when the processors enter the speculative loop. The primitive species parameters which can be used to calculate the starting and ending addresses of the shared array and initializes the registers MSA and MEA in the lter of the checkers. The content of the SBA register is initialized automatically and all the shadow array elements are initialized to zero. The lter is also enabled by asserting the ag in the CSW register. Finally, the recovery code is set up as the interrupt routine for hazards. 16

17 1 SP_LOOP(A,N); 2 for(i=0;i<n;i++) { 3 A[I1[i]] =...; 4...; 5... = A[I2[i]]; 6 } (a) the source loop 1 MEMCHECK_ON(A, N); 2 forall (pid = 0; pid < NPROC; pid++) { 3 for (chunk = 0, chunk < total_chunks; chunk++) { 4 base = pid * partition_size + chunk * chunk_size; 5 for (i = base; i < base + partition_size; i++) { 6 SET_ITER(i); 7 A[I1[i]] =...; 8...; 9... = A[I2[i]]; 10 } 11 } 12 goto DONE; 13 RECOVER: /* this is the hazard interrupt handler routine */ 14 BARRIER( barrier, NPROC); 15 if (pid == 0) { 16 /* recovery code here */ 17 MEMCHECK_CONTINUE(); 18 } 19 } /* forall */ 20 DONE: 21 BARRIER( barrier, NPROC); 22 MEMCHECK_OFF(); (b) the transformed speculative DOALL loop Figure 7: The transformed speculative DOALL loop by the compiler. 17

18 When a processor starts executing an iteration i, the primitive SET ITER(i) (line 6 in Figure 7(b)) will be called to inform the checkers with the iteration number and set their corresponding register le entry. The hardware checkers then check hazards whenever there is an access to the shared array. When a checker nds a hazard, it interrupts all the processors. This causes the processors to execute the recovery code. In the mean time, other checkers will also be informed to turn o their checking mechanism. After recovery, another primitive MEMCHECK CONTINUE (line 17 in Figure 7(b)) is used to wake up the checkers and continue with the speculative execution. 4 Performance Evaluation In this section, we study the performance of our hardware checker design. The environment of our experiments is introduced rst, followed by the performance results. 4.1 Experimental Environment We used Augmint [1] to simulate the proposed hardware checker. Augmint is an execution-driven multiprocessor memory simulator for the Intel Pentium architecture. It augments the assembly code of the compiled application by inserting extra instructions at memory reference points in order to emit memory events. The augmented object code consists roughly of two parts: front-end and back-end. The front-end simulates the execution of multiple processors by user-level threads and generates events of interest. The back-end consists of user-specied simulation routines, which are invoked with the events generated by the front-end. Our checker circuit was embedded in the back-end routines. The simulation programs, i.e. the simulator and the synthesis benchmark, were compiled by GNU C version and were executed on a Pentium processor running Linux version The hardware checkers were assumed to run inside a uniform memory access (UMA) multiprocessor as shown in Figure 8. Shared data were stored in the global shared memory without local copies. Private caches and local memories were assumed to store instructions, stacks and unshared data. To simplify the experiments, virtual memory operations (such as page swaps) were not allowed when checkers were enabled. We also ignored network and bus contentions so as to focus only on memory contentions. Write operations were assumed to be non-blocking and there was a write-buer in each memory controller for queuing regular write accesses. The main memory 18

19 P P P P c c c c LM LM LM LM Interconnection Network checker checker checker checker Memory Module Memory Module Memory Module..... Memory Module Global Main Memory Figure 8: The simulated architecture: a UMA shared-memory multiprocessor without bus/network contention. cycle time was assumed to be 5 cycles and memory controller served requests in unit of a word. The memory consistency model was sequential consistency. Unless stated explicitly, we assumed that the shared array elements and their corresponding shadow elements are stored in the same memory module. In the experiments we used a synthetic program to evaluate the performance of the checker. The program is shown in Figure 9 and was adopted from [2]. The iteration count N was assumed to be the same as the size of the shared array A. There were P processors in the system. We used the greedy chunk method to parallelize the loop. A chunk with a size of P iterations means that the loop is partitioned into N=P chunks and each processor executes one iteration in each chunk. In other words, the loop is partitioned in CYCLIC. Note that there are barrier synchronizations when the execution moves from one chunk to another. A chunk size of N means that there is only one chunk and each processor executes N=P iterations. In other words, the loop is partitioned in BLOCK. Data access patterns were controlled by the array INDEX. By setting appropriate values in the array INDEX, we could make the loop DOALL or DOACROSS. Following [2], two parameters H frac and H size were used to control the DOACROSS access pattern. In brief, H frac is the probability that an access is a hot access, where a hot access means that the access causes data dependences easily. The parameter H size is the fraction of hot area in the entire array A, where a hot area is a range in A where hot accesses will reference. Thus, if an access is a hot access, it will reference the hot area, otherwise its reference can be directed to the entire array A. We experimented with 19

20 for (i = 0; i < N; i++) { for (j = 0; j < r; j++) { if (odd(j)) A[INDEX[i*r+j]] = tmp1; else tmp2 = A[INDEX[i*r+j]]; } for (k = 0; k < W; k++) { /* simulate useful work, assume one */ /* processor cycle cost in each iteration */ } } Figure 9: The synthetic benchmark program. DOACROSS Pattern EARLY-RECOVER LATE-RECOVER Description Hazards occur in the beginning of the loop Hazards occur near the end of the loop MOST-SERIAL Heavy data dependences with (H frac,h size )=(0.9,0.1) MOST-PARALLEL Light data dependences with (H frac,h size )=(0.1,0.9) Table 5: Dierent types of DOACROSS loops used in the experiments. a number of DOACROSS loops, which are listed in Table 5. In the experiments, we set N = 6400, r = 2, and W = Experimental Results Overall Performance Let us rst study the performance of our checkers in executing DOALL loops. Note that DOALL loops do not cause any hazard and thus induce no recovery overhead. Figure 10(a) shows the speedup of executing the synthetic benchmark using dierent speculative parallelization schemes, if the loop is DOALL. From the gure we can see that our hardware checker can improve the software Rauchwerger's algorithm [9, 10] by a factor of 2. We can also see that the speedup scales well when hardware checkers are used. The overhead in our design comes primarily from software, such as the barrier between chunks, chunk setup time, etc. When the chunk size is set to P, the speculative DOALL loop has the 20

21 30 25 without checkers, chunk size = N with checkers, chunk size = N neighbor checker with checker, chunk size = P Rauchwerger s Algorithm, chunk size = N 2.2e+06 2e+06 EARLY-RECOVER LATE-RECOVER Type 1 method e+06 Speedup 15 Execution Cycles 1.6e e e e Processors (a) Speedup of DOALL loops Processors (b) Time of executing DOACROSS loops Figure 10: Performance of various run-time parallelization methods. worst speedup. This is because a smaller chunk size will result in a large number of chunks. As a result, there are more barriers and chunk setups. In the gure there is a curve denoted \Neighbor Checker". The curve is the result of allocating the checkers to memory modules dierent from the shared elements to be checked. We shall explain this curve later. Next, let us consider the worst case in speculative parallelization, i.e., the loop is DOACROSS and the chunk size is set to N. Figure 10(b) shows the execution time of the synthetic loop in such a case. Two extreme dependence patterns were used: EARLY-RECOVER and LATE-RECOVER (see Table 5). In the EARLY-RECOVER case the hazard is found immediately using our hardware checker, and the execution dose not waste much time in fail recovery. The resultant execution time is close to that of the sequential version regardless of the number of processors used. However, in the LATE-RECOVER case run-time parallelization performs even worse than the sequential version. This is because the sequential re-execution of the source loop, after speculation fails, dominates the total execution time. The situation becomes even worse if Type 1 scheme is used. Note that in Type 1 scheme, the whole loop needs to be executed before dependence hazards are checked and the loop is rolled back at any hazard Eects of Chunk Size on DOACROSS loops From the above discussions, we can see that the chunk size is a very important parameter in speculative parallelization. Figure 11 shows the eects of the chunk size on MOST-PARALLEL and 21

22 8 7 chunk size = 1xP chunk size = 2xP chunk size = 3xP chunk size = 4xP Neighbor Checker 1xP chunk size = 1xP chunk size = 2xP chunk size = 3xP chunk size = 4xP Speedup 4 Speedup Processors (a) MOST-PARALLEL DOACROSS loop Processors (b) MOST-SERIAL DOACROSS loop Figure 11: Eects of chunk size on two patterns of DOACROSS loop. MOST-SERIAL loops. Figure 11(a) shows the speedup curves for MOST-PARALLEL loops. As we can see, the performance degrades rapidly by increasing the chunk size. This is because a larger chunk size has a higher probability of causing hazards, which in turn degrades the performance. The same reason holds for MOST-SERIAL loops (Figure 11(b)). Comparing these two kinds of loops, MOST-SERIAL loops attend a maximum speedup of only about 4 and is less scalable. Moreover, as the number of processors increases, the performance of MOST-SERIAL loops will degrade more rapidly than MOST-PARALLEL loops do. The reason is because MOST-SERIAL loops have a high probability of having hazards in a chunk than MOST-PARALLEL loops do. The gure also shows that a chunk size of P will result in the best performance. This is dierent from the case for DOALL loops, in which a chunk size of N is preferred (see Figure 10(a)). This implies that the CYCLIC partition of iterations will be more suitable for DOACROSS loops Hot-Spot Eects and Neighbor Checkers We say that a hot spot occurs when there is a large number of references contending for the same memory module during a short period of time. We measured the hot-spot eect by counting the maximum number of references ever buered in the request queue for a memory controller. In Figure 12, we show how dierent ways of assigning the checkers may aect the hot spots. One approach to assigning the checkers, which was used in previous experiments, is to store the shadow array elements in the same memory module as their corresponding shared array elements 22

23 DOALL+ Checker DOALL + Neighbor Checker DOALL no checker used DOACROSS MOST-PARALLEL + Checkers DOACROSS MOST-PARALLEL + Neighbor Checker DOACROSS MOST-SERIAL + Checker DOACROSS MOST-SERIAL + Neighbor Checker Maximum Queue Length Maximum Queue Length Processors (a) DOALL loop Processors (b) DOACROSS loop Figure 12: Hot-spot eects due to dierent ways of embedding the checkers. and use the local checker to check for hazards. Since the memory module can serve one request at a time, checker accesses can only be performed after corresponding regular accesses complete. Another approach is to let the checker in another memory module to check local shared elements. This scheme is called the neighbor checker. In our experiment, if the regular access is performed in memory module i, then the checker access will be performed in module (i + 1)%M, where M is the number of memory modules. Figure 12(a) shows the eects of dierent checker assignments on hot spots for DOALL loops. The gure shows that the neighbor checker scheme can help to reduce hot spots. When the number of processors is 32, the hot-spot eect can be reduced by about 40% (from 114 to 46 queued requests). Note that the curves oscillate when the number of processors vary. This is because memory access patterns vary widely when the source loop is partitioned for dierent numbers of processors. Figure 12(b) shows the results for DOACROSS loops using the greedy-chunk approach. The hot-spot eect is not very signicant for DOACROSS loops. This is because chunks in which hazards occur must be executed again in sequential. Sequential executions do not cause memory contention, which in turn minimize hot spots. Thus the neighbor checker scheme is not very helpful in this case, though it can still reduce some memory hot spots. 23

24 DOALL no checker used DOALL + Checker DOALL + Neighbor Checker DOACROSS MOST-PARALLEL + Checker DOACROSS MOST-PARALLEL + Neighbor Checker DOACROSS MOST-SERIAL + Checker DOACROSS MOST-SERIAL + Neighbor Checker Average Memory Latency Processors Figure 13: Eects of the checkers on the average memory latency Eects of Checkers on Memory Latency As the number of processors increases, memory modules have to serve more memory requests. In Figure 13 we compare the average memory latency by varying the number of processors. These curves show that when a xed number of memory modules are used (4 memory modules in this case), the average latency for each reference will increase when more processors are used. The curve with the lowest average memory latency corresponds to the DOALL loop without using checkers. This curve serves as a baseline for reference purpose. When checkers are used, the average memory latency is close to the baseline when the number of processors is smaller than 10. In such a case, the memory bandwidth is not saturated due to extra checking trac. This also indicates that our checkers do not induce excessive overhead. For DOACROSS loops, we can see that they have a longer average memory latency than DOALL loops when the number of processors is small. However, the memory latency in DOALL loops will become longer than the MOST-SERIAL loop when the number of processors is over 25, and will become longer than the MOST-PARALLEL case when the number of processors is over 30. This is due to a heavier memory contention caused by a higher degree of parallelism in the DOALL loops. This is also evident in comparing the latencies of MOST-PARALLEL and MOST-SERIAL loops. The former have a higher memory latency than the latter. The neighbor checker scheme is useful in reducing memory latency. But the eect is not very signicant in DOALL loops. The reason is that regular accesses tend to be distributed evenly among memory modules. Thus even with the neighbor checker scheme, checker accesses will contend with 24

25 DOALL no checker used, M=4 DOACROSS MOST-PARALLEL M=4 DOACROSS MOST-PARALLEL M=8 DOACROSS MOST-PARALLEL M=12 DOACROSS MOST-PARALLEL M= Average Memory Latency Speedup DOACROSS MOST-PARALLEL Neighbor Checker M=4 DOACROSS MOST-PARALLEL Neighbor Checker M=8 DOACROSS MOST-PARALLEL Neighbor Checker M=12 DOACROSS MOST-PARALLEL Neighbor Checker M= Processors Processors (a) Average memory latency. (b) Speedup on MOST-PARALLEL loops. Figure 14: Eects of adding more memory checkers. regular accesses in all the modules. On the other hand, for DOACROSS loops, the neighbor checker scheme can improve the average memory latency by about 2 clock cycles Eects of Adding More Checkers Intuitively, if we increase the number of memory modules, each having a hardware checker, memory requests will be served more quickly. Figure 14 shows the performance results for the MOST- PARALLEL loops when dierent numbers of memory controllers are used. From Figure 14(a) we can see that increasing the number of memory modules reduces the average memory latency. Of course this performance gain is at the cost of increasing hardware. From Figure 14(b) we can see that increasing the number of memory modules enhances the speedup but the improvement rate drops as more checkers are used. For example, when the number of processors in the system is 16, the speedup using four checkers is about 7 but the speedup is only 8 when eight checkers are used. Furthermore, when 16 checkers are used, the speedup is only improved to 8.7. This is mainly due to the limited parallelism in the individual chunks. Thus, adding more checkers does improve the memory latency but its eectiveness is constrained by the inherent parallelism in the loop. It should be noted that our experiments did not model network contention. Thus adding more memory modules can always increase memory bandwidth. The amount of exploitable parallelism is limited only by the number of processors used and the chunk size selected. 25

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) A 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software