Worker-Checker A Framework for Run-time. National Tsing Hua University. Hsinchu, Taiwan 300, R.O.C.

Size: px
Start display at page:

Download "Worker-Checker A Framework for Run-time. National Tsing Hua University. Hsinchu, Taiwan 300, R.O.C."

Transcription

1 Worker-Checker A Framework for Run-time Parallelization on Multiprocessors Kuang-Chih Liu Chung-Ta King Department of Computer Science National Tsing Hua University Hsinchu, Taiwan 300, R.O.C. fkcliu,kingg@cs.nthu.edu.tw Abstract Run-time parallelization is a technique for solving problems whose data access patterns are dicult to analyze at compile time. In this paper we propose a worker-checker framework to classify dierent run-time parallelization schemes. Under the framework, operations performed during run-time parallelization are classied loosely into a worker and a checker. Dierent schemes are then cast into the framework based on the relative execution order of their worker and checker. From the framework, we identied several new run-time parallelization methods. In the second part of the paper we then examine the implementation of one such method derived from speculative parallelization [10]. The implementation is based on the idea of embedding hardware checkers inside memory controllers. We will present the design of the hardware checker and evaluate the eectiveness of the design on run-time parallelizing DOALL and DOACROSS loops. Keywords: 1 Introduction Run-time parallelization, speculative parallelization, inspector-executor, irregular problem, smart memory Run-time parallelization is a technique for solving problems whose data access patterns are dicult to analyze at compile time. The programs of these problems usually have array elements that are accessed via nonlinear subscripts, pointers, or subscripted subscripts. Analyzing such program constructs has been a major challenge to current compiler techniques. A large class of problems, including molecular dynamics, uid dynamics, astrophysics and device simulation, exhibit this characteristic. To exploit the parallelism inherent in these irregular applications, we need to resort A preliminary version of this paper appears in Proc. of the Eighth IASTED International Conference on Parallel and Distributed Computing and Systems, Oct This work was supported in part by National Science Council grants NSC E R and NSC E

2 to run-time solutions. Since loops usually consume the most execution time of a program and they potentially contain the most parallelism, the majority of existing run-time parallelization schemes focus on loops. There are two basic approaches to run-time parallelization of irregular loops: speculative parallelization [9, 10] and inspector-executor [2, 7, 11, 12, 13, 15]. Speculative parallelization is a run-time strategy in which the given loop is executed speculatively as if it were a DOALL loop. While the loop is executing, data accesses are recorded. At the end of the execution, the access records are checked to discover any violation to data dependences. If so, the execution is rolled back and the loop is executed sequentially again. Obviously, this approach can yield good results when the loop is in fact fully parallel. Inspector-executor is a classic paradigm for run-time parallelization. Given a sequential loop, the compiler (or programmer) generates two pieces of codes from the loop: an inspector and an executor. The inspector examines data access patterns and generates a schedule for executing the loop. The executor uses the schedule to access data and synchronize with each other so as to execute the operations in parallel. Although these two approaches look dierent, they in fact have many common characteristics. For example, in both approaches there are operations for recording, analyzing, and checking dependences. There are also operations for actually evaluating the loop iterations. If we could come up with a proper taxonomy to classify dierent run-time parallelization schemes, then we may be able to observe these strategies more closely and perhaps identify other opportunities for parallelization. In this paper we propose one such taxonomy. From the taxonomy several variations to the two basic run-time parallelization strategies are identied, and in the latter part of the paper we will examine one variation in detail. In order to come up with a simple but useful classication, we partition the activities in a run-time parallelization scheme loosely into two general classes: worker and checker. The worker contains roughly the operations involved in executing the actual loop iterations, while the checker contains the remaining, including all bookkeeping and checking operations. Dierent run-time parallelization schemes are then classied based on the relative execution order of their worker and checker. In our framework, the execution order may be strict, interleaved, or overlapped. According to this taxonomy, the speculative parallelization scheme proposed in [9, 10] is a strict worker-thenchecker strategy, while the classic inspector-executor scheme [15, 12] is a strict checker-then-worker 2

3 strategy. Details of the taxonomy will be described in Section 2. One use of the taxonomy is to identify new opportunities for performing run-time parallelization. From our classication we can see that the operations done in the checker are mostly overheads they do not perform the actual computations. Although previous approaches have considered the parallelization of the checker and/or the worker, these two classes of activities are primarily performed sequentially. The opportunities of overlapping the executions of the two so as to hide the overheads in the checker were not exploited. In this paper, we will investigate such opportunities and examine one possible solution. Our proposed scheme extends speculative parallelization and supports overlapped executions of worker and checker with smart memories. The basic idea is to put some checker functions into the memory controllers so that the memories can play the role of the checker while the processors perform the workers. With architectural supports the operations in the checker and the worker are overlapped naturally. In addition, we can also avoid possible long execution time due to software implementation of the checker. We will show that the checker logic added in the memory controllers is quite simple but is very eective for improving the performance of run-time parallelization. Our simulation results also show that the extra circuits in the controllers can help to exploit run-time loop-level parallelism but do not degrade the memory performance much. Contributions of this paper are two-folds. First, we propose the worker-checker framework, which gives the direction to improve current techniques and identies possible overheads. Second, we demonstrate a smart memory design for run-time parallelization based on the speculative parallelization strategy. The rest of this paper is organized as follow: Section 2 provides background on related works and the worker-checker framework. In Section 3 we examine one worker-checker algorithm for speculative parallelization as a case study. Section 4 evaluates the performance of our design. Section 5 concludes the paper. 2 Worker-Checker Framework 2.1 Previous Run-time Parallelization Schemes and Their Overheads In this subsection we briey review previous run-time parallelization schemes. Main operations in these schemes are analyzed and classied, which serve as a basis for our worker-checker framework. 3

4 Operations Schemes Phase Access Execution of Synchronization Dependence Parallel Marking Source Loop Primitives Testing Schedule p p Speculative Specu.Run p Parallelization Specu.Test p p Inspector- Inspector p p Executor Executor Worker Checker Table 1: Typical operations in run-time parallelization Possible overheads in these schemes are also identied, from which new approaches to run-time parallelization are motivated. To exploit loop-level parallelism at run-time, there are two general approaches: speculative parallelization and inspector-executor. Their main operations are summarized in Table 1. For simplicity, speculative parallelization is assumed to be successful and there is no roll-back and re-execution. Speculatively executing loops in parallel at run-time was originally proposed by Rauchwerger and Padua [9, 10]. Their algorithm used three shadow arrays A r, A w and A np for recording array accesses. Two techniques, privatization and reduction parallelization, were used to enhance the chance of correct speculative executions. Their LRPD-test (or the PD-test in [9]) has shown some promising results in parallelizing PERFECT benchmarks. Speculative parallelization typically consists of two phases: Specu.Run and Specu.Test. Specu.Run, while executing the source loop in parallel (Execution of Source Loop), uses some data structures to mark shared data references (Access Marking). If a loop-carried dependence is violated in the execution, called a dependence hazard, then Specu.Test can report this violation by analyzing these data structures (Dependence Testing) at the end of the execution. According to the characteristics of the operations done in each phase, Specu.Run can be classied loosely as performing the operations of workers, while Specu.Test performing checkers. Note that in speculative parallelization, checkers follow workers sequentially. Next, let us analyze the overhead in speculative execution. Access marking in the workers will cause extra memory references and consume some CPU resource. However, most overheads are in checkers. In fact, in speculative parallelization, checkers are pure overheads. This is because 4

5 checkers perform only checking but not any computation of the loop. In addition, the operations of checkers and workers are not overlapped. From the viewpoint of overhead hiding, it is possible to let workers overlap with checkers. We shall discuss one implementation in detail in Section 3. Zhu and Yew rst proposed the inspector-executor strategy on shared-memory multiprocessors [15]. Saltz et al. extended the idea and contributed signicantly to run-time scheduling of irregular loops, either on shared-memory multiprocessors [12, 13] or on distributed-memory multicomputers [4, 8]. Leung and Zahorjan proposed methods to parallelize the inspector by sectioning and bootstrapping [7]. Chen and Yew extended the results in [15] and discussed how to obtain optimal schedules in parallel [2]. A recent work by Rauchwerger and Padua improved previous results and discussed the ideas of interleaved and overlapped inspector-executor [11]. In inspector-executor, the inspector plays the role of checkers, and executor plays the role of workers. The checkers have to analyze data dependences (Dependence Testing) and then prepare an execution schedule (Parallel Schedule) for workers. Workers then execute loops in parallel (Execution of Source Loop) according to the schedule. Unlike speculative parallelization, inspectorexecutor can handle partially parallel loops, i.e. DOACROSS loops. Overheads in inspectorexecutor include the time spent in the inspector and in synchronization (Synchronization Primitives) during the executor. As pointed out in [7] processor utilization and load balance are worthwhile criteria to achieve good performance without global optimal schedule. How these criteria can be addressed from the viewpoint of overhead hiding is an interesting research topic. We shall discuss this issue based on the worker-checker framework later. 2.2 The Worker-Checker Model Type Schemes Example References 1 strict worker-then-checker speculative parallelization [9, 10] 2 strict checker-then-worker inspector-executor [2, 7, 12] 3 interleaved worker-then-checker our algorithm this paper 4 interleaved checker-then-worker sectioning inspector-executor [7, 11] 5 overlapped worker-then-checker hardware supported checker this paper 6 overlapped checker-then-worker dynamically scheduled inspector-executor [11] Table 2: Run-time parallelization schemes under the worker-checker framework. Under the worker-checker model, we can identify at least six dierent run-time parallelization 5

6 schemes as shown in Table 2. This classication is based on the relative execution order of the workers and checkers. There are three possible orders: strict, interleaved, and overlapped. These six schemes further fall into two general categories: speculative parallelization and inspector-executor. Types 1, 3 and 5 belong to speculative parallelization, and Types 2, 4 and 6 belong to inspectorexecutor. (a) Worker Checker (b) Checker Worker (c) (d) (e) (f) Figure 1: Six types of run-time parallelization schemes Run-Time Parallelization Based on Speculative Parallelization Strict worker-then-checker: The term strict means that there is a strict sequential execution order between the worker and the checker, i.e., the checker does not start until the worker nishes the speculative execution. Figure 1(a) illustrates such an execution order. Note however that the worker and the checker may themselves run in parallel. This scheme is the run-time parallelization method originally proposed in [9, 10]. Apparently, this scheme has the drawback that the worker always runs through the whole execution even if violations to dependences have occurred in the middle of the execution. This is because the checker has to wait until the worker nishes it cannot check the hazards and inform the worker earlier. Interleaved worker-then-checker: In this scheme the worker only works on a portion of the whole computation, called a chunk, and then the checker checks the partial results immediately. If the checker does not discover any 6

7 hazard, the worker continues with the next chunk of the computation. Otherwise, the current chunk is executed again sequentially. The execution order of the worker and the checker is shown in Figure 1(c). This scheme is a straightforward extension of Type 1 scheme. It has the benet that the checker can detect hazards earlier and need only to test partial results. In addition, the worker only has to sequentially re-execute the chunks in which the checker fails. In this way, even DOACROSS loops can be executed with some degree of parallelism. In this paper, such an execution scheme will be referred to as the greedy-chunk approach. Overlapped worker-then-checker: In this scheme, the operations of the worker and the checker are overlapped, with the worker ahead of the checker. The greedy-chunk approach introduced above can also be used here. Again the whole computation is partitioned into chunks. While the worker is executing the current chunk, the checker checks hazards in the generated results simultaneously. If a hazard is detected, the current chunk is re-executed in sequential. In this way, the overhead in executing the checker can be hidden by the worker. Figure 1(e) illustrates such an execution scenario. It is apparent that in order for this scheme to work the system must be able to execute the worker and the checker concurrently either logically or physically. For example, in a system supporting multi-threading, the checker can be executed with a thread while the worker with another thread. Alternatively, the checker can be realized with hardware that works concurrently with the main CPU which executes the worker. In the second part of this paper we will address the design issues of this hardware approach. One critical issue in the Type 5 scheme is when the checkers are invoked. If the checkers are invoked at the end of each chunk, then this is equivalent to the Type 3 scheme. To increase the overlapping between the worker and the checker, there are at least two possible strategies: checkon-iteration and check-on-reference. In the check-on-iteration strategy, the checkers are invoked when the workers have executed a number iterations in the loop. On the other hand, in the checkon-reference strategy hazards are checked at each shared array reference. The dierence between these two strategies is mainly a tradeo between the degree of parallelism and the checker overhead. A ne-grain strategy such as check-on-reference will incur a large overhead in invoking the checkers but allows hazards to be detected as early as possible. Section 4 will give a comparison between these two strategies. 7

8 2.2.2 Run-Time Parallelization Based on Inspector-Executor Strict checker-then-worker: This scheme is the conventional inspector-executor method [7, 12, 13, 15]. The checker (i.e., the inspector) and the worker (i.e., the executor) can both be parallelized, but the worker must wait for the execution schedule and cannot start until the checker nishes. The behavior of this scheme is shown in Figure 1(b). One important issue in this scheme is to obtain an optimal execution schedule for the worker and obtain it quickly. The inherent sequential nature in deciding the schedule makes it dicult to fully utilize all the processors. It is also dicult to come up with a schedule which balances the load of the processors in executing the worker. Hints to solve these problems may be found in [11]. Interleaved checker-then-worker: Similar to the interleaved worker-then-checker scheme, the operations in the checker and the worker are interleaved in this scheme. Figure 1(d) illustrates the behavior. This scheme is a simple extension to the conventional inspector-executor methods. The greedy-chunk approach introduced above can be applied here, in corporation with the sectioning algorithm [7]. Note that with sectioning it is not possible to have a global optimal schedule, but the worker can still be parallelized to some extend and good speedup can be obtained if the processors are well utilized. Overlapped checker-then-worker: In this scheme, while the checker is inspecting the iterations in the next chunk, the worker executes those iterations in the current chunk according to the schedule generated by the checker previously. Operations in the checker and the worker are overlapped, and the checker overhead can be hidden by the worker. Figure 1(f) illustrates the scenario. Again the system needs to support concurrent execution of the checker and the work, and allow them to synchronize eciently. This can be done with multi-threading or special hardware. In [11] a rough sketch of an implementation with multi-threading and dynamic processor assignment was given. We will leave the investigation of this scheme for future research. 3 A Case Study Design of a Hardware Checker From the previous section, we can see that by interleaving or overlapping the worker and the checker, we can exploit further opportunities to optimize run-time parallelization. As a case study we will 8

9 examine one design of the Type 5 scheme in this section. We will discuss the overall design concept, the check-on-reference algorithm in detail, and the considerations in hardware implementation. Workers Checkers P P P P Interconnection Network checker checker checker checker Memory Module Memory Module Memory Module..... Memory Module Global Main Memory Figure 2: A shared-memory multiprocessor with hardware checkers. 3.1 Overall Organization Figure 2 shows the organization of a shared-memory multiprocessor which implements the Type 5 scheme of the worker-checker model. There is a hardware checker embedded in the memory controller in each memory module. In this system the processors play the role of the worker and the memory controllers play the role of the checker. Given a loop to be executed speculatively, the processors rst initialize the checkers before entering the source loop. While the processors are executing the iterations, the checkers monitor the accesses to the shared array in memory at the same time. Based on the types of the accesses and the iterations when the accesses occur, the checkers can detect any hazard due to loop-carried data dependences. Since the checkers are invoked on every access to the shared array, they follow the check-on-reference strategy. In the following, we will rst examine the hazard conditions and introduce the check-on-reference algorithm, which is used by the hardware checkers to detect hazards. Implementation details of the hardware checkers are then presented. 3.2 Hazard Conditions In this subsection, we examine the conditions under which dependence hazards may occur. Note rst that data dependences may or may not cause hazards, depending mainly on the order of 9

10 accesses to the shared array. To illustrate the idea, consider the example loop shown in Figure 3. In the loop there are two references to the shared array A, which are labeled a and b respectively. Every access to the shared array can be represented using an access identier: x(p; q), where x is either r for read or w for write, p is the iteration number and q is the label of the reference. Let addr(x(p; q)) denote the address of the array element accessed in x(p; q). The operator! denotes a partial order and r(p; q)! w(r; s) means that the access r(p; q) happens before w(r; s). Code: Data: DO i = 1, N index (a) A(I1(i))= I (b)... =A(I2(i)) I END Figure 3: An example to illustrate shared data accesses and possible hazards. Suppose the memory consistency model is sequential consistency [5]. From Figure 3, we can see that there is a loop-carried ow dependence from w(2; a) to r(3; b), because they both access to the same location A(6), i.e. addr(w(2; a)) = addr(r(3; b)) = A(6). Now, if w(2; a) occurs before r(3; b), then the dependence can be satised and there will be no run-time hazard. However, if r(3; b) is performed before w(2; a), i.e., r(3; b)! w(2; a), then a run-time hazard will occur. This is because r(3; b) will fetch the old value of A(6) instead of the most up-to-date value written by w(2; a). In the following discussion, we will say that r(3; b)! w(2; a) forms a violated access sequence. Depending on the order and types of the accesses, there are three kinds of violated access sequences, which are shown in Table 3. From Table 3 we can see that a read access will cause a violated access sequence with a previous write if (1) they both access the same array element, and (2) the write is labeled with a larger iteration number, i.e. it is supposed to be invoked in a later iteration. This is a WAR (write after read) hazard, which typically violates loop-carried anti-dependences. Similarly Type 2 hazard is a WAW (write after write) hazard, which violates output dependences, and Type 3 hazard is a RAW (read after write) hazard, which violates ow dependences. 10

11 Type Condition Hazard Type 1 w(i; x)! r(j; y), where i > j and addr(w(i; x)) = addr(r(j; y)) WAR 2 w(i; x)! w(j; y), where i > j and addr(w(i; x)) = addr(w(j; y)) WAW 3 r(i; x)! w(j; y), where i > j and addr(r(i; x)) = addr(w(j; y)) RAW 3.3 The Checker Algorithm Table 3: Three types of violated access sequences. In this subsection, we present the checker algorithm which is to be executed in each checker. The algorithm is based on the check-on-reference strategy. From Table 3 we can see that the iteration number associated with an access indicates when the access is supposed to happen. This information should match the time when the access actually occur; otherwise run-time hazards might happen. Thus, for example, to check if a read access causes a run-time hazard, we need to compare its associated iteration number iter with the iteration numbers of all previous writes which access to the same array element. If iter is smaller than any of those iteration numbers, then there is a write access which occurred earlier but is supposed to occur later. As a result the read operation causes a violated access sequence and a run-time hazard occurs. Similar arguments can be applied to write accesses. In practice, we need not check \all" iteration numbers only the largest iteration number recorded so far. These ideas lead to the checker algorithm shown in Figure 4. The algorithm is invoked on every access to the shared array, because it implements the checkon-reference strategy. The access must indicate the type of access op, the address of the array element to be accessed A[k], and the iteration iter when the access occurs. To check for run-time hazards, the algorithm references two shadow arrays. The array A w (A r ) records for each array element the maximum iteration number at which an access has ever written to (read from) the element. In the algorithm we assume that the index of the given array is normalized and positive. Thus if the shadow arrays have been properly initialized to zero, then a nonzero value in a shadow element indicates that the corresponding array element has been accessed before. The primary task of the algorithm is to maintain the two shadow arrays and to use their information to check for hazards according to Table 3. For a read access to A[k] in iteration iter, the algorithm checks the maximum iteration number stored in A w [k]. If iter is smaller than the 11

12 /* CHECKER ALGORITHM: Input: type of access op (READ or WRITE), array base address A, array index k, iteration number when the access occurs iter. Output: inform the worker if a hazard occurs. */ 1 MEMORY CHECK(op, A, k, iter) 2 f 3 get A r [k] and A w [k] from shadow arrays A r and A w ; 4 switch(op) f 5 case READ: 6 /* if A w [k] 6= 0, A[k] has been written before */ 7 if (iter < A w [k]) hazard = VIOLATE ANTI; /* WAR */ 8 if (A r [k] < iter) A r [k] = iter; 9 if (hazard) inform the worker and exit. 10 case WRITE: 11 /* if A w [k] 6= 0, A[k] has been written before */ 12 if (A w [k] == 0) backup(a[k]); /* backup on the rst write */ 13 if (iter < A w [k]) hazard = VIOLATE OUTPUT; /* WAW */ 14 if (hazard) inform the worker and exit. 15 /* if A r [k] 6= 0, A[k] has been read before */ 16 if (iter < A r [k]) hazard = VIOLATE FLOW; /* RAW */ 17 if (A w [k] < iter) A w [k] = iter; 18 if (hazard) inform the worker and exit. 19 g 20 g Figure 4: The checker algorithm. value in A w [k], then a WAR hazard occurs. Otherwise, A r [k] is updated to the maximum iteration number. The operations performed during a write access are similar. 3.4 Considerations for Implementing Checkers The checker algorithm shown in Figure 4 can be implemented entirely in software. One possible implementation on a multiprocessor is to use one thread or process in each processor to execute the checker algorithm, while another thread or process to execute the worker. Since the checker threads or processes are executed on dierent processors in parallel, updating the shadow arrays A w and A r must be protected as a critical section. The critical section serializes the operations of 12

13 the checkers and results in a system bottleneck. The checker algorithm can also be implemented in hardware. The primary reason is that Type 5 scheme (overlapped worker-then-checker) can be actually realized. Besides, a recent trend in memory design is to put more functionalities into the memory. Memory controllers become more intelligent for specic applications, e.g. irregular memory access [14], locality analysis using prefetch cache, fast graphics display with smart Z-buer, etc. It is thus feasible to embed some checker hardware inside memory controllers to support run-time parallelization. Details of the hardware checkers will be given in the following subsections. Note that an alternative to hardware checkers is to implement the checkers inside the caches. The checker algorithm then operates cooperatively with the cache-coherence protocol. This approach is more complex and will be left for future investigation. write?ar:max(ar,iter) read?aw:max(aw,iter) 0 1 Max Shadow Ar Aw A A B CMP iter B read C Selector #1 B<A? C B<A? (b) Selector Circuit write A B B<A? C Selector #2 read (a) Checker Circuit INT Figure 5: Circuit diagram of the hardware checker. 3.5 The Checker Circuit Figure 5 shows the checker circuit, which implements lines 4 19 in the checker algorithm in Figure 4. Again, the circuit operates whenever there is an access to the shared array. Each access will supply the circuit with the following inputs: the type of request, the iteration number, and the elements of the shadow arrays A w and A r corresponding to the accessed array element. Note 13

14 that a shadow array element stores the maximum iteration number at which an access to the corresponding array element has ever occurred. This data will be loaded from memory for checking and then be stored back for updating. The checker circuit consists of two identical selectors, one for read access and the other for write access. The selectors are used to compare the iteration numbers, report hazards, and update the maximum iteration number. In each selector, the input A contains the value from the shadow array (A r or A w ) and B contains the iteration number in which the current access occurs. A comparator (CMP) is used to compare these two values. A multiplexer is used to select a value from A or B to update the shadow array element: A is selected when the control input is 1 otherwise B is selected. The checker circuit works as follows. When a read request to a shared array element A[k] arrives (read=1 and write=0), the read selector will compare inputs A (the shadow array element A w [k]) and B (the iteration number). If B < A, then a WAR hazard occurs and the hazard interrupt signal INT is asserted. Since the C input to the selector is now 1, A w [k] is unchanged. In the mean time, the write selector has input C = 0. Thus, if the iteration number iter is larger than the shadow element A r [k], then the write selector will select B input to store into A r [k]. In this way, A r [k] is updated with the maximum iteration number. Note that the write selector will signal a RAR (read after read) hazard if iter < A r [k]. An inverter is used here to prohibit such a signal. The operations performed during a write access are similar. Now the read selector checks any WAW hazard and the write selector checks any RAW hazard. The read selector is also responsible for updating the shadow element A w [k]. 3.6 Memory Controllers with the Checker Circuit The checker circuit in Figure 5 can be easily embedded in the memory controller. The architecture of such a memory controller is shown in Figure 6. There are two buers: One is a FIFO buer for queuing memory access requests, i.e. regular accesses. The other is used by the checker and enqueues those requests for checking hazards, i.e. checker accesses. A lter is used to enable/disable checking and translate referenced addresses into shadow array addresses. In our design, the shadow arrays are stored in the main memory. The lter contains a number of registers. The CSW (Checker Status Word) register contains a ag indicating whether the checker is enabled or disabled. The MSA (Marked Start Address) 14

15 command/address Interrupt Filter regular address shadow address Bank #4 #1 #2 #3 Shadow read/write/iter Checker Memory Module Data Bus Figure 6: A memory controller with the checker circuit. and MEA (Marked End Address) registers contain the starting and ending addresses of the shared array. The SBA register contains the starting address of the shadow arrays. The lter also has a register le to record which processor is executing which iteration. The register le has at least N registers, where N is the number of processors. Now let us see how the memory controller works. For simplicity, we assume that only one shared array needs to be checked in the speculated loop. When the processors start executing the speculated loop, they rst initialize the registers in the lter. When a processor p begins executing an iteration, it broadcasts the iteration number i to all the memory controllers. The lter in each controller will record i in the p-th entry of its register le. When the lter receives a request from processor p to access a shared array element A[k], it will (1) obtain the iteration number from the p-th entry of the register le, (2) translate the shared array address A[k] into the shadow array addresses A r [k] or A w [k], and (3) pass these information into the checking queue for the checker circuit to check hazards. As mentioned, the shadow arrays are stored in the main memory. If the memory system is dual-port, then the checker can load or update the shadow array without aecting regular accesses. However, dual-port memories are very expensive. Thus, we assume one-port memories in the following discussions. Unfortunately, in such memories we need to arbitrate regular and checker accesses. 15

16 Programmer Compiler Operating System Hardware Give a directive SP LOOP(addr, size) before the loop to be speculated Transform the specied loop into a speculative DOALL loop with a recovery code by inserting primitives: MEMCHECK ON, MEMCHECK OFF, MEMCHECK CONTINUE and SET ITER Called by the primitives to initialize the lter in the checker and invoke user-level interrupt handler to handle hazards Check hazards and report to the processors by interrupt Table 4: Operations necessary for executing a speculative loop. Our design here is to serve the regular access rst and then the corresponding checker access. In the next section we will study how regular accesses may be aected by checker accesses. Note that if the referenced shared element A[k] is placed on one memory module and its corresponding shadow elements A r [k] and A w [k] are placed in another module, then the checker access can be performed at the same time as the regular access. Such a checking scheme will be referred to as the neighbor checking and can achieve an eect similar to dual-port memories. In Section 4 we will study the performance of neighbor checking also. 3.7 Software Supports Given the hardware checker described in the previous subsections, the multiprocessor requires a number of system supports to accomplish speculative parallelization of nested loops. Table 4 summarizes the necessary supports. As an example, consider the code in Figure 7. In the source loop (Figure 7(a)), the programmers must mark the loop to be speculated with a directive SP LOOP(addr,size), where addr is the starting address of the shared array and size is the range to be monitored/checked. The compiler, then transforms the source loop into a speculative DOALL loop and a segment of recovery code, as shown in Figure 7(b), The speculative DOALL loop is embraced by two primitives MEMCHECK ON and MEMCHECK OFF. At run-time the primitive MEMCHECK ON will be invoked when the processors enter the speculative loop. The primitive species parameters which can be used to calculate the starting and ending addresses of the shared array and initializes the registers MSA and MEA in the lter of the checkers. The content of the SBA register is initialized automatically and all the shadow array elements are initialized to zero. The lter is also enabled by asserting the ag in the CSW register. Finally, the recovery code is set up as the interrupt routine for hazards. 16

17 1 SP_LOOP(A,N); 2 for(i=0;i<n;i++) { 3 A[I1[i]] =...; 4...; 5... = A[I2[i]]; 6 } (a) the source loop 1 MEMCHECK_ON(A, N); 2 forall (pid = 0; pid < NPROC; pid++) { 3 for (chunk = 0, chunk < total_chunks; chunk++) { 4 base = pid * partition_size + chunk * chunk_size; 5 for (i = base; i < base + partition_size; i++) { 6 SET_ITER(i); 7 A[I1[i]] =...; 8...; 9... = A[I2[i]]; 10 } 11 } 12 goto DONE; 13 RECOVER: /* this is the hazard interrupt handler routine */ 14 BARRIER( barrier, NPROC); 15 if (pid == 0) { 16 /* recovery code here */ 17 MEMCHECK_CONTINUE(); 18 } 19 } /* forall */ 20 DONE: 21 BARRIER( barrier, NPROC); 22 MEMCHECK_OFF(); (b) the transformed speculative DOALL loop Figure 7: The transformed speculative DOALL loop by the compiler. 17

18 When a processor starts executing an iteration i, the primitive SET ITER(i) (line 6 in Figure 7(b)) will be called to inform the checkers with the iteration number and set their corresponding register le entry. The hardware checkers then check hazards whenever there is an access to the shared array. When a checker nds a hazard, it interrupts all the processors. This causes the processors to execute the recovery code. In the mean time, other checkers will also be informed to turn o their checking mechanism. After recovery, another primitive MEMCHECK CONTINUE (line 17 in Figure 7(b)) is used to wake up the checkers and continue with the speculative execution. 4 Performance Evaluation In this section, we study the performance of our hardware checker design. The environment of our experiments is introduced rst, followed by the performance results. 4.1 Experimental Environment We used Augmint [1] to simulate the proposed hardware checker. Augmint is an execution-driven multiprocessor memory simulator for the Intel Pentium architecture. It augments the assembly code of the compiled application by inserting extra instructions at memory reference points in order to emit memory events. The augmented object code consists roughly of two parts: front-end and back-end. The front-end simulates the execution of multiple processors by user-level threads and generates events of interest. The back-end consists of user-specied simulation routines, which are invoked with the events generated by the front-end. Our checker circuit was embedded in the back-end routines. The simulation programs, i.e. the simulator and the synthesis benchmark, were compiled by GNU C version and were executed on a Pentium processor running Linux version The hardware checkers were assumed to run inside a uniform memory access (UMA) multiprocessor as shown in Figure 8. Shared data were stored in the global shared memory without local copies. Private caches and local memories were assumed to store instructions, stacks and unshared data. To simplify the experiments, virtual memory operations (such as page swaps) were not allowed when checkers were enabled. We also ignored network and bus contentions so as to focus only on memory contentions. Write operations were assumed to be non-blocking and there was a write-buer in each memory controller for queuing regular write accesses. The main memory 18

19 P P P P c c c c LM LM LM LM Interconnection Network checker checker checker checker Memory Module Memory Module Memory Module..... Memory Module Global Main Memory Figure 8: The simulated architecture: a UMA shared-memory multiprocessor without bus/network contention. cycle time was assumed to be 5 cycles and memory controller served requests in unit of a word. The memory consistency model was sequential consistency. Unless stated explicitly, we assumed that the shared array elements and their corresponding shadow elements are stored in the same memory module. In the experiments we used a synthetic program to evaluate the performance of the checker. The program is shown in Figure 9 and was adopted from [2]. The iteration count N was assumed to be the same as the size of the shared array A. There were P processors in the system. We used the greedy chunk method to parallelize the loop. A chunk with a size of P iterations means that the loop is partitioned into N=P chunks and each processor executes one iteration in each chunk. In other words, the loop is partitioned in CYCLIC. Note that there are barrier synchronizations when the execution moves from one chunk to another. A chunk size of N means that there is only one chunk and each processor executes N=P iterations. In other words, the loop is partitioned in BLOCK. Data access patterns were controlled by the array INDEX. By setting appropriate values in the array INDEX, we could make the loop DOALL or DOACROSS. Following [2], two parameters H frac and H size were used to control the DOACROSS access pattern. In brief, H frac is the probability that an access is a hot access, where a hot access means that the access causes data dependences easily. The parameter H size is the fraction of hot area in the entire array A, where a hot area is a range in A where hot accesses will reference. Thus, if an access is a hot access, it will reference the hot area, otherwise its reference can be directed to the entire array A. We experimented with 19

20 for (i = 0; i < N; i++) { for (j = 0; j < r; j++) { if (odd(j)) A[INDEX[i*r+j]] = tmp1; else tmp2 = A[INDEX[i*r+j]]; } for (k = 0; k < W; k++) { /* simulate useful work, assume one */ /* processor cycle cost in each iteration */ } } Figure 9: The synthetic benchmark program. DOACROSS Pattern EARLY-RECOVER LATE-RECOVER Description Hazards occur in the beginning of the loop Hazards occur near the end of the loop MOST-SERIAL Heavy data dependences with (H frac,h size )=(0.9,0.1) MOST-PARALLEL Light data dependences with (H frac,h size )=(0.1,0.9) Table 5: Dierent types of DOACROSS loops used in the experiments. a number of DOACROSS loops, which are listed in Table 5. In the experiments, we set N = 6400, r = 2, and W = Experimental Results Overall Performance Let us rst study the performance of our checkers in executing DOALL loops. Note that DOALL loops do not cause any hazard and thus induce no recovery overhead. Figure 10(a) shows the speedup of executing the synthetic benchmark using dierent speculative parallelization schemes, if the loop is DOALL. From the gure we can see that our hardware checker can improve the software Rauchwerger's algorithm [9, 10] by a factor of 2. We can also see that the speedup scales well when hardware checkers are used. The overhead in our design comes primarily from software, such as the barrier between chunks, chunk setup time, etc. When the chunk size is set to P, the speculative DOALL loop has the 20

21 30 25 without checkers, chunk size = N with checkers, chunk size = N neighbor checker with checker, chunk size = P Rauchwerger s Algorithm, chunk size = N 2.2e+06 2e+06 EARLY-RECOVER LATE-RECOVER Type 1 method e+06 Speedup 15 Execution Cycles 1.6e e e e Processors (a) Speedup of DOALL loops Processors (b) Time of executing DOACROSS loops Figure 10: Performance of various run-time parallelization methods. worst speedup. This is because a smaller chunk size will result in a large number of chunks. As a result, there are more barriers and chunk setups. In the gure there is a curve denoted \Neighbor Checker". The curve is the result of allocating the checkers to memory modules dierent from the shared elements to be checked. We shall explain this curve later. Next, let us consider the worst case in speculative parallelization, i.e., the loop is DOACROSS and the chunk size is set to N. Figure 10(b) shows the execution time of the synthetic loop in such a case. Two extreme dependence patterns were used: EARLY-RECOVER and LATE-RECOVER (see Table 5). In the EARLY-RECOVER case the hazard is found immediately using our hardware checker, and the execution dose not waste much time in fail recovery. The resultant execution time is close to that of the sequential version regardless of the number of processors used. However, in the LATE-RECOVER case run-time parallelization performs even worse than the sequential version. This is because the sequential re-execution of the source loop, after speculation fails, dominates the total execution time. The situation becomes even worse if Type 1 scheme is used. Note that in Type 1 scheme, the whole loop needs to be executed before dependence hazards are checked and the loop is rolled back at any hazard Eects of Chunk Size on DOACROSS loops From the above discussions, we can see that the chunk size is a very important parameter in speculative parallelization. Figure 11 shows the eects of the chunk size on MOST-PARALLEL and 21

22 8 7 chunk size = 1xP chunk size = 2xP chunk size = 3xP chunk size = 4xP Neighbor Checker 1xP chunk size = 1xP chunk size = 2xP chunk size = 3xP chunk size = 4xP Speedup 4 Speedup Processors (a) MOST-PARALLEL DOACROSS loop Processors (b) MOST-SERIAL DOACROSS loop Figure 11: Eects of chunk size on two patterns of DOACROSS loop. MOST-SERIAL loops. Figure 11(a) shows the speedup curves for MOST-PARALLEL loops. As we can see, the performance degrades rapidly by increasing the chunk size. This is because a larger chunk size has a higher probability of causing hazards, which in turn degrades the performance. The same reason holds for MOST-SERIAL loops (Figure 11(b)). Comparing these two kinds of loops, MOST-SERIAL loops attend a maximum speedup of only about 4 and is less scalable. Moreover, as the number of processors increases, the performance of MOST-SERIAL loops will degrade more rapidly than MOST-PARALLEL loops do. The reason is because MOST-SERIAL loops have a high probability of having hazards in a chunk than MOST-PARALLEL loops do. The gure also shows that a chunk size of P will result in the best performance. This is dierent from the case for DOALL loops, in which a chunk size of N is preferred (see Figure 10(a)). This implies that the CYCLIC partition of iterations will be more suitable for DOACROSS loops Hot-Spot Eects and Neighbor Checkers We say that a hot spot occurs when there is a large number of references contending for the same memory module during a short period of time. We measured the hot-spot eect by counting the maximum number of references ever buered in the request queue for a memory controller. In Figure 12, we show how dierent ways of assigning the checkers may aect the hot spots. One approach to assigning the checkers, which was used in previous experiments, is to store the shadow array elements in the same memory module as their corresponding shared array elements 22

23 DOALL+ Checker DOALL + Neighbor Checker DOALL no checker used DOACROSS MOST-PARALLEL + Checkers DOACROSS MOST-PARALLEL + Neighbor Checker DOACROSS MOST-SERIAL + Checker DOACROSS MOST-SERIAL + Neighbor Checker Maximum Queue Length Maximum Queue Length Processors (a) DOALL loop Processors (b) DOACROSS loop Figure 12: Hot-spot eects due to dierent ways of embedding the checkers. and use the local checker to check for hazards. Since the memory module can serve one request at a time, checker accesses can only be performed after corresponding regular accesses complete. Another approach is to let the checker in another memory module to check local shared elements. This scheme is called the neighbor checker. In our experiment, if the regular access is performed in memory module i, then the checker access will be performed in module (i + 1)%M, where M is the number of memory modules. Figure 12(a) shows the eects of dierent checker assignments on hot spots for DOALL loops. The gure shows that the neighbor checker scheme can help to reduce hot spots. When the number of processors is 32, the hot-spot eect can be reduced by about 40% (from 114 to 46 queued requests). Note that the curves oscillate when the number of processors vary. This is because memory access patterns vary widely when the source loop is partitioned for dierent numbers of processors. Figure 12(b) shows the results for DOACROSS loops using the greedy-chunk approach. The hot-spot eect is not very signicant for DOACROSS loops. This is because chunks in which hazards occur must be executed again in sequential. Sequential executions do not cause memory contention, which in turn minimize hot spots. Thus the neighbor checker scheme is not very helpful in this case, though it can still reduce some memory hot spots. 23

24 DOALL no checker used DOALL + Checker DOALL + Neighbor Checker DOACROSS MOST-PARALLEL + Checker DOACROSS MOST-PARALLEL + Neighbor Checker DOACROSS MOST-SERIAL + Checker DOACROSS MOST-SERIAL + Neighbor Checker Average Memory Latency Processors Figure 13: Eects of the checkers on the average memory latency Eects of Checkers on Memory Latency As the number of processors increases, memory modules have to serve more memory requests. In Figure 13 we compare the average memory latency by varying the number of processors. These curves show that when a xed number of memory modules are used (4 memory modules in this case), the average latency for each reference will increase when more processors are used. The curve with the lowest average memory latency corresponds to the DOALL loop without using checkers. This curve serves as a baseline for reference purpose. When checkers are used, the average memory latency is close to the baseline when the number of processors is smaller than 10. In such a case, the memory bandwidth is not saturated due to extra checking trac. This also indicates that our checkers do not induce excessive overhead. For DOACROSS loops, we can see that they have a longer average memory latency than DOALL loops when the number of processors is small. However, the memory latency in DOALL loops will become longer than the MOST-SERIAL loop when the number of processors is over 25, and will become longer than the MOST-PARALLEL case when the number of processors is over 30. This is due to a heavier memory contention caused by a higher degree of parallelism in the DOALL loops. This is also evident in comparing the latencies of MOST-PARALLEL and MOST-SERIAL loops. The former have a higher memory latency than the latter. The neighbor checker scheme is useful in reducing memory latency. But the eect is not very signicant in DOALL loops. The reason is that regular accesses tend to be distributed evenly among memory modules. Thus even with the neighbor checker scheme, checker accesses will contend with 24

25 DOALL no checker used, M=4 DOACROSS MOST-PARALLEL M=4 DOACROSS MOST-PARALLEL M=8 DOACROSS MOST-PARALLEL M=12 DOACROSS MOST-PARALLEL M= Average Memory Latency Speedup DOACROSS MOST-PARALLEL Neighbor Checker M=4 DOACROSS MOST-PARALLEL Neighbor Checker M=8 DOACROSS MOST-PARALLEL Neighbor Checker M=12 DOACROSS MOST-PARALLEL Neighbor Checker M= Processors Processors (a) Average memory latency. (b) Speedup on MOST-PARALLEL loops. Figure 14: Eects of adding more memory checkers. regular accesses in all the modules. On the other hand, for DOACROSS loops, the neighbor checker scheme can improve the average memory latency by about 2 clock cycles Eects of Adding More Checkers Intuitively, if we increase the number of memory modules, each having a hardware checker, memory requests will be served more quickly. Figure 14 shows the performance results for the MOST- PARALLEL loops when dierent numbers of memory controllers are used. From Figure 14(a) we can see that increasing the number of memory modules reduces the average memory latency. Of course this performance gain is at the cost of increasing hardware. From Figure 14(b) we can see that increasing the number of memory modules enhances the speedup but the improvement rate drops as more checkers are used. For example, when the number of processors in the system is 16, the speedup using four checkers is about 7 but the speedup is only 8 when eight checkers are used. Furthermore, when 16 checkers are used, the speedup is only improved to 8.7. This is mainly due to the limited parallelism in the individual chunks. Thus, adding more checkers does improve the memory latency but its eectiveness is constrained by the inherent parallelism in the loop. It should be noted that our experiments did not model network contention. Thus adding more memory modules can always increase memory bandwidth. The amount of exploitable parallelism is limited only by the number of processors used and the chunk size selected. 25

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) A 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra ia a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

Real-Time Scalability of Nested Spin Locks. Hiroaki Takada and Ken Sakamura. Faculty of Science, University of Tokyo

Real-Time Scalability of Nested Spin Locks. Hiroaki Takada and Ken Sakamura. Faculty of Science, University of Tokyo Real-Time Scalability of Nested Spin Locks Hiroaki Takada and Ken Sakamura Department of Information Science, Faculty of Science, University of Tokyo 7-3-1, Hongo, Bunkyo-ku, Tokyo 113, Japan Abstract

More information

A taxonomy of race. D. P. Helmbold, C. E. McDowell. September 28, University of California, Santa Cruz. Santa Cruz, CA

A taxonomy of race. D. P. Helmbold, C. E. McDowell. September 28, University of California, Santa Cruz. Santa Cruz, CA A taxonomy of race conditions. D. P. Helmbold, C. E. McDowell UCSC-CRL-94-34 September 28, 1994 Board of Studies in Computer and Information Sciences University of California, Santa Cruz Santa Cruz, CA

More information

The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared-Memory Multiprocessor

The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared-Memory Multiprocessor IEEE Transactions on Parallel and Distributed Systems, Vol. 5, No. 6, June 1994, pp. 573-584.. The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared-Memory Multiprocessor David J.

More information

Run-Time Parallelization: It s Time Has Come

Run-Time Parallelization: It s Time Has Come Run-Time Parallelization: It s Time Has Come Lawrence Rauchwerger Department of Computer Science Texas A&M University College Station, TX 77843 Corresponding Author: Lawrence Rauchwerger telephone: (409)

More information

Multi-Version Caches for Multiscalar Processors. Manoj Franklin. Clemson University. 221-C Riggs Hall, Clemson, SC , USA

Multi-Version Caches for Multiscalar Processors. Manoj Franklin. Clemson University. 221-C Riggs Hall, Clemson, SC , USA Multi-Version Caches for Multiscalar Processors Manoj Franklin Department of Electrical and Computer Engineering Clemson University 22-C Riggs Hall, Clemson, SC 29634-095, USA Email: mfrankl@blessing.eng.clemson.edu

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra is a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and

More information

PARALLEL EXECUTION OF HASH JOINS IN PARALLEL DATABASES. Hui-I Hsiao, Ming-Syan Chen and Philip S. Yu. Electrical Engineering Department.

PARALLEL EXECUTION OF HASH JOINS IN PARALLEL DATABASES. Hui-I Hsiao, Ming-Syan Chen and Philip S. Yu. Electrical Engineering Department. PARALLEL EXECUTION OF HASH JOINS IN PARALLEL DATABASES Hui-I Hsiao, Ming-Syan Chen and Philip S. Yu IBM T. J. Watson Research Center P.O.Box 704 Yorktown, NY 10598, USA email: fhhsiao, psyug@watson.ibm.com

More information

is developed which describe the mean values of various system parameters. These equations have circular dependencies and must be solved iteratively. T

is developed which describe the mean values of various system parameters. These equations have circular dependencies and must be solved iteratively. T A Mean Value Analysis Multiprocessor Model Incorporating Superscalar Processors and Latency Tolerating Techniques 1 David H. Albonesi Israel Koren Department of Electrical and Computer Engineering University

More information

Submitted for TAU97 Abstract Many attempts have been made to combine some form of retiming with combinational

Submitted for TAU97 Abstract Many attempts have been made to combine some form of retiming with combinational Experiments in the Iterative Application of Resynthesis and Retiming Soha Hassoun and Carl Ebeling Department of Computer Science and Engineering University ofwashington, Seattle, WA fsoha,ebelingg@cs.washington.edu

More information

Speculative Parallelization Technology s only constant is CHANGE. Devarshi Ghoshal Sreesudhan

Speculative Parallelization Technology s only constant is CHANGE. Devarshi Ghoshal Sreesudhan Speculative Parallelization Technology s only constant is CHANGE Devarshi Ghoshal Sreesudhan Agenda Moore s law What is speculation? What is parallelization? Amdahl s law Communication between parallely

More information

A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors *

A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors * A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors * Hsin-Ta Chiao and Shyan-Ming Yuan Department of Computer and Information Science National Chiao Tung University

More information

A MULTIPORTED NONBLOCKING CACHE FOR A SUPERSCALAR UNIPROCESSOR JAMES EDWARD SICOLO THESIS. Submitted in partial fulllment of the requirements

A MULTIPORTED NONBLOCKING CACHE FOR A SUPERSCALAR UNIPROCESSOR JAMES EDWARD SICOLO THESIS. Submitted in partial fulllment of the requirements A MULTIPORTED NONBLOCKING CACHE FOR A SUPERSCALAR UNIPROCESSOR BY JAMES EDWARD SICOLO B.S., State University of New York at Bualo, 989 THESIS Submitted in partial fulllment of the requirements for the

More information

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization ECE669: Parallel Computer Architecture Fall 2 Handout #2 Homework # 2 Due: October 6 Programming Multiprocessors: Parallelism, Communication, and Synchronization 1 Introduction When developing multiprocessor

More information

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS Xiaodong Zhang and Yongsheng Song 1. INTRODUCTION Networks of Workstations (NOW) have become important distributed

More information

Software and Hardware for Exploiting Speculative Parallelism with a Multiprocessor

Software and Hardware for Exploiting Speculative Parallelism with a Multiprocessor Software and Hardware for Exploiting Speculative Parallelism with a Multiprocessor Jeffrey Oplinger, David Heine, Shih-Wei Liao, Basem A. Nayfeh, Monica S. Lam and Kunle Olukotun Computer Systems Laboratory

More information

Network. Department of Statistics. University of California, Berkeley. January, Abstract

Network. Department of Statistics. University of California, Berkeley. January, Abstract Parallelizing CART Using a Workstation Network Phil Spector Leo Breiman Department of Statistics University of California, Berkeley January, 1995 Abstract The CART (Classication and Regression Trees) program,

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

Eect of fan-out on the Performance of a. Single-message cancellation scheme. Atul Prakash (Contact Author) Gwo-baw Wu. Seema Jetli

Eect of fan-out on the Performance of a. Single-message cancellation scheme. Atul Prakash (Contact Author) Gwo-baw Wu. Seema Jetli Eect of fan-out on the Performance of a Single-message cancellation scheme Atul Prakash (Contact Author) Gwo-baw Wu Seema Jetli Department of Electrical Engineering and Computer Science University of Michigan,

More information

Table-Lookup Approach for Compiling Two-Level Data-Processor Mappings in HPF Kuei-Ping Shih y, Jang-Ping Sheu y, and Chua-Huang Huang z y Department o

Table-Lookup Approach for Compiling Two-Level Data-Processor Mappings in HPF Kuei-Ping Shih y, Jang-Ping Sheu y, and Chua-Huang Huang z y Department o Table-Lookup Approach for Compiling Two-Level Data-Processor Mappings in HPF Kuei-Ping Shih y, Jang-Ping Sheu y, and Chua-Huang Huang z y Department of Computer Science and Information Engineering National

More information

Chapter 7 The Potential of Special-Purpose Hardware

Chapter 7 The Potential of Special-Purpose Hardware Chapter 7 The Potential of Special-Purpose Hardware The preceding chapters have described various implementation methods and performance data for TIGRE. This chapter uses those data points to propose architecture

More information

Diffusing Your Mobile Apps: Extending In-Network Function Virtualisation to Mobile Function Offloading

Diffusing Your Mobile Apps: Extending In-Network Function Virtualisation to Mobile Function Offloading Diffusing Your Mobile Apps: Extending In-Network Function Virtualisation to Mobile Function Offloading Mario Almeida, Liang Wang*, Jeremy Blackburn, Konstantina Papagiannaki, Jon Crowcroft* Telefonica

More information

Module 14: "Directory-based Cache Coherence" Lecture 31: "Managing Directory Overhead" Directory-based Cache Coherence: Replacement of S blocks

Module 14: Directory-based Cache Coherence Lecture 31: Managing Directory Overhead Directory-based Cache Coherence: Replacement of S blocks Directory-based Cache Coherence: Replacement of S blocks Serialization VN deadlock Starvation Overflow schemes Sparse directory Remote access cache COMA Latency tolerance Page migration Queue lock in hardware

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

Main Points of the Computer Organization and System Software Module

Main Points of the Computer Organization and System Software Module Main Points of the Computer Organization and System Software Module You can find below the topics we have covered during the COSS module. Reading the relevant parts of the textbooks is essential for a

More information

Module 5 Introduction to Parallel Processing Systems

Module 5 Introduction to Parallel Processing Systems Module 5 Introduction to Parallel Processing Systems 1. What is the difference between pipelining and parallelism? In general, parallelism is simply multiple operations being done at the same time.this

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

FB(9,3) Figure 1(a). A 4-by-4 Benes network. Figure 1(b). An FB(4, 2) network. Figure 2. An FB(27, 3) network

FB(9,3) Figure 1(a). A 4-by-4 Benes network. Figure 1(b). An FB(4, 2) network. Figure 2. An FB(27, 3) network Congestion-free Routing of Streaming Multimedia Content in BMIN-based Parallel Systems Harish Sethu Department of Electrical and Computer Engineering Drexel University Philadelphia, PA 19104, USA sethu@ece.drexel.edu

More information

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX Towards an Adaptive Distributed Shared Memory (Preliminary Version ) Jai-Hoon Kim Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3 E-mail: fjhkim,vaidyag@cs.tamu.edu

More information

On Object Orientation as a Paradigm for General Purpose. Distributed Operating Systems

On Object Orientation as a Paradigm for General Purpose. Distributed Operating Systems On Object Orientation as a Paradigm for General Purpose Distributed Operating Systems Vinny Cahill, Sean Baker, Brendan Tangney, Chris Horn and Neville Harris Distributed Systems Group, Dept. of Computer

More information

III Data Structures. Dynamic sets

III Data Structures. Dynamic sets III Data Structures Elementary Data Structures Hash Tables Binary Search Trees Red-Black Trees Dynamic sets Sets are fundamental to computer science Algorithms may require several different types of operations

More information

A Scalable Method for Run Time Loop Parallelization

A Scalable Method for Run Time Loop Parallelization A Scalable Method for Run Time Loop Parallelization Lawrence Rauchwerger Nancy M. Amato David A. Padua Center for Supercomputing R&D Department of Computer Science Center for Supercomputing R&D University

More information

Scaling Optimistic Concurrency Control by Approximately Partitioning the Certifier and Log

Scaling Optimistic Concurrency Control by Approximately Partitioning the Certifier and Log Scaling Optimistic Concurrency Control by Approximately Partitioning the Certifier and Log Philip A. Bernstein Microsoft Research Redmond, WA, USA phil.bernstein@microsoft.com Sudipto Das Microsoft Research

More information

Process size is independent of the main memory present in the system.

Process size is independent of the main memory present in the system. Hardware control structure Two characteristics are key to paging and segmentation: 1. All memory references are logical addresses within a process which are dynamically converted into physical at run time.

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

to automatically generate parallel code for many applications that periodically update shared data structures using commuting operations and/or manipu

to automatically generate parallel code for many applications that periodically update shared data structures using commuting operations and/or manipu Semantic Foundations of Commutativity Analysis Martin C. Rinard y and Pedro C. Diniz z Department of Computer Science University of California, Santa Barbara Santa Barbara, CA 93106 fmartin,pedrog@cs.ucsb.edu

More information

Enabling Loop Parallelization with Decoupled Software Pipelining in LLVM: Final Report

Enabling Loop Parallelization with Decoupled Software Pipelining in LLVM: Final Report Enabling Loop Parallelization with Decoupled Software Pipelining in LLVM: Final Report Ameya Velingker and Dougal J. Sutherland {avelingk, dsutherl}@cs.cmu.edu http://www.cs.cmu.edu/~avelingk/compilers/

More information

Multiprocessor scheduling

Multiprocessor scheduling Chapter 10 Multiprocessor scheduling When a computer system contains multiple processors, a few new issues arise. Multiprocessor systems can be categorized into the following: Loosely coupled or distributed.

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s]

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s] Fast, single-pass K-means algorithms Fredrik Farnstrom Computer Science and Engineering Lund Institute of Technology, Sweden arnstrom@ucsd.edu James Lewis Computer Science and Engineering University of

More information

Processes and Non-Preemptive Scheduling. Otto J. Anshus

Processes and Non-Preemptive Scheduling. Otto J. Anshus Processes and Non-Preemptive Scheduling Otto J. Anshus Threads Processes Processes Kernel An aside on concurrency Timing and sequence of events are key concurrency issues We will study classical OS concurrency

More information

COMPUTER SCIENCE 4500 OPERATING SYSTEMS

COMPUTER SCIENCE 4500 OPERATING SYSTEMS Last update: 3/28/2017 COMPUTER SCIENCE 4500 OPERATING SYSTEMS 2017 Stanley Wileman Module 9: Memory Management Part 1 In This Module 2! Memory management functions! Types of memory and typical uses! Simple

More information

Characteristics of Mult l ip i ro r ce c ssors r

Characteristics of Mult l ip i ro r ce c ssors r Characteristics of Multiprocessors A multiprocessor system is an interconnection of two or more CPUs with memory and input output equipment. The term processor in multiprocessor can mean either a central

More information

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA CS 258 Parallel Computer Architecture Data Speculation Support for a Chip Multiprocessor (Hydra CMP) Lance Hammond, Mark Willey and Kunle Olukotun Presented: May 7 th, 2008 Ankit Jain Outline The Hydra

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

T H. Runable. Request. Priority Inversion. Exit. Runable. Request. Reply. For T L. For T. Reply. Exit. Request. Runable. Exit. Runable. Reply.

T H. Runable. Request. Priority Inversion. Exit. Runable. Request. Reply. For T L. For T. Reply. Exit. Request. Runable. Exit. Runable. Reply. Experience with Real-Time Mach for Writing Continuous Media Applications and Servers Tatsuo Nakajima Hiroshi Tezuka Japan Advanced Institute of Science and Technology Abstract This paper describes the

More information

Virtual Memory Management

Virtual Memory Management Virtual Memory Management CS-3013 Operating Systems Hugh C. Lauer (Slides include materials from Slides include materials from Modern Operating Systems, 3 rd ed., by Andrew Tanenbaum and from Operating

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System

Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System Meenakshi Arunachalam Alok Choudhary Brad Rullman y ECE and CIS Link Hall Syracuse University Syracuse, NY 344 E-mail:

More information

Parallel Processors. Session 1 Introduction

Parallel Processors. Session 1 Introduction Parallel Processors Session 1 Introduction Applications of Parallel Processors Structural Analysis Weather Forecasting Petroleum Exploration Fusion Energy Research Medical Diagnosis Aerodynamics Simulations

More information

Hardware for Speculative Run-Time Parallelization.

Hardware for Speculative Run-Time Parallelization. Hardware for Speculative Run-Time Parallelization in Distributed Shared-Memory Multiprocessors 1 Ye Zhang y, Lawrence Rauchwerger z, and Josep Torrellas y y Computer Science Department University of Illinois

More information

UNIT I (Two Marks Questions & Answers)

UNIT I (Two Marks Questions & Answers) UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-

More information

Relaxed Memory Consistency

Relaxed Memory Consistency Relaxed Memory Consistency Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)

More information

An Introduction to Parallel Programming

An Introduction to Parallel Programming An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe

More information

Software-Controlled Multithreading Using Informing Memory Operations

Software-Controlled Multithreading Using Informing Memory Operations Software-Controlled Multithreading Using Informing Memory Operations Todd C. Mowry Computer Science Department University Sherwyn R. Ramkissoon Department of Electrical & Computer Engineering University

More information

Makuhari, Chiba 273, Japan Kista , Sweden. Penny system [2] can then exploit the parallelism implicitly

Makuhari, Chiba 273, Japan Kista , Sweden. Penny system [2] can then exploit the parallelism implicitly Dynamic Scheduling in an Implicit Parallel System Haruyasu Ueda Johan Montelius Institute of Social Information Science Fujitsu Laboratories Ltd. Swedish Institute of Computer Science Makuhari, Chiba 273,

More information

Compiler and Runtime Support for Programming in Adaptive. Parallel Environments 1. Guy Edjlali, Gagan Agrawal, and Joel Saltz

Compiler and Runtime Support for Programming in Adaptive. Parallel Environments 1. Guy Edjlali, Gagan Agrawal, and Joel Saltz Compiler and Runtime Support for Programming in Adaptive Parallel Environments 1 Guy Edjlali, Gagan Agrawal, Alan Sussman, Jim Humphries, and Joel Saltz UMIACS and Dept. of Computer Science University

More information

Cache Injection on Bus Based Multiprocessors

Cache Injection on Bus Based Multiprocessors Cache Injection on Bus Based Multiprocessors Aleksandar Milenkovic, Veljko Milutinovic School of Electrical Engineering, University of Belgrade E-mail: {emilenka,vm@etf.bg.ac.yu, Http: {galeb.etf.bg.ac.yu/~vm,

More information

Request Network Reply Network CPU L1 Cache L2 Cache STU Directory Memory L1 cache size unlimited L1 write buer 8 lines L2 cache size unlimited L2 outs

Request Network Reply Network CPU L1 Cache L2 Cache STU Directory Memory L1 cache size unlimited L1 write buer 8 lines L2 cache size unlimited L2 outs Evaluation of Communication Mechanisms in Invalidate-based Shared Memory Multiprocessors Gregory T. Byrd and Michael J. Flynn Computer Systems Laboratory Stanford University, Stanford, CA Abstract. Producer-initiated

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal

More information

Design of Parallel Algorithms. Models of Parallel Computation

Design of Parallel Algorithms. Models of Parallel Computation + Design of Parallel Algorithms Models of Parallel Computation + Chapter Overview: Algorithms and Concurrency n Introduction to Parallel Algorithms n Tasks and Decomposition n Processes and Mapping n Processes

More information

Chapter 8: Virtual Memory. Operating System Concepts

Chapter 8: Virtual Memory. Operating System Concepts Chapter 8: Virtual Memory Silberschatz, Galvin and Gagne 2009 Chapter 8: Virtual Memory Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating

More information

A Feasibility Study for Methods of Effective Memoization Optimization

A Feasibility Study for Methods of Effective Memoization Optimization A Feasibility Study for Methods of Effective Memoization Optimization Daniel Mock October 2018 Abstract Traditionally, memoization is a compiler optimization that is applied to regions of code with few

More information

References EE457. Out of Order (OoO) Execution. Instruction Scheduling (Re-ordering of instructions)

References EE457. Out of Order (OoO) Execution. Instruction Scheduling (Re-ordering of instructions) EE457 Out of Order (OoO) Execution Introduction to Dynamic Scheduling of Instructions (The Tomasulo Algorithm) By Gandhi Puvvada References EE557 Textbook Prof Dubois EE557 Classnotes Prof Annavaram s

More information

Chapter 13: I/O Systems

Chapter 13: I/O Systems COP 4610: Introduction to Operating Systems (Spring 2015) Chapter 13: I/O Systems Zhi Wang Florida State University Content I/O hardware Application I/O interface Kernel I/O subsystem I/O performance Objectives

More information

MULTIPROCESSORS. Characteristics of Multiprocessors. Interconnection Structures. Interprocessor Arbitration

MULTIPROCESSORS. Characteristics of Multiprocessors. Interconnection Structures. Interprocessor Arbitration MULTIPROCESSORS Characteristics of Multiprocessors Interconnection Structures Interprocessor Arbitration Interprocessor Communication and Synchronization Cache Coherence 2 Characteristics of Multiprocessors

More information

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,

More information

Hazard Pointers. Number of threads unbounded time to check hazard pointers also unbounded! difficult dynamic bookkeeping! thread B - hp1 - hp2

Hazard Pointers. Number of threads unbounded time to check hazard pointers also unbounded! difficult dynamic bookkeeping! thread B - hp1 - hp2 Hazard Pointers Store pointers of memory references about to be accessed by a thread Memory allocation checks all hazard pointers to avoid the ABA problem thread A - hp1 - hp2 thread B - hp1 - hp2 thread

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system

More information

Effects of Parallelism Degree on Run-Time Parallelization of Loops

Effects of Parallelism Degree on Run-Time Parallelization of Loops Effects of Parallelism Degree on Run-Time Parallelization of Loops Chengzhong Xu Department of Electrical and Computer Engineering Wayne State University, Detroit, MI 4822 http://www.pdcl.eng.wayne.edu/czxu

More information

A hardware operating system kernel for multi-processor systems

A hardware operating system kernel for multi-processor systems A hardware operating system kernel for multi-processor systems Sanggyu Park a), Do-sun Hong, and Soo-Ik Chae School of EECS, Seoul National University, Building 104 1, Seoul National University, Gwanakgu,

More information

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of

More information

Speculative Synchronization

Speculative Synchronization Speculative Synchronization José F. Martínez Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu/martinez Problem 1: Conservative Parallelization No parallelization

More information

Memory hierarchy. 1. Module structure. 2. Basic cache memory. J. Daniel García Sánchez (coordinator) David Expósito Singh Javier García Blas

Memory hierarchy. 1. Module structure. 2. Basic cache memory. J. Daniel García Sánchez (coordinator) David Expósito Singh Javier García Blas Memory hierarchy J. Daniel García Sánchez (coordinator) David Expósito Singh Javier García Blas Computer Architecture ARCOS Group Computer Science and Engineering Department University Carlos III of Madrid

More information

Practical Near-Data Processing for In-Memory Analytics Frameworks

Practical Near-Data Processing for In-Memory Analytics Frameworks Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

Chapter-6. SUBJECT:- Operating System TOPICS:- I/O Management. Created by : - Sanjay Patel

Chapter-6. SUBJECT:- Operating System TOPICS:- I/O Management. Created by : - Sanjay Patel Chapter-6 SUBJECT:- Operating System TOPICS:- I/O Management Created by : - Sanjay Patel Disk Scheduling Algorithm 1) First-In-First-Out (FIFO) 2) Shortest Service Time First (SSTF) 3) SCAN 4) Circular-SCAN

More information

Frequently asked questions from the previous class survey

Frequently asked questions from the previous class survey CS 370: OPERATING SYSTEMS [THREADS] Shrideep Pallickara Computer Science Colorado State University L7.1 Frequently asked questions from the previous class survey When a process is waiting, does it get

More information

Parallelization Principles. Sathish Vadhiyar

Parallelization Principles. Sathish Vadhiyar Parallelization Principles Sathish Vadhiyar Parallel Programming and Challenges Recall the advantages and motivation of parallelism But parallel programs incur overheads not seen in sequential programs

More information

Part IV. Chapter 15 - Introduction to MIMD Architectures

Part IV. Chapter 15 - Introduction to MIMD Architectures D. Sima, T. J. Fountain, P. Kacsuk dvanced Computer rchitectures Part IV. Chapter 15 - Introduction to MIMD rchitectures Thread and process-level parallel architectures are typically realised by MIMD (Multiple

More information

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University Lecture 16: Checkpointed Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 18-1 Announcements Reading for today: class notes Your main focus:

More information

CS Operating Systems

CS Operating Systems CS 4500 - Operating Systems Module 9: Memory Management - Part 1 Stanley Wileman Department of Computer Science University of Nebraska at Omaha Omaha, NE 68182-0500, USA June 9, 2017 In This Module...

More information

CS Operating Systems

CS Operating Systems CS 4500 - Operating Systems Module 9: Memory Management - Part 1 Stanley Wileman Department of Computer Science University of Nebraska at Omaha Omaha, NE 68182-0500, USA June 9, 2017 In This Module...

More information

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on

More information

Unit 2 : Computer and Operating System Structure

Unit 2 : Computer and Operating System Structure Unit 2 : Computer and Operating System Structure Lesson 1 : Interrupts and I/O Structure 1.1. Learning Objectives On completion of this lesson you will know : what interrupt is the causes of occurring

More information

A Multiprocessor Memory Processor for Efficient Sharing And Access Coordination

A Multiprocessor Memory Processor for Efficient Sharing And Access Coordination 1 1 A Multiprocessor Memory Processor for Efficient Sharing And Access Coordination David M. Koppelman Department of Electrical & Computer Engineering Louisiana State University, Baton Rouge koppel@ee.lsu.edu

More information

MDP Routing in ATM Networks. Using the Virtual Path Concept 1. Department of Computer Science Department of Computer Science

MDP Routing in ATM Networks. Using the Virtual Path Concept 1. Department of Computer Science Department of Computer Science MDP Routing in ATM Networks Using the Virtual Path Concept 1 Ren-Hung Hwang, James F. Kurose, and Don Towsley Department of Computer Science Department of Computer Science & Information Engineering University

More information

Java Virtual Machine

Java Virtual Machine Evaluation of Java Thread Performance on Two Dierent Multithreaded Kernels Yan Gu B. S. Lee Wentong Cai School of Applied Science Nanyang Technological University Singapore 639798 guyan@cais.ntu.edu.sg,

More information

PE PE PE. Network Interface. Processor Pipeline + Register Windows. Thread Management & Scheduling. N e t w o r k P o r t s.

PE PE PE. Network Interface. Processor Pipeline + Register Windows. Thread Management & Scheduling. N e t w o r k P o r t s. Latency Tolerance: A Metric for Performance Analysis of Multithreaded Architectures Shashank S. Nemawarkar, Guang R. Gao School of Computer Science McGill University 348 University Street, Montreal, H3A

More information

Sistemi in Tempo Reale

Sistemi in Tempo Reale Laurea Specialistica in Ingegneria dell'automazione Sistemi in Tempo Reale Giuseppe Lipari Introduzione alla concorrenza Fundamentals Algorithm: It is the logical procedure to solve a certain problem It

More information

Parallel Computer Architecture and Programming Written Assignment 3

Parallel Computer Architecture and Programming Written Assignment 3 Parallel Computer Architecture and Programming Written Assignment 3 50 points total. Due Monday, July 17 at the start of class. Problem 1: Message Passing (6 pts) A. (3 pts) You and your friend liked the

More information

Instruction Pipelining Review

Instruction Pipelining Review Instruction Pipelining Review Instruction pipelining is CPU implementation technique where multiple operations on a number of instructions are overlapped. An instruction execution pipeline involves a number

More information

Intel Hyper-Threading technology

Intel Hyper-Threading technology Intel Hyper-Threading technology technology brief Abstract... 2 Introduction... 2 Hyper-Threading... 2 Need for the technology... 2 What is Hyper-Threading?... 3 Inside the technology... 3 Compatibility...

More information

Performance Evaluation of Two Home-Based Lazy Release Consistency Protocols for Shared Virtual Memory Systems

Performance Evaluation of Two Home-Based Lazy Release Consistency Protocols for Shared Virtual Memory Systems The following paper was originally published in the Proceedings of the USENIX 2nd Symposium on Operating Systems Design and Implementation Seattle, Washington, October 1996 Performance Evaluation of Two

More information

160 M. Nadjarbashi, S.M. Fakhraie and A. Kaviani Figure 2. LUTB structure. each block-level track can be arbitrarily connected to each of 16 4-LUT inp

160 M. Nadjarbashi, S.M. Fakhraie and A. Kaviani Figure 2. LUTB structure. each block-level track can be arbitrarily connected to each of 16 4-LUT inp Scientia Iranica, Vol. 11, No. 3, pp 159{164 c Sharif University of Technology, July 2004 On Routing Architecture for Hybrid FPGA M. Nadjarbashi, S.M. Fakhraie 1 and A. Kaviani 2 In this paper, the routing

More information

CSc33200: Operating Systems, CS-CCNY, Fall 2003 Jinzhong Niu December 10, Review

CSc33200: Operating Systems, CS-CCNY, Fall 2003 Jinzhong Niu December 10, Review CSc33200: Operating Systems, CS-CCNY, Fall 2003 Jinzhong Niu December 10, 2003 Review 1 Overview 1.1 The definition, objectives and evolution of operating system An operating system exploits and manages

More information

Reducing The De-linearization of Data Placement to Improve Deduplication Performance

Reducing The De-linearization of Data Placement to Improve Deduplication Performance Reducing The De-linearization of Data Placement to Improve Deduplication Performance Yujuan Tan 1, Zhichao Yan 2, Dan Feng 2, E. H.-M. Sha 1,3 1 School of Computer Science & Technology, Chongqing University

More information