Real-Time Systems, 13(1):47-65, July 1997.

Size: px
Start display at page:

Download "Real-Time Systems, 13(1):47-65, July 1997."

Transcription

1 Real-Time Systems, 13(1):47-65, July 1997.,, 1{20 () c Kluwer Academic Publishers, Boston. Manufactured in The Netherlands. Threaded Prefetching: A New Instruction Memory Hierarchy for Real-Time Systems* MINSUK LEE mslee@ice.hansung.ac.kr Dept. of Computer Engineering, Hansung University, 389 Samsun-dong 2 ga, Sungbook-gu, Seoul, , Korea SANG LYUL MIN, HEONSHIK SHIN, CHONG SANG KIM symin@dandelion.snu.ac.kr Dept. of Computer Engineering, Seoul National University, San 56-1 Shinlim-dong, Kwanak-gu, Seoul, , Korea CHANG YUN PARK cypark@dandelion.snu.ac.kr Dept. of Computer Engineering, Chung-Ang University, 221 Heuksok-dong, Dongjak-gu, Seoul, , Korea Editor: Abstract. Cache memories have been extensively used to bridge the speed gap between high speed processors and relatively slow main memory. However, they are not widely used in real-time systems due to their unpredictable performance. This paper proposes an instruction prefetching scheme called threaded prefetching as an alternative to instruction caching in real-time systems. In the proposed threaded prefetching, an instruction block pointer called a thread is assigned to each instruction memory block and is made to point to the next block on the worst case execution path that is determined by a compile-time analysis. Also, the thread is not updated throughout the entire program execution to guarantee predictability. This paper also compares the worst case performances of various previous instruction prefetching schemes with that of the proposed threaded prefetching. By analyzing several benchmark programs, we show that the worst case performance of the proposed scheme is signicantly better than those of previous instruction prefetching schemes. The results also show that when the block size is large enough the worst case performance of the proposed threaded prefetching scheme is almost as good as that of an instruction cache with 100 % hit ratio. Keywords: Real-Time System, Instruction Prefetching, Worst Case Execution Time, Predictability, Timing Schema 1. Introduction Advances in VLSI technology have drastically improved processor speed. In the case of DRAMs that are used as main memories, the advances have been applied to improve density rather than speed. The resultant speed gap between high speed processors and relatively slow main memory has been bridged by using high speed buers like cache memories. Caches, however, have not been widely used in hard real-time systems where the guaranteed worst case performance is much more important than the average case performance. In order to locate the worst case execution path of a program in a cache-based system, we need to know the cache *This work was supported in part by ADD under contract ADD

2 2 hit or miss of each instruction reference in the program. Unfortunately this cache hit/miss of individual instruction references is known only after the worst case execution path has been found. This cyclic dependency, in many cases, yields a pessimistic estimation of WCET (worst case execution time). Moreover, a burst of cache misses that usually occur after a context switch further complicate the analysis of a program's worst case behavior. In this paper, we propose to use instruction prefetching, in particular, the threaded prefetching as an alternative to instruction caching in real-time systems. Instruction prefetching, as its name implies, fetches, in advance, the instruction memory blocks 1 that are likely to be requested by the CPU. This prefetching can be performed in parallel with instruction execution, thus hiding instruction memory latency behind the execution time of the block currently being executed. In the proposed threaded prefetching scheme, an instruction block pointer called a thread is assigned to each instruction memory block. This thread points to the instruction block that is to be prefetched while the CPU is executing the block. In the proposed scheme, the thread is generated in such a way that the prefetching is always made towards the worst case execution path, thus improving the WCET of the task. Further, the thread is not updated throughout the entire program execution to guarantee predictability. This paper is organized as follows. In Section 2, we survey the related works. Section 3 describes, in detail, the proposed threaded prefetching scheme. This section also describes two extensions of the proposed scheme. Section 4 compares the worst case performances of various instruction prefetching schemes including the proposed threaded prefetching scheme. Finally, we give our concluding remarks in Section Related works 2.1. Cache memory Caches are small buer memories used for speeding up memory access. They maintain parts of main memory that are expected to be accessed by the CPU in the near future [21]. The use of caches has been an eective means of bridging the speed gap between high speed processors and relatively slow main memory. Although caches are a very eective means of boosting the average case performance, they are of little help to improve the \guaranteed" worst case performance due to their unpredictable performance. The unpredictable performance results mainly from the following two sources: intra-task interference and inter-task interference. Intra-task interference in caches occurs when a memory block of a task competes with another memory block of the same task for the same cache block. This intratask interference results in the following two types of cache misses: capacity misses and conict misses [9]. Capacity misses are due to nite cache size. Conict misses, on the other hand, are caused by a limited set associativity. These types of cache misses cannot be avoided if the cache has a limited size and/or set associativity

3 3 and make it dicult to accurately predict the WCET of a task due to the cyclic dependency explained earlier. Recently substantial progress has been made in this area and interested readers are referred to [2], [16], [17], [19]. Inter-task interference is caused by task preemption. When a task is preempted, most of its cache blocks are displaced by the newly scheduled task and the tasks scheduled thereafter. When the preempted task resumes execution, it requests the previously displaced blocks and experiences a burst of cache misses. This type of cache misses results in a wide variation in task execution times. The unpredictability of caches due to such misses can be eliminated by partitioning the cache and dedicating one or more partitions to each real-time task [14], [23]. Although this cache partitioning approach eliminates the unpredictability caused by task preemption, it has a disadvantage of limiting the total caching capacity that is available to a task Instruction prefetching Prefetching techniques improve system performance by prefetching memory blocks before they are actually needed [22]. In the literature, both instruction prefetching and data prefetching [3], [4], [15] have been studied although we restrict ourselves to instruction prefetching in this paper. In the past, instruction prefetching has been limited to sequential prefetching. Smith, in [21], studies the following three sequential prefetching schemes and gives an analysis on their average case performance. The always prefetch scheme prefetches the physically sequential block at every memory access. The prefetch on miss scheme, an improvement over the rst scheme, prefetches the sequential block only when a cache miss occurs. The tagged prefetch scheme associates a tag bit to each memory block. This tag bit allows sequential prefetches not only on a cache miss but also on a cache hit to a prefetched block. Other studies on instruction prefetching include the stream-buer approach in which several sequential blocks are prefetched on a cache miss to hide the ever increasing memory latency [11]. Examples of computer systems and microprocessors that make use of instruction prefetching include the IBM System/370 Models 145, 158, 168, 195 [18]; the CDC 6600 [18]; the Manchester University MU5 [18]; the VAX-11/780 [7]; the M68020/68030 [8]; the Intel 8086/80286/80386/80486 [10]. By always prefetching the next sequential block, sequential prefetching is predictable. Also it performs reasonably well in terms of average case performance because of spatial locality of programs. However, the sequential prefetching is inecient in terms of worst case performance since the sequential execution path is not always the worst case execution path.

4 Timing schema The timing schema is a set of formulas for reasoning about the timing behavior of various language constructs [20]. In the timing schema approach, the worst case timing behavior of a program construct is abstracted into a WCET bound. For example, the formula for the WCET bound of an if statement S: if (exp) then S 1 else S 2 is T (S) = max(t (exp) + T (S 1 ); T (exp) + T (S 2 )) where T (exp), T (S 1 ), and T (S 2 ) are the WCET bounds of exp, S 1, and S 2, respectively. The timing schema approach is simple and allows for an ecient hierarchical timing analysis of programs written in a high level language. One problem with this approach, however, is that in its purest form it lacks provisions for the case where instruction execution time varies depending on surrounding instructions. Such a case, for example, occurs when the processor has pipelined functional units. Due to conicts in the use of pipeline stages and data dependencies among instructions, the execution time of an instruction is aected by surrounding instructions. To rectify the problem associated with variable instruction execution times, Lim et al. extend the original time schema in [16]. In the extended timing schema, the WCET bound in the original timing schema is replaced with what is called the Worst Case Timing Abstract (WCTA) [16]. Since a program construct may have more than one execution path as in the case of an if statement and the WCETs of these execution paths dier signicantly depending on surrounding program constructs, the worst case execution path of the program construct cannot always be determined by analyzing the program construct independently of the surrounding program constructs. For this reason, the WCTA of a program construct contains timing information of every execution path in the program construct that might be the worst case execution path of the program construct. Each element of a WCTA is timing information of an execution path in the corresponding program construct and is called the Path Abstraction (PA) of the execution path. The PA of an execution path encodes the factors that aect the WCET of the execution path. The encoding is done in such a way that allows for renement of the execution path's WCET when the detailed information about the surrounding execution paths becomes available. For example, in the case of pipelined execution analysis, the PA is about the use of pipeline stages in the associated execution path. This information allows for the renement of the path's execution time once the pipeline usage information of surrounding execution paths becomes available. This extended timing information structure leads to timing formulas that are dierent from those of the original timing schema in that (concatenation) and prune operations on PAs are newly dened to replace the + and max operations on the WCET bounds in the original timing schema. The operation between two PAs models the execution of one path followed by that of another path and yields the PA of the combined path. During this operation, the execution times of

5 5 Table 1. Comparison of the extended timing schema with the original timing schema Original Timing Schema Extended Timing Schema L S: S1; S2 T (S) = T(S1) + T (S2) W(S) = W(S1) W(S2) where T (S), T (S1), and T (S2) are the where W(S), W(S1), and W (S2) are the WCET bounds of S, S1, and S2, respectively. is defined as W 1 W 2 = fw1 WCTAs L of S, S1, and S2, Lrespectively and w2 jw 1 2 W 1; w2 2 W 2 g. S (W (exp) L W(S 2)) L S: if (exp) then S1 T (S) = max(t (exp) + T(S1); W(S) = (W(exp) W(S1)) else S2 T (exp) + T (S2)) S: while (exp) S1 T(S) = N (T (exp) + T (S1)) + T(exp) W(S) = ( L N i=1 (W(exp) L W(S1))) L W(exp) L L S: f(exp1, : : :, expn) T (S) = T (exp1) + : : : + T (expn) + T (f()) W(S) = W(exp1) : : : W(expn) L W(f()) both paths are revised using each other's timing information encoded in their PAs. These revised execution times are reected in the PA of the combined path. The prune operation, which is the counterpart of the max operation of the original timing schema, is performed on a set of PAs for a program construct and prunes the PAs whose associated execution paths cannot be the worst case execution path of the program construct. In other words, a PA of a program construct can be pruned if its WCET is always smaller than the WCET of another PA in the same program construct regardless of what the surrounding program constructs are. Table 1 contrasts the timing formulas of the extended timing schema with those of the original timing schema. The timing formula for S: S 1 ; S 2 rst enumerates all the possible execution paths within S. The prune operation after the enumeration (although it is not shown in the formula) prunes a subset of the resulting execution paths that cannot be the worst case execution path of the sequential statement. Similarly the timing formula for an if statement enumerates both all the execution paths in the then path and those in the else path. As in the case of the sequential statement, the execution paths that cannot be the worst case execution path of the if statement are pruned. The timing formula for a loop statement with loop bound N models the unrolling of the loop N times. This approach is exact but is computationally intractable for a large N. In [16], Lim et al. give an ecient approximate loop timing analysis method using a maximum cycle mean algorithm due to Karp [13]. This approximate analysis method has a O(jP j 3 ) time complexity where P is the set of the execution paths in the loop body that might be the worst case execution path of the loop body (i.e., those in W (exp) L W (S 1 )). Function calls are processed like sequential statements. In the extended timing approach, functions are processed in a reverse topological order in the call graph 2 since the WCTA of a function should be calculated before the functions that call it are processed.

6 6 C source Assembly Code Thread if (a) then if beq r a, else i block i then jump next i+1 block i+1 else else i+2 block i+2 i+3 block i+3 next: next: i+4 block i+4 Figure 1. Sample C program fragment and its pseudo assembly code 3. Threaded Prefetching Scheme 3.1. Overview In this section, we describe a new instruction prefetching scheme, called the threaded prefetching scheme, that always prefetches towards the worst case execution path. In the scheme, each instruction block is assigned a tag called a thread that indicates the worst case execution path. This thread is generated at compile-time through an analysis on the worst case timing and is not changed during task execution to guarantee predictability. The performances of sequential prefetching and threaded prefetching are the same as long as the execution ow is sequential. However, whenever a statement that may change the execution ow is encountered, they show dierent behaviors and, therefore, dierent performances. As an example, consider the program fragment and its assembly code given in Figure 1. Assume that the if statement is a part of a real-time program and that the then clause (block i + 1) and else clause (blocks i + 2 and i + 3) require one and two memory blocks, respectively. Further assume that the WCETs of the then and else clauses are t then and t else respectively and that t else > t then (i.e., the else path is the worst case execution path). In sequential prefetching, the next sequential block to the block i (i.e., block i + 1), is prefetched while the CPU is executing block i. If the branch in block i is not taken, in other words, if the then clause is executed, the prefetching is successful. However, if the branch is taken, in other words, if the execution proceeds along the worst case execution path, the prefetching fails and the WCET of this if statement will be t else + t fetch where t fetch is the time needed to perform a demand fetch on a prefetch failure. In the threaded prefetching, the thread in each instruction block is made to point towards

7 7 CPU Data memory Hierarchy Current Buffer Prefetch Buffer Prefetch Control Unit Prefetch Request Queue Instruction Memory Bus Instruction Memory Thread Memory Figure 2. Processor architecture augmented by threaded prefetching the worst case execution path. Thus, the thread of block i in this example will point to the block on the worst case execution path, in this case, the rst block of the else clause (i.e., block i + 2). As a result, block i + 2 will be prefetched during the execution of block i and the WCET of this if statement will become max(t else ; t then + t fetch ). Note that this WCET is always less than t else + t fetch, which is the WCET of this if statement if the sequential prefetching is used Support Mechanism Our threaded prefetching scheme requires the following hardware and software support mechanisms Hardware support Figure 2 shows a hardware organization of a conventional processor augmented by threaded prefetching. The additional hardware components required for the threaded prefetching are two instruction buers (the current buer and the prefetch buer), the prefetch control unit (PCU) and the prefetch request queue. In order for each instruction block to store its own thread, a separate memory module called thread memory is used. Thread memory is read by prefetch control unit simultaneously with instruction memory block. We assume that the instruction memory bus width is equal to the instruction memory block size plus the thread size.

8 8 The current buer contains the instruction block that the CPU is currently executing and the prefetch buer is used to hold the block being prefetched. Both buers have the following structure: V Tag V B I - Block Thread Tag valid Block valid The V bit indicates whether the buer contents are valid. The Tag is the block address of the I-block (instruction block) in the buer. The V B bit indicates the validity of the I-block and its thread. The threaded prefetching hardware operates as follows: For each instruction reference, a check is made to see whether the referenced instruction is in the current buer. If it is in the current buer, the reference is serviced immediately. Otherwise, the referenced block will be either in the prefetch buer or in main memory. 1. Hit in the prefetch buer: The PCU checks whether the V B bit is set. (a) V B = 1: The PCU services the CPU request with the I-block stored in the prefetch buer. A prefetch request for the instruction block pointed to by the thread in the prefetch buer is placed into the prefetch request queue and is later made to main memory. Afterwards, the prefetch buer becomes the current buer while the current buer becomes the prefetch buer. (b) V B = 0: Wait until V B becomes 1 and then proceed as in (a). 2. Miss in the prefetch buer: The PCU aborts all the outstanding prefetch requests stored in the prefetch request queue and then issues a demand fetch request for the missed block. After the requested instruction block and the corresponding thread are read from main memory and stored in the prefetch buer, the PCU services the CPU request. As before, a prefetch request for the instruction block indicated by the thread is made to the prefetch request queue and the prefetch buer becomes the current buer while the current buer becomes the prefetch buer. When a context switch occurs, the rst instruction reference after the context switch will always miss in the prefetch buer. However, the predictability of the subsequent prefetches will remain intact. In short, the predictability of the proposed threaded prefetching scheme is not aected by inter-task interference.

9 9 S 1 exp true false true exp S 1 S 2 false S 1 S 1 S 2 false exp true CALL F 1 RET S1 ; S2 if exp then S1 if exp then S1 else S2 do S1 until exp call F1 Figure 3. Examples of control structures 3.3. Thread generation As it was previously mentioned, the PCU always prefetches the block pointed to by the thread of the currently executing instruction block. Since the worst case performance is much more important than the average case performance in realtime systems, our approach to thread generation is geared towards minimizing the WCET. Figure 3 shows several well-known control structures. These control structures have sequential ows (denoted by ), unconditional branches (denoted by ) and conditional branches (denoted by a pair of 's). Threads of instruction blocks that do not contain any branches are made to point to the next sequential block. Blocks with an unconditional branch have their threads made to point to the branch target. Blocks with a conditional branch have their threads made to point to the path that takes more time to execute between the two possible paths. For example, the thread of an instruction block containing the condition part of an if statement will point to the then or else clause, whichever has the longer WCET. If the WCETs of the two clauses are the same, then the thread is made to point to the branch target (else clause in our case) and the WCET of the if statement is calculated for the case when the prefetch fails (i.e., when the then path is taken). In the case of a loop statement, the thread is made to point towards repetition, which is obviously the worst case execution path Extensions Lookahead prefetching In a prefetching scheme, the PCU reads the next block required by the CPU into a prefetch buer while the CPU is executing an instruction block, thus giving the eect of hiding memory access latency. However, if the execution time of the instruction block is shorter than the memory access latency, the CPU will have to stall until the block is completely read even when the prefetch is successful. This stall time can be reduced by lookaheading two blocks rather than a single

10 10 thread Loop Statement prefetched demand fetched Sequential Statement (a) Threaded Prefetching without Lookahead Loop Statement Sequential Statement (b) Threaded Prefetching with Lookahead Figure 4. Threaded prefetching with and without lookahead block. When this lookahead technique is used in sequential prefetching, the PCU will always prefetch the block physically situated next to the next block of the current block. On the other hand, if this technique is applied to the threaded prefetching, the thread will point to the block that is two blocks ahead of the current block on the worst case execution path. In this case, two outstanding prefetch requests are possible (one by the current instruction block and the other by the previous instruction block). This additional outstanding prefetch request can be accommodated by adding one more prefetch buer. Figure 4 compares the lookahead threaded prefetching with the normal threaded prefetching using an example where a loop statement is followed by a sequential statement. In the gure, it can be noted that although the lookahead threaded prefetching provides more time for prefetching, it increases the number of demand fetches. Thus, it can be expected that the lookahead feature is benecial only when the performance improvement from hiding more memory latency outweighs the performance degradation due to increased demand fetches Threads embedded in the instruction block As it was explained in Section 3.2.1, the threads are stored in a separate memory module called thread memory. This extra memory module requires all the components needed to implement a memory module (e.g., bus interface, memory module controller, etc.). These added complexities can be completely eliminated by making use of a word in each instruction block to store the associated thread. This thread allocation makes the hardware simple but it results in a reduction in the number of instructions that can be placed in an instruction block. This reduction not only

11 11 increases the number of instruction block fetches but also reduces the amount of memory latency hidden by instruction prefetching. However, as we will see in Section 4, the resultant performance degradation is not so signicant, thus making this allocation scheme attractive as a cost-eective alternative. 4. Evaluation of Instruction Prefetching Schemes in Real-Time Systems In order to evaluate the worst case performances of various instruction prefetching schemes, we chose a set of four simple programs as our benchmarks and compared their predicted worst case execution times (PWCETs) using a timing tool based on the extended timing schema. The tool consists of a compiler and a timing analyzer. The compiler is a modied version of an ANSI C compiler called lcc [6]. This compiler accepts a C source program and generates the assembly code along with program structure information. The timing analyzer uses the generated assembly code and program structure information along with user-provided information (e.g., iteration counts of loop statements, WCETs of the library functions used in the program, etc.) to compute the WCET of the given program. In the timing analyzer, the assumed processor model is the MIPS R3000 CPU along with R3010 FPA (Floating Point Accelerator) [12]. The instruction prefetching schemes compared in this section are sequential prefetching, thread prefetching and their variations. For comparison purposes, we also evaluated the ideal case where the instruction memory latency is zero. Note that this ideal case is the most predictable case and yields the best possible performance. The four benchmark programs used in the experiment are Clock, FFT, I-Sort and S-Matrix. The Clock benchmark is a program used to implement a periodic timer. This program periodically checks 10 linked-listed timers and, if any of them expires, calls the corresponding handler function. The FFT benchmark performs the FFT (Fast Fourier Transform) on an array of 100 double precision oating point numbers. The I-Sort benchmark sorts an array of 100 integer numbers using an insertion sort algorithm and the S-Matrix program multiplies two sparse matrices and puts the result into another sparse matrix. In the experiment, we used the following three policies to place machine code into instruction memory blocks. B-placement places instructions from at most one basic block 3 into each instruction memory block. P-placement allows instructions from more than one basic block to be placed into the same instruction memory block. I-placement uses a word in each instruction memory block to store the associated thread (threaded prefetching only). In the B-placement, each instruction block has at most one branch instruction. Thus, when threaded prefetching is used, prefetching will always result in a hit if

12 12 Table 2. Number of demand fetches in the prefetching schemes BS=32B NB SB TB NP SP TP SLP TLP TI TLI Clock FFT I-Sort S-Matrix BS=64B NB SB TB NP SP TP SLP TLP TI TLI Clock FFT I-Sort S-Matrix the program follows the worst case execution path. This placement, however, has the drawback of requiring a larger number of instruction block fetches than the other placement policies. The P-placement allows more than one branch instruction in a single instruction block and, therefore, a single instruction block may have several branch instructions and, thus, prefetch targets. In this case, the aforementioned timing tool generates a thread that will point to the instruction block that has the greatest eect on the WCET. Unlike the B- or P-placement that stores threads in a separate memory module called the thread memory, the I-placement places the thread in each instruction block. For I-placement, we assume that the size of each thread is equal to the size of instruction or one word (4 bytes). Figure 5 shows the three code placement policies applied to the example of Figure 1. In this example, we assume that the basic blocks corresponding to the then and else clauses require less than one memory block and more than one but less than two memory blocks, respectively. Table 2 shows the number of demand fetches performed in each prefetching scheme for the four benchmark programs when the programs follow their worst case execution paths. The upper table is when the block size (BS) is 32 bytes while the lower one is when the block size is 64 bytes. The abbreviations used in the table have the following meanings:

13 13 block i (a) (b) (c) B-Placement P-Placement I-Placement if beq Ra, else if then beq Ra, else if then THREAD beq Ra, else i+1 then else else THREAD else THREAD i+2 next THREAD i+3 next then clause else clause unused area Figure 5. Three code placement policies

14 14 Abbr. N S T L B P I Meaning No prefetching Sequential prefetching Threaded prefetching Prefetching with two block lookahead B-placement P-placement I-placement For example, NB means no prefetching with B-placement while TLI means threaded prefetching with two block lookahead and I-placement. In the table, the dierence between column N (NB and NP) and the remaining columns is the number of prefetch hits obtained by the prefetching schemes for the given code placement policy. The table shows that the number of demand fetches in any prefetching scheme is much smaller than that in the no-prefetching case for all the four benchmarks. It can also be seen that the threaded prefetching and its variations require much smaller number of demand fetches than their sequential prefetching counterparts. In one extreme case (I-Sort with block size of 64 bytes), TP requires only one demand fetch while SP requires 9901 demand fetches. The I-Sort benchmark program consists of a doubly nested loop that can t into two 64-byte instruction blocks. And the inner loop extends from the middle of the rst block into the second block. Thus, when sequential prefetching is used, a prefetch miss will occur at each iteration of the inner loop (because at the end of the loop, the PCU always prefetches the block after the second block). The threaded prefetching, on the other hand, correctly prefetches the rst block that is indicated by the thread. This results in a large dierence in the number of demand fetches between threaded prefetching and sequential prefetching. However, the results shown in Table 2 should be interpreted with caution since not only prefetch misses (that result in demand fetches) but also prefetch hits in which memory latency is only partially hidden may as well aect the worst case performance. Thus, it would be fairer to base the comparison of the schemes on their PWCETs rather than their numbers of demand fetches. Figure 6 shows the PWCETs of the prefetching schemes normalized to the PWCET of the ideal case. In calculating the PWCETs of the prefetching schemes, both of the instruction and data memory latencies are assumed to be 15 cycles. In the gure, threaded prefetching always shows a better performance than the sequential prefetching with the same code placement policy. The reason for this superior worst case performance is the ability of the threaded prefetching to prefetch along the worst case execution path. This gure also shows that the worst case performances of the prefetching schemes are improved when the block size increases from 32 bytes to 64 bytes. This is due to the fact that as the block size increases, the execution time of a block also increases, and, therefore, more prefetching time can

15 Block Size = 32 Bytes Block Size = 32 Bytes 15 NB SB TB NP SP TP NB SB TB NP SP TP Normalized Execution Time Normalized Execution Time Clock FFT I-Sort S-Matrix Clock FFT I-Sort S-Matrix Figure 6. Comparison of prefetching schemes be hidden by the execution time of the block. When the block size is 64 bytes, the PWCETs of most of the benchmark programs with threaded prefetching approaches the PWCET of the ideal case. In other words, the worst case performance of threaded prefetching approaches that of an instruction cache with 100 % hit ratio. For the Clock and S-Matrix benchmarks, however, there is a noticeable dierence between the PWCET of the threaded prefetching and that of the ideal case. This can be explained by a large number of relatively small basic blocks found in these benchmarks. These small basic blocks did not provide enough prefetching time to hide the memory latency, thus resulting in frequent stalls. Figure 7 compares the \performance gain" of lookahead threaded prefetching with that of normal threaded prefetching as the instruction memory latency increases. The performance gain of a prefetching scheme is the performance increase made by the prefetching scheme and is calculated by the following equation: P erformance Gain = P W CET no prefetching? P W CET prefetching scheme P W CET prefetching scheme As it was previously mentioned, lookahead prefetching has the advantage of hiding more memory latency. It, however, has the disadvantage of increasing the number of demand fetches. If the memory latency is very low, it can easily be hidden even without lookahead, thus lessening the eectiveness of lookahead prefetching. On the other hand, if the memory latency is too high, the performance gain of lookahead prefetching will be overshadowed by the performance degradation due to additional demand fetches. Therefore, it can be expected that the lookahead

16 Block Size = 32 Bytes Clock-TP Clock-TLP FFT-TP FFT-TLP 100 Block Size = 32 Bytes I-Sort-TP I-Sort-TLP S-Matrix-TP S-Matrix-TLP Performance Gain (%) 50 Performance Gain (%) Instruction Memory Latency Instruction Memory Latency 100 Block Size = 64 Bytes Clock-TP Clock-TLP FFT-TP FFT-TLP 100 Block Size = 64 Bytes I-Sort-TP I-Sort-TLP S-Matrix-TP S-Matrix-TLP Performance Gain (%) 50 Performance Gain (%) Instruction Memory Latency Instruction Memory Latency Figure 7. Eects of lookahead in threaded prefetching

17 17 prefetching scheme shows a higher performance than the normal prefetching for the interval from the latency that is too high for the normal prefetching to hide and to the latency that is low enough to overcome the penalty caused by the increased demand fetches. When the block size is 32 bytes, all the benchmarks except I-Sort show such trends. In the I-Sort benchmark, the innermost loop can t into two 32 byte instruction blocks. So the CPU can execute the loop with the blocks in the two instruction buers without any instruction memory access. Thus, the performance gain increases linearly regardless of whether the lookahead technique is used or not. When the block size is 64 bytes, there is enough time for prefetching even if the lookahead is not used. So the performance gains of threaded prefetching with and without lookahead are almost the same when the memory latency is below about 30 cycles. However, when the memory latency is over 30 cycles, the lookahead prefetching then shows a better performance gain. One interesting point to notice is a linear increase in the performance gain of the Clock benchmark when the lookahead is used. A careful inspection of the generated assembly code shows that this linear increase is due to the one additional prefetch buer used in the lookahead prefetching. The innermost loop of the Clock benchmark can t into three 64-byte instruction blocks. With the three instruction buers provided in the lookahead threaded prefetching (one current buer plus two prefetch buers), the CPU can execute the innermost loop without any instruction memory access. Thus, the performance gain increases linearly as the instruction memory latency increases. Figure 8 compares the performance gain of TI (threaded prefetching with I- placement) with that of TP (threaded prefetching with P-placement). As it was previously mentioned, both the number of instructions in each instruction block and the execution time of an instruction block decrease in the TI scheme. This, in turn, reduces the memory latency hidden by prefetching. Thus, it can be expected that the TI scheme will show a worse performance than the TP scheme. However, as we can see in Figure 8, the resultant performance degradation is not so signicant. In two cases (Clock and FFT), TI even outperforms TP. A careful analysis of the results reveals that in the two benchmarks the performance improvement due to a reduced number of prefetch targets in an instruction block in the TI scheme outweighs the aforementioned performance degradation. Considering its reduced hardware complexity and competitive performance, the TI scheme appears to be the most preferable scheme among the prefetching schemes discussed in this paper. 5. Conclusions In this paper, we propose an instruction prefetching scheme called threaded prefetching as an alternative to instruction caching in real-time systems. In the proposed scheme, prefetching is dictated by a tag called a thread that always points towards the worst case execution path. The proposed threaded prefetching scheme has the following advantages: First, by always prefetching predetermined blocks, the proposed scheme is predictable. Moreover this predictability is preserved over task

18 Block Size = 32 Bytes 100 Block Size = 32 Bytes Clock-TP Clock-TI FFT-TP FFT-TI I-Sort-TP I-Sort-TI S-Matrix-TP S-Matrix-TI Performance Gain (%) 50 Performance Gain (%) Instruction Memory Latency Instruction Memory Latency 100 Block Size = 64 Bytes 100 Block Size = 64 Bytes Clock-TP Clock-TI FFT-TP FFT-TI I-Sort-TP I-Sort-TI S-Matrix-TP S-Matrix-TI Performance Gain (%) 50 Performance Gain (%) Instruction Memory Latency Instruction Memory Latency Figure 8. Eects of threads embedded in instruction blocks

19 19 switches, thus allowing an accurate schedulability analysis of tasks. Secondly, the scheme performs nearly as well as an innite cache in terms of worst case performance when the block size is large enough. Finally, the scheme requires only a minimal hardware addition. Notes 1. A block is the minimum unit of information that can be either present or not present in a memory hierarchy [7]. 2. A call graph contains the information on how functions call each other [5]. For example, if f calls g, then an arc connects f's vertex to that of g in their call graph. 3. A basic block is a sequence of consecutive instructions in which ow of control enters at the beginning and leaves at the end without halt or possibility of branching except at the end [1]. References 1. A. V. Aho, R. Sethi, and J. D. Ullman. Compilers Principles, Techniques, and Tools. Addison-Wesley Publishing Company, Reading, MA, R. Arnold, F. Mueller, D. Whalley, and M. Harmon. Bounding worst-case instruction cache performance. In Proceedings of the 15th Real-Time Systems Symposium, pages 172{181, D. Callahan, K. Kennedy, and A. Portereld. Software prefetching. In Proceedings of the Fourth International Conference on Architectural Support on Programming Languages and Operating Systems, pages 40{52, T.-F. Chen and J.-L. Baer. Reducing memory latency via non-blocking and prefetching caches. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 51{61, C. N. Fischer and R. J. LeBlanc. Crafting a Compiler with C. The Benjamin/Cummings Publishing Company, Inc., Redwood City, CA, C. W. Fraser and D. R. Hanson. A code generation interface for ANSI C. Technical Report CSL-TR , Dept. of Computer Science, Princeton University, July J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers, San Mateo, CA, W. Hilf and A. Nausch. The M68000 Family Volum 1, page 48. Prentice Hall, Englewood Clis, NJ, M. D. Hill. Aspects of Cache Memory and Instruction Buer Performance. PhD thesis, University of California, Berkeley, Nov Intel. Microprocessors Handbook. Intel Corporation, N. P. Jouppi. Improving directed-mapped cache performance by the addition of a small fully-associative cache and prefetch buers. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 364{373, G. Kane and J. Heinrich. MIPS RISC Architecture. Prentice Hall, Englewood Clis, NJ, R. M. Karp. A characterization of the minimum cycle mean in a digraph. Discrete Mathematics, 23:309{311, D. B. Kirk. SMART (Strategic Memory Allocation for Real-Time) cache design. In Proceedings of the 10th Real-Time Systems Symposium, pages 229{237, 1989.

20 A. C. Klaiber and H. M. Levy. Architecture for software-controlled data prefetching. In Proceedings of the 18th Annual International Symposium on Computer Architecture, pages 43{63, S.-S. Lim, Y. H. Bae, G. T. Jang, B.-D. Rhee, S. L. Min, C. Y. Park, H. Shin, K. Park, and C. S. Kim. An accurate worst case timing analysis technique for RISC processors. In Proceedings of the 15th Real-Time Systems Symposium, pages 97{108, J.-C. Liu and H.-J. Lee. Deterministic upperbounds of the worst-case execution times of cached programs. In Proceedings of the 15th Real-Time Systems Symposium, pages 182{191, B. R. Rau and G. E. Rossman. The eect of instruction fetch strategies upon the performance of pipelined instruction units. In Proceedings of the 4th Annual International Symposium on Computer Architecture, pages 80{89, J. Rawat. Static Analysis of Cache Performance for Real-Time Programming. Master's thesis, Iowa State University, A. C. Shaw. Reasoning about time in higher-level language software. IEEE Transactions On Software Engineering, 15(7):875{889, July A. J. Smith. Cache memories. ACM Computing Surveys, 14(3):473{530, A. J. Smith. Sequential program prefetching in memory hierarchies. IEEE Computer, pages 7{21, Dec A. Wolfe. Software-based cache partitioning for real-time applications. In Proceedings of the Third Workshop on Responsive Computer Systems, Sept

Performance analysis of the static use of locking caches.

Performance analysis of the static use of locking caches. Performance analysis of the static use of locking caches. A. MARTÍ CAMPOY, A. PERLES, S. SÁEZ, J.V. BUSQUETS-MATAIX Departamento de Informática de Sistemas y Computadores Universidad Politécnica de Valencia

More information

Predicting the Worst-Case Execution Time of the Concurrent Execution. of Instructions and Cycle-Stealing DMA I/O Operations

Predicting the Worst-Case Execution Time of the Concurrent Execution. of Instructions and Cycle-Stealing DMA I/O Operations ACM SIGPLAN Workshop on Languages, Compilers and Tools for Real-Time Systems, La Jolla, California, June 1995. Predicting the Worst-Case Execution Time of the Concurrent Execution of Instructions and Cycle-Stealing

More information

The Impact of Write Back on Cache Performance

The Impact of Write Back on Cache Performance The Impact of Write Back on Cache Performance Daniel Kroening and Silvia M. Mueller Computer Science Department Universitaet des Saarlandes, 66123 Saarbruecken, Germany email: kroening@handshake.de, smueller@cs.uni-sb.de,

More information

PERFORMANCE COMPARISON OF LOCKING CACHES UNDER STATIC AND DYNAMIC SCHEDULERS. A. Martí Campoy, S. Sáez, A. Perles, J.V.

PERFORMANCE COMPARISON OF LOCKING CACHES UNDER STATIC AND DYNAMIC SCHEDULERS. A. Martí Campoy, S. Sáez, A. Perles, J.V. PERFORMANCE COMPARISON OF LOCKING CACHES UNDER STATIC AND DYNAMIC SCHEDULERS A. Martí Campoy, S. Sáez, A. Perles, J.V. Busquets-Mataix Departamento de Informática de Sistemas y Computadores E-46022, Universidad

More information

To appear in the IEEE Transactions on Computers.

To appear in the IEEE Transactions on Computers. To appear in the IEEE Transactions on Computers. Analysis of Cache-related Preemption Delay in Fixed-priority Preemptive Scheduling Chang-Gun Lee y Joosun Hahn y Yang-Min Seo z Sang Lyul Min y Rhan Ha

More information

Allowing Cycle-Stealing Direct Memory Access I/O. Concurrent with Hard-Real-Time Programs

Allowing Cycle-Stealing Direct Memory Access I/O. Concurrent with Hard-Real-Time Programs To appear in: Int. Conf. on Parallel and Distributed Systems, ICPADS'96, June 3-6, 1996, Tokyo Allowing Cycle-Stealing Direct Memory Access I/O Concurrent with Hard-Real-Time Programs Tai-Yi Huang, Jane

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Memory Design. Cache Memory. Processor operates much faster than the main memory can.

Memory Design. Cache Memory. Processor operates much faster than the main memory can. Memory Design Cache Memory Processor operates much faster than the main memory can. To ameliorate the sitution, a high speed memory called a cache memory placed between the processor and main memory. Barry

More information

Abstract. provide substantial improvements in performance on a per application basis. We have used architectural customization

Abstract. provide substantial improvements in performance on a per application basis. We have used architectural customization Architectural Adaptation in MORPH Rajesh K. Gupta a Andrew Chien b a Information and Computer Science, University of California, Irvine, CA 92697. b Computer Science and Engg., University of California,

More information

Heap-on-Top Priority Queues. March Abstract. We introduce the heap-on-top (hot) priority queue data structure that combines the

Heap-on-Top Priority Queues. March Abstract. We introduce the heap-on-top (hot) priority queue data structure that combines the Heap-on-Top Priority Queues Boris V. Cherkassky Central Economics and Mathematics Institute Krasikova St. 32 117418, Moscow, Russia cher@cemi.msk.su Andrew V. Goldberg NEC Research Institute 4 Independence

More information

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX Towards an Adaptive Distributed Shared Memory (Preliminary Version ) Jai-Hoon Kim Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3 E-mail: fjhkim,vaidyag@cs.tamu.edu

More information

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications Byung In Moon, Hongil Yoon, Ilgu Yun, and Sungho Kang Yonsei University, 134 Shinchon-dong, Seodaemoon-gu, Seoul

More information

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Abridged version of Patterson & Hennessy (2013):Ch.5 Principle of Locality Programs access a small proportion of their address space at any time Temporal

More information

instruction fetch memory interface signal unit priority manager instruction decode stack register sets address PC2 PC3 PC4 instructions extern signals

instruction fetch memory interface signal unit priority manager instruction decode stack register sets address PC2 PC3 PC4 instructions extern signals Performance Evaluations of a Multithreaded Java Microcontroller J. Kreuzinger, M. Pfeer A. Schulz, Th. Ungerer Institute for Computer Design and Fault Tolerance University of Karlsruhe, Germany U. Brinkschulte,

More information

August 1994 / Features / Cache Advantage. Cache design and implementation can make or break the performance of your high-powered computer system.

August 1994 / Features / Cache Advantage. Cache design and implementation can make or break the performance of your high-powered computer system. Cache Advantage August 1994 / Features / Cache Advantage Cache design and implementation can make or break the performance of your high-powered computer system. David F. Bacon Modern CPUs have one overriding

More information

LECTURE 5: MEMORY HIERARCHY DESIGN

LECTURE 5: MEMORY HIERARCHY DESIGN LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Memory Hierarchies 2009 DAT105

Memory Hierarchies 2009 DAT105 Memory Hierarchies Cache performance issues (5.1) Virtual memory (C.4) Cache performance improvement techniques (5.2) Hit-time improvement techniques Miss-rate improvement techniques Miss-penalty improvement

More information

Hardware-based Synchronization Support for Shared Accesses in Multi-core Architectures

Hardware-based Synchronization Support for Shared Accesses in Multi-core Architectures Hardware-based Synchronization Support for Shared Accesses in Multi-core Architectures Bo Hong Drexel University, Philadelphia, PA 19104 bohong@coe.drexel.edu Abstract A new hardware-based design is presented

More information

V. Primary & Secondary Memory!

V. Primary & Secondary Memory! V. Primary & Secondary Memory! Computer Architecture and Operating Systems & Operating Systems: 725G84 Ahmed Rezine 1 Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM)

More information

Chapter 7 The Potential of Special-Purpose Hardware

Chapter 7 The Potential of Special-Purpose Hardware Chapter 7 The Potential of Special-Purpose Hardware The preceding chapters have described various implementation methods and performance data for TIGRE. This chapter uses those data points to propose architecture

More information

SF-LRU Cache Replacement Algorithm

SF-LRU Cache Replacement Algorithm SF-LRU Cache Replacement Algorithm Jaafar Alghazo, Adil Akaaboune, Nazeih Botros Southern Illinois University at Carbondale Department of Electrical and Computer Engineering Carbondale, IL 6291 alghazo@siu.edu,

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás CACHE MEMORIES Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix B, John L. Hennessy and David A. Patterson, Morgan Kaufmann,

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

Design of A Memory Latency Tolerant. *Faculty of Eng.,Tokai Univ **Graduate School of Eng.,Tokai Univ. *

Design of A Memory Latency Tolerant. *Faculty of Eng.,Tokai Univ **Graduate School of Eng.,Tokai Univ. * Design of A Memory Latency Tolerant Processor() Naohiko SHIMIZU* Kazuyuki MIYASAKA** Hiroaki HARAMIISHI** *Faculty of Eng.,Tokai Univ **Graduate School of Eng.,Tokai Univ. 1117 Kitakaname Hiratuka-shi

More information

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

Page 1. Multilevel Memories (Improving performance using a little cash )

Page 1. Multilevel Memories (Improving performance using a little cash ) Page 1 Multilevel Memories (Improving performance using a little cash ) 1 Page 2 CPU-Memory Bottleneck CPU Memory Performance of high-speed computers is usually limited by memory bandwidth & latency Latency

More information

ECE468 Computer Organization and Architecture. Virtual Memory

ECE468 Computer Organization and Architecture. Virtual Memory ECE468 Computer Organization and Architecture Virtual Memory ECE468 vm.1 Review: The Principle of Locality Probability of reference 0 Address Space 2 The Principle of Locality: Program access a relatively

More information

5. Memory Hierarchy Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

5. Memory Hierarchy Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16 5. Memory Hierarchy Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16 Movie Rental Store You have a huge warehouse with every movie ever made.

More information

Chapter 13 Reduced Instruction Set Computers

Chapter 13 Reduced Instruction Set Computers Chapter 13 Reduced Instruction Set Computers Contents Instruction execution characteristics Use of a large register file Compiler-based register optimization Reduced instruction set architecture RISC pipelining

More information

Worst Case Execution Time Analysis for Synthesized Hardware

Worst Case Execution Time Analysis for Synthesized Hardware Worst Case Execution Time Analysis for Synthesized Hardware Jun-hee Yoo ihavnoid@poppy.snu.ac.kr Seoul National University, Seoul, Republic of Korea Xingguang Feng fengxg@poppy.snu.ac.kr Seoul National

More information

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont. History Table. Correlating Prediction Table

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont.   History Table. Correlating Prediction Table Lecture 15 History Table Correlating Prediction Table Prefetching Latest A0 A0,A1 A3 11 Fall 2018 Jon Beaumont A1 http://www.eecs.umich.edu/courses/eecs470 Prefetch A3 Slides developed in part by Profs.

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

Consistent Logical Checkpointing. Nitin H. Vaidya. Texas A&M University. Phone: Fax:

Consistent Logical Checkpointing. Nitin H. Vaidya. Texas A&M University. Phone: Fax: Consistent Logical Checkpointing Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 hone: 409-845-0512 Fax: 409-847-8578 E-mail: vaidya@cs.tamu.edu Technical

More information

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building

More information

A Time-Predictable Instruction-Cache Architecture that Uses Prefetching and Cache Locking

A Time-Predictable Instruction-Cache Architecture that Uses Prefetching and Cache Locking A Time-Predictable Instruction-Cache Architecture that Uses Prefetching and Cache Locking Bekim Cilku, Daniel Prokesch, Peter Puschner Institute of Computer Engineering Vienna University of Technology

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1 Memory Hierarchy Maurizio Palesi Maurizio Palesi 1 References John L. Hennessy and David A. Patterson, Computer Architecture a Quantitative Approach, second edition, Morgan Kaufmann Chapter 5 Maurizio

More information

Sireesha R Basavaraju Embedded Systems Group, Technical University of Kaiserslautern

Sireesha R Basavaraju Embedded Systems Group, Technical University of Kaiserslautern Sireesha R Basavaraju Embedded Systems Group, Technical University of Kaiserslautern Introduction WCET of program ILP Formulation Requirement SPM allocation for code SPM allocation for data Conclusion

More information

Chapter 12. CPU Structure and Function. Yonsei University

Chapter 12. CPU Structure and Function. Yonsei University Chapter 12 CPU Structure and Function Contents Processor organization Register organization Instruction cycle Instruction pipelining The Pentium processor The PowerPC processor 12-2 CPU Structures Processor

More information

FROM TIME-TRIGGERED TO TIME-DETERMINISTIC REAL-TIME SYSTEMS

FROM TIME-TRIGGERED TO TIME-DETERMINISTIC REAL-TIME SYSTEMS FROM TIME-TRIGGERED TO TIME-DETERMINISTIC REAL-TIME SYSTEMS Peter Puschner and Raimund Kirner Vienna University of Technology, A-1040 Vienna, Austria {peter, raimund}@vmars.tuwien.ac.at Abstract Keywords:

More information

Cache Optimisation. sometime he thought that there must be a better way

Cache Optimisation. sometime he thought that there must be a better way Cache sometime he thought that there must be a better way 2 Cache 1. Reduce miss rate a) Increase block size b) Increase cache size c) Higher associativity d) compiler optimisation e) Parallelism f) prefetching

More information

1 Introduction The demand on the performance of memory subsystems is rapidly increasing with the advances in microprocessor architecture. The growing

1 Introduction The demand on the performance of memory subsystems is rapidly increasing with the advances in microprocessor architecture. The growing Tango: a Hardware-based Data Prefetching Technique for Superscalar Processors 1 Shlomit S. Pinter IBM Science and Technology MATAM Advance Technology ctr. Haifa 31905, Israel E-mail: shlomit@vnet.ibm.com

More information

ECE4680 Computer Organization and Architecture. Virtual Memory

ECE4680 Computer Organization and Architecture. Virtual Memory ECE468 Computer Organization and Architecture Virtual Memory If I can see it and I can touch it, it s real. If I can t see it but I can touch it, it s invisible. If I can see it but I can t touch it, it

More information

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses Soner Onder Michigan Technological University Randy Katz & David A. Patterson University of California, Berkeley Four Questions for Memory Hierarchy Designers

More information

The of these simple branch prediction strategies is about 3%, but some benchmark programs have a of. A more sophisticated implementation of static bra

The of these simple branch prediction strategies is about 3%, but some benchmark programs have a of. A more sophisticated implementation of static bra Improving Semi-static Branch Prediction by Code Replication Andreas Krall Institut fur Computersprachen Technische Universitat Wien Argentinierstrae 8 A-4 Wien andimips.complang.tuwien.ac.at Abstract Speculative

More information

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1 CSE 431 Computer Architecture Fall 2008 Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Mary Jane Irwin ( www.cse.psu.edu/~mji ) [Adapted from Computer Organization and Design, 4 th Edition, Patterson

More information

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard.

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard. COMP 212 Computer Organization & Architecture Pipeline Re-Cap Pipeline is ILP -Instruction Level Parallelism COMP 212 Fall 2008 Lecture 12 RISC & Superscalar Divide instruction cycles into stages, overlapped

More information

Automatic Counterflow Pipeline Synthesis

Automatic Counterflow Pipeline Synthesis Automatic Counterflow Pipeline Synthesis Bruce R. Childers, Jack W. Davidson Computer Science Department University of Virginia Charlottesville, Virginia 22901 {brc2m, jwd}@cs.virginia.edu Abstract The

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

EITF20: Computer Architecture Part 5.1.1: Virtual Memory EITF20: Computer Architecture Part 5.1.1: Virtual Memory Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache optimization Virtual memory Case study AMD Opteron Summary 2 Memory hierarchy 3 Cache

More information

A hardware operating system kernel for multi-processor systems

A hardware operating system kernel for multi-processor systems A hardware operating system kernel for multi-processor systems Sanggyu Park a), Do-sun Hong, and Soo-Ik Chae School of EECS, Seoul National University, Building 104 1, Seoul National University, Gwanakgu,

More information

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. Computer Architectures Chapter 5 Tien-Fu Chen National Chung Cheng Univ. Chap5-0 Topics in Memory Hierachy! Memory Hierachy Features: temporal & spatial locality Common: Faster -> more expensive -> smaller!

More information

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Science CPUtime = IC CPI Execution + Memory accesses Instruction

More information

Code Placement Techniques for Cache Miss Rate Reduction

Code Placement Techniques for Cache Miss Rate Reduction Code Placement Techniques for Cache Miss Rate Reduction HIROYUKI TOMIYAMA and HIROTO YASUURA Kyushu University In the design of embedded systems with cache memories, it is important to minimize the cache

More information

CS 61C: Great Ideas in Computer Architecture Caches Part 2

CS 61C: Great Ideas in Computer Architecture Caches Part 2 CS 61C: Great Ideas in Computer Architecture Caches Part 2 Instructors: Nicholas Weaver & Vladimir Stojanovic http://insteecsberkeleyedu/~cs61c/fa15 Software Parallel Requests Assigned to computer eg,

More information

Complexity-effective Enhancements to a RISC CPU Architecture

Complexity-effective Enhancements to a RISC CPU Architecture Complexity-effective Enhancements to a RISC CPU Architecture Jeff Scott, John Arends, Bill Moyer Embedded Platform Systems, Motorola, Inc. 7700 West Parmer Lane, Building C, MD PL31, Austin, TX 78729 {Jeff.Scott,John.Arends,Bill.Moyer}@motorola.com

More information

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation Weiping Liao, Saengrawee (Anne) Pratoomtong, and Chuan Zhang Abstract Binary translation is an important component for translating

More information

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved. LRU A list to keep track of the order of access to every block in the set. The least recently used block is replaced (if needed). How many bits we need for that? 27 Pseudo LRU A B C D E F G H A B C D E

More information

Accuracy Enhancement by Selective Use of Branch History in Embedded Processor

Accuracy Enhancement by Selective Use of Branch History in Embedded Processor Accuracy Enhancement by Selective Use of Branch History in Embedded Processor Jong Wook Kwak 1, Seong Tae Jhang 2, and Chu Shik Jhon 1 1 Department of Electrical Engineering and Computer Science, Seoul

More information

A Hybrid Adaptive Feedback Based Prefetcher

A Hybrid Adaptive Feedback Based Prefetcher A Feedback Based Prefetcher Santhosh Verma, David M. Koppelman and Lu Peng Department of Electrical and Computer Engineering Louisiana State University, Baton Rouge, LA 78 sverma@lsu.edu, koppel@ece.lsu.edu,

More information

UNIT- 5. Chapter 12 Processor Structure and Function

UNIT- 5. Chapter 12 Processor Structure and Function UNIT- 5 Chapter 12 Processor Structure and Function CPU Structure CPU must: Fetch instructions Interpret instructions Fetch data Process data Write data CPU With Systems Bus CPU Internal Structure Registers

More information

Practice Problems (Con t) The ALU performs operation x and puts the result in the RR The ALU operand Register B is loaded with the contents of Rx

Practice Problems (Con t) The ALU performs operation x and puts the result in the RR The ALU operand Register B is loaded with the contents of Rx Microprogram Control Practice Problems (Con t) The following microinstructions are supported by each CW in the CS: RR ALU opx RA Rx RB Rx RB IR(adr) Rx RR Rx MDR MDR RR MDR Rx MAR IR(adr) MAR Rx PC IR(adr)

More information

Memory Hierarchy Basics

Memory Hierarchy Basics Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Memory Hierarchy Basics Six basic cache optimizations: Larger block size Reduces compulsory misses Increases

More information

Cache Controller with Enhanced Features using Verilog HDL

Cache Controller with Enhanced Features using Verilog HDL Cache Controller with Enhanced Features using Verilog HDL Prof. V. B. Baru 1, Sweety Pinjani 2 Assistant Professor, Dept. of ECE, Sinhgad College of Engineering, Vadgaon (BK), Pune, India 1 PG Student

More information

CS3350B Computer Architecture

CS3350B Computer Architecture CS335B Computer Architecture Winter 25 Lecture 32: Exploiting Memory Hierarchy: How? Marc Moreno Maza wwwcsduwoca/courses/cs335b [Adapted from lectures on Computer Organization and Design, Patterson &

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline

More information

Memory. Principle of Locality. It is impossible to have memory that is both. We create an illusion for the programmer. Employ memory hierarchy

Memory. Principle of Locality. It is impossible to have memory that is both. We create an illusion for the programmer. Employ memory hierarchy Datorarkitektur och operativsystem Lecture 7 Memory It is impossible to have memory that is both Unlimited (large in capacity) And fast 5.1 Intr roduction We create an illusion for the programmer Before

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

Performance Analysis of Embedded Software Using Implicit Path Enumeration

Performance Analysis of Embedded Software Using Implicit Path Enumeration Performance Analysis of Embedded Software Using Implicit Path Enumeration Yau-Tsun Steven Li Sharad Malik Department of Electrical Engineering, Princeton University, NJ 08544, USA. Abstract Embedded computer

More information

Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009

Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009 Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009 Agenda Introduction Memory Hierarchy Design CPU Speed vs.

More information

ait: WORST-CASE EXECUTION TIME PREDICTION BY STATIC PROGRAM ANALYSIS

ait: WORST-CASE EXECUTION TIME PREDICTION BY STATIC PROGRAM ANALYSIS ait: WORST-CASE EXECUTION TIME PREDICTION BY STATIC PROGRAM ANALYSIS Christian Ferdinand and Reinhold Heckmann AbsInt Angewandte Informatik GmbH, Stuhlsatzenhausweg 69, D-66123 Saarbrucken, Germany info@absint.com

More information

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Organization Part II Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University, Auburn,

More information

Dynamic Voltage Scaling of Periodic and Aperiodic Tasks in Priority-Driven Systems Λ

Dynamic Voltage Scaling of Periodic and Aperiodic Tasks in Priority-Driven Systems Λ Dynamic Voltage Scaling of Periodic and Aperiodic Tasks in Priority-Driven Systems Λ Dongkun Shin Jihong Kim School of CSE School of CSE Seoul National University Seoul National University Seoul, Korea

More information

Computer Systems Architecture I. CSE 560M Lecture 17 Guest Lecturer: Shakir James

Computer Systems Architecture I. CSE 560M Lecture 17 Guest Lecturer: Shakir James Computer Systems Architecture I CSE 560M Lecture 17 Guest Lecturer: Shakir James Plan for Today Announcements and Reminders Project demos in three weeks (Nov. 23 rd ) Questions Today s discussion: Improving

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Topics to be covered. EEC 581 Computer Architecture. Virtual Memory. Memory Hierarchy Design (II)

Topics to be covered. EEC 581 Computer Architecture. Virtual Memory. Memory Hierarchy Design (II) EEC 581 Computer Architecture Memory Hierarchy Design (II) Department of Electrical Engineering and Computer Science Cleveland State University Topics to be covered Cache Penalty Reduction Techniques Victim

More information

The CPU Design Kit: An Instructional Prototyping Platform. for Teaching Processor Design. Anujan Varma, Lampros Kalampoukas

The CPU Design Kit: An Instructional Prototyping Platform. for Teaching Processor Design. Anujan Varma, Lampros Kalampoukas The CPU Design Kit: An Instructional Prototyping Platform for Teaching Processor Design Anujan Varma, Lampros Kalampoukas Dimitrios Stiliadis, and Quinn Jacobson Computer Engineering Department University

More information

reasonable to store in a software implementation, it is likely to be a signicant burden in a low-cost hardware implementation. We describe in this pap

reasonable to store in a software implementation, it is likely to be a signicant burden in a low-cost hardware implementation. We describe in this pap Storage-Ecient Finite Field Basis Conversion Burton S. Kaliski Jr. 1 and Yiqun Lisa Yin 2 RSA Laboratories 1 20 Crosby Drive, Bedford, MA 01730. burt@rsa.com 2 2955 Campus Drive, San Mateo, CA 94402. yiqun@rsa.com

More information

Agenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!)

Agenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!) 7/4/ CS 6C: Great Ideas in Computer Architecture (Machine Structures) Caches II Instructor: Michael Greenbaum New-School Machine Structures (It s a bit more complicated!) Parallel Requests Assigned to

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Using Intel Streaming SIMD Extensions for 3D Geometry Processing Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,

More information

A Reconfigurable Cache Design for Embedded Dynamic Data Cache

A Reconfigurable Cache Design for Embedded Dynamic Data Cache I J C T A, 9(17) 2016, pp. 8509-8517 International Science Press A Reconfigurable Cache Design for Embedded Dynamic Data Cache Shameedha Begum, T. Vidya, Amit D. Joshi and N. Ramasubramanian ABSTRACT Applications

More information

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache Classifying Misses: 3C Model (Hill) Divide cache misses into three categories Compulsory (cold): never seen this address before Would miss even in infinite cache Capacity: miss caused because cache is

More information

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Journal of Instruction-Level Parallelism 13 (11) 1-14 Submitted 3/1; published 1/11 Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Santhosh Verma

More information

for Energy Savings in Set-associative Instruction Caches Alexander V. Veidenbaum

for Energy Savings in Set-associative Instruction Caches Alexander V. Veidenbaum Simultaneous Way-footprint Prediction and Branch Prediction for Energy Savings in Set-associative Instruction Caches Weiyu Tang Rajesh Gupta Alexandru Nicolau Alexander V. Veidenbaum Department of Information

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design Edited by Mansour Al Zuair 1 Introduction Programmers want unlimited amounts of memory with low latency Fast

More information

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs Chapter 5 (Part II) Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Virtual Machines Host computer emulates guest operating system and machine resources Improved isolation of multiple

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address space at any time Temporal locality Items accessed recently are likely to

More information

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722 Dynamic Branch Prediction Dynamic branch prediction schemes run-time behavior of branches to make predictions. Usually information about outcomes of previous occurrences of branches are used to predict

More information

Cycle accurate transaction-driven simulation with multiple processor simulators

Cycle accurate transaction-driven simulation with multiple processor simulators Cycle accurate transaction-driven simulation with multiple processor simulators Dohyung Kim 1a) and Rajesh Gupta 2 1 Engineering Center, Google Korea Ltd. 737 Yeoksam-dong, Gangnam-gu, Seoul 135 984, Korea

More information

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers. CS 612 Software Design for High-performance Architectures 1 computers. CS 412 is desirable but not high-performance essential. Course Organization Lecturer:Paul Stodghill, stodghil@cs.cornell.edu, Rhodes

More information

Extra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987

Extra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987 Extra-High Speed Matrix Multiplication on the Cray-2 David H. Bailey September 2, 1987 Ref: SIAM J. on Scientic and Statistical Computing, vol. 9, no. 3, (May 1988), pg. 603{607 Abstract The Cray-2 is

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information