Real-Time Systems, 13(1):47-65, July 1997.

Size: px

Start display at page:

Download "Real-Time Systems, 13(1):47-65, July 1997."

Milton Sparks
5 years ago
Views:

1 Real-Time Systems, 13(1):47-65, July 1997.,, 1{20 () c Kluwer Academic Publishers, Boston. Manufactured in The Netherlands. Threaded Prefetching: A New Instruction Memory Hierarchy for Real-Time Systems* MINSUK LEE mslee@ice.hansung.ac.kr Dept. of Computer Engineering, Hansung University, 389 Samsun-dong 2 ga, Sungbook-gu, Seoul, , Korea SANG LYUL MIN, HEONSHIK SHIN, CHONG SANG KIM symin@dandelion.snu.ac.kr Dept. of Computer Engineering, Seoul National University, San 56-1 Shinlim-dong, Kwanak-gu, Seoul, , Korea CHANG YUN PARK cypark@dandelion.snu.ac.kr Dept. of Computer Engineering, Chung-Ang University, 221 Heuksok-dong, Dongjak-gu, Seoul, , Korea Editor: Abstract. Cache memories have been extensively used to bridge the speed gap between high speed processors and relatively slow main memory. However, they are not widely used in real-time systems due to their unpredictable performance. This paper proposes an instruction prefetching scheme called threaded prefetching as an alternative to instruction caching in real-time systems. In the proposed threaded prefetching, an instruction block pointer called a thread is assigned to each instruction memory block and is made to point to the next block on the worst case execution path that is determined by a compile-time analysis. Also, the thread is not updated throughout the entire program execution to guarantee predictability. This paper also compares the worst case performances of various previous instruction prefetching schemes with that of the proposed threaded prefetching. By analyzing several benchmark programs, we show that the worst case performance of the proposed scheme is signicantly better than those of previous instruction prefetching schemes. The results also show that when the block size is large enough the worst case performance of the proposed threaded prefetching scheme is almost as good as that of an instruction cache with 100 % hit ratio. Keywords: Real-Time System, Instruction Prefetching, Worst Case Execution Time, Predictability, Timing Schema 1. Introduction Advances in VLSI technology have drastically improved processor speed. In the case of DRAMs that are used as main memories, the advances have been applied to improve density rather than speed. The resultant speed gap between high speed processors and relatively slow main memory has been bridged by using high speed buers like cache memories. Caches, however, have not been widely used in hard real-time systems where the guaranteed worst case performance is much more important than the average case performance. In order to locate the worst case execution path of a program in a cache-based system, we need to know the cache *This work was supported in part by ADD under contract ADD

2 2 hit or miss of each instruction reference in the program. Unfortunately this cache hit/miss of individual instruction references is known only after the worst case execution path has been found. This cyclic dependency, in many cases, yields a pessimistic estimation of WCET (worst case execution time). Moreover, a burst of cache misses that usually occur after a context switch further complicate the analysis of a program's worst case behavior. In this paper, we propose to use instruction prefetching, in particular, the threaded prefetching as an alternative to instruction caching in real-time systems. Instruction prefetching, as its name implies, fetches, in advance, the instruction memory blocks 1 that are likely to be requested by the CPU. This prefetching can be performed in parallel with instruction execution, thus hiding instruction memory latency behind the execution time of the block currently being executed. In the proposed threaded prefetching scheme, an instruction block pointer called a thread is assigned to each instruction memory block. This thread points to the instruction block that is to be prefetched while the CPU is executing the block. In the proposed scheme, the thread is generated in such a way that the prefetching is always made towards the worst case execution path, thus improving the WCET of the task. Further, the thread is not updated throughout the entire program execution to guarantee predictability. This paper is organized as follows. In Section 2, we survey the related works. Section 3 describes, in detail, the proposed threaded prefetching scheme. This section also describes two extensions of the proposed scheme. Section 4 compares the worst case performances of various instruction prefetching schemes including the proposed threaded prefetching scheme. Finally, we give our concluding remarks in Section Related works 2.1. Cache memory Caches are small buer memories used for speeding up memory access. They maintain parts of main memory that are expected to be accessed by the CPU in the near future [21]. The use of caches has been an eective means of bridging the speed gap between high speed processors and relatively slow main memory. Although caches are a very eective means of boosting the average case performance, they are of little help to improve the \guaranteed" worst case performance due to their unpredictable performance. The unpredictable performance results mainly from the following two sources: intra-task interference and inter-task interference. Intra-task interference in caches occurs when a memory block of a task competes with another memory block of the same task for the same cache block. This intratask interference results in the following two types of cache misses: capacity misses and conict misses [9]. Capacity misses are due to nite cache size. Conict misses, on the other hand, are caused by a limited set associativity. These types of cache misses cannot be avoided if the cache has a limited size and/or set associativity

3 3 and make it dicult to accurately predict the WCET of a task due to the cyclic dependency explained earlier. Recently substantial progress has been made in this area and interested readers are referred to [2], [16], [17], [19]. Inter-task interference is caused by task preemption. When a task is preempted, most of its cache blocks are displaced by the newly scheduled task and the tasks scheduled thereafter. When the preempted task resumes execution, it requests the previously displaced blocks and experiences a burst of cache misses. This type of cache misses results in a wide variation in task execution times. The unpredictability of caches due to such misses can be eliminated by partitioning the cache and dedicating one or more partitions to each real-time task [14], [23]. Although this cache partitioning approach eliminates the unpredictability caused by task preemption, it has a disadvantage of limiting the total caching capacity that is available to a task Instruction prefetching Prefetching techniques improve system performance by prefetching memory blocks before they are actually needed [22]. In the literature, both instruction prefetching and data prefetching [3], [4], [15] have been studied although we restrict ourselves to instruction prefetching in this paper. In the past, instruction prefetching has been limited to sequential prefetching. Smith, in [21], studies the following three sequential prefetching schemes and gives an analysis on their average case performance. The always prefetch scheme prefetches the physically sequential block at every memory access. The prefetch on miss scheme, an improvement over the rst scheme, prefetches the sequential block only when a cache miss occurs. The tagged prefetch scheme associates a tag bit to each memory block. This tag bit allows sequential prefetches not only on a cache miss but also on a cache hit to a prefetched block. Other studies on instruction prefetching include the stream-buer approach in which several sequential blocks are prefetched on a cache miss to hide the ever increasing memory latency [11]. Examples of computer systems and microprocessors that make use of instruction prefetching include the IBM System/370 Models 145, 158, 168, 195 [18]; the CDC 6600 [18]; the Manchester University MU5 [18]; the VAX-11/780 [7]; the M68020/68030 [8]; the Intel 8086/80286/80386/80486 [10]. By always prefetching the next sequential block, sequential prefetching is predictable. Also it performs reasonably well in terms of average case performance because of spatial locality of programs. However, the sequential prefetching is inecient in terms of worst case performance since the sequential execution path is not always the worst case execution path.

4 Timing schema The timing schema is a set of formulas for reasoning about the timing behavior of various language constructs [20]. In the timing schema approach, the worst case timing behavior of a program construct is abstracted into a WCET bound. For example, the formula for the WCET bound of an if statement S: if (exp) then S 1 else S 2 is T (S) = max(t (exp) + T (S 1 ); T (exp) + T (S 2 )) where T (exp), T (S 1 ), and T (S 2 ) are the WCET bounds of exp, S 1, and S 2, respectively. The timing schema approach is simple and allows for an ecient hierarchical timing analysis of programs written in a high level language. One problem with this approach, however, is that in its purest form it lacks provisions for the case where instruction execution time varies depending on surrounding instructions. Such a case, for example, occurs when the processor has pipelined functional units. Due to conicts in the use of pipeline stages and data dependencies among instructions, the execution time of an instruction is aected by surrounding instructions. To rectify the problem associated with variable instruction execution times, Lim et al. extend the original time schema in [16]. In the extended timing schema, the WCET bound in the original timing schema is replaced with what is called the Worst Case Timing Abstract (WCTA) [16]. Since a program construct may have more than one execution path as in the case of an if statement and the WCETs of these execution paths dier signicantly depending on surrounding program constructs, the worst case execution path of the program construct cannot always be determined by analyzing the program construct independently of the surrounding program constructs. For this reason, the WCTA of a program construct contains timing information of every execution path in the program construct that might be the worst case execution path of the program construct. Each element of a WCTA is timing information of an execution path in the corresponding program construct and is called the Path Abstraction (PA) of the execution path. The PA of an execution path encodes the factors that aect the WCET of the execution path. The encoding is done in such a way that allows for renement of the execution path's WCET when the detailed information about the surrounding execution paths becomes available. For example, in the case of pipelined execution analysis, the PA is about the use of pipeline stages in the associated execution path. This information allows for the renement of the path's execution time once the pipeline usage information of surrounding execution paths becomes available. This extended timing information structure leads to timing formulas that are dierent from those of the original timing schema in that (concatenation) and prune operations on PAs are newly dened to replace the + and max operations on the WCET bounds in the original timing schema. The operation between two PAs models the execution of one path followed by that of another path and yields the PA of the combined path. During this operation, the execution times of

5 5 Table 1. Comparison of the extended timing schema with the original timing schema Original Timing Schema Extended Timing Schema L S: S1; S2 T (S) = T(S1) + T (S2) W(S) = W(S1) W(S2) where T (S), T (S1), and T (S2) are the where W(S), W(S1), and W (S2) are the WCET bounds of S, S1, and S2, respectively. is defined as W 1 W 2 = fw1 WCTAs L of S, S1, and S2, Lrespectively and w2 jw 1 2 W 1; w2 2 W 2 g. S (W (exp) L W(S 2)) L S: if (exp) then S1 T (S) = max(t (exp) + T(S1); W(S) = (W(exp) W(S1)) else S2 T (exp) + T (S2)) S: while (exp) S1 T(S) = N (T (exp) + T (S1)) + T(exp) W(S) = ( L N i=1 (W(exp) L W(S1))) L W(exp) L L S: f(exp1, : : :, expn) T (S) = T (exp1) + : : : + T (expn) + T (f()) W(S) = W(exp1) : : : W(expn) L W(f()) both paths are revised using each other's timing information encoded in their PAs. These revised execution times are reected in the PA of the combined path. The prune operation, which is the counterpart of the max operation of the original timing schema, is performed on a set of PAs for a program construct and prunes the PAs whose associated execution paths cannot be the worst case execution path of the program construct. In other words, a PA of a program construct can be pruned if its WCET is always smaller than the WCET of another PA in the same program construct regardless of what the surrounding program constructs are. Table 1 contrasts the timing formulas of the extended timing schema with those of the original timing schema. The timing formula for S: S 1 ; S 2 rst enumerates all the possible execution paths within S. The prune operation after the enumeration (although it is not shown in the formula) prunes a subset of the resulting execution paths that cannot be the worst case execution path of the sequential statement. Similarly the timing formula for an if statement enumerates both all the execution paths in the then path and those in the else path. As in the case of the sequential statement, the execution paths that cannot be the worst case execution path of the if statement are pruned. The timing formula for a loop statement with loop bound N models the unrolling of the loop N times. This approach is exact but is computationally intractable for a large N. In [16], Lim et al. give an ecient approximate loop timing analysis method using a maximum cycle mean algorithm due to Karp [13]. This approximate analysis method has a O(jP j 3 ) time complexity where P is the set of the execution paths in the loop body that might be the worst case execution path of the loop body (i.e., those in W (exp) L W (S 1 )). Function calls are processed like sequential statements. In the extended timing approach, functions are processed in a reverse topological order in the call graph 2 since the WCTA of a function should be calculated before the functions that call it are processed.

6 6 C source Assembly Code Thread if (a) then if beq r a, else i block i then jump next i+1 block i+1 else else i+2 block i+2 i+3 block i+3 next: next: i+4 block i+4 Figure 1. Sample C program fragment and its pseudo assembly code 3. Threaded Prefetching Scheme 3.1. Overview In this section, we describe a new instruction prefetching scheme, called the threaded prefetching scheme, that always prefetches towards the worst case execution path. In the scheme, each instruction block is assigned a tag called a thread that indicates the worst case execution path. This thread is generated at compile-time through an analysis on the worst case timing and is not changed during task execution to guarantee predictability. The performances of sequential prefetching and threaded prefetching are the same as long as the execution ow is sequential. However, whenever a statement that may change the execution ow is encountered, they show dierent behaviors and, therefore, dierent performances. As an example, consider the program fragment and its assembly code given in Figure 1. Assume that the if statement is a part of a real-time program and that the then clause (block i + 1) and else clause (blocks i + 2 and i + 3) require one and two memory blocks, respectively. Further assume that the WCETs of the then and else clauses are t then and t else respectively and that t else > t then (i.e., the else path is the worst case execution path). In sequential prefetching, the next sequential block to the block i (i.e., block i + 1), is prefetched while the CPU is executing block i. If the branch in block i is not taken, in other words, if the then clause is executed, the prefetching is successful. However, if the branch is taken, in other words, if the execution proceeds along the worst case execution path, the prefetching fails and the WCET of this if statement will be t else + t fetch where t fetch is the time needed to perform a demand fetch on a prefetch failure. In the threaded prefetching, the thread in each instruction block is made to point towards

7 7 CPU Data memory Hierarchy Current Buffer Prefetch Buffer Prefetch Control Unit Prefetch Request Queue Instruction Memory Bus Instruction Memory Thread Memory Figure 2. Processor architecture augmented by threaded prefetching the worst case execution path. Thus, the thread of block i in this example will point to the block on the worst case execution path, in this case, the rst block of the else clause (i.e., block i + 2). As a result, block i + 2 will be prefetched during the execution of block i and the WCET of this if statement will become max(t else ; t then + t fetch ). Note that this WCET is always less than t else + t fetch, which is the WCET of this if statement if the sequential prefetching is used Support Mechanism Our threaded prefetching scheme requires the following hardware and software support mechanisms Hardware support Figure 2 shows a hardware organization of a conventional processor augmented by threaded prefetching. The additional hardware components required for the threaded prefetching are two instruction buers (the current buer and the prefetch buer), the prefetch control unit (PCU) and the prefetch request queue. In order for each instruction block to store its own thread, a separate memory module called thread memory is used. Thread memory is read by prefetch control unit simultaneously with instruction memory block. We assume that the instruction memory bus width is equal to the instruction memory block size plus the thread size.

8 8 The current buer contains the instruction block that the CPU is currently executing and the prefetch buer is used to hold the block being prefetched. Both buers have the following structure: V Tag V B I - Block Thread Tag valid Block valid The V bit indicates whether the buer contents are valid. The Tag is the block address of the I-block (instruction block) in the buer. The V B bit indicates the validity of the I-block and its thread. The threaded prefetching hardware operates as follows: For each instruction reference, a check is made to see whether the referenced instruction is in the current buer. If it is in the current buer, the reference is serviced immediately. Otherwise, the referenced block will be either in the prefetch buer or in main memory. 1. Hit in the prefetch buer: The PCU checks whether the V B bit is set. (a) V B = 1: The PCU services the CPU request with the I-block stored in the prefetch buer. A prefetch request for the instruction block pointed to by the thread in the prefetch buer is placed into the prefetch request queue and is later made to main memory. Afterwards, the prefetch buer becomes the current buer while the current buer becomes the prefetch buer. (b) V B = 0: Wait until V B becomes 1 and then proceed as in (a). 2. Miss in the prefetch buer: The PCU aborts all the outstanding prefetch requests stored in the prefetch request queue and then issues a demand fetch request for the missed block. After the requested instruction block and the corresponding thread are read from main memory and stored in the prefetch buer, the PCU services the CPU request. As before, a prefetch request for the instruction block indicated by the thread is made to the prefetch request queue and the prefetch buer becomes the current buer while the current buer becomes the prefetch buer. When a context switch occurs, the rst instruction reference after the context switch will always miss in the prefetch buer. However, the predictability of the subsequent prefetches will remain intact. In short, the predictability of the proposed threaded prefetching scheme is not aected by inter-task interference.

9 9 S 1 exp true false true exp S 1 S 2 false S 1 S 1 S 2 false exp true CALL F 1 RET S1 ; S2 if exp then S1 if exp then S1 else S2 do S1 until exp call F1 Figure 3. Examples of control structures 3.3. Thread generation As it was previously mentioned, the PCU always prefetches the block pointed to by the thread of the currently executing instruction block. Since the worst case performance is much more important than the average case performance in realtime systems, our approach to thread generation is geared towards minimizing the WCET. Figure 3 shows several well-known control structures. These control structures have sequential ows (denoted by ), unconditional branches (denoted by ) and conditional branches (denoted by a pair of 's). Threads of instruction blocks that do not contain any branches are made to point to the next sequential block. Blocks with an unconditional branch have their threads made to point to the branch target. Blocks with a conditional branch have their threads made to point to the path that takes more time to execute between the two possible paths. For example, the thread of an instruction block containing the condition part of an if statement will point to the then or else clause, whichever has the longer WCET. If the WCETs of the two clauses are the same, then the thread is made to point to the branch target (else clause in our case) and the WCET of the if statement is calculated for the case when the prefetch fails (i.e., when the then path is taken). In the case of a loop statement, the thread is made to point towards repetition, which is obviously the worst case execution path Extensions Lookahead prefetching In a prefetching scheme, the PCU reads the next block required by the CPU into a prefetch buer while the CPU is executing an instruction block, thus giving the eect of hiding memory access latency. However, if the execution time of the instruction block is shorter than the memory access latency, the CPU will have to stall until the block is completely read even when the prefetch is successful. This stall time can be reduced by lookaheading two blocks rather than a single

10 10 thread Loop Statement prefetched demand fetched Sequential Statement (a) Threaded Prefetching without Lookahead Loop Statement Sequential Statement (b) Threaded Prefetching with Lookahead Figure 4. Threaded prefetching with and without lookahead block. When this lookahead technique is used in sequential prefetching, the PCU will always prefetch the block physically situated next to the next block of the current block. On the other hand, if this technique is applied to the threaded prefetching, the thread will point to the block that is two blocks ahead of the current block on the worst case execution path. In this case, two outstanding prefetch requests are possible (one by the current instruction block and the other by the previous instruction block). This additional outstanding prefetch request can be accommodated by adding one more prefetch buer. Figure 4 compares the lookahead threaded prefetching with the normal threaded prefetching using an example where a loop statement is followed by a sequential statement. In the gure, it can be noted that although the lookahead threaded prefetching provides more time for prefetching, it increases the number of demand fetches. Thus, it can be expected that the lookahead feature is benecial only when the performance improvement from hiding more memory latency outweighs the performance degradation due to increased demand fetches Threads embedded in the instruction block As it was explained in Section 3.2.1, the threads are stored in a separate memory module called thread memory. This extra memory module requires all the components needed to implement a memory module (e.g., bus interface, memory module controller, etc.). These added complexities can be completely eliminated by making use of a word in each instruction block to store the associated thread. This thread allocation makes the hardware simple but it results in a reduction in the number of instructions that can be placed in an instruction block. This reduction not only

11 11 increases the number of instruction block fetches but also reduces the amount of memory latency hidden by instruction prefetching. However, as we will see in Section 4, the resultant performance degradation is not so signicant, thus making this allocation scheme attractive as a cost-eective alternative. 4. Evaluation of Instruction Prefetching Schemes in Real-Time Systems In order to evaluate the worst case performances of various instruction prefetching schemes, we chose a set of four simple programs as our benchmarks and compared their predicted worst case execution times (PWCETs) using a timing tool based on the extended timing schema. The tool consists of a compiler and a timing analyzer. The compiler is a modied version of an ANSI C compiler called lcc [6]. This compiler accepts a C source program and generates the assembly code along with program structure information. The timing analyzer uses the generated assembly code and program structure information along with user-provided information (e.g., iteration counts of loop statements, WCETs of the library functions used in the program, etc.) to compute the WCET of the given program. In the timing analyzer, the assumed processor model is the MIPS R3000 CPU along with R3010 FPA (Floating Point Accelerator) [12]. The instruction prefetching schemes compared in this section are sequential prefetching, thread prefetching and their variations. For comparison purposes, we also evaluated the ideal case where the instruction memory latency is zero. Note that this ideal case is the most predictable case and yields the best possible performance. The four benchmark programs used in the experiment are Clock, FFT, I-Sort and S-Matrix. The Clock benchmark is a program used to implement a periodic timer. This program periodically checks 10 linked-listed timers and, if any of them expires, calls the corresponding handler function. The FFT benchmark performs the FFT (Fast Fourier Transform) on an array of 100 double precision oating point numbers. The I-Sort benchmark sorts an array of 100 integer numbers using an insertion sort algorithm and the S-Matrix program multiplies two sparse matrices and puts the result into another sparse matrix. In the experiment, we used the following three policies to place machine code into instruction memory blocks. B-placement places instructions from at most one basic block 3 into each instruction memory block. P-placement allows instructions from more than one basic block to be placed into the same instruction memory block. I-placement uses a word in each instruction memory block to store the associated thread (threaded prefetching only). In the B-placement, each instruction block has at most one branch instruction. Thus, when threaded prefetching is used, prefetching will always result in a hit if

12 12 Table 2. Number of demand fetches in the prefetching schemes BS=32B NB SB TB NP SP TP SLP TLP TI TLI Clock FFT I-Sort S-Matrix BS=64B NB SB TB NP SP TP SLP TLP TI TLI Clock FFT I-Sort S-Matrix the program follows the worst case execution path. This placement, however, has the drawback of requiring a larger number of instruction block fetches than the other placement policies. The P-placement allows more than one branch instruction in a single instruction block and, therefore, a single instruction block may have several branch instructions and, thus, prefetch targets. In this case, the aforementioned timing tool generates a thread that will point to the instruction block that has the greatest eect on the WCET. Unlike the B- or P-placement that stores threads in a separate memory module called the thread memory, the I-placement places the thread in each instruction block. For I-placement, we assume that the size of each thread is equal to the size of instruction or one word (4 bytes). Figure 5 shows the three code placement policies applied to the example of Figure 1. In this example, we assume that the basic blocks corresponding to the then and else clauses require less than one memory block and more than one but less than two memory blocks, respectively. Table 2 shows the number of demand fetches performed in each prefetching scheme for the four benchmark programs when the programs follow their worst case execution paths. The upper table is when the block size (BS) is 32 bytes while the lower one is when the block size is 64 bytes. The abbreviations used in the table have the following meanings:

13 13 block i (a) (b) (c) B-Placement P-Placement I-Placement if beq Ra, else if then beq Ra, else if then THREAD beq Ra, else i+1 then else else THREAD else THREAD i+2 next THREAD i+3 next then clause else clause unused area Figure 5. Three code placement policies

14 14 Abbr. N S T L B P I Meaning No prefetching Sequential prefetching Threaded prefetching Prefetching with two block lookahead B-placement P-placement I-placement For example, NB means no prefetching with B-placement while TLI means threaded prefetching with two block lookahead and I-placement. In the table, the dierence between column N (NB and NP) and the remaining columns is the number of prefetch hits obtained by the prefetching schemes for the given code placement policy. The table shows that the number of demand fetches in any prefetching scheme is much smaller than that in the no-prefetching case for all the four benchmarks. It can also be seen that the threaded prefetching and its variations require much smaller number of demand fetches than their sequential prefetching counterparts. In one extreme case (I-Sort with block size of 64 bytes), TP requires only one demand fetch while SP requires 9901 demand fetches. The I-Sort benchmark program consists of a doubly nested loop that can t into two 64-byte instruction blocks. And the inner loop extends from the middle of the rst block into the second block. Thus, when sequential prefetching is used, a prefetch miss will occur at each iteration of the inner loop (because at the end of the loop, the PCU always prefetches the block after the second block). The threaded prefetching, on the other hand, correctly prefetches the rst block that is indicated by the thread. This results in a large dierence in the number of demand fetches between threaded prefetching and sequential prefetching. However, the results shown in Table 2 should be interpreted with caution since not only prefetch misses (that result in demand fetches) but also prefetch hits in which memory latency is only partially hidden may as well aect the worst case performance. Thus, it would be fairer to base the comparison of the schemes on their PWCETs rather than their numbers of demand fetches. Figure 6 shows the PWCETs of the prefetching schemes normalized to the PWCET of the ideal case. In calculating the PWCETs of the prefetching schemes, both of the instruction and data memory latencies are assumed to be 15 cycles. In the gure, threaded prefetching always shows a better performance than the sequential prefetching with the same code placement policy. The reason for this superior worst case performance is the ability of the threaded prefetching to prefetch along the worst case execution path. This gure also shows that the worst case performances of the prefetching schemes are improved when the block size increases from 32 bytes to 64 bytes. This is due to the fact that as the block size increases, the execution time of a block also increases, and, therefore, more prefetching time can

15 Block Size = 32 Bytes Block Size = 32 Bytes 15 NB SB TB NP SP TP NB SB TB NP SP TP Normalized Execution Time Normalized Execution Time Clock FFT I-Sort S-Matrix Clock FFT I-Sort S-Matrix Figure 6. Comparison of prefetching schemes be hidden by the execution time of the block. When the block size is 64 bytes, the PWCETs of most of the benchmark programs with threaded prefetching approaches the PWCET of the ideal case. In other words, the worst case performance of threaded prefetching approaches that of an instruction cache with 100 % hit ratio. For the Clock and S-Matrix benchmarks, however, there is a noticeable dierence between the PWCET of the threaded prefetching and that of the ideal case. This can be explained by a large number of relatively small basic blocks found in these benchmarks. These small basic blocks did not provide enough prefetching time to hide the memory latency, thus resulting in frequent stalls. Figure 7 compares the \performance gain" of lookahead threaded prefetching with that of normal threaded prefetching as the instruction memory latency increases. The performance gain of a prefetching scheme is the performance increase made by the prefetching scheme and is calculated by the following equation: P erformance Gain = P W CET no prefetching? P W CET prefetching scheme P W CET prefetching scheme As it was previously mentioned, lookahead prefetching has the advantage of hiding more memory latency. It, however, has the disadvantage of increasing the number of demand fetches. If the memory latency is very low, it can easily be hidden even without lookahead, thus lessening the eectiveness of lookahead prefetching. On the other hand, if the memory latency is too high, the performance gain of lookahead prefetching will be overshadowed by the performance degradation due to additional demand fetches. Therefore, it can be expected that the lookahead

16 Block Size = 32 Bytes Clock-TP Clock-TLP FFT-TP FFT-TLP 100 Block Size = 32 Bytes I-Sort-TP I-Sort-TLP S-Matrix-TP S-Matrix-TLP Performance Gain (%) 50 Performance Gain (%) Instruction Memory Latency Instruction Memory Latency 100 Block Size = 64 Bytes Clock-TP Clock-TLP FFT-TP FFT-TLP 100 Block Size = 64 Bytes I-Sort-TP I-Sort-TLP S-Matrix-TP S-Matrix-TLP Performance Gain (%) 50 Performance Gain (%) Instruction Memory Latency Instruction Memory Latency Figure 7. Eects of lookahead in threaded prefetching

17 17 prefetching scheme shows a higher performance than the normal prefetching for the interval from the latency that is too high for the normal prefetching to hide and to the latency that is low enough to overcome the penalty caused by the increased demand fetches. When the block size is 32 bytes, all the benchmarks except I-Sort show such trends. In the I-Sort benchmark, the innermost loop can t into two 32 byte instruction blocks. So the CPU can execute the loop with the blocks in the two instruction buers without any instruction memory access. Thus, the performance gain increases linearly regardless of whether the lookahead technique is used or not. When the block size is 64 bytes, there is enough time for prefetching even if the lookahead is not used. So the performance gains of threaded prefetching with and without lookahead are almost the same when the memory latency is below about 30 cycles. However, when the memory latency is over 30 cycles, the lookahead prefetching then shows a better performance gain. One interesting point to notice is a linear increase in the performance gain of the Clock benchmark when the lookahead is used. A careful inspection of the generated assembly code shows that this linear increase is due to the one additional prefetch buer used in the lookahead prefetching. The innermost loop of the Clock benchmark can t into three 64-byte instruction blocks. With the three instruction buers provided in the lookahead threaded prefetching (one current buer plus two prefetch buers), the CPU can execute the innermost loop without any instruction memory access. Thus, the performance gain increases linearly as the instruction memory latency increases. Figure 8 compares the performance gain of TI (threaded prefetching with I- placement) with that of TP (threaded prefetching with P-placement). As it was previously mentioned, both the number of instructions in each instruction block and the execution time of an instruction block decrease in the TI scheme. This, in turn, reduces the memory latency hidden by prefetching. Thus, it can be expected that the TI scheme will show a worse performance than the TP scheme. However, as we can see in Figure 8, the resultant performance degradation is not so signicant. In two cases (Clock and FFT), TI even outperforms TP. A careful analysis of the results reveals that in the two benchmarks the performance improvement due to a reduced number of prefetch targets in an instruction block in the TI scheme outweighs the aforementioned performance degradation. Considering its reduced hardware complexity and competitive performance, the TI scheme appears to be the most preferable scheme among the prefetching schemes discussed in this paper. 5. Conclusions In this paper, we propose an instruction prefetching scheme called threaded prefetching as an alternative to instruction caching in real-time systems. In the proposed scheme, prefetching is dictated by a tag called a thread that always points towards the worst case execution path. The proposed threaded prefetching scheme has the following advantages: First, by always prefetching predetermined blocks, the proposed scheme is predictable. Moreover this predictability is preserved over task

18 Block Size = 32 Bytes 100 Block Size = 32 Bytes Clock-TP Clock-TI FFT-TP FFT-TI I-Sort-TP I-Sort-TI S-Matrix-TP S-Matrix-TI Performance Gain (%) 50 Performance Gain (%) Instruction Memory Latency Instruction Memory Latency 100 Block Size = 64 Bytes 100 Block Size = 64 Bytes Clock-TP Clock-TI FFT-TP FFT-TI I-Sort-TP I-Sort-TI S-Matrix-TP S-Matrix-TI Performance Gain (%) 50 Performance Gain (%) Instruction Memory Latency Instruction Memory Latency Figure 8. Eects of threads embedded in instruction blocks

19 19 switches, thus allowing an accurate schedulability analysis of tasks. Secondly, the scheme performs nearly as well as an innite cache in terms of worst case performance when the block size is large enough. Finally, the scheme requires only a minimal hardware addition. Notes 1. A block is the minimum unit of information that can be either present or not present in a memory hierarchy [7]. 2. A call graph contains the information on how functions call each other [5]. For example, if f calls g, then an arc connects f's vertex to that of g in their call graph. 3. A basic block is a sequence of consecutive instructions in which ow of control enters at the beginning and leaves at the end without halt or possibility of branching except at the end [1]. References 1. A. V. Aho, R. Sethi, and J. D. Ullman. Compilers Principles, Techniques, and Tools. Addison-Wesley Publishing Company, Reading, MA, R. Arnold, F. Mueller, D. Whalley, and M. Harmon. Bounding worst-case instruction cache performance. In Proceedings of the 15th Real-Time Systems Symposium, pages 172{181, D. Callahan, K. Kennedy, and A. Portereld. Software prefetching. In Proceedings of the Fourth International Conference on Architectural Support on Programming Languages and Operating Systems, pages 40{52, T.-F. Chen and J.-L. Baer. Reducing memory latency via non-blocking and prefetching caches. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 51{61, C. N. Fischer and R. J. LeBlanc. Crafting a Compiler with C. The Benjamin/Cummings Publishing Company, Inc., Redwood City, CA, C. W. Fraser and D. R. Hanson. A code generation interface for ANSI C. Technical Report CSL-TR , Dept. of Computer Science, Princeton University, July J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers, San Mateo, CA, W. Hilf and A. Nausch. The M68000 Family Volum 1, page 48. Prentice Hall, Englewood Clis, NJ, M. D. Hill. Aspects of Cache Memory and Instruction Buer Performance. PhD thesis, University of California, Berkeley, Nov Intel. Microprocessors Handbook. Intel Corporation, N. P. Jouppi. Improving directed-mapped cache performance by the addition of a small fully-associative cache and prefetch buers. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 364{373, G. Kane and J. Heinrich. MIPS RISC Architecture. Prentice Hall, Englewood Clis, NJ, R. M. Karp. A characterization of the minimum cycle mean in a digraph. Discrete Mathematics, 23:309{311, D. B. Kirk. SMART (Strategic Memory Allocation for Real-Time) cache design. In Proceedings of the 10th Real-Time Systems Symposium, pages 229{237, 1989.

20 A. C. Klaiber and H. M. Levy. Architecture for software-controlled data prefetching. In Proceedings of the 18th Annual International Symposium on Computer Architecture, pages 43{63, S.-S. Lim, Y. H. Bae, G. T. Jang, B.-D. Rhee, S. L. Min, C. Y. Park, H. Shin, K. Park, and C. S. Kim. An accurate worst case timing analysis technique for RISC processors. In Proceedings of the 15th Real-Time Systems Symposium, pages 97{108, J.-C. Liu and H.-J. Lee. Deterministic upperbounds of the worst-case execution times of cached programs. In Proceedings of the 15th Real-Time Systems Symposium, pages 182{191, B. R. Rau and G. E. Rossman. The eect of instruction fetch strategies upon the performance of pipelined instruction units. In Proceedings of the 4th Annual International Symposium on Computer Architecture, pages 80{89, J. Rawat. Static Analysis of Cache Performance for Real-Time Programming. Master's thesis, Iowa State University, A. C. Shaw. Reasoning about time in higher-level language software. IEEE Transactions On Software Engineering, 15(7):875{889, July A. J. Smith. Cache memories. ACM Computing Surveys, 14(3):473{530, A. J. Smith. Sequential program prefetching in memory hierarchies. IEEE Computer, pages 7{21, Dec A. Wolfe. Software-based cache partitioning for real-time applications. In Proceedings of the Third Workshop on Responsive Computer Systems, Sept

Performance analysis of the static use of locking caches.

Performance analysis of the static use of locking caches. A. MARTÍ CAMPOY, A. PERLES, S. SÁEZ, J.V. BUSQUETS-MATAIX Departamento de Informática de Sistemas y Computadores Universidad Politécnica de Valencia