An Adaptive Sequential Prefetching Scheme in Shared-Memory Multiprocessors

Size: px

Start display at page:

Download "An Adaptive Sequential Prefetching Scheme in Shared-Memory Multiprocessors"

Florence Owens
5 years ago
Views:

An Adaptive Sequential Prefetching Scheme in Shared-Memory Multiprocessors Myoung Kwon Tcheun, Hyunsoo Yoon, Seung Ryoul Maeng Department of Computer Science, CAR Korea Advanced nstitute of Science

1 An Adaptive Sequential Prefetching Scheme in Shared-Memory Multiprocessors Myoung Kwon Tcheun, Hyunsoo Yoon, Seung Ryoul Maeng Department of Computer Science, CAR Korea Advanced nstitute of Science and Technology (KAST) Kusong-Dong Yusung-Gu, Taejon , Korea { Abstract The sequential prefetching scheme is a simple hardwarecontrolled scheme, which exploits the sequentiality of memory accesses to predict which blocks will be read in the near future. We analyze the relationship between the sequentiality of application programs and the effectiveness of sequential prefetching on shared-memory multiprocessors. Also, we propose a simple hardware scheme which selects the prefetching degree on each miss by adding a small table(pds: Prefetching Degree Selector} to the sequential prefetching scheme. This scheme could prefetch consecutive blocks aggressively for applications with high sequentiality and conservatively for applications with low sequentiality. 1 ntroduction n large-scale multiprocessors with a general interconnection network, the program execution time significantly depends on the shared-memory access latency. The latency consists of the memory latency and the network latency. Caches are quite effective to reduce and hide the sharedmemory access latency by reducing the number of sharedmemory accesses. However, the remained shared-memory accesses are still serious bottleneck to achieve high performance computing because the cache miss penalty reaches to tens to hundreds of processor cycles with the advent of very fast uniprocessors and massively parallel systems [ 1,2]. Prefetching is an attractive scheme to reduce the cache miss penalty by exploiting the overlap of processor computations with data accesses. Especially for multiprocessors, cache miss penalty can be decreased significantly by overlapping the network latency of fetched block with those of prefetched blocks. Many prefetching schemes based on software or hardware have been proposed. Software prefetching schemes [3,4, 5, 61 perform static program analysis and insert explicitly prefetch instructions into the program code, which increases the program size. n contrast, hardware prefetching schemes control the prefetch activities according to the program execution by only hardware. Several hardware prefetching schemes [7, 8, 91 prefetch blocks if a regular access pattern is detected. These schemes require complex hardware to detect a regular access pattern. One block lookahead schemes (OBL)[ 101 are simple hardware prefetching schemes, where the next one block is only concerned upon referencing a block. There are three-types of OBL schemes: Alwaysprefetch prefetches the next block on each reference, so it makes excessive cache lookups. Tagged prefetch prefetches the next block if referenced block is accessed for the first time. This scheme requires an extra bit per cache block. Prefetch on misses prefetches the next block only on the miss. Since this scheme requires no extra bit per cache block and no prefetch activities on cache hits, it is simple to implement. However, this scheme is too conservative to need a cache miss to prefetch one block. The sequential prefetching schemes [lo, 11, 12, 131 are an extension of prefetch on misses. They prefetch several consecutive blocks following the missed block in the caches on each miss. Fu and Patel [ 111 show that the performance of sequential prefetching is enhanced with increasing the number of blocks to prefetch. Their scheme prefetches several blocks on misses which access a scalar or a short stride vector, while it does not prefetch on misses which access a long stride vector. This scheme uses the stride information carried by vector instructions. For scalar processor, it requires complex hardware to get the stride information. Dahlgren, Dubois, and Stenstrom [ 121 mention that sequential prefetching, which prefetches more than one block on each miss, could be useful if the system use the high bandwidth network. They propose an adaptive sequential prefetching scheme, where the prefetching degree, i.e., the number of blocks to prefetch on each miss, is controlled by the prefetch efficiency. The prefetch efficiency is measured by counting the prefetched blocks that are useful. To count the useful prefetches, this scheme uses two extra bits for each cache block and needs some activities on cache hits. Bianchini and LeBlanc [ 131 simulate the performance of sequential prefetching scheme with various prefetching degrees on various memory and network latency and analyze the effect of fine-grain sharing and write operations. While aggressive sequential prefetching prefetches many blocks on each miss and improves the miss rates of application programs, it may not result in reduction of the stall time. n some cases, prefetching only one block increases the execution time for application programs with fine-grain sharing and lots of write operations. They propose Hybrid prefetching scheme combining hardware and software supports for stride-directed prefetching. The compiler assigns the number of blocks to prefetch for each instructions according to the data type. One block is assigned for instructions which access fine-grain sharing data and several blocks are assigned for the other instructions. Each of these studies concludes that the performance of sequential prefetching depends on several factors, such as the characteristics of application programs, the prefetching /97 $ EEE 306

2 degree, and the memory and network latency. n this paper, we analyze the relationship between the sequentiality of application programs and the effectiveness of sequential prefetching on various memory and network latency, and we propose a new simple adaptive sequential prefetching scheme which does not need compiler assistance and extra activity on cache hits. This scheme prefetches many blocks on cache misses in long sequential streams and a few blocks on misses in short sequential streams by adjusting the prefetching degree according to the length of sequential streams. Therefore, for application programs with high sequentiality, the proposed scheme can prefetch blocks aggressively and shows better execution time than the sequential prefetching scheme with the prefetching degree of one. Also, the proposed scheme shows better execution time in low bandwidth than the adaptive sequential prefetching scheme proposed by Dahlgren et al. For application programs with low sequentiality, this scheme prefetches blocks conservatively and can avoid prefetching of useless blocks. The proposed scheme shows similar execution time to the sequential prefetching scheme with the prefetching degree of one and shows better execution time in high bandwidth than the adaptive sequential prefetching scheme proposed by Dahlgren et al. The remainder of this paper will be organized as follows. Section 2 describes the overview and implementation of the new adaptive sequential prefetching scheme. n Section 3, we present the simulation methodology and workloads used in this study. We analyze the relationship between the sequentiality of application programs and the effectiveness of sequential prefetching in Section 4. n Section 5, the proposed scheme is compared with the other simple hardware sequential prefetching schemes. Finally, conclusions are presented in Section 6. 2 An Adaptive Sequential Prefetching Scheme 2.1 Overview When a cache miss occurs, the processor accesses the memory. f the cache misses are occurred on to the consecutive blocks, the processor accesses the blocks in the memory sequentially. We define the sequence of these memory accesses as a sequential stream, and the number of consecutive blocks as the length of sequential stream. The effectiveness of sequential prefetching depends on the length of sequential streams and the prefetching degree. n a long sequential stream, aggressive prefetching reduces the miss rate efficiently. The miss penalty can be also decreased significantly by overlapping the network latency of fetched block with those of prefetched blocks. However, aggressive prefetching in a short sequential stream increases the traffic for useless blocks and may increase the network latency. The proposed scheme increases the prefetching degree according to the length of sequential stream. Since the length of the sequential stream is not known, the prefetching degree is increased one by one on each miss in a sequential stream as shown in Figure 1. After K misses have been occurred in a sequential stream, K + 1 blocks are prefetched on the next miss in the sequential stream. Thus, many blocks can be prefetched on a miss in a long sequential stream. Since the sequential streams can be interleaved, this scheme detects misses in the same sequential stream by using a small table containing addresses on which the next bl - Sequence of missed blocks bz b p ba b5 bs b7 bs b9 bla bli one block two blocks three blocks Prefetched blocks b,,l = b,+l : the sequence of b,is sequential b, : the ith block address of a sequential stream Figure 1 : Missed block sequence miss will be occurred. When there is a miss on the first block in a sequential stream, this scheme stores the third block address with the prefetching degree of 2 in the table. The prefetching degree is stored in the table to keep the increased prefetching degree for each stream. A block is considered to be the first in a sequential stream if there is no previous miss in the same sequential stream. n implementation, the first miss is detected by checking the existance of the missed block address in the table, f a miss really belongs to a sequential stream, there will be references to the second block, the third block, and so on. A reference to the second block becomes a cache hit because the second block is already prefetched. When a miss on the third block occurs, this scheme compares the miss address with the stored addresses in the table. Since there exists third block address with the prefetching degree of 2, this scheme prefetches two consecutive blocks, the fourth and fifth blocks, following the third block. The third block address is replaced with the sixth block address, and the prefetching degree is increased to 3. The misses will occur on the st, the 3rd, the 6th, and the 10th blocks of a sequential stream in the proposed scheme because the prefetching degree is increased one by one as shown in Figure 1. When the reference reaches to the end of the sequential stream, the processor would not access the saved address in the near future. Since the table size is finite, the saved address and the increased prefetching degree would be replaced by the first address of the other sequential stream. Each entry in the table keeps different prefetching degree for the recent sequential streams. The proposed scheme prefetches one block on first misses in sequential stream which may access a long stride vector or fine-grain sharing data, and several blocks on the other misses in sequential stream which access scalar variables or a short stride vector. Thus, there is no need to decrease the prefetching degree. 2.2 Processing Node Architecture The processing node shown in Figure 2.a consists of a processor, a cache, a prefetching unit, a prefetching degree selector (PDS), and a network interface. Figure 2.a except for PDS shows the mechanism of a sequential prefetching scheme. The processor is blocked only during the time when it takes to handle the read miss. A cache read miss for the block address A issues a fetch request to the network interface and activates the prefetching unit. The prefetching unit increases the block address A to A + 1 and lookups the cache for the block address A + 1. f the block is not present in the cache, the prefetching unit issues a prefetch request to the network interface. During this time, the block address A + 1 is increased to A + 2. n the next cache 307

3 hit /miss Processor 4 k Cache -' "K:Prefetching -- Degree , Prefetch 1 PDS ' A --7' Fetch. a. Processing node architecture b. Prefetching Degree Selector (PDS) Figure 2: Prefetching mechanism cycle, a cache lookup is made for block address A + 2. f the block is not present in the cache, a prefetch request is issued to the network interface. K cache lookups are made for K consecutive blocks, and prefetch requests are issued for the blocks which are not present in the cache. K is the prefetching degree. The Prefetch Controller controlls the number of cache lookups and issues prefetch requests depending on the result of each cache lookup. Since the prefetch requests are issued one at a time and are pipelined in the network and memory system, they can be overlapped with the fetch request. The Prefetching Degree Selector (PDS) selects a prefetching degree for a cache miss and sends the selected prefetching degree to the prefetching unit. 2.3 Prefetching Degree Selector PDS uses a table to increase the prefetching degree one by one on each miss in a sequential stream. Each entry of the table contains a block address to detect the next miss in the same sequential stream and a prefetching degree to increase the prefetching degree for the next miss. Figure 2.b shows the block diagram of PDS. The compare logic compares the miss address A with the addresses stored in the table. f there is a hit, the associated prefetching degree K is selected, and the entry (A, K) is replaced with (A f K + 1, K + 1). The prefetching degree K is increased to K + 1, and the address is added to the increased prefetching degree. f the address (A+ K + 1) also exists in the table, PDS invalidates the entry. However, if the miss address does not exist in the table, an entry (A + 2,2) is inserted into the table. f the address i" A + exists in the table, the entry is replaced by the entry + 2,2). Since the table is finite, the new entry might cause the least recently used (LRU) entry to be replaced by an entry selection logic. The number of entries affects the performance of the proposed scheme. Our experimental results suggest that a table of four entries is appropriate. Consider an arbitrary sequence. a1,a2,b1,b2,cl,~2,di,d2,b3,b4,b5,b6,ei,ez,e3 Each sequence of 5, is sequential, i.e., Z,+ = C, + 1. C, is an address of ith block of each sequential sequence of 2,. Black PlefeLching Address Degree H-Ed Before After Before a. On the miss of block a1 After b. On the miss of block b3 Before After Before After c. On the miss of block bg d. On the miss of block el Figure 3: Update entries in the table For example, a1 is the first block address of sequential stream a], a2, and b3 is the third block address of sequential stream bl, b2, b3, bq, b5, bg. Five sequential streams, sequences of a,, b,, cz, d,, and e,, are interleaved. The read miss on a1 will prefetch a2 and will insert address a3 and the prefetching degree of 2 into the table (Figure 3.a). A reference to a2 is a hit, because a2 is already prefetched. Read misses bl, c1, dl will also prefetch b2, c2, d2 and will insert addresses b3, c3, d3 and prefetching degrees of 2. A read miss on b3 will prefetch b4, b5 because the prefetching degree in the table is 2. The address b3 is replaced with b6 and the prefetching degree increases to 3 from 2 (Figure 3.b). After references to b4 and b5, a miss to block b6 will be occurred. This miss replaces b6 and 3 with blo and 4 (Figure 3.c). A miss to block el, which is the first block of el, e2, e3 stream, replaces the entry a3 and 2 with e3 and 2 (Figure 3.d). 3 Simulation Environment We make the architectural simulation model using MNT [ 141. The architecture is a scalable direct-connected multiprocessor with 16 nodes. Cache coherence is maintained by the full directory protocol [ 151 distributed among the mem- 308

Table 2: The Characteristics of Application Programs Program Description Data Sets LU decompose a dense matrix 200x 200 matrix FFT FFT algorithm - 32 K complex data points Radix nteger radix sort

4 Table 2: The Characteristics of Application Programs Program Description Data Sets LU decompose a dense matrix 200x 200 matrix FFT FFT algorithm - 32 K complex data points Radix nteger radix sort 1024 radix. a. The Distribution of Sequential Stream Lengths (Unit %) Program The length of seauential streams 1., Average length ory modules. Prefetched data are loaded into the caches so that the data can be invalidated by the cache coherence protocol. The page size is 4 Kbytes, and the pages are allocated in a round-robin fashion. The interconnection network is a bi-directional wormhole-routed mesh with dimensionordered routing. We simulate in different memory and network bandwidths to explore the effect of the bandwidth. We use memory and network bandwidths used in [13]. The latency of a memory module is 24 processor cycles. The memory transfer rates are 0.5 cycles, 2 cycles, and 4 cycles per word for high, medium, and low bandwidths, respectively. Network latency per link and latency per switch are 1 cycles and 5 cycles. Path widths are 128 bits, 32 bits, and 16 bits for high, medium, and low bandwidths, respectively. n simulation, high handwidth means high transfer rate and high path width. Also, medium and low bandwidth mean each memory transfer rate and path width. Six benchmark programs are considered for the simulation. Four of them (LU, FFT, Radix and Ocean) are taken from the SPLASH-2 suite [16], and the other two programs(mp3d and PTHOR) from SPLASH suite [17]. We summarize them in Table 1. Simulations are carried out with caches of 16 Kbytes and a block size of 32 bytes. We will focus on four different metrics: the read miss rate, the prefetch efficiency, the number of transferred blocks, and the execution time. The read miss rate is computed solely with respect to shared references. The purpose of prefetching is to reduce the cache misses. So the read miss rate is a proper metric. Cache pollution is one of the cost of prefetching. We use the prefetch efficiency to measure how many prefetched blocks are useful. The prefetch efficiency is defined as the number of prefetched and useful blocks divided by the total number of prefetched blocks. The prefetched and useful blocks are prefetched blocks that are accessed later. The traffic for prefetched blocks is the other cost of prefetching. The high traffic may increase the memory access latency. We use the number of transferred blocks to measure the traffic for read cache misses and prefetches. 4 The mpact of Sequentiality to Sequential Prefetching The performance of the sequential prefetching schemes depends on the distribution of sequential stream lengths of application programs because the effectiveness of sequential prefetching depends on the length of sequential streams. Table 2.a shows the distribution of sequential stream lengths, and Table 2.b shows the fraction of misses belonging to sequential streams of length one, the fraction of misses belonging to sequential streams of length greater than 30, and the average length of sequential streams. f the length of sequential stream is one, most of prefetched blocks will be useless, while for a long sequential stream, many useful blocks can be prefetched on each miss. We define the sequentiality of application program as the average length of sequential streams of application program. FFT, LU, and Radix show high sequentiality because large fraction of all misses belong to the sequential streams of length greater than 30, while MP3D, Ocean, and PTHOR show low sequentiality because large fraction of all misses belong to the sequential streams of length one. Benchmark programs are simulated with the prefetching degree of 1,2,4,8, 16, and no prefetching to examine the impact of sequentiality to sequential prefetching with various prefetching degrees. Figure 4.a shows the reduction of misses. The number of misses is normalized to no prefetching. The reduction of misses is larger for application programs with higher sequentiality for most of the prefetching degrees. Figure 4.b shows that the prefetch efficiency is higher for application programs with higher sequentiality. Figure 4.c shows the number of transferred blocks which is normalized to no prefetching. As can be seen, the increase of transferred blocks is larger for application programs with lower sequentiality. Figure 5 shows the execution times which are normalized to no prefetching. The execution times as a function of the prefetching degree follow U-shaped curves. The execution time depends on various factors, i.e., cache size and cache block size, the memory access latency, the memory and network bandwidth, the sequentiality of application programs, 309

5 !! 2 i O s Y : d Fr(7.M) fr..' LU(4,191'.+... Raf1ii2.98) *... MP3q.M) * Ocean(l.39) A- PTHOR(1.06) * W, t 0.0 ' he pefekhing degiu b. The Prefetch efficiency Figure 4: The impact of the sequentiality he pefetchng degree c. No. of Transferred Blocks,! C s w FFT(7,M) -. LU(4.29).+--- Radix(2.98) ---. MPlq1.M) *- Oc@0(1.39) -A-- PTH0Rj.M) -*--. > J, Y M M)t,,, thr prefdching &%e the prefeiching hsee the piefetching degree a. High Bandwidth b. Medium Bandwidth c. Low Bandwidth Figure 5: The execution time and the number of write operations. n this paper, we concentrate on the sequentiality of application programs and the memory and network bandwidth. Figure 5.b and Figure 5.c clearly show the distinction between application programs with high sequentiality and application programs with low sequentiality. For application programs with high sequentiality, the reductions of execution time are high. This is expected because the reduction of misses is high and the traffic is slightly increased. The reduction of execution time of FFT is smaller than that of LU in spite of higher reduction of read misses and higher prefetch efficiency. The reason for this is that the stall time for FFT increases more than that for LU because the write traffic for FFT is much higher than that for LU. n general, application programs with high sequentiality show high reduction of misses, high prefetch efficiency, and small increase of transferred blocks. Therefore, aggressive sequential prefetching works well for application programs with high sequentiality. On the other hand, for application programs with low sequentiality, sequential prefetching de- grades the execution time in some cases. 5 Comparison with Other Sequential Prefetching Schemes n this section, we compare the proposed scheme with the other simple hardware sequential prefetching schemes, i.e., the sequential prefetching scheme with the prefetching degree of one and the adaptive sequential prefetching scheme proposed by Dahlgren et al. We refer to these two sequential prefetching schemes as SP1 and ASP1. We also refer to the adaptive sequential prefetching scheme proposed by ours as ASP2. SP1 is the most simplest hardware sequential prefetching scheme. The sequential prefetching scheme with the prefetching degree greater than one needs hardware to count the number of prefetch requests on each miss. n the previous section, higher prefetching degree shows better execution time for application programs with high sequentiality, but SP1 shows the best execution time for application pro- 3 10

lzo 100 f Pcold.coherence Oreplace Oreplace-p 100 E,", 80 60 40 20 n FFT LU Radix MPBD Ocean PTHOR Figure 6: The miss rates S Z Z 95%? % Z Sijg 252 mx.

6 lzo 100 f Pcold.coherence Oreplace Oreplace-p 100 E,", n FFT LU Radix MPBD Ocean PTHOR Figure 6: The miss rates S Z Z 95%? % Z Sijg 252 mx.5 2%: 9s N 9 8 FFT LU Radix MPBD Ocean PTHOR Figure 7: The prefetch efficiency Table 3: Characteristics of sequential prefetching schemes Needs Extra Hardware on Cache hit Differentiate the misses No SP1 NO activity No.TWO ASPl ASP2 1 Yes Yes 1 hits per l~kbifs cache line for 16 Khytes cache) Activities Setthe bits in the cache and count useful blocks No.A small table 14 x 32hits for!! NO Activity Yes.sequential - scalar variables - Short stride vector L non-sequential - long stride vector - fine-grain sharing data grams with low sequentility. ASPl controls the prefetching degree to make the prefetch efficiency lie between two predefined values. We use 75% and 50% for these two values in simulations. Those are the same values used in [12]. As shown in the previous section, the prefetch efficiency is decreased with increasing the prefetching degree. When the prefetch efficiency is high, ASPl increases the prefetching degree and when the prefetch efficiency is low, ASPl decreases the prefetching degree. n some case, prefetching can be stopped. This scheme needs two extra bits for each cache block and some activities on cache hits to count the useful prefetches. 1 Kbits are needed for 16 Kbytes cache with 32 byte block size. ASP2 increases the prefetching degree to prefetch many blocks on misses in long sequential streams. A small table is required to increase the prefetching degree. Each entry in the table consists a block address and a prefetching degree. A block address needs 27 bits for the block of 32 bytes and the prefetching degree needs 5 bits to increase from 1 to bits are sufficient to cover the longest sequential stream of six application programs. Since 32 bits are used for one entry, 4 x 32 bits are required for a table with four entries. This scheme differentiates the misses according to the sequentiality. The characteristics of the above three prefetching schemes are summarized in Table 3. Figures show the simulation results. Figure 6, Figure 7, and Figure 8 show the miss rates, the prefetch ef- ficiencies, and the numbers of transferred blocks in high bandwidth. These results are similar with the results in medium and low bandwidth. Figure 6 shows the relative numbers of misses which are normalized to SP1. Each bar consists of four sections that, from the bottom to the top, correspond to cold misses, coherence misses, replace misses, and replacep misses. A miss is classified as a cold miss if it has never been fetched into the cache, coherence miss if it was invalidated by cache coherence protocol, and a replace miss if it was replaced out from the cache. A read miss to a block that was replaced out from the cache by prefetched block is classified as a replacep miss. As expected, the reduction of miss rates of ASP2 for FFT, LU, and Radix is high, because the sequentiality is high. On the other hand, for application programs like MP3D, Ocean, and PTHOR, the miss rates are slightly lower than those of SP1, because these application programs have low sequentialities. Also, the reduction of miss rates of ASPl for FFT, LU, and Radix is high, because the prefetch efficiency is high. Since the prefetch efficiencies of FFT is very high as shown in Figure 4.b, the miss rate of ASP2 is higher than that of ASPl for FFT. Figure 7 shows the prefetch efficiency. The prefetch efficiency of ASP2 is better than that of SPl for all programs except FFT, because ASP2 prefetches many blocks on misses in long sequential streams. For FFT, although the prefetch efficiency of ASP2 is worse than that of SP1, it is still high as 90.4%. The prefetch efficiency of ASP1 is worse than that of SP1 for four of six application programs because ASPl makes the prefetch efficiencies of those application programs lie between 75% and 50%. For Ocean and PTHOR, the prefetch efficiency of ASPl is better than that of SPl, because the prefetch efficiencies are lower than 50% for most of the prefetching degrees. Figure 8 shows the number of transferred blocks. We have broken down the transferred blocks into three components that, from bottom to the top, correspond to the prefetched and useful blocks, the prefetched but useless blocks, and the fetched blocks. A prefetched but useless block means a block which is replaced or invalidated before accessed. The number of transferred blocks of ASP2 is slightly higher than that of SP1. For FFT, LU, and Radix, the por- 311

120 100 80 60 40 20 FFT LU Radix MP3D Ocean PTHOR Figure 8: The number of transferred blocks C0y3 O > > C O > > 2" yaa %a% 3%: y$$ 0: 2 : 2: FFT LU Radix MP3D Ocean PTHOR Figure 10: Execution time in

7 FFT LU Radix MP3D Ocean PTHOR Figure 8: The number of transferred blocks C0y3 O > > C O > > 2" yaa %a% 3%: y$$ 0: 2 : 2: FFT LU Radix MP3D Ocean PTHOR Figure 10: Execution time in medium bandwidth %%a 3mz 3: '0: 335 W252 P P %$3 %$$ FFT LU Radix MP3D Ocean PTHOR Figure 9: Execution time in high bandwidth qaa s%z saz qza?aa 2aa 2 % 3: 2: 2 % 2: 0: FFT Radix MP3D Ocean PTHOR LU Figure 11: Execution time in low bandwidth tion of fetched blocks of ASP2 is smaller than that of SP1 since the miss rates are reduced significantly. The portion of useful prefetched blocks is larger than that of SP1 because the sequentiality is high. For MP3D, Ocean, and PTHOR, each portion is similar with that of SP1. The number of transferred blocks of ASP1 is larger than that of ASP2 for all application programs except PTHOR. PTHOR shows extremely small portion of prefetched blocks because the prefetch efficiency is extremely low. For FFT, LU, Radix, and MP3D, the portion of useless blocks under ASPl is much larger than that under ASP2 because the prefetch efficiencies are high. For PTHOR, the portion of useless blpcks under ASPl is smaller than that under ASP2, because the prefetch efficiency is extremely low. These results are similar with the results in medium and low bandwidth except Ocean. For Ocean, ASP2 transfers more blocks than ASPl in in medium and low bandwidth. Figures show the relative execution times which are normalized to SP1, respectively according to the bandwidth. The execution time depends on the miss rate, the number of transferred blocks. The blocked time is in propotion to the miss rate, while the memory access latency is affected by the number of transferred blocks. For application programs with high sequentiality, ASP2 shows much lower miss rate than SP1 and transfers sim- ilar number of blocks to SP1. Thus, ASP2 shows better execution time than SP1. For LU, the execution time of ASP2 is reduced to 77% of SP1. For application programs with low sequentiality, ASP2 shows similar execution time to SP1. For MP3D, ASP2 shows slightly better execution time because ASP2 shows slightly lower miss rate than SP1 and transfers similar number of blocks to SP1. For Ocean, ASP2 shows similar execution time to SP1 because ASP2 shows slightly lower miss rate but slightly higher number of transferred blocks than SP1. For PTHOR, ASP2 shows similar execution time to SP1 because ASP2 and SP1 shows similar miss rate and similar number of transferred blocks. ASP2 shows better execution time than ASP1 in some cases while ASPl shows better execution time than ASP2 in other cases. For FFT, ASP2 shows better performance than ASPl in low bandwidth while ASPl shows better execution time than ASP2 in high bandwidth. The reason is that ASPl shows much lower miss rate than ASP2 but transfers much more blocks than ASP2, and the execution time is more sensitve to the miss rate than the number of transferred blocks in high bandwidth, but more sensitive to the number of transferred blocks in low bandwidth. LU, Radix, and MP3D show better execution times under ASP2 in all bandwidth because ASP2 shows similar miss rate with small number of transferred blocks. Ocean shows better ex- 3 12

8 ecution times under ASP2 in high bandwidth because ASP2 shows lower miss rate with small number of transferred blocks. But, in medium and low bandwidth, ASP2 shows slightly worse execution time than ASPl because ASP2 shows lower miss rate but transfers much more blocks than ASP. PTHOR prefetches very small number of blocks because this program shows extremely low sequentiality. ASP2 shows slightly lower miss rate, but prefetches more blocks than ASP1. Thus, ASPl shows better execution time in all bandwidth. As expected, ASP2 shows better execution time than SPl for application programs with high sequentiality and similar execution time to SP1 for application programs with low sequentiality. For four out of six application programs, ASP2 shows higher prefetch efficiency and better execution time than ASPl because ASP2 prefetches useless blocks less than ASPl. 6 Conclusion n this paper, we have analyzed the impact of sequentiality on the sequential prefetching scheme. For application programs with high sequentiality, it is shown that aggressive sequential prefetching can reduce miss rate significantly and keep prefetch efficiency high with small increase of transferred blocks. Therefore, we can conclude that aggressive sequential prefetching works well for application programs with high sequentiality. On the other hand, for application programs with low sequentiality, it is shown that aggressive sequential prefetching degrades prefetch efficiency with high increase of transferred blocks and results in a small reduction of miss rate. We have also proposed a simple hardware sequential prefetching scheme which can increase the prefetching degree according to the length of sequential streams. Simply adding a small table to the sequential prefctching scheme, it is shown that the proposed scheme can reduce the execution time upto 77% of the sequential prefetching scheme with the prefetching degree of one. For four out of six application programs, the proposed scheme shows better execution time than the scheme proposed by Dahlgren et al. References A. Gupta, J. Hennessy, K. Gharachorloo, T. Mowry, and W.-D. Weber, Comparative Evaluation of Latency Reducing and Tolerating Techniques, in Proc. of the 18th Annual ntemational Symposium on Computer Architecture, pp , D.Lenoski, J.Lauden, K.Gharachorloo, A.Gupta, and J.Hennessy, The directory-based cache coherence protocol for the dash multiprocessor, in Proc. of the 17th Annual ntemational Symposium on Computer Architecture, pp , E. Gornish, E. Granston, and A. Veidenbaum, Compiler-directed data prefetching in multiprocessors with memory hierarchies, in Proc nternational Conference on Supercomputing, pp , T. Mowry and A. Gupta, Tolerating latency through software-controlled prefetching in shared-memory multiprocessors, Journal of Parallel and Distributed Computing, vol. 12, no. 2, pp , T. Mowry, M. Lam, and A. Gupta, Design and evaluation of a compiler algorithm for prefetching, in Proc. of the 5th ntl. Con$ on Architectural Support for Programming Languages and Operating Systems, pp , J. Baer and T. Chen, An effective on-chip preloading scheme to reduce data access penalty, in Proc. of Supercomputing 91, pp , J. Baer and T. Chen, Reducing memory latency via non-blocking and prefetching caches, in Architectural Support for Programming Languages and Operating Systems, pp , J. Fu, J. Patel, and B. Janssens, Stride directed prefetching in scalar processors, in Proc. of the 25th nternational Symposium on Microarchitectute, pp , [O] A. Smith, Cache memories, ACM Computing Surveys, vol. 14, pp , Sep [ 1 ] J. Fu and J. Patel, Data prefetching in multiprocessor vector cache memories, in Proc. of the 18th Annual nternational Symposium on Computer Architecture, pp , [ 121 F. Dahlgren, M. Dubois, and P. Stenstrom, Fixed and Adaptive sequential prefetching in shared memory multiprocessors, in Proc. of the nternational Conference on Parallel Processing, pp , [13] R. Bianchini and T. LeBlanc, A preliminary evaluation of cache-miss-initiated prefetching techniques in scalable multiprocessors, in tech. rep. 515, University of Rochester, [ 141 J. Veenstra, Mint Tutorial and User Manual, in tech. rep. 452, Department of Computer Science, University of Rochester, July [ 151 L.Censier and P.Feautrier, A new solution to coherence problems in multicache systems, EEE Transactions on Computers, vol. C-27, no. 12, pp , [16] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta, The SPLASH-2 Programs: Characterization and Methodological Considerations, in ntemational Symposium on Computer Architecture, [ 171 J.P.Singh, W.D.Weber, and A.Gupta, SPLASH: Stanford Parallel Applications for Shared-Memory, Computer Architecture News, vol. 20, pp. 5-44, Mar D. Callahan, K. Kennedy, and A. Porterfield, Software prefetching, in Proc. of the 4th ntl. Con$ on Architectural Support for Programming Languages and Operating Systems, pp ,

Miss Penalty Reduction Using Bundled Capacity Prefetching in Multiprocessors

Miss Penalty Reduction Using Bundled Capacity Prefetching in Multiprocessors Dan Wallin and Erik Hagersten Uppsala University Department of Information Technology P.O. Box 337, SE-751 05 Uppsala, Sweden