An Adaptive Sequential Prefetching Scheme in Shared-Memory Multiprocessors

Size: px
Start display at page:

Download "An Adaptive Sequential Prefetching Scheme in Shared-Memory Multiprocessors"

Transcription

1 An Adaptive Sequential Prefetching Scheme in Shared-Memory Multiprocessors Myoung Kwon Tcheun, Hyunsoo Yoon, Seung Ryoul Maeng Department of Computer Science, CAR Korea Advanced nstitute of Science and Technology (KAST) Kusong-Dong Yusung-Gu, Taejon , Korea { Abstract The sequential prefetching scheme is a simple hardwarecontrolled scheme, which exploits the sequentiality of memory accesses to predict which blocks will be read in the near future. We analyze the relationship between the sequentiality of application programs and the effectiveness of sequential prefetching on shared-memory multiprocessors. Also, we propose a simple hardware scheme which selects the prefetching degree on each miss by adding a small table(pds: Prefetching Degree Selector} to the sequential prefetching scheme. This scheme could prefetch consecutive blocks aggressively for applications with high sequentiality and conservatively for applications with low sequentiality. 1 ntroduction n large-scale multiprocessors with a general interconnection network, the program execution time significantly depends on the shared-memory access latency. The latency consists of the memory latency and the network latency. Caches are quite effective to reduce and hide the sharedmemory access latency by reducing the number of sharedmemory accesses. However, the remained shared-memory accesses are still serious bottleneck to achieve high performance computing because the cache miss penalty reaches to tens to hundreds of processor cycles with the advent of very fast uniprocessors and massively parallel systems [ 1,2]. Prefetching is an attractive scheme to reduce the cache miss penalty by exploiting the overlap of processor computations with data accesses. Especially for multiprocessors, cache miss penalty can be decreased significantly by overlapping the network latency of fetched block with those of prefetched blocks. Many prefetching schemes based on software or hardware have been proposed. Software prefetching schemes [3,4, 5, 61 perform static program analysis and insert explicitly prefetch instructions into the program code, which increases the program size. n contrast, hardware prefetching schemes control the prefetch activities according to the program execution by only hardware. Several hardware prefetching schemes [7, 8, 91 prefetch blocks if a regular access pattern is detected. These schemes require complex hardware to detect a regular access pattern. One block lookahead schemes (OBL)[ 101 are simple hardware prefetching schemes, where the next one block is only concerned upon referencing a block. There are three-types of OBL schemes: Alwaysprefetch prefetches the next block on each reference, so it makes excessive cache lookups. Tagged prefetch prefetches the next block if referenced block is accessed for the first time. This scheme requires an extra bit per cache block. Prefetch on misses prefetches the next block only on the miss. Since this scheme requires no extra bit per cache block and no prefetch activities on cache hits, it is simple to implement. However, this scheme is too conservative to need a cache miss to prefetch one block. The sequential prefetching schemes [lo, 11, 12, 131 are an extension of prefetch on misses. They prefetch several consecutive blocks following the missed block in the caches on each miss. Fu and Patel [ 111 show that the performance of sequential prefetching is enhanced with increasing the number of blocks to prefetch. Their scheme prefetches several blocks on misses which access a scalar or a short stride vector, while it does not prefetch on misses which access a long stride vector. This scheme uses the stride information carried by vector instructions. For scalar processor, it requires complex hardware to get the stride information. Dahlgren, Dubois, and Stenstrom [ 121 mention that sequential prefetching, which prefetches more than one block on each miss, could be useful if the system use the high bandwidth network. They propose an adaptive sequential prefetching scheme, where the prefetching degree, i.e., the number of blocks to prefetch on each miss, is controlled by the prefetch efficiency. The prefetch efficiency is measured by counting the prefetched blocks that are useful. To count the useful prefetches, this scheme uses two extra bits for each cache block and needs some activities on cache hits. Bianchini and LeBlanc [ 131 simulate the performance of sequential prefetching scheme with various prefetching degrees on various memory and network latency and analyze the effect of fine-grain sharing and write operations. While aggressive sequential prefetching prefetches many blocks on each miss and improves the miss rates of application programs, it may not result in reduction of the stall time. n some cases, prefetching only one block increases the execution time for application programs with fine-grain sharing and lots of write operations. They propose Hybrid prefetching scheme combining hardware and software supports for stride-directed prefetching. The compiler assigns the number of blocks to prefetch for each instructions according to the data type. One block is assigned for instructions which access fine-grain sharing data and several blocks are assigned for the other instructions. Each of these studies concludes that the performance of sequential prefetching depends on several factors, such as the characteristics of application programs, the prefetching /97 $ EEE 306

2 degree, and the memory and network latency. n this paper, we analyze the relationship between the sequentiality of application programs and the effectiveness of sequential prefetching on various memory and network latency, and we propose a new simple adaptive sequential prefetching scheme which does not need compiler assistance and extra activity on cache hits. This scheme prefetches many blocks on cache misses in long sequential streams and a few blocks on misses in short sequential streams by adjusting the prefetching degree according to the length of sequential streams. Therefore, for application programs with high sequentiality, the proposed scheme can prefetch blocks aggressively and shows better execution time than the sequential prefetching scheme with the prefetching degree of one. Also, the proposed scheme shows better execution time in low bandwidth than the adaptive sequential prefetching scheme proposed by Dahlgren et al. For application programs with low sequentiality, this scheme prefetches blocks conservatively and can avoid prefetching of useless blocks. The proposed scheme shows similar execution time to the sequential prefetching scheme with the prefetching degree of one and shows better execution time in high bandwidth than the adaptive sequential prefetching scheme proposed by Dahlgren et al. The remainder of this paper will be organized as follows. Section 2 describes the overview and implementation of the new adaptive sequential prefetching scheme. n Section 3, we present the simulation methodology and workloads used in this study. We analyze the relationship between the sequentiality of application programs and the effectiveness of sequential prefetching in Section 4. n Section 5, the proposed scheme is compared with the other simple hardware sequential prefetching schemes. Finally, conclusions are presented in Section 6. 2 An Adaptive Sequential Prefetching Scheme 2.1 Overview When a cache miss occurs, the processor accesses the memory. f the cache misses are occurred on to the consecutive blocks, the processor accesses the blocks in the memory sequentially. We define the sequence of these memory accesses as a sequential stream, and the number of consecutive blocks as the length of sequential stream. The effectiveness of sequential prefetching depends on the length of sequential streams and the prefetching degree. n a long sequential stream, aggressive prefetching reduces the miss rate efficiently. The miss penalty can be also decreased significantly by overlapping the network latency of fetched block with those of prefetched blocks. However, aggressive prefetching in a short sequential stream increases the traffic for useless blocks and may increase the network latency. The proposed scheme increases the prefetching degree according to the length of sequential stream. Since the length of the sequential stream is not known, the prefetching degree is increased one by one on each miss in a sequential stream as shown in Figure 1. After K misses have been occurred in a sequential stream, K + 1 blocks are prefetched on the next miss in the sequential stream. Thus, many blocks can be prefetched on a miss in a long sequential stream. Since the sequential streams can be interleaved, this scheme detects misses in the same sequential stream by using a small table containing addresses on which the next bl - Sequence of missed blocks bz b p ba b5 bs b7 bs b9 bla bli one block two blocks three blocks Prefetched blocks b,,l = b,+l : the sequence of b,is sequential b, : the ith block address of a sequential stream Figure 1 : Missed block sequence miss will be occurred. When there is a miss on the first block in a sequential stream, this scheme stores the third block address with the prefetching degree of 2 in the table. The prefetching degree is stored in the table to keep the increased prefetching degree for each stream. A block is considered to be the first in a sequential stream if there is no previous miss in the same sequential stream. n implementation, the first miss is detected by checking the existance of the missed block address in the table, f a miss really belongs to a sequential stream, there will be references to the second block, the third block, and so on. A reference to the second block becomes a cache hit because the second block is already prefetched. When a miss on the third block occurs, this scheme compares the miss address with the stored addresses in the table. Since there exists third block address with the prefetching degree of 2, this scheme prefetches two consecutive blocks, the fourth and fifth blocks, following the third block. The third block address is replaced with the sixth block address, and the prefetching degree is increased to 3. The misses will occur on the st, the 3rd, the 6th, and the 10th blocks of a sequential stream in the proposed scheme because the prefetching degree is increased one by one as shown in Figure 1. When the reference reaches to the end of the sequential stream, the processor would not access the saved address in the near future. Since the table size is finite, the saved address and the increased prefetching degree would be replaced by the first address of the other sequential stream. Each entry in the table keeps different prefetching degree for the recent sequential streams. The proposed scheme prefetches one block on first misses in sequential stream which may access a long stride vector or fine-grain sharing data, and several blocks on the other misses in sequential stream which access scalar variables or a short stride vector. Thus, there is no need to decrease the prefetching degree. 2.2 Processing Node Architecture The processing node shown in Figure 2.a consists of a processor, a cache, a prefetching unit, a prefetching degree selector (PDS), and a network interface. Figure 2.a except for PDS shows the mechanism of a sequential prefetching scheme. The processor is blocked only during the time when it takes to handle the read miss. A cache read miss for the block address A issues a fetch request to the network interface and activates the prefetching unit. The prefetching unit increases the block address A to A + 1 and lookups the cache for the block address A + 1. f the block is not present in the cache, the prefetching unit issues a prefetch request to the network interface. During this time, the block address A + 1 is increased to A + 2. n the next cache 307

3 hit /miss Processor 4 k Cache -' "K:Prefetching -- Degree , Prefetch 1 PDS ' A --7' Fetch. a. Processing node architecture b. Prefetching Degree Selector (PDS) Figure 2: Prefetching mechanism cycle, a cache lookup is made for block address A + 2. f the block is not present in the cache, a prefetch request is issued to the network interface. K cache lookups are made for K consecutive blocks, and prefetch requests are issued for the blocks which are not present in the cache. K is the prefetching degree. The Prefetch Controller controlls the number of cache lookups and issues prefetch requests depending on the result of each cache lookup. Since the prefetch requests are issued one at a time and are pipelined in the network and memory system, they can be overlapped with the fetch request. The Prefetching Degree Selector (PDS) selects a prefetching degree for a cache miss and sends the selected prefetching degree to the prefetching unit. 2.3 Prefetching Degree Selector PDS uses a table to increase the prefetching degree one by one on each miss in a sequential stream. Each entry of the table contains a block address to detect the next miss in the same sequential stream and a prefetching degree to increase the prefetching degree for the next miss. Figure 2.b shows the block diagram of PDS. The compare logic compares the miss address A with the addresses stored in the table. f there is a hit, the associated prefetching degree K is selected, and the entry (A, K) is replaced with (A f K + 1, K + 1). The prefetching degree K is increased to K + 1, and the address is added to the increased prefetching degree. f the address (A+ K + 1) also exists in the table, PDS invalidates the entry. However, if the miss address does not exist in the table, an entry (A + 2,2) is inserted into the table. f the address i" A + exists in the table, the entry is replaced by the entry + 2,2). Since the table is finite, the new entry might cause the least recently used (LRU) entry to be replaced by an entry selection logic. The number of entries affects the performance of the proposed scheme. Our experimental results suggest that a table of four entries is appropriate. Consider an arbitrary sequence. a1,a2,b1,b2,cl,~2,di,d2,b3,b4,b5,b6,ei,ez,e3 Each sequence of 5, is sequential, i.e., Z,+ = C, + 1. C, is an address of ith block of each sequential sequence of 2,. Black PlefeLching Address Degree H-Ed Before After Before a. On the miss of block a1 After b. On the miss of block b3 Before After Before After c. On the miss of block bg d. On the miss of block el Figure 3: Update entries in the table For example, a1 is the first block address of sequential stream a], a2, and b3 is the third block address of sequential stream bl, b2, b3, bq, b5, bg. Five sequential streams, sequences of a,, b,, cz, d,, and e,, are interleaved. The read miss on a1 will prefetch a2 and will insert address a3 and the prefetching degree of 2 into the table (Figure 3.a). A reference to a2 is a hit, because a2 is already prefetched. Read misses bl, c1, dl will also prefetch b2, c2, d2 and will insert addresses b3, c3, d3 and prefetching degrees of 2. A read miss on b3 will prefetch b4, b5 because the prefetching degree in the table is 2. The address b3 is replaced with b6 and the prefetching degree increases to 3 from 2 (Figure 3.b). After references to b4 and b5, a miss to block b6 will be occurred. This miss replaces b6 and 3 with blo and 4 (Figure 3.c). A miss to block el, which is the first block of el, e2, e3 stream, replaces the entry a3 and 2 with e3 and 2 (Figure 3.d). 3 Simulation Environment We make the architectural simulation model using MNT [ 141. The architecture is a scalable direct-connected multiprocessor with 16 nodes. Cache coherence is maintained by the full directory protocol [ 151 distributed among the mem- 308

4 Table 2: The Characteristics of Application Programs Program Description Data Sets LU decompose a dense matrix 200x 200 matrix FFT FFT algorithm - 32 K complex data points Radix nteger radix sort 1024 radix. a. The Distribution of Sequential Stream Lengths (Unit %) Program The length of seauential streams 1., Average length ory modules. Prefetched data are loaded into the caches so that the data can be invalidated by the cache coherence protocol. The page size is 4 Kbytes, and the pages are allocated in a round-robin fashion. The interconnection network is a bi-directional wormhole-routed mesh with dimensionordered routing. We simulate in different memory and network bandwidths to explore the effect of the bandwidth. We use memory and network bandwidths used in [13]. The latency of a memory module is 24 processor cycles. The memory transfer rates are 0.5 cycles, 2 cycles, and 4 cycles per word for high, medium, and low bandwidths, respectively. Network latency per link and latency per switch are 1 cycles and 5 cycles. Path widths are 128 bits, 32 bits, and 16 bits for high, medium, and low bandwidths, respectively. n simulation, high handwidth means high transfer rate and high path width. Also, medium and low bandwidth mean each memory transfer rate and path width. Six benchmark programs are considered for the simulation. Four of them (LU, FFT, Radix and Ocean) are taken from the SPLASH-2 suite [16], and the other two programs(mp3d and PTHOR) from SPLASH suite [17]. We summarize them in Table 1. Simulations are carried out with caches of 16 Kbytes and a block size of 32 bytes. We will focus on four different metrics: the read miss rate, the prefetch efficiency, the number of transferred blocks, and the execution time. The read miss rate is computed solely with respect to shared references. The purpose of prefetching is to reduce the cache misses. So the read miss rate is a proper metric. Cache pollution is one of the cost of prefetching. We use the prefetch efficiency to measure how many prefetched blocks are useful. The prefetch efficiency is defined as the number of prefetched and useful blocks divided by the total number of prefetched blocks. The prefetched and useful blocks are prefetched blocks that are accessed later. The traffic for prefetched blocks is the other cost of prefetching. The high traffic may increase the memory access latency. We use the number of transferred blocks to measure the traffic for read cache misses and prefetches. 4 The mpact of Sequentiality to Sequential Prefetching The performance of the sequential prefetching schemes depends on the distribution of sequential stream lengths of application programs because the effectiveness of sequential prefetching depends on the length of sequential streams. Table 2.a shows the distribution of sequential stream lengths, and Table 2.b shows the fraction of misses belonging to sequential streams of length one, the fraction of misses belonging to sequential streams of length greater than 30, and the average length of sequential streams. f the length of sequential stream is one, most of prefetched blocks will be useless, while for a long sequential stream, many useful blocks can be prefetched on each miss. We define the sequentiality of application program as the average length of sequential streams of application program. FFT, LU, and Radix show high sequentiality because large fraction of all misses belong to the sequential streams of length greater than 30, while MP3D, Ocean, and PTHOR show low sequentiality because large fraction of all misses belong to the sequential streams of length one. Benchmark programs are simulated with the prefetching degree of 1,2,4,8, 16, and no prefetching to examine the impact of sequentiality to sequential prefetching with various prefetching degrees. Figure 4.a shows the reduction of misses. The number of misses is normalized to no prefetching. The reduction of misses is larger for application programs with higher sequentiality for most of the prefetching degrees. Figure 4.b shows that the prefetch efficiency is higher for application programs with higher sequentiality. Figure 4.c shows the number of transferred blocks which is normalized to no prefetching. As can be seen, the increase of transferred blocks is larger for application programs with lower sequentiality. Figure 5 shows the execution times which are normalized to no prefetching. The execution times as a function of the prefetching degree follow U-shaped curves. The execution time depends on various factors, i.e., cache size and cache block size, the memory access latency, the memory and network bandwidth, the sequentiality of application programs, 309

5 !! 2 i O s Y : d Fr(7.M) fr..' LU(4,191'.+... Raf1ii2.98) *... MP3q.M) * Ocean(l.39) A- PTHOR(1.06) * W, t 0.0 ' he pefekhing degiu b. The Prefetch efficiency Figure 4: The impact of the sequentiality he pefetchng degree c. No. of Transferred Blocks,! C s w FFT(7,M) -. LU(4.29).+--- Radix(2.98) ---. MPlq1.M) *- Oc@0(1.39) -A-- PTH0Rj.M) -*--. > J, Y M M)t,,, thr prefdching &%e the prefeiching hsee the piefetching degree a. High Bandwidth b. Medium Bandwidth c. Low Bandwidth Figure 5: The execution time and the number of write operations. n this paper, we concentrate on the sequentiality of application programs and the memory and network bandwidth. Figure 5.b and Figure 5.c clearly show the distinction between application programs with high sequentiality and application programs with low sequentiality. For application programs with high sequentiality, the reductions of execution time are high. This is expected because the reduction of misses is high and the traffic is slightly increased. The reduction of execution time of FFT is smaller than that of LU in spite of higher reduction of read misses and higher prefetch efficiency. The reason for this is that the stall time for FFT increases more than that for LU because the write traffic for FFT is much higher than that for LU. n general, application programs with high sequentiality show high reduction of misses, high prefetch efficiency, and small increase of transferred blocks. Therefore, aggressive sequential prefetching works well for application programs with high sequentiality. On the other hand, for application programs with low sequentiality, sequential prefetching de- grades the execution time in some cases. 5 Comparison with Other Sequential Prefetching Schemes n this section, we compare the proposed scheme with the other simple hardware sequential prefetching schemes, i.e., the sequential prefetching scheme with the prefetching degree of one and the adaptive sequential prefetching scheme proposed by Dahlgren et al. We refer to these two sequential prefetching schemes as SP1 and ASP1. We also refer to the adaptive sequential prefetching scheme proposed by ours as ASP2. SP1 is the most simplest hardware sequential prefetching scheme. The sequential prefetching scheme with the prefetching degree greater than one needs hardware to count the number of prefetch requests on each miss. n the previous section, higher prefetching degree shows better execution time for application programs with high sequentiality, but SP1 shows the best execution time for application pro- 3 10

6 lzo 100 f Pcold.coherence Oreplace Oreplace-p 100 E,", n FFT LU Radix MPBD Ocean PTHOR Figure 6: The miss rates S Z Z 95%? % Z Sijg 252 mx.5 2%: 9s N 9 8 FFT LU Radix MPBD Ocean PTHOR Figure 7: The prefetch efficiency Table 3: Characteristics of sequential prefetching schemes Needs Extra Hardware on Cache hit Differentiate the misses No SP1 NO activity No.TWO ASPl ASP2 1 Yes Yes 1 hits per l~kbifs cache line for 16 Khytes cache) Activities Setthe bits in the cache and count useful blocks No.A small table 14 x 32hits for!! NO Activity Yes.sequential - scalar variables - Short stride vector L non-sequential - long stride vector - fine-grain sharing data grams with low sequentility. ASPl controls the prefetching degree to make the prefetch efficiency lie between two predefined values. We use 75% and 50% for these two values in simulations. Those are the same values used in [12]. As shown in the previous section, the prefetch efficiency is decreased with increasing the prefetching degree. When the prefetch efficiency is high, ASPl increases the prefetching degree and when the prefetch efficiency is low, ASPl decreases the prefetching degree. n some case, prefetching can be stopped. This scheme needs two extra bits for each cache block and some activities on cache hits to count the useful prefetches. 1 Kbits are needed for 16 Kbytes cache with 32 byte block size. ASP2 increases the prefetching degree to prefetch many blocks on misses in long sequential streams. A small table is required to increase the prefetching degree. Each entry in the table consists a block address and a prefetching degree. A block address needs 27 bits for the block of 32 bytes and the prefetching degree needs 5 bits to increase from 1 to bits are sufficient to cover the longest sequential stream of six application programs. Since 32 bits are used for one entry, 4 x 32 bits are required for a table with four entries. This scheme differentiates the misses according to the sequentiality. The characteristics of the above three prefetching schemes are summarized in Table 3. Figures show the simulation results. Figure 6, Figure 7, and Figure 8 show the miss rates, the prefetch ef- ficiencies, and the numbers of transferred blocks in high bandwidth. These results are similar with the results in medium and low bandwidth. Figure 6 shows the relative numbers of misses which are normalized to SP1. Each bar consists of four sections that, from the bottom to the top, correspond to cold misses, coherence misses, replace misses, and replacep misses. A miss is classified as a cold miss if it has never been fetched into the cache, coherence miss if it was invalidated by cache coherence protocol, and a replace miss if it was replaced out from the cache. A read miss to a block that was replaced out from the cache by prefetched block is classified as a replacep miss. As expected, the reduction of miss rates of ASP2 for FFT, LU, and Radix is high, because the sequentiality is high. On the other hand, for application programs like MP3D, Ocean, and PTHOR, the miss rates are slightly lower than those of SP1, because these application programs have low sequentialities. Also, the reduction of miss rates of ASPl for FFT, LU, and Radix is high, because the prefetch efficiency is high. Since the prefetch efficiencies of FFT is very high as shown in Figure 4.b, the miss rate of ASP2 is higher than that of ASPl for FFT. Figure 7 shows the prefetch efficiency. The prefetch efficiency of ASP2 is better than that of SPl for all programs except FFT, because ASP2 prefetches many blocks on misses in long sequential streams. For FFT, although the prefetch efficiency of ASP2 is worse than that of SP1, it is still high as 90.4%. The prefetch efficiency of ASP1 is worse than that of SP1 for four of six application programs because ASPl makes the prefetch efficiencies of those application programs lie between 75% and 50%. For Ocean and PTHOR, the prefetch efficiency of ASPl is better than that of SPl, because the prefetch efficiencies are lower than 50% for most of the prefetching degrees. Figure 8 shows the number of transferred blocks. We have broken down the transferred blocks into three components that, from bottom to the top, correspond to the prefetched and useful blocks, the prefetched but useless blocks, and the fetched blocks. A prefetched but useless block means a block which is replaced or invalidated before accessed. The number of transferred blocks of ASP2 is slightly higher than that of SP1. For FFT, LU, and Radix, the por- 311

7 FFT LU Radix MP3D Ocean PTHOR Figure 8: The number of transferred blocks C0y3 O > > C O > > 2" yaa %a% 3%: y$$ 0: 2 : 2: FFT LU Radix MP3D Ocean PTHOR Figure 10: Execution time in medium bandwidth %%a 3mz 3: '0: 335 W252 P P %$3 %$$ FFT LU Radix MP3D Ocean PTHOR Figure 9: Execution time in high bandwidth qaa s%z saz qza?aa 2aa 2 % 3: 2: 2 % 2: 0: FFT Radix MP3D Ocean PTHOR LU Figure 11: Execution time in low bandwidth tion of fetched blocks of ASP2 is smaller than that of SP1 since the miss rates are reduced significantly. The portion of useful prefetched blocks is larger than that of SP1 because the sequentiality is high. For MP3D, Ocean, and PTHOR, each portion is similar with that of SP1. The number of transferred blocks of ASP1 is larger than that of ASP2 for all application programs except PTHOR. PTHOR shows extremely small portion of prefetched blocks because the prefetch efficiency is extremely low. For FFT, LU, Radix, and MP3D, the portion of useless blocks under ASPl is much larger than that under ASP2 because the prefetch efficiencies are high. For PTHOR, the portion of useless blpcks under ASPl is smaller than that under ASP2, because the prefetch efficiency is extremely low. These results are similar with the results in medium and low bandwidth except Ocean. For Ocean, ASP2 transfers more blocks than ASPl in in medium and low bandwidth. Figures show the relative execution times which are normalized to SP1, respectively according to the bandwidth. The execution time depends on the miss rate, the number of transferred blocks. The blocked time is in propotion to the miss rate, while the memory access latency is affected by the number of transferred blocks. For application programs with high sequentiality, ASP2 shows much lower miss rate than SP1 and transfers sim- ilar number of blocks to SP1. Thus, ASP2 shows better execution time than SP1. For LU, the execution time of ASP2 is reduced to 77% of SP1. For application programs with low sequentiality, ASP2 shows similar execution time to SP1. For MP3D, ASP2 shows slightly better execution time because ASP2 shows slightly lower miss rate than SP1 and transfers similar number of blocks to SP1. For Ocean, ASP2 shows similar execution time to SP1 because ASP2 shows slightly lower miss rate but slightly higher number of transferred blocks than SP1. For PTHOR, ASP2 shows similar execution time to SP1 because ASP2 and SP1 shows similar miss rate and similar number of transferred blocks. ASP2 shows better execution time than ASP1 in some cases while ASPl shows better execution time than ASP2 in other cases. For FFT, ASP2 shows better performance than ASPl in low bandwidth while ASPl shows better execution time than ASP2 in high bandwidth. The reason is that ASPl shows much lower miss rate than ASP2 but transfers much more blocks than ASP2, and the execution time is more sensitve to the miss rate than the number of transferred blocks in high bandwidth, but more sensitive to the number of transferred blocks in low bandwidth. LU, Radix, and MP3D show better execution times under ASP2 in all bandwidth because ASP2 shows similar miss rate with small number of transferred blocks. Ocean shows better ex- 3 12

8 ecution times under ASP2 in high bandwidth because ASP2 shows lower miss rate with small number of transferred blocks. But, in medium and low bandwidth, ASP2 shows slightly worse execution time than ASPl because ASP2 shows lower miss rate but transfers much more blocks than ASP. PTHOR prefetches very small number of blocks because this program shows extremely low sequentiality. ASP2 shows slightly lower miss rate, but prefetches more blocks than ASP1. Thus, ASPl shows better execution time in all bandwidth. As expected, ASP2 shows better execution time than SPl for application programs with high sequentiality and similar execution time to SP1 for application programs with low sequentiality. For four out of six application programs, ASP2 shows higher prefetch efficiency and better execution time than ASPl because ASP2 prefetches useless blocks less than ASPl. 6 Conclusion n this paper, we have analyzed the impact of sequentiality on the sequential prefetching scheme. For application programs with high sequentiality, it is shown that aggressive sequential prefetching can reduce miss rate significantly and keep prefetch efficiency high with small increase of transferred blocks. Therefore, we can conclude that aggressive sequential prefetching works well for application programs with high sequentiality. On the other hand, for application programs with low sequentiality, it is shown that aggressive sequential prefetching degrades prefetch efficiency with high increase of transferred blocks and results in a small reduction of miss rate. We have also proposed a simple hardware sequential prefetching scheme which can increase the prefetching degree according to the length of sequential streams. Simply adding a small table to the sequential prefctching scheme, it is shown that the proposed scheme can reduce the execution time upto 77% of the sequential prefetching scheme with the prefetching degree of one. For four out of six application programs, the proposed scheme shows better execution time than the scheme proposed by Dahlgren et al. References A. Gupta, J. Hennessy, K. Gharachorloo, T. Mowry, and W.-D. Weber, Comparative Evaluation of Latency Reducing and Tolerating Techniques, in Proc. of the 18th Annual ntemational Symposium on Computer Architecture, pp , D.Lenoski, J.Lauden, K.Gharachorloo, A.Gupta, and J.Hennessy, The directory-based cache coherence protocol for the dash multiprocessor, in Proc. of the 17th Annual ntemational Symposium on Computer Architecture, pp , E. Gornish, E. Granston, and A. Veidenbaum, Compiler-directed data prefetching in multiprocessors with memory hierarchies, in Proc nternational Conference on Supercomputing, pp , T. Mowry and A. Gupta, Tolerating latency through software-controlled prefetching in shared-memory multiprocessors, Journal of Parallel and Distributed Computing, vol. 12, no. 2, pp , T. Mowry, M. Lam, and A. Gupta, Design and evaluation of a compiler algorithm for prefetching, in Proc. of the 5th ntl. Con$ on Architectural Support for Programming Languages and Operating Systems, pp , J. Baer and T. Chen, An effective on-chip preloading scheme to reduce data access penalty, in Proc. of Supercomputing 91, pp , J. Baer and T. Chen, Reducing memory latency via non-blocking and prefetching caches, in Architectural Support for Programming Languages and Operating Systems, pp , J. Fu, J. Patel, and B. Janssens, Stride directed prefetching in scalar processors, in Proc. of the 25th nternational Symposium on Microarchitectute, pp , [O] A. Smith, Cache memories, ACM Computing Surveys, vol. 14, pp , Sep [ 1 ] J. Fu and J. Patel, Data prefetching in multiprocessor vector cache memories, in Proc. of the 18th Annual nternational Symposium on Computer Architecture, pp , [ 121 F. Dahlgren, M. Dubois, and P. Stenstrom, Fixed and Adaptive sequential prefetching in shared memory multiprocessors, in Proc. of the nternational Conference on Parallel Processing, pp , [13] R. Bianchini and T. LeBlanc, A preliminary evaluation of cache-miss-initiated prefetching techniques in scalable multiprocessors, in tech. rep. 515, University of Rochester, [ 141 J. Veenstra, Mint Tutorial and User Manual, in tech. rep. 452, Department of Computer Science, University of Rochester, July [ 151 L.Censier and P.Feautrier, A new solution to coherence problems in multicache systems, EEE Transactions on Computers, vol. C-27, no. 12, pp , [16] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta, The SPLASH-2 Programs: Characterization and Methodological Considerations, in ntemational Symposium on Computer Architecture, [ 171 J.P.Singh, W.D.Weber, and A.Gupta, SPLASH: Stanford Parallel Applications for Shared-Memory, Computer Architecture News, vol. 20, pp. 5-44, Mar D. Callahan, K. Kennedy, and A. Porterfield, Software prefetching, in Proc. of the 4th ntl. Con$ on Architectural Support for Programming Languages and Operating Systems, pp ,

Miss Penalty Reduction Using Bundled Capacity Prefetching in Multiprocessors

Miss Penalty Reduction Using Bundled Capacity Prefetching in Multiprocessors Miss Penalty Reduction Using Bundled Capacity Prefetching in Multiprocessors Dan Wallin and Erik Hagersten Uppsala University Department of Information Technology P.O. Box 337, SE-751 05 Uppsala, Sweden

More information

Adaptive Prefetching Technique for Shared Virtual Memory

Adaptive Prefetching Technique for Shared Virtual Memory Adaptive Prefetching Technique for Shared Virtual Memory Sang-Kwon Lee Hee-Chul Yun Joonwon Lee Seungryoul Maeng Computer Architecture Laboratory Korea Advanced Institute of Science and Technology 373-1

More information

Comprehensive Review of Data Prefetching Mechanisms

Comprehensive Review of Data Prefetching Mechanisms 86 Sneha Chhabra, Raman Maini Comprehensive Review of Data Prefetching Mechanisms 1 Sneha Chhabra, 2 Raman Maini 1 University College of Engineering, Punjabi University, Patiala 2 Associate Professor,

More information

An Adaptive Update-Based Cache Coherence Protocol for Reduction of Miss Rate and Traffic

An Adaptive Update-Based Cache Coherence Protocol for Reduction of Miss Rate and Traffic To appear in Parallel Architectures and Languages Europe (PARLE), July 1994 An Adaptive Update-Based Cache Coherence Protocol for Reduction of Miss Rate and Traffic Håkan Nilsson and Per Stenström Department

More information

Bundling: Reducing the Overhead of Multiprocessor Prefetchers

Bundling: Reducing the Overhead of Multiprocessor Prefetchers Bundling: Reducing the Overhead of Multiprocessor Prefetchers Dan Wallin and Erik Hagersten Uppsala University, Department of Information Technology P.O. Box 337, SE-751 05 Uppsala, Sweden dan.wallin,

More information

Effect of Data Prefetching on Chip MultiProcessor

Effect of Data Prefetching on Chip MultiProcessor THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS TECHNICAL REPORT OF IEICE. 819-0395 744 819-0395 744 E-mail: {fukumoto,mihara}@c.csce.kyushu-u.ac.jp, {inoue,murakami}@i.kyushu-u.ac.jp

More information

Bundling: Reducing the Overhead of Multiprocessor Prefetchers

Bundling: Reducing the Overhead of Multiprocessor Prefetchers Bundling: Reducing the Overhead of Multiprocessor Prefetchers Dan Wallin and Erik Hagersten Uppsala University, Department of Information Technology P.O. Box 337, SE-751 05 Uppsala, Sweden fdan.wallin,

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

Performance Evaluation and Cost Analysis of Cache Protocol Extensions for Shared-Memory Multiprocessors

Performance Evaluation and Cost Analysis of Cache Protocol Extensions for Shared-Memory Multiprocessors IEEE TRANSACTIONS ON COMPUTERS, VOL. 47, NO. 10, OCTOBER 1998 1041 Performance Evaluation and Cost Analysis of Cache Protocol Extensions for Shared-Memory Multiprocessors Fredrik Dahlgren, Member, IEEE

More information

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Journal of Instruction-Level Parallelism 13 (11) 1-14 Submitted 3/1; published 1/11 Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Santhosh Verma

More information

The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared-Memory Multiprocessor

The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared-Memory Multiprocessor IEEE Transactions on Parallel and Distributed Systems, Vol. 5, No. 6, June 1994, pp. 573-584.. The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared-Memory Multiprocessor David J.

More information

An Automated Method for Software Controlled Cache Prefetching

An Automated Method for Software Controlled Cache Prefetching An Automated Method for Software Controlled Cache Prefetching Daniel F. Zucker*, Ruby B. Lee, and Michael J. Flynn Computer Systems Laboratory Department of Electrical Engineering Stanford University Stanford,

More information

Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009

Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009 Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009 Agenda Introduction Memory Hierarchy Design CPU Speed vs.

More information

A Hybrid Adaptive Feedback Based Prefetcher

A Hybrid Adaptive Feedback Based Prefetcher A Feedback Based Prefetcher Santhosh Verma, David M. Koppelman and Lu Peng Department of Electrical and Computer Engineering Louisiana State University, Baton Rouge, LA 78 sverma@lsu.edu, koppel@ece.lsu.edu,

More information

Lecture 13: Consistency Models. Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models

Lecture 13: Consistency Models. Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models Lecture 13: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models 1 Coherence Vs. Consistency Recall that coherence guarantees

More information

A Survey of Data Prefetching Techniques

A Survey of Data Prefetching Techniques A Survey of Data Prefetching Techniques Technical Report No: HPPC-96-05 October 1996 Steve VanderWiel David J. Lilja High-Performance Parallel Computing Research Group Department of Electrical Engineering

More information

Combined Performance Gains of Simple Cache Protocol Extensions

Combined Performance Gains of Simple Cache Protocol Extensions Combined Performance Gains of Simple Cache Protocol Extensions Fredrik Dahlgren, Michel Duboi~ and Per Stenstrom Department of Computer Engineering Lund University P.O. Box 118, S-221 00 LUND, Sweden *Department

More information

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. Computer Architectures Chapter 5 Tien-Fu Chen National Chung Cheng Univ. Chap5-0 Topics in Memory Hierachy! Memory Hierachy Features: temporal & spatial locality Common: Faster -> more expensive -> smaller!

More information

CS 838 Chip Multiprocessor Prefetching

CS 838 Chip Multiprocessor Prefetching CS 838 Chip Multiprocessor Prefetching Kyle Nesbit and Nick Lindberg Department of Electrical and Computer Engineering University of Wisconsin Madison 1. Introduction Over the past two decades, advances

More information

Introduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem

Introduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem Introduction Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: Increase computation power Make the best use of available bandwidth We study the bandwidth

More information

DATA PREFETCHING AND DATA FORWARDING IN SHARED MEMORY MULTIPROCESSORS (*)

DATA PREFETCHING AND DATA FORWARDING IN SHARED MEMORY MULTIPROCESSORS (*) DATA PREFETCHING AND DATA FORWARDING IN SHARED MEMORY MULTIPROCESSORS (*) David K. Poulsen and Pen-Chung Yew Center for Supercomputing Research and Development University of Illinois at Urbana-Champaign

More information

Performance of MP3D on the SB-PRAM prototype

Performance of MP3D on the SB-PRAM prototype Performance of MP3D on the SB-PRAM prototype Roman Dementiev, Michael Klein and Wolfgang J. Paul rd,ogrim,wjp @cs.uni-sb.de Saarland University Computer Science Department D-66123 Saarbrücken, Germany

More information

Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables

Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables Storage Efficient Hardware Prefetching using Correlating Prediction Tables Marius Grannaes Magnus Jahre Lasse Natvig Norwegian University of Science and Technology HiPEAC European Network of Excellence

More information

Multiple Context Processors. Motivation. Coping with Latency. Architectural and Implementation. Multiple-Context Processors.

Multiple Context Processors. Motivation. Coping with Latency. Architectural and Implementation. Multiple-Context Processors. Architectural and Implementation Tradeoffs for Multiple-Context Processors Coping with Latency Two-step approach to managing latency First, reduce latency coherent caches locality optimizations pipeline

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

Software-Controlled Multithreading Using Informing Memory Operations

Software-Controlled Multithreading Using Informing Memory Operations Software-Controlled Multithreading Using Informing Memory Operations Todd C. Mowry Computer Science Department University Sherwyn R. Ramkissoon Department of Electrical & Computer Engineering University

More information

Hardware and Software Cache Prefetching Techniques for MPEG Benchmarks

Hardware and Software Cache Prefetching Techniques for MPEG Benchmarks 782 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 10, NO. 5, AUGUST 2000 Hardware and Software Cache Prefetching Techniques for MPEG Benchmarks Daniel F. Zucker, Member, IEEE,, Ruby

More information

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont. History Table. Correlating Prediction Table

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont.   History Table. Correlating Prediction Table Lecture 15 History Table Correlating Prediction Table Prefetching Latest A0 A0,A1 A3 11 Fall 2018 Jon Beaumont A1 http://www.eecs.umich.edu/courses/eecs470 Prefetch A3 Slides developed in part by Profs.

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved. LRU A list to keep track of the order of access to every block in the set. The least recently used block is replaced (if needed). How many bits we need for that? 27 Pseudo LRU A B C D E F G H A B C D E

More information

Memory Hierarchy Basics

Memory Hierarchy Basics Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Memory Hierarchy Basics Six basic cache optimizations: Larger block size Reduces compulsory misses Increases

More information

Memory Consistency. Challenges. Program order Memory access order

Memory Consistency. Challenges. Program order Memory access order Memory Consistency Memory Consistency Memory Consistency Reads and writes of the shared memory face consistency problem Need to achieve controlled consistency in memory events Shared memory behavior determined

More information

Quantitative study of data caches on a multistreamed architecture. Abstract

Quantitative study of data caches on a multistreamed architecture. Abstract Quantitative study of data caches on a multistreamed architecture Mario Nemirovsky University of California, Santa Barbara mario@ece.ucsb.edu Abstract Wayne Yamamoto Sun Microsystems, Inc. wayne.yamamoto@sun.com

More information

h Coherence Controllers

h Coherence Controllers High-Throughput h Coherence Controllers Anthony-Trung Nguyen Microprocessor Research Labs Intel Corporation 9/30/03 Motivations Coherence Controller (CC) throughput is bottleneck of scalable systems. CCs

More information

Cache Injection on Bus Based Multiprocessors

Cache Injection on Bus Based Multiprocessors Cache Injection on Bus Based Multiprocessors Aleksandar Milenkovic, Veljko Milutinovic School of Electrical Engineering, University of Belgrade E-mail: {emilenka,vm@etf.bg.ac.yu, Http: {galeb.etf.bg.ac.yu/~vm,

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

CS3350B Computer Architecture

CS3350B Computer Architecture CS335B Computer Architecture Winter 25 Lecture 32: Exploiting Memory Hierarchy: How? Marc Moreno Maza wwwcsduwoca/courses/cs335b [Adapted from lectures on Computer Organization and Design, Patterson &

More information

ABSTRACT. PANT, SALIL MOHAN Slipstream-mode Prefetching in CMP s: Peformance Comparison and Evaluation. (Under the direction of Dr. Greg Byrd).

ABSTRACT. PANT, SALIL MOHAN Slipstream-mode Prefetching in CMP s: Peformance Comparison and Evaluation. (Under the direction of Dr. Greg Byrd). ABSTRACT PANT, SALIL MOHAN Slipstream-mode Prefetching in CMP s: Peformance Comparison and Evaluation. (Under the direction of Dr. Greg Byrd). With the increasing gap between processor speeds and memory,

More information

Review on ichat: Inter Cache Hardware Assistant Data Transfer for Heterogeneous Chip Multiprocessors. By: Anvesh Polepalli Raj Muchhala

Review on ichat: Inter Cache Hardware Assistant Data Transfer for Heterogeneous Chip Multiprocessors. By: Anvesh Polepalli Raj Muchhala Review on ichat: Inter Cache Hardware Assistant Data Transfer for Heterogeneous Chip Multiprocessors By: Anvesh Polepalli Raj Muchhala Introduction Integrating CPU and GPU into a single chip for performance

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Hardware Prefetching in Bus-Based Multiprocessors: Pattern Characterization and Cost-Effective Hardware

Hardware Prefetching in Bus-Based Multiprocessors: Pattern Characterization and Cost-Effective Hardware Hardware Prefetching in Bus-d Multiprocessors: Pattern Characterization and Cost-Effective Hardware M.J. Garzarán, J.L. Briz, P.E. Ibáñez and V. Viñals Univ. Zaragoza, España {garzaran,briz,imarin,victor}@posta.unizar.es

More information

Page 1. Multilevel Memories (Improving performance using a little cash )

Page 1. Multilevel Memories (Improving performance using a little cash ) Page 1 Multilevel Memories (Improving performance using a little cash ) 1 Page 2 CPU-Memory Bottleneck CPU Memory Performance of high-speed computers is usually limited by memory bandwidth & latency Latency

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Memory Hierarchy. Slides contents from:

Memory Hierarchy. Slides contents from: Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory

More information

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM

More information

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu Midterm Exam ECE 741 Advanced Computer Architecture, Spring 2009 Instructor: Onur Mutlu TAs: Michael Papamichael, Theodoros Strigkos, Evangelos Vlachos February 25, 2009 EXAM 1 SOLUTIONS Problem Points

More information

A Study of the Efficiency of Shared Attraction Memories in Cluster-Based COMA Multiprocessors

A Study of the Efficiency of Shared Attraction Memories in Cluster-Based COMA Multiprocessors A Study of the Efficiency of Shared Attraction Memories in Cluster-Based COMA Multiprocessors Anders Landin and Mattias Karlgren Swedish Institute of Computer Science Box 1263, S-164 28 KISTA, Sweden flandin,

More information

Boosting the Performance of Shared Memory Multiprocessors

Boosting the Performance of Shared Memory Multiprocessors Research Feature Boosting the Performance of Shared Memory Multiprocessors Proposed hardware optimizations to CC-NUMA machines shared memory multiprocessors that use cache consistency protocols can shorten

More information

Latency Hiding on COMA Multiprocessors

Latency Hiding on COMA Multiprocessors Latency Hiding on COMA Multiprocessors Tarek S. Abdelrahman Department of Electrical and Computer Engineering The University of Toronto Toronto, Ontario, Canada M5S 1A4 Abstract Cache Only Memory Access

More information

1/19/2009. Data Locality. Exploiting Locality: Caches

1/19/2009. Data Locality. Exploiting Locality: Caches Spring 2009 Prof. Hyesoon Kim Thanks to Prof. Loh & Prof. Prvulovic Data Locality Temporal: if data item needed now, it is likely to be needed again in near future Spatial: if data item needed now, nearby

More information

Data Prefetch Mechanisms

Data Prefetch Mechanisms Data Prefetch Mechanisms STEVEN P. VANDERWIEL IBM Server Group AND DAVID J. LILJA University of Minnesota The expanding gap between microprocessor and DRAM performance has necessitated the use of increasingly

More information

06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli

06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli 06-1 Vector Processors, Etc. 06-1 Some material from Appendix B of Hennessy and Patterson. Outline Memory Latency Hiding v. Reduction Program Characteristics Vector Processors Data Prefetch Processor /DRAM

More information

CSC 631: High-Performance Computer Architecture

CSC 631: High-Performance Computer Architecture CSC 631: High-Performance Computer Architecture Spring 2017 Lecture 10: Memory Part II CSC 631: High-Performance Computer Architecture 1 Two predictable properties of memory references: Temporal Locality:

More information

ECE 3056: Architecture, Concurrency, and Energy of Computation. Sample Problem Set: Memory Systems

ECE 3056: Architecture, Concurrency, and Energy of Computation. Sample Problem Set: Memory Systems ECE 356: Architecture, Concurrency, and Energy of Computation Sample Problem Set: Memory Systems TLB 1. Consider a processor system with 256 kbytes of memory, 64 Kbyte pages, and a 1 Mbyte virtual address

More information

Reducing the Latency of L2 Misses in Shared-Memory Multiprocessors through On-Chip Directory Integration

Reducing the Latency of L2 Misses in Shared-Memory Multiprocessors through On-Chip Directory Integration Reducing the Latency of L2 Misses in Shared-Memory Multiprocessors through On-Chip Directory Integration Manuel E. Acacio, José González, José M. García Dpto. Ing. y Tecnología de Computadores Universidad

More information

Assignment 2: Understanding Data Cache Prefetching

Assignment 2: Understanding Data Cache Prefetching Assignment 2: Understanding Data Cache Prefetching Computer Architecture Due: Monday, March 27, 2017 at 4:00 PM This assignment represents the second practical component of the Computer Architecture module.

More information

Why memory hierarchy

Why memory hierarchy Why memory hierarchy (3 rd Ed: p.468-487, 4 th Ed: p. 452-470) users want unlimited fast memory fast memory expensive, slow memory cheap cache: small, fast memory near CPU large, slow memory (main memory,

More information

Advanced cache memory optimizations

Advanced cache memory optimizations Advanced cache memory optimizations Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department

More information

Motivations. Shared Memory Consistency Models. Optimizations for Performance. Memory Consistency

Motivations. Shared Memory Consistency Models. Optimizations for Performance. Memory Consistency Shared Memory Consistency Models Authors : Sarita.V.Adve and Kourosh Gharachorloo Presented by Arrvindh Shriraman Motivations Programmer is required to reason about consistency to ensure data race conditions

More information

Hardware versus Hybrid Data Prefetching in Multimedia Processors: A Case Study

Hardware versus Hybrid Data Prefetching in Multimedia Processors: A Case Study In Proc, of the IEEE Int. Performance, Computing and Communications Conference, Phoenix, USA, Feb. 2, c 2 IEEE, reprinted with permission of the IEEE Hardware versus Hybrid Data ing in Multimedia Processors:

More information

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012) Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1] EE482: Advanced Computer Organization Lecture #7 Processor Architecture Stanford University Tuesday, June 6, 2000 Memory Systems and Memory Latency Lecture #7: Wednesday, April 19, 2000 Lecturer: Brian

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation Outline Evaluation of applications is important Simulation of sample data sets provides important information Working sets indicate

More information

When Caches Aren t Enough: Data Prefetching Techniques

When Caches Aren t Enough: Data Prefetching Techniques Computing Practices When Caches Aren t Enough: Data Prefetching Techniques With data prefetching, memory systems call data into the cache before the processor needs it, thereby reducing memory-access latency.

More information

Neighborhood Prefetching on Multiprocessors Using Instruction History

Neighborhood Prefetching on Multiprocessors Using Instruction History Neighborhood Prefetching on Multiprocessors Using Instruction History David M. Koppelman Department of Electrical & Computer Engineering, Louisiana State University koppel@ee.lsu.edu Abstract A multiprocessor

More information

A Best-Offset Prefetcher

A Best-Offset Prefetcher A Best-Offset Prefetcher Pierre Michaud Inria pierre.michaud@inria.fr The Best-Offset (BO) prefetcher submitted to the DPC contest prefetches one line into the level-two (L) cache on every cache miss or

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

Lecture notes for CS Chapter 2, part 1 10/23/18

Lecture notes for CS Chapter 2, part 1 10/23/18 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Speculative Sequential Consistency with Little Custom Storage

Speculative Sequential Consistency with Little Custom Storage Journal of Instruction-Level Parallelism 5(2003) Submitted 10/02; published 4/03 Speculative Sequential Consistency with Little Custom Storage Chris Gniady Babak Falsafi Computer Architecture Laboratory,

More information

Memory Hierarchy. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Memory Hierarchy. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Memory Hierarchy Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu EEE3050: Theory on Computer Architectures, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)

More information

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM UNIT III MULTIPROCESSORS AND THREAD LEVEL PARALLELISM 1. Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared

More information

Large Scale Multiprocessors and Scientific Applications. By Pushkar Ratnalikar Namrata Lele

Large Scale Multiprocessors and Scientific Applications. By Pushkar Ratnalikar Namrata Lele Large Scale Multiprocessors and Scientific Applications By Pushkar Ratnalikar Namrata Lele Agenda Introduction Interprocessor Communication Characteristics of Scientific Applications Synchronization: Scaling

More information

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network Shared Memory Multis Processor Processor Processor i Processor n Symmetric Shared Memory Architecture (SMP) cache cache cache cache Interconnection Network Main Memory I/O System Cache Coherence Cache

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

/ : Computer Architecture and Design Fall Final Exam December 4, Name: ID #:

/ : Computer Architecture and Design Fall Final Exam December 4, Name: ID #: 16.482 / 16.561: Computer Architecture and Design Fall 2014 Final Exam December 4, 2014 Name: ID #: For this exam, you may use a calculator and two 8.5 x 11 double-sided page of notes. All other electronic

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Cache Organization Prof. Michel A. Kinsy The course has 4 modules Module 1 Instruction Set Architecture (ISA) Simple Pipelining and Hazards Module 2 Superscalar Architectures

More information

Speculative Sequential Consistency with Little Custom Storage

Speculative Sequential Consistency with Little Custom Storage Appears in Journal of Instruction-Level Parallelism (JILP), 2003. Speculative Sequential Consistency with Little Custom Storage Chris Gniady and Babak Falsafi Computer Architecture Laboratory Carnegie

More information

Module 14: "Directory-based Cache Coherence" Lecture 31: "Managing Directory Overhead" Directory-based Cache Coherence: Replacement of S blocks

Module 14: Directory-based Cache Coherence Lecture 31: Managing Directory Overhead Directory-based Cache Coherence: Replacement of S blocks Directory-based Cache Coherence: Replacement of S blocks Serialization VN deadlock Starvation Overflow schemes Sparse directory Remote access cache COMA Latency tolerance Page migration Queue lock in hardware

More information

Evaluation of memory latency in cluster-based cachecoherent multiprocessor systems with di erent interconnection topologies

Evaluation of memory latency in cluster-based cachecoherent multiprocessor systems with di erent interconnection topologies Computers and Electrical Engineering 26 (2000) 207±220 www.elsevier.com/locate/compeleceng Evaluation of memory latency in cluster-based cachecoherent multiprocessor systems with di erent interconnection

More information

An Efficient Lock Protocol for Home-based Lazy Release Consistency

An Efficient Lock Protocol for Home-based Lazy Release Consistency An Efficient Lock Protocol for Home-based Lazy ease Consistency Hee-Chul Yun Sang-Kwon Lee Joonwon Lee Seungryoul Maeng Computer Architecture Laboratory Korea Advanced Institute of Science and Technology

More information

Chapter Seven. Large & Fast: Exploring Memory Hierarchy

Chapter Seven. Large & Fast: Exploring Memory Hierarchy Chapter Seven Large & Fast: Exploring Memory Hierarchy 1 Memories: Review SRAM (Static Random Access Memory): value is stored on a pair of inverting gates very fast but takes up more space than DRAM DRAM

More information

A Cache Hierarchy in a Computer System

A Cache Hierarchy in a Computer System A Cache Hierarchy in a Computer System Ideally one would desire an indefinitely large memory capacity such that any particular... word would be immediately available... We are... forced to recognize the

More information

Cache Performance (H&P 5.3; 5.5; 5.6)

Cache Performance (H&P 5.3; 5.5; 5.6) Cache Performance (H&P 5.3; 5.5; 5.6) Memory system and processor performance: CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st

More information

15-740/ Computer Architecture Lecture 12: Advanced Caching. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 12: Advanced Caching. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 12: Advanced Caching Prof. Onur Mutlu Carnegie Mellon University Announcements Chuck Thacker (Microsoft Research) Seminar Tomorrow RARE: Rethinking Architectural

More information

Exploiting the Prefetching Effect Provided by Executing Mispredicted Load Instructions

Exploiting the Prefetching Effect Provided by Executing Mispredicted Load Instructions Exploiting the Prefetching Effect Provided by Executing Mispredicted Load Instructions Resit Sendag 1, David J. Lilja 1, and Steven R. Kunkel 2 1 Department of Electrical and Computer Engineering Minnesota

More information

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence Parallel Computer Architecture Spring 2018 Shared Memory Multiprocessors Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly Parallel Computer Architecture

More information

A Performance Study of Instruction Cache Prefetching Methods

A Performance Study of Instruction Cache Prefetching Methods IEEE TRANSACTIONS ON COMPUTERS, VOL. 47, NO. 5, MAY 1998 497 A Performance Study of Instruction Cache Prefetching Methods Wei-Chung Hsu and James E. Smith Abstract Prefetching methods for instruction caches

More information

SF-LRU Cache Replacement Algorithm

SF-LRU Cache Replacement Algorithm SF-LRU Cache Replacement Algorithm Jaafar Alghazo, Adil Akaaboune, Nazeih Botros Southern Illinois University at Carbondale Department of Electrical and Computer Engineering Carbondale, IL 6291 alghazo@siu.edu,

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

thester UNIVERSITY OF ' COMPUTER SCIENCE AD-A

thester UNIVERSITY OF ' COMPUTER SCIENCE AD-A AD-A281 502 -- A Preliminary Evaluation of Cache-Miss-Initiated Prefetching Techniques in Scalable Multiprocessors Ricaxdo Bianchini and Thomas J. LeBlanc Technical Report 515 May 1994 DTIC ELECTE JUL1

More information

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline The Need for Cache Memory The Basics

More information

Modelling Accesses to Migratory and Producer-Consumer Characterised Data in a Shared Memory Multiprocessor

Modelling Accesses to Migratory and Producer-Consumer Characterised Data in a Shared Memory Multiprocessor In Proceedings of the 6th IEEE Symposium on Parallel and Distributed Processing Dallas, October 26-28 1994, pp. 612-619. Modelling Accesses to Migratory and Producer-Consumer Characterised Data in a Shared

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Abridged version of Patterson & Hennessy (2013):Ch.5 Principle of Locality Programs access a small proportion of their address space at any time Temporal

More information

Fall 2011 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

Fall 2011 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic Fall 2011 Prof. Hyesoon Kim Thanks to Prof. Loh & Prof. Prvulovic Reading: Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000) If memory

More information

Coherence Controller Architectures for SMP-Based CC-NUMA Multiprocessors

Coherence Controller Architectures for SMP-Based CC-NUMA Multiprocessors Coherence Controller Architectures for SMP-Based CC-NUMA Multiprocessors Maged M. Michael y, Ashwini K. Nanda z, Beng-Hong Lim z, and Michael L. Scott y y University of Rochester z IBM Research Department

More information

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple Memory Hierarchy Basics Six basic cache optimizations: Larger block size Reduces compulsory misses Increases capacity and conflict misses, increases miss penalty Larger total cache capacity to reduce miss

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information