Texture Caching. Héctor Antonio Villa Martínez Universidad de Sonora

Size: px
Start display at page:

Download "Texture Caching. Héctor Antonio Villa Martínez Universidad de Sonora"

Transcription

1 April, 2006 Caching Héctor Antonio Villa Martínez Universidad de Sonora 1. Introduction This report presents a review of caching architectures used for texture mapping in Computer Graphics. mapping is a technique aimed to give more realism to computer generated images and by its nature is a memory-intensive process. However, texture mapping presents some special features, like locality, which makes using a cache approach desirable. The report is organized as follows: section 2 presents a description of texture mapping and a justification for using a cache dedicated to texturing. Section 3 describes the basic texture caching architecture. The rest of the report review the successive improvements to this basic architecture found in the literature: multilevel caching (section 4), prefetching (section 5), parallel texture caching (section 6), hybrid access caching (section 7), multilevel parallel texture caching (section 8), and adaptive indexing (section 9). 2. Texturing 2.1 The graphics pipeline. Defining Computer Graphics is a challenging task. In this report, we will understand Computer Graphics, or more specifically, 3D Computer Graphics, as the field of Computer Science that studies how to generate a 2D computer image from a 3D model. Most Computer Graphics books, for example [2, p. 9] and [16, p. 142], divide the process of creating a 2D image in a series of steps and call these steps the graphics pipeline. There is no consensus in the number or in the names of the graphics pipeline stages. To be consistent, we will follow the nomenclature from Akenine-Möller and Haines book [2]. They divide the pipeline in three big stages: application, geometry, and rasterizer.

2 The application stage is where the 3D model is defined. This stage is generally implemented in software. It can be interactive, if the user can create or modify a scene interactively; or it can be static and just read the scene definition from a text file. In any case, the output of this stage is the primitives defined for the graphics system. The geometry stage receives the primitives from the application stage and applies a series of geometric operations like transforms, projections, and clippings, in order to pass the transformed vertices and color to the rasterizer stage. This stage can be implemented in software, in hardware, or part in software and part in hardware. Because the complexity of the geometry stage it is divided in five sub-stages as depicted in the next figure. Model & View Transform Lighting and Shading Projection Model & View Transform Screen Mapping Figure 1. The geometry stage (based on [2, p. 14]). The rasterizer stage input are the transformed vertices, colors, and texture coordinates from the geometry stage, in order to assign the correct color to the pixels and render the image correctly. This stage is also divided in sub-stages, which can be implemented in hardware or parallelized in high-end graphics systems. Triangle Setup Hidden Surface Removal Texturing Combine Figure 2. The rasterizer stage (based on [2, p. 20]). 2.2 mapping. Of all the above stages and sub-stages, we are interested only in the texturing sub-stage. Texturing is the process that modifies the appearance of a surface using some image, function, or any other method [2, p. 117]. For example, to render a wall, the graphics system can use a plane and glue the image of a wall on it. Another example is the rendering of terrain using a random function to decide where a feature will be located. In any case, there is a mapping process between the texture and the object to be textured. For this reason, this process is called texture mapping. As with many Computer Graphics concepts, there is not a clear definition of texture mapping. Some authors like Akenine-Möller and Haines [2], make a distinction between image texture mapping (obtaining the texture from an image) 2

3 and procedural texture mapping (obtaining the texture using a procedure or function). Other authors, most notably almost all the authors of the texture caching papers reviews here, consider texture mapping as the process of mapping an image into an object. Thus, in the rest of this paper, the term texture mapping will mean image texture mapping. The problem of texture mapping involves mapping a 2D point in the image, called texel, short of texture element, to a 3D point in the object. It is custom to specify the address of a texel with two coordinates, one horizontal u, and one vertical v. v u Figure 3. A water image texture. Akenine-Möller and Haines [2, p. 117], and Watt [16, p. 223] presents many texture mapping techniques. One of the most popular is mipmapping and is briefly explained in the next section. 2.3 Mipmapping Mipmapping was introduced in 1983 by Williams [18]. MIP is an acronym of the Latin phrase multum in parvo that can be roughly translated as much/many on little. Mipmapping has two stages: in the pre-computing step, it stores the original texture at different level of details creating a mipmap; in the rendering step, the program uses this mipmap to texture an object. The mipmap is generated as follows. The original texture is considered to be the level-0 sub-texture. Then, level-1 sub-texture is computed as level-0 sub-texture reduced to a quarter of the original area. This is done taking four neighbor texels from level-0 sub-texture, generally a 2 x 2 square, computing the average, and storing the result as one level-1 texel. The process is repeated, generating in each step, one new sub-texture which has one-quarter of the area of the previous sub-texture, until one or both of the most recent sub-texture dimensions (u or v) has only one texel. At rendering time, the first step is deciding which level of detail (LOD) d will be used. Williams recommends one method based in the cell obtained from projecting the screen pixel onto the texture [18]. Akenine-Möller and Haines [2, p. 135] describe another method based on differentials. Regardless of the method used, the result is d as a real number. The meaning is that two sub-textures will be used, one with the LOD given by floor (d), the other with the LOD ceiling (d). In each one of these two sub-textures, four texels are sampled and bilinearly 3

4 interpolated. The results of the previous step are linearly interpolated, depending on d, and this is the final value of the screen pixel. This interpolating method is called trilinear interpolation [2, p. 136]. 2.4 caching Mipmapping is one of the most practical and efficient methods to filter images [18, p.142] and reduces many aliasing problems [2, p.133] [7]. The storage overhead of the mipmap is only about 33% of the original texture [7]. One of the problems of Mipmapping is overblurring, although this is most noticeable when the texture is viewed from the edge [2, p. 136]. Another problem is that Mipmapping is a memory-intensive process. For computing one pixel, the algorithm needs to access eight texels. Caches are known to improve memory bandwidth on systems that exhibit locality [1] [6, p. 390]. 3. Basic texture caching Hakura and Gupta [5] realized that Mipmapping presents both spatial locality and temporal locality. Spatial locality is present because the movement of one pixel in the screen can be mapped to the movement of one texel in the texture. Temporal locality is present due to two features of Mipmapping: first, the final value (i.e. color) of the interpolated texel depends on the value of their neighbors. It is highly probable that, while computing the value of a texel, the process will need the value of a neighboring texel recently computed. The second feature comes from the common practice of repeating a texture to cover a larger geometry. Hakura and Gupta defined an architecture with a single fragment generator (see figure 4) and studied the impact of using an SRAM (static RAM) cache to store textures. Triangles Fragment generator SRAM Cache DRAM Memory d Fragments Figure 4. Hakura and Gupta architecture (based on [5]). 4

5 In traditional Mipmapping the texel s red, blue, and green components are stored separately. Hakura and Gupta found that an alternatively technique, called by them 6D blocked representation, improves the spatial locality and reduces the texture cache conflict misses when the block size is the same that the texture cache line size. Using the 6D blocked representation and tiled rasterization [2, p. 690], Hakura and Gupta concludes that the memory bandwidth of a system with a two-way setassociative 16KB cache is between one-third and one-fifteenth the memory bandwidth of an equivalent system accessing the texels directly from a DRAM memory. 4. Multi-level caching. Cox et al. [4] extended the work of Hakura and Gupta [5] proposing a 2-level cache architecture. They studied the feasibility of using an external texture cache (L2 cache) between the texture memory and the internal texture cache (L1 cache, closer to the GPU). The argument is that in a single-level texture cache system, because of its small size, the texture cache can only handle what the authors call intra-triangle and intra-object locality, that is, the locality exhibited by texturing one triangle or one object with the same texture. However, there are inter-object and inter-frame localities if we consider that two or more objects or even two consecutive frames, can share blocks of texture. The size of these later working sets is in the order of Megabytes, not Kilobytes. The goal of L2 cache is to absorb L1 misses when the intra-triangle and intra-object working set exceeds L1 size, and to absorb the inter-object and inter-frame working sets. Figure 5 is a block diagram of the proposed architecture. CPU Main memory Core logic memory (L3) GPU cache (L2) cache (L1) 5

6 Figure 5. Two-level cache architecture (based on [4]). In this architecture, L1 texture cache is a regular two-way set-associative as reported by Hakura and Gupta [5]. The L2 cache size, in the order of megabytes, raises some organization problems. Cox et al. argue than a fully associative cache of this size is not feasible. By other hand, the two-level cache architecture augments the problem of conflict misses (called in this paper collisions) when compared with a single-level cache architecture. It is not trivial to find a good hashing function that leads to good replacement behavior; therefore, it is difficult to organize L2 as a direct-mapped or even as a set-associative cache. The solution is organize L2 as a virtual memory, with a mechanism to translate from virtual texture addresses to physical addresses, and a replacement policy, in this case, LRU. Cox et al. reports that a 2MB L2 cache, coupled with a 16KB L1 cache, uses from 3 to 5 times less local memory and saves between 18 and 140 times the download bandwidth, compared with a single-level texture cache. 5. Prefetching Igehy et al. [7] noted that one of the problems about texture memory access is the high latency of memory systems. They explain that, while some computing aspects, like memory and logic density have experienced tremendous growth, memory speed has seen slight growth. That means that instructions cannot be read from memory as fast as they can be executed by the processor. Thus, sometimes memory latency (the time the memory takes to deliver the data requested) or memory bandwidth (the amount of data per second that can be transferred to or from memory) becomes a bottleneck. Caching can alleviate the problem of memory bandwidth. Hakura and Gupta [5] showed that a two-way set-associative 16KB texture cache can reduce the memory bandwidth requirements between one-third and one-fifteenth, compared with a system with no cache. However, a cache does not solve the problem of memory latency. Igehy et al. [7] propose a texture cache architecture with prefetching that takes advantage of the access characteristics of texture mapping. Prefetching is a technique where the processor retrieves data or instructions before they are used. With this approach, the processor reduces its waiting time and hides the memory latency to some extent [15, p. 8]. Experimental results show that their architecture can hide most of the memory latency with a 97% utilization of hardware resources. Furthermore, the number of pipeline stalls due to multiple misses per fragment is typically less than 1%. The architecture described by Igehy et al. includes the following components: 6

7 A fragment FIFO to store the fragments to be textured while the system receives all the texels needed either from the texture cache or from the texture memory. A texture cache to store some texels according with some policy (most recently used, most recently fetched, etc cetera). A request FIFO to store the requests to the texture memory when there the texture cache misses the texels. A reorder buffer to store and reorder the texels coming from the texture memory. It can be a FIFO if responses from the texture memory always come in the same order the requests were made. The architecture processes the fragments as follows: 1. For each fragment, all of its texels are searched in the texture cache. 2. If all the texels are in the cache, the fragment is forwarded to the fragment FIFO. Otherwise, the missing texels are requested from the texture memory using the request FIFO, and then, the fragment is forwarded to the fragment FIFO to wait for the arrival of its missing texels. 3. The missing texels arrive from the texture memory and are stored in the reorder buffer. In order to avoid conflicts with other texels in the cache, the new texels are sent to the cache only when their corresponding fragment is at the head of the fragment FIFO and ready to be textured. 4. The fragment removed from the head of the fragment FIFO has their texels in the cache, either because they were already there, or because they were just retrieved from the reorder buffer. 5. The fragment and all their data is moved to the texture applicator step. This procedure is shown in figure 6. 7

8 Fragment All texels in cache? no Request buffer texels req. yes Fragment FIFO fragment fragment memory texels Reorder buffer texels fragments cache texels applicator Figure 6. caching with prefetching (based on [7]). 6. Parallel texture caching Another way of improving performance is replicating some or all the stages of the basic graphics pipeline. Parallel rasterization is the process in charge of distributing work to the copies of the rasterization stages. Igehy et al. [8] were among the first to study the benefits of texture caching in a parallel architecture. They argue that previous serial architectures have been effective in minimizing bandwidth and/or memory latency, because of the locality of reference exhibited by the texture mapping process, and that this locality decrease in a parallel architecture. They describe a real-time graphics system which uses 1, 2, or 4 fragment generators. Each fragment generator is coupled with an independent texture memory. The texture memory is replicated up to 4 times because any fragment generator needs to access any texture data. As a result, the texture subsystems made minimal use of texture locality. 8

9 In order to study the effect of parallel rasterization on texture locality, Igehy et al. classified parallel rasterization architectures in two big categories: texture memory architecture and rasterization algorithms. Regarding texture memory architecture, the authors describe two schemes: a) Dedicated texture memory. Each texturing unit has its own texture memory that holds a copy of the texture data. Caching can reduce texture memory bandwidth. b) Shared texture memory. There is one copy of the texture data distributed among the distinct texture memories. A texture sorting network allows any texturing unit to access any texture memory. Caching can reduce network and memory bandwidth. Regarding to rasterization algorithms, there are three considerations: a) How work is partitioned. There are two options: by image-space, where the screen is divided in blocks and each block is assigned to a texturing unit; or by object-space, where an object is decomposed in fragments and each fragment is assigned to a texturing unit. b) Order in which a texturing unit processes fragments. It can be the original order in which primitives are presented (primitive order), or some other order, like block order, where all fragments of a block are processed before any fragment of the next block. c) Order in which fragments destined for the same location are processed. Here, the authors only consider that fragments are processed in the order they are presented by the application. As a result of their study, Igehy et al. demonstrated that parallel texture caching works well from 1 to 64 texture units. They also proved that shared texture memory architecture has better performance than dedicated texture memory, not only because it does not replicate textures, but because distributes contention over the different memories. In this way, shared texture memory uses bandwidth that would be unused in dedicated texture memory architecture. They confirmed that increasing the parallelism decreases the working set size. A 16KB texture cache produces good results in parallel architectures with 2 up to 64 texture units. Finally, they concluded that parallel texture caching is general enough to work with both image-space and object-space, as well with primitive order and block order architectures. 7. Hybrid access caching Theobald et al. [14] affirms that while direct-mapped caches have better access times, set-associative caches have less miss rates. In order to get the best from each of the two architectures, they propose the Hybrid Access Cache (HAC) 9

10 model. In the HAC model, a hybrid cache consists of two parts: a direct-mapped primary section and a slower, usually fully-associative, secondary section. Choi et al. [3] analyzed the distribution of texture cache misses, and observed that in many texture mapping applications, there are a significant number of conflict misses. For the purpose of reduce these conflict misses, they evaluated three hybrid access cache systems: victim cache, half-and-half cache, and cooperative cache. Before showing the results, here is a short review of these three approaches: Victim cache. First proposed by Jouppi [9]. It is a small fully-associative cache (1 to 5 lines in Jouppi s paper) that it is added to the main cache. The victim cache stores the most recently discarded elements (victims) in case they are needed again. Hennessy and Patterson [6, p. 449] affirm the victim cache helps to reduce miss penalty and miss rate. Half-and-half cache. First proposed by Theobald et al. [14]. Half of the lines in this cache are direct-mapped and the other half are setassociative. The authors claim that for moderated size caches, the halfand-half cache has better performance than other hybrid caches in most applications. Cooperative cache. Explained in [12]. It consists of a direct mapped temporal oriented cache and a four-way set-associative spatial oriented cache. Each cache has an 8KB capacity but different block size. This arrangement is oriented to reduce power consumption because the caches are designed to help each other. Choi et al. compared the three architectures by miss rate and AMAC (average memory access cycles) for various cache line sizes, different associativity, and distinct cache line sizes; and concluded that victim cache is the most suitable in a texture cache system. For small-sized caches (8KB) victim cache shows a performance improvement of 19% when compared with a conventional cache system. However, from the authors tables, it is not clear which architecture, between victim cache and half-and-half cache, is the best one for a cache size of 16KB. 8. Multilevel parallel texture caching Park et al. [13] combined two previous ideas in their approach: multi-level caching [4] and parallel texture caching [8] and called it Multilevel parallel texture caching. The architecture has two levels of cache memory and four components overall, all integrated in a single chip: an 8MB DRAM L2 cache memory, eight 8- way 16KB SRAM L1 cache memories in parallel, eight pipelined texture filter modules, and a serial-to-parallel loader. See next figure. 10

11 AGP or PCI bus Multilevel Parallel Cache Serial-to-parallel latch L2 DRAM 8MB IBUS L1 C0 SRAM 16KB L1 C7 SRAM 16KB Filter 0 Filter 7 Graphics Pipeline 0 Graphics Pipeline 7 Figure 7. Multilevel parallel texture cache (based on [13]). The main results and features of the multilevel parallel texture cache are as follows. The L2 cache reduces the required bandwidth of the AGP (Accelerated Graphics Port) or PCI (Peripheral Component Interconnect) bus by 20 times, for a 1024 x 768 screen resolution, exploiting the texture data coherency between successive frames. The eight independent L1 caches remove the access conflicts in the parallel graphics pipelines and allow them to run at full speed. Using a new transfer data proposed by Park et al., the internal bus (IBUS) has a bandwidth of up to 75 GB/s. With this bandwidth the L2 cache is able to service the L1 caches without starvation. The architecture is reconfigurable because the cache line sizes of L1 and L2 can be 4 x 4, 8 x 8, or 16 x 16 pixels. The intention is to maintain an optimal cache performance, in terms of cache misses, for different graphics applications. 11

12 9. Adaptive indexing Adaptive indexing, first proposed as ACI (adaptive cache indexing) by Kim et al. [10] and later renamed as A-index by Kim and Kim [11], is a technique aimed to reduce the miss rate in a texture cache. Kim and Kim [11] say that most texture caching methods uses the texture u coordinate as the cache index (u-index). This is fine in most cases, because the u coordinate is the horizontal coordinate, and most rasterization algorithms are horizontal-based. However, there are cases where a texture is vertical-oriented or using their nomenclature, the texture is v-major, and using the u coordinate to index the texture cache results in a higher cache miss rate. In this case, using the v-index would have a better cache performance. To understand the concepts of a texture being u-major or v-major, consider the following figure that represents two lines. u v Figure 8. Example of u-major and v-major. The blue (left) line is v-major because v > u, i.e. the vertical coordinate changes faster than the horizontal coordinate. On the other hand, the red (right) line is u-major because for this case v < u. The decision of which index, between u-index and v-index, will be used to access the texture cache is made at the time of address generation, based on the current and the previous texture sampling points. If the u, between the current and previous u coordinate, is greater than v, between the current and previous v coordinate, then, the direction is u-major; otherwise, the direction is v-major. Using adaptive indexing with a 16KB, two-way set associative cache with a 64 bytes (4 x 4 texels) line size, and scenes of 640 x 480 pixels, Kim et al. [10] reports a reduction in cache misses of about 23% compared with a traditional texture caching architecture with no adaptive indexing. The total number of cycles is reduced in 8.9%. Kim and Kim [11] reports 21.6% and 8.8% for reduction in cache misses and cycles, respectively. 12

13 Using adaptive indexing in a texture cache system requires adding 1-bit registers for each cache line to distinguish between a u-index and a v-index; and some logic to compare two texture sampling points. Kim et al. [10] reports that for a 16KB texture cache, the total hardware overhead is about 7.7% extra nandequivalent gates. 13

14 References [1] Akeley K, Hanrahan P. CS448A: Real-Time Graphics Architectures. Stanford University Accessed 2/21/06. [2] Akenine-Möller T, Haines E. Real-time rendering, 2 nd edition. A. K. Peters: Natick, Mass [3] Choi CJ, Park GH, Lee JH, Park WC. Han TD. Performance comparison of various cache systems for texture mapping. The Fourth International Conference on High-Performance Computing in the Asia-Pacific Region [4] Cox M, Bhandari N, Shantz M. Multi-level caching for 3D graphics hardware. Proceedings of the 25th annual international symposium on Computer architecture ISCA [5] Hakura ZS, Gupta A. The design and analysis of a cache architecture for texture mapping. Proceedings of the 24th annual international symposium on Computer architecture ISCA [6] Hennessy JL, Patterson DA. Computer architecture: a quantitative approach, 3 rd edition. Morgan Kaufmann: San Francisco [7] Igehy H, Eldridge M, Proudfoot K. Prefetching in a texture cache architecture. Proceedings of the ACM SIGGRAPH/Eurographics Workshop on Graphics Hardware [8] Igehy H, Eldridge M, Hanrahan, P. Parallel texture caching. Proceedings of the ACM SIGGRAPH/Eurographics Workshop on Graphics Hardware [9] Jouppi NP. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. Proceedings of the 17th annual international symposium on Computer Architecture ISCA [10] Kim CH, Im YH, Kim LS. Miss-rate reduction in texture cache by adaptive cache indexing. Electronic Letters. (40) [11] Kim CH, Kim LS. Adaptive selection of an index in a texture cache. Proceedings of the IEEE International Conference on Computer Design ICCD

15 [12] Park GH, Lee KW, Lee JH, Han TD, Kim SD. A power efficient cache structure for embedded processors based on the dual cache structure. ACM SIGPLAN Workshop LCTES [13] Park SJ, Kim JS, Woo R, Lee SJ, Lee KM, Yang TH, Jung JY, Yoo HJ. A reconfigurable multilevel parallel texture cache memory with 75GB/s parallel cache replacement bandwidth. IEEE Journal of Solid-State Circuits. (37) [14] Theobald KB, Hum HHJ, Gao GR. A design framework for hybrid-access caches. 1st IEEE Symposium on High-Performance Computer Architecture [15] Van der Pas R. Memory hierarchy in cache-based systems. Sun Microsystems Accessed 2/27/2006. [16] Watt A. 3D Computer Graphics, 3 rd Edition. Addison-Wesley: Harlow, England [17] Watt A, Watt M. Advanced Animation and Rendering Techniques: Theory and Practice. Addison-Wesley: Harlow, England [18] Williams L. Pyramidal parametrics. Proceedings of the 10th annual conference on Computer graphics and interactive techniques

Real-Time Graphics Architecture. Kurt Akeley Pat Hanrahan. Texture

Real-Time Graphics Architecture. Kurt Akeley Pat Hanrahan.  Texture Real-Time Graphics Architecture Kurt Akeley Pat Hanrahan http://www.graphics.stanford.edu/courses/cs448a-01-fall Texture 1 Topics 1. Review of texture mapping 2. RealityEngine and InfiniteReality 3. Texture

More information

Lecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Lecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011) Lecture 6: Texture Kayvon Fatahalian CMU 15-869: Graphics and Imaging Architectures (Fall 2011) Today: texturing! Texture filtering - Texture access is not just a 2D array lookup ;-) Memory-system implications

More information

Texture. Real-Time Graphics Architecture. Kurt Akeley Pat Hanrahan.

Texture. Real-Time Graphics Architecture. Kurt Akeley Pat Hanrahan. Texture Real-Time Graphics Architecture Kurt Akeley Pat Hanrahan http://graphics.stanford.edu/courses/cs448-07-spring/ Topics 1. Projective texture mapping 2. Texture filtering and mip-mapping 3. Early

More information

Texture Filter Memory A Power-efficient and Scalable Texture Memory Architecture for Mobile Graphics Processors

Texture Filter Memory A Power-efficient and Scalable Texture Memory Architecture for Mobile Graphics Processors Texture Filter Memory A Power-efficient and Scalable Texture Memory Architecture for Mobile Graphics Processors B. V. N. Silpa, Anjul Patney, Tushar Krishna, Preeti Ranjan Panda, and G. S. Visweswaran

More information

Page 1. Multilevel Memories (Improving performance using a little cash )

Page 1. Multilevel Memories (Improving performance using a little cash ) Page 1 Multilevel Memories (Improving performance using a little cash ) 1 Page 2 CPU-Memory Bottleneck CPU Memory Performance of high-speed computers is usually limited by memory bandwidth & latency Latency

More information

Lecture 6: Texturing Part II: Texture Compression and GPU Latency Hiding Mechanisms. Visual Computing Systems CMU , Fall 2014

Lecture 6: Texturing Part II: Texture Compression and GPU Latency Hiding Mechanisms. Visual Computing Systems CMU , Fall 2014 Lecture 6: Texturing Part II: Texture Compression and GPU Latency Hiding Mechanisms Visual Computing Systems Review: mechanisms to reduce aliasing in the graphics pipeline When sampling visibility?! -

More information

Memory Hierarchy. Slides contents from:

Memory Hierarchy. Slides contents from: Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory

More information

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology Memory Hierarchies Instructor: Dmitri A. Gusev Fall 2007 CS 502: Computers and Communications Technology Lecture 10, October 8, 2007 Memories SRAM: value is stored on a pair of inverting gates very fast

More information

Spring 2009 Prof. Hyesoon Kim

Spring 2009 Prof. Hyesoon Kim Spring 2009 Prof. Hyesoon Kim Application Geometry Rasterizer CPU Each stage cane be also pipelined The slowest of the pipeline stage determines the rendering speed. Frames per second (fps) Executes on

More information

ISSCC 2001 / SESSION 9 / INTEGRATED MULTIMEDIA PROCESSORS / 9.2

ISSCC 2001 / SESSION 9 / INTEGRATED MULTIMEDIA PROCESSORS / 9.2 ISSCC 2001 / SESSION 9 / INTEGRATED MULTIMEDIA PROCESSORS / 9.2 9.2 A 80/20MHz 160mW Multimedia Processor integrated with Embedded DRAM MPEG-4 Accelerator and 3D Rendering Engine for Mobile Applications

More information

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Organization Part II Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University, Auburn,

More information

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823

More information

Memory Hierarchy. Slides contents from:

Memory Hierarchy. Slides contents from: Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory

More information

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST Chapter 5 Memory Hierarchy Design In-Cheol Park Dept. of EE, KAIST Why cache? Microprocessor performance increment: 55% per year Memory performance increment: 7% per year Principles of locality Spatial

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address

More information

Course Administration

Course Administration Spring 207 EE 363: Computer Organization Chapter 5: Large and Fast: Exploiting Memory Hierarchy - Avinash Kodi Department of Electrical Engineering & Computer Science Ohio University, Athens, Ohio 4570

More information

Vertex Shader Design I

Vertex Shader Design I The following content is extracted from the paper shown in next page. If any wrong citation or reference missing, please contact ldvan@cs.nctu.edu.tw. I will correct the error asap. This course used only

More information

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste

More information

Spring 2011 Prof. Hyesoon Kim

Spring 2011 Prof. Hyesoon Kim Spring 2011 Prof. Hyesoon Kim Application Geometry Rasterizer CPU Each stage cane be also pipelined The slowest of the pipeline stage determines the rendering speed. Frames per second (fps) Executes on

More information

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1] EE482: Advanced Computer Organization Lecture #7 Processor Architecture Stanford University Tuesday, June 6, 2000 Memory Systems and Memory Latency Lecture #7: Wednesday, April 19, 2000 Lecturer: Brian

More information

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1 CSE 431 Computer Architecture Fall 2008 Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Mary Jane Irwin ( www.cse.psu.edu/~mji ) [Adapted from Computer Organization and Design, 4 th Edition, Patterson

More information

Caches. Hiding Memory Access Times

Caches. Hiding Memory Access Times Caches Hiding Memory Access Times PC Instruction Memory 4 M U X Registers Sign Ext M U X Sh L 2 Data Memory M U X C O N T R O L ALU CTL INSTRUCTION FETCH INSTR DECODE REG FETCH EXECUTE/ ADDRESS CALC MEMORY

More information

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off

More information

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Abridged version of Patterson & Hennessy (2013):Ch.5 Principle of Locality Programs access a small proportion of their address space at any time Temporal

More information

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY 1 Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored

More information

A Reconfigurable Architecture for Load-Balanced Rendering

A Reconfigurable Architecture for Load-Balanced Rendering A Reconfigurable Architecture for Load-Balanced Rendering Jiawen Chen Michael I. Gordon William Thies Matthias Zwicker Kari Pulli Frédo Durand Graphics Hardware July 31, 2005, Los Angeles, CA The Load

More information

CSE 167: Introduction to Computer Graphics Lecture #8: Textures. Jürgen P. Schulze, Ph.D. University of California, San Diego Spring Quarter 2016

CSE 167: Introduction to Computer Graphics Lecture #8: Textures. Jürgen P. Schulze, Ph.D. University of California, San Diego Spring Quarter 2016 CSE 167: Introduction to Computer Graphics Lecture #8: Textures Jürgen P. Schulze, Ph.D. University of California, San Diego Spring Quarter 2016 Announcements Project 2 due this Friday Midterm next Tuesday

More information

CS451Real-time Rendering Pipeline

CS451Real-time Rendering Pipeline 1 CS451Real-time Rendering Pipeline JYH-MING LIEN DEPARTMENT OF COMPUTER SCIENCE GEORGE MASON UNIVERSITY Based on Tomas Akenine-Möller s lecture note You say that you render a 3D 2 scene, but what does

More information

EE 4683/5683: COMPUTER ARCHITECTURE

EE 4683/5683: COMPUTER ARCHITECTURE EE 4683/5683: COMPUTER ARCHITECTURE Lecture 6A: Cache Design Avinash Kodi, kodi@ohioedu Agenda 2 Review: Memory Hierarchy Review: Cache Organization Direct-mapped Set- Associative Fully-Associative 1 Major

More information

Texture mapping. Computer Graphics CSE 167 Lecture 9

Texture mapping. Computer Graphics CSE 167 Lecture 9 Texture mapping Computer Graphics CSE 167 Lecture 9 CSE 167: Computer Graphics Texture Mapping Overview Interpolation Wrapping Texture coordinates Anti aliasing Mipmaps Other mappings Including bump mapping

More information

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface Chapter 5 Large and Fast: Exploiting Memory Hierarchy 5 th Edition Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic

More information

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections ) Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections 5.1-5.3) 1 Reducing Miss Rate Large block size reduces compulsory misses, reduces miss penalty in case

More information

COMP 4801 Final Year Project. Ray Tracing for Computer Graphics. Final Project Report FYP Runjing Liu. Advised by. Dr. L.Y.

COMP 4801 Final Year Project. Ray Tracing for Computer Graphics. Final Project Report FYP Runjing Liu. Advised by. Dr. L.Y. COMP 4801 Final Year Project Ray Tracing for Computer Graphics Final Project Report FYP 15014 by Runjing Liu Advised by Dr. L.Y. Wei 1 Abstract The goal of this project was to use ray tracing in a rendering

More information

6 th Lecture :: The Cache - Part Three

6 th Lecture :: The Cache - Part Three Dr. Michael Manzke :: CS7031 :: 6 th Lecture :: The Cache - Part Three :: October 20, 2010 p. 1/17 [CS7031] Graphics and Console Hardware and Real-time Rendering 6 th Lecture :: The Cache - Part Three

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address space at any time Temporal locality Items accessed recently are likely to

More information

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1 Memory Hierarchy Maurizio Palesi Maurizio Palesi 1 References John L. Hennessy and David A. Patterson, Computer Architecture a Quantitative Approach, second edition, Morgan Kaufmann Chapter 5 Maurizio

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Using Intel Streaming SIMD Extensions for 3D Geometry Processing Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,

More information

Structure. Woo-Chan Park, Kil-Whan Lee, Seung-Gi Lee, Moon-Hee Choi, Won-Jong Lee, Cheol-Ho Jeong, Byung-Uck Kim, Woo-Nam Jung,

Structure. Woo-Chan Park, Kil-Whan Lee, Seung-Gi Lee, Moon-Hee Choi, Won-Jong Lee, Cheol-Ho Jeong, Byung-Uck Kim, Woo-Nam Jung, A High Performance 3D Graphics Rasterizer with Effective Memory Structure Woo-Chan Park, Kil-Whan Lee, Seung-Gi Lee, Moon-Hee Choi, Won-Jong Lee, Cheol-Ho Jeong, Byung-Uck Kim, Woo-Nam Jung, Il-San Kim,

More information

2D/3D Graphics Accelerator for Mobile Multimedia Applications. Ramchan Woo, Sohn, Seong-Jun Song, Young-Don

2D/3D Graphics Accelerator for Mobile Multimedia Applications. Ramchan Woo, Sohn, Seong-Jun Song, Young-Don RAMP-IV: A Low-Power and High-Performance 2D/3D Graphics Accelerator for Mobile Multimedia Applications Woo, Sungdae Choi, Ju-Ho Sohn, Seong-Jun Song, Young-Don Bae,, and Hoi-Jun Yoo oratory Dept. of EECS,

More information

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho Why memory hierarchy? L1 cache design Sangyeun Cho Computer Science Department Memory hierarchy Memory hierarchy goals Smaller Faster More expensive per byte CPU Regs L1 cache L2 cache SRAM SRAM To provide

More information

Application of Parallel Processing to Rendering in a Virtual Reality System

Application of Parallel Processing to Rendering in a Virtual Reality System Application of Parallel Processing to Rendering in a Virtual Reality System Shaun Bangay Peter Clayton David Sewry Department of Computer Science Rhodes University Grahamstown, 6140 South Africa Internet:

More information

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed 5.3 By convention, a cache is named according to the amount of data it contains (i.e., a 4 KiB cache can hold 4 KiB of data); however, caches also require SRAM to store metadata such as tags and valid

More information

The University of Adelaide, School of Computer Science 13 September 2018

The University of Adelaide, School of Computer Science 13 September 2018 Computer Architecture A Quantitative Approach, Sixth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

Chapter 8. Virtual Memory

Chapter 8. Virtual Memory Operating System Chapter 8. Virtual Memory Lynn Choi School of Electrical Engineering Motivated by Memory Hierarchy Principles of Locality Speed vs. size vs. cost tradeoff Locality principle Spatial Locality:

More information

SF-LRU Cache Replacement Algorithm

SF-LRU Cache Replacement Algorithm SF-LRU Cache Replacement Algorithm Jaafar Alghazo, Adil Akaaboune, Nazeih Botros Southern Illinois University at Carbondale Department of Electrical and Computer Engineering Carbondale, IL 6291 alghazo@siu.edu,

More information

Computer Architecture. Memory Hierarchy. Lynn Choi Korea University

Computer Architecture. Memory Hierarchy. Lynn Choi Korea University Computer Architecture Memory Hierarchy Lynn Choi Korea University Memory Hierarchy Motivated by Principles of Locality Speed vs. Size vs. Cost tradeoff Locality principle Temporal Locality: reference to

More information

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350): The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350): Motivation for The Memory Hierarchy: { CPU/Memory Performance Gap The Principle Of Locality Cache $$$$$ Cache Basics:

More information

Graphics Processing Unit Architecture (GPU Arch)

Graphics Processing Unit Architecture (GPU Arch) Graphics Processing Unit Architecture (GPU Arch) With a focus on NVIDIA GeForce 6800 GPU 1 What is a GPU From Wikipedia : A specialized processor efficient at manipulating and displaying computer graphics

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

Chapter Seven. Large & Fast: Exploring Memory Hierarchy

Chapter Seven. Large & Fast: Exploring Memory Hierarchy Chapter Seven Large & Fast: Exploring Memory Hierarchy 1 Memories: Review SRAM (Static Random Access Memory): value is stored on a pair of inverting gates very fast but takes up more space than DRAM DRAM

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address

More information

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

EITF20: Computer Architecture Part 5.1.1: Virtual Memory EITF20: Computer Architecture Part 5.1.1: Virtual Memory Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache optimization Virtual memory Case study AMD Opteron Summary 2 Memory hierarchy 3 Cache

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1) Department of Electr rical Eng ineering, Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1) 王振傑 (Chen-Chieh Wang) ccwang@mail.ee.ncku.edu.tw ncku edu Depar rtment of Electr rical Engineering,

More information

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Nagi N. Mekhiel Department of Electrical and Computer Engineering Ryerson University, Toronto, Ontario M5B 2K3

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface COEN-4710 Computer Hardware Lecture 7 Large and Fast: Exploiting Memory Hierarchy (Chapter 5) Cristinel Ababei Marquette University Department

More information

Parallel Texture Caching

Parallel Texture Caching Parallel Caching Homan Igehy Matthew Eldridge Pat Hanrahan Computer Science and Electrical Engineering Departments Stanford University Abstract The creation of high-quality images requires new functionality

More information

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I Memory Performance of Algorithms CSE 32 Data Structures Lecture Algorithm Performance Factors Algorithm choices (asymptotic running time) O(n 2 ) or O(n log n) Data structure choices List or Arrays Language

More information

CENG 3420 Computer Organization and Design. Lecture 08: Memory - I. Bei Yu

CENG 3420 Computer Organization and Design. Lecture 08: Memory - I. Bei Yu CENG 3420 Computer Organization and Design Lecture 08: Memory - I Bei Yu CEG3420 L08.1 Spring 2016 Outline q Why Memory Hierarchy q How Memory Hierarchy? SRAM (Cache) & DRAM (main memory) Memory System

More information

CSCI-UA.0201 Computer Systems Organization Memory Hierarchy

CSCI-UA.0201 Computer Systems Organization Memory Hierarchy CSCI-UA.0201 Computer Systems Organization Memory Hierarchy Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Programmer s Wish List Memory Private Infinitely large Infinitely fast Non-volatile

More information

Performance Analysis and Culling Algorithms

Performance Analysis and Culling Algorithms Performance Analysis and Culling Algorithms Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1 Assignment 2 Sign up for Pluto labs on the web

More information

CENG 3420 Computer Organization and Design. Lecture 08: Cache Review. Bei Yu

CENG 3420 Computer Organization and Design. Lecture 08: Cache Review. Bei Yu CENG 3420 Computer Organization and Design Lecture 08: Cache Review Bei Yu CEG3420 L08.1 Spring 2016 A Typical Memory Hierarchy q Take advantage of the principle of locality to present the user with as

More information

An Effective Pixel Rasterization Pipeline Architecture for 3D Rendering Processors

An Effective Pixel Rasterization Pipeline Architecture for 3D Rendering Processors IEEE TRANSACTIONS ON COMPUTERS, VOL. 52, NO. 11, NOVEMBER 2003 1501 An Effective Pixel Rasterization Pipeline Architecture for 3D Rendering Processors Woo-Chan Park, Member, IEEE, Kil-WhanLee, Il-San Kim,

More information

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Portland State University ECE 587/687. Caches and Memory-Level Parallelism Portland State University ECE 587/687 Caches and Memory-Level Parallelism Revisiting Processor Performance Program Execution Time = (CPU clock cycles + Memory stall cycles) x clock cycle time For each

More information

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN CHAPTER 4 TYPICAL MEMORY HIERARCHY MEMORY HIERARCHIES MEMORY HIERARCHIES CACHE DESIGN TECHNIQUES TO IMPROVE CACHE PERFORMANCE VIRTUAL MEMORY SUPPORT PRINCIPLE OF LOCALITY: A PROGRAM ACCESSES A RELATIVELY

More information

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup Yan Sun and Min Sik Kim School of Electrical Engineering and Computer Science Washington State University Pullman, Washington

More information

Textbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, Textbook web site:

Textbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, Textbook web site: Textbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, 2003 Textbook web site: www.vrtechnology.org 1 Textbook web site: www.vrtechnology.org Laboratory Hardware 2 Topics 14:332:331

More information

A Cache Hierarchy in a Computer System

A Cache Hierarchy in a Computer System A Cache Hierarchy in a Computer System Ideally one would desire an indefinitely large memory capacity such that any particular... word would be immediately available... We are... forced to recognize the

More information

14:332:331. Week 13 Basics of Cache

14:332:331. Week 13 Basics of Cache 14:332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson s UCB CS152 slides and Mary Jane Irwin s PSU CSE331 slides] 331 Lec20.1 Fall 2003 Head

More information

Chapter Seven Morgan Kaufmann Publishers

Chapter Seven Morgan Kaufmann Publishers Chapter Seven Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored as a charge on capacitor (must be

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

Graphics Hardware. Instructor Stephen J. Guy

Graphics Hardware. Instructor Stephen J. Guy Instructor Stephen J. Guy Overview What is a GPU Evolution of GPU GPU Design Modern Features Programmability! Programming Examples Overview What is a GPU Evolution of GPU GPU Design Modern Features Programmability!

More information

3D Rasterization II COS 426

3D Rasterization II COS 426 3D Rasterization II COS 426 3D Rendering Pipeline (for direct illumination) 3D Primitives Modeling Transformation Lighting Viewing Transformation Projection Transformation Clipping Viewport Transformation

More information

Advanced Memory Organizations

Advanced Memory Organizations CSE 3421: Introduction to Computer Architecture Advanced Memory Organizations Study: 5.1, 5.2, 5.3, 5.4 (only parts) Gojko Babić 03-29-2018 1 Growth in Performance of DRAM & CPU Huge mismatch between CPU

More information

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM

More information

Hardware-driven visibility culling

Hardware-driven visibility culling Hardware-driven visibility culling I. Introduction 20073114 김정현 The goal of the 3D graphics is to generate a realistic and accurate 3D image. To achieve this, it needs to process not only large amount

More information

Lecture 7 - Memory Hierarchy-II

Lecture 7 - Memory Hierarchy-II CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II John Wawrzynek Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~johnw

More information

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste!

More information

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy Chapter 5B Large and Fast: Exploiting Memory Hierarchy One Transistor Dynamic RAM 1-T DRAM Cell word access transistor V REF TiN top electrode (V REF ) Ta 2 O 5 dielectric bit Storage capacitor (FET gate,

More information

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality: it will tend to be referenced again soon spatial locality: nearby items will tend to be

More information

New Memory Organizations For 3D DRAM and PCMs

New Memory Organizations For 3D DRAM and PCMs New Memory Organizations For 3D DRAM and PCMs Ademola Fawibe 1, Jared Sherman 1, Krishna Kavi 1 Mike Ignatowski 2, and David Mayhew 2 1 University of North Texas, AdemolaFawibe@my.unt.edu, JaredSherman@my.unt.edu,

More information

Cray XE6 Performance Workshop

Cray XE6 Performance Workshop Cray XE6 Performance Workshop Mark Bull David Henty EPCC, University of Edinburgh Overview Why caches are needed How caches work Cache design and performance. 2 1 The memory speed gap Moore s Law: processors

More information

PowerVR Series5. Architecture Guide for Developers

PowerVR Series5. Architecture Guide for Developers Public Imagination Technologies PowerVR Series5 Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty of any kind.

More information

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Computer Organization and Structure. Bing-Yu Chen National Taiwan University Computer Organization and Structure Bing-Yu Chen National Taiwan University Large and Fast: Exploiting Memory Hierarchy The Basic of Caches Measuring & Improving Cache Performance Virtual Memory A Common

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Introduction to OpenMP. Lecture 10: Caches

Introduction to OpenMP. Lecture 10: Caches Introduction to OpenMP Lecture 10: Caches Overview Why caches are needed How caches work Cache design and performance. The memory speed gap Moore s Law: processors speed doubles every 18 months. True for

More information

Lecture-14 (Memory Hierarchy) CS422-Spring

Lecture-14 (Memory Hierarchy) CS422-Spring Lecture-14 (Memory Hierarchy) CS422-Spring 2018 Biswa@CSE-IITK The Ideal World Instruction Supply Pipeline (Instruction execution) Data Supply - Zero-cycle latency - Infinite capacity - Zero cost - Perfect

More information

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Lecture 2: Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Visual Computing Systems Analyzing a 3D Graphics Workload Where is most of the work done? Memory Vertex

More information

A fixed-point 3D graphics library with energy-efficient efficient cache architecture for mobile multimedia system

A fixed-point 3D graphics library with energy-efficient efficient cache architecture for mobile multimedia system MS Thesis A fixed-point 3D graphics library with energy-efficient efficient cache architecture for mobile multimedia system Min-wuk Lee 2004.12.14 Semiconductor System Laboratory Department Electrical

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per

More information

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás CACHE MEMORIES Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix B, John L. Hennessy and David A. Patterson, Morgan Kaufmann,

More information

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2) The Memory Hierarchy Cache, Main Memory, and Virtual Memory (Part 2) Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Cache Line Replacement The cache

More information

Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two

Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two Bushra Ahsan and Mohamed Zahran Dept. of Electrical Engineering City University of New York ahsan bushra@yahoo.com mzahran@ccny.cuny.edu

More information

Registers. Instruction Memory A L U. Data Memory C O N T R O L M U X A D D A D D. Sh L 2 M U X. Sign Ext M U X ALU CTL INSTRUCTION FETCH

Registers. Instruction Memory A L U. Data Memory C O N T R O L M U X A D D A D D. Sh L 2 M U X. Sign Ext M U X ALU CTL INSTRUCTION FETCH PC Instruction Memory 4 M U X Registers Sign Ext M U X Sh L 2 Data Memory M U X C O T R O L ALU CTL ISTRUCTIO FETCH ISTR DECODE REG FETCH EXECUTE/ ADDRESS CALC MEMOR ACCESS WRITE BACK A D D A D D A L U

More information

GPU-AWARE HYBRID TERRAIN RENDERING

GPU-AWARE HYBRID TERRAIN RENDERING GPU-AWARE HYBRID TERRAIN RENDERING Christian Dick1, Jens Krüger2, Rüdiger Westermann1 1 Computer Graphics and Visualization Group, Technische Universität München, Germany 2 Interactive Visualization and

More information

Agenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File

Agenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File EE 260: Introduction to Digital Design Technology Yao Zheng Department of Electrical Engineering University of Hawaiʻi at Mānoa 2 Technology Naive Register File Write Read clk Decoder Read Write 3 4 Arrays:

More information