Texture Caching. Héctor Antonio Villa Martínez Universidad de Sonora

Size: px

Start display at page:

Download "Texture Caching. Héctor Antonio Villa Martínez Universidad de Sonora"

Oswald Manning
5 years ago
Views:

1 April, 2006 Caching Héctor Antonio Villa Martínez Universidad de Sonora 1. Introduction This report presents a review of caching architectures used for texture mapping in Computer Graphics. mapping is a technique aimed to give more realism to computer generated images and by its nature is a memory-intensive process. However, texture mapping presents some special features, like locality, which makes using a cache approach desirable. The report is organized as follows: section 2 presents a description of texture mapping and a justification for using a cache dedicated to texturing. Section 3 describes the basic texture caching architecture. The rest of the report review the successive improvements to this basic architecture found in the literature: multilevel caching (section 4), prefetching (section 5), parallel texture caching (section 6), hybrid access caching (section 7), multilevel parallel texture caching (section 8), and adaptive indexing (section 9). 2. Texturing 2.1 The graphics pipeline. Defining Computer Graphics is a challenging task. In this report, we will understand Computer Graphics, or more specifically, 3D Computer Graphics, as the field of Computer Science that studies how to generate a 2D computer image from a 3D model. Most Computer Graphics books, for example [2, p. 9] and [16, p. 142], divide the process of creating a 2D image in a series of steps and call these steps the graphics pipeline. There is no consensus in the number or in the names of the graphics pipeline stages. To be consistent, we will follow the nomenclature from Akenine-Möller and Haines book [2]. They divide the pipeline in three big stages: application, geometry, and rasterizer.

2 The application stage is where the 3D model is defined. This stage is generally implemented in software. It can be interactive, if the user can create or modify a scene interactively; or it can be static and just read the scene definition from a text file. In any case, the output of this stage is the primitives defined for the graphics system. The geometry stage receives the primitives from the application stage and applies a series of geometric operations like transforms, projections, and clippings, in order to pass the transformed vertices and color to the rasterizer stage. This stage can be implemented in software, in hardware, or part in software and part in hardware. Because the complexity of the geometry stage it is divided in five sub-stages as depicted in the next figure. Model & View Transform Lighting and Shading Projection Model & View Transform Screen Mapping Figure 1. The geometry stage (based on [2, p. 14]). The rasterizer stage input are the transformed vertices, colors, and texture coordinates from the geometry stage, in order to assign the correct color to the pixels and render the image correctly. This stage is also divided in sub-stages, which can be implemented in hardware or parallelized in high-end graphics systems. Triangle Setup Hidden Surface Removal Texturing Combine Figure 2. The rasterizer stage (based on [2, p. 20]). 2.2 mapping. Of all the above stages and sub-stages, we are interested only in the texturing sub-stage. Texturing is the process that modifies the appearance of a surface using some image, function, or any other method [2, p. 117]. For example, to render a wall, the graphics system can use a plane and glue the image of a wall on it. Another example is the rendering of terrain using a random function to decide where a feature will be located. In any case, there is a mapping process between the texture and the object to be textured. For this reason, this process is called texture mapping. As with many Computer Graphics concepts, there is not a clear definition of texture mapping. Some authors like Akenine-Möller and Haines [2], make a distinction between image texture mapping (obtaining the texture from an image) 2

3 and procedural texture mapping (obtaining the texture using a procedure or function). Other authors, most notably almost all the authors of the texture caching papers reviews here, consider texture mapping as the process of mapping an image into an object. Thus, in the rest of this paper, the term texture mapping will mean image texture mapping. The problem of texture mapping involves mapping a 2D point in the image, called texel, short of texture element, to a 3D point in the object. It is custom to specify the address of a texel with two coordinates, one horizontal u, and one vertical v. v u Figure 3. A water image texture. Akenine-Möller and Haines [2, p. 117], and Watt [16, p. 223] presents many texture mapping techniques. One of the most popular is mipmapping and is briefly explained in the next section. 2.3 Mipmapping Mipmapping was introduced in 1983 by Williams [18]. MIP is an acronym of the Latin phrase multum in parvo that can be roughly translated as much/many on little. Mipmapping has two stages: in the pre-computing step, it stores the original texture at different level of details creating a mipmap; in the rendering step, the program uses this mipmap to texture an object. The mipmap is generated as follows. The original texture is considered to be the level-0 sub-texture. Then, level-1 sub-texture is computed as level-0 sub-texture reduced to a quarter of the original area. This is done taking four neighbor texels from level-0 sub-texture, generally a 2 x 2 square, computing the average, and storing the result as one level-1 texel. The process is repeated, generating in each step, one new sub-texture which has one-quarter of the area of the previous sub-texture, until one or both of the most recent sub-texture dimensions (u or v) has only one texel. At rendering time, the first step is deciding which level of detail (LOD) d will be used. Williams recommends one method based in the cell obtained from projecting the screen pixel onto the texture [18]. Akenine-Möller and Haines [2, p. 135] describe another method based on differentials. Regardless of the method used, the result is d as a real number. The meaning is that two sub-textures will be used, one with the LOD given by floor (d), the other with the LOD ceiling (d). In each one of these two sub-textures, four texels are sampled and bilinearly 3

4 interpolated. The results of the previous step are linearly interpolated, depending on d, and this is the final value of the screen pixel. This interpolating method is called trilinear interpolation [2, p. 136]. 2.4 caching Mipmapping is one of the most practical and efficient methods to filter images [18, p.142] and reduces many aliasing problems [2, p.133] [7]. The storage overhead of the mipmap is only about 33% of the original texture [7]. One of the problems of Mipmapping is overblurring, although this is most noticeable when the texture is viewed from the edge [2, p. 136]. Another problem is that Mipmapping is a memory-intensive process. For computing one pixel, the algorithm needs to access eight texels. Caches are known to improve memory bandwidth on systems that exhibit locality [1] [6, p. 390]. 3. Basic texture caching Hakura and Gupta [5] realized that Mipmapping presents both spatial locality and temporal locality. Spatial locality is present because the movement of one pixel in the screen can be mapped to the movement of one texel in the texture. Temporal locality is present due to two features of Mipmapping: first, the final value (i.e. color) of the interpolated texel depends on the value of their neighbors. It is highly probable that, while computing the value of a texel, the process will need the value of a neighboring texel recently computed. The second feature comes from the common practice of repeating a texture to cover a larger geometry. Hakura and Gupta defined an architecture with a single fragment generator (see figure 4) and studied the impact of using an SRAM (static RAM) cache to store textures. Triangles Fragment generator SRAM Cache DRAM Memory d Fragments Figure 4. Hakura and Gupta architecture (based on [5]). 4

5 In traditional Mipmapping the texel s red, blue, and green components are stored separately. Hakura and Gupta found that an alternatively technique, called by them 6D blocked representation, improves the spatial locality and reduces the texture cache conflict misses when the block size is the same that the texture cache line size. Using the 6D blocked representation and tiled rasterization [2, p. 690], Hakura and Gupta concludes that the memory bandwidth of a system with a two-way setassociative 16KB cache is between one-third and one-fifteenth the memory bandwidth of an equivalent system accessing the texels directly from a DRAM memory. 4. Multi-level caching. Cox et al. [4] extended the work of Hakura and Gupta [5] proposing a 2-level cache architecture. They studied the feasibility of using an external texture cache (L2 cache) between the texture memory and the internal texture cache (L1 cache, closer to the GPU). The argument is that in a single-level texture cache system, because of its small size, the texture cache can only handle what the authors call intra-triangle and intra-object locality, that is, the locality exhibited by texturing one triangle or one object with the same texture. However, there are inter-object and inter-frame localities if we consider that two or more objects or even two consecutive frames, can share blocks of texture. The size of these later working sets is in the order of Megabytes, not Kilobytes. The goal of L2 cache is to absorb L1 misses when the intra-triangle and intra-object working set exceeds L1 size, and to absorb the inter-object and inter-frame working sets. Figure 5 is a block diagram of the proposed architecture. CPU Main memory Core logic memory (L3) GPU cache (L2) cache (L1) 5

6 Figure 5. Two-level cache architecture (based on [4]). In this architecture, L1 texture cache is a regular two-way set-associative as reported by Hakura and Gupta [5]. The L2 cache size, in the order of megabytes, raises some organization problems. Cox et al. argue than a fully associative cache of this size is not feasible. By other hand, the two-level cache architecture augments the problem of conflict misses (called in this paper collisions) when compared with a single-level cache architecture. It is not trivial to find a good hashing function that leads to good replacement behavior; therefore, it is difficult to organize L2 as a direct-mapped or even as a set-associative cache. The solution is organize L2 as a virtual memory, with a mechanism to translate from virtual texture addresses to physical addresses, and a replacement policy, in this case, LRU. Cox et al. reports that a 2MB L2 cache, coupled with a 16KB L1 cache, uses from 3 to 5 times less local memory and saves between 18 and 140 times the download bandwidth, compared with a single-level texture cache. 5. Prefetching Igehy et al. [7] noted that one of the problems about texture memory access is the high latency of memory systems. They explain that, while some computing aspects, like memory and logic density have experienced tremendous growth, memory speed has seen slight growth. That means that instructions cannot be read from memory as fast as they can be executed by the processor. Thus, sometimes memory latency (the time the memory takes to deliver the data requested) or memory bandwidth (the amount of data per second that can be transferred to or from memory) becomes a bottleneck. Caching can alleviate the problem of memory bandwidth. Hakura and Gupta [5] showed that a two-way set-associative 16KB texture cache can reduce the memory bandwidth requirements between one-third and one-fifteenth, compared with a system with no cache. However, a cache does not solve the problem of memory latency. Igehy et al. [7] propose a texture cache architecture with prefetching that takes advantage of the access characteristics of texture mapping. Prefetching is a technique where the processor retrieves data or instructions before they are used. With this approach, the processor reduces its waiting time and hides the memory latency to some extent [15, p. 8]. Experimental results show that their architecture can hide most of the memory latency with a 97% utilization of hardware resources. Furthermore, the number of pipeline stalls due to multiple misses per fragment is typically less than 1%. The architecture described by Igehy et al. includes the following components: 6

7 A fragment FIFO to store the fragments to be textured while the system receives all the texels needed either from the texture cache or from the texture memory. A texture cache to store some texels according with some policy (most recently used, most recently fetched, etc cetera). A request FIFO to store the requests to the texture memory when there the texture cache misses the texels. A reorder buffer to store and reorder the texels coming from the texture memory. It can be a FIFO if responses from the texture memory always come in the same order the requests were made. The architecture processes the fragments as follows: 1. For each fragment, all of its texels are searched in the texture cache. 2. If all the texels are in the cache, the fragment is forwarded to the fragment FIFO. Otherwise, the missing texels are requested from the texture memory using the request FIFO, and then, the fragment is forwarded to the fragment FIFO to wait for the arrival of its missing texels. 3. The missing texels arrive from the texture memory and are stored in the reorder buffer. In order to avoid conflicts with other texels in the cache, the new texels are sent to the cache only when their corresponding fragment is at the head of the fragment FIFO and ready to be textured. 4. The fragment removed from the head of the fragment FIFO has their texels in the cache, either because they were already there, or because they were just retrieved from the reorder buffer. 5. The fragment and all their data is moved to the texture applicator step. This procedure is shown in figure 6. 7

8 Fragment All texels in cache? no Request buffer texels req. yes Fragment FIFO fragment fragment memory texels Reorder buffer texels fragments cache texels applicator Figure 6. caching with prefetching (based on [7]). 6. Parallel texture caching Another way of improving performance is replicating some or all the stages of the basic graphics pipeline. Parallel rasterization is the process in charge of distributing work to the copies of the rasterization stages. Igehy et al. [8] were among the first to study the benefits of texture caching in a parallel architecture. They argue that previous serial architectures have been effective in minimizing bandwidth and/or memory latency, because of the locality of reference exhibited by the texture mapping process, and that this locality decrease in a parallel architecture. They describe a real-time graphics system which uses 1, 2, or 4 fragment generators. Each fragment generator is coupled with an independent texture memory. The texture memory is replicated up to 4 times because any fragment generator needs to access any texture data. As a result, the texture subsystems made minimal use of texture locality. 8

9 In order to study the effect of parallel rasterization on texture locality, Igehy et al. classified parallel rasterization architectures in two big categories: texture memory architecture and rasterization algorithms. Regarding texture memory architecture, the authors describe two schemes: a) Dedicated texture memory. Each texturing unit has its own texture memory that holds a copy of the texture data. Caching can reduce texture memory bandwidth. b) Shared texture memory. There is one copy of the texture data distributed among the distinct texture memories. A texture sorting network allows any texturing unit to access any texture memory. Caching can reduce network and memory bandwidth. Regarding to rasterization algorithms, there are three considerations: a) How work is partitioned. There are two options: by image-space, where the screen is divided in blocks and each block is assigned to a texturing unit; or by object-space, where an object is decomposed in fragments and each fragment is assigned to a texturing unit. b) Order in which a texturing unit processes fragments. It can be the original order in which primitives are presented (primitive order), or some other order, like block order, where all fragments of a block are processed before any fragment of the next block. c) Order in which fragments destined for the same location are processed. Here, the authors only consider that fragments are processed in the order they are presented by the application. As a result of their study, Igehy et al. demonstrated that parallel texture caching works well from 1 to 64 texture units. They also proved that shared texture memory architecture has better performance than dedicated texture memory, not only because it does not replicate textures, but because distributes contention over the different memories. In this way, shared texture memory uses bandwidth that would be unused in dedicated texture memory architecture. They confirmed that increasing the parallelism decreases the working set size. A 16KB texture cache produces good results in parallel architectures with 2 up to 64 texture units. Finally, they concluded that parallel texture caching is general enough to work with both image-space and object-space, as well with primitive order and block order architectures. 7. Hybrid access caching Theobald et al. [14] affirms that while direct-mapped caches have better access times, set-associative caches have less miss rates. In order to get the best from each of the two architectures, they propose the Hybrid Access Cache (HAC) 9

10 model. In the HAC model, a hybrid cache consists of two parts: a direct-mapped primary section and a slower, usually fully-associative, secondary section. Choi et al. [3] analyzed the distribution of texture cache misses, and observed that in many texture mapping applications, there are a significant number of conflict misses. For the purpose of reduce these conflict misses, they evaluated three hybrid access cache systems: victim cache, half-and-half cache, and cooperative cache. Before showing the results, here is a short review of these three approaches: Victim cache. First proposed by Jouppi [9]. It is a small fully-associative cache (1 to 5 lines in Jouppi s paper) that it is added to the main cache. The victim cache stores the most recently discarded elements (victims) in case they are needed again. Hennessy and Patterson [6, p. 449] affirm the victim cache helps to reduce miss penalty and miss rate. Half-and-half cache. First proposed by Theobald et al. [14]. Half of the lines in this cache are direct-mapped and the other half are setassociative. The authors claim that for moderated size caches, the halfand-half cache has better performance than other hybrid caches in most applications. Cooperative cache. Explained in [12]. It consists of a direct mapped temporal oriented cache and a four-way set-associative spatial oriented cache. Each cache has an 8KB capacity but different block size. This arrangement is oriented to reduce power consumption because the caches are designed to help each other. Choi et al. compared the three architectures by miss rate and AMAC (average memory access cycles) for various cache line sizes, different associativity, and distinct cache line sizes; and concluded that victim cache is the most suitable in a texture cache system. For small-sized caches (8KB) victim cache shows a performance improvement of 19% when compared with a conventional cache system. However, from the authors tables, it is not clear which architecture, between victim cache and half-and-half cache, is the best one for a cache size of 16KB. 8. Multilevel parallel texture caching Park et al. [13] combined two previous ideas in their approach: multi-level caching [4] and parallel texture caching [8] and called it Multilevel parallel texture caching. The architecture has two levels of cache memory and four components overall, all integrated in a single chip: an 8MB DRAM L2 cache memory, eight 8- way 16KB SRAM L1 cache memories in parallel, eight pipelined texture filter modules, and a serial-to-parallel loader. See next figure. 10

11 AGP or PCI bus Multilevel Parallel Cache Serial-to-parallel latch L2 DRAM 8MB IBUS L1 C0 SRAM 16KB L1 C7 SRAM 16KB Filter 0 Filter 7 Graphics Pipeline 0 Graphics Pipeline 7 Figure 7. Multilevel parallel texture cache (based on [13]). The main results and features of the multilevel parallel texture cache are as follows. The L2 cache reduces the required bandwidth of the AGP (Accelerated Graphics Port) or PCI (Peripheral Component Interconnect) bus by 20 times, for a 1024 x 768 screen resolution, exploiting the texture data coherency between successive frames. The eight independent L1 caches remove the access conflicts in the parallel graphics pipelines and allow them to run at full speed. Using a new transfer data proposed by Park et al., the internal bus (IBUS) has a bandwidth of up to 75 GB/s. With this bandwidth the L2 cache is able to service the L1 caches without starvation. The architecture is reconfigurable because the cache line sizes of L1 and L2 can be 4 x 4, 8 x 8, or 16 x 16 pixels. The intention is to maintain an optimal cache performance, in terms of cache misses, for different graphics applications. 11

12 9. Adaptive indexing Adaptive indexing, first proposed as ACI (adaptive cache indexing) by Kim et al. [10] and later renamed as A-index by Kim and Kim [11], is a technique aimed to reduce the miss rate in a texture cache. Kim and Kim [11] say that most texture caching methods uses the texture u coordinate as the cache index (u-index). This is fine in most cases, because the u coordinate is the horizontal coordinate, and most rasterization algorithms are horizontal-based. However, there are cases where a texture is vertical-oriented or using their nomenclature, the texture is v-major, and using the u coordinate to index the texture cache results in a higher cache miss rate. In this case, using the v-index would have a better cache performance. To understand the concepts of a texture being u-major or v-major, consider the following figure that represents two lines. u v Figure 8. Example of u-major and v-major. The blue (left) line is v-major because v > u, i.e. the vertical coordinate changes faster than the horizontal coordinate. On the other hand, the red (right) line is u-major because for this case v < u. The decision of which index, between u-index and v-index, will be used to access the texture cache is made at the time of address generation, based on the current and the previous texture sampling points. If the u, between the current and previous u coordinate, is greater than v, between the current and previous v coordinate, then, the direction is u-major; otherwise, the direction is v-major. Using adaptive indexing with a 16KB, two-way set associative cache with a 64 bytes (4 x 4 texels) line size, and scenes of 640 x 480 pixels, Kim et al. [10] reports a reduction in cache misses of about 23% compared with a traditional texture caching architecture with no adaptive indexing. The total number of cycles is reduced in 8.9%. Kim and Kim [11] reports 21.6% and 8.8% for reduction in cache misses and cycles, respectively. 12

13 Using adaptive indexing in a texture cache system requires adding 1-bit registers for each cache line to distinguish between a u-index and a v-index; and some logic to compare two texture sampling points. Kim et al. [10] reports that for a 16KB texture cache, the total hardware overhead is about 7.7% extra nandequivalent gates. 13

14 References [1] Akeley K, Hanrahan P. CS448A: Real-Time Graphics Architectures. Stanford University Accessed 2/21/06. [2] Akenine-Möller T, Haines E. Real-time rendering, 2 nd edition. A. K. Peters: Natick, Mass [3] Choi CJ, Park GH, Lee JH, Park WC. Han TD. Performance comparison of various cache systems for texture mapping. The Fourth International Conference on High-Performance Computing in the Asia-Pacific Region [4] Cox M, Bhandari N, Shantz M. Multi-level caching for 3D graphics hardware. Proceedings of the 25th annual international symposium on Computer architecture ISCA [5] Hakura ZS, Gupta A. The design and analysis of a cache architecture for texture mapping. Proceedings of the 24th annual international symposium on Computer architecture ISCA [6] Hennessy JL, Patterson DA. Computer architecture: a quantitative approach, 3 rd edition. Morgan Kaufmann: San Francisco [7] Igehy H, Eldridge M, Proudfoot K. Prefetching in a texture cache architecture. Proceedings of the ACM SIGGRAPH/Eurographics Workshop on Graphics Hardware [8] Igehy H, Eldridge M, Hanrahan, P. Parallel texture caching. Proceedings of the ACM SIGGRAPH/Eurographics Workshop on Graphics Hardware [9] Jouppi NP. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. Proceedings of the 17th annual international symposium on Computer Architecture ISCA [10] Kim CH, Im YH, Kim LS. Miss-rate reduction in texture cache by adaptive cache indexing. Electronic Letters. (40) [11] Kim CH, Kim LS. Adaptive selection of an index in a texture cache. Proceedings of the IEEE International Conference on Computer Design ICCD

15 [12] Park GH, Lee KW, Lee JH, Han TD, Kim SD. A power efficient cache structure for embedded processors based on the dual cache structure. ACM SIGPLAN Workshop LCTES [13] Park SJ, Kim JS, Woo R, Lee SJ, Lee KM, Yang TH, Jung JY, Yoo HJ. A reconfigurable multilevel parallel texture cache memory with 75GB/s parallel cache replacement bandwidth. IEEE Journal of Solid-State Circuits. (37) [14] Theobald KB, Hum HHJ, Gao GR. A design framework for hybrid-access caches. 1st IEEE Symposium on High-Performance Computer Architecture [15] Van der Pas R. Memory hierarchy in cache-based systems. Sun Microsystems Accessed 2/27/2006. [16] Watt A. 3D Computer Graphics, 3 rd Edition. Addison-Wesley: Harlow, England [17] Watt A, Watt M. Advanced Animation and Rendering Techniques: Theory and Practice. Addison-Wesley: Harlow, England [18] Williams L. Pyramidal parametrics. Proceedings of the 10th annual conference on Computer graphics and interactive techniques

Real-Time Graphics Architecture. Kurt Akeley Pat Hanrahan. Texture

Real-Time Graphics Architecture Kurt Akeley Pat Hanrahan http://www.graphics.stanford.edu/courses/cs448a-01-fall Texture 1 Topics 1. Review of texture mapping 2. RealityEngine and InfiniteReality 3. Texture