Mitigating the Impact of Decompression Latency in L1 Compressed Data Caches via Prefetching

Size: px

Start display at page:

Download "Mitigating the Impact of Decompression Latency in L1 Compressed Data Caches via Prefetching"

Aleesha Lee
5 years ago
Views:

1 Mitigating the Imact of Decomression Latency in L1 Comressed Data Caches via Prefetching by Sean Rea A thesis resented to Lakehead University in artial fulfillment of the requirement for the degree of Master of Science In Electrical and Comuter Engineering Thunder Bay, Ontario, Canada Setember 1, 2017 Coyright 2017 Sean Rea

2 Abstract Exanding cache size is a common aroach for reducing cache miss rates and increasing erformance in rocessors. This aroach, however, comes at a cost of increased static and dynamic ower consumtion by the cache. Static ower scales with the number of transistors in the design, while dynamic ower increases with the number of transistors being switched and the effective oerating frequency of the cache. Cache comression is a technique that can increase the effective caacity of cache memory without exeriencing the same gains in static and dynamic ower consumtion. Alternatively, this technique can reduce the hysical size and therefore the static and dynamic energy usage of the cache while maintaining reasonable effective cache caacity. A drawback of comression is that a delay, or decomression latency, is exerienced when accessing the comressed data, which affects the critical execution ath of the rocessor. This latency can have a noticeable imact on rocessor erformance, esecially when imlemented in first level caches. Cache refetching techniques have been used to hide the latency of lower level memory accesses. This work aims to investigate the combination of current refetching techniques and cache comression techniques to reduce the effect of decomression latency and therefore imrove the feasibility of ower reduction via comression in high level caches. We roose an architecture that combines L1 data cache comression with table-based refetching to redict which cache lines will require decomression. The architecture then erforms decomression in arallel, moving the delay due to decomression off the critical ath of the rocessor. The architecture is verified using 90nm CMOS technology simulations in a new branch of SimleScalar, using Wattch as a baseline, and cache model inuts from CACTI. Comression and decomression hardware are synthesized using the 90nm Cadence GPDK and verified at the register-transfer level. The results of our verifications demonstrate that using Base-Delta-Immediate (B I) comression, in combination with Last Outcome (LO), Stride (S), and Two-Level (2L) refetch methods, or hybrid combinations of these methods (S/LO or 2L/S), rovides erformance imrovement over Base-Delta- Immediate (B I) comression alone in L1 data cache. On average, across the SPEC CPU 2000 benchmarks tested, Base-Delta-Immediate (B I) comression results in a slowdown of 3.6%. Imlementing a 1K-Set Last Outcome refetch mechanism imroves slowdown to 2.1% and reduces the energy consumtion of the L1 Data Cache by 21% versus a baseline scheme with no comression. ii

3 Acknowledgements For introducing me to comuter architecture and roviding me with an engaging toic to study while at Lakehead, thank you Dr. Atoofian. I have learned a lot under your suervision. Thank you to Dr. Mansour and Dr. Tayebi for endorsing my alication for art-time studies back in 2012 and to Dr. Natarajan for keeing an eye on me for the first few years. Thank you Dr. Christoffersen for your active role in enabling students at Lakehead to have access to Cadence tools. Thank you to all my family who have given their suort and made every accommodation ossible for me to ursue this degree. Most imortantly, thank you Marcelle and Adam for suorting my decision to continue with my studies. Sean Rea rea@ieee.org iii

4 Contents List of Figures... vi List of Tables... viii List of Symbols... ix List of Abbreviations... x Chater 1 Introduction... 1 Chater 2 Background and Related Work Memory Hierarchy Cache Comression Related Work in Cache Comression Base-Delta Base-Delta-Immediate Data Prefetching and Data Value Prediction Related Work in Prefetching Last Outcome Stride and Hybrid Stride / Last Outcome Two-Level and Hybrid Two-Level / Stride Thesis Motivation Chater 3 Cache Comression and Prefetching Comression Architecture Power Considerations Prefetching Architecture FETCH MEM Hardware Design Hierarchical Carry-Lookahead Adder iv

5 3.3.2 Imlementation in Verilog Chater 4 Simulation Methodology Methodology Simoint CACTI SimleScalar Environment Synthesis and Static Power Analysis Dynamic Power Analysis Place and Route in Cadence Innovus Chater 5 Results Comression Prefetching Comression and Prefetching Chater 6 Summary and Future Work Contributions Future Work Bibliograhy Aendix A Verilog Source v

6 List of Figures Figure 2.1 Memory Hierarchy... 4 Figure Byte Cache Line Comressed with Base-Delta... 8 Figure Byte Cache Line Comressed with Base-Delta (2 Bases)... 8 Figure 2.4 Changes to Tag and Data Architecture for B I Comression... 9 Figure 2.5 Last Outcome Prefetching Figure 2.6 Stride Prefetching Figure 2.7 Stride State Machine Figure 2.8 Two-Level Prefetch Table and Pattern History Table Figure 3.1 Prefetching Alied to Classic RISC Architecture Figure 3.2 Prefetch Table Structure Figure 3.3 Comressor Design Figure 3.4 Decomressor Design Figure 3.5 Truth Table for Full Adder Figure 3.6 Karnaugh Ma for Full Adder Figure 3.7 Adder Design Figure 3.8 HDL Structure of Comressor Figure 3.9 Testbench Waveforms for Comressor in Xilinx ISE Figure 3.10 HDL Structure of Decomressor Figure 4.1 Simulation Flow Diagram Figure 4.2 CACTI Outut Figure 4.3 Comression Model Verification Results Figure 4.4 Comressor VCD Header Figure 4.5 Comressor VCD ASCII Value Figure 4.6 Decomressor VCD Header Variables Figure 4.7 Decomressor VCD ASCII Value Figure 4.8 Genus Gates Reort for Comressor (Condensed) Figure 4.9 Comressor Routing in Innovus Figure 5.1 Percentage of L1 Data Cache Lines Comressed by Each Scheme Figure 5.2 Comression Ratio of L1 Data Cache Figure 5.3 IPC of Baseline Scheme Figure 5.4 IPC of Comressed Scheme Figure 5.5 Seedu of Comressed Scheme vs Baseline vi

7 Figure 5.6 L1 Data Cache Static Energy (Baseline Scheme) Figure 5.7 L1 Data Cache Static Energy (Comressed Scheme) Figure 5.8 L1 Data Cache Static Energy Ratio Comressed vs Baseline Figure 5.9 L1 Data Cache Dynamic Energy (Baseline Scheme) Figure 5.10 L1 Data Cache Dynamic Energy (Comressed Scheme) Figure 5.11 L1 Data Cache Dynamic Energy Ratio Comressed vs Baseline Figure 5.12 Hit Percentage of Load Instructions by Prefetch Table (128 Set) Figure 5.13 Hit Percentage of Load Instructions by Prefetch Table (1K Set) Figure 5.14 Prediction Accuracy of 10 Prefetch Table Configurations Figure 5.15 Static Energy by Prefetch Table (128 Set) Figure 5.16 Static Energy by Prefetch Table (1K Set) Figure 5.17 Dynamic Energy by Prefetch Table (128 Set) Figure 5.18 Dynamic Energy by Prefetch Table (1K Set) Figure 5.19 L1 Data Cache Energy vs Performance Figure 5.20 CPU Energy vs. Performance Figure 5.21 Seedu Due to Prefetching (128 Set, vs. Comressed Only) Figure 5.22 Seedu Due to Prefetching (1K Set, vs. Comressed Only) Figure 5.23 Energy-Delay Product (CPU) vii

8 List of Tables Table 3.1 Comression Events Table 3.2 Power Events Table 3.3 Prefetch Table Power Events Table 3.4 Two-Level Table Power Events Table 3.5 Decomression Buffer Power Events Table 3.6 Comressor Test Cases Table gzi CPI Values by Simulation Point Table 4.2 Simoint Error by Maximum Number of Clusters Table 4.3 SimPoint Error Table M SimPoint Results Table 4.5 CACTI L1 Cache Configurations and Power Results Table 4.6 CACTI L2 Cache Timing Table 4.7 CACTI Prefetch Table Configurations and Power Results Table 4.8 CACTI Pattern History Table Power Results Table 4.9 CACTI Decomression Buffer Power Results Table 4.10 Delta Datatye and Overflow Information Table 4.11 Boundary Conditions for Comression Table 4.12 Initial Static Power Analysis of Decomressor by PDK Table 4.13 Comressor Static Power Determination Table 4.14 Decomressor Static Power Determination Table gzi Comressor Static Power Values by Simulation Point Table 4.16 Static Power for Comressor and Decomressor Table gzi Comressor Dynamic Power Values by Simulation Point Table 4.18 Comressor Dynamic Power Results from Cadence Genus Table gzi Decomressor Power Values by Simulation Point Table 4.20 Decomressor Power Results from Cadence Genus viii

9 List of Symbols c i c i+1 E g i G i M i P i t t VCD, s carry-in carryout energy generate function block generate function maximum number of SimPoint clusters roagate function block roagate function time value change dum timestam (in icoseconds) ix

10 List of Abbreviations 2L B I BBV CMC CMOS CPI CPU EDP FIFO FLAC FPC GHB GPDK IPC L1 L2 L3 LO LRU PC PDK PNG RISC simoint S S/LO SPEC VCD ZCA ZIP two-level refetcher base-delta-immediate comression basic block vector Canadian Microelectronics Cororation comlementary metal-oxide-semiconductor cycles er instruction central rocessing unit energy-delay roduct first-in, first-out free lossless audio codec file format frequent attern comression global history buffer generic rocess design kit instructions er cycle level-1 cache level-2 cache level-3 cache last outcome refetcher least recently used rogram counter rocess design kit ortable network grahics file format reduced instruction set comuter simulation oint stride refetcher hybrid stride/last outcome refetcher Standard Performance Evaluation Cororation value change dum zero-content augmented cache ZIP file format x

11 Chater 1 Introduction In 2016, it has been estimated that the world creates 2.5 quintillion bytes of data er day [1]. At the time, that estimate suggested that 90% of the world s data had been created in the revious two years alone. In as early as 2013, it was aroximated that Information and Communication Technologies were consuming nearly 10% of the world s electricity generation [2]. With this rate of data growth, and the current imact comuting has on the world s energy consumtion, there is a need to investigate ways to imrove the way we store and rocess data. Linked with the growth of our data generation is the emergence of a raidly growing mobile device market. This market relies heavily on low ower, battery sulied devices. Unavoidably, the best way to rovide a longer battery life for these devices is for the devices themselves to consume less ower. While imrovements can be made to the devices themselves (e.g. suly voltage and device size), in many alications, the best way to reduce ower is to find efficiencies at the architectural level [3]. Redefining the architecture can result in orders of magnitude in reduction of ower deending on the secific alication. When we look at the data we are creating, it is clear that atterns exist that create inefficiencies in the way it is stored and rocessed [4]. Data atterns may consist of values that are reeated over-and-over again, values that are very close to each other, and even large sets of null data. Some of the greatest sources of data in the world today are the cameras on our mobile devices. Images are a great examle of data that consists of atterns. Pixel data contains sequences of values, which can be identical or very close in magnitude (when looking at color value, brightness, etc.). Programs that maniulate this image data generally handle large data arrays. These arrays are frequently initialized to some reeated value, often zero. And frequently enough, the develoers of those rograms may over-rovision data tyes to hold that data such that most of it goes unused (narrow data). These inefficiencies in our data contribute to unnecessary storage and rocessing. 1

12 In comuting architecture, we utilize a memory hierarchy to ensure that the data we use most frequently is closest to the CPU and therefore accessible as fast as ossible. The closest memory saces, L1 and L2 cache, take u large areas on-chi and consume large amounts of ower. Deending on the architecture, cache memory in a rocessor can account for uwards of 40% of the total ower budget [5]. Because the cache must handle our data, which is full of inefficiencies, a significant ortion of this energy consumtion could be avoided via comression. Comression is ossible by relacing the most inefficient atterns with a set of smaller reresentations, known as encoding. Encoding can be done with a fixed dictionary or a comression scheme can dynamically and iteratively assign code words to atterns. Most of our data is comressible to some extent, whether it is text, image, or audio data. This is why, in main memory, it is common to store and transfer large files in a comressed format (e.g. ZIP, PNG, FLAC). Cache-level comression in a rocessor is a technique that can increase the effective caacity of the cache, and therefore imrove erformance of the rocessor, by comressing cache lines before they are stored in the cache. Alternatively, this same method can be used to reduce the hysical size of the cache and therefore reduce the ower consumtion. Tyically, cache lines consist of several bytes of data. In set-associative caches, multile lines, or ways, may be stored at a given cache index. Each of these ways store a comlete uncomressed cache line. If we can comress the size of these cache lines, we can store more data in each set. Alternatively, we could reduce the hysical size of the cache and store the same, or similar, amounts of data. The ower of a cache deends on its size and the frequency of accesses. By reducing the total size of the cache, we can significantly reduce the ower consumtion. In addition, the size of the data lines we read from the cache are otentially reduced in size as well. Therefore, we can model the savings in dynamic ower consumtion in the cache by considering the size ratio of comressed data vs an uncomressed cache line. In the ast, researchers have avoided imlementing comression in high-level caches such as L1 data cache because of the imact decomression latency has on the overall erformance of the rocessor [4, 6, 7]. Access times at this level are in the order of a few clock cycles. To add even a few clock cycles to this access time will cause significant erformance delays and defeat the intent of the high-level cache. 2

13 However, it is ossible to imlement additional techniques, such as refetching, to remove the some of the burden of decomression from the rocessor s critical ath. If this aroach is successful, it could lead to imroving the feasibility of imlementing comression in high-level caches (secifically L1 data cache). Our work focusses on the combination of data cache comression and table-based refetching to exlore the feasibility of imlementing comression in L1 data cache. We evaluate this new architecture by examining what imact it has on the erformance and ower consumtion of a CPU during the execution of standard benchmarks. Secially, this work makes three key contributions: (1) An architecture is roosed that combines comression, secifically Base-Delta-Immediate comression, in L1 data cache with table-based refetching methods, such as Last Outcome, Stride, and Two-Level, to redict which cache lines will require decomression. The architecture then erforms decomression in arallel, therefore moving the delay due to decomression off the critical ath of the rocessor. (2) Modifications are made to Wattch [5], a branch of SimleScalar [8] that is an oen-source rocessor modelling tool for analyzing and otimizing ower consumtion at the architectural level. This tool is extended to model Base-Delta-Immediate comression in combination with table-based refetching to show the benefit of erforming decomression as a arallel activity to execution and increase the feasibility of imlementing comression in L1 caches. (3) 64-byte comressor and decomressor hardware is designed in 90nm CMOS and tested for imlementation with Base-Delta comression. Static and dynamic ower analysis is erformed on the new hardware, reinforcing its suitability for use in a ower-reducing comressed cache scheme. The remainder of this thesis is organized into five chaters. Chater 2 rovides an overview of cache comression and refetching and recent research that has been done in these areas. In Chater 3, we define the roosed comression and refetching architecture, discuss what changes are necessary to accommodate the new architecture in a conventional suerscalar rocessor, and rovide the details of the hardware design for the comressor and decomressor units. Chater 4 discusses the tools used and modified to model the comression and refetching architecture. In Chater 5, the results of simulation are rovided and discussed. Finally, Chater 6 rovides a summary of the work done, the significance of the results, and future work that could be done to advance this research. 3

14 Chater 2 Background and Related Work In this chater, we review the concets of memory hierarchy, comression, data value rediction, and refetching. We review existing work in the areas of cache comression, data value rediction, and refetching and go into detail of the oeration of one comression scheme, Base-Delta-Immediate, three rediction schemes: Last Outcome, Stride, and Two-Level as well as hybrid combinations of these schemes. Finally, the motivation behind this work is resented. 2.1 Memory Hierarchy The seed in which a rocessor can read information from memory has a great imact on the erformance of that rocessor. Fast memory, however, is exensive. For this reason, memory is organized into levels that exloit small amounts of fast memory close to the rocessor, and larger, slower levels of memory farther away. This organization is referred to as the Memory Hierarchy. In this hierarchy, between the central rocessing unit (CPU) and main memory are various levels of cache memory. Figure 2.1 Memory Hierarchy When the data for a given address is stored in the high-level cache (i.e. L1 cache), then the rocessor has fast access to this information. If the data is not there, the rocessor must retrieve it from lower levels. This is referred to as a cache miss. Cache misses are classified by three tyes: comulsory misses, caacity misses, and conflict misses [9]. Comulsory misses occur during start-u, when no information exists in 4

15 the cache. Caacity misses occur if the cache is not large enough for all the blocks required during the execution of a rogram and blocks are discarded. If these discarded blocks must be read again, they must be fetched from lower levels. Conflict misses occur if the cache is not fully associative. In this case, blocks may be discarded even before the cache is full. Increasing the caacity of the cache can reduce the number of caacity and conflict misses and therefore increase erformance. This, however, comes at a cost of increased ower consumtion. 2.2 Cache Comression Similar to data files in main memory, the data within cache memory consists of atterns that can be exloited by comression techniques to save sace. Cache comression is a method that can be used to increase the caacity of the cache without exeriencing the same increase in ower consumtion. Because the intent of cache memory is to rovide low latency access to data, comression and decomression must be erformed at the hardware level in the rocessor rather than at the software level, as is commonly erformed on files in main memory. This same requirement for low latency cache access is why most of the revious work done on the toic of cache comression focusses on low-level cache (i.e. L2 cache and L3 cache). Because L1 caches tyically have access latencies in the order of a few clock cycles, adding a decomression latency on to of that can degrade erformance beyond accetable levels. The ideal comression scheme for imlementation in L1 cache is one that is, of course, fast, but is also caable of encoding the most common atterns that exist within data stored in memory. These most common atterns can be groued into four main categories: zeros, reeated values, narrow values, and other atterns [4]. Zeros Zero values are widely used throughout rograms, rimarily in variable initializations, null ointers, false boolean values, and sarse matrices [4]. Reeated Values Similar to zero values, reeated values may aear in the form of variable initializations. Another cause for reeated values is image data. Adjacent ixels tend to contain similar information such as colour data [4]. 5

16 Narrow Values It is common for develoers to over-allocate sace to variables to rotect from overflow during execution of a rogram. In some cases, these variables never come close to their maximum values. A small value stored as a large data tye is considered a narrow value [4]. Other Patterns This grou is not meant to include all other atterns, but rather atterns that secifically have low dynamic range. For examle, an array of ointers that all oint to the same region of memory [4] Related Work in Cache Comression Research has been done to evaluate hardware-based data comression in CPU caches [4, 6, 7, 10, 11, 12] as well as in GPGPU [13]. The following aers have exlored which methods exloit the most oortunities in data atterns and at which levels in cache they are most beneficial. A common understanding among this work is that decomression latency is a roblem when imlementing in fast caches and is cited as the reason for avoiding L1 comression in some works [4, 6, 7]. In [7], the authors resent Frequent Pattern Comression (FPC), which comresses data that fits into one of seven atterns. Each 32-bit word is evaluated searately so data is not comressed sanning multile words. The scheme is evaluated in L2 cache with L1 cache being left uncomressed. The design is evaluated against the Wisconsin Commercial Workload Suite and six benchmarks from the SPEC CPU 2000 suite. The scheme rovides comression ratios ranging from 1.0 to 2.4 over all their benchmarks. This scheme catures the main three grous of atterns that exist in data. However, this scheme does not address the behaviour of low dynamic range that data exhibits. In [11], the authors exloit the common scenario of storing null data in caches by augmenting the uncomressed cache with an additional cache that is only required to store the addresses of zero-content cache lines. The authors evaluate this zero-content augmented (ZCA) cache in every combination of cache level from L1 to L3. They found that imlementing ZCA in L3 alone was sufficient to exerience most of the benefit and found u to a 22% seedu when run against SPEC CPU 2000 benchmarks. Because this scheme only looks at zeros, decomression latency is not an issue, and the authors are able to exlore this technique in all levels of cache without affecting the read latency. This scheme, however, does not address most of the atterns that exist within cache data. 6

17 In [12], the authors comress the cache line by encoding 32-bit words that aear in a redefined list of frequent values. The scheme requires that a cache line be comressible to 50% in size or less or it is not comressed at all. Encoding bits are required for each word in the cache line. The authors determine that their scheme can imrove the miss rate for six integer benchmarks from SPEC CPU 95 as much as 36.4%. Due to the table-looku nature of the scheme, it cannot cature all reeating values efficiently. As well, it misses the imortant narrow values that occur within the data. In [4], the authors resent a new comression scheme that this work builds uon, called Base-Delta- Immediate. In addition to null data and reeating values, this work exloits two trends in data called narrow values and low dynamic range. The scheme comresses cache lines that can be reresented as a single base and an array of small deltas. The authors evaluated their scheme against the SPEC CPU 2006 benchmark suite, among other benchmarks. The authors achieve an average comression ratio of 1.53 across all benchmarks when comressing L2 cache. Due to the imact decomression latency would have on L1 cache, the authors focus on L2. Because this scheme addresses all the atterns discussed in the other works, and more, it is the comression scheme we use. We will address the issue of imlementing this scheme in L1 cache by combining common refetching techniques to mask the effect of the decomression latency Base-Delta Back to [4], the authors first describe the foundation of their scheme, called Base-Delta. Base-Delta is a cache comression scheme that stores a cache line as one large base value along with an array of smaller deltas. The concet behind the scheme is that, for many cache lines, the data values have a low dynamic range (the difference between values is small). For examle, Figure 2.2 shows an examle of how a 32- byte cache line may be comressed using a Base 4 Delta 1 comression scheme. 7

18 Figure Byte Cache Line Comressed with Base-Delta From the figure, you can see how this cache line benefits from low dynamic range. In this examle, the Base 4 Delta 1 scheme is used. This means the chosen size of the base is 4 bytes, and the size of the deltas is 1 byte. In some cases, comression may benefit from having multile bases. For examle, the cache line in Figure 2.3 clearly shows atterns with low dynamic range around two bases. Figure Byte Cache Line Comressed with Base-Delta (2 Bases) Determining two otimized bases is a high latency task that is not feasible during execution. The authors resolve this issue by imlementing the immediate ortion of their comression making Base-Delta- Immediate [4]. 8

19 2.2.3 Base-Delta-Immediate Base-Delta-Immediate (B I) comression imlements a 2-base Base-Delta scheme where one base is always zero. This method sees much of the benefit of a 2-base system, without adding the need to store a second base. To imlement this immediate base, an array of flag bits called the immediate mask is included in the tag to identify which deltas refer to the base and which refer to zero. To change a conventional cache into a Base-Delta-Immediate cache, we double the number of tags. This allows us to utilize the vacant sace in the cache created during comression. Next, we modify how the tags oint to the data stored in the cache. The data array is divided into 8-byte segments rather than 64- byte blocks. Rather than ointing to a constant 64-byte block, the tag now oints to a variable-size comressed block. The location of the comressed block at a given cache index is determined by summing the size of cache blocks stored in front of it. Lastly, we add the encoding bits (and immediate mask, as mentioned above) to the tag. This allows us to define the tye of comression alied to the data for a given way. These changes to the architecture are shown in Figure 2.4. Figure 2.4 Changes to Tag and Data Architecture for B I Comression 9

20 These changes are imlemented functionally in SimleScalar and have their ower and access time imact modeled using CACTI (see Chater 4, Simulation Methodology). The additional hardware required for comressing and decomressing the B I cache lines is discussed searately in Chater Data Prefetching and Data Value Prediction Prefetching is a method that can be used to reduce the miss rate of all three tyes of cache misses. Data can be refetched (read in arallel to execution, before it is requested), either directly into the next-levelu cache or into a custom buffer that can be accessed faster than main memory [9]. While successful refetches can reduce memory latency and imrove overall rocessor erformance, unused refetches will have a negative imact on ower consumtion while having no ositive imact on erformance. Data value rediction is similar to refetching such that it emloys a table-based redictor to imrove erformance of the rocessor. Unlike refetching, data value redictor tables do not store the address in memory where the data exists. Rather, it stores the data itself secifically the result of single-register roducing instructions. The rocessor then continues execution using this redicted result. If an incorrect rediction is made, the rocessor ieline must be flushed of any instructions that deend on this data. The aroach we take concerning refetching is neither a direct alication of refetching nor of data value rediction. Rather, our refetcher redicts which address in the L1 data cache will be accessed based on the rogram counter of each load instruction. Then, in arallel, the rocessor decomresses this data in the cache, if it is comressed, and inserts it into an external buffer. The key similarities between refetching and data value rediction are the rediction table methods used. These methods have been exlored extensively for both alications. We look at this ast research to determine which tables are best suited for our alication. Reviewing the rediction schemes resented in literature, five tyes stand out as candidates for this work, as discussed in [14]. The simlest, Last Outcome, is evaluated first to determine how quickly we can recover the cost of decomression with the lowest ossible comlexity. Next, Stride and the hybrid Stride/Last Outcome redictors are used and evaluated. Lastly, the Two-Level and hybrid Two- Level/Stride redictors are looked at to cature more comlex atterns. Global History Buffer is discussed as a otential extension by evaluating the benefit of redictor tables with a deth greater than one. 10

21 2.3.1 Related Work in Prefetching In [14], data value redictors are discussed using Last Outcome, Stride, 2-Level Pattern History methods, and two hybrids of these methods. Data Value Prediction uses rediction tables in the same way as refetching. Data Value Predictors redict the data value rather than the address of the data in memory. Accuracy is critical for Data Value Predictors because if an incorrect rediction is made, any rogressed instructions deendant on this value much be flushed out of the ieline. The authors found that the Last Outcome scheme was correct 28-62% of the time and incorrect u to 72% of the time. Stride was correct 35-77% and 3-6% incorrect. Two-Level makes minor imrovements in rediction accuracy over Last Outcome (1-3%). However, the two-level table scheme greatly imroves the misrediction rate to 1-13%. The first hybrid scheme imlements Last Outcome when Stride is not in steady state. This results in a correct redictions rate of 49-80% and incorrect redictions 20-51%. The second hybrid redictor combines the 2-Level scheme with Stride. This scheme made correct redictions 50-82% of the time and misredictions only 5-18% of the time. This scheme is, however, the most comlex to imlement. An imortant takeaway from this research are the incorrect rediction rates. In the Data Value Predictor method, if we make an incorrect rediction, we must flush the rocessor of any instructions that deend on the incorrect value. In our comression-refetching method, we do not have to urge any information. However, incorrect redictions will cause unnecessary cache accesses which will increase ower and may evict useful data from the decomression buffer (deending on the deth of the buffer). In [6], the authors resent a two-table refetching scheme called Global History Buffer (GHB). GHB itself, as with the above research on Data Value Predictors, exlores multile rediction schemes: Address Correlation, Distance Correlation and Constant Stride. The key benefit of GHB is the two indeendently sized tables. The first is the Index Table which only stores the tag and a ointer to the head of a list stored in the GHB Table. The GHB acts as a circular buffer, keeing only the latest information. The authors investigate different table configurations with a degree of four (values refetched each access). They found that GHB Distance refetching resulted in a 20% seedu over conventional table Distance refetching when indexed by the miss address and 6% when indexed by the rogram counter (PC). The key enabler of the Global History Buffer is to minimize sace and hold the latest information about cache misses. The same technique can be alied to our rediction table. However, this aroach will only be useful if tables with a deth greater than 1 rove to be valuable. The Global History Buffer aroach 11

22 will not be exlored directly in this work, but this work can easily be extended to exlore this ossibility in a later study Last Outcome For this thesis, refetching techniques are used to redict which cache lines may require decomression from L1 data cache before the instruction is decoded using the PC of the instruction. The intent is to load the data from a comressed cache, decomress it, and make it available to the CPU in arallel with other stages to remove the bottleneck that is decomression latency. The simlest scheme that will be imlemented is Last Outcome. imlementation of this scheme. Figure 2.5 shows the traditional Figure 2.5 Last Outcome Prefetching In the figure, you see that a table exists to store two values for each entry: tag and value. Tag identifies the load instruction address and Value identifies the memory address loaded by that instruction. The configuration of the table can be varied similar to that of cache memory: associativity, deth (multile values er tag), etc. Unlike the traditional architecture, in this work it is not necessary to verify if the rediction is correct in order to validate instructions with deendencies. If it is incorrect, the rocessor will merely suffer the full latency of decomression. 12

23 2.3.3 Stride and Hybrid Stride / Last Outcome As mentioned earlier, the authors in [14] roose a hybrid rediction table that imlements a Constant Stride refetcher, and uses the Last Outcome result when the stride refetcher is not in a steady state. We have already reviewed the behaviour of the Last Outcome table. So, let us review the functionality of a stride refetcher. Stride Prefetching Similar to Last Outcome, we will be indexing the Stride table by Program Counter (PC) of each load instruction. When an entry is udated in the table, the value of the stride is calculated as the difference between the current and last memory addresses that are loaded. The state of the refetcher can be Init, Transient, or Steady. Therefore, as shown in Figure 2.6, the table requires four columns: Tag, State, Value, and Stride. Figure 2.6 Stride Prefetching When a line in the table is first entered, there is no revious data from which to calculate the stride. The entry is in an initialized state. After this line is udated at least once, a stride can be calculated and the entry is in the transient state. The line will remain in this transient state until an udate occurs that roduces the same stride value as is currently stored in the table. When this occurs, the table is udated to steady state and this value for the stride is used. Figure 2.7 shows an overview of the state machine for this refetcher. 13

24 Figure 2.7 Stride State Machine Two-Level and Hybrid Two-Level / Stride In addition to a hybrid S/LO refetch table, the authors in [14] roose a hybrid rediction table that imlements a Two-Level refetcher, and uses the Stride result when the Two-Level refetcher does not make a rediction. We have already reviewed the behaviour of the Stride table. So, let us review the functionality of a two-level refetcher. Two-Level Prefetching Similar to the revious methods, we index the two-level table by Program Counter (PC) of each load instruction. When an entry is udated in the table, the LRU and attern information are udated. If the address does not already exist in the table at this location, then the least-recently-used address is relaced. As shown in Figure 2.8, the table requires four columns: Tag, LRU, Value History Pattern, and Data Values. The data values in our case are load addresses. A second table exists called the Pattern History Table. This table is indexed by the value history attern and ranks the addresses stored as values in the value history table. During the FETCH stage, if we hit the refetch table for a given load instruction PC, we index the PHT at the resultant attern. If there exists a rank greater than a re-set threshold, then we redict that value from the refetch table. During the MEM stage, if a value in the table is the target of a load instruction, the rank is increased. The other values in the table are decreased such that there is a net zero ranking. As with the other tables, we are not concerned if a misrediction is made as the result will simly be a full decomression latency seen by the MEM stage and an unused value eventually being evicted from the decomression buffer. 14

25 Figure 2.8 Two-Level Prefetch Table and Pattern History Table 2.4 Thesis Motivation Among all the works mentioned so far relating to cache comression, there exists two common gas in the research. First, due to the imact of the decomression latency, aart from ZCA comression in [11], revious work has not used comression in L1 cache. We hoe to address this issue by introducing refetching of the decomressed information to side ste this decomression latency in our architecture. Second, all the works on comression that were mentioned above have focussed on using their comression schemes to imrove erformance of the cache. This work intends on reviewing the benefit of reducing the size of the cache to save ower, and imlementing refetching as a means of maintaining erformance. 15

26 Chater 3 Cache Comression and Prefetching In this chater, we resent a new architecture that combines cache comression with a refetching mechanism to redict which memory addresses might require decomression based on the rogram counter of the instruction. The design of the comressor and decomressor hardware are discussed including the selection of the hierarchical carry-lookahead adder and the theory behind it. 3.1 Comression Architecture The comression architecture discussed in this work is a detailed imlementation of the high-level design resented by the authors in [4]. To imlement this architecture for L1 data cache, we must consider when data would be comressed and when it would be decomressed in a suerscalar rocessor. In this architecture, data comression takes lace when data is written into the cache. That is, on any write oeration or a read miss. Decomression takes lace on a read hit. These events are summarized in Table 3.1 and reresent the major insertion oints for this new architecture in a suerscalar rocessor. Table 3.1 Comression Events L1 Data Cache Event Action Read Hit Decomress Read Miss Comress Write Hit Comress Write Miss Comress Decomression On a Read Hit, we check the encoding bits for the hit cache line. If the line is encoded as comressed, we ut this cache line to the decomression hardware. After a number of cycles equal to the decomression latency, the uncomressed result is available at the outut of the decomressor hardware. If the line is not 16

27 comressed, the line is read as usual from the cache and is available in a number of cycles equivalent to the access time of the L1 data cache. The additional logic required to check the encoding bits to determine if the line is comressed or not is included in the design of the decomressor. The CPU can read the result from the outut of the decomressor regardless of comression. If the data is uncomressed, the result will available significantly faster as it is merely assing the inut data through a single multilexer. Comression On a Read Miss or Write Miss, we check the comressibility of the data to be written into the cache line. Based on the best ossible comression scheme that this data fits into, the size of the cache line is determined. Then, we check the size of the cache set at the miss address. If there is room for the new cache line, comressible or not, then it is written to the cache. If there is not enough room, we evict data in the cache at that index in an LRU fashion until there is enough room. On a Write Hit, we treat comressibility in the same manner as any miss with one minor change. When determining the sace remaining in the set at the write address, we do not consider the sace currently occuied by the hit address. This sace will be overwritten by the new write data. This is critical as the new data may consume more sace and may not even be comressible. If this is the case, we can exect one or more segments to be thrown from the cache to accommodate the new data. Because udating the cache, and therefore comression, occurs off the critical execution ath, this is not a time-critical task. Therefore, each instance of writing data to the cache goes through the comressor hardware. The comressor hardware itself, as you will see later in this chater, checks the 64-byte data for comressibility, selects the otimum comression scheme (or no comression scheme), and oututs the encoding bits and data to be written (comressed or not). This means that no additional logic is required to be added to the CPU to accommodate comression Power Considerations When modelling the ower consumtion for this comression architecture, we can take into consideration that fact that we are reading and writing smaller sets of data from and to the cache. Static ower and tag dynamic ower remain the same. However, we can reresent the data array dynamic energy calculation as: 17

28 size of comressed cache line size of uncomressed cache line E dynamic, comressed Edynamic, er access (3.1) Relating to the cache events mentioned reviously in Table 3.1, we can model ower with resect to these events as well. The ower imact is shown in Table 3.2. L1 Data Cache Event Read Hit Read Miss Write Hit Write Miss Table 3.2 Power Events Power Imact Tag Read, Data Read Tag Read, Tag Write, Data Write Tag Read, Tag Write, Data Write Tag Read, Tag Write, Data Write 3.2 Prefetching Architecture From the comression architecture described earlier, you can see that we add decomression clock cycles for a read hit if the data is comressed in the cache. These additional cycles are necessary to allow the decomression hardware enough time to decomress the line. The urose of the refetching architecture is to avoid having this decomression of L1 data cache lines on the critical ath of the rocessor. To accomlish this, we look for a way to erform decomression in arallel to other stages in the CPU. Consider the classic RISC architecture shown in Figure 3.1 [9]. Figure 3.1 Prefetching Alied to Classic RISC Architecture In the Instruction Fetch (IF) stage, the rogram counter (PC) is used to access the next instruction from memory. At this oint, it is imortant to know if the next instruction is a load instruction, if the data to be loaded is currently comressed in L1 data cache, and most imortantly, what is the address of this data in memory. If we have this information, we can then read the comressed data from L1 data cache and decomress it in arallel with the Instruction Decode (ID) and Execution (EXE) stages. 18

29 In the Memory Access (MEM) stage, data is read from memory at the address determined during the ID and EXE stages. If we have a buffer containing cache lines that have been decomressed already, we will read from here rather than from the cache. We actually do not need to know much about the instruction to accomlish this. Similar to [14] and [15], we index our refetch table using only the PC of the instruction. We do not oulate the table every time we generate a single register result, as done in [14], nor do we oulate the table on a cache miss. Rather, in our architecture we add entries to our refetch table each time we suffer the full decomression latency on a comressed cache hit. This means that each entry reresents a load instruction. At a minimum, we store the address of the comressed cache line in the cache. Deending on the refetch table scheme, we store other information to aid in making a correct rediction of the next comressed address that is read by this load instruction. Figure 3.2 Prefetch Table Structure This architecture requires udating the behavior of the CPU in two key areas: FETCH stage and MEM stage FETCH After we fetch an instruction, we want to know if we should begin reading from the L1 data cache. We check our refetch table for an entry at the index of our rogram counter. If we return an address rediction from the table, then we oulate another table called the decomression buffer. The ower considerations for this table are shown in Table 3.3. Prefetch Table Event Read Hit Read Miss Table 3.3 Prefetch Table Power Events Power Imact Tag Read, Data Read, Decomression Buffer Tag / Data Write Tag Read 19

30 If we are using one of the Two-Level refetching schemes, we will require a second table access. This table is referred to as the Pattern History Table (PHT). In this case, if the request hits the refetch table, a attern is returned. We then read the PHT at the attern index, and return a reference to a value that is stored in the refetch table. This value is the rediction address. The ower considerations for the refetch table change as well, as we only read the table data if the PHT hits over the threshold. Two-Level Table Event Prefetch Read Hit Prefetch Read Miss PHT Read Hit PHT Read Miss Table 3.4 Two-Level Table Power Events Power Imact Prefetch Tag Read Prefetch Tag Read PHT Read, Prefetch Data Read Decomression Buffer Tag / Data Write PHT Read The decomression buffer contains the comlete decomressed 64-byte cache lines. It is imlemented as a FIFO cache. This buffer should be large enough that data is not being evicted before it is required in the MEM stage. However, as the table gets larger, the ower consumtion and access time rise. Therefore, we need to determine the best value for this table exerimentally MEM In the MEM stage, for a load instruction, we will now know the actual address of the data to be read from the cache. At this oint in the new architecture, we read the decomression buffer to see if our data exists there, decomressed. We will use the data if the PC and address of data in the buffer match the instruction that is now in the MEM stage. If the data is there, we can read it as fast as the access time for the table. The access time and ower of the table deend on the size of the table. Table 3.5 Decomression Buffer Power Events Decomression Buffer Event Power Imact Read Hit Buffer Tag Read, Buffer Data Read Read Miss Buffer Tag Read If we are using a stride or two-level refetcher, we use this oortunity to udate the stride and stride state or the attern history of the entry in the refetch table. 20

31 If we must access the cache directly in the MEM stage, this is where we add entries into our refetch table. However, we only do this if we are on the critical ath. For examle, we do not udate our refetch table if we are decomressing into the decomression buffer. 3.3 Hardware Design Because read latency is such an imortant asect of cache memory, esecially in L1 cache, this comression scheme must be imlemented at the architectural level (rather than at the software / comiler level). Therefore, it requires additional hardware to imlement comression / decomression. In [4], the authors rovide a high-level concet of the comression and decomression schemes. However, no design is resented or evaluated. It is imortant to verify that the new hardware required for this roosed architecture does not have ower requirements that exceed the benefit of the architecture itself. Furthermore, it is imortant to define the delay requirements for decomression, as this has a direct imact on the erformance of the CPU in the roosed architecture. For these reasons, we designed 64-byte comressor and decomressor units in Verilog to confirm the ower consumtion enalty as well as the hardware delay. Comressor The comressor unit contains searate hardware to evaluate the cache line for each tye of comression scheme in arallel. This method rioritizes seed over resource usage. Because much of the hardware required to evaluate the different comression schemes is the same (largely based on adders / subtractors), a more resource-otimized aroach would be to evaluate each method serially using the same hardware. In the future, it would be interesting to evaluate this aroach for comression, as this task does not fall on the critical execution ath. In the current design, we evaluate each comression scheme in arallel with the design shown in Figure 3.3. To erform comression, the 64-byte cache line is divided into 2, 4, or 8-byte segments. The first segment is chosen as the base. Then, this base is subtracted from each of the remaining segments. The result of this subtraction is the array of deltas. A delta is stored as either a 1, 2, or 4-byte value, deending on the comression scheme being used. If all deltas can be stored without overflow, then the comression is valid. 21

32 Figure 3.3 Comressor Design Decomressor The decomressor unit follows the same design, excet the subtraction oeration is relaced by simle addition. Unlike the comressor, it is imortant that we rioritize seed over resource usage in the decomressor because our intent is to minimize the decomression latency. Figure 3.4 shows the design. To erform decomression, the comressed cache line is divided into segments deending on the encoding of the data. The first 2, 4, or 8 bytes is the base. The base is carried to the decomressed line as-is. The remaining bytes are divided into 1, 2, or 4-byte deltas. These deltas are added to the base to create the decomressed segment. As a redundancy, the first delta is always zero (reresenting the delta of the first segment which is the base). Figure 3.4 Decomressor Design The basis of the comressor and decomressor designs used for this roject are 64-bit, 32-bit, and 16-bit adders. Comression requires a subtraction oeration between 8, 4, and 2-byte blocks within a single cache line deending on the comression scheme. Decomression works in the oosite manner. Addition of 8, 4, and 2-byte bases with 1, 2, and 4-byte deltas restores the data to an uncomressed state. Large adders are discussed in deth in [16] and, as with the overall design aroach, rovide the oortunity to rioritize 22

33 seed over resource allocation. Ultimately, we selected the hierarchical carry-lookahead adder as the basis of the design due to its balance between seed and resource usage Hierarchical Carry-Lookahead Adder The rimary goal of this work is to avoid the latency of decomression on the critical execution ath by using refetching to erform decomression in arallel. However, when refetching fails (i.e. comulsory misses during start-u, or when the redicted load address is incorrect), the rocessor will see the full enalty of decomression. Therefore, it is imortant to minimize this delay as much as ossible. The delay of the decomressor deends on the design of the adders used in the new hardware. Simle adders imlement a full adder block for each bit and roagate carry bits serially through the circuit. While these circuits use a small number of gates, and therefore consume less ower, they are very slow. Each bit requires the revious bits to be evaluated first causing many gate delays. Alternatively, we can consider a full 64-bit carry lookahead adder. Because none of the stages execute serially, this is one of the fastest adders we can imlement here. However, because each bit requires the same information as all revious bits, the comlexity and size of this hardware would become excessive. Nesbit and Smith describe a hierarchical carry-lookahead adder that divides the carry-lookahead function into 8-bit blocks, which are each evaluated serially by roagating the carry bit through the circuit [15]. This aroach is a trade-off between good seed and moderate resource usage. To describe this adder, we must look at the definition for the full adder. The truth table of the full adder is shown in Figure 3.5 and Karnaugh ma in Figure

34 c i x i y i c i+1 s i Figure 3.5 Truth Table for Full Adder x i y i c i Figure 3.6 Karnaugh Ma for Full Adder From the truth table and k-ma, one can see that the carryout bit for a given stage can be determined as: c i 1 x y x c y c i i i i i i (3.2) And the sum bit is the XOR of the three inut signals: s x y c i i i i x y c i i i x y c i i i x y c i i i (3.3) Factoring the carry-in rovides: c i 1 x y i i x y i i c i (3.4) From this equation, two imortant functions are defined, generate and roagate. The generate function is defined as: g x y i i i (3.5) 24

35 25 The roagate function is defined as: (3.6) Leaving the relationshi between the carryout and these functions as being: (3.7) So, let's look at the carryout of our first 8-bit block, c 8 : (3.8) Exanding this formula rovides: (3.9) From this exanded view, we can define the generate and roagate signals for the entire block: (3.10) And, (3.11) Which results in, (3.12) Later stages are calculated in the same way: (3.13) i i i y x i i i i c g c c g c c g g g g g g g g c g g g g g g g g G P c P G c PP c PG G Pc G c

36 In the modules that handle 2-byte bases, 16-bit adder/subtractors are used. In the modules that handle 4- byte bases, 32-bit adder/subtractors are used. Finally, in the modules that handle 8-byte bases, 64-bit adder/subtractors are used. Figure 3.7 shows the generic design of a 16-bit adder using smaller 8-bit lookahead adders. Figure 3.7 Adder Design Imlementation in Verilog The above derivations are the basis of the comressor and decomressor designs in Verilog. Source files for these designs are included in Aendix A. The general structures of the designs are highlighted here to show the modularity of the designs and the significance of the adders. Comressor The structure of the comressor design in Verilog, including the test bench used to verify the design, is as follows: testbench (comressor_testbench.v) comressor (comressor.v) bdi (bdi.v) hadder8 (hadder8.v) bdi32 (bdi32.v) hadder8 (hadder8.v) bdi16 (bdi16.v) hadder8 (hadder8.v) Figure 3.8 HDL Structure of Comressor The module hadder8, imlements the 8-bit adder block from a hierarchical carry-lookahead adder. That is, it oututs the block generate (G i ) and block roagate (P i ) functions rather than the carryout (c i+1 ) as does a tyical rile-carry adder. 26

37 The next module u imlements as many of these 8-bit blocks as are necessary to erform the subtraction function. These modules are also resonsible for inverting the inut to turn hadder8 into a subtractor. bdi imlements a 64-bit adder, so 8 instances of hadder8 for base 8 comression bdi32 imlements a 32-bit adder, so 4 instances of hadder8 for base 4 comression bdi16 imlements a 16-bit adder, so 2 instances of hadder8 for base 2 comression These modules evaluate all delta sizes in arallel. For examle, bdi oututs three valid bits: one for base 8 delta 4, one for base 8 delta 2, and one for base 8 delta 1. The to module, comressor, is resonsible for instantiating blocks of bdi, bdi32, and bdi16 on the cache line to check for all three base sizes in arallel. 8 instances of bdi for base 8 comression on a 512-bit cache line 16 instances of bdi32 for base 4 comression on a 512-bit cache line 32 instances of bdi16 for base 2 comression on a 512-bit cache line Module comressor then takes all valid bits and determines which comression scheme will be used, if any. Testbench Strategy To test the functionality of the comressor, testing was erformed using Xilinx ISE WebPACK [17]. Inut stimulus to the comressor module is the 512-bit uncomressed cache line. Test oints were chosen as the boundary conditions for each of the six base/delta combinations as well as a simle zeros and reeating values lines. A similar aroach was taken to first test the Base-Delta-Immediate model in SimleScalar. Test cases are shown in Table 3.6 and the comressor outut is shown functioning in Figure

38 Table 3.6 Comressor Test Cases Test Case Base (S0=S3= =Sn) Delta (S1=S2) Exected result Zeros 0 (0x0 0) 0 (0x0 0) Zeros ass. Reeating Values -1 (0xF F) -1 (0xF F) Reeating values ass. Base 8 Delta 1 Lower Fail 0 (0x0 0) -129 (0xFFFFFFFFFFFFFF7F) B8D1 fail. B8D2 ass. Base 8 Delta 1 Lower Pass 0 (0x0 0) -128 (0xFFFFFFFFFFFFFF80) B8D1 ass. Base 8 Delta 1 Uer Fail 0 (0x0 0) 128 (0x ) B8D1 fail. B8D2 ass. Base 8 Delta 1 Uer Pass 0 (0x0 0) 127 (0x F) B8D1 ass. Base 8 Delta 2 Lower Fail 0 (0x0 0) -32,769 (0xFFFFFFFFFFFF7FFF) B8D2 fail. B8D4 ass. Base 8 Delta 2 Lower Pass 0 (0x0 0) -32,768 (0xFFFFFFFFFFFF8000) B8D2 ass. Base 8 Delta 2 Uer Fail 0 (0x0 0) 32,768 (0x ) B8D2 fail. B8D4 ass. Base 8 Delta 2 Uer Pass 0 (0x0 0) 32,767 (0x FFF) B8D2 ass. Base 8 Delta 4 Lower Fail 0 (0x0 0) -2,147,483,649 (0xFFFFFFFF7FFFFFFF) Not comressible. Base 8 Delta 4 Lower Pass 0 (0x0 0) -2,147,483,648 (0xFFFFFFFF ) B8D4 ass. Base 8 Delta 4 Uer Fail 0 (0x0 0) 2,147,483,648 (0x ) Not comressible. Base 8 Delta 4 Uer Pass 0 (0x0 0) 2,147,483,647 (0x FFFFFFF) B8D4 ass. Base 4 Delta 1 Lower Fail 0 (0x0 0) -129 (0xFFFFFF7F) B4D1 fail. B4D2 ass. Base 4 Delta 1 Lower Pass 0 (0x0 0) -128 (0xFFFFFF80) B4D1 ass. Base 4 Delta 1 Uer Fail 0 (0x0 0) 128 (0x ) B4D1 fail. B4D2 ass. Base 4 Delta 1 Uer Pass 0 (0x0 0) 127 (0x F) B4D1 ass. Base 4 Delta 2 Lower Fail 0 (0x0 0) -32,769 (0xFFFF7FFF) Not comressible. Base 4 Delta 2 Lower Pass 0 (0x0 0) -32,768 (0xFFFF8000) B4D2 ass. Base 4 Delta 2 Uer Fail 0 (0x0 0) 32,768 (0x ) Not comressible. Base 4 Delta 2 Uer Pass 0 (0x0 0) 32,767 (0x00007FFF) B4D2 ass. Base 2 Delta 1 Lower Fail 0 (0x0 0) -129 (0xFF7F) Not comressible. Base 2 Delta 1 Lower Pass 0 (0x0 0) -128 (0xFF80) B2D1 ass. Base 2 Delta 1 Uer Fail 0 (0x0 0) 128 (0x0080) Not comressible. Base 2 Delta 1 Uer Pass 0 (0x0 0) 127 (0x007F) B2D1 ass. 28

39 Figure 3.9 Testbench Waveforms for Comressor in Xilinx ISE Decomressor The structure of the decomressor design in Verilog, including the test bench used to verify the design, is as follows: testbench (decomressor_testbench.v) decomressor (decomressor.v) hadd (hadd.v) hadder8 (hadder8.v) hadd32 (hadd32.v) hadder8 (hadder8.v) hadd16 (hadd16.v) hadder8 (hadder8.v) Figure 3.10 HDL Structure of Decomressor The hadder8 module is identical to that of the comressor. The key differences between the decomressor and comressor are that the second level modules (hadd, hadd32, and hadd16) do not convert hadder8 into a subtractor and they do not have to evaluate delta overflow. These modules strictly build the 64-bit, 32-bit, and 16-bit hierarchical carry-lookahead adders. hadd imlements a 64-bit adder, so 8 instances of hadder8 for base 8 decomression hadd32 imlements a 32-bit adder, so 4 instances of hadder8 for base 4 decomression hadd16 imlements a 16-bit adder, so 2 instances of hadder8 for base 2 decomression 29

40 The to module, decomressor, has much more work to do than that of the comressor. This module must instantiate adders for each comression scheme, not just for each base. 8 instances of hadd for base 8 delta 1 decomression on a 128-bit cache line 8 instances of hadd for base 8 delta 2 decomression on a 192-bit cache line 8 instances of hadd for base 8 delta 4 decomression on a 320-bit cache line 16 instances of hadd32 for base 4 delta 1 decomression on a 160-byte cache line 16 instances of hadd32 for base 4 delta 2 decomression on a 288-byte cache line 32 instances of hadd16 for base 2 delta 1 decomression on a 272-byte cache line With all these instances, module decomressor attemts decomress an inut cache line using all 8 methods at once and even oututs a 512-bit decomressed cache line for each. Only the line with an associated valid bit contains the correct data. Module decomressor sets this valid bit based on the inut encoding bits. 30

41 Chater 4 Simulation Methodology In this chater, we discuss the method for evaluating the erformance of the new comression and refetching architecture. The tools required to erform this analysis are discussed as well as the environment used to erform testing. 4.1 Methodology In this section, we describe four key tools used in erforming this work: SimPoint, CACTI, SimleScalar, and Wattch. SimPoint [18] is used to determine the intervals that can be executed to reresent the full execution of a given rogram. We use SimPoint as a means of reducing the simulation time and size of oututs from the simulator without sacrificing the behavior of the benchmarks used. The decided simulation oints are tabulated and the ercent error of each is determined based on a comarison of CPI between the weighted simulation oints and the full execution of the benchmark. CACTI [19] is used to generate the static and dynamic ower models for the various cache configurations used for this thesis. In addition, we use CACTI to model the refetch tables and the new decomression buffer that is required for the roosed architecture. Configurations and ower results are resented as they are used as inuts into the simulator. SimleScalar (secifically a branch called Wattch), and the changes introduced in this work, are used to model the behaviour of the cache comression and refetching architectures. A summary of the simulation aroach is shown in Figure 4.1 and discussed in detail in the following sections. 31

Figure 4.1 Simulation Flow Diagram 4.1.1 Simoint For this thesis, arts of the SPEC CPU 2000 benchmark suite are used.

42 Figure 4.1 Simulation Flow Diagram Simoint For this thesis, arts of the SPEC CPU 2000 benchmark suite are used. Full runs of these benchmarks can take days to run even in a simle erformance simulator (e.g. sim-fast). Running these in a detailed simulator such as sim-outorder, and esecially in the modified version that we have develoed, can take much longer. Therefore, it was necessary to identify smaller intervals of these benchmarks that could be executed. SimPoint is a tool that was created to choose simulation intervals that best reresent the full rogram execution. SimPoint does this in four stes: Basic Block Vector (BBV) Analysis, Random Projection, Phase Classification, and Simulation Point Selection [18]. BBV Analysis The Basic Block Vector (BBV) contains information about the behaviour of the rogram with resect to basic blocks. A basic block is a section of the rogram with one entry oint and one exit oint that executes from start to finish. The BBV itself is an array of elements reresenting the frequency each basic block is entered for a given execution interval (weighted by the number of instructions in that block). For this thesis, BBV information for the SPEC CPU 2000 benchmarks is created using the tool Sim-Fast BBV Tracker. This tool is rovided by the creators of SimPoint and is a modified version of SimleScalar that generates the BBV files during execution of sim-fast. 32

43 Random Projection, Phase Classification, and Simulation Point Selection SimPoint analyses the BBV file generated by the revious ste and chooses a reresentation of each hase by finding the interval closest to the centre of the hase. Then, SimPoint determines the weight of that simulation oint based on the number of intervals in that hase of the rogram's execution. SimPoint Results For this work, an interval of 100 million instructions is chosen for determining simulation oints. The selection of 100 million instruction intervals is a balance between 1 billion, which generates very large outut data, and 10 million, which is too small to run without erforming a warmu routine. The authors state that 100 million is an aroriately sized interval to avoid the need to bring the simulations to a warmu state [16]. The maximum number of clusters in the k-means algorithm [20] executed in SimPoint was chosen based on the error roduced by the resultant simulation oints. Choosing a single cluster would result in the simlest imlementation. That is, no weighing or combination of results would be necessary. This method, however, does not yield good results, as most rograms will contain multile hases. The authors use the ercent error in CPI between the full execution and the weighted simulation oints as a means of evaluating the accuracy of the method. Rather than arbitrarily choosing a maximum number of clusters, we use this same method to evaluate the error as the authors do in [21]. Given a set of simulation oints and weights, the following is the correct method of calculating the weighted CPI [22], CPI Weight 1CPI 1 Weight 2CPI 2... Weight n CPI n (4.1) Using 164.gzi as an examle, for M = 3: Table gzi CPI Values by Simulation Point Simulation Point Weight CPI

44 CPI ( )(0.5884) ( )(0.5289) ( )(0.5790) (4.2) Comaring this with the CPI result from the full execution, we can calculate the % error: % Error CPI Full CPI CPI Full Simo int (4.3) 0.37% Table 4.2 contains this error calculation for 18 of the SPEC CPU 2000 benchmarks for maximum number of clusters (M) from 1 to 3. Table 4.2 Simoint Error by Maximum Number of Clusters Benchmark M=1 M=2 M=3 164.gzi 0.68% 0.19% 0.37% 168.wuwise 2.69% 2.79% 0.45% 171.swim 49.28% 8.16% 0.02% 172.mgrid 2.43% 0.57% 0.10% 173.alu 7.71% 5.72% 0.93% 175.vr 2.84% 1.56% 3.07% 176.gcc 3.56% 6.62% 2.53% 177.mesa 0.29% 0.07% 0.03% 179.art 1.65% 0.09% 0.05% 181.mcf 23.20% 2.12% 4.39% 183.equake 1.48% 0.14% 0.48% 188.amm 4.85% 0.00% 1.77% 197.arser 6.84% 14.20% 5.14% 253.erlbmk 0.15% 0.04% 0.29% 255.vortex 5.69% 4.44% 1.48% 256.bzi % 13.80% 14.64% 300.twolf 4.28% 0.69% 0.00% 301.asi 11.60% 11.64% 12.25% 34

45 As can be seen from the data, the amount of error is significant in the benchmarks 171.swim, 181.mcf, 197.arser, 256.bzi2, and 301.asi. The authors of SimPoint evaluated the SPEC CPU 2000 benchmarks using a maximum number of clusters equal to 10 for intervals of 100 million instructions. In their data, the maximum ercent error was 5.47%. So, we then evaluated the 18 benchmarks with these same arameters. The results are shown in Table 4.3. Table 4.3 SimPoint Error Benchmark Instructions (Full) CPI (Full) CPI (Simoint) % Error 164.gzi % 168.wuwise % 171.swim % 172.mgrid % 173.alu % 175.vr % 176.gcc % 177.mesa % 179.art % 181.mcf % 183.equake % 188.amm % 197.arser % 253.erlbmk % 255.vortex % 256.bzi % 300.twolf % 301.asi % The maximum ercent error from this method is 5.16% which is less than that of the 5.47% in the author's results, but quite similar. Therefore, for the uroses of this thesis, all benchmarks are executed for a maximum of 10 intervals of 100 million instructions as generated by SimPoint in the method discussed above. The resulting simulation oints and their weights are shown in Table

46 Table M SimPoint Results Benchmark Simulation Point Weight Benchmark Simulation Point Weight 164.gzi alu gzi alu gzi alu gzi alu wuwise alu wuwise vr wuwise vr wuwise vr wuwise vr wuwise gcc swim gcc swim gcc swim gcc swim gcc swim gcc swim gcc swim gcc swim mesa swim mesa mgrid mesa mgrid mesa mgrid mesa mgrid mesa mgrid mesa mgrid art mgrid art mgrid art alu art alu art alu mcf alu mcf

47 Table M SimPoint Results (continued) Benchmark Simulation Point Weight Benchmark Simulation Point Weight 181.mcf vortex mcf vortex mcf vortex mcf vortex mcf vortex equake vortex equake bzi equake bzi equake bzi equake bzi equake bzi equake bzi amm bzi amm bzi amm bzi amm bzi amm twolf amm twolf amm twolf amm twolf amm twolf amm asi arser asi arser asi arser asi erlbmk asi erlbmk asi erlbmk asi vortex asi vortex asi

48 4.1.2 CACTI CACTI is a cache and memory access time, cycle time, area, leakage ower, and dynamic energy modelling tool [23]. For the uroses of this thesis, the access time, leakage ower, and dynamic energy calculations erformed by CACTI are the focus. For secific cache configurations, the access time, leakage ower, and dynamic energy arameters are determined and used as inut into the simulator. CACTI 6.5 was built from source and used for this thesis. One modification is made to CACTI to outut the dynamic energy (tag, data, and total) for the write oeration. The details of this change, building, and using CACTI are not included in this reort. CACTI Results From the outut of CACTI, the following lines are articularly relevant for this thesis and are used as inut into the simulator: Access time (ns):... Data array: Total dynamic read energy/access (nj):... Data array: Total dynamic write energy/access (nj):... Total leakage read/write ower of a bank (mw):... Tag array: Total dynamic read energy/access (nj):... Tag array: Total dynamic write energy/access (nj):... Total leakage read/write ower of a bank (mw):... Figure 4.2 CACTI Outut Table 4.5 shows all the L1 cache configurations used for this thesis. The first configuration in the table reresents the baseline scheme with no comression. The next two reresents a comressed cache of half the size of the baseline. The tag and data banks for the comressed scheme are modelled in searate runs in CACTI. Configuration Data (bytes) Assoc. Table 4.5 CACTI L1 Cache Configurations and Power Results Tag (bits) Data Read (nj) Data Write (nj) Data Static (mw) Tag Read (nj) Tag Write (nj) Tag Static (mw) Access Time (ns) 3GHz BASELINE ( ) COMPRESSED DATA COMPRESSED TAG ( ) ( ) 38

49 Table 4.5 shows all the L2 cache configurations used for this thesis. Configuration Table 4.6 CACTI L2 Cache Timing Size (bytes) Assoc. Tag (bits) Access Time (ns) 3GHz BASELINE default ( ) CACTI was also used to model the energy consumtion of the refetch tables. Data size is assumed to be 4 bytes er address in this model to store the target address of the load instruction. Table 4.7 CACTI Prefetch Table Configurations and Power Results Configuration Size (bytes) Tag (bits) Data Read (nj) Data Write (nj) Data Static (mw) Tag Read (nj) Tag Write (nj) Tag Static (mw) Access Time (ns) 3GHz LO 128 LO 1024 LO ( ) (3.3873) ( ) STRIDE 128 STRIDE 1024 STRIDE 2048 HYBRID S/LO 128 HYBRID S/LO 1024 HYBRID S/LO LEVEL 128 2LEVEL LEVEL 2048 HYBRID 2L/S 128 HYBRID 2L/S 1024 HYBRID 2L/S ( ) ( ) ( ) ( ) ( ) ( ) (25+1+4) (22+1+4) (21+1+4) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 39

50 For two-level refetching, we require a second table called the Pattern History Table (PHT). This table is indexed by the access attern and stores an integer value for each of the data values stored in the refetch table. Table 4.8 CACTI Pattern History Table Power Results Configuration Size (bytes) Tag (bits) Data Read (nj) Data Write (nj) Data Static (mw) Tag Read (nj) Tag Write (nj) Tag Static (mw) Access Time (ns) 3GHz PHT 2D4P ( ) Lastly, a decomression buffer is considered with 64-byte data and 1K sets. In CACTI, this buffer is modeled as L1 cache. Table 4.9 CACTI Decomression Buffer Power Results Configuration Size (bytes) Tag (bits) Data Read (nj) Data Write (nj) Data Static (mw) Tag Read (nj) Tag Write (nj) Tag Static (mw) Access Time (ns) 3GHz BUFFER 1K ( ) ( ) 40

51 4.1.3 SimleScalar To be able to measure the benefit of imlementing cache comression with a refetching mechanism, we use SimleScalar to model the erformance of the CPU. SimleScalar is an oen-source rocessor modelling tool that is meant to be built uon for secific alications such as this work. SimleScalar is written in C. The tool can emulate different instruction sets, including Alha, ARM, x86, but most imortantly PISA [24]. The binaries for the SPEC CPU 2000 benchmarks used for this work are comiled to PISA. Wattch is a secific branch of SimleScalar for analyzing and otimizing ower consumtion in the architecture of a CPU [5]. Wattch rovides us with a mechanism to comare our ower consumtion in the cache and new hardware with the overall ower consumtion of the CPU Comression To confirm the feasibility of this work, we check how many cache lines within the 18 benchmarks are comressible using the Base-Delta-Immediate comression scheme. To do this, we model Base-Delta- Immediate in SimleScalar. In this comression scheme, cache lines are comressed before they are written to the cache. Cache lines are written when they miss the cache or on a write hit. Therefore, we must add functionality to the simulator when we these events occur, as mentioned reviously in Table 3.1. Zeros The check for zeros comressibility is straightforward. We iterate through all elements of the cache line array and flag zeros comressibility as not ossible if any element does not equal zero. Reeating Values In the scheme roosed by the authors in [4], reeating 8-byte values are considered. Therefore, we check comressibility for this while checking for other base 8 schemes. Starting at element zero, we concatenate the values of the next seven bytes to the current byte, then iterate through the cache line array by a stride of 8. At each iteration through the array, we check if the new 8-byte value equals the 8-byte value at element 0. If any element does not equal element zero, we flag reeating values comressibility as not ossible. 41

52 Base-Delta-Immediate Searate arrays and searate loos handle the comressibility check for each size of base. Base 8 behaves as described above. Base 4 iterates though the array in stride of 4, Base 2 in strides of 2. To verify the comressibility of a Base-Delta-Immediate scheme, we check that each delta does not overflow its datatye referenced either from the base or from zero (immediate). If the second otion is taken (immediate), then the immediate flag is set for that iteration. Table 4.10 shows the overflow arameters of each delta. Table 4.10 Delta Datatye and Overflow Information Delta Data Tye Floor Ceiling 1 signed char signed short sighed int Validation of the Comression Model To confirm that we have correctly modeled Base-Delta-Immediate comression in SimleScalar, we write a rogram to exercise the boundary condition of each of the comression schemes, cross-comile that rogram to the PISA instruction set, and run this rogram through our model and verify the results. This rogram consists of 26 arrays containing 64 1-byte elements. Those arrays contain the cache line values that exercise the boundaries of the model. Table 4.11 shows these values. 42

53 Table 4.11 Boundary Conditions for Comression array descrition significant value char z[64] Zeros 0 (0x0 0) char r[64] Reeating Values -1 (0xF F) char b8d1lf[64] Base 8 Delta 1 Lower Fail char b8d1l[64] Base 8 Delta 1 Lower Pass char b8d1uf[64] Base 8 Delta 1 Uer Fail char b8d1u[64] Base 8 Delta 1 Uer Pass char b8d2lf[64] Base 8 Delta 2 Lower Fail char b8d2l[64] Base 8 Delta 2 Lower Pass char b8d2uf[64] Base 8 Delta 2 Uer Fail char b8d2u[64] Base 8 Delta 2 Uer Pass -129 (0xFFFFFFFFFFFFFF7F) -128 (0xFFFFFFFFFFFFFF80) 128 (0x ) 127 (0x F) -32,769 (0xFFFFFFFFFFFF7FFF) -32,768 (0xFFFFFFFFFFFF8000) 32,768 (0x ) 32,767 (0x FFF) char b8d4lf[64] Base 8 Delta 4 Lower Fail -2,147,483,649 (0xFFFFFFFF7FFFFFFF) char b8d4l[64] Base 8 Delta 4 Lower Pass -2,147,483,648 (0xFFFFFFFF ) char b8d4uf[64] Base 8 Delta 4 Uer Fail char b8d4u[64] Base 8 Delta 4 Uer Pass char b4d1lf[64] Base 4 Delta 1 Lower Fail char b4d1l[64] Base 4 Delta 1 Lower Pass char b4d1uf[64] Base 4 Delta 1 Uer Fail char b4d1u[64] Base 4 Delta 1 Uer Pass char b4d2lf[64] Base 4 Delta 2 Lower Fail char b4d2l[64] Base 4 Delta 2 Lower Pass char b4d2uf[64] Base 4 Delta 2 Uer Fail char b4d2u[64] Base 4 Delta 2 Uer Pass char b2d1lf[64] Base 2 Delta 1 Lower Fail char b2d1l[64] Base 2 Delta 1 Lower Pass char b2d1uf[64] Base 2 Delta 1 Uer Fail char b2d1u[64] Base 2 Delta 1 Uer Pass 2,147,483,648 (0x ) 2,147,483,647 (0x FFFFFFF) -129 (0xFFFFFF7F) -128 (0xFFFFFF80) 128 (0x ) 127 (0x F) -32,769 (0xFFFF7FFF) -32,768 (0xFFFF8000) 32,768 (0x ) 32,767 (0x00007FFF) -129 (0xFF7F) -128 (0xFF80) 128 (0x0080) 127 (0x007F) Running our benchmark through our comression model in SimleScalar, we get the following result, which matches the exected behaviour of the comressor hardware. sim_num_byte_reads 187 # total number of byte reads sim_num_zero_blocks 1 # total number of zero block reads sim_num_reeats_blocks 1 # total number of reeats block reads sim_num_del81_blocks 2 # total number of base 8 delta 1 reads sim_num_del41_blocks 2 # total number of base 4 delta 1 reads sim_num_del82_blocks 4 # total number of base 8 delta 2 reads sim_num_del21_blocks 2 # total number of base 2 delta 1 reads sim_num_del42_blocks 4 # total number of base 4 delta 2 reads sim_num_del84_blocks 4 # total number of base 8 delta 4 reads sim_num_uncomr_blocks 167 # total number of uncomressed reads Figure 4.3 Comression Model Verification Results 43

54 Comression with Prefetching Model To imlement the comression and refetching model, we imlement two key behaviours to the simulator: (1) Read the refetch table during fetch of a load instruction and, if we hit the table and return a rediction address, add the address to the decomression buffer along with a ready time equal to the current cycle lus all delays that block that data. Secifically affecting the decomression buffer are the refetch table access time, the L1 data cache access time, and the decomression latency. (2) Read decomression buffer before accessing L1 data cache to confirm if the correct address was there. Before running the cache_access() function for L1 data cache, we check if we have a correct PC and address in the decomression buffer. If we do, then our refetch function will have correctly redicted the load address. If the PC and address are not correct in the decomression buffer, then we exerience the full decomression latency and udate our refetch table information Stage Delays Baseline SimleScalar and Wattch imlement single cycle ieline stages. This is not a realistic model for many rocessors. Therefore, we imlement a mechanism to include otions for additional delays in each stage. The delays used in this work are based on [25]. To imlement this, a new queue is added to store instructions delayed in the ieline. This queue is monitored at the end of each stage and submits oerations each cycle as they are ready VCD Outut A critical art of the ower analysis in this work is to comare the ower consumtion of the new hardware to that of the rocessor and the cache. For this, we use Cadence Genus. To achieve an accurate dynamic ower model in Cadence Genus, we need to set the actual inut characteristics of the cache. To do this, we generate what is called a Value Change Dum (VCD) stimulus for the hardware for each of the simulation oints run in the simulator. The header information of the comressor VCD file is shown in Figure

55 $date :06:21 EDT $end $version VCD version 0.1 $end $timescale 1 s $end $scoe module comressor $end $var wire 512! x $end $uscoe $end $enddefinitions $end #0 $dumvars b0! $end Figure 4.4 Comressor VCD Header Then, following the header, for each instance of comression (each L1 data cache miss or write hit), a timestamed udate is written into the VCD file. The timestam is calculated using the frequency otion for the CPU and the number of cycles: t VCD, s 1000 sim _ cycle f GHz (4.4) The timestam is written to the VCD file followed by the value of the comressor inut as a 512-character ASCII string. Figure 4.5 shows an examle of this. 45

56 #34632 b ! Figure 4.5 Comressor VCD ASCII Value For the decomressor, a carry bit is initialized to zero and the encoding bits are udated every time. The decomressor header must declare different variables as shown in Figure 4.6. An examle of a timestamed udate to the decomressor VCD file is shown in Figure 4.7. $var wire 512! x $end $var wire 1 # carry $end $var wire 4 $ encoding $end Figure 4.6 Decomressor VCD Header Variables # b ! b0111 $ Figure 4.7 Decomressor VCD ASCII Value SimPoint Imlementation Baseline SimleScalar and Wattch allow for fast forwarding through a benchmark and running a certain block of instructions. This is imlemented through the runtime otions -fastfwd and -max:inst. The first is imlemented as a signed integer and the second as an unsigned integer. This means, the deeest interval that can be run for any rogram is from instruction 2,147,483,648 to 6,442,450,942. Looking back at the results of our SimPoint analysis in Table 4.4, we see that our largest simoint is 8395 for benchmark 172.mgrid. This means we are required to execute from instruction 839,500,000,001 to 839,600,000,

57 To accommodate this, we imlement two new otions at runtime -interval and -simoint. The first is simly a renaming of -max:inst. The second, simoint, is the multile of interval that fastfwd should be set to. By declaring fastfwd as a 64-bit unsigned integer, we can accommodate all of our simoints Technology Scaling Out-of-the-box Wattch is based on 180nm technology arameters rovided in an early technical reort for CACTI. In ower.h, Wattch is set u to allow for configuration and scalability of the CMOS feature size, as well as the CPU frequency used for calculations. To accommodate our chosen frequency of 3 GHz and the 90nm CMOS used to create the hardware for this thesis, the following udates were made to ower.h: Macro TECH_POINT10 contains scaling definitions to bring the 180nm arameters down to a 100nm equivalent. Scaling exists for wire caacitance, wire resistance, feature length, feature area, voltage, threshold voltage, sense voltage, and overall ower scaling. Macro FUDGEFACTOR is used to scale results, further beyond that of the chosen tech oint, from the CACTI function calculate_time that is built into Wattch. FUDGEFACTOR is given by dividing the defined technology size by the desired value: TECH FUDGEFACTOR TECH 100nm 90nm DEFINED DESIRED (4.5) Macro Mhz is used to define the frequency used throughout the ower calculations. 47

58 4.1.4 Environment Across the 18 benchmarks used, with a maximum SimPoint cluster size of 10, there are 122 total simulation oints to be executed er configuration. With 12 configurations, there are 1464 instances of the simulator to be executed er simulation batch. Benchmarking is erformed on Lakehead University s 240 core Linux Cluster, Wesley [26]. Jobs are queued to the cluster using Torque. The outut of the batch on Wesley are 1464 simulation result reorts and 2928 Value Change Dum (VCD) files. These 4392 files are moved from Wesley to Lakehead University s CMC server. Executed via scriting in Tcl, the hardware is synthesized and maed for the comressor in Cadence Genus, then the comressor VCD file for each simulation oint is inut into Genus and the associated reort file is aended with the dynamic ower analysis results. This rocess is then reeated for the decomressor. 4.2 Synthesis and Static Power Analysis In this section, the synthesis of this design using Cadence Genus is discussed as well as timing and ower results and the selection of the 90nm Cadence Generic PDK. Place and route is resented for this design using Cadence Innovus. Initial Analysis and PDK Selection To evaluate the seed and ower consumtion of the hardware, the Verilog design files are synthesized using Cadence Genus Synthesis Solution. Two rocess design kits were considered for this thesis, Cadence Generic 90nm PDK and FreePDK 45nm. Although ower consumtion is the focus of this work, the selection of the PDK was based on the delay for the decomressor, which lies on the critical execution ath. Table 4.12 shows the results of this initial analysis. 48

59 Table 4.12 Initial Static Power Analysis of Decomressor by PDK Library Delay (s) Static Power (nw) Cadence 90nm Generic PDK v3.3 (fast.lib) , Cadence 90nm Generic PDK v3.3 (tyical.lib) , Cadence 90nm Generic PDK v3.3 (slow.lib) , Cadence 90nm Generic PDK v3.3 (ss.0v75.lib) , Cadence 90nm Generic PDK v3.3 (ss.0v67.lib) , FreePDK 45nm v1.4 (gscl45nm.lib) , As can be seen from the results, the fast library from the Cadence 90nm GPDK is the fastest. At 655s, even with any overhead that has not been accounted for, it is reasonable to assume that this hardware can rovide decomression within 4 cycles at 3GHz. This assumtion carries into the simulator discussed in the following section of this reort. Static Power Static Power Analysis in Cadence Genus is straightforward. The Liberty Timing File (.lib) from the Process Design Kit (PDK) defines a arameter, cell_leakage_ower, which is static ower on a er-cell basis. After synthesizing the design, we can run the gates reort to determine how many instances of each cell is used in the synthesized design. Gate Instances Area Library AND2X fast... XNOR2X fast total Figure 4.8 Genus Gates Reort for Comressor (Condensed) From this, we can validate Genus static ower calculation. Table 4.13 shows this validation for the comressor, Table 4.14 for the decomressor. 49

60 Table 4.13 Comressor Static Power Determination gate cell_leakage_ower instances total static ower (nw) AND2X AND4X AND4XL AO21X AO22X AOI211XL AOI21XL AOI221XL AOI22XL AOI2BB1XL AOI31XL AOI32XL AOI33XL CLKINVX CLKXOR2X INVXL MX2X MXI2XL NAND2BXL NAND2XL NAND3BXL NAND3XL NAND4BXL NAND4XL NOR2BXL NOR2XL NOR3BXL NOR3XL NOR4BXL NOR4XL OA21X OAI211XL OAI21XL OAI221XL OAI22XL OAI2BB1XL OAI31XL OR2X OR2XL OR3X OR4X OR4XL XNOR2X Total Comressor Static Power nw 50

61 Table 4.14 Decomressor Static Power Determination gate cell_leakage_ower instances total static ower (nw) ADDHXL AND2X AND4X AO21X AO22X AOI211XL AOI21XL AOI221XL AOI222XL AOI22XL AOI2BB1XL AOI31XL AOI32XL CLKINVX CLKXOR2X INVXL MXI2XL NAND2BX NAND2BXL NAND2XL NAND3BX NAND3BXL NAND3XL NAND4BBXL NAND4BXL NAND4XL NOR2BX NOR2BXL NOR2XL NOR3BXL NOR3XL NOR4BXL NOR4XL OA21X OAI211XL OAI21XL OAI221XL OAI22XL OAI2BB1XL OR2X OR4X TLATXL XNOR2X XNOR2XL XOR2XL Total Decomressor Static Power nw 51

62 We can then comare these calculated values to Genus results for the simulation runs in SimleScalar, described in detail later in this reort. Using 164.gzi as an examle, we can observe the outut behaviour of Cadence Genus with regards to static ower consumtion of the comressor hardware. The results are shown in Table Table gzi Comressor Static Power Values by Simulation Point Simulation Point Weight P Static (nw) Notice from the results that that static ower analysis is not affected when changing the inut, which is an exected behaviour. Therefore, the following values are considered constant and valid and will be used throughout the remainder of this thesis: Table 4.16 Static Power for Comressor and Decomressor Device P Static (nw) Comressor Decomressor Dynamic Power Analysis Dynamic ower consists of three comonents: switching ower, short-circuit ower, and glitching ower [27]. Genus grous these comonents into net ower and internal ower. Dynamic ower is generally calculated as the following [27, 16]: 2 PD fcv DD (4.6) Net Power Net ower is the ower consumtion in a gate when charging the outut load voltage from low to high. Therefore, Genus calculates net ower as the following: 52

63 P Net ftoggle CLVDD (4.7) where f toggle is the toggle rate calculated by Genus and C L is the sum of load caacitances connected to the net. Internal Power Internal ower is the roduct of frequency and "arc" energy for each inut/outut arc. Genus calculates internal ower as the following: P Internal E E... A Y A Y B Y B Y n Y E n Y (4.8) where A Y is the arc activity calculated by Genus between inut A and outut Y and E A Y is the energy of the arc determined by Genus, based on the Liberty Timing File (.lib) for the chosen PDK. Because dynamic ower deends on the inut stimulus to the comressor and decomressor modules, the most accurate way of modelling the dynamic ower of these units is to use actual cache lines from the chosen benchmarks. Each time the comressor and decomressor must be accessed in the simulator, data is written to a file in Value Change Dum (VCD) format. This data is then inut into Cadence Genus. Table 4.17 through Table 4.20 rovide an overview of the dynamic ower results for the comressed configuration. The weighed dynamic ower calculation for the comressor, using 164.gzi as an examle, is shown in Table Table gzi Comressor Dynamic Power Values by Simulation Point Simulation Point Weight P Internal (nw) P Net (nw) P Dynamic (nw) ,502, ,509, ,011, ,376, ,425, ,801, ,958, ,191, ,150, ,642, ,573, ,215, P Dynamic ,912, nW (4.9) 53

64 The dynamic ower results for the comressor, for each benchmark, are shown in Table Table 4.18 Comressor Dynamic Power Results from Cadence Genus Benchmark P Static (nw) P Internal (nw) P Net (nw) P Dynamic (nw) 164.gzi wuwise swim mgrid alu vr gcc mesa art mcf equake amm arser erlbmk vortex bzi twolf asi The weighed dynamic ower calculation for the decomressor, using 164.gzi as an examle, is shown below. Table gzi Decomressor Power Values by Simulation Point Simulation Point Weight P Internal (nw) P Net (nw) P Dynamic (nw) , , , , , , , , , , , ,

65 P Dynamic , nw (4.10) The dynamic ower results for the decomressor, for each benchmark, are shown in Table Table 4.20 Decomressor Power Results from Cadence Genus Benchmark P Static (nw) P Internal (nw) P Net (nw) P Dynamic (nw) 164.gzi wuwise swim mgrid alu vr gcc mesa art mcf equake amm arser erlbmk vortex bzi twolf asi

66 4.4 Place and Route in Cadence Innovus To confirm that the comlexity of the hardware is not beyond imlementation, Cadence Innovus is used to automatically lace and route the design to chi. Figure 4.9 shows the final routing of the comressor in a 1mm by 1mm die. Figure 4.9 Comressor Routing in Innovus 56

67 Chater 5 Results In this chater, we look at the results of erformance simulation. We start by reviewing the erformance of the comression architecture, followed by the erformance of the refetch tables. Finally, we look at the overall erformance of the new combined architecture. 5.1 Comression As art of the comressibility check discussed in Chater 4, we outut in SimleScalar a count of cache lines that are comressed by each of the schemes. Figure 5.1 shows the ercentage of L1 data cache lines comressed by each comression scheme. Figure 5.1 Percentage of L1 Data Cache Lines Comressed by Each Scheme 57

As can be seen from the data, and as a verification of the results resented in [4], there is significant oortunity to aly Base-Delta-Immediate comression in L1 data cache.

68 As can be seen from the data, and as a verification of the results resented in [4], there is significant oortunity to aly Base-Delta-Immediate comression in L1 data cache. The best comression is achieved through zeros comression (64 bytes down to 8 bytes), so from Figure 5.1 we would exect a high comression ratio from 176.gcc because more than 40% of the cache lines are comressible using zeros comression. We also see that each of the comression schemes are well reresented within the benchmarks. 300.twolf, for examle, imlements a nice balance of each of the Base-Delta-Immediate schemes. Comression Ratio We want to know what kind of imact this comression would have on the amount of data we are able to store in the cache. To do this, we can look at the comression ratio of each of the benchmarks. The comression ratio achieved by running each of the benchmarks through the simluator is shown in Figure 5.2. To calculate comression ratio, we comare the comressed size of the data with the uncomressed size: comression size ratio size uncomressed comressed (5.1) Figure 5.2 Comression Ratio of L1 Data Cache 58

69 As exected from the comression rates shown revisouly in Figure 5.1, 176.gcc achieves the best comression mostly due to the 40% zeros comression. This is because zeros comression has the highest comression ratio among the schemes at a rate of 64/8. If we consider only this 40% zeros comression, and no other comressed cache lines, we would see the following comression ratio: size comression ratio size uncomressed comressed (8).6(64) (5.2) 176.gcc does not achieve much more than this, with a ratio of A comression ratio of 1.62 means that, on average, a 64-byte cache line is taking u 40 bytes of sace. This is significant because it means, on average, each cache index holding 128 bytes now has room for 3 cache lines instead of 2. Slowdown We know, due to the decomression latency, that we will suffer a erformance deterioration when we imlement Base-Delta-Immediate comression esecially in L1 cache. To determine the slowdown, we comare the IPC of the comressed scheme versus the baseline scheme for each of the runs. The calculation for seedu and slowdown are shown below. seedu IPC IPC comressed baseline (5.3) slowdown 1 seedu Figure 5.3 shows the IPC of each of the benchmarks for the baseline configuration. Figure 5.4 shows the comressed scheme. The resultant seedu is shown in Figure

70 Figure 5.3 IPC of Baseline Scheme Figure 5.4 IPC of Comressed Scheme 60

71 Figure 5.5 Seedu of Comressed Scheme vs Baseline What is siginificant here is that 183.equake has the most slowdown due to comression, yet 176.gcc has the highest comression ratio. This would likely be due to 183.equake exeriencing more comressed cache hits and therefore exeriencing more of the imact of the decomression latency. 61

72 Static Power The rimary intent of imlementing comression in L1 data cache is to reduce the size and therefore the ower consumtion of the cache. The amount of ower savings here is imortant because this savings should outweigh any enalties introduced in the refetching architecture or in the slowdown of erformance. Figure 5.6 shows the static energy consumtion of the L1 data cache for the baseline scheme. Figure 5.7 shows the static energy for the comressed scheme, including the comression and decomression hardware energy. Figure 5.8 shows the ratio of comressed to baseline to highlight the reduction. Figure 5.6 L1 Data Cache Static Energy (Baseline Scheme) 62

static energy reduction in the cache itself due to its decrease in size.

73 Figure 5.7 L1 Data Cache Static Energy (Comressed Scheme) Figure 5.8 L1 Data Cache Static Energy Ratio Comressed vs Baseline We see a significant static energy reduction in the cache itself due to its decrease in size. Looking at the static ower from the CACTI model in Table 4.5, we would exect to see a reduction equal to: 63

P comressed P baseline 13.6143 3.29114 0.64 25.0286 1.22089 (5.4) However, 0.

74 P comressed P baseline (5.4) However, 0.64 is not achieved due to the CPU slowdown caused by introducing comression and the ower overhead of the comression and decomression hardware. In fact, you can correlate the balance of static ower to the ercent slowdown of the CPU due to comression. Comaring Figure 5.8 with the slowdown in Figure 5.5, you see that they comlement eachother in this regard. Dynamic Power Because we change the cache erformance as discussed above, the switching characteristics will change. In addition, the overall reduction in cache area will imact the configuration of the cache and therefore the energy required to read and write the cache. Figure 5.9 shows the dynamic energy consumtion of the L1 data cache for the baseline scheme. Figure 5.10 shows the dynamic energy for the comressed scheme, including the energy consumed in the comressor and decomressor. Figure 5.11 ashows the ratio of comressed to baseline to highlight the reduction. Figure 5.9 L1 Data Cache Dynamic Energy (Baseline Scheme) 64

75 Figure 5.10 L1 Data Cache Dynamic Energy (Comressed Scheme) Figure 5.11 L1 Data Cache Dynamic Energy Ratio Comressed vs Baseline We cannot comare the dynamic behaviour as we did with static and the slowdown. However, we do know that the comressed data size imacts the dynamic energy consumtion of the cache. Therefore, we can determine how much of this energy reduction is due to the reduced data size by looking at the comression 65

76 ratio. For examle, 176.gcc has a comression ratio of This reresents an average data size reduction of: 1 avg data reduction (5.5) The remaining energy reduction or gains in the CPU are due to the change in energy er access as well as the overall change in erformance of the CPU. 5.2 Prefetching We first look at the overall erformance of all of the refetching configurations used. To evaluate the erformance of the refetch tables, we consider two key elements. First, we look at what the hit ercentage is for the table during the instruction fetch stage in the rocessor. That is, what ercentage of load instructions successfully acquire a rediction address from the refetch table based on the rogram counter only. Second, we look at how accurate those redictions are. By reviewing the state of decomressed lines in the decomression buffer when they are evicted, we can better understand how the refetch tables are affecting the erformance of the new architecture. In addition, this metric sheds some light on where imrovements can be made to this architecture, as we will see in the data to follow. Hit Percentage A 128-Set and 1K-Set table were simulated for each of the refetch table tyes (Last Outcome, Stride, Hybrid S/LO, Two-Level, and Hybrid 2L/S). To understand this selection, consider the static energy savings of comression resented in Figure 5.8. On average, we see a savings ratio of 0.27, which reresents 0.79mJ in static energy, or 7.12mW in static ower across the executed benchmarks. Reviewing the CACTI results in Table 4.7, the only 2K-Set table that kees its static ower within this range is the Last Outcome table. Therefore, we did not exceed 1K table sizes as we did not want to consume our static ower savings entirely within the refetch table. For each of the 10 configurations, Figure 5.12 shows the ercentage of load instructions that successfully receive a rediction address from the 128-Set refetch tables. Figure 5.13 shows the ercentage of instructions that hit the 1K refetch tables. 66

77 Figure 5.12 Hit Percentage of Load Instructions by Prefetch Table (128 Set) Figure 5.13 Hit Percentage of Load Instructions by Prefetch Table (1K Set) 67

78 Figure 5.13 shows that the 1K-Set variation of each table out erforms the 128-Set variant of the table. This is due to the reduction of conflict misses in the table. We also notice that overall, the refetch table hit ercentage is quite low, averaging 10% to 15% for 128-Set and 16% to 24% for 1K-Set. Prediction Accuracy When a rediction is made, data is decomressed from the cache and then entered into the decomression buffer. There are five ossible results for entries in this buffer. If the buffer is not large enough, entries are evicted before they can be used. Used entries can be correct or incorrect. In addition, entries may be tossed out due to cache relacement or branch misrediction. Figure 5.14 show the results for the 10 refetch configurations. 68

79 Figure 5.14 Prediction Accuracy of 10 Prefetch Table Configurations 69

Some major factors stand out here. First, we notice that Stride refetching is easily the most accurate. Second, we see that Last Outcome results in a large number of incorrect redictions.

80 Some major factors stand out here. First, we notice that Stride refetching is easily the most accurate. Second, we see that Last Outcome results in a large number of incorrect redictions. While these misredictions do not directly imact the erformance of the CPU, they do require an additional cache access which consumes unnecessary energy. In addition, we notice that benchmarks 172.mgrid and 179.art are duming many of the decomressed results from the decomression buffer before ever using them. This means that the normal number of load instructions between the instruction fetch stage and the mem stage is larger than our 1K buffer design which has 16 entries. Static Power The static energy consumtion of all of the combined refetch tables is shown in Figure 5.15 for 128-Set, Figure 5.16 for 1K-Set. This data includes the refetch table, decomression buffer, and attern history table in the case of two-level refetching. Figure 5.15 Static Energy by Prefetch Table (128 Set) 70

81 Figure 5.16 Static Energy by Prefetch Table (1K Set) Dynamic Power The dynamic energy consumtion of all of the combined refetch tables is shown in Figure 5.17 for 128- Set, Figure 5.18 for 1K-Set. This data includes the refetch table, decomression buffer, and attern history table in the case of two-level refetching. 71

82 Figure 5.17 Dynamic Energy by Prefetch Table (128 Set) Figure 5.18 Dynamic Energy by Prefetch Table (1K Set) 72

83 5.3 Comression and Prefetching In this section, we look at the overall results of combining refetching with cache comression and comare those results with the comression-only configuration. Cache Energy vs. Performance Figure 5.19 shows the slowdown versus the ower consumed in L1 data cache. This figure identifies two imortant table configurations: Stride (128) and Hybrid Stride / Last Outcome (1K). Stride (128) rovides the best seedu-to-dl1 energy relationshi. Hybrid Stride / Last Outcome (1K) rovides the best overall seedu, which we will see is the most imortant in the data to come. Figure 5.19 L1 Data Cache Energy vs Performance 73

84 CPU Power vs. Performance Figure 5.20 shows the overall energy savings in the CPU versus slowdown comared with the baseline configuration. Stride (128) and Hybrid Stride / Last Outcome (128) stand out here as well because they actually consume less energy than the comressed architecture alone. This is because the erformance benefit of the refetching actually results in a reduction in ower. Figure 5.20 CPU Energy vs. Performance Seedu due to Prefetching It is clear from the revious figure that erformance lays an imortant role in the overall energy consumtion of the CPU. In Figure 5.21, we comare the seedu of the different refetching methods for each of the benchmarks for 128-Set tables, in Figure 5.22 for 1K-Set tables. 74

85 Figure 5.21 Seedu Due to Prefetching (128 Set, vs. Comressed Only) Figure 5.22 Seedu Due to Prefetching (1K Set, vs. Comressed Only) 75

Energy-Delay Product To determine which refetching method stands out as the best, we need to consider both the overall erformance of the CPU as well as the energy consumtion.

86 Energy-Delay Product To determine which refetching method stands out as the best, we need to consider both the overall erformance of the CPU as well as the energy consumtion. To do this, we use the roduct of the two metrics as follows: EDP Et (5.6) Where E is the total energy consumed by the CPU (in Joules) and t is the runtime of the rogram (in seconds). Figure 5.23 shows the energy-delay roduct, using values normalized to the baseline scheme, for each of the refetch table configurations. Figure 5.23 Energy-Delay Product (CPU) Figure 5.23 shows that all evaluated refetch tables in combination with Base-Delta-Immediate (B I) comression outerform comression alone in L1 data cache. Based on the energy-delay roduct, the 1K Stride/Last Outcome table has the best overall erformance for the SPEC CPU 2000 benchmarks used. 76

AUTOMATIC GENERATION OF HIGH THROUGHPUT ENERGY EFFICIENT STREAMING ARCHITECTURES FOR ARBITRARY FIXED PERMUTATIONS. Ren Chen and Viktor K.

inuts er clock cycle Streaming ermutation oututs er clock cycle AUTOMATIC GENERATION OF HIGH THROUGHPUT ENERGY EFFICIENT STREAMING ARCHITECTURES FOR ARBITRARY FIXED PERMUTATIONS Ren Chen and Viktor K.