Figure 5.2: (a) Floor plan examples for varying the number of memory controllers and ranks. (b) Example configuration.

Size: px
Start display at page:

Download "Figure 5.2: (a) Floor plan examples for varying the number of memory controllers and ranks. (b) Example configuration."

Transcription

1 Figure 5.2: (a) Floor plan examples for varying the number of memory controllers and ranks. (b) Example configuration. The study found that a 16 rank 4 memory controller system obtained a speedup of over a 4 rank 1 memory controller system. This shows that significant results can be obtained by making architectural changes in this area. For additional information on the techniques used and data gathered by this study, the reader is referred to the reference section. Increase DRAM Row Buffer Entries DRAM row buffers essentially function as a DRAM cache. If an entry is in the row buffer for a bank, it reads the memory from there and reduces the overall access time of main memory. Of course, in both 2D and 3D systems, increasing the number of row buffer entries for a bank will increase performance. However, the benefit is much more visible in a 3D system, where wire delay and other factors don t dwarf DRAM access time. This makes it much more beneficial to add a few more row buffer entries for each bank in a 3D stacked memory system. The same paper that discussed increasing the number of ranks and the number of memory controllers analyzed the effect of increasing the number of row buffer entries in each bank. It was found that increasing the number of row buffer entries to 4 gave a speedup of 1.3. Of course, increasing the number of row buffer entries is not negligible in cost: additional entries increase the complexity of logic, and each entry requires 4 KB of space. On a 16 rank system, with 8 banks per rank, increasing the number of row buffer entries to 4 requires 2 MB total space. The study discussed these costs and even ran a simulation comparing the increased row buffer performance to an alternate case where the extra storage space was allocated to the L2 cache. It was found that the even though the L2 has a faster access time, increasing the storage space was not as beneficial as increasing the number of row buffer entries in main memory. This was due to the fact that increasing the size of the L2 cache did not greatly reduce L2 misses, while increasing the number of row buffer entries greatly increased the number of row buffer hits. 1

2 Stacked Memory-Aware, Rich TSV-enabled 3D Memory Hierarchy This section refers to a rather unique method of taking advantage of the high bandwidth available in a 3D system. The method is proposed in the paper [citation] and will henceforth be referred to as SMART-3D (as it is in the original paper). While this section refers to a specific method and the implementation of that method, concepts can be used in other areas, and it shows how redesigning architecture with 3D-stacked memory in mind can be beneficial. SMART-3D proposed an L2 cache supporting standard 64 byte read and write operations for the L1 cache with a 4KB bus for fill and write-back operations to the main memory. A standard planar H-tree layout is used for the read and write operations. Essentially this is varying the size of the L2 cache line based on the perspective. From the perspective of the L1, the L2 has a 64 byte cache line, while from the perspective of main memory, the L2 has a 4 KB cache line. This abstraction is provided by separating the L2 cache into 64 sub-banks, with a vertical 64 B bus in each, allowing for 64 simultaneous 64-byte transfers between main memory and the L2. This reduces the L2 performance slightly by increasing the hit access time, but that is made up for by reducing the miss rate. The main benefit of this implementation is that it exploits spatial locality. A lot of the advantages of a standard 4 KB cache line are realized while avoiding many of the downsides. The most significant upside to the SMART-3D implementation compared to a standard 4 KB cache line is that the L2 cache access time is much lower. This translates to noticeable performance improvements for even moderately memory-intensive applications. Several major design decisions were discussed in the SMART-3D paper. For example, when data is evicted from the L2 cache in the system, it is possible to evict at a 64 B or a 4 KB granularity. To clarify, would the cache evict the least-recently-used 64-byte line from each sub-bank (local LRU) or the least-recently-used page (global LRU)? Simulations were conducted and it was stated in the paper that, though the results were application-specific, miss rates differed by a maximum of 0.033%. Therefore, it was decided that the additional complexity of that a local LRU system would not be cost-effective, and a global LRU system was implemented. Another issue considered is that one L2 eviction might lead to up to 64 L1 evictions, depending on the state of the caches. The paper argues that this does not create a problem if each L2 cache line keeps track of an inclusion bit to minimize unnecessary eviction traffic sent to the L1. Overall, this implementation showed speedups of 1.71x for memory-intensive applications and 1.30 for standard applications over the 3D-base system. Of course, these are the most general numbers, and for performance results in specific cases, the reader is referred to the references section. Splitting Cache Across Layers In addition to the many optimizations already discussed, it is possible to greatly reduce worstcase wire length in the caches by taking advantage of the 3D layout. One way to do this is by partitioning the cache between separate silicon layers. Figure 5.3 shows the most general way to do this. 2

3 Figure 5.3: (a) Standard 8-bank 2D cache layout. (b) Partition the cache across layers by bank. As can be seen, the worst-case wire length is significantly reduced in this system. If y and x are assumed to be equal, then wire length is reduced by half. This translates to noticeable reductions in delay and energy use. There are a few other partitioning options as well, at a finer granularity. It is also possible to partition a cache by wordline or by bitline. Figure 5.4 shows an example of this. In each partitioning option, wire lengths are reduced. These partitioning options provide a greater benefit than simple bank-stacking, because the wire reduction benefits are seen both on the global routing level and the individual array level. However, they are more complex as splitting banks does not require any redesign just simple stacking. Figure 5.4: (left) Standard 2D SRAM array. (middle) Split by bitline. (right) Split by wordline. Puttaswamy and Loh conducted a study, Implementing Caches in a 3D Technology for High Performance Processors, simulating access latency and energy across several different cache sizes. They implemented array stacking, split by bitline. They found greatly varying benefits depending on the cache sizes: 3

4 Figure 5.4: (left) Table comparing cache size to access energy, showing savings obtained by partitioning the cache. (right) Table comparing cache size to access latency, again showing savings. A key idea to note is that, for latency, what matters is not the percent reduction, but the cycle reduction. A delay reduction of 15% doesn t matter if the number of cycles required to access it doesn t change. Of course, for energy reductions, every small improvement helps. It would be very interesting to delve deeper into optimize the cache architecture for a 3D system with an even finer scope. For example, partitioning each bit cell across layers could provide even greater benefits. However, this is limited by the size of the die-to-die vias. Currently, each via is larger than the size of an SRAM cell this makes further optimization impossible without technology improvements. Optimization Summary These optimizations have all been shown to significantly improve the performance of a processor. Of course, the possibilities present with a 3D-stacked processor are endless. So the question is: why is 3D not the standard? And the answer is that there are other issues that were not discussed in this section. With any processor, heat is an issue that must be dealt with. As technology improves and energy density increases, heat can become a problem. When this technology stacks DRAM cells on top of the processor, one can imagine that this problem could only get worse. Additionally, manufacturing is an issue Through-Silicon-Vias are difficult to manufacture reliably, and there would have to be significant communication between the processor manufacturer and the DRAM manufacturer. There are certainly challenges facing 3D memory. However, the benefits seem to be great, and as the memory wall worsens, more research may be done in this area. 4

5 Thermal Analysis One of the biggest obstacles that a 3 dimensional memory/cpu structure must overcome is power management and heat removal. Traditional 2D layouts have the benefit of being separate modules and thus can easily be cooled with their own heat sinks. A 2D CPU places the bulk silicon directly against a heat spreader that is mounted to a heat sink. This allows for direct heat conduction away from the CPU to ambient. The separate memory modules generate less heat in comparison and are easily cooled in a similar fashion at a separate location. But with the integration of the CPU and memory in the same stack there is more obstruction for heat removal as seen in Figure 6.1(a). Here the top die (Layer N) would be mounted against the motherboard using a socket similar to the 2D design. However, now heat is generated in several different layers increasing power density and heat removal requirements for the same effective heat sink area. The upper layers must have an even higher temperature than layers below in order to establish the gradient required for heat flow to the heat sink. Assuming the heat sink is maintained at ambient, the stacking of comparable dies leads to higher overall temperatures for a 3D design compared to an equivalent 2D design. This is further compounded due to the fact that leakage current is exponentially dependent on temperature. So once temperature is elevated, the processor's static power increases at an exponential rate.[1] Additionally Having multiple layers consuming power for a given footprint and effective heat sink area results in a higher power density for the 3D structure. Figure 6.1 3D processor and memory stack including heat model [1] 5

6 Performance Affects There have been countless testing and simulations done to evaluate the performance improvements with 3D memory structures. Since simulation of the heat problem is difficult few of studies have accurately considered heat's affects on system performance. In an analysis completed at the University of California [1] these thermal effects were considered when comparing performance of 2D vs. 3D structures. It was pointed out that previous studies had under estimated the negative affects of heat on system performance because small scale processes had not been evaluated. This study simulated 120nm process where leakage power is highly temperature sensitive. Similar 2D and 3D structures were compared against each other where cache and memory designs were assumed to be identical. For more detail of the study see the earlier Performance Comparison section of this survey. The thermal analysis consisted of a simple one dimensional model as seen in Figure 6.1(b) this model considers how each layer's generated heat must dissipate through the layers below in order to reach the heat sink at the bottom of the stack. Subsequently the top layers must achieve a temperature greater than that of the lower layers in order to provide the downward heat flow to the sink. Further complicating the issue is the relatively low thermal conductivity of silicon. The simple way to prevent exceeding heat limits on any layer, the heat generated (power) must be lowered. Since both structures were assumed to have the same process and thus same voltage with negligible differences in capacitance. The result is in order to reduce power and heat generated is to lower operating frequency of the CPU which may subsequently limits performance. The results of this study can be seen in Figure 6.2. The top graph shows a performance comparison for the memory intensive application. The positive sloping lines correlate to the right axis of limiting CPU temperature. The fact that the 3D structure is 5-10 C hotter for a given CPU frequency is due to the layering effect discussed previously. In order to fairly compare the 2D and 3D structures their performance must be compared for a single operating temperature since the heat removal capabilities are assumed to be the same for each in regards to the sink. The graph lines out a temperature constraint of 100 C with the dotted horizontal line. The operating frequency is determined by the crossing of the structures temperature and the 100 C limit; here it's 2.35 GHz for the 3D structure and 2.85Ghz for the 2D. With the operating frequency now realized the execution time can be found on the left axis. At these frequencies the execution time is 0.6nsec and 2.5nsec for the 3D and 2D structures receptively. The result is that even with the 3D structure operating at an 18% reduction in frequency to prevent exceeding temperature limits, it has a 76% improvement in execution time. When considering the application with low memory use different results are found. Here there is no difference in performance for a given operating frequency since memory is not a bottleneck it isn't throttling execution time. Now when comparing both structures with a temperature limits of 100 C the 3D structure is still limited to 2.35 GHz with an execution time of 0.41nsec while the 2D structure can operate at 2.8 GHz with 0.35nsec execution time. The resulting 17% reduction in operating frequency directly scaled performance down by 17% as well. Here the 2D structure outperformed the 3D structure due to the sole limit on operating frequency to reduce power consumption and heat generated. The realized lower temperature limit of the 3D structure hindered performance in the comparison. 6

7 Figure 6.2 Execution time per instruction and maximum chip temperature as a function of operating frequency for 2-D and 3-D processors. (a) Highly memory intensive application and (b) non-memory intensive application. The conclusion from this study is there are great improvements to be made with memory bandwidth and speed of the 3D structure. These improvements are significantly successful in lowering the memory wall and improving system performance of memory intensive applications. However the reduced ability to remove heat from the 3D structure and the realized lower CPU frequency reduces system performance of non-memory intensive applications. This is because these applications were never limited due to memory access in the first place. The final result is improved performance of memory intensive applications come at the cost of lowered performance in non-memory intensive ones. However with a 3D memory structure the large performance improvement with memory intensive applications can be viewed as outweighing the slight performance reduction of non-memory intensive applications since the memory wall is the obstacle that's being attempted to overcome. [1] 7

8 Possible Solutions Hotspots One of the largest thermal problems is hotspots in a CPU design. Some blocks of a CPU have higher power consumption than others resulting in higher temperatures in these blocks leading to hot spots on the die. These hotspots become the limiting factor for system frequency in the resulting design. So while the majority of blocks may be well below their thermal limit and could be operating at higher speeds, it's blocks in a CPU that consume large amounts of power that generate these hotspots which dictate overall system frequency. The largest hotspots for one example are the instruction TLB and data cache as seen in Figure 6.3. [2] Figure 6.3 Thermal hotspots in a 2D layout. [2] Luckily there are several benefits that can be utilized to reduce these hotspots in a 3D structure. Because of the 3D architecture certain CPU blocks can be stacked above each other to make optimizations. One example is CAM based circuitry which contains a very complex wire structure. These wire structures consume large amounts of power and if the block is split in two and then stacked on top of each other these wire networks are simplified and reduced in size. With this optimization the latency and power consumption of this block is greatly reduced. Another advantage can be achieved by splitting blocks in two such that only one of them is active at a give time. An example is seen with cache split with half of the cache lines on one die/layer and the other half on next die. The cache's decoder will only address one cache line at a time so only one of the two dies will be consuming dynamic power. The net result is that one of the die's caches are active at a given moment reducing their individual power consumption by half. This allows one of the large dense hotspots the data cache to split into two smaller hotspots on separate die of a multilayer CPU. Additionally choosing two blocks that aren't expected to be in use at the same time can be layered on top of each other such as an integer and floating point adder. Stacking one on top of the other can have an operational effect of lowering the power density in that area of the two dies. 8

9 Thermal herding If a 3D structure is allowed to split the CPU into multiple layers there are other interesting solutions which would allow for better heat dissipation by utilizing zoning of major heat sources. A specific implementation of this is referred to as 'Thermal Herding' with the goal being to put the largest heat sources near the heat sink. This method takes advantage of several different optimizations but one specifically utilizes the fact that integer calculations may only require the lower 16 bits or less of a 64-bit CPU. Since this situation is easily predicted, it is possible to speculate when the upper portions of a register may be disabled. Structurally the lower bits of the register are placed on the die closest to the heat sink. The upper potentially disabled bits are placed on subsequent layers below the less significant bits of the register as seen in Figure 6.4. Figure 6.4 Thermal herding applied to 64bit registers on a multilayer CPU design. [3] If the overhead of register length prediction can be kept low the thermal profile is able to be improved by placing the logic which is most likely to be active closest to the heat sink. Similar arrangements can be created in CPU blocks such as ALUs, bypass networks, schedulers, ques, and caches. By thermally herding CPU blocks in this manner this study's simulations found a worst case temperature increase of 12 C compared to a 17 C rise without thermal herding techniques on similar 3D CPUs. Additionally when compared to a 2D equivalent processor thermal herding was able to reduce power consumption by 20% while simultaneously raising CPU frequency by 40% on average for the applications simulated. [3] 9

10 While these there is much to be gained from these techniques, most likely they won't be able to be implemented on the first generation of 3D memory-cpu structures. This is because a multilayer CPU is a larger challenge than just stacking memory on the processor. But once the die stacking technology is improved and multilayer CPUs are possible, these additional methods to reduce power and heat will become available concurrently. Final Conclusion Stacking memory on top of the processor to create a 3D system is a promising method to overcome the ever-increasing memory wall. Even the simplest configurations yield significant results, and, with a 3D system, a myriad of new ways to improve memory performance arise. Taking advantage of these new possibilities could drastically reduce or even eliminate the memory wall problem, as was discussed in the architecture optimizations section of this paper. As promising as it is, 3D memory is not without its challenges. As with any processor, heat is an issue, but with a 3D system, it is compounded. There are a number of methods to overcome thermal problems, but they each have their costs and limitations. Thermal challenges relating to 3D memory were discussed in the thermal analysis section of this paper. Additionally, manufacturing is a significant barrier to widespread use of a 3D-stacked memory system. Die-to-die vias are difficult to manufacture and align properly, and a massive level of communication between the processor and memory manufacturers would be required to make the system function properly. Due to these challenges, it is likely that 3D memory will progress slowly. As the basic technology and dieto-die via interfaces improve, 3D-stacked memory systems may become more widely-used. Initially, corporations will probably take advantage of easy gains, such as simple stacking and 3D-base systems. As these systems prove useful, the architectural optimizations that require some redesign may be implemented. 3D memory has great potential; the main question is when this potential will outweigh the associated costs. 10

11 References Intro, Background, and 3D Stacked Memory 1. Lee, Ben. ECE570. Feb Oregon State University. Feb < 2. Y. Li, Y. Liu, and T. Zhang, Exploiting three-dimensional (3D) memory stacking to improve image data access efficiency for motion estimation accelerators, Elsevier J. Signal Process.: Image Commun. (Special Issue on Breakthrough Architecture for Image and Video Systems), vol. 25, no. 5, pp , Jun EDA360 Insider. 12 Dec < 4. S. Borkar et al. Platform 2015: Intel Processor and Platform Evolution for the Next Decade, Mar Davis, Brian. Modern DRAM Architectures. 6. Ahmed Al Maashri, Guangyu Sun, Xiangyu Dong, Vijay Narayanan and Yuan Xie. 3D GPU Architecture using Cache Stacking: Performance, Cost, Power and Thermal analysis March Wikipedia. < 9. < RAM_n.jpg> 8. < Dynamic random-access memory (DRAM)> 10. Bridging the processor memory gap 11. 3D stacked memory architecture for multi-core processors 12. < Performance 1. G. Loi, B. Agrawal, N. Srivastava, S. Lin, T. Sherwood, and K. Banerjee. A Thermally-Aware Performance Analysis of Vertically Integrated (3-D) Processor-Memory Hierarchy. Proceedings of the 43rd annual Design Automation Conference, Architecture Optimizations 1. C. Liu, I. Ganusov, M. Burtscher, and S. Tiwari. Bridging the Processor-Memory Performance Gap with 3D IC Technology. IEEE Design & Test of Computers, 22(6): , G. Loh. 3D-Stacked Memory Architectures for Multi-core Processors. Proceedings of the International Symposium on Computer Architecture, D. H. Woo, N. H. Seong, D. L. Lewis, and H.-H. S. Lee. An Optimized 3D-Stacked Memory Architecture by Exploiting Excessive, High-Density TSV Bandwidth. Proceedings of the International Conference on High Performance Computer Architecture, January K. Puttaswamy, and G. Loh. Implementing Caches in a 3D Technology for High Performance Processors. Proceedings of the 2005 International Conference on Computer Design, p , 11

12 October 02-05, Y. Xie, G. Loh, B. Black, and K. Bernstein. Design Space Exploration for 3D Architectures. Journal on Emerging Technologies in Computing Systems, G. Loh, Y. Xie, and B. Black. Processor Design in 3D Die-Stacking Technologies. Micro, IEEE, p.31-48, May Thermal Analysis 1. G. Loi, B. Agrawal, N. Srivastava, S. Lin, T. Sherwood, and K. Banerjee. A Thermally-Aware Performance Analysis of Vertically Integrated (3-D) Processor-Memory Hierarchy. Proceedings of the 43rd annual Design Automation Conference,

OVERCOMING THE MEMORY WALL FINAL REPORT. By Jennifer Inouye Paul Molloy Matt Wisler

OVERCOMING THE MEMORY WALL FINAL REPORT. By Jennifer Inouye Paul Molloy Matt Wisler OVERCOMING THE MEMORY WALL FINAL REPORT By Jennifer Inouye Paul Molloy Matt Wisler ECE/CS 570 OREGON STATE UNIVERSITY Winter 2012 Contents 1. Introduction... 3 2. Background... 5 3. 3D Stacked Memory...

More information

ECE 486/586. Computer Architecture. Lecture # 2

ECE 486/586. Computer Architecture. Lecture # 2 ECE 486/586 Computer Architecture Lecture # 2 Spring 2015 Portland State University Recap of Last Lecture Old view of computer architecture: Instruction Set Architecture (ISA) design Real computer architecture:

More information

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823

More information

EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES

EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES MICRO 2011 @ Porte Alegre, Brazil Gabriel H. Loh [1] and Mark D. Hill [2][1] December 2011 [1] AMD Research [2] University

More information

Hardware/Software T e T chniques for for DRAM DRAM Thermal Management

Hardware/Software T e T chniques for for DRAM DRAM Thermal Management Hardware/Software Techniques for DRAM Thermal Management 6/19/2012 1 Introduction The performance of the main memory is an important factor on overall system performance. To improve DRAM performance, designers

More information

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed 5.3 By convention, a cache is named according to the amount of data it contains (i.e., a 4 KiB cache can hold 4 KiB of data); however, caches also require SRAM to store metadata such as tags and valid

More information

Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity. Donghyuk Lee Carnegie Mellon University

Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity. Donghyuk Lee Carnegie Mellon University Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity Donghyuk Lee Carnegie Mellon University Problem: High DRAM Latency processor stalls: waiting for data main memory high latency Major bottleneck

More information

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand Caches Nima Honarmand Motivation 10000 Performance 1000 100 10 Processor Memory 1 1985 1990 1995 2000 2005 2010 Want memory to appear: As fast as CPU As large as required by all of the running applications

More information

Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization

Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization Fazal Hameed and Jeronimo Castrillon Center for Advancing Electronics Dresden (cfaed), Technische Universität Dresden,

More information

BREAKING THE MEMORY WALL

BREAKING THE MEMORY WALL BREAKING THE MEMORY WALL CS433 Fall 2015 Dimitrios Skarlatos OUTLINE Introduction Current Trends in Computer Architecture 3D Die Stacking The memory Wall Conclusion INTRODUCTION Ideal Scaling of power

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Memory Hierarchy & Caches Motivation 10000 Performance 1000 100 10 Processor Memory 1 1985 1990 1995 2000 2005 2010 Want memory to appear: As fast as CPU As large as required

More information

Towards Performance Modeling of 3D Memory Integrated FPGA Architectures

Towards Performance Modeling of 3D Memory Integrated FPGA Architectures Towards Performance Modeling of 3D Memory Integrated FPGA Architectures Shreyas G. Singapura, Anand Panangadan and Viktor K. Prasanna University of Southern California, Los Angeles CA 90089, USA, {singapur,

More information

Addendum to Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches

Addendum to Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Addendum to Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Gabriel H. Loh Mark D. Hill AMD Research Department of Computer Sciences Advanced Micro Devices, Inc. gabe.loh@amd.com

More information

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction Chapter 6 Objectives Chapter 6 Memory Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured.

More information

Emerging NVM Memory Technologies

Emerging NVM Memory Technologies Emerging NVM Memory Technologies Yuan Xie Associate Professor The Pennsylvania State University Department of Computer Science & Engineering www.cse.psu.edu/~yuanxie yuanxie@cse.psu.edu Position Statement

More information

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches CS 61C: Great Ideas in Computer Architecture The Memory Hierarchy, Fully Associative Caches Instructor: Alan Christopher 7/09/2014 Summer 2014 -- Lecture #10 1 Review of Last Lecture Floating point (single

More information

Lecture-14 (Memory Hierarchy) CS422-Spring

Lecture-14 (Memory Hierarchy) CS422-Spring Lecture-14 (Memory Hierarchy) CS422-Spring 2018 Biswa@CSE-IITK The Ideal World Instruction Supply Pipeline (Instruction execution) Data Supply - Zero-cycle latency - Infinite capacity - Zero cost - Perfect

More information

Chapter 0 Introduction

Chapter 0 Introduction Chapter 0 Introduction Jin-Fu Li Laboratory Department of Electrical Engineering National Central University Jhongli, Taiwan Applications of ICs Consumer Electronics Automotive Electronics Green Power

More information

Page 1. Multilevel Memories (Improving performance using a little cash )

Page 1. Multilevel Memories (Improving performance using a little cash ) Page 1 Multilevel Memories (Improving performance using a little cash ) 1 Page 2 CPU-Memory Bottleneck CPU Memory Performance of high-speed computers is usually limited by memory bandwidth & latency Latency

More information

Cache Designs and Tricks. Kyle Eli, Chun-Lung Lim

Cache Designs and Tricks. Kyle Eli, Chun-Lung Lim Cache Designs and Tricks Kyle Eli, Chun-Lung Lim Why is cache important? CPUs already perform computations on data faster than the data can be retrieved from main memory and microprocessor execution speeds

More information

15-740/ Computer Architecture Lecture 19: Main Memory. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 19: Main Memory. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 19: Main Memory Prof. Onur Mutlu Carnegie Mellon University Last Time Multi-core issues in caching OS-based cache partitioning (using page coloring) Handling

More information

Memory Hierarchy. Slides contents from:

Memory Hierarchy. Slides contents from: Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory

More information

Spring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand

Spring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand Cache Design Basics Nima Honarmand Storage Hierarchy Make common case fast: Common: temporal & spatial locality Fast: smaller, more expensive memory Bigger Transfers Registers More Bandwidth Controlled

More information

The Memory Hierarchy 1

The Memory Hierarchy 1 The Memory Hierarchy 1 What is a cache? 2 What problem do caches solve? 3 Memory CPU Abstraction: Big array of bytes Memory memory 4 Performance vs 1980 Processor vs Memory Performance Memory is very slow

More information

Survey results. CS 6354: Memory Hierarchy I. Variety in memory technologies. Processor/Memory Gap. SRAM approx. 4 6 transitors/bit optimized for speed

Survey results. CS 6354: Memory Hierarchy I. Variety in memory technologies. Processor/Memory Gap. SRAM approx. 4 6 transitors/bit optimized for speed Survey results CS 6354: Memory Hierarchy I 29 August 2016 1 2 Processor/Memory Gap Variety in memory technologies SRAM approx. 4 6 transitors/bit optimized for speed DRAM approx. 1 transitor + capacitor/bit

More information

Xylem: Enhancing Vertical Thermal Conduction in 3D Processor-Memory Stacks

Xylem: Enhancing Vertical Thermal Conduction in 3D Processor-Memory Stacks Xylem: Enhancing Vertical Thermal Conduction in 3D Processor-Memory Stacks Aditya Agrawal, Josep Torrellas and Sachin Idgunji University of Illinois at Urbana Champaign and Nvidia Corporation http://iacoma.cs.uiuc.edu

More information

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013 Reminder: Homework 5 (Today) Due April 3 (Wednesday!) Topics: Vector processing,

More information

Addressing the Memory Wall

Addressing the Memory Wall Lecture 26: Addressing the Memory Wall Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Cage the Elephant Back Against the Wall (Cage the Elephant) This song is for the

More information

Chapter 02. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 02. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 02 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 2.1 The levels in a typical memory hierarchy in a server computer shown on top (a) and in

More information

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors Brent Bohnenstiehl and Bevan Baas Department of Electrical and Computer Engineering University of California, Davis {bvbohnen,

More information

Computer Systems Research in the Post-Dennard Scaling Era. Emilio G. Cota Candidacy Exam April 30, 2013

Computer Systems Research in the Post-Dennard Scaling Era. Emilio G. Cota Candidacy Exam April 30, 2013 Computer Systems Research in the Post-Dennard Scaling Era Emilio G. Cota Candidacy Exam April 30, 2013 Intel 4004, 1971 1 core, no cache 23K 10um transistors Intel Nehalem EX, 2009 8c, 24MB cache 2.3B

More information

1/19/2009. Data Locality. Exploiting Locality: Caches

1/19/2009. Data Locality. Exploiting Locality: Caches Spring 2009 Prof. Hyesoon Kim Thanks to Prof. Loh & Prof. Prvulovic Data Locality Temporal: if data item needed now, it is likely to be needed again in near future Spatial: if data item needed now, nearby

More information

CS 6354: Memory Hierarchy I. 29 August 2016

CS 6354: Memory Hierarchy I. 29 August 2016 1 CS 6354: Memory Hierarchy I 29 August 2016 Survey results 2 Processor/Memory Gap Figure 2.2 Starting with 1980 performance as a baseline, the gap in performance, measured as the difference in the time

More information

Memory. From Chapter 3 of High Performance Computing. c R. Leduc

Memory. From Chapter 3 of High Performance Computing. c R. Leduc Memory From Chapter 3 of High Performance Computing c 2002-2004 R. Leduc Memory Even if CPU is infinitely fast, still need to read/write data to memory. Speed of memory increasing much slower than processor

More information

The Effect of Temperature on Amdahl Law in 3D Multicore Era

The Effect of Temperature on Amdahl Law in 3D Multicore Era The Effect of Temperature on Amdahl Law in 3D Multicore Era L Yavits, A Morad, R Ginosar Abstract This work studies the influence of temperature on performance and scalability of 3D Chip Multiprocessors

More information

Agenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File

Agenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File EE 260: Introduction to Digital Design Technology Yao Zheng Department of Electrical Engineering University of Hawaiʻi at Mānoa 2 Technology Naive Register File Write Read clk Decoder Read Write 3 4 Arrays:

More information

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative Chapter 6 s Topics Memory Hierarchy Locality of Reference SRAM s Direct Mapped Associative Computer System Processor interrupt On-chip cache s s Memory-I/O bus bus Net cache Row cache Disk cache Memory

More information

Accelerating Dynamic Binary Translation with GPUs

Accelerating Dynamic Binary Translation with GPUs Accelerating Dynamic Binary Translation with GPUs Chung Hwan Kim, Srikanth Manikarnike, Vaibhav Sharma, Eric Eide, Robert Ricci School of Computing, University of Utah {chunghwn,smanikar,vaibhavs,eeide,ricci}@utah.edu

More information

Memory Hierarchy. Slides contents from:

Memory Hierarchy. Slides contents from: Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory

More information

CHAPTER 6 Memory. CMPS375 Class Notes (Chap06) Page 1 / 20 Dr. Kuo-pao Yang

CHAPTER 6 Memory. CMPS375 Class Notes (Chap06) Page 1 / 20 Dr. Kuo-pao Yang CHAPTER 6 Memory 6.1 Memory 341 6.2 Types of Memory 341 6.3 The Memory Hierarchy 343 6.3.1 Locality of Reference 346 6.4 Cache Memory 347 6.4.1 Cache Mapping Schemes 349 6.4.2 Replacement Policies 365

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 15

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 15 CS24: INTRODUCTION TO COMPUTING SYSTEMS Spring 2017 Lecture 15 LAST TIME: CACHE ORGANIZATION Caches have several important parameters B = 2 b bytes to store the block in each cache line S = 2 s cache sets

More information

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline The Need for Cache Memory The Basics

More information

Advanced Memory Organizations

Advanced Memory Organizations CSE 3421: Introduction to Computer Architecture Advanced Memory Organizations Study: 5.1, 5.2, 5.3, 5.4 (only parts) Gojko Babić 03-29-2018 1 Growth in Performance of DRAM & CPU Huge mismatch between CPU

More information

Z-RAM Ultra-Dense Memory for 90nm and Below. Hot Chips David E. Fisch, Anant Singh, Greg Popov Innovative Silicon Inc.

Z-RAM Ultra-Dense Memory for 90nm and Below. Hot Chips David E. Fisch, Anant Singh, Greg Popov Innovative Silicon Inc. Z-RAM Ultra-Dense Memory for 90nm and Below Hot Chips 2006 David E. Fisch, Anant Singh, Greg Popov Innovative Silicon Inc. Outline Device Overview Operation Architecture Features Challenges Z-RAM Performance

More information

Slide Set 9. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

Slide Set 9. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng Slide Set 9 for ENCM 369 Winter 2018 Section 01 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary March 2018 ENCM 369 Winter 2018 Section 01

More information

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1] EE482: Advanced Computer Organization Lecture #7 Processor Architecture Stanford University Tuesday, June 6, 2000 Memory Systems and Memory Latency Lecture #7: Wednesday, April 19, 2000 Lecturer: Brian

More information

Revolutionizing Technological Devices such as STT- RAM and their Multiple Implementation in the Cache Level Hierarchy

Revolutionizing Technological Devices such as STT- RAM and their Multiple Implementation in the Cache Level Hierarchy Revolutionizing Technological s such as and their Multiple Implementation in the Cache Level Hierarchy Michael Mosquera Department of Electrical and Computer Engineering University of Central Florida Orlando,

More information

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation Mainstream Computer System Components CPU Core 2 GHz - 3.0 GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation One core or multi-core (2-4) per chip Multiple FP, integer

More information

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1 CSE 431 Computer Architecture Fall 2008 Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Mary Jane Irwin ( www.cse.psu.edu/~mji ) [Adapted from Computer Organization and Design, 4 th Edition, Patterson

More information

Mainstream Computer System Components

Mainstream Computer System Components Mainstream Computer System Components Double Date Rate (DDR) SDRAM One channel = 8 bytes = 64 bits wide Current DDR3 SDRAM Example: PC3-12800 (DDR3-1600) 200 MHz (internal base chip clock) 8-way interleaved

More information

Processor and DRAM Integration by TSV- Based 3-D Stacking for Power-Aware SOCs

Processor and DRAM Integration by TSV- Based 3-D Stacking for Power-Aware SOCs Processor and DRAM Integration by TSV- Based 3-D Stacking for Power-Aware SOCs Shin-Shiun Chen, Chun-Kai Hsu, Hsiu-Chuan Shih, and Cheng-Wen Wu Department of Electrical Engineering National Tsing Hua University

More information

Lecture 8: Virtual Memory. Today: DRAM innovations, virtual memory (Sections )

Lecture 8: Virtual Memory. Today: DRAM innovations, virtual memory (Sections ) Lecture 8: Virtual Memory Today: DRAM innovations, virtual memory (Sections 5.3-5.4) 1 DRAM Technology Trends Improvements in technology (smaller devices) DRAM capacities double every two years, but latency

More information

A Spherical Placement and Migration Scheme for a STT-RAM Based Hybrid Cache in 3D chip Multi-processors

A Spherical Placement and Migration Scheme for a STT-RAM Based Hybrid Cache in 3D chip Multi-processors , July 4-6, 2018, London, U.K. A Spherical Placement and Migration Scheme for a STT-RAM Based Hybrid in 3D chip Multi-processors Lei Wang, Fen Ge, Hao Lu, Ning Wu, Ying Zhang, and Fang Zhou Abstract As

More information

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3 CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3 Instructors: Bernhard Boser & Randy H. Katz http://inst.eecs.berkeley.edu/~cs61c/ 10/24/16 Fall 2016 - Lecture #16 1 Software

More information

Mark Redekopp, All rights reserved. EE 352 Unit 10. Memory System Overview SRAM vs. DRAM DMA & Endian-ness

Mark Redekopp, All rights reserved. EE 352 Unit 10. Memory System Overview SRAM vs. DRAM DMA & Endian-ness EE 352 Unit 10 Memory System Overview SRAM vs. DRAM DMA & Endian-ness The Memory Wall Problem: The Memory Wall Processor speeds have been increasing much faster than memory access speeds (Memory technology

More information

Slide Set 5. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

Slide Set 5. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng Slide Set 5 for ENCM 501 in Winter Term, 2017 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary Winter Term, 2017 ENCM 501 W17 Lectures: Slide

More information

Abbas El Gamal. Joint work with: Mingjie Lin, Yi-Chang Lu, Simon Wong Work partially supported by DARPA 3D-IC program. Stanford University

Abbas El Gamal. Joint work with: Mingjie Lin, Yi-Chang Lu, Simon Wong Work partially supported by DARPA 3D-IC program. Stanford University Abbas El Gamal Joint work with: Mingjie Lin, Yi-Chang Lu, Simon Wong Work partially supported by DARPA 3D-IC program Stanford University Chip stacking Vertical interconnect density < 20/mm Wafer Stacking

More information

Speeding Up Crossbar Resistive Memory by Exploiting In-memory Data Patterns

Speeding Up Crossbar Resistive Memory by Exploiting In-memory Data Patterns March 12, 2018 Speeding Up Crossbar Resistive Memory by Exploiting In-memory Data Patterns Wen Wen Lei Zhao, Youtao Zhang, Jun Yang Executive Summary Problems: performance and reliability of write operations

More information

Power Measurement Using Performance Counters

Power Measurement Using Performance Counters Power Measurement Using Performance Counters October 2016 1 Introduction CPU s are based on complementary metal oxide semiconductor technology (CMOS). CMOS technology theoretically only dissipates power

More information

Mohsen Imani. University of California San Diego. System Energy Efficiency Lab seelab.ucsd.edu

Mohsen Imani. University of California San Diego. System Energy Efficiency Lab seelab.ucsd.edu Mohsen Imani University of California San Diego Winter 2016 Technology Trend for IoT http://www.flashmemorysummit.com/english/collaterals/proceedi ngs/2014/20140807_304c_hill.pdf 2 Motivation IoT significantly

More information

On GPU Bus Power Reduction with 3D IC Technologies

On GPU Bus Power Reduction with 3D IC Technologies On GPU Bus Power Reduction with 3D Technologies Young-Joon Lee and Sung Kyu Lim School of ECE, Georgia Institute of Technology, Atlanta, Georgia, USA yjlee@gatech.edu, limsk@ece.gatech.edu Abstract The

More information

Unleashing the Power of Embedded DRAM

Unleashing the Power of Embedded DRAM Copyright 2005 Design And Reuse S.A. All rights reserved. Unleashing the Power of Embedded DRAM by Peter Gillingham, MOSAID Technologies Incorporated Ottawa, Canada Abstract Embedded DRAM technology offers

More information

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM

More information

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James Computer Systems Architecture I CSE 560M Lecture 18 Guest Lecturer: Shakir James Plan for Today Announcements No class meeting on Monday, meet in project groups Project demos < 2 weeks, Nov 23 rd Questions

More information

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY 1 Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored

More information

Power dissipation! The VLSI Interconnect Challenge. Interconnect is the crux of the problem. Interconnect is the crux of the problem.

Power dissipation! The VLSI Interconnect Challenge. Interconnect is the crux of the problem. Interconnect is the crux of the problem. The VLSI Interconnect Challenge Avinoam Kolodny Electrical Engineering Department Technion Israel Institute of Technology VLSI Challenges System complexity Performance Tolerance to digital noise and faults

More information

Brief Background in Fiber Optics

Brief Background in Fiber Optics The Future of Photonics in Upcoming Processors ECE 4750 Fall 08 Brief Background in Fiber Optics Light can travel down an optical fiber if it is completely confined Determined by Snells Law Various modes

More information

3D systems-on-chip. A clever partitioning of circuits to improve area, cost, power and performance. The 3D technology landscape

3D systems-on-chip. A clever partitioning of circuits to improve area, cost, power and performance. The 3D technology landscape Edition April 2017 Semiconductor technology & processing 3D systems-on-chip A clever partitioning of circuits to improve area, cost, power and performance. In recent years, the technology of 3D integration

More information

CS 31: Intro to Systems Caching. Kevin Webb Swarthmore College March 24, 2015

CS 31: Intro to Systems Caching. Kevin Webb Swarthmore College March 24, 2015 CS 3: Intro to Systems Caching Kevin Webb Swarthmore College March 24, 205 Reading Quiz Abstraction Goal Reality: There is no one type of memory to rule them all! Abstraction: hide the complex/undesirable

More information

Exploring Performance, Power, and Temperature Characteristics of 3D Systems with On-Chip DRAM

Exploring Performance, Power, and Temperature Characteristics of 3D Systems with On-Chip DRAM Exploring Performance, Power, and Temperature Characteristics of 3D Systems with On-Chip DRAM Jie Meng, Daniel Rossell, and Ayse K. Coskun Electrical and Computer Engineering Department, Boston University,

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I Memory Performance of Algorithms CSE 32 Data Structures Lecture Algorithm Performance Factors Algorithm choices (asymptotic running time) O(n 2 ) or O(n log n) Data structure choices List or Arrays Language

More information

Handout 4 Memory Hierarchy

Handout 4 Memory Hierarchy Handout 4 Memory Hierarchy Outline Memory hierarchy Locality Cache design Virtual address spaces Page table layout TLB design options (MMU Sub-system) Conclusion 2012/11/7 2 Since 1980, CPU has outpaced

More information

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6. Cache Memories From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6. Today Cache memory organization and operation Performance impact of caches The memory mountain Rearranging

More information

Five Emerging DRAM Interfaces You Should Know for Your Next Design

Five Emerging DRAM Interfaces You Should Know for Your Next Design Five Emerging DRAM Interfaces You Should Know for Your Next Design By Gopal Raghavan, Cadence Design Systems Producing DRAM chips in commodity volumes and prices to meet the demands of the mobile market

More information

Lecture 1: Introduction

Lecture 1: Introduction Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline

More information

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy Chapter 5B Large and Fast: Exploiting Memory Hierarchy One Transistor Dynamic RAM 1-T DRAM Cell word access transistor V REF TiN top electrode (V REF ) Ta 2 O 5 dielectric bit Storage capacitor (FET gate,

More information

Views of Memory. Real machines have limited amounts of memory. Programmer doesn t want to be bothered. 640KB? A few GB? (This laptop = 2GB)

Views of Memory. Real machines have limited amounts of memory. Programmer doesn t want to be bothered. 640KB? A few GB? (This laptop = 2GB) CS6290 Memory Views of Memory Real machines have limited amounts of memory 640KB? A few GB? (This laptop = 2GB) Programmer doesn t want to be bothered Do you think, oh, this computer only has 128MB so

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Memory / DRAM SRAM = Static RAM SRAM vs. DRAM As long as power is present, data is retained DRAM = Dynamic RAM If you don t do anything, you lose the data SRAM: 6T per bit

More information

Review: Performance Latency vs. Throughput. Time (seconds/program) is performance measure Instructions Clock cycles Seconds.

Review: Performance Latency vs. Throughput. Time (seconds/program) is performance measure Instructions Clock cycles Seconds. Performance 980 98 982 983 984 985 986 987 988 989 990 99 992 993 994 995 996 997 998 999 2000 7/4/20 CS 6C: Great Ideas in Computer Architecture (Machine Structures) Caches Instructor: Michael Greenbaum

More information

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho Why memory hierarchy? L1 cache design Sangyeun Cho Computer Science Department Memory hierarchy Memory hierarchy goals Smaller Faster More expensive per byte CPU Regs L1 cache L2 cache SRAM SRAM To provide

More information

A novel DRAM architecture as a low leakage alternative for SRAM caches in a 3D interconnect context.

A novel DRAM architecture as a low leakage alternative for SRAM caches in a 3D interconnect context. A novel DRAM architecture as a low leakage alternative for SRAM caches in a 3D interconnect context. Anselme Vignon, Stefan Cosemans, Wim Dehaene K.U. Leuven ESAT - MICAS Laboratory Kasteelpark Arenberg

More information

Summary of Computer Architecture

Summary of Computer Architecture Summary of Computer Architecture Summary CHAP 1: INTRODUCTION Structure Top Level Peripherals Computer Central Processing Unit Main Memory Computer Systems Interconnection Communication lines Input Output

More information

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5) Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5) 1 More Cache Basics caches are split as instruction and data; L2 and L3 are unified The /L2 hierarchy can be inclusive,

More information

Skewed-Associative Caches: CS752 Final Project

Skewed-Associative Caches: CS752 Final Project Skewed-Associative Caches: CS752 Final Project Professor Sohi Corey Halpin Scot Kronenfeld Johannes Zeppenfeld 13 December 2002 Abstract As the gap between microprocessor performance and memory performance

More information

Kaisen Lin and Michael Conley

Kaisen Lin and Michael Conley Kaisen Lin and Michael Conley Simultaneous Multithreading Instructions from multiple threads run simultaneously on superscalar processor More instruction fetching and register state Commercialized! DEC

More information

Memory hierarchy review. ECE 154B Dmitri Strukov

Memory hierarchy review. ECE 154B Dmitri Strukov Memory hierarchy review ECE 154B Dmitri Strukov Outline Cache motivation Cache basics Six basic optimizations Virtual memory Cache performance Opteron example Processor-DRAM gap in latency Q1. How to deal

More information

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste

More information

Memory. Objectives. Introduction. 6.2 Types of Memory

Memory. Objectives. Introduction. 6.2 Types of Memory Memory Objectives Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured. Master the concepts

More information

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches CS 61C: Great Ideas in Computer Architecture Direct Mapped Caches Instructor: Justin Hsia 7/05/2012 Summer 2012 Lecture #11 1 Review of Last Lecture Floating point (single and double precision) approximates

More information

ECE 172 Digital Systems. Chapter 15 Turbo Boost Technology. Herbert G. Mayer, PSU Status 8/13/2018

ECE 172 Digital Systems. Chapter 15 Turbo Boost Technology. Herbert G. Mayer, PSU Status 8/13/2018 ECE 172 Digital Systems Chapter 15 Turbo Boost Technology Herbert G. Mayer, PSU Status 8/13/2018 1 Syllabus l Introduction l Speedup Parameters l Definitions l Turbo Boost l Turbo Boost, Actual Performance

More information

Memory latency: Affects cache miss penalty. Measured by:

Memory latency: Affects cache miss penalty. Measured by: Main Memory Main memory generally utilizes Dynamic RAM (DRAM), which use a single transistor to store a bit, but require a periodic data refresh by reading every row. Static RAM may be used for main memory

More information

Spring 2018 :: CSE 502. Main Memory & DRAM. Nima Honarmand

Spring 2018 :: CSE 502. Main Memory & DRAM. Nima Honarmand Main Memory & DRAM Nima Honarmand Main Memory Big Picture 1) Last-level cache sends its memory requests to a Memory Controller Over a system bus of other types of interconnect 2) Memory controller translates

More information

6 February Parallel Computing: A View From Berkeley. E. M. Hielscher. Introduction. Applications and Dwarfs. Hardware. Programming Models

6 February Parallel Computing: A View From Berkeley. E. M. Hielscher. Introduction. Applications and Dwarfs. Hardware. Programming Models Parallel 6 February 2008 Motivation All major processor manufacturers have switched to parallel architectures This switch driven by three Walls : the Power Wall, Memory Wall, and ILP Wall Power = Capacitance

More information

10/19/17. You Are Here! Review: Direct-Mapped Cache. Typical Memory Hierarchy

10/19/17. You Are Here! Review: Direct-Mapped Cache. Typical Memory Hierarchy CS 6C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3 Instructors: Krste Asanović & Randy H Katz http://insteecsberkeleyedu/~cs6c/ Parallel Requests Assigned to computer eg, Search

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 10

ECE 571 Advanced Microprocessor-Based Design Lecture 10 ECE 571 Advanced Microprocessor-Based Design Lecture 10 Vince Weaver http://www.eece.maine.edu/ vweaver vincent.weaver@maine.edu 2 October 2014 Performance Concerns Caches Almost all programming can be

More information

Stacked Silicon Interconnect Technology (SSIT)

Stacked Silicon Interconnect Technology (SSIT) Stacked Silicon Interconnect Technology (SSIT) Suresh Ramalingam Xilinx Inc. MEPTEC, January 12, 2011 Agenda Background and Motivation Stacked Silicon Interconnect Technology Summary Background and Motivation

More information

LECTURE 10: Improving Memory Access: Direct and Spatial caches

LECTURE 10: Improving Memory Access: Direct and Spatial caches EECS 318 CAD Computer Aided Design LECTURE 10: Improving Memory Access: Direct and Spatial caches Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information