Figure 5.2: (a) Floor plan examples for varying the number of memory controllers and ranks. (b) Example configuration.

Size: px

Start display at page:

Download "Figure 5.2: (a) Floor plan examples for varying the number of memory controllers and ranks. (b) Example configuration."

Justin Ramsey
5 years ago
Views:

1 Figure 5.2: (a) Floor plan examples for varying the number of memory controllers and ranks. (b) Example configuration. The study found that a 16 rank 4 memory controller system obtained a speedup of over a 4 rank 1 memory controller system. This shows that significant results can be obtained by making architectural changes in this area. For additional information on the techniques used and data gathered by this study, the reader is referred to the reference section. Increase DRAM Row Buffer Entries DRAM row buffers essentially function as a DRAM cache. If an entry is in the row buffer for a bank, it reads the memory from there and reduces the overall access time of main memory. Of course, in both 2D and 3D systems, increasing the number of row buffer entries for a bank will increase performance. However, the benefit is much more visible in a 3D system, where wire delay and other factors don t dwarf DRAM access time. This makes it much more beneficial to add a few more row buffer entries for each bank in a 3D stacked memory system. The same paper that discussed increasing the number of ranks and the number of memory controllers analyzed the effect of increasing the number of row buffer entries in each bank. It was found that increasing the number of row buffer entries to 4 gave a speedup of 1.3. Of course, increasing the number of row buffer entries is not negligible in cost: additional entries increase the complexity of logic, and each entry requires 4 KB of space. On a 16 rank system, with 8 banks per rank, increasing the number of row buffer entries to 4 requires 2 MB total space. The study discussed these costs and even ran a simulation comparing the increased row buffer performance to an alternate case where the extra storage space was allocated to the L2 cache. It was found that the even though the L2 has a faster access time, increasing the storage space was not as beneficial as increasing the number of row buffer entries in main memory. This was due to the fact that increasing the size of the L2 cache did not greatly reduce L2 misses, while increasing the number of row buffer entries greatly increased the number of row buffer hits. 1

2 Stacked Memory-Aware, Rich TSV-enabled 3D Memory Hierarchy This section refers to a rather unique method of taking advantage of the high bandwidth available in a 3D system. The method is proposed in the paper [citation] and will henceforth be referred to as SMART-3D (as it is in the original paper). While this section refers to a specific method and the implementation of that method, concepts can be used in other areas, and it shows how redesigning architecture with 3D-stacked memory in mind can be beneficial. SMART-3D proposed an L2 cache supporting standard 64 byte read and write operations for the L1 cache with a 4KB bus for fill and write-back operations to the main memory. A standard planar H-tree layout is used for the read and write operations. Essentially this is varying the size of the L2 cache line based on the perspective. From the perspective of the L1, the L2 has a 64 byte cache line, while from the perspective of main memory, the L2 has a 4 KB cache line. This abstraction is provided by separating the L2 cache into 64 sub-banks, with a vertical 64 B bus in each, allowing for 64 simultaneous 64-byte transfers between main memory and the L2. This reduces the L2 performance slightly by increasing the hit access time, but that is made up for by reducing the miss rate. The main benefit of this implementation is that it exploits spatial locality. A lot of the advantages of a standard 4 KB cache line are realized while avoiding many of the downsides. The most significant upside to the SMART-3D implementation compared to a standard 4 KB cache line is that the L2 cache access time is much lower. This translates to noticeable performance improvements for even moderately memory-intensive applications. Several major design decisions were discussed in the SMART-3D paper. For example, when data is evicted from the L2 cache in the system, it is possible to evict at a 64 B or a 4 KB granularity. To clarify, would the cache evict the least-recently-used 64-byte line from each sub-bank (local LRU) or the least-recently-used page (global LRU)? Simulations were conducted and it was stated in the paper that, though the results were application-specific, miss rates differed by a maximum of 0.033%. Therefore, it was decided that the additional complexity of that a local LRU system would not be cost-effective, and a global LRU system was implemented. Another issue considered is that one L2 eviction might lead to up to 64 L1 evictions, depending on the state of the caches. The paper argues that this does not create a problem if each L2 cache line keeps track of an inclusion bit to minimize unnecessary eviction traffic sent to the L1. Overall, this implementation showed speedups of 1.71x for memory-intensive applications and 1.30 for standard applications over the 3D-base system. Of course, these are the most general numbers, and for performance results in specific cases, the reader is referred to the references section. Splitting Cache Across Layers In addition to the many optimizations already discussed, it is possible to greatly reduce worstcase wire length in the caches by taking advantage of the 3D layout. One way to do this is by partitioning the cache between separate silicon layers. Figure 5.3 shows the most general way to do this. 2

Figure 5.3: (a) Standard 8-bank 2D cache layout. (b) Partition the cache across layers by bank. As can be seen, the worst-case wire length is significantly reduced in this system.

There are a few other partitioning options as well, at a finer granularity. It is also possible to partition a cache by wordline or by bitline. Figure 5.4 shows an example of this.

These partitioning options provide a greater benefit than simple bank-stacking, because the wire reduction benefits are seen both on the global routing level and the individual array level.

3 Figure 5.3: (a) Standard 8-bank 2D cache layout. (b) Partition the cache across layers by bank. As can be seen, the worst-case wire length is significantly reduced in this system. If y and x are assumed to be equal, then wire length is reduced by half. This translates to noticeable reductions in delay and energy use. There are a few other partitioning options as well, at a finer granularity. It is also possible to partition a cache by wordline or by bitline. Figure 5.4 shows an example of this. In each partitioning option, wire lengths are reduced. These partitioning options provide a greater benefit than simple bank-stacking, because the wire reduction benefits are seen both on the global routing level and the individual array level. However, they are more complex as splitting banks does not require any redesign just simple stacking. Figure 5.4: (left) Standard 2D SRAM array. (middle) Split by bitline. (right) Split by wordline. Puttaswamy and Loh conducted a study, Implementing Caches in a 3D Technology for High Performance Processors, simulating access latency and energy across several different cache sizes. They implemented array stacking, split by bitline. They found greatly varying benefits depending on the cache sizes: 3

Figure 5.4: (left) Table comparing cache size to access energy, showing savings obtained by partitioning the cache. (right) Table comparing cache size to access latency, again showing savings.

4 Figure 5.4: (left) Table comparing cache size to access energy, showing savings obtained by partitioning the cache. (right) Table comparing cache size to access latency, again showing savings. A key idea to note is that, for latency, what matters is not the percent reduction, but the cycle reduction. A delay reduction of 15% doesn t matter if the number of cycles required to access it doesn t change. Of course, for energy reductions, every small improvement helps. It would be very interesting to delve deeper into optimize the cache architecture for a 3D system with an even finer scope. For example, partitioning each bit cell across layers could provide even greater benefits. However, this is limited by the size of the die-to-die vias. Currently, each via is larger than the size of an SRAM cell this makes further optimization impossible without technology improvements. Optimization Summary These optimizations have all been shown to significantly improve the performance of a processor. Of course, the possibilities present with a 3D-stacked processor are endless. So the question is: why is 3D not the standard? And the answer is that there are other issues that were not discussed in this section. With any processor, heat is an issue that must be dealt with. As technology improves and energy density increases, heat can become a problem. When this technology stacks DRAM cells on top of the processor, one can imagine that this problem could only get worse. Additionally, manufacturing is an issue Through-Silicon-Vias are difficult to manufacture reliably, and there would have to be significant communication between the processor manufacturer and the DRAM manufacturer. There are certainly challenges facing 3D memory. However, the benefits seem to be great, and as the memory wall worsens, more research may be done in this area. 4

5 Thermal Analysis One of the biggest obstacles that a 3 dimensional memory/cpu structure must overcome is power management and heat removal. Traditional 2D layouts have the benefit of being separate modules and thus can easily be cooled with their own heat sinks. A 2D CPU places the bulk silicon directly against a heat spreader that is mounted to a heat sink. This allows for direct heat conduction away from the CPU to ambient. The separate memory modules generate less heat in comparison and are easily cooled in a similar fashion at a separate location. But with the integration of the CPU and memory in the same stack there is more obstruction for heat removal as seen in Figure 6.1(a). Here the top die (Layer N) would be mounted against the motherboard using a socket similar to the 2D design. However, now heat is generated in several different layers increasing power density and heat removal requirements for the same effective heat sink area. The upper layers must have an even higher temperature than layers below in order to establish the gradient required for heat flow to the heat sink. Assuming the heat sink is maintained at ambient, the stacking of comparable dies leads to higher overall temperatures for a 3D design compared to an equivalent 2D design. This is further compounded due to the fact that leakage current is exponentially dependent on temperature. So once temperature is elevated, the processor's static power increases at an exponential rate.[1] Additionally Having multiple layers consuming power for a given footprint and effective heat sink area results in a higher power density for the 3D structure. Figure 6.1 3D processor and memory stack including heat model [1] 5

6 Performance Affects There have been countless testing and simulations done to evaluate the performance improvements with 3D memory structures. Since simulation of the heat problem is difficult few of studies have accurately considered heat's affects on system performance. In an analysis completed at the University of California [1] these thermal effects were considered when comparing performance of 2D vs. 3D structures. It was pointed out that previous studies had under estimated the negative affects of heat on system performance because small scale processes had not been evaluated. This study simulated 120nm process where leakage power is highly temperature sensitive. Similar 2D and 3D structures were compared against each other where cache and memory designs were assumed to be identical. For more detail of the study see the earlier Performance Comparison section of this survey. The thermal analysis consisted of a simple one dimensional model as seen in Figure 6.1(b) this model considers how each layer's generated heat must dissipate through the layers below in order to reach the heat sink at the bottom of the stack. Subsequently the top layers must achieve a temperature greater than that of the lower layers in order to provide the downward heat flow to the sink. Further complicating the issue is the relatively low thermal conductivity of silicon. The simple way to prevent exceeding heat limits on any layer, the heat generated (power) must be lowered. Since both structures were assumed to have the same process and thus same voltage with negligible differences in capacitance. The result is in order to reduce power and heat generated is to lower operating frequency of the CPU which may subsequently limits performance. The results of this study can be seen in Figure 6.2. The top graph shows a performance comparison for the memory intensive application. The positive sloping lines correlate to the right axis of limiting CPU temperature. The fact that the 3D structure is 5-10 C hotter for a given CPU frequency is due to the layering effect discussed previously. In order to fairly compare the 2D and 3D structures their performance must be compared for a single operating temperature since the heat removal capabilities are assumed to be the same for each in regards to the sink. The graph lines out a temperature constraint of 100 C with the dotted horizontal line. The operating frequency is determined by the crossing of the structures temperature and the 100 C limit; here it's 2.35 GHz for the 3D structure and 2.85Ghz for the 2D. With the operating frequency now realized the execution time can be found on the left axis. At these frequencies the execution time is 0.6nsec and 2.5nsec for the 3D and 2D structures receptively. The result is that even with the 3D structure operating at an 18% reduction in frequency to prevent exceeding temperature limits, it has a 76% improvement in execution time. When considering the application with low memory use different results are found. Here there is no difference in performance for a given operating frequency since memory is not a bottleneck it isn't throttling execution time. Now when comparing both structures with a temperature limits of 100 C the 3D structure is still limited to 2.35 GHz with an execution time of 0.41nsec while the 2D structure can operate at 2.8 GHz with 0.35nsec execution time. The resulting 17% reduction in operating frequency directly scaled performance down by 17% as well. Here the 2D structure outperformed the 3D structure due to the sole limit on operating frequency to reduce power consumption and heat generated. The realized lower temperature limit of the 3D structure hindered performance in the comparison. 6

7 Figure 6.2 Execution time per instruction and maximum chip temperature as a function of operating frequency for 2-D and 3-D processors. (a) Highly memory intensive application and (b) non-memory intensive application. The conclusion from this study is there are great improvements to be made with memory bandwidth and speed of the 3D structure. These improvements are significantly successful in lowering the memory wall and improving system performance of memory intensive applications. However the reduced ability to remove heat from the 3D structure and the realized lower CPU frequency reduces system performance of non-memory intensive applications. This is because these applications were never limited due to memory access in the first place. The final result is improved performance of memory intensive applications come at the cost of lowered performance in non-memory intensive ones. However with a 3D memory structure the large performance improvement with memory intensive applications can be viewed as outweighing the slight performance reduction of non-memory intensive applications since the memory wall is the obstacle that's being attempted to overcome. [1] 7

8 Possible Solutions Hotspots One of the largest thermal problems is hotspots in a CPU design. Some blocks of a CPU have higher power consumption than others resulting in higher temperatures in these blocks leading to hot spots on the die. These hotspots become the limiting factor for system frequency in the resulting design. So while the majority of blocks may be well below their thermal limit and could be operating at higher speeds, it's blocks in a CPU that consume large amounts of power that generate these hotspots which dictate overall system frequency. The largest hotspots for one example are the instruction TLB and data cache as seen in Figure 6.3. [2] Figure 6.3 Thermal hotspots in a 2D layout. [2] Luckily there are several benefits that can be utilized to reduce these hotspots in a 3D structure. Because of the 3D architecture certain CPU blocks can be stacked above each other to make optimizations. One example is CAM based circuitry which contains a very complex wire structure. These wire structures consume large amounts of power and if the block is split in two and then stacked on top of each other these wire networks are simplified and reduced in size. With this optimization the latency and power consumption of this block is greatly reduced. Another advantage can be achieved by splitting blocks in two such that only one of them is active at a give time. An example is seen with cache split with half of the cache lines on one die/layer and the other half on next die. The cache's decoder will only address one cache line at a time so only one of the two dies will be consuming dynamic power. The net result is that one of the die's caches are active at a given moment reducing their individual power consumption by half. This allows one of the large dense hotspots the data cache to split into two smaller hotspots on separate die of a multilayer CPU. Additionally choosing two blocks that aren't expected to be in use at the same time can be layered on top of each other such as an integer and floating point adder. Stacking one on top of the other can have an operational effect of lowering the power density in that area of the two dies. 8

9 Thermal herding If a 3D structure is allowed to split the CPU into multiple layers there are other interesting solutions which would allow for better heat dissipation by utilizing zoning of major heat sources. A specific implementation of this is referred to as 'Thermal Herding' with the goal being to put the largest heat sources near the heat sink. This method takes advantage of several different optimizations but one specifically utilizes the fact that integer calculations may only require the lower 16 bits or less of a 64-bit CPU. Since this situation is easily predicted, it is possible to speculate when the upper portions of a register may be disabled. Structurally the lower bits of the register are placed on the die closest to the heat sink. The upper potentially disabled bits are placed on subsequent layers below the less significant bits of the register as seen in Figure 6.4. Figure 6.4 Thermal herding applied to 64bit registers on a multilayer CPU design. [3] If the overhead of register length prediction can be kept low the thermal profile is able to be improved by placing the logic which is most likely to be active closest to the heat sink. Similar arrangements can be created in CPU blocks such as ALUs, bypass networks, schedulers, ques, and caches. By thermally herding CPU blocks in this manner this study's simulations found a worst case temperature increase of 12 C compared to a 17 C rise without thermal herding techniques on similar 3D CPUs. Additionally when compared to a 2D equivalent processor thermal herding was able to reduce power consumption by 20% while simultaneously raising CPU frequency by 40% on average for the applications simulated. [3] 9

10 While these there is much to be gained from these techniques, most likely they won't be able to be implemented on the first generation of 3D memory-cpu structures. This is because a multilayer CPU is a larger challenge than just stacking memory on the processor. But once the die stacking technology is improved and multilayer CPUs are possible, these additional methods to reduce power and heat will become available concurrently. Final Conclusion Stacking memory on top of the processor to create a 3D system is a promising method to overcome the ever-increasing memory wall. Even the simplest configurations yield significant results, and, with a 3D system, a myriad of new ways to improve memory performance arise. Taking advantage of these new possibilities could drastically reduce or even eliminate the memory wall problem, as was discussed in the architecture optimizations section of this paper. As promising as it is, 3D memory is not without its challenges. As with any processor, heat is an issue, but with a 3D system, it is compounded. There are a number of methods to overcome thermal problems, but they each have their costs and limitations. Thermal challenges relating to 3D memory were discussed in the thermal analysis section of this paper. Additionally, manufacturing is a significant barrier to widespread use of a 3D-stacked memory system. Die-to-die vias are difficult to manufacture and align properly, and a massive level of communication between the processor and memory manufacturers would be required to make the system function properly. Due to these challenges, it is likely that 3D memory will progress slowly. As the basic technology and dieto-die via interfaces improve, 3D-stacked memory systems may become more widely-used. Initially, corporations will probably take advantage of easy gains, such as simple stacking and 3D-base systems. As these systems prove useful, the architectural optimizations that require some redesign may be implemented. 3D memory has great potential; the main question is when this potential will outweigh the associated costs. 10

11 References Intro, Background, and 3D Stacked Memory 1. Lee, Ben. ECE570. Feb Oregon State University. Feb < 2. Y. Li, Y. Liu, and T. Zhang, Exploiting three-dimensional (3D) memory stacking to improve image data access efficiency for motion estimation accelerators, Elsevier J. Signal Process.: Image Commun. (Special Issue on Breakthrough Architecture for Image and Video Systems), vol. 25, no. 5, pp , Jun EDA360 Insider. 12 Dec < 4. S. Borkar et al. Platform 2015: Intel Processor and Platform Evolution for the Next Decade, Mar Davis, Brian. Modern DRAM Architectures. 6. Ahmed Al Maashri, Guangyu Sun, Xiangyu Dong, Vijay Narayanan and Yuan Xie. 3D GPU Architecture using Cache Stacking: Performance, Cost, Power and Thermal analysis March Wikipedia. < 9. < RAM_n.jpg> 8. < Dynamic random-access memory (DRAM)> 10. Bridging the processor memory gap 11. 3D stacked memory architecture for multi-core processors 12. < Performance 1. G. Loi, B. Agrawal, N. Srivastava, S. Lin, T. Sherwood, and K. Banerjee. A Thermally-Aware Performance Analysis of Vertically Integrated (3-D) Processor-Memory Hierarchy. Proceedings of the 43rd annual Design Automation Conference, Architecture Optimizations 1. C. Liu, I. Ganusov, M. Burtscher, and S. Tiwari. Bridging the Processor-Memory Performance Gap with 3D IC Technology. IEEE Design & Test of Computers, 22(6): , G. Loh. 3D-Stacked Memory Architectures for Multi-core Processors. Proceedings of the International Symposium on Computer Architecture, D. H. Woo, N. H. Seong, D. L. Lewis, and H.-H. S. Lee. An Optimized 3D-Stacked Memory Architecture by Exploiting Excessive, High-Density TSV Bandwidth. Proceedings of the International Conference on High Performance Computer Architecture, January K. Puttaswamy, and G. Loh. Implementing Caches in a 3D Technology for High Performance Processors. Proceedings of the 2005 International Conference on Computer Design, p , 11

12 October 02-05, Y. Xie, G. Loh, B. Black, and K. Bernstein. Design Space Exploration for 3D Architectures. Journal on Emerging Technologies in Computing Systems, G. Loh, Y. Xie, and B. Black. Processor Design in 3D Die-Stacking Technologies. Micro, IEEE, p.31-48, May Thermal Analysis 1. G. Loi, B. Agrawal, N. Srivastava, S. Lin, T. Sherwood, and K. Banerjee. A Thermally-Aware Performance Analysis of Vertically Integrated (3-D) Processor-Memory Hierarchy. Proceedings of the 43rd annual Design Automation Conference,

OVERCOMING THE MEMORY WALL FINAL REPORT. By Jennifer Inouye Paul Molloy Matt Wisler

OVERCOMING THE MEMORY WALL FINAL REPORT By Jennifer Inouye Paul Molloy Matt Wisler ECE/CS 570 OREGON STATE UNIVERSITY Winter 2012 Contents 1. Introduction... 3 2. Background... 5 3. 3D Stacked Memory...