Using a Cache Simulator on Big Data Applications

Size: px

Start display at page:

Download "Using a Cache Simulator on Big Data Applications"

Maude Barton
6 years ago
Views:

1 1 Using a Cache Simulator on Big Data Applications Liliane Ntaganda Spelman College lntagand@scmail.spelman.edu Hyesoon Kim Georgia Institute of Technology hyesoon@cc.gatech.edu ABSTRACT From the computer architect's perspective, Big Data benchmark is considered none other than a heavily memory-intensive application as opposed to a more normal application. Then again, due to its overwhelmingly huge amount of data access, it is actually more than a simple heavily memory-intensive application. Therefore people are trying to come up with a new system that can handle those applications more efficiently, in terms of both energy and performance. In this research project, an efficient cache simulator is needed to help study and analyze new memory architectures that might help improve performance for those big data applications, while not sacrificing energy consumption. To benefit more from all the functionalities of an efficient cache simulator, various cache features such as N-way set associativity, appropriate replacement policy, appropriate write and read policy, and different levels of closeness and accessibility to the microprocessor have to be considered. By using a multicore-multilevel cache simulator, and running several simulations, results show that the cache performance can be improved when cache parameters are varied efficiently. The results also show that in order to increase the overall system performance, it is necessary to greatly reduce the cache miss rate. This research presents the results obtained by running some application traces on the implemented cache simulator. Keywords Big Data Benchmark, Cache Simulator, DineroIV cache simulator, Data Cache, Traces, Cache Hit Rate, Cache Miss Rate, N-way Set Associativity, Write Through policy, Write Allocate policy, LRU (Least Recently Used) replacement policy, SRAM, DRAM, CPU (Central Processing Unit), C OBJECTIVES The goal of this research is to explore a new memory architecture that is capable of processing huge amount of data, generally represented by the term Big Data. To this end, the first stage of this research is to understand important characteristics of those applications for which Big Data Benchmarks are going to be collected. In addition, a cache simulator has to be implemented, with which those interesting new memory architectures that might help improve performance for those Big Data applications are studied and analyzed. Understanding how cache works is necessary to design and implement accurate and efficient cache simulator. In this project, a CPU cache simulator program is written in C++. Then, various simulations are performed to draw observations and conclusions. Observations from program testing are used to analyze how the cache miss rate is reduced and to take appropriate measures in order to significantly improve the cache simulator performance. 2. CACHE ARCHITECTURE AND ORGANIZATION 2.1 Overview of Cache Figure 1: Basic Cache Model Figure 1 shows a simplified diagram of a system with cache. In this system, every time the CPU performs a read or write, the cache may intercept the bus transaction, allowing the cache to decrease the response time of the system [1]. Thus, a cache memory is a memory that a computer microprocessor, CPU, can access more quickly than it can access regular RAM. A Cache is a small high speed memory usually a SRAM that contains the most recently accessed pieces of main memory [1].It is a component that transparently stores data so that future requests for that data can be served faster. Hence, the greater the number of requests that can be served from the cache, the faster the overall system performance becomes. One may want to know the reason why this high speed memory is necessary or beneficial. In today s systems, the time it takes to bring an instruction or piece of data into the processor is very long when compared to the time to execute the instruction [1]. Therefore a bottle neck forms at the input to the processor. Cache memory helps by decreasing the time it takes to move information to and from the processor. One may again want to know how such a small piece of high speed memory can improve the system performance. The theory that explains this performance is called Locality of Reference [2]. The concept is that at any given time the processor will be accessing memory in a small or localized region of memory. The cache loads this region allowing the processor to access the memory region faster.

2 2 2.2 Features Implemented in the Cache Simulator Features implemented in the cache simulator program are as follow: - Multicore-multilevel cache: the simulator program supports many levels of cache memory. Program users state the number of levels they prefer the cache simulator to have. Each level can have more than one cache. Each cache on the first level, the level that is the closest to the CPU, is private to their core. - N-way set associativity: the cache simulator supports more than one degree of freedom. With N-way set associative, cache slots are grouped into sets. When locating a cache block, you find the appropriate set for a given address, and within the set you find the appropriate slot. This scheme has fewer collisions because you have more slots to pick from, even when cache lines map to the same set. - Write through policy: when there is a write hit, the information is written to both the block in the cache and to the block in the lower-level memory [1]. - Write allocate policy: when there is a write miss, a block containing missed data is allocated and loaded from memory to cache, and data is written into that cache block [1]. - LRU replacement policy: the least recently used policy is used to select what to remove from the cache. 3. METHODOLOGY 3.1 Language The cache simulator program is written in C++ programming language. C++ is built off of the C language with less dependency on the functions for basic tasks, such as input and output, and also adds object-oriented features [6]. 3.2 Multilevel Cache Simulator Cache memory is sometimes described in levels of closeness and accessibility to the microprocessor [3]. An L1 (level one) cache is on the same chip as the microprocessor. L2 (level two) cache is usually a separate static RAM chip. In order to maintain the multilevel characteristic of the cache in our program, some rules of how caches on different levels would be placed and communicate to each other are set. The program can support any number of cache levels and any number of caches on the first level. The numbers of caches on other levels are computed and set by the program according to the implemented rules. These are the implemented rules in the program of how the cache simulator is structured: - The program prompts the user to input the number of cache levels, and the number of caches on the first level. The program does the rest of the work by itself, which is to compute the number of caches that must go on other cache levels according to the user input. - An equal number of caches are distributed to corresponding parent caches in the lower/precedent level. [4] - The number of L1 cache must be divisible by the number of L2 cache with no remainder. - If the number of the lowest level cache is odd, the number of the next highest level cache must equal 1, meaning having just one cache on this level. 3.3 Program Validation To verify the effectiveness of our cache simulator program, the DineroIV cache simulator program was used. Output results from our program had to be compared to the output results from the DineroIV cache simulator. Dinero is a trace-driven uniprocessor CPU cache simulator for memory reference traces, written by Dr. Jan Edler and Prof. Mark D. Hill from the University of Wisconsin Madison [4]. Since the Dinero cache simulator is a uni-core simulator [4], in simulations from our program that had to be compared with the Dinero results, only one cache was used on the first level, leading to have one cache on each of the higher levels as well. 3.4 Application Traces Simulation results are determined by the input trace and the cache parameters. A trace is a finite sequence of memory references usually obtained by the interpretive execution of a program or set of programs. Traces used in our simulation experiments are: - A C compiler trace that has both read and write addresses - A trace called trace_4664_2_0.raw that has only read addresses Some testing simulation results from our program that involved both reading and writing addresses slightly differed from the Dinero results. Thus, we decided to use traces with only read addresses in our further experiments to ensure that the data obtained and observations made are correct. All the following simulation results are obtained using traces with only read addresses. 3.5 Cache Performance and System Performance The time of a program execution is one of the most reliable performance measure [5]. CPUtime = IC*(CPIexecution +Memory accesses/instruction *Miss rate * Miss penalty) * Clock cycle With IC being the instruction count, CPI, clocks per instruction, and Miss Penalty, the extra delay caused by a cache miss. From the above CPU time formula, it is well observed that if Miss Rate is reduced, CPU time would reduce as well which can improve the overall system performance. Therefore, various experimental simulations have been conducted to figure out what needs to be done to reduce the cache Miss Rate. Using larger block size and higher cache associativity have shown to reduce the cache Miss Rate since the number of cache collisions clearly reduce. Even though the goal of implementing the cache simulator is to only obtain the cache miss rate and hit rate information, it is a good

3 3 practice to also consider the effect of the cache simulator design on the overall system performance. Since CPU time and cache Miss Rate are interrelated, reducing cache Miss Rate will reduce the CPU time which will be beneficial for big data applications because of their huge amount of data access and computation. - Level 1 cache miss rate = Level 2 cache miss rate = Figure 3: DineroIV Sample Simulation Output 2 levels cache simulator with 1 data cache on each level 4. RESULTS Running sample simulations in both the implemented cache simulator program and the DineroIV cache simulator has shown to provide similar output results. Comparing both results is the assurance that all the cache features are well implemented in the program. Figure 2: Sample Simulation Output 2 levels cache simulator with 1 cache on each level Figure 3 depicts the output result that comes from running the same simulation in the DineroIV cache simulator program. The Dinero cache simulator has both instruction cache and data cache, but in this case, data cache is used since it is the cache that matches with our implemented cache simulator program. - l1-cache miss rate = l2-cache miss rate = Figure 2 depicts the output result that comes from running a simulation in our implemented cache simulator program. The cache simulator is composed of two levels and one cache on each level. Both screenshot figures display how the output while running simulations on two different cache simulators produces the same result. In both cases, cache miss rates are similar. To ensure that the comparison is accurate, cache parameters such as cache size, cache block size, and cache associativity, must the same in both programs. Implementation policies such as replacement policy, and allocation policy must be similar as well. Those criteria were maintained in comparing our sample simulations. In addition, similar traces had to be used in both simulator programs. In these sample simulations, there is only one cache on each level. The only reason behind that is because the DineroIV is a uni-core cache simulator. It can only have one cache on a level. Having only one cache on a level in our implemented program was an easy way

4 to be able to compare our results with the DineroIV simulator results. Other than that, our implemented simulator program can have more than one cache on a level.

DISCUSSION After running simulations in two different cache simulator programs and confirming that the output results do match, it was safe to proceed with other analysis tests on our implemented

4 4 to be able to compare our results with the DineroIV simulator results. Other than that, our implemented simulator program can have more than one cache on a level. Figure 6: Reducing Cache Misses via Higher Associativity 5. DISCUSSION After running simulations in two different cache simulator programs and confirming that the output results do match, it was safe to proceed with other analysis tests on our implemented cache simulator program. By varying cache parameters such as cache size, cache block size, and the number of associativity, different simulations helped in the study and analysis of various ways that can be used to improve cache performance. Using larger block size and higher associativity have shown to improve cache performance since they reduce the cache miss rate. Since CPU performance is highly connected to cache miss rate, a reduced cache miss rate would greatly increase the system performance. Figure 4: Reducing Cache Misses via Larger Block Size 6. CONCLUSION This research project has shown that the performance of a cache simulator program depends on a combination of the effectiveness of the algorithm used in the program, and the effectiveness of the cache design options chosen to implement the simulator. The better the cache simulator works efficiently, the easier it will be to use it on big data applications which will accelerate the process of understanding how well new memory architectures for big data applications have to be constructed. It is the hope that computer scientists continue exploring memory architectures that are capable of handling Big Data applications more efficiently. - Cache associativity is 4 in each of the above cases Figure 5: Corresponding Graph to the above tables 7. FUTURE PLANS The future work will be to eliminate all of the malfunctions encountered in cache simulations since for instance write addresses had some issues in the cache simulator program. That will require to fix writing addresses by choosing an efficient write policy to use and implement it in the program correctly. More importantly in order to accomplish the goal of the project, traces from big data applications will have to be used in simulations since at this point simulations have been done using traces of simple, general applications. Another aspect that is of interest will be to verify and improve multicore cache programmability in the simulator program as well as take into account better cache design options such as knowing which policies are better to combine or not in the cache simulator program. The first step in improving multicore cache programmability will be to find another already built cache simulator that can support more than one cache on a level such as an advanced or improved version of DineroIV. Such cache simulator would help in validating output results from our cache simulator program in case more than one cache are used on a cache simulator level. 8. ACKNOWLEDGMENTS This research was funded by the Computing Research Association's Distributed Mentoring Program under the mentorship of Dr.Hyesoon Kim at Georgia Institute of Technology, College of Computing.

5 5 References [1] A. S. Tanenbaum, Structured Computer Organization, Prentice Hall, 1993, pp ; [2] P. J. Denning, Communication Networks and Computer Systems, Imperial College Press, 2006, pp [3] M. A. Ismail, T. Altaf and S. H. Mirza, "A new parallel Multilevel Cache Simulator For multi-core processors," in Electronics, Communications and Photonics Conference (SIECPC), Riyadh, [4] J. Edler and M. D. Hill, "Dinero IV Trace-Driven Uniprocessor Cache Simulator," [Online]. Available: [5] D. A. Patterson and J. L. Hennessy, Computer Organization and Design, San Francisco: Morgan Kaufmann, 2013, pp [6] B. Stroustrup, The C++ Programming Language, Ann Arbor: Addison-Wesley Professional, 2013.

Caching Basics. Memory Hierarchies

Caching Basics. Memory Hierarchies Caching Basics CS448 1 Memory Hierarchies Takes advantage of locality of reference principle Most programs do not access all code and data uniformly, but repeat for certain data choices spatial nearby