Scheduling the Intel Core i7

Size: px

Start display at page:

Download "Scheduling the Intel Core i7"

Lucinda Morris
5 years ago
Views:

1 Third Year Project Report University of Manchester SCHOOL OF COMPUTER SCIENCE Scheduling the Intel Core i7 Ibrahim Alsuheabani Degree Programme: BSc Software Engineering Supervisor: Prof. Alasdair Rawsthorne May

2 Abstract As more cores are added to the CPU chip, the process of scheduling the CPU becomes harder. The Intel Core i7 920 model has four cores and supports hyperthreading providing in total eight execution threads. Although it shows higher performance compared to CPUs with fewer cores, it is not employed to its full potential. In order to create scheduling algorithms for the Core i7 that can fully utilize it, the Core i7 was tested and the test results were analysed. This paper presents the test results and analysis as will as a scheduling algorithm that was developed. Project Name: Scheduling core i7 Name: Ibrahim Alsuheabani Date: May 2010 Project Supervisor: Prof. Alasdair Rawsthorne 2

3 Acknowledgements I would like to thank my supervisor, Alasdair Rawsthorne, for his helpful feedback and suggestions on my work and for providing access to the Core i7 desktop. Throughout many phases in the course of this project, his advice was always available. I would also like to thank my family and friends, for their support and encouragement. 3

4 Table of Contents 1 Introduction Background Project Proposal Aims and Objectives Report Structure Literature Survey CPU Performance Multi-core Processors Multi-core Issues Nehalem Architecture Cores and Intel Turbo Boost Technology Threads and Intel Hyper-Threading Technology Caches and Intel Smart Cache Technology The Linux Scheduler Performance Test Tools SPEC CPU2006 Benchmark Suite Performance Counters Analysis and Performance Tests SPEC CPU2006 Benchmarks Assigning Benchmarks to Cores in SPEC CPU Benchmarks Runtimes Running Multiple Benchmarks Simultaneously Disabling and Enabling Cores Changing the Cores Clock Frequency Accessing the Performance Counters Results Multiple Copies of the C Benchmarks Two Copies Three and Four Copies Five Copies Five Copies of 403.gcc Benchmark Eight Copies of 403.gcc Benchmark The C++ Benchmarks The Effects of Disabling Inactive Cores Executing Two Processes in One Core Two Benchmark Copies Executed In the Same Core Cache Access Counters The Scheduler The Objective of the Scheduler The Scheduler s Structure Scheduler Results Paging and Physical Address Extension (PAE) Conclusions Summery Further Performance Tests Improving the Scheduler References...30 Appendix A...31 Descriptions and Runtimes Of the Integer Benchmarks

5 List of Figures Figure 1:The structure of the Core i7, and the order in which CPU2006 assigns processes to threads Figure 2: Runtime results for two copies of the C benchmarks...19 Figure 3: Runtime results for each copy of the two copies test for 456.hmmer...20 Figure 4: Runtime results for the one, two, three and four copies tests of the C benchmarks...21 Figure 5: Runtime results for five copies of 403.gcc executed simultaneously...23 Figure 6: Runtime results for eight copies of 403.gcc executed simultaneously...23 Figure 7: Runtime results for four copies of 429.mcf before and after the kernel update...26 Figure 8: Runtime results for eight copies of 429.mcf executed using the scheduler.26 5

6 List of Tables Table 1: The benchmarks used in the project and their runtime of executing a copy each...16 Table 2: Comparing the runtime(s) for two copies of the C benchmarks...20 Table 3: Comparing the runtime(s) for one, four and five copies of the C benchmarks, the five copies results are divided into: the three copies executed in separate cores and the two copies executed in the same core...22 Table 4: Runtime results for multiple copies of the C++ benchmarks...24 Table 5: The test results for disabling inactive cores while executing the 403.gcc benchmark...24 Table 6: Comparison of runtime(s) results between two tests each executing eight copies of 429.mcf...28 Table 7: CINT2006 benchmarks[9]...31 Table 8:Comparing the runtime(s) of the benchmarks, the multiple copies being executed simultaneously

7 Glossary for Practical Terms Perfmon2: Interface that provides access to the hardware performance counters. In Intel processors this interface accesses the performance-monitoring unit (PMU) that monitors CPU events. pfmon: Command line hardware performance monitoring tool that uses the perfmon2 interface to access to hardware performance counters. SPEC CPU2006: CPU-intensive benchmarks suite. Providing benchmarks to test the performance of the CPU. runspec: A command line tool in the SPEC CPU2006 suite. It builds the benchmarks, runs them, and prints out their results. pgrep: Linux command line utility that searches for the named process and returns the process ID (pid) for all the processes with that name. wc: Linux command line utility printing the number of lines, words or characters in a file. kill: Linux command line utility that sends a specified signal to a specified process. taskset: Linux command line utility to set or retrieve a CPU affinity for a running process. CPU affinity is a scheduler property that links a process to a set of CPUs. struct: is a structured type in C programming language that combines a set of characterized objects with different types into a single object. 7

8 1 Introduction 1.1 Background Multi-core CPUs introduced a new aspect in processing applications. The multi-core CPU can execute simultaneously a number of processes equal to its number of cores. This provides an increase in performance for multithreaded applications. These cores are not considered independent processors. Since they are implemented in one chip, they share some of the hardware resources. As a result of that, these cores may interfere with each other when executing processes. The Intel Core 7 is the new generation of multi-core systems. This CPU has a feature called Hyper-Threading. Hyper-threading allows each core to execute two threads simultaneously. With Hyper-Threading, the CPU has execution threads that are double the number of available cores. It also implements an inclusive third level cache (L3) that is shared across all cores. To create a scheduler for the CPU, it should be tested first. Testing the CPU gives a clear idea of how the performance of the CPU varies with different tests. Then, after studying the tests results, create scheduling algorithms to maximize the performance of the Core i7. CPUs are tested using benchmarks that simulate real user applications. Standard Performance Evaluation Corporation (SPEC) created CPU2006, which is an industry-standard benchmark suite. This suite provides benchmarks to stress the system s CPU producing comparative performance measurements. Major computer corporations test their hardware using the CPU2006 benchmark suite and submit the results to SPEC. Some of these results are available in the SPEC website. Completely fair scheduler is a scheduler introduced in the Linux kernel [1]. This scheduler provides fairness between tasks running in a single core. However, the issue is that it is not multi-core aware. A new scheduler domain indicating multicore features has been added to the domain hierarchy of the Linux process scheduler [2]. This scheduler domain can identify cores that are sharing resources. For example, in the Intel Core i7 with Hyper-Threading enabled, a core is considered two logical cores sharing resources. With this information, the scheduler by default assigns tasks first to cores that are not sharing resources. Thus, maximising resource utilization and minimize its contention. The issue is that some applications do better when their tasks are executed on cores that share resources. Also, another aspect to consider during the scheduling process in the Core i7 is that all cores share the L3 cache. 8

9 1.2 Project Proposal The proposal was to test the performance of the Core i7 by using SPEC CPU2006 benchmark suite as the performance test tool and Linux as the operating system. Then, start to design a scheduler to enhance the performance based on the test results. Due to the new technologies introduced in this processor, old scheduling algorithms face some issues when used with it. The Core i7 model used in this project is quadcore. It also provides hyper-threading, thus eight logical processors in total. The processor is tested by executing multiple benchmarks in different threads in the CPU. Changing the number of benchmarks executed and also changing the threads used for execution generate a database of test results. These results then are studied to see how the CPU performs in different circumstances. The design of the scheduler would be based on the information gathered from the test results. 1.3 Aims and Objectives The First objective of the project is testing the performance of the Core i7. As mentioned above, the test is done using SPEC CPU2006. For this project, three performance test methodologies were suggested. The first test methodology was running multiple copies of a benchmark simultaneously and increase this number by one each time. Then, compare the results of the tests to see how the Core i7 performed. Also, during this test method some tests will disable unneeded cores to examine the difference between inactive and disabled cores. The second methodology was testing how two copies of a benchmark would perform if executed in one core by using hyper-threading. The last test methodology was to perform random tests to understand how the Core i7 employs its L3 cache. The second objective was to design scheduling algorithms to maximize the throughput of the Core i Report Structure The idea of multi-core processing and the description of the Core i7 s structure and technologies will be presented in chapter 2. Also, the description of the performance test tools used in the project will be shown. In chapter 3, the analysis and performance tests were done will be discussed. The test results will be presented in chapter 4. In chapter 5, a scheduler designed to solve an issue the Core i7 faced will be explained, while chapter 6 concludes the report. 9

10 2 Literature Survey At the beginning of this chapter, the idea of CPU performance will be described. After that, the multi-core processors and some of its issues will be presented. Then, the structure of processor Intel core 7 will be shown. The technologies that it provides will also be discussed. Then, the scheduler used in the Linux system will be explained. Finally, the performance test tools will be presented. 2.1 CPU Performance The performance of the CPU is often associated with its Clock rate (MHz). The clock rate is the rate of cycles per second for the frequency of the CPU clock. However, having a higher clock rate does not always mean a higher performance. The reason for that is there are other factors that affect the CPU performance. Two of the factors are the cache size and the size of the random memory access (RAM) available for the CPU. Also, bus speeds and the type and order of the instructions of the executed applications have impact on the CPU performance. Measuring the CPU performance is done using benchmarks. They are test applications that use the hardware resources of the CPU and measure the time it takes the CPU to execute them. Scaling these runtimes with reference runtimes the performance of the CPU can be measured. 2.2 Multi-core Processors Multi-core processor is the idea of using more than one independent processor (core) in one chip. Multi-core processors have the ability to perform multiprocessing. Multiprocessing is the execution of multiple processes concurrently in the system. Each core in the multi-core processor operates independently of the other cores. Cores in a multi-core processor are coupled. This coupling between the cores varies from system to system. In some systems, the cores may share a random access memory (RAM). They may also share caches. Also, in some systems cores communicate using message passing. For a single software, the performance gain multi-core provides for it depends on the software s implementation. The software should have operations or tasks that can be executed simultaneously to benefit from the multi-core system. These parts of the software are executed simultaneously in separate cores in the multicore system providing gain in performance. That is what called parallel processing of the software. Parallelization of the software has proven to be hard to implement since the cores executing the parallel parts of the software may interfere with each other. 10

11 For executing multiple applications, the multi-core processor can start the execution on its cores simultaneously running more than one application at the same time. 2.3 Multi-core Issues The first issue multi-core processors face is the Operating System (OS) support. Some Operating Systems consider each virtual processor or core as a separate CPU handling the system as a multiprocessor platform. As a result, the OS does not consider that these cores share hardware resources inside the chip. For that reason, the OS may not be successful in fully utilizing the multi-core processor to its potentials. Also, as there is more than one core they may interfere when running tasks simultaneously causing delays and slowing down the performance. Thus, the challenge is to understand the multi-core architecture and create a scheduler to avoid as much interference as possible between the cores while maintaining a better performance. 2.4 Nehalem Architecture Intel Core i7 is based on the Nehalem microarchitecture. Nehalem is the codename Intel gave to their new multi-core microarchitecture. Intel introduced new technologies in Nehalem. A third level cache was also added in this microarchitecture. Making it a three level cache hierarchy processor. Core i7 has many modules with different specifications. The model used in this project is Core i This processor has four cores each core is multi-threaded providing in total eight threads [3] Cores and Intel Turbo Boost Technology Core i7 920 is quad-cored with a maximum frequency of MHz and a minimum of MHz for the cores. Intel Turbo Boost Technology allows active cores to run faster than the base operating frequency. The turbo boost technology is activated when the Operating System (OS) requests a higher processor performance. The maximum frequency of the turbo boost depends on the number of active cores. The amount of time a core spends in a turbo boost state depends on the estimated current, temperature and power consumption of the processor. If the processor is operating within limits of these factors and additional performance is needed, the processor frequency will constantly increase by MHz on short periods until it reaches the limit determined by the number of active cores [4]. On the other hand, the processor reduces frequency by MHz when temperature, current or power exceed factory limits. 11

12 2.4.2 Threads and Intel Hyper-Threading Technology Hyper threading enables simultaneous multi-threading for each core in the processor. For the Core i7 920, it will have eight threads with this technology enabled. If the operating system supports hyper-threading and has it enabled, it will treat each thread as a separate logical processor. Two logical processors on one physical core will share the same execution unit. Hyper-threading provides higher performance when used with multi-threaded applications. Since threads can execute processes simultaneously, it also offers the advantage of reducing latency and making full use of the clock cycle. For instance, in one core if a thread is inactive doing an I/O or waiting for a result, the other thread will execute making a full use of the clock cycle [3] Caches and Intel Smart Cache Technology Nehalem has four first level (L1) and second level (L2) caches and one third (L3) or last level cache. Each core has its own L1 and L2 caches. The two threads in the same core will share the L1 and the L2 caches. L3 cache is shared between all the cores. Intel Smart Cache is provided in L3. The size of L3 allocated for each core can be dynamically altered using smart caching. Therefore, if a core has minimal cache requirements, the other core can dynamically increase its share of the cache, reducing cache misses [5]. Also, Nehalem enhances the Intel smart cache by allowing the L3 cache to increase performance and reduce traffic to the processor cores. Some process architecture use L3 to save data not stored in the other caches. As a result, if a data request is missed in L3, all the other caches must still be searched in case they contain the requested data. This method increases the latency between the cores. However, in the Nehalem micorachitecture, a miss of its shared L3 cache guarantees the data needed is not in the processor. Thus, eliminating unnecessary search and reduces latency [6]. The following are the cache sizes for Core i7: L1 cache size is 32KB for instruction cache and 32KB for data cache. L2 cache size is 256KB for both data and instruction. L3 is a fully shared 8MB cache [3]. 2.5 The Linux Scheduler Since the operating system in this project was Linux, it is important to understand the strategy in which its scheduler assigns processes to the CPU. In the current version of Linux, the scheduler used is the Completely Fair Scheduler (CFS). The main concept in CFS is to preserve fairness in assigning CPU time to processes. The CFS maintains the amount of CPU time given to a task in what is called virtual runtime [1]. The tasks were arranged in the CFS using a time-ordered red- 12

13 black tree rather than a queue. The red-black tree is a balanced tree providing efficient and fast operations such as deleting and inserting tasks. Tasks with the lowest virtual runtime has the highest need for the CPU. These tasks are stored in the left side of the red-black tree, and the tasks with the lowest need for the CPU are stored in the right side. The CFS chooses the left-most task in the red-black tree to be executed next. This provides fairness since the left-most task has the highest need for CPU time. Nodes in the red-black tree shift from the right to the left by one, providing balance to the tree and fairness between tasks. If a task used its available CPU time and was still not finished, its execution time is added to its virtual runtime, and then inserted again into the red-black tree. 2.6 Performance Test Tools To measure the performance of the Intel core i7, benchmarks have to be used. Benchmarks simulate real user applications. They are developed to pressure the computer hardware so that real performance is measured.for this project, SPEC CPU2006 benchmark suite was used. Also, a useful measurement method is using performance counters or event counters. These events are processor events like CPU cycles and cache accesses. For accessing and printing the event counters perfmon2 was used [7] SPEC CPU2006 Benchmark Suite The Standard Performance Evaluation Corporation (SPEC) maintains a standardized set of relevant benchmarks being applied to the newest generation of highperformance computers [8]. One of its benchmark suites is the CPU2006. SPEC CPU2006 provides performance measurements for the CPU. CPU2006 contains two benchmarks suites: CINT2006 for measuring integer performance, and CFP2006 for measuring floating point performance. In this project, CINT2006 benchmark suite was used as the measurement tool. CINT2006 has twelve benchmarks all written in C or C++[9]. To run a test in this benchmark suite, basically choose the benchmark and the number of copies of this benchmark that will be executed simultaneously. The result of the execution is printed showing the runtime of each copy on the system. SPEC uses a reference machine to normalize the performance measures. Each benchmark was run in this reference machine to give a reference runtime for that benchmark. The runtime of the benchmark on the user system is then compared to the reference runtime already provided. This comparison will produce a ratio that can be thought of as a score for the CPU. The higher the ratio in the CPU, the better the 13

14 performance. For example, for the benchmark 403.gcc the runtime in the tested system was 388s and the reference runtime is 8031s so the ratio was Performance Counters Intel supply in their processors a performance monitoring unit (PMU). This unit provides the event counters. These events include cache accesses, CPU cycles and cache misses. This is a useful tool if there is any delay or interference between processes. By using the counters, it is possible to find where the problem is. The Linux command pfmon was used to print these counters by accessing them through the perfmon2 interface. 14

3 Analysis and Performance Tests This chapter will discuss how the CPU performance tests have been done using the benchmarks. Also, the results for some of the tests will be presented.

15 3 Analysis and Performance Tests This chapter will discuss how the CPU performance tests have been done using the benchmarks. Also, the results for some of the tests will be presented. In addition, results for running different number of copies of the benchmarks in different core orders will be shown. 3.1 SPEC CPU2006 Benchmarks CPU2006 was used in this project as a command line program for Linux. The command for it is runspec. The test specifications like what benchmark to execute and how many copies of it are written inside a configuration file such as tmp1.cfg. The maximum number of processes CPU2006 can run is the available number of logical processors. Any extra processes will be terminated. The command to run this test is: -> runspec c tmp1.cfg The c or config option is to specify the configuration file which is written after it. After running the command, CPU2006 gets the benchmark chosen and runs a number of copies as specified by the configuration file. Some benchmarks have more than one workload. For example, 403.gcc has 9 workloads. Core1 CPU7 CPU0 The order Core2 CPU6 CPU1 Core3 CPU5 CPU2 Core4 CPU4 CPU3 Figure 1:The structure of the Core i7, and the order in which CPU2006 assigns processes to threads Assigning Benchmarks to Cores in SPEC CPU2006 From the experience in Linux with Core i7, if hyper-threading is enabled, it uses the first thread (CPU0) for system operations. If multiple benchmarks are executed simultaneously, it assigns the processes to threads in the order as can be seen in Figure 1 above. This method avoids hyper-threading if four or less processes are executed. If the number is more than four, it starts using hyper-threading with the fifth 15

16 process being executed in the second thread of the last core (CPU4). If there is a sixth process, by the given order, it will be executed using CPU5. Using two threads in the same core to run processes produce some latency because of sharing the resources in the core Benchmarks Runtimes At first, the time it takes to run one copy of each benchmark should be recorded and used as a base result. This base result of the benchmark is compared to the runtime of a number of copies of the benchmark ran concurrently. Using this comparison, it is possible to see the effect of multiprocessing on the benchmark. Also, it shows if there is any interference or delay between the concurrently executed copies. The base results used in this project are presented in Table 1. Table 1: The benchmarks used in the project and their runtime of executing a copy each BENCHMARK RUNTIME(s) Ratio 401.bzip gcc mcf gobmk hmmer sjeng h264ref omnetpp astar Running Multiple Benchmarks Simultaneously Testing the performance of the quad-core processor in this project was by executing copies of a benchmark in more than one core concurrently. That showed if these copies interfere. Also, testing the difference between executing two copies of a benchmark in two threads in separate cores, and executing them in two threads in the same core. As the number of copies executed concurrently increase, the runtime for them will increase as a result of sharing the CPU resources. In this project, the relation between the runtime and the number of copies executed has been studied. The results showed dramatic increase in the runtime when executing five copies or more. That is because one of the cores will execute two copies using both of its threads. CPU2006 can only execute a maximum of eight copies of a benchmark simultaneously because that is the number of available logical processors. 16

17 When needed to execute a large number of copies consuming considerable amount of time, the Linux at command was used. This command simply takes another command and a given time as arguments, then runs that command at the given time. Because of the inconvenience large performance tests could cause, they were assigned to run at night. The at command syntax is: -> at f [filename] t [time] The filename contains the commands needed to run at the given time. 3.2 Disabling and Enabling Cores Disabling some threads in the CPU is another method to test its performance. In the Linux OS, for each thread there is a file that shows if it is enabled or disabled. The file location: /sys/devices/system/cpu/cpux/online (X: the thread number) The first thread is always enabled and cannot be disabled since it is used for system operations. In the online file for a thread, if 1 is written, it means enabled and if 0 is written, it is disabled. For a thread to be enabled or disabled, the online file would be edited as follows: -> echo [0 or 1] >> /sys/devices/system/cpu/cpux/online (X: the thread number) 3.3 Changing the Cores Clock Frequency Changing the CPU frequency is useful when testing the processor. Although the Turbo boost Technology can alter frequency and increase it when needed, the initial frequency that the core starts with could be changed. Each thread has its own frequency. It can be manually changed using the command: -> cpufreq-selector c [thread number] f [frequency] Also, regarding CPU frequency there is the frequency governor. As the name implies, it governs the CPU frequency. There are five governors, each with its benefits and drawbacks. For example, a frequency governor called performance sets the CPU frequency to the highest. This frequency governor is suitable for a period of time with intense workloads offering a very high speed. However, a disadvantage is that it has no power saving benefit. The other governors are ondemand, userspace, conservative and powersave. The command for choosing the governor: -> cpufreq-selector c [thread number] g [governor] 17

18 3.4 Accessing the Performance Counters In Intel processors, events can be monitored using programmable counters. These counters provide how many times an event occurred during a specific process or in the system as a whole. In the project, some events counters were used for the executed benchmarks. The syntax for pfmon that will use the perfmon2 interface to print events counters for a given command: -> pfmon e[event1,event2, ] command That will run the command given in the argument and start monitoring the processor events. After the command finishes running, the counters for the events are printed out. The command tested during the project was the runspec command that runs the benchmarks. Accessing these counters provides very useful information for the performance analysis. 18

19 4 Results In this chapter, all the performance analysis results will be presented. First, the runtime results for all the benchmarks when two, four or five copies executed simultaneously will be shown. Then, the results for the 403.gcc benchmark when using the hyper-threading technology are analysed. Next, test results for the C++ benchmarks will be presented. Finally, results for executing two copies in the same core with using performance counters will be shown. 4.1 Multiple Copies of the C Benchmarks Two Copies In this section, the results of the tests will be presented and studied starting with the two copies test. Figure 2 presents the C benchmarks runtime results for executing two copies of them. It compares between simultaneous and sequential execution of the benchmarks. When running two copies simultaneously, the runtime presented as the result is the runtime of the slower copy. In simultaneous execution of two copies, there are only a few seconds difference between their runtimes. On the other hand, for sequential execution the first copy will have a runtime same as the runtime for executing one copy alone. The second copy will have double that runtime since it has to wait for the first copy to finish so it can be executed. The runtime presented for sequential execution in Figure 2 is the runtime for its second copy. All the benchmarks benefit from running two copies simultaneously. That is expected because the copies were executed simultaneously in two different cores each core with its own execution unit. Figure 2: Runtime results for two copies of the C benchmarks 19

Table 2: Comparing the runtime(s) for two copies of the C benchmarks BENCHMARK One copy Two copies simultaneously Two copies sequentially 429.mcf 311 532 621 401.bzip2 742 1276 1485 403.

20 Table 2: Comparing the runtime(s) for two copies of the C benchmarks BENCHMARK One copy Two copies simultaneously Two copies sequentially 429.mcf bzip gcc gobmk hmmer sjeng h264ref As shown in Table 2, executing two copies of benchmark 403.gcc simultaneously took 655s while executing them sequentially would take 796s. This shows that 403.gcc benefited from running two copies of simultaneously since it decreased the runtime by 17% compared to running them sequentially. There is some interference between the simultaneously executed copies since their runtime for each copy is 65% more than the runtime for one copy executed alone. The benchmark 456.hmmer had a runtime of 1071s for executing one copy. Executing two copies simultaneously of 456.hmmer took 1876s. This is a 75% increase in runtime for each copy in the simultaneous execution. This high increase in runtime shows that there is a major interference between the two copies. Although the runtime of simultaneous execution here is less than the runtime of the last copy in sequential execution, it shouldn t be considered the best performance in general. As can be seen in Figure 3 the runtime of the first copy in sequential execution is 57% less than the runtime of simultaneous execution. Also, if power consumption is considered sequential execution is a better solution in this case. While sequential execution consumes 2142s of CPU time in only one core, simultaneous execution consumes 1876s in two cores. Therefore, sequential execution for 456.hmmer of two copies consumes less power. Figure 3: Runtime results for each copy of the two copies test for 456.hmmer 20

21 4.1.2 Three and Four Copies When running three or four copies simultaneously, each copy is executed in a separate core. The runtime results for executing three and four copies of the benchmark simultaneously are shown in Figure 4. All the copies executed simultaneously for each benchmark have the same runtime since they are ran in separate cores. From Figure 4, it is clear that none of the benchmarks had their runtime of three copies simultaneously equal to the runtime of executing one copy. In an ideal situation where copies of a benchmark do not interfere, the runtime of one copy of the benchmark executed alone should equal the runtime of four copies executed simultaneously. This means that at some point during the execution of the benchmark copies they interfere causing the latency in the runtime. In this part of the performance test, sequential execution was not considered since it will present a very high runtime that is not comparable with the simultaneous execution. The test results showed that benchmark 429.mcf had a 5% increase in runtime of three copies simultaneous compared to the runtime of two copies. 401.bzip and 403.gcc also had a minor increase of 2%. On the other hand, the benchmarks 445.gobmk, 456.hmmer, 458.sjeng and 464.h26ref had the same runtime for three and two copies executed simultaneously. Figure 4: Runtime results for the one, two, three and four copies tests of the C benchmarks 21

22 In the simultaneous execution of four copies, the results were almost the same as for the three copies tests. The runtime result of executing four copies simultaneously of 429.mcf and 403.gcc was 4% more compared to the runtime of three copies while it was only 2% in 402.bzip2. Again, the benchmarks 445.gobmk, 456.hmmer, 458.sjeng and 464.h26ref had no difference between the runtime of thee and four copies simultaneously. From the tests above, it was shown that when executing two copies of a benchmark using two cores of the CPU, the performance was lower than when executing one copy of the benchmark. However, when executing three or four copies the performance was the same as the performance of the two copies. From the test results and the analysis, it was concluded that the performance of the CPU would only vary when using a single core or multiple cores. Taking into consideration in the multiple copies test that they have enough physical memory Five Copies When executing five copies simultaneously in the quad-core processor, one of the cores will use hyper-threading. The two processes executed in this core will share the core s execution resources. That will result in slowing down the performance of these two processes. Table 3: Comparing the runtime(s) for one, four and five copies of the C benchmarks, the five copies results are divided into: the three copies executed in separate cores and the two copies executed in the same core Five copies simultaneously BENCHMARK One copy Four copies simultaneously The Three copies in separate cores The two copies in the same core 429.mcf bzip gcc gobmk hmmer sjeng h264ref As can be seen in Table 3 above, the results of executing five copies simultaneously showed that the three copies executed in separate cores had the same runtime as the four copies test. The other two copies that were executed in one core using hyper-threading had an increase in runtime that varied between the benchmarks. 429.mcf and 464.h264ref had the highest increase in the runtime of the two copies executing using hyper-threading. The increase was 83% compared to the runtime of the copies executed in separate processors. The interference here between the two copies was too high. In this case, executing them sequentially would be considered if 22

Therefore, none of the benchmarks used for this test benefited from the two copies executed in one core sharing its resources. 4.1.4 Five Copies of 403.

23 power consumption has a higher priority than performance. For the rest of the benchmarks, the runtime of the two copies executed using hyper-threading had around 50% increase compared to that of the other three copies. Therefore, none of the benchmarks used for this test benefited from the two copies executed in one core sharing its resources Five Copies of 403.gcc Benchmark Figure 5 below shows the runtime results for five copies of 403.gcc executed simultaneously. The first three copies were executed in separate cores. Their runtime results were the same as the runtime results for four copies of 403.gcc executed simultaneously. The last two processes were executed in the same core. Since the core was multithreaded, each copy was executed in a separate thread. These two copies showed a significant increase in runtime. The increase was 56% compared to the copies executed in separate cores. This increase is the result of sharing the L1 and L2 caches of the core and also the core s execution units. This difference in runtime also appears when executing six and seven copies. When executing eight copies all of the cores will use hyper-threading, thus all the runtimes increase. Figure 5: Runtime results for five copies of 403.gcc executed simultaneously Figure 6: Runtime results for eight copies of 403.gcc executed simultaneously Eight Copies of 403.gcc Benchmark To test hyper-threading in all the cores, eight copies of 403.gcc were executed simultaneously. Figure 6 above shows that the runtimes results for all the copies are in the same range. Since all cores are executing two copies each using hyper-threading, the runtime increased. Also, the eight copies results increased by around 7% compared to the two copies ran in the same core in the five copies execution shown in Figure 5. The reason for the increase is the execution of more copies, which implies more access to the L3 cache. 23

24 4.2 The C++ Benchmarks There are two C++ benchmarks in the SPEC CPU2006 suite. Table 4 presents the test results for these benchmarks. In the two copies test for these benchmarks, the increase in runtime compared to a one-copy runtime was 14% and 8% for 471.omnetpp and 473.astar consecutively. For the same test in the C benchmarks, the average increase was 70%. Also for the three copies test, the C and C++ benchmarks showed different results. While the C benchmark showed the same runtime results for the three and two copies test, the C++ benchmarks had a 41%(471.omnetpp) and 63%(473.astar) increase in runtime when executing three copies compared to the two copies test. Same as the C benchmarks, the C++ benchmarks had the same results for both the three and four copies tests. The C++ benchmarks copies executed in separate cores in the five copies test had the same runtime as the four copies runtime. The two copies executed in the same core in the five copies test had a 42% increase in runtime compared to the copies executed in separate cores. Table 4: Runtime results for multiple copies of the C++ benchmarks 5copies Benchmark 1 copy 2 copies 3 copies 4 copies Separate cores (3 copies) Same core (2 copies) 471.omnetpp astar The Effects of Disabling Inactive Cores In this test, cores that are not used will be disabled one by one to test the change in the runtime. Each core has two execution threads. A thread in a core can be disabled while the other thread is still active. The benchmark 403.gcc was chosen for this test. 403.gcc has nine workloads that are executed sequentially. Executing two copies of the 403.gcc benchmark while four or eight threads are enabled had a runtime of 655s. However, as can be seen in Table 5, when the execution was done while enabling two, three, five, six or seven threads the runtime was approximately 416s. Also, when executing three copies while enabling six or seven threads the CPU had a better performance. In the four copies execution, it was only when enabling six copies that the runtime got lower. Table 5: The test results for disabling inactive cores while executing the 403.gcc benchmark Number of The number of each execution threads enabled 403.gcc copies copies copies # copies # #

25 4.4 Executing Two Processes in One Core Two Benchmark Copies Executed In the Same Core The main interference and delay happens when running two processes in the same core. When sharing the resources for that core, the runtime for these processes differ from the runtime of processes executed in separate cores. In the five copies simultaneously test, two of the five copies were executed in the same core. The results showed none of the benchmark benefited more from the execution of these two copies. In this section, the same idea will be tested but with only two copies being executed. The test was done after an update in the kernel that improved the performance of the CPU. Thus, the results produced by this test are not comparable to the previous tests results. The benchmark 429.mcf was chosen for this test. The reason for choosing 429.mcf was it has only workload that can be easily assigned to a specific core. After the update, executing two copies of 429.mcf simultaneously in different cores had a runtime of 353s. Executing two copies in the same core showed an excessive increase in the runtime. The runtime was 663s when executed in the same core, and therefore executing the two copies sequentially would produce a better result Cache Access Counters The cache tested is the L3 cache. The Perfmon2 interface provides access to the performance counter LLC_REFERENCES. This performance counter counts how many times the last level cache (L3) was referenced by the execution thread. Another performance counter is UNHALTED_CPU_CYCLES. The process executed in this tested was a C program provided by my supervisor. This program creates a number of structs and keeps accessing them for a finite number of times. The Linux command time measured the CPU time for the program. Executing two copies of the program in the same core had a 47% increase in runtime compared to executing them in separate cores. In addition, the performance counter UNHALTED_CPU_CYCLES showed that the execution in the same core had a 52% increase in CPU cycles compared to the two cores execution. The L3 cache access counter LLC_REFERENCES also had an increase and it was 37%. Therefore, the large increase in LLC_REFERENCES has an influence in increasing the runtime of the two copies of the program. 25

5 The Scheduler In this chapter, a designed scheduling algorithm will be presented. First, the scheduler s objective and method will be discussed.

1 The Objective of the Scheduler As a first step to start scheduling, a simple issue was chosen to create a scheduler that solves it. The issue was with the 429.mcf benchmark.

26 5 The Scheduler In this chapter, a designed scheduling algorithm will be presented. First, the scheduler s objective and method will be discussed. Then, its structure and the results produced when using it will be shown. Finally, features provided by the CPU that could reach the objective of the scheduler will be explained. 5.1 The Objective of the Scheduler As a first step to start scheduling, a simple issue was chosen to create a scheduler that solves it. The issue was with the 429.mcf benchmark. After an update in the Linux kernel, the benchmark experienced some problems. If executing more than three copies of the benchmark, the runtime increased extremely. As shown in Figure 7, the lowest runtime from four copies executed simultaneously after the update was 100s more than the runtimes before the update. Moreover, the rest of the three runtimes after the update were more than double the runtimes before the update. Figure 7: Runtime results for four copies of 429.mcf before and after the kernel update. Figure 8: Runtime results for eight copies of 429.mcf executed using the scheduler The reason for that increase is not having enough RAM space for all the copies. As a result, the copies start accessing memory at the same time causing the excessive delay. The first thing done was executing the copies in different periods. Although allowed one copy to finish in a normal time, this answer didn t solve the issue with the other copies. The scheduler adapted a more efficient method. That method was to run a maximum of three copies at a time. Dividing the copies in groups of three or less. As soon as a group finishes executing, the scheduler executes the other group. Also, the scheduler took into consideration that the maximum number of benchmark copies CPU2006 can execute concurrently on Core i7 920 is eight. Runtime results for running eight copies using this scheduler can be seen in Figure 8 above. 26

27 5.2 The Scheduler s Structure The scheduler is a Linux shell script. The shell script is a series of commands written in a plain text file. The first task the scheduler has to do is count the number of 429.mcf copies the user executed. That was done using the command: ->pgrep mcf_base.none wc l pgrep will return all the process IDs (pid) for the 429.mcf benchmark copies being executed. Each pid is written in a separate line. Those pids are saved in local variables. The number of 429.mcf copies executed can be known using the command wc. It counts the number of lines in a given file. Piping the result file of pgrep to be used as an argument for wc. Then, wc counts the number of lines in the file, which represent the number of 429.mcf copies in execution. After counting the number of copies and saving their pids, the next step for the scheduler is to stop the execution of some benchmark copies temporarily. That is done using the command kill sending the stop signal to a given process using its pid: -> kill stop pid the stop signal simply stops the execution of the process. Stopping all copies except three will allow the three to run without the delay. If the number of executed processes is three or less, the scheduler will not alter the execution of the processes. Next, the scheduler simply waits using a while loop until the copies in execution finish. Inside the while loop, the scheduler counts the number of copies either in execution or waiting. Consequently, the scheduler is able to detect when a group finishes execution. When the first three copies finish running, the scheduler continues the execution of the next group of copies. To continue a stopped process, the scheduler sends the cont signal using the kill command to the process. Another issue here is that two copies could be assigned to execute in the same core using both of its threads. Since the maximum number of benchmark copies running concurrently is three, it is possible to run each one in a separate core. The scheduler used taskset command to assign copies to logical processes. The command syntax for taskset: -> taskset cp CPU-number pid This command will assign the process, which its pid is given to be executed in the logical processor with number CPU-number. Until this step, the scheduler can schedule up to six copies of the benchmark executed simultaneously. If the number of processes is seven or eight, the scheduler simply repeats the steps. Waiting for the processes in execution until they finish. Then, continue the execution of the remaining copies and assigning them to separate cores. 27

28 5.3 Scheduler Results Table 6 shows runtimes of 429.mcf copies from two tests. Test1 was done without using the scheduler. It had a massive delay in the runtimes with the lowest being 8203s. The scheduler was used in Test2 to avoid the issue. The results of Test2 showed the benefit of using the scheduler. Comparing the highest runtimes in both tests, there is an 89% decrease in runtime when using the scheduler. Table 6: Comparison of runtime(s) results between two tests each executing eight copies of 429.mcf Copy number RUNTIME(s) Test Test2 (scheduler) 5.4 Paging and Physical Address Extension (PAE) Increasing available memory space for the CPU was another way to solve the issue 429.mcf faced. The problem was the copies needed a big random access memory (RAM) space and when running concurrently they interfered in accessing the RAM. Although there was 6GB of RAM, the processor can only use 4GB because of the number of physical memory addresses. As a result, a feature called physical address extension (PAE) was used. This feature extends the size of the physical address from 32 bits to 36 bits. This increases the maximum physical memory size from 4GB to 64GB. By using this feature, all the 6GB RAM can be accessed by the processor. Also, another feature called swapping was used. Linux divides its physical RAM into blocks called pages. Swapping is copying a page in the RAM to a preconfigured space in the hard desk called swapping space. Then, the page space in the RAM is freed up. The size of the RAM and the swapping space is the total size of the virtual memory. By increasing the memory size using these two features, executing up to seven copies simultaneously of the benchmark 429.mcf ran without any problem. However, the problem still existed when executing eight copies simultaneously. 28

29 6 Conclusions This conclusion summarises what I have done and learnt during the period of working on this project. Also, it presents some recommendations for further performance tests and scheduler improvements. 6.1 Summery In section 1.3, two objectives were listed. The first objective, which was testing and analysing the performance of the CPU, had three methodologies. The first test methodology was executing different number of copies of the benchmarks and comparing the test results. Also, testing how the CPU would perform when inactive cores are disabled. This test methodology was done and its results and analysis are presented in sections 4.1, 4.2 and 4.3. Second test methodology was examining the difference between executing two copies in separate cores and executing them in the same core. This test was done, but due to the lack of comparable results from the test, the analysis is not complete. Third methodology was performing random tests to inspect how the CPU employs it L3 cache. This methodology was performed in association with the second test methodology and the results are presented in section 4.4. The Second objective was to design a scheduler to maximize the performance of the CPU. For this objective and with the available results from the tests, a scheduler was created for a specific benchmark and its structure is described in chapter Further Performance Tests The tests done in this project were executing simultaneously multiple copies of the same benchmark. It could be taken further by executing different benchmarks simultaneously, examining how they interact and testing processes that benefit more when executed simultaneously in the same core by using hyper-threading. In addition, perform more tests using the benchmarks and the performance counters to gain more understanding on how the CPU uses the L3 cache. 6.3 Improving the Scheduler In creating the scheduling algorithm for the 429.mcf benchmark, altering a running process was learned. Now it is known how to get a specific process ID, stop it, continue it and assign it to a specific processor. If given more time with this knowledge and with conducting further tests, it would be possible to create a general scheduler that will increase the performance of the Core i7. 29

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors

Resource-Conscious Scheduling for Energy Efficiency on Andreas Merkel, Jan Stoess, Frank Bellosa System Architecture Group KIT The cooperation of Forschungszentrum Karlsruhe GmbH and Universität Karlsruhe