Scheduling the Intel Core i7

Size: px
Start display at page:

Download "Scheduling the Intel Core i7"

Transcription

1 Third Year Project Report University of Manchester SCHOOL OF COMPUTER SCIENCE Scheduling the Intel Core i7 Ibrahim Alsuheabani Degree Programme: BSc Software Engineering Supervisor: Prof. Alasdair Rawsthorne May

2 Abstract As more cores are added to the CPU chip, the process of scheduling the CPU becomes harder. The Intel Core i7 920 model has four cores and supports hyperthreading providing in total eight execution threads. Although it shows higher performance compared to CPUs with fewer cores, it is not employed to its full potential. In order to create scheduling algorithms for the Core i7 that can fully utilize it, the Core i7 was tested and the test results were analysed. This paper presents the test results and analysis as will as a scheduling algorithm that was developed. Project Name: Scheduling core i7 Name: Ibrahim Alsuheabani Date: May 2010 Project Supervisor: Prof. Alasdair Rawsthorne 2

3 Acknowledgements I would like to thank my supervisor, Alasdair Rawsthorne, for his helpful feedback and suggestions on my work and for providing access to the Core i7 desktop. Throughout many phases in the course of this project, his advice was always available. I would also like to thank my family and friends, for their support and encouragement. 3

4 Table of Contents 1 Introduction Background Project Proposal Aims and Objectives Report Structure Literature Survey CPU Performance Multi-core Processors Multi-core Issues Nehalem Architecture Cores and Intel Turbo Boost Technology Threads and Intel Hyper-Threading Technology Caches and Intel Smart Cache Technology The Linux Scheduler Performance Test Tools SPEC CPU2006 Benchmark Suite Performance Counters Analysis and Performance Tests SPEC CPU2006 Benchmarks Assigning Benchmarks to Cores in SPEC CPU Benchmarks Runtimes Running Multiple Benchmarks Simultaneously Disabling and Enabling Cores Changing the Cores Clock Frequency Accessing the Performance Counters Results Multiple Copies of the C Benchmarks Two Copies Three and Four Copies Five Copies Five Copies of 403.gcc Benchmark Eight Copies of 403.gcc Benchmark The C++ Benchmarks The Effects of Disabling Inactive Cores Executing Two Processes in One Core Two Benchmark Copies Executed In the Same Core Cache Access Counters The Scheduler The Objective of the Scheduler The Scheduler s Structure Scheduler Results Paging and Physical Address Extension (PAE) Conclusions Summery Further Performance Tests Improving the Scheduler References...30 Appendix A...31 Descriptions and Runtimes Of the Integer Benchmarks

5 List of Figures Figure 1:The structure of the Core i7, and the order in which CPU2006 assigns processes to threads Figure 2: Runtime results for two copies of the C benchmarks...19 Figure 3: Runtime results for each copy of the two copies test for 456.hmmer...20 Figure 4: Runtime results for the one, two, three and four copies tests of the C benchmarks...21 Figure 5: Runtime results for five copies of 403.gcc executed simultaneously...23 Figure 6: Runtime results for eight copies of 403.gcc executed simultaneously...23 Figure 7: Runtime results for four copies of 429.mcf before and after the kernel update...26 Figure 8: Runtime results for eight copies of 429.mcf executed using the scheduler.26 5

6 List of Tables Table 1: The benchmarks used in the project and their runtime of executing a copy each...16 Table 2: Comparing the runtime(s) for two copies of the C benchmarks...20 Table 3: Comparing the runtime(s) for one, four and five copies of the C benchmarks, the five copies results are divided into: the three copies executed in separate cores and the two copies executed in the same core...22 Table 4: Runtime results for multiple copies of the C++ benchmarks...24 Table 5: The test results for disabling inactive cores while executing the 403.gcc benchmark...24 Table 6: Comparison of runtime(s) results between two tests each executing eight copies of 429.mcf...28 Table 7: CINT2006 benchmarks[9]...31 Table 8:Comparing the runtime(s) of the benchmarks, the multiple copies being executed simultaneously

7 Glossary for Practical Terms Perfmon2: Interface that provides access to the hardware performance counters. In Intel processors this interface accesses the performance-monitoring unit (PMU) that monitors CPU events. pfmon: Command line hardware performance monitoring tool that uses the perfmon2 interface to access to hardware performance counters. SPEC CPU2006: CPU-intensive benchmarks suite. Providing benchmarks to test the performance of the CPU. runspec: A command line tool in the SPEC CPU2006 suite. It builds the benchmarks, runs them, and prints out their results. pgrep: Linux command line utility that searches for the named process and returns the process ID (pid) for all the processes with that name. wc: Linux command line utility printing the number of lines, words or characters in a file. kill: Linux command line utility that sends a specified signal to a specified process. taskset: Linux command line utility to set or retrieve a CPU affinity for a running process. CPU affinity is a scheduler property that links a process to a set of CPUs. struct: is a structured type in C programming language that combines a set of characterized objects with different types into a single object. 7

8 1 Introduction 1.1 Background Multi-core CPUs introduced a new aspect in processing applications. The multi-core CPU can execute simultaneously a number of processes equal to its number of cores. This provides an increase in performance for multithreaded applications. These cores are not considered independent processors. Since they are implemented in one chip, they share some of the hardware resources. As a result of that, these cores may interfere with each other when executing processes. The Intel Core 7 is the new generation of multi-core systems. This CPU has a feature called Hyper-Threading. Hyper-threading allows each core to execute two threads simultaneously. With Hyper-Threading, the CPU has execution threads that are double the number of available cores. It also implements an inclusive third level cache (L3) that is shared across all cores. To create a scheduler for the CPU, it should be tested first. Testing the CPU gives a clear idea of how the performance of the CPU varies with different tests. Then, after studying the tests results, create scheduling algorithms to maximize the performance of the Core i7. CPUs are tested using benchmarks that simulate real user applications. Standard Performance Evaluation Corporation (SPEC) created CPU2006, which is an industry-standard benchmark suite. This suite provides benchmarks to stress the system s CPU producing comparative performance measurements. Major computer corporations test their hardware using the CPU2006 benchmark suite and submit the results to SPEC. Some of these results are available in the SPEC website. Completely fair scheduler is a scheduler introduced in the Linux kernel [1]. This scheduler provides fairness between tasks running in a single core. However, the issue is that it is not multi-core aware. A new scheduler domain indicating multicore features has been added to the domain hierarchy of the Linux process scheduler [2]. This scheduler domain can identify cores that are sharing resources. For example, in the Intel Core i7 with Hyper-Threading enabled, a core is considered two logical cores sharing resources. With this information, the scheduler by default assigns tasks first to cores that are not sharing resources. Thus, maximising resource utilization and minimize its contention. The issue is that some applications do better when their tasks are executed on cores that share resources. Also, another aspect to consider during the scheduling process in the Core i7 is that all cores share the L3 cache. 8

9 1.2 Project Proposal The proposal was to test the performance of the Core i7 by using SPEC CPU2006 benchmark suite as the performance test tool and Linux as the operating system. Then, start to design a scheduler to enhance the performance based on the test results. Due to the new technologies introduced in this processor, old scheduling algorithms face some issues when used with it. The Core i7 model used in this project is quadcore. It also provides hyper-threading, thus eight logical processors in total. The processor is tested by executing multiple benchmarks in different threads in the CPU. Changing the number of benchmarks executed and also changing the threads used for execution generate a database of test results. These results then are studied to see how the CPU performs in different circumstances. The design of the scheduler would be based on the information gathered from the test results. 1.3 Aims and Objectives The First objective of the project is testing the performance of the Core i7. As mentioned above, the test is done using SPEC CPU2006. For this project, three performance test methodologies were suggested. The first test methodology was running multiple copies of a benchmark simultaneously and increase this number by one each time. Then, compare the results of the tests to see how the Core i7 performed. Also, during this test method some tests will disable unneeded cores to examine the difference between inactive and disabled cores. The second methodology was testing how two copies of a benchmark would perform if executed in one core by using hyper-threading. The last test methodology was to perform random tests to understand how the Core i7 employs its L3 cache. The second objective was to design scheduling algorithms to maximize the throughput of the Core i Report Structure The idea of multi-core processing and the description of the Core i7 s structure and technologies will be presented in chapter 2. Also, the description of the performance test tools used in the project will be shown. In chapter 3, the analysis and performance tests were done will be discussed. The test results will be presented in chapter 4. In chapter 5, a scheduler designed to solve an issue the Core i7 faced will be explained, while chapter 6 concludes the report. 9

10 2 Literature Survey At the beginning of this chapter, the idea of CPU performance will be described. After that, the multi-core processors and some of its issues will be presented. Then, the structure of processor Intel core 7 will be shown. The technologies that it provides will also be discussed. Then, the scheduler used in the Linux system will be explained. Finally, the performance test tools will be presented. 2.1 CPU Performance The performance of the CPU is often associated with its Clock rate (MHz). The clock rate is the rate of cycles per second for the frequency of the CPU clock. However, having a higher clock rate does not always mean a higher performance. The reason for that is there are other factors that affect the CPU performance. Two of the factors are the cache size and the size of the random memory access (RAM) available for the CPU. Also, bus speeds and the type and order of the instructions of the executed applications have impact on the CPU performance. Measuring the CPU performance is done using benchmarks. They are test applications that use the hardware resources of the CPU and measure the time it takes the CPU to execute them. Scaling these runtimes with reference runtimes the performance of the CPU can be measured. 2.2 Multi-core Processors Multi-core processor is the idea of using more than one independent processor (core) in one chip. Multi-core processors have the ability to perform multiprocessing. Multiprocessing is the execution of multiple processes concurrently in the system. Each core in the multi-core processor operates independently of the other cores. Cores in a multi-core processor are coupled. This coupling between the cores varies from system to system. In some systems, the cores may share a random access memory (RAM). They may also share caches. Also, in some systems cores communicate using message passing. For a single software, the performance gain multi-core provides for it depends on the software s implementation. The software should have operations or tasks that can be executed simultaneously to benefit from the multi-core system. These parts of the software are executed simultaneously in separate cores in the multicore system providing gain in performance. That is what called parallel processing of the software. Parallelization of the software has proven to be hard to implement since the cores executing the parallel parts of the software may interfere with each other. 10

11 For executing multiple applications, the multi-core processor can start the execution on its cores simultaneously running more than one application at the same time. 2.3 Multi-core Issues The first issue multi-core processors face is the Operating System (OS) support. Some Operating Systems consider each virtual processor or core as a separate CPU handling the system as a multiprocessor platform. As a result, the OS does not consider that these cores share hardware resources inside the chip. For that reason, the OS may not be successful in fully utilizing the multi-core processor to its potentials. Also, as there is more than one core they may interfere when running tasks simultaneously causing delays and slowing down the performance. Thus, the challenge is to understand the multi-core architecture and create a scheduler to avoid as much interference as possible between the cores while maintaining a better performance. 2.4 Nehalem Architecture Intel Core i7 is based on the Nehalem microarchitecture. Nehalem is the codename Intel gave to their new multi-core microarchitecture. Intel introduced new technologies in Nehalem. A third level cache was also added in this microarchitecture. Making it a three level cache hierarchy processor. Core i7 has many modules with different specifications. The model used in this project is Core i This processor has four cores each core is multi-threaded providing in total eight threads [3] Cores and Intel Turbo Boost Technology Core i7 920 is quad-cored with a maximum frequency of MHz and a minimum of MHz for the cores. Intel Turbo Boost Technology allows active cores to run faster than the base operating frequency. The turbo boost technology is activated when the Operating System (OS) requests a higher processor performance. The maximum frequency of the turbo boost depends on the number of active cores. The amount of time a core spends in a turbo boost state depends on the estimated current, temperature and power consumption of the processor. If the processor is operating within limits of these factors and additional performance is needed, the processor frequency will constantly increase by MHz on short periods until it reaches the limit determined by the number of active cores [4]. On the other hand, the processor reduces frequency by MHz when temperature, current or power exceed factory limits. 11

12 2.4.2 Threads and Intel Hyper-Threading Technology Hyper threading enables simultaneous multi-threading for each core in the processor. For the Core i7 920, it will have eight threads with this technology enabled. If the operating system supports hyper-threading and has it enabled, it will treat each thread as a separate logical processor. Two logical processors on one physical core will share the same execution unit. Hyper-threading provides higher performance when used with multi-threaded applications. Since threads can execute processes simultaneously, it also offers the advantage of reducing latency and making full use of the clock cycle. For instance, in one core if a thread is inactive doing an I/O or waiting for a result, the other thread will execute making a full use of the clock cycle [3] Caches and Intel Smart Cache Technology Nehalem has four first level (L1) and second level (L2) caches and one third (L3) or last level cache. Each core has its own L1 and L2 caches. The two threads in the same core will share the L1 and the L2 caches. L3 cache is shared between all the cores. Intel Smart Cache is provided in L3. The size of L3 allocated for each core can be dynamically altered using smart caching. Therefore, if a core has minimal cache requirements, the other core can dynamically increase its share of the cache, reducing cache misses [5]. Also, Nehalem enhances the Intel smart cache by allowing the L3 cache to increase performance and reduce traffic to the processor cores. Some process architecture use L3 to save data not stored in the other caches. As a result, if a data request is missed in L3, all the other caches must still be searched in case they contain the requested data. This method increases the latency between the cores. However, in the Nehalem micorachitecture, a miss of its shared L3 cache guarantees the data needed is not in the processor. Thus, eliminating unnecessary search and reduces latency [6]. The following are the cache sizes for Core i7: L1 cache size is 32KB for instruction cache and 32KB for data cache. L2 cache size is 256KB for both data and instruction. L3 is a fully shared 8MB cache [3]. 2.5 The Linux Scheduler Since the operating system in this project was Linux, it is important to understand the strategy in which its scheduler assigns processes to the CPU. In the current version of Linux, the scheduler used is the Completely Fair Scheduler (CFS). The main concept in CFS is to preserve fairness in assigning CPU time to processes. The CFS maintains the amount of CPU time given to a task in what is called virtual runtime [1]. The tasks were arranged in the CFS using a time-ordered red- 12

13 black tree rather than a queue. The red-black tree is a balanced tree providing efficient and fast operations such as deleting and inserting tasks. Tasks with the lowest virtual runtime has the highest need for the CPU. These tasks are stored in the left side of the red-black tree, and the tasks with the lowest need for the CPU are stored in the right side. The CFS chooses the left-most task in the red-black tree to be executed next. This provides fairness since the left-most task has the highest need for CPU time. Nodes in the red-black tree shift from the right to the left by one, providing balance to the tree and fairness between tasks. If a task used its available CPU time and was still not finished, its execution time is added to its virtual runtime, and then inserted again into the red-black tree. 2.6 Performance Test Tools To measure the performance of the Intel core i7, benchmarks have to be used. Benchmarks simulate real user applications. They are developed to pressure the computer hardware so that real performance is measured.for this project, SPEC CPU2006 benchmark suite was used. Also, a useful measurement method is using performance counters or event counters. These events are processor events like CPU cycles and cache accesses. For accessing and printing the event counters perfmon2 was used [7] SPEC CPU2006 Benchmark Suite The Standard Performance Evaluation Corporation (SPEC) maintains a standardized set of relevant benchmarks being applied to the newest generation of highperformance computers [8]. One of its benchmark suites is the CPU2006. SPEC CPU2006 provides performance measurements for the CPU. CPU2006 contains two benchmarks suites: CINT2006 for measuring integer performance, and CFP2006 for measuring floating point performance. In this project, CINT2006 benchmark suite was used as the measurement tool. CINT2006 has twelve benchmarks all written in C or C++[9]. To run a test in this benchmark suite, basically choose the benchmark and the number of copies of this benchmark that will be executed simultaneously. The result of the execution is printed showing the runtime of each copy on the system. SPEC uses a reference machine to normalize the performance measures. Each benchmark was run in this reference machine to give a reference runtime for that benchmark. The runtime of the benchmark on the user system is then compared to the reference runtime already provided. This comparison will produce a ratio that can be thought of as a score for the CPU. The higher the ratio in the CPU, the better the 13

14 performance. For example, for the benchmark 403.gcc the runtime in the tested system was 388s and the reference runtime is 8031s so the ratio was Performance Counters Intel supply in their processors a performance monitoring unit (PMU). This unit provides the event counters. These events include cache accesses, CPU cycles and cache misses. This is a useful tool if there is any delay or interference between processes. By using the counters, it is possible to find where the problem is. The Linux command pfmon was used to print these counters by accessing them through the perfmon2 interface. 14

15 3 Analysis and Performance Tests This chapter will discuss how the CPU performance tests have been done using the benchmarks. Also, the results for some of the tests will be presented. In addition, results for running different number of copies of the benchmarks in different core orders will be shown. 3.1 SPEC CPU2006 Benchmarks CPU2006 was used in this project as a command line program for Linux. The command for it is runspec. The test specifications like what benchmark to execute and how many copies of it are written inside a configuration file such as tmp1.cfg. The maximum number of processes CPU2006 can run is the available number of logical processors. Any extra processes will be terminated. The command to run this test is: -> runspec c tmp1.cfg The c or config option is to specify the configuration file which is written after it. After running the command, CPU2006 gets the benchmark chosen and runs a number of copies as specified by the configuration file. Some benchmarks have more than one workload. For example, 403.gcc has 9 workloads. Core1 CPU7 CPU0 The order Core2 CPU6 CPU1 Core3 CPU5 CPU2 Core4 CPU4 CPU3 Figure 1:The structure of the Core i7, and the order in which CPU2006 assigns processes to threads Assigning Benchmarks to Cores in SPEC CPU2006 From the experience in Linux with Core i7, if hyper-threading is enabled, it uses the first thread (CPU0) for system operations. If multiple benchmarks are executed simultaneously, it assigns the processes to threads in the order as can be seen in Figure 1 above. This method avoids hyper-threading if four or less processes are executed. If the number is more than four, it starts using hyper-threading with the fifth 15

16 process being executed in the second thread of the last core (CPU4). If there is a sixth process, by the given order, it will be executed using CPU5. Using two threads in the same core to run processes produce some latency because of sharing the resources in the core Benchmarks Runtimes At first, the time it takes to run one copy of each benchmark should be recorded and used as a base result. This base result of the benchmark is compared to the runtime of a number of copies of the benchmark ran concurrently. Using this comparison, it is possible to see the effect of multiprocessing on the benchmark. Also, it shows if there is any interference or delay between the concurrently executed copies. The base results used in this project are presented in Table 1. Table 1: The benchmarks used in the project and their runtime of executing a copy each BENCHMARK RUNTIME(s) Ratio 401.bzip gcc mcf gobmk hmmer sjeng h264ref omnetpp astar Running Multiple Benchmarks Simultaneously Testing the performance of the quad-core processor in this project was by executing copies of a benchmark in more than one core concurrently. That showed if these copies interfere. Also, testing the difference between executing two copies of a benchmark in two threads in separate cores, and executing them in two threads in the same core. As the number of copies executed concurrently increase, the runtime for them will increase as a result of sharing the CPU resources. In this project, the relation between the runtime and the number of copies executed has been studied. The results showed dramatic increase in the runtime when executing five copies or more. That is because one of the cores will execute two copies using both of its threads. CPU2006 can only execute a maximum of eight copies of a benchmark simultaneously because that is the number of available logical processors. 16

17 When needed to execute a large number of copies consuming considerable amount of time, the Linux at command was used. This command simply takes another command and a given time as arguments, then runs that command at the given time. Because of the inconvenience large performance tests could cause, they were assigned to run at night. The at command syntax is: -> at f [filename] t [time] The filename contains the commands needed to run at the given time. 3.2 Disabling and Enabling Cores Disabling some threads in the CPU is another method to test its performance. In the Linux OS, for each thread there is a file that shows if it is enabled or disabled. The file location: /sys/devices/system/cpu/cpux/online (X: the thread number) The first thread is always enabled and cannot be disabled since it is used for system operations. In the online file for a thread, if 1 is written, it means enabled and if 0 is written, it is disabled. For a thread to be enabled or disabled, the online file would be edited as follows: -> echo [0 or 1] >> /sys/devices/system/cpu/cpux/online (X: the thread number) 3.3 Changing the Cores Clock Frequency Changing the CPU frequency is useful when testing the processor. Although the Turbo boost Technology can alter frequency and increase it when needed, the initial frequency that the core starts with could be changed. Each thread has its own frequency. It can be manually changed using the command: -> cpufreq-selector c [thread number] f [frequency] Also, regarding CPU frequency there is the frequency governor. As the name implies, it governs the CPU frequency. There are five governors, each with its benefits and drawbacks. For example, a frequency governor called performance sets the CPU frequency to the highest. This frequency governor is suitable for a period of time with intense workloads offering a very high speed. However, a disadvantage is that it has no power saving benefit. The other governors are ondemand, userspace, conservative and powersave. The command for choosing the governor: -> cpufreq-selector c [thread number] g [governor] 17

18 3.4 Accessing the Performance Counters In Intel processors, events can be monitored using programmable counters. These counters provide how many times an event occurred during a specific process or in the system as a whole. In the project, some events counters were used for the executed benchmarks. The syntax for pfmon that will use the perfmon2 interface to print events counters for a given command: -> pfmon e[event1,event2, ] command That will run the command given in the argument and start monitoring the processor events. After the command finishes running, the counters for the events are printed out. The command tested during the project was the runspec command that runs the benchmarks. Accessing these counters provides very useful information for the performance analysis. 18

19 4 Results In this chapter, all the performance analysis results will be presented. First, the runtime results for all the benchmarks when two, four or five copies executed simultaneously will be shown. Then, the results for the 403.gcc benchmark when using the hyper-threading technology are analysed. Next, test results for the C++ benchmarks will be presented. Finally, results for executing two copies in the same core with using performance counters will be shown. 4.1 Multiple Copies of the C Benchmarks Two Copies In this section, the results of the tests will be presented and studied starting with the two copies test. Figure 2 presents the C benchmarks runtime results for executing two copies of them. It compares between simultaneous and sequential execution of the benchmarks. When running two copies simultaneously, the runtime presented as the result is the runtime of the slower copy. In simultaneous execution of two copies, there are only a few seconds difference between their runtimes. On the other hand, for sequential execution the first copy will have a runtime same as the runtime for executing one copy alone. The second copy will have double that runtime since it has to wait for the first copy to finish so it can be executed. The runtime presented for sequential execution in Figure 2 is the runtime for its second copy. All the benchmarks benefit from running two copies simultaneously. That is expected because the copies were executed simultaneously in two different cores each core with its own execution unit. Figure 2: Runtime results for two copies of the C benchmarks 19

20 Table 2: Comparing the runtime(s) for two copies of the C benchmarks BENCHMARK One copy Two copies simultaneously Two copies sequentially 429.mcf bzip gcc gobmk hmmer sjeng h264ref As shown in Table 2, executing two copies of benchmark 403.gcc simultaneously took 655s while executing them sequentially would take 796s. This shows that 403.gcc benefited from running two copies of simultaneously since it decreased the runtime by 17% compared to running them sequentially. There is some interference between the simultaneously executed copies since their runtime for each copy is 65% more than the runtime for one copy executed alone. The benchmark 456.hmmer had a runtime of 1071s for executing one copy. Executing two copies simultaneously of 456.hmmer took 1876s. This is a 75% increase in runtime for each copy in the simultaneous execution. This high increase in runtime shows that there is a major interference between the two copies. Although the runtime of simultaneous execution here is less than the runtime of the last copy in sequential execution, it shouldn t be considered the best performance in general. As can be seen in Figure 3 the runtime of the first copy in sequential execution is 57% less than the runtime of simultaneous execution. Also, if power consumption is considered sequential execution is a better solution in this case. While sequential execution consumes 2142s of CPU time in only one core, simultaneous execution consumes 1876s in two cores. Therefore, sequential execution for 456.hmmer of two copies consumes less power. Figure 3: Runtime results for each copy of the two copies test for 456.hmmer 20

21 4.1.2 Three and Four Copies When running three or four copies simultaneously, each copy is executed in a separate core. The runtime results for executing three and four copies of the benchmark simultaneously are shown in Figure 4. All the copies executed simultaneously for each benchmark have the same runtime since they are ran in separate cores. From Figure 4, it is clear that none of the benchmarks had their runtime of three copies simultaneously equal to the runtime of executing one copy. In an ideal situation where copies of a benchmark do not interfere, the runtime of one copy of the benchmark executed alone should equal the runtime of four copies executed simultaneously. This means that at some point during the execution of the benchmark copies they interfere causing the latency in the runtime. In this part of the performance test, sequential execution was not considered since it will present a very high runtime that is not comparable with the simultaneous execution. The test results showed that benchmark 429.mcf had a 5% increase in runtime of three copies simultaneous compared to the runtime of two copies. 401.bzip and 403.gcc also had a minor increase of 2%. On the other hand, the benchmarks 445.gobmk, 456.hmmer, 458.sjeng and 464.h26ref had the same runtime for three and two copies executed simultaneously. Figure 4: Runtime results for the one, two, three and four copies tests of the C benchmarks 21

22 In the simultaneous execution of four copies, the results were almost the same as for the three copies tests. The runtime result of executing four copies simultaneously of 429.mcf and 403.gcc was 4% more compared to the runtime of three copies while it was only 2% in 402.bzip2. Again, the benchmarks 445.gobmk, 456.hmmer, 458.sjeng and 464.h26ref had no difference between the runtime of thee and four copies simultaneously. From the tests above, it was shown that when executing two copies of a benchmark using two cores of the CPU, the performance was lower than when executing one copy of the benchmark. However, when executing three or four copies the performance was the same as the performance of the two copies. From the test results and the analysis, it was concluded that the performance of the CPU would only vary when using a single core or multiple cores. Taking into consideration in the multiple copies test that they have enough physical memory Five Copies When executing five copies simultaneously in the quad-core processor, one of the cores will use hyper-threading. The two processes executed in this core will share the core s execution resources. That will result in slowing down the performance of these two processes. Table 3: Comparing the runtime(s) for one, four and five copies of the C benchmarks, the five copies results are divided into: the three copies executed in separate cores and the two copies executed in the same core Five copies simultaneously BENCHMARK One copy Four copies simultaneously The Three copies in separate cores The two copies in the same core 429.mcf bzip gcc gobmk hmmer sjeng h264ref As can be seen in Table 3 above, the results of executing five copies simultaneously showed that the three copies executed in separate cores had the same runtime as the four copies test. The other two copies that were executed in one core using hyper-threading had an increase in runtime that varied between the benchmarks. 429.mcf and 464.h264ref had the highest increase in the runtime of the two copies executing using hyper-threading. The increase was 83% compared to the runtime of the copies executed in separate processors. The interference here between the two copies was too high. In this case, executing them sequentially would be considered if 22

23 power consumption has a higher priority than performance. For the rest of the benchmarks, the runtime of the two copies executed using hyper-threading had around 50% increase compared to that of the other three copies. Therefore, none of the benchmarks used for this test benefited from the two copies executed in one core sharing its resources Five Copies of 403.gcc Benchmark Figure 5 below shows the runtime results for five copies of 403.gcc executed simultaneously. The first three copies were executed in separate cores. Their runtime results were the same as the runtime results for four copies of 403.gcc executed simultaneously. The last two processes were executed in the same core. Since the core was multithreaded, each copy was executed in a separate thread. These two copies showed a significant increase in runtime. The increase was 56% compared to the copies executed in separate cores. This increase is the result of sharing the L1 and L2 caches of the core and also the core s execution units. This difference in runtime also appears when executing six and seven copies. When executing eight copies all of the cores will use hyper-threading, thus all the runtimes increase. Figure 5: Runtime results for five copies of 403.gcc executed simultaneously Figure 6: Runtime results for eight copies of 403.gcc executed simultaneously Eight Copies of 403.gcc Benchmark To test hyper-threading in all the cores, eight copies of 403.gcc were executed simultaneously. Figure 6 above shows that the runtimes results for all the copies are in the same range. Since all cores are executing two copies each using hyper-threading, the runtime increased. Also, the eight copies results increased by around 7% compared to the two copies ran in the same core in the five copies execution shown in Figure 5. The reason for the increase is the execution of more copies, which implies more access to the L3 cache. 23

24 4.2 The C++ Benchmarks There are two C++ benchmarks in the SPEC CPU2006 suite. Table 4 presents the test results for these benchmarks. In the two copies test for these benchmarks, the increase in runtime compared to a one-copy runtime was 14% and 8% for 471.omnetpp and 473.astar consecutively. For the same test in the C benchmarks, the average increase was 70%. Also for the three copies test, the C and C++ benchmarks showed different results. While the C benchmark showed the same runtime results for the three and two copies test, the C++ benchmarks had a 41%(471.omnetpp) and 63%(473.astar) increase in runtime when executing three copies compared to the two copies test. Same as the C benchmarks, the C++ benchmarks had the same results for both the three and four copies tests. The C++ benchmarks copies executed in separate cores in the five copies test had the same runtime as the four copies runtime. The two copies executed in the same core in the five copies test had a 42% increase in runtime compared to the copies executed in separate cores. Table 4: Runtime results for multiple copies of the C++ benchmarks 5copies Benchmark 1 copy 2 copies 3 copies 4 copies Separate cores (3 copies) Same core (2 copies) 471.omnetpp astar The Effects of Disabling Inactive Cores In this test, cores that are not used will be disabled one by one to test the change in the runtime. Each core has two execution threads. A thread in a core can be disabled while the other thread is still active. The benchmark 403.gcc was chosen for this test. 403.gcc has nine workloads that are executed sequentially. Executing two copies of the 403.gcc benchmark while four or eight threads are enabled had a runtime of 655s. However, as can be seen in Table 5, when the execution was done while enabling two, three, five, six or seven threads the runtime was approximately 416s. Also, when executing three copies while enabling six or seven threads the CPU had a better performance. In the four copies execution, it was only when enabling six copies that the runtime got lower. Table 5: The test results for disabling inactive cores while executing the 403.gcc benchmark Number of The number of each execution threads enabled 403.gcc copies copies copies # copies # #

25 4.4 Executing Two Processes in One Core Two Benchmark Copies Executed In the Same Core The main interference and delay happens when running two processes in the same core. When sharing the resources for that core, the runtime for these processes differ from the runtime of processes executed in separate cores. In the five copies simultaneously test, two of the five copies were executed in the same core. The results showed none of the benchmark benefited more from the execution of these two copies. In this section, the same idea will be tested but with only two copies being executed. The test was done after an update in the kernel that improved the performance of the CPU. Thus, the results produced by this test are not comparable to the previous tests results. The benchmark 429.mcf was chosen for this test. The reason for choosing 429.mcf was it has only workload that can be easily assigned to a specific core. After the update, executing two copies of 429.mcf simultaneously in different cores had a runtime of 353s. Executing two copies in the same core showed an excessive increase in the runtime. The runtime was 663s when executed in the same core, and therefore executing the two copies sequentially would produce a better result Cache Access Counters The cache tested is the L3 cache. The Perfmon2 interface provides access to the performance counter LLC_REFERENCES. This performance counter counts how many times the last level cache (L3) was referenced by the execution thread. Another performance counter is UNHALTED_CPU_CYCLES. The process executed in this tested was a C program provided by my supervisor. This program creates a number of structs and keeps accessing them for a finite number of times. The Linux command time measured the CPU time for the program. Executing two copies of the program in the same core had a 47% increase in runtime compared to executing them in separate cores. In addition, the performance counter UNHALTED_CPU_CYCLES showed that the execution in the same core had a 52% increase in CPU cycles compared to the two cores execution. The L3 cache access counter LLC_REFERENCES also had an increase and it was 37%. Therefore, the large increase in LLC_REFERENCES has an influence in increasing the runtime of the two copies of the program. 25

26 5 The Scheduler In this chapter, a designed scheduling algorithm will be presented. First, the scheduler s objective and method will be discussed. Then, its structure and the results produced when using it will be shown. Finally, features provided by the CPU that could reach the objective of the scheduler will be explained. 5.1 The Objective of the Scheduler As a first step to start scheduling, a simple issue was chosen to create a scheduler that solves it. The issue was with the 429.mcf benchmark. After an update in the Linux kernel, the benchmark experienced some problems. If executing more than three copies of the benchmark, the runtime increased extremely. As shown in Figure 7, the lowest runtime from four copies executed simultaneously after the update was 100s more than the runtimes before the update. Moreover, the rest of the three runtimes after the update were more than double the runtimes before the update. Figure 7: Runtime results for four copies of 429.mcf before and after the kernel update. Figure 8: Runtime results for eight copies of 429.mcf executed using the scheduler The reason for that increase is not having enough RAM space for all the copies. As a result, the copies start accessing memory at the same time causing the excessive delay. The first thing done was executing the copies in different periods. Although allowed one copy to finish in a normal time, this answer didn t solve the issue with the other copies. The scheduler adapted a more efficient method. That method was to run a maximum of three copies at a time. Dividing the copies in groups of three or less. As soon as a group finishes executing, the scheduler executes the other group. Also, the scheduler took into consideration that the maximum number of benchmark copies CPU2006 can execute concurrently on Core i7 920 is eight. Runtime results for running eight copies using this scheduler can be seen in Figure 8 above. 26

27 5.2 The Scheduler s Structure The scheduler is a Linux shell script. The shell script is a series of commands written in a plain text file. The first task the scheduler has to do is count the number of 429.mcf copies the user executed. That was done using the command: ->pgrep mcf_base.none wc l pgrep will return all the process IDs (pid) for the 429.mcf benchmark copies being executed. Each pid is written in a separate line. Those pids are saved in local variables. The number of 429.mcf copies executed can be known using the command wc. It counts the number of lines in a given file. Piping the result file of pgrep to be used as an argument for wc. Then, wc counts the number of lines in the file, which represent the number of 429.mcf copies in execution. After counting the number of copies and saving their pids, the next step for the scheduler is to stop the execution of some benchmark copies temporarily. That is done using the command kill sending the stop signal to a given process using its pid: -> kill stop pid the stop signal simply stops the execution of the process. Stopping all copies except three will allow the three to run without the delay. If the number of executed processes is three or less, the scheduler will not alter the execution of the processes. Next, the scheduler simply waits using a while loop until the copies in execution finish. Inside the while loop, the scheduler counts the number of copies either in execution or waiting. Consequently, the scheduler is able to detect when a group finishes execution. When the first three copies finish running, the scheduler continues the execution of the next group of copies. To continue a stopped process, the scheduler sends the cont signal using the kill command to the process. Another issue here is that two copies could be assigned to execute in the same core using both of its threads. Since the maximum number of benchmark copies running concurrently is three, it is possible to run each one in a separate core. The scheduler used taskset command to assign copies to logical processes. The command syntax for taskset: -> taskset cp CPU-number pid This command will assign the process, which its pid is given to be executed in the logical processor with number CPU-number. Until this step, the scheduler can schedule up to six copies of the benchmark executed simultaneously. If the number of processes is seven or eight, the scheduler simply repeats the steps. Waiting for the processes in execution until they finish. Then, continue the execution of the remaining copies and assigning them to separate cores. 27

28 5.3 Scheduler Results Table 6 shows runtimes of 429.mcf copies from two tests. Test1 was done without using the scheduler. It had a massive delay in the runtimes with the lowest being 8203s. The scheduler was used in Test2 to avoid the issue. The results of Test2 showed the benefit of using the scheduler. Comparing the highest runtimes in both tests, there is an 89% decrease in runtime when using the scheduler. Table 6: Comparison of runtime(s) results between two tests each executing eight copies of 429.mcf Copy number RUNTIME(s) Test Test2 (scheduler) 5.4 Paging and Physical Address Extension (PAE) Increasing available memory space for the CPU was another way to solve the issue 429.mcf faced. The problem was the copies needed a big random access memory (RAM) space and when running concurrently they interfered in accessing the RAM. Although there was 6GB of RAM, the processor can only use 4GB because of the number of physical memory addresses. As a result, a feature called physical address extension (PAE) was used. This feature extends the size of the physical address from 32 bits to 36 bits. This increases the maximum physical memory size from 4GB to 64GB. By using this feature, all the 6GB RAM can be accessed by the processor. Also, another feature called swapping was used. Linux divides its physical RAM into blocks called pages. Swapping is copying a page in the RAM to a preconfigured space in the hard desk called swapping space. Then, the page space in the RAM is freed up. The size of the RAM and the swapping space is the total size of the virtual memory. By increasing the memory size using these two features, executing up to seven copies simultaneously of the benchmark 429.mcf ran without any problem. However, the problem still existed when executing eight copies simultaneously. 28

29 6 Conclusions This conclusion summarises what I have done and learnt during the period of working on this project. Also, it presents some recommendations for further performance tests and scheduler improvements. 6.1 Summery In section 1.3, two objectives were listed. The first objective, which was testing and analysing the performance of the CPU, had three methodologies. The first test methodology was executing different number of copies of the benchmarks and comparing the test results. Also, testing how the CPU would perform when inactive cores are disabled. This test methodology was done and its results and analysis are presented in sections 4.1, 4.2 and 4.3. Second test methodology was examining the difference between executing two copies in separate cores and executing them in the same core. This test was done, but due to the lack of comparable results from the test, the analysis is not complete. Third methodology was performing random tests to inspect how the CPU employs it L3 cache. This methodology was performed in association with the second test methodology and the results are presented in section 4.4. The Second objective was to design a scheduler to maximize the performance of the CPU. For this objective and with the available results from the tests, a scheduler was created for a specific benchmark and its structure is described in chapter Further Performance Tests The tests done in this project were executing simultaneously multiple copies of the same benchmark. It could be taken further by executing different benchmarks simultaneously, examining how they interact and testing processes that benefit more when executed simultaneously in the same core by using hyper-threading. In addition, perform more tests using the benchmarks and the performance counters to gain more understanding on how the CPU uses the L3 cache. 6.3 Improving the Scheduler In creating the scheduling algorithm for the 429.mcf benchmark, altering a running process was learned. Now it is known how to get a specific process ID, stop it, continue it and assign it to a specific processor. If given more time with this knowledge and with conducting further tests, it would be possible to create a general scheduler that will increase the performance of the Core i7. 29

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors Resource-Conscious Scheduling for Energy Efficiency on Andreas Merkel, Jan Stoess, Frank Bellosa System Architecture Group KIT The cooperation of Forschungszentrum Karlsruhe GmbH and Universität Karlsruhe

More information

A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b

A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b 5th International Conference on Advanced Materials and Computer Science (ICAMCS 2016) A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b 1 School of

More information

Intel Core i7 Processor

Intel Core i7 Processor Intel Core i7 Processor Vishwas Raja 1, Mr. Danish Ather 2 BSc (Hons.) C.S., CCSIT, TMU, Moradabad 1 Assistant Professor, CCSIT, TMU, Moradabad 2 1 vishwasraja007@gmail.com 2 danishather@gmail.com Abstract--The

More information

A Fast Instruction Set Simulator for RISC-V

A Fast Instruction Set Simulator for RISC-V A Fast Instruction Set Simulator for RISC-V Maxim.Maslov@esperantotech.com Vadim.Gimpelson@esperantotech.com Nikita.Voronov@esperantotech.com Dave.Ditzel@esperantotech.com Esperanto Technologies, Inc.

More information

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard

More information

Energy-centric DVFS Controlling Method for Multi-core Platforms

Energy-centric DVFS Controlling Method for Multi-core Platforms Energy-centric DVFS Controlling Method for Multi-core Platforms Shin-gyu Kim, Chanho Choi, Hyeonsang Eom, Heon Y. Yeom Seoul National University, Korea MuCoCoS 2012 Salt Lake City, Utah Abstract Goal To

More information

OPERATING SYSTEM. PREPARED BY : DHAVAL R. PATEL Page 1. Q.1 Explain Memory

OPERATING SYSTEM. PREPARED BY : DHAVAL R. PATEL Page 1. Q.1 Explain Memory Q.1 Explain Memory Data Storage in storage device like CD, HDD, DVD, Pen drive etc, is called memory. The device which storage data is called storage device. E.g. hard disk, floppy etc. There are two types

More information

School of Computer and Information Science

School of Computer and Information Science School of Computer and Information Science CIS Research Placement Report Multiple threads in floating-point sort operations Name: Quang Do Date: 8/6/2012 Supervisor: Grant Wigley Abstract Despite the vast

More information

MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS

MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS INSTRUCTOR: Dr. MUHAMMAD SHAABAN PRESENTED BY: MOHIT SATHAWANE AKSHAY YEMBARWAR WHAT IS MULTICORE SYSTEMS? Multi-core processor architecture means placing

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per

More information

COL862 Programming Assignment-1

COL862 Programming Assignment-1 Submitted By: Rajesh Kedia (214CSZ8383) COL862 Programming Assignment-1 Objective: Understand the power and energy behavior of various benchmarks on different types of x86 based systems. We explore a laptop,

More information

INSTRUCTION LEVEL PARALLELISM

INSTRUCTION LEVEL PARALLELISM INSTRUCTION LEVEL PARALLELISM Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix H, John L. Hennessy and David A. Patterson,

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

WHITE PAPER SINGLE & MULTI CORE PERFORMANCE OF AN ERASURE CODING WORKLOAD ON AMD EPYC

WHITE PAPER SINGLE & MULTI CORE PERFORMANCE OF AN ERASURE CODING WORKLOAD ON AMD EPYC WHITE PAPER SINGLE & MULTI CORE PERFORMANCE OF AN ERASURE CODING WORKLOAD ON AMD EPYC INTRODUCTION With the EPYC processor line, AMD is expected to take a strong position in the server market including

More information

Improving Real-Time Performance on Multicore Platforms Using MemGuard

Improving Real-Time Performance on Multicore Platforms Using MemGuard Improving Real-Time Performance on Multicore Platforms Using MemGuard Heechul Yun University of Kansas 2335 Irving hill Rd, Lawrence, KS heechul@ittc.ku.edu Abstract In this paper, we present a case-study

More information

Linux Kernel Hacking Free Course

Linux Kernel Hacking Free Course Linux Kernel Hacking Free Course 3 rd edition G.Grilli, University of me Tor Vergata IRQ DISTRIBUTION IN MULTIPROCESSOR SYSTEMS April 05, 2006 IRQ distribution in multiprocessor systems 1 Contents: What

More information

EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun

EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun EECS750: Advanced Operating Systems 2/24/2014 Heechul Yun 1 Administrative Project Feedback of your proposal will be sent by Wednesday Midterm report due on Apr. 2 3 pages: include intro, related work,

More information

Parallel Programming Multicore systems

Parallel Programming Multicore systems FYS3240 PC-based instrumentation and microcontrollers Parallel Programming Multicore systems Spring 2011 Lecture #9 Bekkeng, 4.4.2011 Introduction Until recently, innovations in processor technology have

More information

Chapter 6: CPU Scheduling. Operating System Concepts 9 th Edition

Chapter 6: CPU Scheduling. Operating System Concepts 9 th Edition Chapter 6: CPU Scheduling Silberschatz, Galvin and Gagne 2013 Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Thread Scheduling Multiple-Processor Scheduling Real-Time

More information

Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research

Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research Joel Hestness jthestness@uwalumni.com Lenni Kuff lskuff@uwalumni.com Computer Science Department University of

More information

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou ( Zhejiang University

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou (  Zhejiang University Operating Systems (Fall/Winter 2018) CPU Scheduling Yajin Zhou (http://yajin.org) Zhejiang University Acknowledgement: some pages are based on the slides from Zhi Wang(fsu). Review Motivation to use threads

More information

CPU Scheduling. Daniel Mosse. (Most slides are from Sherif Khattab and Silberschatz, Galvin and Gagne 2013)

CPU Scheduling. Daniel Mosse. (Most slides are from Sherif Khattab and Silberschatz, Galvin and Gagne 2013) CPU Scheduling Daniel Mosse (Most slides are from Sherif Khattab and Silberschatz, Galvin and Gagne 2013) Basic Concepts Maximum CPU utilization obtained with multiprogramming CPU I/O Burst Cycle Process

More information

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter Lecture Topics Today: Advanced Scheduling (Stallings, chapter 10.1-10.4) Next: Deadlock (Stallings, chapter 6.1-6.6) 1 Announcements Exam #2 returned today Self-Study Exercise #10 Project #8 (due 11/16)

More information

Power Measurement Using Performance Counters

Power Measurement Using Performance Counters Power Measurement Using Performance Counters October 2016 1 Introduction CPU s are based on complementary metal oxide semiconductor technology (CMOS). CMOS technology theoretically only dissipates power

More information

Computer Caches. Lab 1. Caching

Computer Caches. Lab 1. Caching Lab 1 Computer Caches Lab Objective: Caches play an important role in computational performance. Computers store memory in various caches, each with its advantages and drawbacks. We discuss the three main

More information

Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems

Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems 1 Presented by Hadeel Alabandi Introduction and Motivation 2 A serious issue to the effective utilization

More information

Swapping. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Swapping. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Swapping Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Swapping Support processes when not enough physical memory User program should be independent

More information

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5) Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3, 3.9, and Appendix C) Hardware

More information

Swapping. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Swapping. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Swapping Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu EEE0: Introduction to Operating Systems, Fall 07, Jinkyu Jeong (jinkyu@skku.edu) Swapping

More information

UCB CS61C : Machine Structures

UCB CS61C : Machine Structures inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 36 Performance 2010-04-23 Lecturer SOE Dan Garcia How fast is your computer? Every 6 months (Nov/June), the fastest supercomputers in

More information

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures W.M. Roshan Weerasuriya and D.N. Ranasinghe University of Colombo School of Computing A Comparative

More information

Simultaneous Multithreading on Pentium 4

Simultaneous Multithreading on Pentium 4 Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on

More information

A Cool Scheduler for Multi-Core Systems Exploiting Program Phases

A Cool Scheduler for Multi-Core Systems Exploiting Program Phases IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 5, MAY 2014 1061 A Cool Scheduler for Multi-Core Systems Exploiting Program Phases Zhiming Zhang and J. Morris Chang, Senior Member, IEEE Abstract Rapid growth

More information

Dongjun Shin Samsung Electronics

Dongjun Shin Samsung Electronics 2014.10.31. Dongjun Shin Samsung Electronics Contents 2 Background Understanding CPU behavior Experiments Improvement idea Revisiting Linux I/O stack Conclusion Background Definition 3 CPU bound A computer

More information

Performance, Power, Die Yield. CS301 Prof Szajda

Performance, Power, Die Yield. CS301 Prof Szajda Performance, Power, Die Yield CS301 Prof Szajda Administrative HW #1 assigned w Due Wednesday, 9/3 at 5:00 pm Performance Metrics (How do we compare two machines?) What to Measure? Which airplane has the

More information

CA Single Sign-On. Performance Test Report R12

CA Single Sign-On. Performance Test Report R12 CA Single Sign-On Performance Test Report R12 Contents CHAPTER 1: OVERVIEW INTRODUCTION SUMMARY METHODOLOGY GLOSSARY CHAPTER 2: TESTING METHOD TEST ENVIRONMENT DATA MODEL CONNECTION PROCESSING SYSTEM PARAMETERS

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per

More information

Seagate Enterprise SATA SSD with DuraWrite Technology Competitive Evaluation

Seagate Enterprise SATA SSD with DuraWrite Technology Competitive Evaluation August 2018 Seagate Enterprise SATA SSD with DuraWrite Technology Competitive Seagate Enterprise SATA SSDs with DuraWrite Technology have the best performance for compressible Database, Cloud, VDI Software

More information

Intel Hyper-Threading technology

Intel Hyper-Threading technology Intel Hyper-Threading technology technology brief Abstract... 2 Introduction... 2 Hyper-Threading... 2 Need for the technology... 2 What is Hyper-Threading?... 3 Inside the technology... 3 Compatibility...

More information

Operating System. Chapter 4. Threads. Lynn Choi School of Electrical Engineering

Operating System. Chapter 4. Threads. Lynn Choi School of Electrical Engineering Operating System Chapter 4. Threads Lynn Choi School of Electrical Engineering Process Characteristics Resource ownership Includes a virtual address space (process image) Ownership of resources including

More information

Hyper-Threading Performance with Intel CPUs for Linux SAP Deployment on ProLiant Servers. Session #3798. Hein van den Heuvel

Hyper-Threading Performance with Intel CPUs for Linux SAP Deployment on ProLiant Servers. Session #3798. Hein van den Heuvel Hyper-Threading Performance with Intel CPUs for Linux SAP Deployment on ProLiant Servers Session #3798 Hein van den Heuvel Performance Engineer Hewlett-Packard 2004 Hewlett-Packard Development Company,

More information

Meet the Increased Demands on Your Infrastructure with Dell and Intel. ServerWatchTM Executive Brief

Meet the Increased Demands on Your Infrastructure with Dell and Intel. ServerWatchTM Executive Brief Meet the Increased Demands on Your Infrastructure with Dell and Intel ServerWatchTM Executive Brief a QuinStreet Excutive Brief. 2012 Doing more with less is the mantra that sums up much of the past decade,

More information

Enhancing Analysis-Based Design with Quad-Core Intel Xeon Processor-Based Workstations

Enhancing Analysis-Based Design with Quad-Core Intel Xeon Processor-Based Workstations Performance Brief Quad-Core Workstation Enhancing Analysis-Based Design with Quad-Core Intel Xeon Processor-Based Workstations With eight cores and up to 80 GFLOPS of peak performance at your fingertips,

More information

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili Virtual Memory Lecture notes from MKP and S. Yalamanchili Sections 5.4, 5.5, 5.6, 5.8, 5.10 Reading (2) 1 The Memory Hierarchy ALU registers Cache Memory Memory Memory Managed by the compiler Memory Managed

More information

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Based on papers by: A.Fedorova, M.Seltzer, C.Small, and D.Nussbaum Pisa November 6, 2006 Multithreaded Chip

More information

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Abridged version of Patterson & Hennessy (2013):Ch.5 Principle of Locality Programs access a small proportion of their address space at any time Temporal

More information

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM Computer Architecture Computer Science & Engineering Chapter 5 Memory Hierachy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic

More information

Bias Scheduling in Heterogeneous Multi-core Architectures

Bias Scheduling in Heterogeneous Multi-core Architectures Bias Scheduling in Heterogeneous Multi-core Architectures David Koufaty Dheeraj Reddy Scott Hahn Intel Labs {david.a.koufaty, dheeraj.reddy, scott.hahn}@intel.com Abstract Heterogeneous architectures that

More information

CS3350B Computer Architecture CPU Performance and Profiling

CS3350B Computer Architecture CPU Performance and Profiling CS3350B Computer Architecture CPU Performance and Profiling Marc Moreno Maza http://www.csd.uwo.ca/~moreno/cs3350_moreno/index.html Department of Computer Science University of Western Ontario, Canada

More information

Hyperthreading 3/25/2008. Hyperthreading. ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01.

Hyperthreading 3/25/2008. Hyperthreading. ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01. Hyperthreading ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01.pdf Hyperthreading is a design that makes everybody concerned believe that they are actually using

More information

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Performance COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals What is Performance? How do we measure the performance of

More information

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 06: Multithreaded Processors

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 06: Multithreaded Processors Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors Lesson 06: Multithreaded Processors Objective To learn meaning of thread To understand multithreaded processors,

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2017 Lecture 21 Main Memory Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 FAQ Why not increase page size

More information

A Case Study in Optimizing GNU Radio s ATSC Flowgraph

A Case Study in Optimizing GNU Radio s ATSC Flowgraph A Case Study in Optimizing GNU Radio s ATSC Flowgraph Presented by Greg Scallon and Kirby Cartwright GNU Radio Conference 2017 Thursday, September 14 th 10am ATSC FLOWGRAPH LOADING 3% 99% 76% 36% 10% 33%

More information

Two hours - online. The exam will be taken on line. This paper version is made available as a backup

Two hours - online. The exam will be taken on line. This paper version is made available as a backup COMP 25212 Two hours - online The exam will be taken on line. This paper version is made available as a backup UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE System Architecture Date: Monday 21st

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

Performance. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Performance. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Performance Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Defining Performance (1) Which airplane has the best performance? Boeing 777 Boeing

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

Power Measurements using performance counters

Power Measurements using performance counters Power Measurements using performance counters CSL862: Low-Power Computing By Suman A M (2015SIY7524) Android Power Consumption in Android Power Consumption in Smartphones are powered from batteries which

More information

ECE 172 Digital Systems. Chapter 15 Turbo Boost Technology. Herbert G. Mayer, PSU Status 8/13/2018

ECE 172 Digital Systems. Chapter 15 Turbo Boost Technology. Herbert G. Mayer, PSU Status 8/13/2018 ECE 172 Digital Systems Chapter 15 Turbo Boost Technology Herbert G. Mayer, PSU Status 8/13/2018 1 Syllabus l Introduction l Speedup Parameters l Definitions l Turbo Boost l Turbo Boost, Actual Performance

More information

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University Today SRAM vs. DRAM Interleaving/Banking DRAM Microarchitecture Memory controller Memory buses

More information

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Multithreading: Exploiting Thread-Level Parallelism within a Processor Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

QLIKVIEW SCALABILITY BENCHMARK WHITE PAPER

QLIKVIEW SCALABILITY BENCHMARK WHITE PAPER QLIKVIEW SCALABILITY BENCHMARK WHITE PAPER Hardware Sizing Using Amazon EC2 A QlikView Scalability Center Technical White Paper June 2013 qlikview.com Table of Contents Executive Summary 3 A Challenge

More information

Power Control in Virtualized Data Centers

Power Control in Virtualized Data Centers Power Control in Virtualized Data Centers Jie Liu Microsoft Research liuj@microsoft.com Joint work with Aman Kansal and Suman Nath (MSR) Interns: Arka Bhattacharya, Harold Lim, Sriram Govindan, Alan Raytman

More information

Sandbox Based Optimal Offset Estimation [DPC2]

Sandbox Based Optimal Offset Estimation [DPC2] Sandbox Based Optimal Offset Estimation [DPC2] Nathan T. Brown and Resit Sendag Department of Electrical, Computer, and Biomedical Engineering Outline Motivation Background/Related Work Sequential Offset

More information

Motivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4

Motivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4 Motivation Threads Chapter 4 Most modern applications are multithreaded Threads run within application Multiple tasks with the application can be implemented by separate Update display Fetch data Spell

More information

!! What is virtual memory and when is it useful? !! What is demand paging? !! When should pages in memory be replaced?

!! What is virtual memory and when is it useful? !! What is demand paging? !! When should pages in memory be replaced? Chapter 10: Virtual Memory Questions? CSCI [4 6] 730 Operating Systems Virtual Memory!! What is virtual memory and when is it useful?!! What is demand paging?!! When should pages in memory be replaced?!!

More information

Predictive Thread-to-Core Assignment on a Heterogeneous Multi-core Processor*

Predictive Thread-to-Core Assignment on a Heterogeneous Multi-core Processor* Predictive Thread-to-Core Assignment on a Heterogeneous Multi-core Processor* Tyler Viswanath Krishnamurthy, and Hridesh Laboratory for Software Design Department of Computer Science Iowa State University

More information

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2017 Lecture 10 Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 Chapter 6: CPU Scheduling Basic Concepts

More information

CS2506 Quick Revision

CS2506 Quick Revision CS2506 Quick Revision OS Structure / Layer Kernel Structure Enter Kernel / Trap Instruction Classification of OS Process Definition Process Context Operations Process Management Child Process Thread Process

More information

ò mm_struct represents an address space in kernel ò task represents a thread in the kernel ò A task points to 0 or 1 mm_structs

ò mm_struct represents an address space in kernel ò task represents a thread in the kernel ò A task points to 0 or 1 mm_structs Last time We went through the high-level theory of scheduling algorithms Scheduling Today: View into how Linux makes its scheduling decisions Don Porter CSE 306 Lecture goals Understand low-level building

More information

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Bijay K.Paikaray Debabala Swain Dept. of CSE, CUTM Dept. of CSE, CUTM Bhubaneswer, India Bhubaneswer, India

More information

Thread to Strand Binding of Parallel Network Applications in Massive Multi-Threaded Systems

Thread to Strand Binding of Parallel Network Applications in Massive Multi-Threaded Systems Thread to Strand Binding of Parallel Network Applications in Massive Multi-Threaded Systems Petar Radojković Vladimir Čakarević Javier Verdú Alex Pajuelo Francisco J. Cazorla Mario Nemirovsky Mateo Valero

More information

Scheduling. Don Porter CSE 306

Scheduling. Don Porter CSE 306 Scheduling Don Porter CSE 306 Last time ò We went through the high-level theory of scheduling algorithms ò Today: View into how Linux makes its scheduling decisions Lecture goals ò Understand low-level

More information

Response Time and Throughput

Response Time and Throughput Response Time and Throughput Response time How long it takes to do a task Throughput Total work done per unit time e.g., tasks/transactions/ per hour How are response time and throughput affected by Replacing

More information

Power Management for Embedded Systems

Power Management for Embedded Systems Power Management for Embedded Systems Minsoo Ryu Hanyang University Why Power Management? Battery-operated devices Smartphones, digital cameras, and laptops use batteries Power savings and battery run

More information

POWER MANAGEMENT AND ENERGY EFFICIENCY

POWER MANAGEMENT AND ENERGY EFFICIENCY POWER MANAGEMENT AND ENERGY EFFICIENCY * Adopted Power Management for Embedded Systems, Minsoo Ryu 2017 Operating Systems Design Euiseong Seo (euiseong@skku.edu) Need for Power Management Power consumption

More information

OneCore Storage Performance Tuning

OneCore Storage Performance Tuning OneCore Storage Performance Tuning Overview To improve Emulex adapter performance while using the OneCore Storage Linux drivers in a multi-core CPU environment, multiple performance tuning features can

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

Pharmacy college.. Assist.Prof. Dr. Abdullah A. Abdullah

Pharmacy college.. Assist.Prof. Dr. Abdullah A. Abdullah The kinds of memory:- 1. RAM(Random Access Memory):- The main memory in the computer, it s the location where data and programs are stored (temporally). RAM is volatile means that the data is only there

More information

Software within building physics and ground heat storage. HEAT3 version 7. A PC-program for heat transfer in three dimensions Update manual

Software within building physics and ground heat storage. HEAT3 version 7. A PC-program for heat transfer in three dimensions Update manual Software within building physics and ground heat storage HEAT3 version 7 A PC-program for heat transfer in three dimensions Update manual June 15, 2015 BLOCON www.buildingphysics.com Contents 1. WHAT S

More information

5DV118 Computer Organization and Architecture Umeå University Department of Computing Science Stephen J. Hegner. Topic 1: Introduction

5DV118 Computer Organization and Architecture Umeå University Department of Computing Science Stephen J. Hegner. Topic 1: Introduction 5DV118 Computer Organization and Architecture Umeå University Department of Computing Science Stephen J. Hegner Topic 1: Introduction These slides are mostly taken verbatim, or with minor changes, from

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Is Intel s Hyper-Threading Technology Worth the Extra Money to the Average User?

Is Intel s Hyper-Threading Technology Worth the Extra Money to the Average User? Is Intel s Hyper-Threading Technology Worth the Extra Money to the Average User? Andrew Murray Villanova University 800 Lancaster Avenue, Villanova, PA, 19085 United States of America ABSTRACT In the mid-1990

More information

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola 1. Microprocessor Architectures 1.1 Intel 1.2 Motorola 1.1 Intel The Early Intel Microprocessors The first microprocessor to appear in the market was the Intel 4004, a 4-bit data bus device. This device

More information

DEMYSTIFYING INTEL IVY BRIDGE MICROARCHITECTURE

DEMYSTIFYING INTEL IVY BRIDGE MICROARCHITECTURE DEMYSTIFYING INTEL IVY BRIDGE MICROARCHITECTURE Roger Luis Uy College of Computer Studies, De La Salle University Abstract: Tick-Tock is a model introduced by Intel Corporation in 2006 to show the improvement

More information

Position Paper: OpenMP scheduling on ARM big.little architecture

Position Paper: OpenMP scheduling on ARM big.little architecture Position Paper: OpenMP scheduling on ARM big.little architecture Anastasiia Butko, Louisa Bessad, David Novo, Florent Bruguier, Abdoulaye Gamatié, Gilles Sassatelli, Lionel Torres, and Michel Robert LIRMM

More information

Computer Architecture. Introduction

Computer Architecture. Introduction to Computer Architecture 1 Computer Architecture What is Computer Architecture From Wikipedia, the free encyclopedia In computer engineering, computer architecture is a set of rules and methods that describe

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems

Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems Prathap Kumar Valsan, Heechul Yun, Farzad Farshchi University of Kansas 1 Why? High-Performance Multicores for Real-Time Systems

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2016 Lecture 2 Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 2 System I/O System I/O (Chap 13) Central

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Spring 2019 Lecture 8 Scheduling Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 FAQ POSIX: Portable Operating

More information

Device-Functionality Progression

Device-Functionality Progression Chapter 12: I/O Systems I/O Hardware I/O Hardware Application I/O Interface Kernel I/O Subsystem Transforming I/O Requests to Hardware Operations Incredible variety of I/O devices Common concepts Port

More information

Chapter 12: I/O Systems. I/O Hardware

Chapter 12: I/O Systems. I/O Hardware Chapter 12: I/O Systems I/O Hardware Application I/O Interface Kernel I/O Subsystem Transforming I/O Requests to Hardware Operations I/O Hardware Incredible variety of I/O devices Common concepts Port

More information

Multiprocessor Support

Multiprocessor Support CSC 256/456: Operating Systems Multiprocessor Support John Criswell University of Rochester 1 Outline Multiprocessor hardware Types of multi-processor workloads Operating system issues Where to run the

More information

Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models. Jason Andrews

Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models. Jason Andrews Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models Jason Andrews Agenda System Performance Analysis IP Configuration System Creation Methodology: Create,

More information

V. Primary & Secondary Memory!

V. Primary & Secondary Memory! V. Primary & Secondary Memory! Computer Architecture and Operating Systems & Operating Systems: 725G84 Ahmed Rezine 1 Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM)

More information

DEMM: a Dynamic Energy-saving mechanism for Multicore Memories

DEMM: a Dynamic Energy-saving mechanism for Multicore Memories DEMM: a Dynamic Energy-saving mechanism for Multicore Memories Akbar Sharifi, Wei Ding 2, Diana Guttman 3, Hui Zhao 4, Xulong Tang 5, Mahmut Kandemir 5, Chita Das 5 Facebook 2 Qualcomm 3 Intel 4 University

More information