Paulo Henrique Schmidt Castellani. Understanding Linux's CPU Resource Management

Size: px

Start display at page:

Download "Paulo Henrique Schmidt Castellani. Understanding Linux's CPU Resource Management"

Cory Davis
6 years ago
Views:

1 Paulo Henrique Schmidt Castellani Understanding Linux's CPU Resource Management Dissertação apresentada ao Programa de Pós- Graduação em Computação da Universidade Federal Fluminense, como requisito parcial para obtenção do Grau de Mestre. Área de Concentração: Redes e Sistemas Distribuídos e Paralelos. Orientador: Prof. Dr. Eugene Francis Vinod Rebello Niterói 2013

2 PAULO HENRIQUE SCHMIDT CASTELLANI UNDERSTANDING LINUX'S CPU RESOURCE MANAGEMENT Dissertação apresentada ao Programa de Pós- Graduação em Computação da Universidade Federal Fluminense, como requisito parcial para obtenção do Grau de Mestre. Área de Concentração: Redes e Sistemas Distribuídos e Paralelos. Aprovada em Dezembro de BANCA EXAMINADORA Prof. Dr. Eugene Francis Vinod Rebello - Orientador UFF Prof. Dr. Alberto Ferreira De Souza UFES Prof. Dr. Maria Cristina Silva Boeres UFF Niterói 2013

3 Resumo Sistemas computacionais estão repletos de capacidades e funcionalidades com o propósito de prover as necessidades de diversos tipos de aplicações. Com o propósito de maximizar a utilização desses dispositivos a gerência dos recursos providos, dado certo perfil de utilização, sempre foi um aspecto fundamental quando lidando com dispositivos computacionais. Visando solucionar esse problema autonomamente os sistemas operacionais foram desenvolvidos para gerenciar a execução das variadas aplicações dados recursos limitados. Entretanto o aumento do nível de complexidade do hardware, como canais de memória, design de cache, tipos de processadores e conjuntos de instrução, assim como a diversificação dos tipos de aplicações está inviabilizando o encontro da solução ótima para todos as combinações em tempo de execução. Portanto recentemente os sistemas operacionais vem buscando minimizar o custo dos seus escalonadores, provendo mecanismos de gerência configuráveis que no lugar de buscar uma solução ótima buscam oferecer mecanismos para que um agente em um nível mais alto possa definir o modo de operação mais adequado. Motivado pela possibilidade de se ter aplicações concorrentes executando sob políticas de escalonamento diferentes, este trabalho examina os mecanismos e ferramentas de gerenciamento disponibilizadas em versões mais recentes do sistema operacional Linux (particularmente na versão 3.6). Diferenças no comportamento do sistema sob diversos senários de utilização foram avaliados com especial atenção ao agrupamento de processos e, dada o atual interesse no modelo de computação nas nuvens, os efeitos de do uso de máquinas virtuais. O objetivo é compilar informação suficiente para um modelo de escalonamento de modo que desenvolvedores de um sistema de gerencia em mais alto nível para ambientes de larga escala, compostos por diversos sistemas independentes, como datacenters, provedores de nuvens, clusters ou mesmo servidores individuais, possam se beneficiar de políticas de escalonamento customizadas para cada grupo ou classe de aplicações. Palavras Chave: Linux, Kernel do Linux, Políticas de Escalonamento, CFS, Virtualização, KVM, Benchmarks.

4 Abstract Computing systems are packed with capacity and functionality in order to be able to address the different needs of a wide range of applications. In order to maximize the utilization of computational devices, the management of the resources they provide, under different usage profiles, has always been of great concern. With the aim of solving this problem autonomously, operating systems were developed to manage the execution of multiple applications and allow them to share efficiently the limited resources. However, due to growing hardware capacity and complexity, like increasing core counts, differing memory and cache designs and enhanced instruction sets, as well as the diversification in the kinds of applications being used, finding the optimal runtime solution for all possible combinations of execution scenarios is becoming unfeasible. Therefore, modern operating systems are seeking to reduce their scheduling overheads and provide configurable management interfaces so that a higher level system management agents can take on the responsibility to define dynamically appropriate operating profiles. Motivated by the possibility to have concurrent applications executing under different scheduling policies, this work examines the management tools and mechanisms made available by more recent versions of the Linux operating system (in particular kernel version 3.6). Differences in system behavior under diverse usage scenarios and configurations have been evaluated with special attention to process grouping and, given the current interest in cloud computing, the effects of virtual machines. The objective is to compile sufficient information so that a scheduling model for designing higher level system managers for large scale environments made up of several independent systems, like data centers, cloud providers, clusters or individual servers, could benefit from custom scheduling policies for each group or class of applications. Key words: Linux, Linux Kernel, Scheduling Policies, CFS, Virtualization, KVM, Benchmarking.

5 Index of Ilustrations Illustration 1: O(1) Scheduler Data Structure [35]...15 Illustration 2: Scheduling Classes [35]...16 Illustration 3: CFS Data Structure [35]...17 Illustration 4: CFS Red and Black Tree [35]...18 Illustration 5: CFS Runtime weight vs. nice level...19 Illustration 6: Composite scheduler runtime tunning results Illustration 7: SciMark Dense LU Matrix Factorization Illustration 8: 7-Zip Compression Illustration 9: CGROUP contention via cpuset and bandwidth at 50% Illustration 10: VP8 libvpx Encoding Illustration 11: CacheBench Write Illustration 12: Graphics Magick Resizing...120

6 Index of Tables Table 1: Kernel scheduler runtime parameters...28 Table 2: Average process time on Host...47 Table 3: Total group time on Host...49 Table 4: Average CS on Host...51 Table 5: Average process time on Host with 33% higher load...53 Table 6: Total group time on Host with 33% higher load...54 Table 7: Average CS on Host with 33% higher load...55 Table 8: Average process time on Host with HZ100 timer...57 Table 9: Total group time on Host with HZ100 timer...58 Table 10: Average CS on Host with HZ100 timer...59 Table 11: Total group time on Host with debug enabled...60 Table 12: Average process time on Host with debug enabled...61 Table 13: Average CS on Host with debug enabled...61 Table 14: Average process time on Host with background...62 Table 15: Total group time on Host with background...65 Table 16: Average CS on Host with background...66 Table 17: Average process time on Host with background without groups...68 Table 18: Total group time on Host with background without groups...69 Table 19: Average CS on Host with background without groups...70 Table 20: Average process time on Host with background also grouped...72 Table 21: Average process time % relative to host...72 Table 22: Total group time on Host with background also grouped...73 Table 23: Total group time % relative to host...73 Table 24: Average CS on Host with background also grouped...74 Table 25: Average process time on Host with six background processes instead of three...75 Table 26: Total group time on Host with six background processes instead of three...76 Table 27: Average CS on Host with six background processes instead of three...77 Table 28: Average process time on Host with background and HZ100 timer...78 Table 29: Total group time on Host with background and HZ100 timer...79 Table 30: Average CS on Host with background and HZ100 timer...80 Table 31: Average process time on a VM...81 Table 32: Total group time on a VM...82 Table 33: Average CS on a VM...83 Table 34: Average process time on a VM with HZ100 timer...84

7 Table 35: Total group time on a VM with HZ100 timer...85 Table 36: Average CS on a VM with HZ100 timer...85 Table 37: Average process time on a VM with background on the same VM...87 Table 38: Total group time on a VM with background on the same VM...88 Table 39: Average CS on a VM with background on the same VM...89 Table 40: Average process time on a VM with background on the Host...90 Table 41: Total group time on a VM with background on the Host...91 Table 42: Average CS on a VM with background on the Host...92 Table 43: Average process time on a VM with background on another VM...93 Table 44: Total group time on a VM with background on another VM...94 Table 45: Average CS on a VM with background on another VM...94 Table 46: VM with background on another VM relative to host with background without groups...95 Table 47: Average process time on a VM with background also grouped but on the Host...96 Table 48: Total group time on a VM with background also grouped but on the Host...97 Table 49: Average CS on a VM with background also grouped but on the Host...97 Table 50: Average process time on a XVM...99 Table 51: Total group time on a XVM Table 52: Average CS on a XVM Table 53: Average process time on a XVM with HZ100 timer Table 54: Total group time on a XVM with HZ100 timer Table 55: Average CS on a XVM with HZ100 timer Table 56: Average process time on a XVM with background on the same XVM Table 57: Total group time on a XVM with background on the same XVM Table 58: Average CS on a XVM with background on the same XVM Table 59: Average process time on a XVM with background on the Host Table 60: Total group time on a XVM with background on the Host Table 61: Average CS on a XVM with Background on the Host Table 62: Average process time on a XVM with background on another XVM Table 63: Total group time on a XVM with background on another XVM Table 64: Average CS on a XVM with Background on another XVM Table 65: XVM with background on another XVM relative to the same scenario with VMs Table 66: Average process time on a XVM with background also grouped but on the Host Table 67: Total group time on a XVM with background also grouped but on the Host Table 68: Average CS on a XVM with background also grouped but on the Host...113

8 Index Resumo... 3 Abstract Introduction Context Motivation Objectives Contribution Structure of the dissertation The Evolution of the Linux Scheduler Tools and Mechanisms Hardware Information Kernel Compilation time options Scheduling Timers CPU frequency governors Virtual Machines QEMU-KVM Options Guest Kernel Summary Practical Validation Relevant metrics for the model Methodology Organization of the experiments System Setup and Configuration Experiments Presentation of Results Bare metal Host Host with higher load HZ Debug Background Background without groups Background also grouped Background with saturation... 74

9 5.2.9 Background HZ Virtual Machines VM Reference VM HZ Background within itself Background on Host Background on another VM Background on Host grouped Extended Virtual Machines Plain XVM XVM HZ Background within itself Background on Host Background on another XVM Background on Host grouped Standard Benchmarking Scheduler runtime tunables CPUSET and CPU Bandwidth CGROUPs Summary Conclusion Future Work Bibliography

10 10 1 Introduction The administration of computational system resource usage was crucial at the onset of computing when hardware resources were scarce, as an example, the computer spreadsheet program VisiCalc from 1979 was able to be executed on an Apple II which was based on an 8 bit MOS Technology 6502 microprocessor running at 1 MHz with 32KB (Apple II ranged from 4KB to 48KB, VisiCalc required 32KB) of RAM. After some time, with large amounts of work already laid out, development costs outweighed hardware costs where adding more functionality was more important than optimization, that is, purchasing some amount more of hardware resources was cheaper than optimizing the software so that it would demand less resources by the same amount. For instance, the last version of OpenOffice from Oracle, when development was dropped at version 3.3 in 2011, required 512MB of RAM on a Macintosh (it did not specify a minimum processor requirement, but it did specify OSX 10.4 which in turn required at least a 300MHz PowerPC G3). This seemed justifiable given the additional functions and the more refined interfaces, except that there was a reversal in this trend with the new LibreOffice (4.2) from the Document Foundation, which demands only a Pentiumcompatible PC (Pentium III, Athlon or more-recent system recommended) and 256 Mb RAM (512 Mb RAM recommended) albeit having many new functions and a similar interface. Until recently, the mentality was that the tools used for management and development, from hardware control units to operating systems in general and from compilers to runtime overlays and frameworks, could do by themselves a good enough of a job, so that what would have been spent on manually optimizing systems could actually be better spent on something else. Also, average utilization in servers was low so the spare capacity could be leveraged. That, however, proved to be a fallacy, adding more hardware resources to solve problems incurs additional support and maintenance costs, additional volume software licensing fees, when applicable, and environmental concerns like the proper recycling of electronics, the availability of raw materials, and energy consumption and dissipation. Also, adding more hardware resources does not scale infinitely, as is the case for some classes of workloads or for personal workstations in general and, thanks to consolidation, globalization of demand and cloud

11 11 strategies, average server usage can be made high. Therefore, it is the intent of this work to help administrators and developers to better understand the tools at their disposal, in particular those made available by Linux based operating systems so that they can then build on top of them, either individually or in scale as part of distributed systems. For this purpose, a normal Gentoo Linux system based on the mainline Linux kernel, at the time the experiments were started in this work, version 3.6, will be analyzed, with particular attention being paid to the scheduler. 1.1 Context The spreading of global demand (geographically, due to better connectivity) and of dynamic services (either due to custom, user generated or interactive content), along with the enhanced set of skills needed to operate the infrastructure are nowadays changing the environment for how computer services must be provided. The spread of global demand implies an oscillation of resource usage required for servicing those different active populations in given time-zones, each with differences in size and behavior, during meaningful time-frames. Dynamic content implies different resource usage, even if the number of users, their geographic distribution and available content offerings were constant. These changes are resulting in a sharp increase in adoption of the cloud computing model where computational resources are offered as a service. That may be surprising to some as there are several severe disadvantages, notably, how resource utilization is measured in relation to how effectively it is used, that is, the manager of the cloud is both the one measuring and charging for usage and the one responsible for ensuring how efficiently the system is used by the applications. There are legal issues like applicable legal jurisdictions to abide to, ownership, privacy and physical control of the data, specially given the presence of vendor lock-ins. On the other hand, clouds have benefits like the possibility of being backed by a service level agreement (SLA), not needing to worry about something that is not an end business, but a means to achieving the enterprise goals; And being able to leverage many resources readily available from the provider instead of having to build a standards compliant infrastructure, notably for situations where those resources will not be needed continuously. These clouds however are not abstract, they are made up of several machines working together and that are each in turn managed by individual operating systems.

12 Motivation The increasing complexity of operating systems, in particular Linux, has led to the usage of so many sub-systems (schedulers, governors, frameworks) each with many options (like scheduling parameters) that, combined also with the increasing sophistication of applications and usage scenarios, it is extremely difficult at runtime to discover the optimal way in which to manage the hardware for all possibilities. Therefore, there is a need to properly understand what systems can actually do, the options they offer and how they behave so that models can be devised to guide system administrators and system or low level software developers in order for them to achieve their respective objectives in an efficient manner. For instance, the migration from the O(1) Linux scheduler to CFS included a trade off between the removal of heuristics from the O(1) scheduler and the addition of several configuration parameters, which in practice implies delegating some control capabilities to a higher level manager, who due to its scope needs able to use this additional information to make required decisions regarding parameter settings and tasks distribution. In the end, the main motivation for this work is efficiency, that is, how to better make use of available resources. The main benefit being a reduced need to acquire new infrastructure, since fewer resources will be needed to meet demand, as the existing infrastructure in place will be able to meet an increased demand. More efficient systems can also speedup application execution, leading to better man-hour throughput and a faster time to market. Lowered energy consumption is important for the environment, reducing energy costs and helping meet regulations that are increasingly demanding more ecological friendly certifications. Remember that in larger systems, one small increase in a subsystem when taken into account at full scale will have a very meaningful impact. For instance, a power consumption economy of one watt in one computer consuming 100 watts may not seem much, but if there are of those, typical for a small to medium sized data center, then the economy can be significant. 1.3 Objectives The purpose of this dissertation is to further understand the Linux process management system, the diversity of system components, their configurations and how their interaction impacts the process scheduler's behavior, when taking into account KVM based virtual machines. Such an understanding will help lead to a better utilization of systems, or nodes of larger systems, as

13 13 well as aid in the development a model that could be used by a higher level process manager, human or middleware, to further increase total utilization of large distributed systems. 1.4 Contribution This dissertation provides an analysis of the current Linux infrastructure, in particular the kernel within the context of CPU process management. This will be useful for teaching administrators how to improve the systems they manage, as well as for developers in order to achieve higher performing solutions. A test tool has been developed for sensitive system activity measurements, which is useful for anyone interested in validating the impact of changes to the systems on their applications, be it administrators, managers or even application developers. Experimental results for different configuration scenarios and their detailed analysis are presented. Finally, a summarized base model aggregating what was learnt is described and that could be further expanded and implemented in future work. 1.5 Structure of the dissertation This dissertation is structured as follows. Chapter The Evolution of the Linux Scheduler describes the historic evolution of the Linux kernel process scheduler, focusing on the current scheduler, CFS. Chapter Tools and Mechanisms describes the current tools and mechanisms available in a Linux system, specially those used in this work. Chapter Practical Validation explains the reasoning for as well as the employed methodologies for the practical validation. Chapter Experiments discloses and discusses the experimental results. Chapter Conclusion concludes the text with some points of interest for future work, followed by the bibliography and the annexes, where the source code for the programs developed during this work and details about the test system are disclosed.

14 14 2 The Evolution of the Linux Scheduler Short term process scheduling is the problem of choosing which process ready to run on the next available processor and for how long each process will hold the CPU. A process either can yield its time slice or be preempted back into the run queue. The Linux CPU process scheduler [11] [29] [35] started as a simple linked list of ready to run processes, with a pointer to the currently running process and a time stamp of the last time the scheduler function ran. The complexity was O(n), where n is the number of ready to run tasks. It did not scale well in regards to multiprocessor systems as well since the list had to be locked for every interaction, so each processor had to wait for its own turn. Time slices were a function of the priority of the process to be run, also refereed to as nice level, and the timer period (1/CONFIG_HZ, also called jiffy). The next mainline scheduler [2] is described in Illustration 1, and consists of runqueues, each with 140 first in first out (FIFO) lists, 100 for real time and 40 for nice levels ( ) [56]. In order to choose the next process to run, the lists had to be iterated by priority order, until a ready process is found. Given that there is a fixed number of lists then the algorithmic complexity is constant, O(1), from where it was named. The interactions were actually done through a priority array used as a binary mask where if queue number Z had ready to run processes then its bit on position Z was set to one.

15 Illustration 1: O(1) Scheduler Data Structure [35] Each processor had two run queues, one for active processes and one for expired processes, the ones that had already expended their allocated

15 15 Illustration 1: O(1) Scheduler Data Structure [35] Each processor had two run queues, one for active processes and one for expired processes, the ones that had already expended their allocated time slice. Once a run queue was empty both had their pointers swapped and the process resumed, the time between those swaps being called an epoch. It also had dynamic task prioritization, the effective priority being the user defined static priority plus or less a dynamic one given by interactivity. In order to calculate interactivity, utilization records for each process were kept and a heuristic decided which from the relation between I/O wait times and CPU times, I/O bound processes receiving higher priorities, mainly due to response time since user interactions are I/O operations. There were also heuristics which tried to prevent starvation by defining interactive processes and swapping them from the expired queue back into the active queue. There was also a cross balancing function called by each processor at 200ms intervals (every millisecond if idle) for the purpose of keeping the workload reasonably balanced across all processors. Time slices were a function of the static priority and the jiffy, but calculated at insertion time on the list and through linear rules, one for positive priority (nice) levels and one for negative levels, calibrated around nice 0 in order to define nice +19 processes time slices as one jiffy [56]. Currently, since kernel version , scheduling is actually done through three

16 16 schedulers. The real time process scheduler (rt_sched_class) defined in rt.c (source/kernel/sched/), which is still using the FIFO queue data structure from the O(1) scheduler, the idle process scheduler (idle_sched_class) in idletask.c and the fair scheduler (fair_sched_class) in fair.c, both based on weighted fair queuing strategies using a red and black tree data structure. As a side note, those files were renamed when the sched folder was created, originally they had sched_ ahead of their names. The whole process scheduler structure is shown in Illustration 2. Once a sched_class function is called, first the highest corresponding class, rt_sched_class, function is called, if not applicable or ready then repeat for the next one, fair_sched_class and lastly idle_sched_class. Illustration 2: Scheduling Classes [35] Real time scheduling, the rt_sched_class, is responsible for FIFO and Round Robin class of processes. Fair scheduling is responsible for the fair classes, batch, default (some times refereed to as SCHED_OTHER) and idle policies. The difference between batch and the default policy classes are that batch does not preempt any non idle class processes upon wake up and batch does not update update_rq_clock(rq) nor update_curr(cfs_rq) inside yield_task_fair, effectively preventing voluntary yielding if still ready to run. The idletask scheduler is a special case needed in order to manage the per CPU idle task. Not to be confused with processes scheduled under the SCHED_IDLE policy, idle tasks are processes created when booting each CPU with the purpose of simplifying scheduling. They allow the assumption that there will always be at least one ready to run task on each CPU to hold true. This scheduler structure ensures that, for a given CPU, idle tasks will only be

17 17 executed if there is nothing else in a ready to run state. Of note, in development since 2009 and intended to be mainlined within the kernel version 3.14 is a new scheduler class, SCHED_DEADLINE [54], implementing an Earliest Deadline First (EDF) algorithm for deadline bound processes. This class is intended for time sensitive control applications. It also is based upon a RB tree structure but ordered by deadlines instead of virtual runtimes as in the fair scheduler. The algorithmic complexity of CFS is O(1) for selecting the next process to be run, since there is a dedicated pointer to it, and O(log(n)) for inserting processes in the tree. It's data structure structure is as shown in Illustration 3. Illustration 3: CFS Data Structure [35] The idea behind fair scheduling is that the less CPU time a process (task) has had the more it should receive in order to be fairly balanced, therefore processes are kept ordered by their virtual runtime, as shown in Illustration 4. It also accounts for tasks in a waiting state so

18 18 they receive a comparable amount of time once they are ready. The time a task has spent being executed is multiplied by a weight (Illustration 5) to account for nice levels. The time-slice is dynamic [41] [51] defined through the parameter sched_latency_ns or by the product between sched_min_granularity_ns and nr_running (the number of running tasks in the queue), whichever is greater. By default, the base latency is 6ms and the base granularity 0.75ms, hence once there are enough tasks in the run queue the base time slice is 6.75ms and so forth. Also, preemption is not immediate, it is delayed by sched_wakeup_granularity with a base default value of 1ms. Real actual default values are the base values multiplied by the value returned from the sched_tunablescaling rule. Illustration 4: CFS Red and Black Tree [35] Points of note are that this time-slice is calculated when the task is set to run and not when it is inserted in the tree like it was in the O(1) queue. Also, that the decision is now based upon values in a nanoseconds resolution thanks to the high resolution timer. And, that there is only one structure for all priorities (nice levels). Finally, there are also scheduler domains [24] [55] defining the CPU span of balancing actions. Each domain has its own CPU group (CGROUP) set and scheduling configurations, manually defined CGROUPs are island domains. Default domains are, first for hyper threaded processor levels (SMT), second for physical processor levels (SMP) and third for NUMA node

19 19 levels. All domains must contain at least two lower domains or CPUs, meaning all CPUs are connected by at most a O(log(n)) distance, where n is the number of CPUs. Physically only one CPU goes into a higher domain in order to limit the tree size and since all domains to which a CPU belongs must be locally available. Since kernel version , the so called patch that does wonders introduced automatic process grouping [19], where processes with the same session identification are treated as a single scheduling entity (shared virtual runtime). Process grouping can be managed through the use of the setsid system call or disabled or enabled through sched_autogroup_enabled at /proc/sys/kernel/. It improved the group scheduling mechanism introduced originally in kernel version by removing heuristics and by no longer being TTY based, allowing users or system administrators to set how groups are managed. Illustration 5: CFS Runtime weight vs. nice level

20 20 3 Tools and Mechanisms In the previous chapter, the Linux scheduler was described from the point of view of its historic evolution. However, the scheduler in itself is but one system software component and, ultimately, software is abstract with the physical process actually being realized on hardware. Therefore, there is the need to pay special attention to how the hardware is controlled by the system and to the available methods of operation for the tools to be used in the next chapter where experimental results from several different base control configurations, called scenarios, will be evaluated. Also, all hardware information in this chapter was gathered from the machine where such tests were executed, hence it also serves for contextualization purposes. 3.1 Hardware Information Given the multitude of tools available for collecting system information [38], such as processor identification or present hardware, an effort was made in order to use only information exposed directly from the kernel, mostly through the /proc pseudo file-system interface [39]. The purpose of such approach is that this information is likely to be easily accessible to future automated tools or services, and does not depend on installed third party software. The following is the output of the command line instruction cat /proc/cpuinfo in full: processor : 0 apicid : 0 vendor_id : GenuineIntel initial apicid : 0 cpu family : 6 fpu : yes model : 23 fpu_exception : yes model name : Intel(R) Core(TM)2 Quad CPU 2.83GHz cpuid level : 13 stepping : 10 wp : yes microcode : 0xa0b cpu MHz : cache size : 6144 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 4 flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm dtherm tpr_shadow vnmi flexpriority bogomips : clflush size : 64 cache_alignment : 64

21 21 address sizes : 36 bits physical, 48 bits virtual vendor_id : GenuineIntel power management: cpu family : 6 model : 23 processor : 1 model name : Intel(R) Core(TM)2 Quad CPU 2.83GHz vendor_id : GenuineIntel stepping : 10 cpu family : 6 microcode : 0xa0b model : 23 cpu MHz : model name : Intel(R) Core(TM)2 Quad CPU 2.83GHz cache size : 6144 KB stepping : 10 physical id : 0 microcode : 0xa0b siblings : 4 cpu MHz : core id : 2 cache size : 6144 KB cpu cores : 4 physical id : 0 siblings : 4 apicid : 2 initial apicid : 2 core id : 1 fpu : yes cpu cores : 4 apicid : 1 fpu_exception : yes cpuid level : 13 initial apicid : 1 wp : yes fpu : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca fpu_exception : yes cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf cpuid level : 13 pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm wp : yes sse4_1 xsave lahf_lm dtherm tpr_shadow vnmi flexpriority flags bogomips : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca : cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall clflush size : 64 nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm dtherm tpr_shadow vnmi flexpriority cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual bogomips : power management: clflush size : 64 cache_alignment : 64 processor : 3 address sizes : 36 bits physical, 48 bits virtual vendor_id : GenuineIntel power management: cpu family : 6 model : 23 processor : 2 model name : Intel(R) Core(TM)2 Quad CPU 2.83GHz

22 22 stepping : 10 fpu_exception : yes microcode : 0xa0b cpuid level : 13 cpu MHz : wp : yes cache size : 6144 KB physical id : 0 siblings : 4 core id : 3 cpu cores : 4 apicid : 3 initial apicid : 3 fpu : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm dtherm tpr_shadow vnmi flexpriority bogomips : clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: Since the above is from a symmetric multi processor (SMP), four core, single socket machine, then there are four repeated entries differing only by the identification fields, processor, core id, apicid and initial apicid. The processor entry provides each processor with an identifying number, here from 0 to 3. Vendor ID is a string with the manufacturer identification, here Intel. CPU family is a number that identifies the type of processor in the system, here as well as for any x86-based system, place the number in front of "86" to determine the full value, like 386, 486, 586, or 686. Model identifies, in a more refined level, the architectural revision of the instruction set, here 23 stands for the Penryn architecture family, encompassing the models Yorkfield, Wolfdale, Penryn and Harpertown. Model name is a string with the specific unique model identification provided by the manufacturer. Stepping provides the silicon revision number referring to a change to the mask used to manufacture the chip. Reasons for such changes include to improve bin split (frequency), to fix errata (ie. bugs), to improve yield, or to solve an electrical issue. Microcode is the hexadecimal number identifying the microcode revision in use, while the microcode is a part of the processor design in itself, translating machine code for the specific electronic implementation, it often is not fully read only in order to provide some level of flexibility for fixing errata without the need to resort to a new stepping. CPU MHz provides the nominal operating frequency in megahertz to the thousandths decimal place.

23 23 Cache size provides the amount of the highest level cache memory pool the processor has direct access to, here it is 6MB which is at the L2 level. The physical ID entry provides each socket with an identification number, here there is only one socket hence all are 0. Siblings provides the amount of processors visible in a socket, including logical processors like for when hyper threading is in use. Core ID provides each core in a socket with an identification number, here the same as the processor entry since there is only one socket, does not count logical processors. CPU cores provides the amount of real cores, also referred to as physical cores, in a socket. APICID provides the ID of the Advanced Programmable Interrupt Controller (APIC) that is used by each individual CPU. FPU and FPU exception are flags which indicate or not the presence of a floating point unity and if such unity handles their exceptions, if not exceptions are sent via the interrupt controller. CPUID level is the maximum option, EAX register value, that can be used when calling the CPUID assembly instruction on x86. The WP flag on the CR0 x86 control register determines whether the CPU can write to pages marked read-only, the WP entry is a binary checking if such write protection works. The flags entry is a list of strings describing the processor feature bits, they are defined in cpufeature.h header file. BOGOMIPS is a measurement made at boot time, to calibrate busy loops (timing parameter loops_per_jiffy ) in the kernel. CLFLUSH size is a number for the granularity of the CFLUSH assembly instruction used to invalidate a block of data from the cache. Cache alignment is the size of a cache line, the basis for data alignment. Physical address size is the actual amount of bytes a processor has to address main memory. Virtual address size is the amount of bits a can be used to map a pointer to address memory in virtual space to physical space. Each process has its own translation tables so a pointer in one process with identical value to a pointer in another process does not normally point to the same physical memory. This however does not encompass all available information from the system, for instance, cache hierarchy information is not available, actually a casual reader would not even be able to identify from the above the total amount of cache available. From the datasheet [34] [47] it is known that there is a total of 12MB of L2 cache in this processor model. However, not only the datasheet may not be readily available but a programmer would not be realistically able

24 to use it as an information source for his program. That being the case then the following is a list of sources from the /sys pseudo file-system concerning the cache: 24 cat /sys/devices/system/cpu/cpu0/cache/index0/level 1 cat /sys/devices/system/cpu/cpu0/cache/index0/type Data cat /sys/devices/system/cpu/cpu0/cache/index1/level 1 cat /sys/devices/system/cpu/cpu0/cache/index1/type Instruction cat /sys/devices/system/cpu/cpu0/cache/index2/level 2 cat /sys/devices/system/cpu/cpu0/cache/index2/type Unified cat /sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_map , cat /sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_map , cat /sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_map , cat /sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list 0-1 cat /sys/devices/system/cpu/cpu0/cache/index0/ways_of_associativity 8 cat /sys/devices/system/cpu/cpu0/cache/index1/ways_of_associativity 8 cat /sys/devices/system/cpu/cpu0/cache/index2/ways_of_associativity 24 cat /sys/devices/system/cpu/cpu0/cache/index2/coherency_line_size 64 cat /sys/devices/system/cpu/cpu0/cache/index1/coherency_line_size 64 cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size 64 Type entries define if the cache is for data, instructions or both. Shared_cpu_map entries are binary masks, printed in hexadecimal above, mapping which processors have direct access to the cache in question, shared_cpu_list provides the same information but in a list with processor numbers. Ways of associativity are how many different cache lines each main memory address can be mapped to. Coherency line size is the granularity in which the cache is is able to read from and write to in relation to a higher level, the main memory for the highest level cache. What can be seen from the above is that there are two levels of cache, each processor has one non shared L1 data cache and one non shared L1 instruction cache. Also, each pair, numbered 0-1 and 2-3, shares a unified data and instruction L2 cache of 6MB each for a total of 12MB in the package. Memory access is realized through 64bits interfaces. There is also another relevant point of concern when dealing with virtual machines,

25 25 what is seen from within them is not an identical copy of the real host hardware. For instance, there is no cache sharing exposed, such cache is differently sized from the host and also some processor flags differ. While, even among compatible systems [3] [4] [5] [6] [67], the instruction set may not really be the same nor even behave in the same manner, since there are plenty of extension variations across models and manufacturers, either a programmer or the compiler already has to account for that. What is not obvious that special care must also be considered across virtual machines even in relation to the same host. Later in this chapter virtual machines will be discussed in greater detail, however to illustrate the case in point, differences between the host and a KVM based VM feature set, using the -cpu host directive, are going to be explored: Host cat /proc/cpuinfo exert: KVM based VM cat /proc/cpuinfo exert: cache size : 6144 KB cache size : 4096 KB flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm dts tpr_shadow vnmi flexpriority flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc arch_perfmon rep_good nopl pni ssse3 cx16 sse4_1 xsave hypervisor lahf_lm Flags present on the host but not inside the VM are: DTS, there are two of them, the first one to be parsed stands for debug store buffer, the second one for digital thermal sensor; ACPI, advanced configuration and power interface via model specific registers (MSR); TM, thermal monitoring version one technology; PBE, pending break enable; PEBS, precise event based sampling; BTS, branch trace store; NOPL, NOP are instructions that do nothing, the one ending with L takes a long sized byte as an operand, for alignment and in order to limit jump instructions given the shared front end; APERFMPERF, APERF and MPERF MSR registers that can provide feedback on processor frequency; DTES64, 64 bit debug store; MONITOR, MONITOR and MWAIT SSE3 instructions for optimizing multi-threaded applications; DS_CPL, extensions to the debug store feature to allow for branch message storage qualified by current privilege level (CPL); VMX, Intel VMX hardware virtualization technology; SMX, safe mode trusted execution technology (TXT); EST, enhanced speed step technology; TM2, thermal monitoring version two technology which is also able to change the multiplier; XPTR, send task priority messages; PDCM, performance capabilities MSR registers; TPR_SHADOW, VM

26 26 execution control which is maintained in a page of memory addressed by the virtual APIC address; VNMI, Intel virtual non maskble interrupt support; FLEXPRIORITY, Intel FlexPriority technology for optimizing virtual interrupt handling. Those flags are for virtual machine management by a hypervisor, debug support, power management and certain optimizations, none of which can reliably be handled inside a VM. The only flag present inside the KVM based VM but not on the host is the HYPERVISOR flag which tells that the processor with such flag is a virtualized processor. Of note, storage devices can be found on /sys/block with related information tough they are beyond the point of this text. Likewise, the compiler capabilities also should be taken into account, for instance, what gcc can see in the system can be exposed by the command echo gcc -dm -E - -march=native from its known options gcc -Q --help=target -march=native at the target architecture. 3.2 Kernel Before getting into the kernel settings explicitly keep in mind that it is possible to help the kernel by advising how the program will behave before actually tacking action, from the process point of view. Some system calls for that are madvise, posix_fadvise and readahead, like posix_fadvise (fd, 0, 0, POSIX_FADV_RANDOM); for instance. In this example the kernel is advised that for the file under file descriptor fd, the access is going to be in a random non sequential order so it will read only a minimal amount of data for each read operation in contrast to POSIX_FADV_SEQUENTIAL where it would instead double the readahead window, both zeroes are integer parameters for offset and range, that the advice applies to, when those integers are both zeroes then the range is set to be the whole file. Another example is the usage of Kernel Same page Merging (KSM) through madvise(addr, length, MADV_MERGEABLE) and madvise(addr, length, MADV_UNMERGEABLE) if enabled. For a more detailed explanation concerning KSM refer to section Compilation time options This subsection discuss some kernel options that have to be set at compilation time. The first to be discussed is CONFIG_HZ_# where # is one of some predefined frequencies for the target

27 27 architecture, usually 1000 for desktops and 100 for servers by default on most Linux distributions. It defines the frequency at which timer interrupts will be raised per processor. In general, lower frequencies favor cache sensitive and non interactive processes where interrupting them just creates overheads. Meanwhile, higher frequencies favors latency sensitive applications, particularly important for user interactions, and also increases the precision of said timers, which may be important for time management. Another important point is that the higher the frequency the more all CPUs will have to be woken, which is particularly troublesome if idle power states savings are significant. Another set of options are the ones dealing with the default scheduler domains. CONFIG_SCHED_SMP is the base configuration directive that enables domains, without which the kernel will only use a single processor even if the machine has many. CONFIG_SCHED_SMT enables the hyper threading domain, CONFIG_SCHED_MC enables symmetric multiprocessing (SMP) optimizations and CONFIG_NUMA enables the domain for non uniform memory access (NUMA). All within Processor Type and Features level at the kernel configuration. The test system in chapter 4 is compiled only with CONFIG_SCHED_MC since this is the only one that applies and does reduce the overhead as for each CPU the domain mapping for all domains it belongs to has to be present locally in order to avoid excessive messaging. Compiling the kernel with CONFIG_SCHED_DEBUG exposes the scheduler debug interface, including several additional runtime adjustable parameters in /proc/sys/kernel/, listed below are those from the kernel version 3.6:

28 28 /proc/sys/kernel/sched_autogroup_enabled /proc/sys/kernel/sched_cfs_bandwidth_slice_us /proc/sys/kernel/sched_child_runs_first /proc/sys/kernel/sched_latency_ns /proc/sys/kernel/sched_migration_cost /proc/sys/kernel/sched_min_granularity_ns /proc/sys/kernel/sched_nr_migrate /proc/sys/kernel/sched_rt_period_us /proc/sys/kernel/sched_rt_runtime_us /proc/sys/kernel/sched_shares_window /proc/sys/kernel/sched_time_avg /proc/sys/kernel/sched_tunable_scaling /proc/sys/kernel/sched_wakeup_granularity_ns Defines whether processes with the same session identification will or will not be treated as a single scheduling entity. Defines the interval in which slices will be balanced across CPU silos in a bandwidth restricted CGROUP set. Defines who will be set for execution next after a fork, the child or the parent. Preemption latency for CPU-bound tasks, a period in which each task runs once. Defines the cost of task migrations across CPUs. If the real runtime of the task is smaller than this value then the scheduler assumes that the task is still in the cache tries and avoid moving the task. Defines the minimum time after which a task becomes eligible to be preempted. Defines he maximum number of tasks to handle during load balancing since this is done with interrupt requests disabled. See next entry. Defines a maximum CPU time that can be used by real time tasks until when they have to wait for sched_rt_period_us in order to be scheduled again. Equivalent to 100% CPU bandwidth. Defines the exponential sliding window over which load is averaged for shares distribution. Defines the period over which real time tasks time consumption is averaged. Defines how parameters scale across the number of CPUs for all kind of CPU add or remove events as well as default values. NONE means *1, the default logarithmic LOG is *(1+ilog2(nCPU)) and LINEAR is *ncpu. Determines the delay of tasks being waken up when preempting the current task. Table 1: Kernel scheduler runtime parameters As part of the /proc pseudo file system, the parameters above can be handled as files. Therefore, reconfiguring the scheduler as well as verifying its current configuration is as simple as writing to and reading from those files. The normal parameters are (without CONFIG_SCHED_DEBUG):

29 29 /proc/sys/kernel/sched_autogroup_enabled as seen in Table 1, and /proc/sys/kernel/sched_child_runs_first. CONFIG_CFS_BANDWIDTH allows for deterministic scheduling through real CPU time reservation, usually set for real time usage scenarios where there is the need to assure a fixed amount of available CPU time for task groups under CGROUPs. Bandwidth limitation [13] [65] works by setting up a quota and a period for the group. If both are set equally then the group is limited what would amount to one CPU worth of time, if not then the group is limited to what would be the equivalent amount of one CPU by the same proportion. Such limitation can be used as a reservation if set as exclusive. For instance, if the period is twice that of the quota then the group will get half of one CPU, while if the quota is twice that of the period then the group will get two CPUs worth of time. This directive also enables the parameter /proc/sys/kernel/sched_cfs_bandwidth_slice_us. If CONFIG_RT_GROUP_SCHED is not enabled then all real time processes will be limited to 0.95s of each second of CPU utilization in order to avoid a CPU lock-in. Also reserves the usage of real time policies for root only unless there is a CGROUP configured for them with proper user permissions. All allocated time not used is yielded. If enabled it is responsible for: /proc/sys/kernel/sched_rt_period_us /proc/sys/kernel/sched_rt_runtime_us Only those options in this subsection already raise concerns related to what their impact on system behavior is, which is fundamental for a model of the system, that will be explored in section 5.5, except the timer option which will be first detailed on section Scheduling Policies are attributed per process id (PID). Children will inherit their parents policies by default. Policies can be changed at runtime through the chrt command line program or the system call sched_setscheduler [40] by anyone with proper permissions. Real-time, SCHED_FIFO and SCHED_RR (round robin), policies are only root permissible, if CONFIG_RT_GROUP_SCHED is set, unless a CGROUP is created for them with proper user permissions. Also, children will be grouped together with their parents if

30 30 sched_autogroup_enabled is set (default). This is such that a heavily parallel process will not hinder a more serialized one, like running parallel make while watching a movie. If a task is supposed to only be executable if no other processes are running then there is SCHED_IDLE, they are always preempted by non idle policy process, not by the per processor idle task. SCHED_BATCH is the same as SCHED_OTHER except it does not update timing statistics when yielding and does not preempt non idle tasks upon wake up, the purpose of SCHED_BATCH is to enable tasks that do not care about latency to otherwise receive a fair time share like they would have under the default settings. These policies are manageable at runtime through chrt, which has three usage parameters. Two based on a given pid, consulting information about the process chrt -p pid, and changing a running process policies on the fly chrt -p$ # pid where $ is the new policy and # the new priority. The third possibility is launching a new process with arbitrary policy and priority chrt -$ # <command>. Valid priorities can be queried with chrt -m, real time policies have one for each queue from 1 to 99 and the non real time policies do not rely on queues since they use only the red and black tree, hence their only available priority is 0. The CPU scheduler does not handle I/O requests, delegating processes requiring them to each block device own scheduler. The currently in use policy (in brackets) along with available ones can be queried with cat /sys/block/sda/queue/scheduler, replacing sda by any desired block device. The default I/O scheduler for all block devices can be defined at kernel compilation time. At boot time through the --elevator= kernel parameter. Or at runtime by writing the to the scheduler file, like echo deadline > /sys/block/sda/queue/scheduler, note that all that is supposed to be written is the desired one, not the whole list as is returned when reading from it. Incremental scheduler statistics [57] can be exposed if CONFIG_SCHEDSTATS is enabled at the time of the kernel compilation, exposing both global statistics at /proc/schedstat and per process ID statistics at /proc/<pid>/schedstat. The changes to process behavior for each process policy will be evaluated on all scenarios from section 5.2 to section 5.4.

31 Timers There are two schemes through which timing can be measured in Linux [36], the standard timer API and the high resolution timer (HRT) API. The standard API operates in the jiffies time scale, period determined by the HZ constant (1000HZ equals a 0.001ms jiffie), and the HRT API [22] in nanoseconds, but the real accuracy is implementation dependent. Usually the standard timer API is just a fallback for when no HRT is available, however some system calls still use it by default like getrusage which is commonly employed for when user time and system time are measured independently. HRTs must be enabled in the kernel through CONFIG_HIGH_RES_TIMERS and there must be at least one hardware clock source compatible. Available options can be listed from /sys/devices/system/clocksource/clocksource0/available_clocksource. For the test system host this returns tsc hpet acpi_pm and for the test system virtual machine the return is kvm-clock tsc hpet acpi_pm, those clock sources are ordered by priority. Time Stamp Counter (TSC) [16] in x86 is a 64 bit register which is accessible through the assembly DRTSC instruction [33], which is why it is a very low overhead time keeping mechanism, that is, the cost is the number of cycles RDTSC takes plus some possible serialization in an out of order pipeline. Unfortunately it is not very accurate, notably because the tick time may differ inside a multicore CPU and is different for multisocket systems. Depending on the implementation they may be affected by power saving mechanisms. A TSC count must be calibrated for real time by determining how much it counts in a given time interval in relation to a more reliable clock source, tough such measurement has a limited precision given the register size limitation, hence continuous readings will drift over time as a consequence. High Precision Event timer (HPET) [30] [32] is a hardware clock source accessible through a memory mapped I/O window discoverable via the advanced configuration and power interface (ACPI), enabled in the kernel at compilation time through CONFIG_HPET. There are some points of concern with the specification, however. It relies on equal register comparison to raise an interrupt and a non maskable interrupt or system management interrupt may interfere with it resulting in a wait for a wraparound of the HPET counter until the match can happen. There is also the cost of accessing it since it is not internal to each processor, tough the exact

32 32 implementation may vary. ACPI_PM is a simple clock signal from an ACPI compatible motherboard with fixed frequencies. It is the most expensive [63] of the three to account for, in response time, and is vulnerable to motherboard or chipset defects which are not rare. The source code actually reads the signal three times just to account for that, at source file /drivers/clocksource/acpi_pm.c function acpi_pm_read_verified. KVM-CLOCK is the handle for a pass through which gives a KVM guest direct access to the host's main clock source. It has to be enabled at the hypervisor, which is the default setting, and must be supported by the guest kernel. The kernel configuration options that enable it are CONFIG_KVM_CLOCK and CONFIG_PARAVIRT_CLOCK, both are needed. This is important since clock sources inside virtual machines are notoriously unreliable. There is a watchdog kernel infrastructure responsible for sorting, verifying the reliability and updating references, in order to account for clock skews, for all of the clock sources, it can be enabled by CONFIG_CLOCKSOURCE_WATCHDOG. The downside is that when doing time measurements through comparisons between checkpoints, if the watchdog updates the timer in between checkpoints the end result may even drift into the past, as if the ending point measurement happened before starting point measurement. Note that some calls like clock_gettime have monotonic options like CLOCK_MONOTONIC to prevents this. 3.4 CPU frequency governors CPUs do not operate statically, their operating characteristics changes based on some power management related criteria. There are several techniques for power management at the processor level, some companies even name their implementation to distinguish themselves like AMD Cool'n'Quiet, Intel SpeedStep and IBM EnergyScale for instance. Those techniques basic premise is that the processor does not need to operate at full capacity the whole time, leading to features regarding the management of different operating levels, including even the ability to turn on and off pieces of the whole CPU in some cases. From the Linux operating system point of view the mechanism intended to handle such management was called frequency governor [10]. There were four strategies. The performance and conservative strategies consisted only into permanently setting the

33 33 processor at the maximum operating level and the minimum one respectively. The userspace strategy consisted into exposing an interface for user level management, like by some daemon or user interface switch. The ondemand strategy consisted into attempting to automatically balance the operating mode based upon current and average processor usage. Nowadays the performance and conservative strategies remain as they were. But the given diversification of the implementations [72], with fine grained hardware level management and even rising frequencies above the nominal in exchange of disabling other processors in a multiprocessor packaged CPU as a strategy [46], new driver specific approaches are being developed, like the Transmeta scaling driver or the Intel P state driver [9]. Also, the granularity of choices for frequency targets has increased from a few power levels to an almost continuous scale, therefore the ondemand strategy was changed for the Linux kernel The main case addressed by this change is the one where while under the minimum frequency the processor would be saturated leading to a change to the maximum frequency, but when operating at the maximum frequency the load was such that the result would be a small load average. In turn, the small load average prompts a change back to the minimum, resulting in a cycle where the frequency kept jumping between the minimum and the maximum. The new ondemand [37] strategy simplified the governor by assuming a continuous configuration range instead of fixed operating points. The target frequency being set as the measured absolute load, as a percentage, times the maximum frequency divided by one hundred, no longer taking into account the average load. 3.5 Virtual Machines Setting up a virtual machine test profile proved to be non trivial. There are not only the aspects pertaining the system in the VM but also the hypervisor settings and the host settings. All of which are closely related and interdependent. Also very important is the timing information reliability needed to acquire valid test results and which ultimately lead to the use of KVM for this work, besides already being an integral part of the kernel which is the main object of study. Therefore there is the need to properly describe those aspects, hence the following two subsections will elaborate on the hypervisor and the guest including the settings used for the experiments in chapter 4.

34 QEMU-KVM Options Fist, QEMU is an emulator, it allows for the execution of binaries for architectures other than that of the host machine. Also, it is the front-end for KVM which is the kernel interface. There are many options that can be passed at launch time for QEMU-KVM all with very important effects, to be describe in sequence here. Take the following, used across all the normal virtual machines in chapter 4, as reference: qemu-kvm -cpu host -smp 4,cores=4,sockets=1 -net nic,model=virtio,macaddr=00:00:00:00:00:00 -net tap,ifname=tap0,script=no,downscript=no,vhost=on -drive file=gentookvm0.img,if=virtio,cache=writeback,aio=native -virtfs local,path=virtfs,security_model=none,mount_tag=virtfs -m vga std -name gentoo0 -boot c & The cpu option is probably the most meaningful along with smp, with it, one can choose the virtual cpu that the guest will have at its disposal. That is, if for instance the guest will run under a 486 family hardware while the host actual hardware is from a 686 family. The more generic the target architecture is, the easier it will be to migrate the guest optimized for it, up from not passing the option at all when the most simple available native target will be used. On the other hand, the more specific the setting is, up to the host parameter, when the hypervisor will try to use the most of the native features, the more the guest can be optimized towards the actual host hardware functionality in use. As an example, below are the valid cpu options the test system hypervisor can provide to a guest, architectures between brackets are native, others depend upon some kind of emulated features: qemu-kvm -cpu? x86 Penryn x86 [coreduo] x86 Opteron_G4 x86 Conroe x86 [kvm32] x86 Opteron_G3 x86 [n270] x86 [qemu32] x86 Opteron_G2 x86 [athlon] x86 [kvm64] x86 Opteron_G1 x86 [pentium3] x86 [core2duo] x86 SandyBridge x86 [pentium2] x86 [phenom] x86 Westmere x86 [pentium] x86 [qemu64] x86 Nehalem x86 [486]

35 35 An interesting point here is that Penryn is not seen as a native target albeit being the actual host processor architecture family. That however does not seem to have a more meaningful impact since the differences in advertised functionality with the host parameter, as explored previously in section 3.1, are restricted to cache sizes, debug support, power management and some specific optimization functions like the MONITOR and MWAIT instructions. The smp option controls how many processors and the processor layout as the guest will be able to see, irrespective of what is in the host. Detailed description from the help command output: -smp n[,maxcpus=cpus][,cores=cores][,threads=threads] [,sockets=sockets] set the number of CPUs to 'n' [default=1] maxcpus= maximum number of total cpus, including offline CPUs for hotplug, etc cores= number of CPU cores on one socket threads= number of threads on one CPU core sockets= number of discrete sockets in the system Since the above can be set at the hypervisor level instead of a copy from the host, then the question of what will happen in cases of over provisioning, when the VM has more virtual resources at it's disposal than the host can provide, arises. This has the potential to be particularly important for cases where there are VM migrations taking place and the host machines are heterogeneous. Over provisioned virtual machines will be explored on section 5.4. The first net option tells what sort of network device to provide, the simplest parameter being user meaning that the VM will access the network as a normal program would. Unfortunately that results in high overhead. The virtio parameter is a para-virtual pass-trough for I/O [68], for it to work a bridge is needed at the host so as to bind to it, usually through a tap device from the tun/tap infrastructure [66]. The second net option is about the specifics of the device, in this example tap0, tap1, etc, one for each VM. Of note here is the vhost=on parameter instructing KVM to use the device /dev/vhost-net from the host, enabled by CONFIG_VHOST_NET when compiling the kernel. If needed, the device permissions can be changed through udev rules, like KERNEL=="vhost-net",GROUP="kvm" placed on a file at /etc/udev/rules.d/. The sucessor

36 36 of devfs, udev [44] is a system intended to provide a userspace solution for a dynamic /dev directory, with persistent device naming. Following is the system host network layout, under Gentoo standards, from /etc/conf.d/net : bridge_br0="eth0 tap0 tap1" brctl_br0="setfd 0 sethello 0 stp off" rc_net_br0_need="net.eth0 net.tap0 net.tap1" tunctl_tap0="-u paulo" config_tap0="null" tuntap_tap1="tap" config_br0=" netmask brd tunctl_tap1="-u paulo" " config_tap1="null" routes_br0="default via " config_eth0="null" tuntap_tap0="tap" In the above: bridge_br0 defines the bridge identified as br0 and it's bindings; brctl_br0 is a bridge control program directive for the bridge br0 to brctl from the bridge-utils package where setfd is the forward delay, sethello is the hello time and stp sets the spanning three protocol on or off; config_br0 defines the IP (layer 3) settings for the bridge br0 ; routes_br0 is defining the gateway for br0, literally instructing the kernel to route default network requests for br0 through ; tuntap_tap# defines the type of the interface, tap was used for the bridge in this case since it is the layer 2 interface; tunctl_tap calls the tunctl program, -u was needed here in order to give user paulo access permissions; all other config directives are null since those interfaces do not need layer 3 parameters. The standard option pointing to what will be emulated as a hard disk device by the guest is hda. It is recommended to use the device option however, as was used here, in order be able to use more fine tuned parameters, like defining cache and asynchronous I/O policies, as well as for the virtio interface. The virtfs [1] option is an interesting one, there is a full discussion about it on [65], in sort it is a para-virtual infrastructure which allows for native sharing of a mount point and removes much of the block device system overhead.

37 37 The m option defines how much main memory, in megabytes, the VM will have at startup. With Virtio balloon driver [61] enabled on the guest kernel that can also be changed at runtime. It can be tough of as a real balloon placed inside a pool, the pool size is the one defined by m, the actual memory available is the water, the balloon is placed inside the water empty, but can be inflated and deflated. Inflating the balloon pushes water out of the pool back to the host, deflating it allows the water to flow back. It cant be deflated more than inflated, nor inflated more than the pool size. The vga option is for accessibility, there is the option for a conventional virtual network computing (VNC) interface, but a SDL console (called with std) is much more convenient when on the host, except that as a native interface it does not allow for actions requiring awareness outside of the virtuial machine guest like copy and paste from and to the host. The soundhw option is what will be provided as a sound device for the guest, it is entirely optional and can be omitted, specially for servers. Lastly, name and boot are administrative options for, respectively, identifying the VM and specifying what the guest will use as a boot device. There is also what can be seen as a hidden feature, kernel same page merging (KSM), that is used automatically if present. As a kernel feature, it is added to the host by compiling it straight into the kernel or loading it as a module. If present, /sys/kernel/mm/ksm/run will be available, but set to 0 (off) as default, hence it is important to either enable it with a script at boot time ( /etc/local.d or rc.local ) or remember to manually setting it to 1 before starting the virtual machines (a normal echo 1 will do). Other files at /sys/kernel/mm/ksm are for monitoring purposes, except: pages_to_scan, which is a configuration option for scan depth in number of pages before sleep; sleep_millisecs specifying how long in milliseconds KSM will sleep in between scan cycles when it will look for pages to be merged; and merge_across_nodes, binary allowing page merging across different NUMA nodes, default 1 (allowed). Also, not only hypervisors can benefit from KSM, since it is a generic tool, normal processes can through the use of the madvise system call: int madvise(addr, length, MADV_MERGEABLE), there are daemons crafted for automatic KSM management like ksmtuned on RedHat based distributions for instance.

38 Guest Kernel Guest operating systems can can be made aware about the fact that they are inside a hypervisor layer and provided means to cooperate with the host in its task of managing actual hardware resources through a para-virtual infrastructure. Host side most of such infrastructure is provided by the virtual machine manager (VMM), the VMM used in this work was QEMU-KVM, just setting KVM support in kernel for the host processor will expose the needed resources like the VT extensions, CONFIG_KVM_INTEL in particular for the test system. On the guest side there are mainly two groups, one for the system in the configuration field Processor type and features --->Paravirtualized guest support with three options: CONFIG_KVM_GUEST, for various optimizations like asynchronous page fault handling; CONFIG_KVM_CLOCK, so that the kvm-clock can be recognized; CONFIG_PARAVIRT, so that the guest already generates hypervisor calls instead of native operations, tough there is a fallback so that this setting can be enabled even on the host without breaking it; There is also the sub-option CONFIG_PARAVIRT_SPINLOCKS enabling a guest VCPU to yield instead of waiting in an active lock [7]. The other group is for the virtual devices, under Device Drivers --->Virtio drivers --->, three options, PCI driver for virtio devices for PCI pass-through and hot-plugging, Virtio balloon driver for dynamic management of the guest main memory, and Platform bus driver for memory mapped virtio devices. A notable difference here is the use of /dev/vd<az> instead of /dev/sd<a-z> when virtio block devices are in use, care is needed when setting the boot loader with proper options. 3.6 Summary In this chapter system tools and methods of operation were explored, including settings and system information for the test system and the experiments to be carried out in the following chapter. Points risen to be investigated are all of the different scheduling policies, the impact of virtual machines including the effects of over provisioning, the timer resolution, the runtime scheduler parameters and layout, like cache amount versus sharing, effects on process distribution.

39 39 4 Practical Validation From what was exposed in the previous chapters, it has become clear that different system configuration options, mostly concerning the Linux kernel or the scheduler, are now available. This however places an onus on the administrator or application developers of having to understand the influence of these settings on the behavior of the system in order to achieve their respective objectives in an efficient manner. Unfortunately, this is a very complex problem, not only does the system itself have such a variety of different options, but there are also plenty of different hardware profiles, like different architectures and deployment scales, upon which they can be used and different usage profiles, like servers, HPC or embedded systems, that may be pursued, each with its own diversity of distinct applications. Applications, by the way, that may change their characteristics dynamically like different HPC loads due to variations in input data or server loads due to alterations in user demands. Therefore, system behavior must be understood so that a model can be built to guide administrators and users towards the aforementioned understanding. This might be extended to produce middleware that would be able to perform the proper adjustments automatically in those configurations in order to account for such changes. The way through which such models are built is by identifying patterns from measurements. In light of this, this chapter describes how the practical aspects of this work were planned and carried out. 4.1 Relevant metrics for the model In order to gather the desired behavioral patterns of the scheduler, fine grained measurements of a known constant amount of work are needed, specially given that scheduler actions happen within a very small time frame, for instance, the clock tick can be as fast as one each millisecond. Also, since managing the workload is exactly the scheduler's function, where the amount of programs to be managed is a primary parameter, being able to set the level of parallelism through the number of processes in execution by the benchmark is of critical importance (taking automatic process grouping into account). In this regard, a very important metric is how often each parallel program is being preempted, and for that matter, the preemption must be as cheap as possible from the benchmark point of view.

40 40 For completion, changes to the different CFS scheduler runtime parameters (table 1) and different resource usage restrictions (CGROUPS) are also carried out. Those are conducted with standard benchmarks, of both serial and parallel nature, so that the reader can easy to contextualize the results. 4.2 Methodology In order to meet the desired requirements, an experimental setup was developed, the full sources (C and BASH) are provided in Annex A, and is designed as follows: First, scripts were developed to automate the procedure as much as possible and help ensure consistency across all repetitions. The scripts receive from the command line, the load size for background processes or the scheduling policy to be set for the test processes and the number of processes in each case. Second, a host program was developed to initialize the System V IPC [62] mechanisms of shared memory and semaphores, and to parse the results, writing them to an output file. The semaphore is used in order to ensure, once being triggered, that this host program will sleep until every process has finished and written its results to the respective addresses in the shared memory vector. Since what really is at stake is the system behavior itself, care is needed to minimize, as much as possible, the interference with the system, by avoiding multiple control flows, shell pipes and the system block interface. The latter is particularly important since the block interface even has its own independent I/O scheduler running in parallel. The third component is a client program to actually run a test as described above, gather the measurements like timing and involuntary context switches, from system calls and write them to the shared memory vector. The fourth set of components are programs to simulate background workloads, by default it has a setsid system call to avoid automatics grouping. For the experiments presented in this chapter, these programs are modified versions of the benchmark code where their load can be changed and which do not carry out inter process communications (IPC) nor measurements. The ability to change the load is an important feature as the background processes need to be sufficiently large enough so that while they all start before the test processes, all of them only finish after all such processes from the test in question have completed. Except when explicitly stated, all data presented here is not raw measurement data but some form of aggregated result, e.g. average, total, deviation or relative change, given the sheer volume of collected data.

41 41 While such setup allows for changes to be made to the benchmark code executed as clients or as background, for the purposes described above, a loop of floating point addition was used. This way it can be very easily adjusted for different loads simply by changing the size of the loop, the load is well known, uniform and simple, that is, will fit in the cache, does not need I/O, has only one control flow (the loop conditional) and uses only one processing unit or core per each instance [12]. For the analysis of the CFS scheduler runtime parameters and the CGROUPS resource usage restriction the Openbenchmarking.org suite was chosen. Openbenchmarking.org is an interface for benchmark customization allowing the user to configure the particular benchmarks to be carried out as well as the presentation of the results, reason for it being chosen since each parameter has to be tested with a number of different values. Openbenchmarking.org has a preconfigured catalog of open source benchmarks, from this catalog the following were chosen (all CPU bound): CacheBench [48], it performs repeated access to data on varying vector lengths, evaluating write, read and read/modify/write throughput in megabytes per second of the processor cache; SciMark, it is a scientific and engineering benchmark consisting of of five computational kernels: FFT, Gauss-Seidel relaxation, Sparse matrix-multiply, Monte Carlo integration, and dense LU factorization. SciMark return values are in mega FLOPS, one for each kernel and the composite score, it was originally developed in java but it has a C implementation, the latter is the one used; VP8 libvpx encoding, is is the reference encoder implementation of the VP8 video CODEC developed by On2 Technologies and incorporated into the WebM format; Graphics Magick, it is a varied image manipulation test relying on the Graphics Magick program and the libpng library. The Graphics Magick benchmark measures iterations per minute for each of five image filters, HWB color space, blur, local adaptive thresholding, resizing and sharpen; Himeno, it has several implementations [49], the one used was the standard himenobmtxpa.c written by Ryutaro Himeno, the benchmark is a kernel from a linear solver of pressure Poisson equations from an incompressible Navier-Stokes solver, employing a point-jacobi method, the return value is in mega floating point operations per second (MFLOPS); 7-Zip, it is is a file archiver, the evaluated algorithm is an implementation of the LZMA compression and the results are in MIPS (integer instructions); Timed Linux kernel compilation, it is a simple time measurement of how long in seconds it takes for the test system to compile a specific version of the Linux kernel (3.1 vanilla); C-Ray, it is a ray tracing

42 42 floating point CPU intensive benchmark, it will shoot eight rays per pixel and generate an 1600 x 1200 pixels final image, it uses a pthreads implementation; LZMA compression, it is an alternative implementation of the LZMA compression algorithm; FLAC audio encoding is a timed measurement in seconds of how much it takes to convert a WAV audio file into the FLAC lossless audio format from Xiph. As a side note, examples of other commonly used standard benchmark suites are the Standard Performance Evaluation Cooperative (SPEC) [12], the Scalable HeterOgeneous Computing Benchmark Suite (SHOC) [25] [60], the High Performance Linpack (HPL) [50] [60]. SPEC is a performance standardization body with several different suites; CPU for compute intensive workloads, including integer and floating point benchmarks [31]; ViewPerf for measuring the performance of professional graphics related editing; WPC for evaluating system as a whole, with workloads stressing CPU, graphics, I/O, and memory bandwidth; CAPC a collection of workloads for Autodesk 3ds Max, an application for 3d modeling; MPI for floating point intensive parallel computations; OMP for shared memory parallel computations; JBB for java based servers; jenterprise for whole system performance on top of Java EE; SFS for server throughput and response times; SIP for measuring SIP operations in a server providing for VoIP deployments; VIRT for end-to-end performance of virtualized platforms. The purpose of the SHOC suite is to evaluate the system wide performance and stability of heterogeneous systems, including the use of graphics processing units (GPUs) [25]. The individual benchmarks have three versions, serial, embarrassingly parallel and true parallel. The difference between the latter two being that the former does not use communications between devices or nodes. It has three groups of benchmarks, level zero for simple tests seeking to find the theoretical maximum throughput of the system, level two for evaluating the performance of basic parallel algorithms and stability tests seeking to find if the device is operating correctly. High Performance Linpack is a solver for random dense linear systems on distributed memory parallel computational systems [50], measuring time to completion and accuracy of the results, it relies on the message passing interface (MPI), basic linear algebra subprograms (BLAS) or vector signal image processing library (VSIPL). It is not really a suite but a complex benchmark.

43 Organization of the experiments There are several distinct scenarios to evaluate, each described in subsections 5.2, 5.3 and 5.4 for individual configuration options, spanning across three environments: One for the physical machine, also known as bare metal; virtual machines (VM) and; virtual machines with over provisioning (XVM). Each scenario is represented by an experiment that was repeated five times. Each experiment consists of seven tests. Each one of the seven tests with a different amount of processes, from one to four then six, eight and twelve processes, due to the machine where they were executed having four processors. Each test was executed separately but the processes within each test were executed together, subject to the respective scheduling policies. The reason for choosing seven different sizes of groups of processes is the following. Without executing any background load, a single process is executed to obtain a base reference, from which the best case can be inferred. Tests with four to twelve processes are for capturing the impact of data contention since four concurrent processes is the point where the system becomes saturated by the test processes themselves and so system overheads begin to become apparent. With background present, the execution of one to three concurrent processes is also important since with three background processes one additional process already saturates the processor. As the test system relied upon an Intel Q9550 processor with four true cores (operating at 3410MHz), three background processes were used in parallel, unless explicitly stated, since processes under the idle scheduling policy would otherwise not be able to execute. Adjusting, for instance, to a similar eight core system, 1 to 4 would be 1 to 7, then 6, 8 and 12 would be 8, 12 and 16 processes respectively. 4.3 System Setup and Configuration The Intel Q9550 processor was used since it was easily available, has a non shared 12MB L2 cache (half shared only within processor pairs 0-1 and 2-3, but not between them) with only one memory interface (Uniform Memory Access), so it is possible to directly compare cache sharing effects, has a full set of hardware virtualization support as needed by KVM. Has a deterministic operating frequency and relies normally on standard current system energy management instead of needing a per system driver. It is operating at 3410MHz due to the Front Side Bus (FSB) being increased, as a consequence of the RAM speed, and multiplier changes, not manual overclocking which could have prevented dynamic voltage and frequency scaling (DVFS)

44 44 implementations from working properly. The test machine RAM was a dual channel G.Skill F CL6-2GBXH set operating under the nominal frequency (1600 MHz) and the default Intel Extreme Memory Profile (XMP) latencies. XMP is a standard originally develop for Intel chipsets, similar to the Enhanced Performance Profile (PPE) standard originally develop for NVIDIA chipsets. Their purpose is to extend the automatic configuration of the memory latencies, through the Serial Presence Detect (SPD) standard, beyond the JDEC standard values. As such, manufacturer specific settings can be automatically used by the machine in a manner that is transparent to the user. The base configuration was CONFIG_HZ_1000, meaning one timer interrupt per millisecond, and without cpusets and debug information. Wall time measurements were made in nanoseconds thanks to the timer infrastructure, described in the previous chapter, and its passthrough for KVM which was used to provide the virtual environments for the VM and XVM.

45 45 5 Experiments This chapter presents the results for the experiments carried out, as described in the previous chapter, and their analysis. Starting with how they are presented, section 5.1, then the tests develop through this work, sections 5.2, 5.3 and 5.4, and finally the standard tests, section Presentation of Results In each of the following subsections, there are three tables each divided into another two to five parts presenting: First, the average time for each process within the test; Second, the total time of the benchmark, i.e. time to execute all of the test processes in an experiment, and; Third, the average number of involuntary context switches for each process of the test. The subdivisions of each table are for the values of the data in question averaged from five executions, the deviation of the results over the five executions, the percentage change relative to another relevant scenario, except for the base reference, how much time groups with more processes took in relation to one process (scaling). Those scenarios used as reference of comparison by others, six in total, had their experiments repeated two more times, for a total of seven. Those were used only for an additional entry called trimmed average that stands for the average of five from the seven experiments chosen by disregarding the lowest and highest values. All references are comparing to those entries, not the first averages. Note that wall time is the difference between the process start and end times, acquired by a pair of gettimeofday standard system calls before and after the test main loop. Strictly for conventional benchmarking purposes the clock_gettime system call would have been better, particularly for consistency with the CLOCK_MONOTONIC directive, but since the point of interest is the system behavior then the clock corrections and the timer watchdog work are not to be disregarded, hence the use of gettimeofday. The average time is the arithmetic mean of the wall time of processes in each test, that is, their sum divided by group size. Total time is defined as the difference between the finishing time of the last process and the start time of the first process in each test. All timing measurements are in nanoseconds. Context Switches are events when the program is preempted for whatever reason. What happens between the control flow being removed from the process and being returned is not directly measurable, but

46 46 will indirectly be part of the time measurements since they delay the execution of the process in question. In all cases, the context switches accounted for were measured for involuntary preemption, meaning those preemptions happen due to factors beyond the control or will of the process itself. For simplicity, the word involuntary is left implied in the tables. Their average number is also the arithmetic mean, as before. While, ideally, all tests within an experiment would yield the same results, not only is this unlikely due to non uniform system activity, down to hardware interrupts, but the timing information that the system has available is not perfect [58]. In order to account for that, experiments were repeated five times, each being a sample since they cannot be considered to be a complete population of all results possible, from where standard deviation is typically calculated. Instead, the standard deviation is being calculated here for a sample of the population, which is as follows. First take the mean, then get the square of the difference between each sample and the mean, and sum all of them. Next, divide this by the number of samples less one and finally take the square root. In this case, there were five runs so there are five averages and total times, for each test, which is the sample size. In order to visualize directly the overhead of increasing the number of processes, there is an entry called process scaling where the value from tests with group sizes two to four are compared directly to the results from those with group size one (equal values returning 100%). For group sizes from six to twelve, a correction factor was added to adjust for the fact that there are only four real processors, therefore the tests with group sizes of six, eight and twelve were compared respectively to one and a half times, two times and three times the result from the tests with group size one. 5.2 Bare metal In this section, the scenarios are without any kind of virtual machines being used, for a total of nine scenarios, in order to properly assert the effects of different system configurations, background profiles and process grouping. Then in section 5.3, normal virtual machines (VM) will be analyzed and in section 5.4 virtual machines with an over provisioning of virtual processors (XVM).

47 Host In this scenario the experiments are without background workloads or any change to the base configuration of the kernel and scheduler. It is the only one without percentage comparisons, i.e. is missing sub table (c), since the results presented here will be used as the first baseline. Because this scenario is used as a reference for other scenarios, then it has the trimmed entry, sub table (e), on both the average process time and total time tables, which can be compared directly to the main averages, sub table (a). Batch 2,942,406 2,942,509 2,943,480 3,110,878 4,455,865 5,845,068 8,781,444 Default 2,942,238 2,942,182 2,944,627 3,119,530 4,329,436 5,861,335 8,791,690 Fifo 2,941,860 2,942,191 2,943,117 3,051,480 3,558,556 3,697,134 5,095,238 Idle 2,942,744 2,942,864 2,943,305 2,995,306 4,398,724 5,799,226 8,772,944 Roundrobin 2,942,060 2,942,214 2,942,931 3,054,481 3,657,870 4,387,022 6,392,188 (a) Average time Batch , ,951 43,563 99,850 Default , ,577 16,660 31,524 20,800 Fifo , , , ,117 Idle ,819 6,895 42,043 32,015 Roundrobin , , , ,190 (b) Average time deviation Batch % % % % % 99.32% 99.48% Default % % % % 98.10% 99.61% 99.60% Fifo % % % % 80.64% 62.84% 57.73% Idle % % % % 99.65% 98.53% 99.37% Roundrobin % % % % 82.89% 74.56% 72.42% (d) Average time scaling Batch 2,942,433 2,942,635 2,943,369 2,993,087 4,370,674 5,852,465 8,820,542 Default 2,942,270 2,942,092 2,943,320 2,994,938 4,334,861 5,861,011 8,795,201 Fifo 2,941,780 2,942,355 2,943,080 3,046,249 3,571,019 3,778,236 5,160,394 Idle 2,942,967 2,943,180 2,943,380 2,991,302 4,399,668 5,823,032 8,792,956 Roundrobin 2,942,230 2,942,135 2,942,950 3,054,655 3,607,968 4,395,169 6,545,284 (e) Average time trimmed Table 2: Average process time on Host Comparing the trimmed measurements to the averages has a special meaning when following spikes in the deviation values, Table 2(b), those in the range of tenths of milliseconds with saturation Table 2(a), when the tests have the potential to use all processors for themselves (four processes), or in the microsecond range without, in the case of the default policy with

48 48 three and four test processes, as well as for the cases with the batch policy running with four, six and twelve test processes. Those spikes are a consequence from the fact that the operating system in itself also has tasks that have to be executed at some point, like a kernel system management task or a daemon service, interfering with the user processes, specially when there is no spare computational capacity available, hence process executions affected by them are actually valid results and must not be disregarded. As for the apparently abnormal increase in the average deviations for real time processes, FIFO and round robin, this is a consequence of how the processes are ordered for execution, with some processes finishing soon while other processes wait, until the time of the last process to be executed matches the total time, those that finish early skew the average downwards. Round robin being very similar to FIFO due to the small time scale limiting the effects of circling in the process queue. While, CFS scheduled processes (batch, default and idle) have average execution times, Table 2(a), near the total time since their execution times are balanced. Process scaling proportion, Table 2 (d), is quite straightforward, without saturation the difference is minimal, in the order of hundredths of a percentage point. At saturation point system overhead start showing in the order of single digit percentage points. Above the saturation point, when the correcting factor is being applied, real time processes start displaying their non preemptive nature where averages fall significantly from what would be the expected scaled value had they balanced their workload, notably FIFO to nearly half the value. Normal CFS based processes are close to one percentage point from the expected values. Comparing Table 3 to Table 2, that is, total times to average times, where on Table 2 it was identified that the measured averages had suffered from spikes, again the cases default with three and four test processes as well as batch with four and six test processes, show that something else from the system interfered, leading to a higher reading.

49 49 Batch 2,942,406 2,942,872 2,944,686 3,523,260 5,067,454 5,986,760 8,907,858 Default 2,942,238 2,942,590 2,947,216 3,567,090 4,629,090 5,988,398 8,913,614 Fifo 2,941,860 2,942,580 2,945,766 3,143,336 5,995,700 6,157,162 9,301,490 Idle 2,942,744 2,943,158 2,943,930 3,074,028 4,566,000 5,990,240 8,908,470 Roundrobin 2,942,060 2,942,522 2,943,508 3,060,958 5,761,694 6,395,304 9,478,998 (a) Total time Batch , , ,169 25,747 17,140 Default ,035 1,056,529 93,791 25,664 23,028 Fifo , ,018 21,485 16,496 74,008 Idle ,476 62,979 21,789 26,555 Roundrobin ,072 1,028, , ,541 (b) Total time deviation Batch % % % % % % % Default % % % % % % % Fifo % % % % % % % Idle % % % % % % % Roundrobin % % % % % % % (d) Total time scaling Batch 2,942,433 2,943,067 2,944,437 3,082,627 4,711,803 5,976,937 8,898,723 Default 2,942,270 2,942,353 2,944,167 3,087,407 4,679,187 5,984,747 8,897,960 Fifo 2,941,780 2,942,567 2,943,730 3,047,897 5,987,733 6,166,160 9,284,957 Idle 2,942,967 2,943,320 2,943,953 3,061,193 4,548,183 5,986,593 8,899,310 Roundrobin 2,942,230 2,942,470 2,943,467 3,067,727 5,334,910 6,449,700 9,391,030 (e) Total time trimmed Table 3: Total group time on Host Still looking at Table 3, from a process scaling point of view, the general cases without saturation have very similar results, which is expected since there are four processing cores present so there is always at least one processor available for system processes and therefore, whatever system activity wakes up can be executed without interfering with the test itself. If eight cores were available, the results for groups with one to seven processes would have been similar. When comparing the saturation point, taking into account the interferences that occurred, the same single digit percent point overhead, around five points, can be observed. Note that, while batch and default spikes resulted in an increase in total time of over ten percentage points than what would have been expected under normal circumstances, it can be seen that FIFO with six processes also suffered some interference where the resulting total time was almost thirty six percentage points greater than the expected value, despite that not being

50 50 obvious from the deviation numbers, meaning that the deviation alone, while a good indicator, is not enough to determine if a spike occurred or not. When looking into expected values for CFS cases under saturation a new pattern is observed. The difference in percentage points from the expected value decreases when the number of processes in the group increases, from where it was at saturation point. This may seem to be an unexpected result at first, but actually the more time it takes for all processes to be executed, the more the system overhead will be diluted on average by comparison. That is, a one millisecond overhead on six processes is more meaningful than the same overhead on twelve processes. This is a case where utilizing a very sensitive experiment is important, too much dilution could have made such system overhead to be indistinguishable from the process execution time itself. Real time cases display three distinct differences at this point; first, they take longer than CFS cases from the saturation point onwards, a side effect of the different scheduler structure; second, they do not show the decreasing pattern mentioned above, due to the cost of the different scheduler not scaling in the same manner; and lastly, given that the process average execution time is smaller than total execution time for the experiment, means that they are not balanced. Of note, the deviation results from Table 3 should indicate the total uncertainty here originating, as mentioned, from system processes, context switching and timer imperfection. As the measurements show, even for unsaturated groups in this best case scenario, the deviation is always in the order of hundreds of nanoseconds, meaning a precision in this order of hundreds or lower of nanoseconds is not feasible. Likewise, once the system gets saturated, the confidence limit (value plus or less the deviation) is in the order of tens of microseconds. Involuntary context switching numbers from Table 4 reveals three patterns: One for CFS batch and default process classes, where there is a constant increase up to the saturation point and then an order of increase above the saturation point, both as expected due to preemptions from wake up and timer ticks, the latter is a consequence of the longer total execution time; Another for CFS idle class process where, while similar until the saturation point, when above the saturation point there is a dramatic increase in the measurements, over twice than the readings for the previous cases up to almost an increase by two orders of magnitude from the

51 51 saturation point; And finally, the pattern for real time processes where there are almost none up to saturation point, a constant increase for round robin processes, and a very small increase of one or two events in total on average for FIFO class processes. Both result from how the processes are arranged for execution by the scheduler as round robin class processes circle around themselves and FIFO class processes are not preempted under most circumstances. Exceptions that can preempt a FIFO class process are non maskable interrupts (NMIs), like some hardware interrupts, and system management interrupts (SMIs). An obvious suspect here are timer interrupts since they are set for each millisecond and each process takes nearly three milliseconds, this theory will be further investigated when changing the timer in section In summary, points already useful for building the system model are that: groups of processes under a real time policy demand more time in total to finish; while average time of processes in groups of processes under a CFS policy is indeed close to the total time for each group, and; that the timer resolution reliability is at best half a microsecond for when there is spare processing capacity or above the order of tens of microseconds otherwise. Batch Default Fifo Idle Roundrobin (a) Average CS Batch Default Fifo Idle Roundrobin (b) Average CS deviation Table 4: Average CS on Host Host with higher load This section's experiment was intended to validate scalability of the previous results, that is, if a fixed increase in the workload of the test processes would yield in a proportional increase in the execution times. For all other scenarios the load is always a constant in order to keep a direct

52 52 comparison possible. The previous experiment was repeated, under the same conditions, with the load being increased by 33%, a value that is not proportional to any group size, used in tests, hence the execution times are expected to be a third higher. Starting with Table 5(c) for comparison, all classes below the saturation point show a difference from the expected value of less than one tenth of a percentage point. From the saturation point above all CFS scheduled processes sustain around a one percentage point margin of difference from the expected value. For real time class processes, while one event might not have meant much due to the possible interferences, more likely to happen now since each process takes longer, every single measurement above the saturation point show a difference of above one percentage point, around three points for round robin and up to almost nine percentage points for FIFO scheduled processes. From the deviation results, Table 5 (b), real time processes deviation can be seen as difficult to predict since whereas an increase in deviation is expected given the process execution ordering, what was observed was that while worst case deviations are indeed higher, there were instances, like FIFO with eight processes or round robin with six processes, in which the measured value decreased in relation to the baseline. Meaning that it is difficult to differentiate between the impact of real time scheduling order and the impact of interferences, also that the measurements for their averages is less trustworthy than that of CFS scheduled processes since even CFS worst cases are better than the best cases for real time scheduled processes.

53 53 Batch 3,912,802 3,913,333 3,914,947 3,974,027 5,792,268 7,798,396 11,712,485 Default 3,913,028 3,912,991 3,914,819 3,984,060 5,777,892 7,803,365 11,777,035 Fifo 3,912,454 3,912,697 3,913,912 4,063,431 4,851,817 5,312,498 7,290,653 Idle 3,912,940 3,914,071 3,914,501 3,964,987 5,946,568 7,789,295 11,774,889 Roundrobin 3,912,834 3,912,830 3,913,761 4,096,187 4,886,893 5,997,302 8,528,878 (a) Average time Batch ,028 39,738 56,852 49,214 Default ,180 18,825 53,198 47, ,974 Fifo , , , ,407 Idle , ,754 29, ,321 Roundrobin ,550 72, , ,894 (b) Average time deviation Batch % % % % % % % Default % % % % % % % Fifo % % % % % % % Idle % % % % % % % Roundrobin % % % % % % % (c) % relative to host Batch % % % % 98.69% 99.65% 99.78% Default % % % % 98.44% 99.71% % Fifo % % % % 82.67% 67.89% 62.11% Idle % % % % % 99.53% % Roundrobin % % % % 83.26% 76.64% 72.66% (d) Average time scaling Table 5: Average process time on Host with 33% higher load When observing the deviation in the measurements, Table 5(b), the same pattern can be observed, the overall values are slightly higher, but this can be easily explained through two factors. One is the accumulated normal uncertainty from the measurements, found previously to be in the order of hundreds of nanoseconds below the saturation point and tens of microseconds at and above the saturation point, accumulated along the increased execution time. And second, the higher chance an interference has to occur tough such influence is more diluted, the peak values are lower but they happen more often. Concerning average time scaling, Table 5 (d), again when below the saturation point the increase is below one tenth of a percentage point and from the saturation point above in the order of single digit percentage points for CFS scheduled processes. Likewise, real time scheduled processes have shown the effects of processes finishing earlier skewing the average

54 54 downwards. Batch 3,912,802 3,913,934 3,917,142 4,049,150 6,116,018 7,925,880 11,809,320 Default 3,913,028 3,913,272 3,916,334 4,071,708 6,102,722 7,907,672 12,436,238 Fifo 3,912,454 3,913,042 3,914,340 4,366,004 7,993,854 8,202,614 12,963,994 Idle 3,912,940 3,914,536 3,915,420 4,029,332 6,712,702 7,934,544 12,421,738 Roundrobin 3,912,834 3,913,226 3,914,184 4,101,862 7,048,008 9,017,082 12,566,566 (a) Total time Batch 640 1,438 2,618 9,855 47,290 35,057 9,807 Default ,996 41,627 72,734 15,930 1,394,625 Fifo ,019 13,416 15,344 1,386,766 Idle ,895 1,385,683 10,118 1,351,709 Roundrobin , , , ,059 (b) Total time deviation Batch % % % % % % % Default % % % % % % % Fifo % % % % % % % Idle % % % % % % % Roundrobin % % % % % % % (c) % relative to host Batch % % % % % % % Default % % % % % % % Fifo % % % % % % % Idle % % % % % % % Roundrobin % % % % % % % (d) Total time scaling Table 6: Total group time on Host with 33% higher load Moving to Table 6, for CFS scheduled processes, looking at deviation measurements, Table 6 (b), like in the previous scenario, again when below the saturation point an increase in the order of microseconds means something interfered. When at or above the saturation point, except for round robin, in the order of hundreds of microseconds, also, the deviation reached the order of milliseconds in several instances, notably with twelve test processes when the experiments took between twelve and thirteen milliseconds. Those cases where the deviation is in the millisecond range show a remarkably similar value in this aspect, not only between themselves but also by being close to a 33% increase from the reference where the peaks (Table 3 (b), batch and default with four test processes and batch and round robin with six) were near

55 55 one millisecond. This is a significant result since a 33% increase in the load not only means that the likelihood of interferences to happen is higher, but also that those interferences are somehow related to the execution time itself, else the results would only have shown either more of the same one microsecond spikes or some two microsecond spikes instead. As for total time scaling, Table 6 (d), the results are very similar to those from the reference so there is no new conclusion. Batch Default Fifo Idle , , Roundrobin (a) Average CS Batch Default Fifo Idle Roundrobin (b) Average CS deviation Table 7: Average CS on Host with 33% higher load Involuntary context switches in Table 7 are not compared directly to the reference in percentage points due to the numbers being small and hence the percent variation of even one context switch would be misleadingly magnified. That being said, the previous scenario patter repeats itself plus the expected 33% increase above the saturation point for CFS scheduled processes. Also, the increase for FIFO scheduled processes from the reference is very small, close to one preemption on average per experiment, but not zero, meaning whatever is preempting it is constantly doing so even if rarely. Of note is that the context switches themselves are not expensive, not even the thirteen hundred involuntary context switches on average for idle policy processes were enough to offset the cost of the less efficient FIFO policy on total time, even tough FIFO policy processes only suffered less than five involuntary context switches on average. Fortunately this aspect can be directly compared from the measurements (Table 6 and Table 7) both when there was no

56 56 discernible interference, with eight test processes, and in a peak interference case, twelve test processes. Context switches being cheap however is only in light of the system itself as the test program was designed to detect, however it is to be expected that this is not the truth for all possible user programs, therefore when profiling them for the model it is important to verify if the impacts of context switches will remain low HZ100 The purpose of this experiment is to reveal the impact of a different timer resolution, and scheduling interrupts by consequence, on the overall system behavior. Note that while changing the resolution implies changing normal timer granularity (like POSIX timers), it does not affect timers which do not rely on normal timer interrupts like hardware based timers TSC, HPET or ACPI. The expectation of this experiment is for measured times and the measured number of involuntary context switches to be lower, while however the deviation values might be higher since the number of preemptions will be fewer, but once they do occur their impact will be greater, tough the exact impact is unclear as context switches were found to be cheap.

57 57 Batch 2,939,460 2,939,720 2,943,969 3,022,523 4,340,549 5,846,378 8,757,627 Default 3,409,192 2,940,208 2,939,651 3,108,193 4,305,485 5,853,259 8,750,084 Fifo 2,939,468 2,944,192 2,940,387 3,041,256 3,520,096 3,929,955 5,216,663 Idle 2,939,558 3,174,394 2,940,627 2,987,506 4,373,337 5,830,031 8,664,699 Roundrobin 2,939,322 2,943,192 2,939,823 3,053,368 3,615,394 4,229,895 6,469,698 (a) Average time Batch ,914 26,379 66,591 48,753 42,224 Default 1,051,655 1, ,931 62,553 26,807 29,703 Fifo 140 3,508 1,854 6, , , ,735 Idle ,784 1,567 4,633 37,913 49, ,402 Roundrobin 396 4,229 1,719 24,691 61, , ,054 (b) Average time deviation Batch 99.90% 99.90% % % 99.31% 99.90% 99.29% Default % 99.94% 99.88% % 99.32% 99.87% 99.49% Fifo 99.92% % 99.91% 99.84% 98.57% % % Idle 99.88% % 99.91% 99.87% 99.40% % 98.54% Roundrobin 99.90% % 99.89% 99.96% % 96.24% 98.85% (c) % relative to host Batch % % % % 98.44% 99.45% 99.31% Default % 86.24% 86.23% 91.17% 84.19% 85.85% 85.55% Fifo % % % % 79.84% 66.85% 59.16% Idle % % % % 99.18% 99.17% 98.25% Roundrobin % % % % 82.00% 71.95% 73.37% (d) Average time scaling Table 8: Average process time on Host with HZ100 timer From the measurements in Table 8 the most obvious numbers that come to attention are those for default with one test process, idle with two and default with four. From what was observed in the previous sections this is a clear indication of the one millisecond interference happening, now it can be seen clearly on averages as this is now a known value and they match each other when accounting for the dispersing effect caused by the average, can be seen in full when group size is one since the average of one is it's own value. Unfortunately this means that all scaling, Table 8 (d), values for default policy processes in this section will be affected since the baseline is affected. When comparing to the host, Table 2 (c), it can be observed that the expected overall gains are around a tenth of one percentage point for up to the saturation point and between half and one and a half percentage point when above. Scaling, item (d), closely resembles the host

58 58 reference scenario, except for the default policy as explained. Batch 2,939,460 2,939,872 2,953,258 3,150,690 4,749,188 5,986,554 8,907,734 Default 3,409,192 2,941,112 2,940,420 3,550,420 4,708,740 6,021,460 8,932,370 Fifo 2,939,468 2,947,334 2,943,400 3,202,352 5,974,848 6,346,816 9,229,986 Idle 2,939,558 3,409,700 2,943,590 3,073,804 4,596,712 5,968,818 8,894,182 Roundrobin 2,939,322 2,945,308 2,941,488 3,139,546 5,211,370 6,830,320 9,434,070 (a) Total time Batch ,888 61, ,310 23,008 16,484 Default 1,051,655 3, , ,430 48,671 24,182 Fifo 140 4,593 5, ,552 8, ,328 26,065 Idle 333 1,051,516 4,486 19,826 96,359 24,574 21,109 Roundrobin 396 5,591 4, ,778 94,434 71, ,470 (b) Total time deviation Batch 99.90% 99.89% % % % % % Default % 99.96% 99.87% % % % % Fifo 99.92% % 99.99% % 99.78% % 99.41% Idle 99.88% % 99.99% % % 99.70% 99.94% Roundrobin 99.90% % 99.93% % 97.68% % % (c) % relative to host Batch % % % % % % % Default % 86.27% 86.25% % 92.08% 88.31% 87.34% Fifo % % % % % % % Idle % % % % % % % Roundrobin % % % % % % % (d) Total time scaling Table 9: Total group time on Host with HZ100 timer From Table 9 (c), it can be seen that most measurements when below the saturation point show a gain, but only in the order of a tenth of a percentage point. However, when above the saturation point this trend reverses even going as far as two and five percentage point increases. This pattern when above the saturation point can be explained through sub-optimal balancing given that context switches are cheap. Deviation measurements reveal that spikes are more frequent, below the saturation point there are six instances in the order of microseconds, one in the order or tens of microseconds and two in the order of milliseconds, against only three in the order of microseconds and none above for the host reference scenario. On the saturation point and above, the pattern is similar for both CFS and real time policy processes.

59 59 Scaling, Table 9 (d) also is similar to the host reference scenario, the additional cost of saturating the system is around four to eight percentage points for most cases, again decreasing for greater groups, this time also for real time policy processes. Batch Default Fifo Idle Roundrobin (a) Average CS Batch Default Fifo Idle Roundrobin (b) Average CS deviation Table 10: Average CS on Host with HZ100 timer Looking into Table 10, as expected, involuntary context switches for non real time processes are lower when above the saturation point, notably for idle policy processes where the difference is four fold, albeit that not being enough to make more than a one percentage point difference in total time from the host, even accounting for a decrease (six test processs). Real time policy processes involuntary context switches are largely unaffected by the change, but this is not really surprising due to their non preemptive nature. Of note, this decrease in involuntary context switches have the potential of being meaningful if their cost is not as insignificant as is here, a model has to account for this when costly switches are found when profiling Debug This was a simple test intended to show the overhead of enabling scheduler debug options when compiling the kernel, needed for some runtime switches like the CFS parameters, the only really meaningful results here are for total time relative to host and involuntary context switches, but for the sake of consistency all results are presented keeping the same format. The averages in Table 12 look indistinguishable from the original settings. Same patterns and nearly

60 60 the same values given the uncertainties, as expected. Total time measurements as shown in Table 11 are even closer, relative changes being in the order of hundredths of a percentage point below the saturation point and in the order of tenths of a percentage point above, except on a few spikes. Even involuntary context switches (Table 13) are similar. If there is anything of note in this scenario is that the effects of enabling debug on performance are nearly non existent given the uncertainties. An administrator or system designer only needs to decide between the security of not exposing those interfaces and the benefits of being able to manage said interfaces if any. Batch 2,942,520 2,942,592 2,946,210 3,085,862 4,699,358 5,975,076 8,906,458 Default 2,941,918 2,942,714 2,945,134 3,093,828 4,606,912 5,964,672 8,897,618 Fifo 2,942,054 2,942,512 2,943,110 3,080,690 5,995,914 6,164,382 9,272,658 Idle 2,941,974 2,942,794 2,944,156 3,061,022 5,001,986 5,980,490 9,371,992 Roundrobin 2,941,738 2,942,344 2,943,504 3,159,924 5,219,850 6,689,070 9,427,602 (a) Total time Batch ,649 33, ,351 22,169 15,528 Default ,017 26,579 96,074 8,151 16,885 Fifo ,902 11,290 33,769 43,798 Idle , ,214 27,366 1,024,310 Roundrobin , , , ,678 (b) Total time deviation Batch % 99.98% % % 99.74% 99.97% % Default 99.99% % % % 98.46% 99.66% % Fifo % % 99.98% % % 99.97% 99.87% Idle 99.97% 99.98% % 99.99% % 99.90% % Roundrobin 99.98% % % % 97.84% % % (c) % relative to host Batch % % % % % % % Default % % % % % % % Fifo % % % % % % % Idle % % % % % % % Roundrobin % % % % % % % (d) Total time scaling Table 11: Total group time on Host with debug enabled

61 61 Batch 2,942,520 2,942,226 2,943,949 2,997,759 4,344,411 5,874,186 8,778,875 Default 2,941,918 2,942,493 2,943,593 3,004,792 4,344,102 5,874,126 8,786,054 Fifo 2,942,054 2,942,206 2,942,706 3,061,403 3,618,161 3,784,021 4,663,008 Idle 2,941,974 2,942,596 2,943,228 2,996,335 4,471,102 5,863,404 8,780,466 Roundrobin 2,941,738 2,942,109 2,942,884 3,057,131 3,640,173 4,332,506 6,209,997 (a) Average time Batch ,164 7,072 20,293 15,025 58,051 Default ,839 47,069 21,891 72,313 Fifo , , , ,563 Idle , ,553 26,780 90,882 Roundrobin ,816 99, , ,772 (b) Average time deviation Batch % 99.99% % % 99.40% % 99.53% Default 99.99% % % % % % 99.90% Fifo % 99.99% 99.99% % % % 90.36% Idle 99.97% 99.98% 99.99% % % % 99.86% Roundrobin 99.98% % % % % 98.57% 94.88% (c) % relative to host Batch % 99.99% % % 98.43% 99.82% 99.45% Default % % % % 98.44% 99.83% 99.55% Fifo % % % % 81.99% 64.31% 52.83% Idle % % % % % 99.65% 99.48% Roundrobin % % % % 82.49% 73.64% 70.37% (d) Average time scaling Table 12: Average process time on Host with debug enabled Batch Default Fifo Idle Roundrobin (a) Average CS Batch Default Fifo Idle Roundrobin (b) Average CS deviation Table 13: Average CS on Host with debug enabled

62 Background Background is important in order to figure out how the scheduling policies interact. Here each of the additional three background processes have distinct session identifications while each experiment group has one. This is where the policies show their real purpose of prioritization, real time processes being largely unaffected while idle processes are the most affected. Of note, idle policy processes only finished at all because the system was intentionally not saturated by the background. Batch 3,022,344 5,882,688 8,720,627 11,615,334 17,650,789 20,974,979 30,972,301 Default 3,016,258 5,880,947 8,728,579 11,773,467 17,049,411 23,476,040 34,430,481 Fifo 2,942,518 2,942,647 2,942,927 3,020,389 3,409,138 3,834,917 5,080,976 Idle 3,042,780 5,840,759 8,539,301 11,689,453 17,473,223 22,473,161 30,968,903 Roundrobin 2,942,198 2,942,720 2,943,305 3,024,359 3,607,058 4,219,518 6,318,394 (a) Average time Batch 63,323 34, , ,375 56,780 2,813,613 3,983,958 Default 52, ,067 58, ,138 1,172, ,001 2,057,978 Fifo ,205 55, , ,017 Idle 55, , , , ,792 1,233,152 4,087,730 Roundrobin ,313 29, , , ,019 (b) Average time deviation Batch % % % % % % % Default % % % % % % % Fifo % % 99.99% 99.15% 95.47% % 98.46% Idle % % % % % % % Roundrobin % % % 99.01% 99.97% 96.00% 96.53% (c) % relative to host Batch % % % % % % % Default % % % % % % % Fifo % % % % 77.24% 65.16% 57.56% Idle % % % % % % % Roundrobin % % % % 81.73% 71.71% 71.58% (d) Average time scaling Batch 3,031,463 5,862,607 8,856,101 11,633,908 17,619,603 22,543,409 31,755,056 Default 3,017,803 5,926,097 8,737,252 11,848,076 17,518,931 23,526,596 35,317,158 Fifo 2,942,437 2,942,472 2,942,610 3,009,121 3,418,817 3,896,997 4,930,317 Idle 3,055,413 5,849,000 8,696,372 11,685,251 17,432,306 23,099,405 31,118,316 Roundrobin 2,942,353 2,942,798 2,942,811 3,025,408 3,549,255 4,149,254 6,402,659 (e) Average time trimmed Table 14: Average process time on Host with background

63 63 For the real time scheduler, as in Table 14, all results are very similar to the baseline, almost as if there was no background present. Idle on the other hand, which was expected to have the highest time measurements, actually matched batch and was even better than the default in some cases in absolute, but not when taking into account the deviation. This is a side effect of grouping, since there was one processor not in use by the background and the group's processes were under the same session identification, then CFS when balancing each different session identification processes equally divided by the processes in each group one quarter of the total capacity, which matched what was available. That the idle policy was close to both batch and default means it was actually a good balancing job under those circumstances. This is even true for the deviation in the measurements were the spikes were of similar magnitudes, specially given the already identified interferences in the range of hundreds of microseconds or one microsecond and the longer total execution time. Measurements in relation to the baseline case (Table 14 to Table 2 (c)), further reinforce those findings. Real time policy processes averages up to the saturation point (in the aspect that the experiment would saturate the system by itself, all experiments with background actually saturate the system from group size one already) were close to one percentage point different from the host (Table 2) above the saturation point most measurements point to decreases for the averages of up to almost five percentage points. However given the overall increased baseline deviation carried by those measurements they can still be considered to be similar. As an example, take the case where the difference was the highest, FIFO with six processes, given the deviation the host background measurements can be said to be in the range from nanoseconds to ns, while with background from ns to ns, and those ranges overlap, meaning that while looking only to the averages they may seem different, in truth, due to the uncertainty, one can not really differentiate between both. When examining the CFS policy processes averages, in relation to the host for up to the saturation point, there is a linear increase, as expected, since they are being treated as if there was only one processor available. Above the saturation point, even the host baseline scenario starts to demand more processing capacity than available. And as a consequence, both increase proportionally keeping the four hundred percentage points measured range given the higher

64 64 uncertainties. Scaling measurements, Table 14 (d), are similar. Real time policy processes behave as if there was no background. CFS policy processes scale up to a measured four hundred percentage points given that there is no correction factor up to group size four and then the much higher uncertainties start pushing the values away instead, notably for cases idle and batch with group sizes eight and twelve with up to four milliseconds of deviation. From the measurements of total time (Table 15) compared to the host reference, Table 3 (c), it can be observed that while on average real time scheduled processes seemed close to the host reference, here, except for when there is only one process which is when the system effectively gets saturated, real time policy processes take actually close to ten percentage points longer to finish, with some spikes around twenty percentage points like when there were four test process on the reference. As such, it is no surprise that the scaling, Table 15 (d), is much worse than in the host reference, tough even there (Table 3) there were spikes with six test processes for both round robin and FIFO around thirty to thirty five percentage points and here, in the same case, around thirty five to thirty eight percentage points.

65 65 Batch 3,022,344 5,929,074 8,813,664 11,719,152 17,680,780 21,542,252 31,656,688 Default 3,016,258 5,911,844 8,818,418 11,824,262 17,227,872 23,553,640 34,621,868 Fifo 2,942,518 3,383,100 3,355,578 3,778,412 6,095,766 6,671,460 9,495,392 Idle 3,042,780 5,871,578 8,751,134 11,767,128 17,604,774 22,922,058 31,881,788 Roundrobin 2,942,198 3,206,336 3,524,608 3,768,978 5,988,190 6,819,020 9,697,210 (a) Total time Batch 63,323 24, ,456 88,617 43,926 2,157,444 3,395,172 Default 52, ,001 43,568 96, , ,724 1,718,909 Fifo , , ,474 98, , ,052 Idle 55,059 95, , , ,927 1,020,660 3,256,396 Roundrobin , , ,932 1,092, , ,434 (b) Total time deviation Batch % % % % % % % Default % % % % % % % Fifo % % % % % % % Idle % % % % % % % Roundrobin % % % % % % % (c) % relative to host Batch % % % % % % % Default % % % % % % % Fifo % % % % % % % Idle % % % % % % % Roundrobin % % % % % % % (d) Total time scaling Batch 3,031,463 5,919,014 8,877,467 11,717,817 17,655,920 22,739,840 32,381,373 Default 3,017,803 5,944,837 8,834,090 11,859,833 17,555,810 23,579,607 35,362,830 Fifo 2,942,437 3,274,293 3,204,080 3,939,327 6,145,747 6,629,243 9,460,403 Idle 3,055,413 5,898,840 8,832,663 11,752,170 17,569,330 23,424,587 31,969,843 Roundrobin 2,942,353 3,195,170 3,342,723 3,721,540 5,589,370 6,819,637 9,704,933 Table 15: Total group time on Host with background When looking at the deviation in the measurements, Table 15 (b), for the real time scheduled processes the expected result of an effective saturation in the system yielding higher uncertainties is observed. However spikes are still in the order around hundreds of microseconds or one millisecond. Either in between the end of a process and the beginning of another in the same group or in one of the involuntary context switches the overhead of managing all active processes manifests itself resulting in a less than ideal replacement for proper resource allocation, that is, simply setting the policy to one of the real time ones in an already saturated system is to be expected to be about ten percentage points worse than moving

66 66 then to a dedicated system, plus the advantages of being able to use CFS given it's higher efficiency and balancing proprieties. For CFS scheduled processes, the total time measurements mirror those of the averages given the balancing and the de facto available capacity of one processor, meaning that both balancing strategies, one between the group and the background and another between the group itself are working as intended and with very low overhead given that in no case an increase of over three hundred percentage points is observed beyond the host reference. Batch , Default Fifo Idle , Roundrobin (a) Average CS Batch Default Fifo Idle Roundrobin (b) Average CS deviation Table 16: Average CS on Host with background Recording that an important point wen evaluating Table 16 is the use of the setsid call by each of the three background processes so they were not grouped, while that was not the case for the test processes. Real time policy processes, as expected, remain largely unaffected even with group sizes of two or more. For CFS scheduled processes, one test process is similar to four at the host reference (Table 4), two test processes is similar to eight at the host reference since that is already a demand for double that of the available processing capacity, similarly between three test processes here and twelve at the host reference. Above that point, batch and default policies show a very similar behavior with a continuous increase of around one hundred to one hundred fifty involuntary context switches each step up to around the one thousand mark, the same as idle policy groups, besides a spike on the deviation for batch on group size eight. Idle policy groups from size two above already jump to the one thousand mark and remain there thorough signaling to a limit.

67 67 In summary, idle policy can be used for profiling in order to find the worst case when modeling if there is any spare processing capacity. Below that point, batch and default balancing will behave similarly on the throughput since their only difference is a wake up penalty for batch and from all that was observed so far such penalty in effect only would be detrimental to interactive processes. Also CFS balancing capabilities are very effective and efficient. Real time policies can be an alternative to having to turn on new processing capacity with an up to ten percent penalty on average given that the properties of their process ordering schemes are acceptable for the case under evaluation. Following subsections will concern themselves with the aspects of grouping and lastly the timer granularity Background without groups Tests done with process grouping were completely disabled through sched_autogroup_enabled at /proc/sys/kernel/. In all other aspects, the previous scenario was repeated. As such each individual process in each group scheduled under either batch or default CFS based policies is expected to be balanced equally with each background process while processes in groups under the idle and real time schedulers remain the same.

68 68 Batch 3,032,662 3,713,878 4,327,169 5,260,960 6,616,962 8,005,812 10,835,762 Default 3,030,264 3,778,059 4,432,747 5,112,395 6,637,075 7,952,422 10,794,561 Fifo 2,942,830 2,942,689 2,944,646 3,008,023 3,384,534 3,946,462 5,302,266 Idle 3,167,352 6,114,746 8,035,389 10,804,677 15,638,352 20,987,713 31,669,121 Roundrobin 2,943,040 2,943,087 2,944,320 3,036,395 3,588,067 4,185,116 6,254,252 (a) Average time Batch 70, , , ,098 66, , ,334 Default 54, ,621 99, ,075 39, , ,535 Fifo ,634 43, , , ,745 Idle 1,974 8, , ,601 2,440,346 1,761,931 2,459,139 Roundrobin ,289 27,127 72, , ,053 (b) Average time deviation Batch % 63.35% 48.86% 45.22% 37.55% 35.51% 34.12% Default % 63.75% 50.73% 43.15% 37.89% 33.80% 30.56% Fifo % % % 99.96% 99.00% % % Idle % % 92.40% 92.46% 89.71% 90.86% % Roundrobin % % % % % % 97.68% (c) % relative to host with background Batch % % % % % % % Default % % % % % % % Fifo % % % % 76.67% 67.05% 60.06% Idle % % % % % % % Roundrobin % % % % 81.28% 71.10% 70.84% (d) Average time scaling Table 17: Average process time on Host with background without groups From Table 17, as expected, idle and real time policies display the same behavior as with groups enabled. Also as expected, batch and default policies allowed each of their processes to compete on equal terms throughout, decreasing the proportional effect of the background as can be seen when in the comparison to the previous background scenario, Table 14 and Table 17 (c), notably with three test processes, where each gets close to half the time of the reference, as the three processes in the group balance equally with the three in the background, getting an aggregated half of the total processing capacity, which is equivalent to two processors, that is, twice of what was available on the reference. Also, deviations seen in Table 17 (b) are much smaller then in the previous scenario (Table 14) notably for bigger groups, a side effect of the lower total time diminishing the likelihood of spikes to occur.

69 69 Batch 3,032,662 3,928,514 4,624,682 5,502,826 6,857,306 8,962,440 11,296,086 Default 3,030,264 4,047,150 4,646,726 5,431,448 6,903,858 8,456,192 11,271,554 Fifo 2,942,830 3,070,692 3,466,546 3,934,086 5,994,154 6,166,202 9,269,320 Idle 3,167,352 6,116,602 9,043,640 11,954,808 17,760,868 23,380,916 35,207,478 Roundrobin 2,943,040 3,159,268 3,332,978 3,828,784 5,312,668 6,809,830 9,568,562 (a) Total time Batch 70, , ,900 84,588 58,757 1,380,574 97,839 Default 54, , , , , ,899 69,670 Fifo , , ,463 2,597 47,091 29,766 Idle 1,974 8,843 10,692 42,671 60, ,635 1,497,104 Roundrobin , , , , , ,551 (b) Total time deviation Batch % 66.37% 52.09% 46.96% 38.84% 39.41% 34.88% Default % 68.08% 52.60% 45.80% 39.33% 35.86% 31.87% Fifo % 93.78% % 99.87% 97.53% 93.02% 97.98% Idle % % % % % 99.81% % Roundrobin % 98.88% 99.71% % 95.05% 99.86% 98.59% (c) % relative to host with background Batch % % % % % % % Default % % % % % % % Fifo % % % % % % % Idle % % % % % % % Roundrobin % % % % % % % (d) Total time scaling Table 18: Total group time on Host with background without groups Measurements of total time (Table 18) are near 300 microseconds (in the order or 1/40) from the averages regarding batch and default policy groups, this reinforces the finding that the balancing is efficient. Real time policies are close to the reference by around two percentage points in most cases. Idle groups total time measurements however drifted away from the averages in this scenario, while remaining close to the reference but taking up to almost four percentage points longer to finish (excluding the spike with twelve test processes), when comparing Table 17 and Table 18 (c). Deviations can not explain this since those measurements are from the same data, meaning that when balancing between idle processes individually instead of in a group the system was able to fit in more of the background between each process than previously.

70 70 Looking at Table 19, as expected for real time scheduled processes, they almost do not suffer involuntary context switches. Batch and default policies are close to the host reference (Table 4) adding three in number of test processes of Table 19 to account for the background so one and three test processes can be directly compared to four and six in Table 4, while two Table 19 would have been like five in Table 4 for instance. Batch Default Fifo Idle Roundrobin (a) Average CS Batch Default Fifo Idle Roundrobin (b) Average CS deviation Table 19: Average CS on Host with background without groups The most interesting measurements in Table 19 however are those for the idle policy. Except for two test processes, their behavioral pattern is the same as in Table 16, but with a lower peak value of around five hundred involuntary context switches. A conclusion drawn from this is that balancing twice, once for the group and then in between the group has a higher overhead in terms of context switches while total times were similar. Keeping in mind that this is due to the cheap nature of context switches for the processes used in the tests and for the system itself, were this aspect to be more costly those results would have been worse. Also, this reinforces that the observed divergence between total time and process averages is indeed in between processes instead of due to the amount of involuntary context switches Background also grouped This section's scenario is without the setsid calls on the background processes. This way the background group as a whole will compete with the test group for batch and default policies. The expected, is for each to receive half of the processing capacity. Real time and idle policies

71 71 should remain unchanged from the previous scenario. Surprisingly, Table 20 closely resembles Table 17, when the grouping was completely disabled. As a special case, the same data from both this scenario and the previous, with two groups and no groups respectively, will be reevaluated in relation to the host reference without background in tables 21 and 23. Then, from tables 20, 21, 22 and 23, it can be observed that removing the setsid call from the background processes did not work as intended, but had an effect equivalent to the removal of process grouping altogether, except for the idle policy. For idle scheduled processes there is a difference between Table 21 and Table 23 in that while total time measurements were similar, the process averages from group size three and above show that with two groups idle policy processes are finishing about twenty percentage points faster than without groups. Looking closer to the deviation measurements, item (b) from both Table 20 and Table 22, it can be seen that while total time measurements were also consistent to a maximum of two hundred microseconds, average process time measurements suffered a substantial increase up to thirty three hundred microseconds. Also, from the same tables, there was a gap between their averages and their total execution times like in the previous scenario (without groups). This implies that what was observed to have happened in between processes without groups enabled, when both are grouped happened during preemptions.

72 72 Batch 3,089,966 3,738,545 4,555,497 5,132,150 6,573,594 7,927,449 10,769,671 Default 3,111,948 3,878,900 4,415,645 5,144,362 6,576,008 8,053,811 10,797,010 Fifo 2,942,112 2,942,474 2,942,561 3,015,114 3,582,954 3,865,109 4,946,740 Idle 3,169,314 6,016,222 7,522,457 9,798,929 14,911,020 20,553,299 30,114,321 Roundrobin 2,942,324 2,943,668 2,942,555 3,027,983 3,541,132 4,139,907 5,889,995 (a) Average time Batch 54, , , ,527 88, ,162 65,457 Default 25, , , , , , ,333 Fifo , , , ,750 Idle 14,492 53, ,523 1,326,284 1,740,714 2,210,517 3,320,017 Roundrobin 153 1, , ,919 85, ,664 (b) Average time deviation Batch % 63.77% 51.44% 44.11% 37.31% 35.17% 33.91% Default % 65.45% 50.54% 43.42% 37.54% 34.23% 30.57% Fifo 99.99% % % % % 99.18% % Idle % % 86.50% 83.86% 85.54% 88.98% 96.77% Roundrobin % % 99.99% % 99.77% 99.77% 91.99% (c) % relative to host with background Batch % % % % % % % Default % % % % % % % Fifo % % % % 81.19% 65.69% 56.05% Idle % % % % % % % Roundrobin % % % % 80.23% 70.35% 66.73% (d) Average time scaling Table 20: Average process time on Host with background also grouped Batch % % % % % % % Default % % % % % % % Fifo % % % 98.75% 94.78% % % Idle % % % % % % % Roundrobin % % % 99.40% 99.45% 95.22% 95.55% No groups Batch % % % % % % % Default % % % % % % % Fifo % % 99.98% 98.98% % % 95.86% Idle % % % % % % % Roundrobin % % 99.99% 99.13% 98.15% 94.19% 89.99% Two groups Table 21: Average process time % relative to host

73 73 Batch 3,089,966 3,904,750 4,934,660 5,408,158 6,900,506 8,394,068 11,299,802 Default 3,111,948 4,032,302 4,586,854 5,365,914 6,973,312 8,520,152 11,325,922 Fifo 2,942,112 3,058,232 3,121,418 3,875,004 6,009,870 6,164,544 9,257,058 Idle 3,169,314 6,121,144 8,994,056 11,929,020 17,658,118 23,542,466 34,870,794 Roundrobin 2,942,324 3,164,512 3,083,228 3,816,158 5,380,028 6,767,222 9,713,922 (a) Total time Batch 54, , , , ,878 60,205 41,336 Default 25, , , ,542 84,907 53,854 71,676 Fifo , , ,568 33,775 13,415 35,111 Idle 14,492 13, ,350 96, ,464 69, ,480 Roundrobin , , , , ,980 87,597 (b) Total time deviation Batch % 65.97% 55.59% 46.15% 39.08% 36.91% 34.90% Default % 67.83% 51.92% 45.24% 39.72% 36.13% 32.03% Fifo 99.99% 93.40% 97.42% 98.37% 97.79% 92.99% 97.85% Idle % % % % % % % Roundrobin % 99.04% 92.24% % 96.25% 99.23% % (c) % relative to host with background Batch % % % % % % % Default % % % % % % % Fifo % % % % % % % Idle % % % % % % % Roundrobin % % % % % % % (d) Total time scaling Table 22: Total group time on Host with background also grouped Batch % % % % % % % Default % % % % % % % Fifo % % % % % % 99.83% Idle % % % % % % % Roundrobin % % % % 99.58% % % No groups Batch % % % % % % % Default % % % % % % % Fifo % % % % % 99.97% 99.70% Idle % % % % % % % Roundrobin % % % % % % % Two groups Table 23: Total group time % relative to host

74 74 Batch Default Fifo Idle Roundrobin (a) Average CS Batch Default Fifo Idle Roundrobin (b) Average CS deviation Table 24: Average CS on Host with background also grouped Given the uncertainties both Table 19 and Table 24 can be said to be equivalent reinforcing the findings that the removal of the setsid calls did not have the intended effect. Also, for idle policy processes, given that the amount of preemptions remained equivalent, it can be said that said preemptions were more costly instead of there being more of them Background with saturation All other tests were executed without fully utilizing the processing capacity by the background processes, in this scenario the number of background processes in the experiment was doubled to six process in order for them to saturate the system. The expected is for the execution time to be 1.75 times higher for CFS policy processes relative to the reference background scenario due to the change from one fourth of the time allocation to one seventh, while real time policy processes remain unaffected. Of note, idle tests would never finish while all background processes were still running so they are not present in this scenario.

75 Batch 5,475,410 10,145,492 14,958,593 20,596,476 30,324,891 39,597,405 59,708,150 Default 5,012,272 10,233,773 14,869,352 20,133,998 30,156,113 39,569,903 60,158,604 Fifo 2,943,112 3,179,099 2,943,030 3,012,333 3,438,352 3,828,511 4,878,061 Roundrobin 2,942,680 2,942,841 2,942,853 3,028,832 3,531,866 4,135,384 6,065,484 (a) Average time Batch 548, , , , , , ,892 Default 269, , , , , , ,660 Fifo 1, , , , , ,807 Roundrobin ,054 78, , ,020 (b) Average time deviation Batch % % % % % % % Default % % % % % % % Fifo % % % % % 98.24% 98.94% Roundrobin % % % % 99.51% 99.67% 94.73% (c) % relative to host with background Batch % % % % % % % Default % % % % % % % Fifo % % % % 77.88% 65.04% 55.25% Roundrobin % % % % 80.01% 70.27% 68.71% (d) Average time scaling Table 25: Average process time on Host with six background processes instead of three 75 As expected, from Table 25 (c) it can be seen that real time scheduled processes remained unaffected. Batch scheduled processes were around the expected 175% mark, given the deviation in the measurements, even in spikes like with one test process, equivalent to ten percent of the measured value. Strictly speaking, some readings were not in the 175% range like with three and twelve test processes where the deviations were quite small in relation to other group sizes, but one was below the expected and the other above. Default policy processes however were consistently below the expected, closer to 170% instead of 175%, even the closest case, six test processes, where the deviation was 1.38% of the measured value, did not reach the expectation. Scaling measurements for default policy processes increased linearly as expected while unfortunately batch with one test process, the reference for scaling suffered a spike, dragging the other values lower than they would have been normally, tough the linear aspect of the increase can still be seen.

76 76 Batch 5,475,410 10,258,290 15,152,810 21,368,172 31,309,848 40,010,884 60,359,100 Default 5,012,272 10,320,432 15,081,740 20,421,992 30,556,024 40,060,544 60,807,874 Fifo 2,943,112 3,573,148 3,288,890 3,727,698 5,981,122 6,521,522 9,462,372 Roundrobin 2,942,680 3,117,902 3,413,846 3,580,178 5,579,146 6,893,192 9,801,940 (a) Total time Batch 548, , ,569 1,914,410 2,062, , ,312 Default 269, , , , , , ,215 Fifo 1, ,825 94, ,981 28, , ,056 Roundrobin , , , , , ,434 (b) Total time deviation Batch % % % % % % % Default % % % % % % % Fifo % % % 94.63% 97.32% 98.38% % Roundrobin % 97.58% % 96.20% 99.82% % % (c) % relative to host with background Batch % % % % % % % Default % % % % % % % Fifo % % % % % % % Roundrobin % % % % % % % (d) Total time scaling Table 26: Total group time on Host with six background processes instead of three Total time measurements (Table 26) show the same picture from the averages overall. Batch scheduled processes were found to have suffered more from spikes, notably with one, four and six test processes, while default measurements were more consistent. Also, for total time measurements, default scheduled processes were close to the expected 175% of the reference, but still slightly lower (from one to nine percentage points lower). Once more CFS proved to be reliable in balancing given process averages were close to total group time.

77 77 Batch , , , Default , , , Fifo Roundrobin (a) Average CS Batch Default Fifo Roundrobin (b) Average CS deviation Table 27: Average CS on Host with six background processes instead of three When observing the involuntary context switches measurements in Table 27 it can be seen that indeed real time scheduled processes were unaffected. Batch and default scheduled processes have shown very similar measurements, as a result it can be inferred that the differences observed regarding the expected difference from the reference are due to what happens after the switches instead Background HZ100 For the last scenario on the host, background testing with 100Hz timer resolution. From the previous HZ100 test the main difference is expected to be from involuntary context switches and in order to investigate if with background they have a more meaningful impact.

78 78 Batch 3,083,882 5,775,066 8,635,349 11,518,796 15,787,377 22,397,051 32,086,501 Default 3,054,020 5,764,996 8,653,186 11,450,901 15,564,118 22,563,898 31,984,378 Fifo 2,938,968 2,939,334 2,941,377 3,017,899 3,459,253 3,676,537 4,860,414 Idle 3,065,134 5,607,973 8,466,212 11,334,659 15,298,476 21,557,192 30,254,348 Roundrobin 2,939,302 2,941,142 2,940,657 3,018,861 3,397,205 3,942,966 6,046,477 (a) Average time Batch 17, , , ,879 2,155,445 1,215,922 1,590,668 Default 39, , , ,032 1,597,186 1,374,238 3,733,829 Fifo ,711 24, , , ,775 Idle 9,534 81, , ,947 1,352,559 1,640,080 1,177,296 Roundrobin 765 3,398 3,160 23, ,636 86, ,268 (b) Average time deviation Batch % 98.51% 97.51% 99.01% 89.60% 99.35% % Default % 97.28% 99.04% 96.65% 88.84% 95.91% 90.56% Fifo 99.88% 99.89% 99.96% % % 94.34% 98.58% Idle % 95.88% 97.35% 97.00% 87.76% 93.32% 97.22% Roundrobin 99.90% 99.94% 99.93% 99.78% 95.72% 95.03% 94.44% (c) % relative to host with background Batch % % % % % % % Default % % % % % % % Fifo % % % % 78.47% 62.55% 55.13% Idle % % % % % % % Roundrobin % % % % 77.05% 67.07% 68.57% (d) Average time scaling Table 28: Average process time on Host with background and HZ100 timer From Table 28 (b) it can be seen that there were spikes at nearly all CFS cases except with group size one. From item (c) it would seem clear that there was an overall decrease in relation to the reference, but given the increased deviation in the readings that has to be taken cautiously, tough as the margins of error go both ways and the values are ether lower or very close to the 100% mark then indeed the gains can be confirmed.

79 79 Batch 3,083,882 5,839,272 8,741,942 11,682,088 16,267,268 22,815,352 32,808,194 Default 3,054,020 5,854,942 8,795,858 11,648,812 16,186,192 22,881,474 33,287,684 Fifo 2,938,968 3,178,220 3,380,974 3,607,096 6,166,044 6,732,456 9,840,248 Idle 3,065,134 5,832,912 8,709,036 11,544,548 15,894,902 22,044,470 31,929,458 Roundrobin 2,939,302 3,168,774 3,455,052 3,703,524 5,846,030 6,973,134 10,009,600 (a) Total time Batch 17, , ,753 89,632 1,736, ,769 1,284,800 Default 39, , , ,880 1,247,794 1,018,318 3,426,300 Fifo , , , , , ,200 Idle 9,534 59,033 93, ,410 1,020,649 1,316,364 1,668,081 Roundrobin , , , , , ,681 (b) Total time deviation Batch % 98.65% 98.47% 99.70% 92.13% % % Default % 98.49% 99.57% 98.22% 92.20% 97.04% 94.13% Fifo 99.88% 97.07% % 91.57% % % % Idle % 98.88% 98.60% 98.23% 90.47% 94.11% 99.87% Roundrobin 99.90% 99.17% % 99.52% % % % (c) % relative to host with background Batch % % % % % % % Default % % % % % % % Fifo % % % % % % % Idle % % % % % % % Roundrobin % % % % % % % (d) Total time scaling Table 29: Total group time on Host with background and HZ100 timer Total time measurements in Table 29 have consistently lower deviations, this is as expected since interferences are expected to be less frequent, but more costly once they happen, therefore impacting the process averages more seriously instead of the total time. Also, relative to the background reference scenario, the gains for CFS scheduled processes are grater than they were without background (Table 9), consistently close to one or two percentage points up to with four test processes, from three to nine percentage points with six to twelve test processes and never taking two percentage points or more longer.

80 80 Batch Default Fifo Idle Roundrobin (a) Average CS Batch Default Fifo Idle Roundrobin (b) Average CS deviation Table 30: Average CS on Host with background and HZ100 timer Indeed as expected, involuntary context switching numbers are significantly lower than the reference capping at a three times lower threshold of three hundred involuntary context switches on average instead of from nine hundred to one thousand. 5.3 Virtual Machines Same tests than in the host, but within a normal virtual machine using the parameters discussed back at the virtual machines section 3.5. Overall the results are expected to mirror those of the host plus an overhead. The expected interesting points of investigation are the combined effects of both operating systems working at the same time, one on top of the other, and how process grouping is affected given the virtual machines themselves can be seen as a container for the processes inside them VM Reference This is the baseline scenario for virtual machines like the host reference was in the previous section. Here the main point of interest is the overhead in relation to the host and to confirm the behavior of the policies. From Table 31, processes are taking more time as a consequence from the additional virtual machine overhead, of around four percentage points, and behaving the same for all policies up to group size four, the saturation point. Above the saturation point, the same pattern remains under all policies but FIFO, where the expected increase on the average does not take

81 place, they are balanced instead, so much so that for group size twelve the average time was just 60,23% that of the host. From Table 33, it can be seen that the amount of involuntary context switches (from the VM point of view) remained like they were in the host reference scenario, for all policies. It can be concluded that such balancing is taking place beyond the virtual machine either at the hypervisor or due to the host scheduler. Batch 3,040,794 3,054,350 3,050,950 3,086,964 4,426,809 6,124,750 9,248,018 Default 3,029,448 3,045,313 3,070,009 3,098,023 4,411,369 6,115,787 9,228,269 Fifo 3,040,496 3,069,566 3,082,002 3,129,257 3,148,147 3,362,652 3,108,191 Idle 3,031,732 3,031,410 3,076,541 3,103,165 4,554,427 6,106,691 9,179,029 Roundrobin 3,047,592 3,066,228 3,347,943 3,599,054 4,115,190 4,778,000 6,143,578 (a) Average time Batch 12,516 31,521 24,962 9, ,592 21, ,906 Default 9,305 49,095 15,488 9, ,759 47,750 38,504 Fifo 21,851 36,730 12,951 83,582 5, ,380 89,203 Idle 10,341 24,393 42,958 19,154 36,598 37,802 22,988 Roundrobin 13,665 49, , , , , ,245 (b) Average time deviation Batch % % % % % % % Default % % % % % % % Fifo % % % % 88.16% 89.00% 60.23% Idle % % % % % % % Roundrobin % % % % % % 93.86% (c) % relative to host Batch % % % % 97.05% % % Default % % % % 97.08% % % Fifo % % % % 69.03% 55.30% 34.08% Idle % 99.99% % % % % % Roundrobin % % % % 90.02% 78.39% 67.20% (d) Average time scaling Batch 3,036,110 3,062,702 3,066,663 3,092,056 4,462,615 6,137,243 9,314,136 Default 3,028,333 3,052,973 3,078,993 3,104,261 4,487,626 6,147,423 9,246,428 Fifo 3,049,073 3,067,870 3,087,048 3,137,627 3,151,684 3,311,996 3,149,092 Idle 3,037,430 3,041,285 3,088,913 3,112,652 4,575,242 6,127,245 9,189,691 Roundrobin 3,050,853 3,072,810 3,086,309 3,376,704 4,065,818 4,691,267 6,028,902 (e) Average time trimmed Table 31: Average process time on a VM 81 When looking at Table 32 it can be observed that the overhead is indeed around four percentage points. Of note, spikes are were much more common for real time policies, FIFO

82 82 with four eight and twelve test processes, round robin with three, four, eight and twelve, batch with six and twelve and none under idle and default policies. Total time scaling was also consistent given the spikes. Batch 3,040,794 3,059,164 3,068,444 3,132,598 5,381,570 6,232,732 10,229,252 Default 3,029,448 3,048,722 3,081,944 3,141,528 4,838,374 6,214,564 9,324,948 Fifo 3,040,496 3,074,528 3,094,332 4,362,934 6,252,622 8,449,308 11,215,524 Idle 3,031,732 3,038,302 3,094,606 3,141,052 4,790,064 6,215,522 9,314,888 Roundrobin 3,047,592 3,072,900 3,978,420 4,645,444 5,597,234 7,272,876 10,719,908 (a) Total time Batch 12,516 31,239 28,757 30, ,592 31,688 1,294,642 Default 9,305 48,274 13,245 12, ,403 31,560 30,051 Fifo 21,851 38,587 19,265 1,574,098 15,998 1,156,315 1,379,590 Idle 10,341 25,044 58,173 30,358 69,090 23,880 17,948 Roundrobin 13,665 52,908 1,190,993 1,602, , , ,895 (b) Total time deviation Batch % % % % % % % Default % % % % % % % Fifo % % % % % % % Idle % % % % % % % Roundrobin % % % % % % % (c) % relative to host Batch % % % % % % % Default % % % % % % % Fifo % % % % % % % Idle % % % % % % % Roundrobin % % % % % % % (d) Total time scaling Batch 3,036,110 3,067,763 3,087,733 3,152,487 5,006,093 6,252,907 9,449,803 Default 3,028,333 3,057,450 3,089,997 3,148,630 4,826,887 6,230,800 9,342,060 Fifo 3,049,073 3,071,840 3,102,417 4,191,063 6,259,783 8,853,397 11,409,643 Idle 3,037,430 3,052,220 3,101,413 3,149,517 4,805,783 6,228,680 9,327,643 Roundrobin 3,050,853 3,076,207 3,163,370 3,777,807 5,408,217 7,186,983 10,327,193 (e) Total time trimmed Table 32: Total group time on a VM

83 83 Batch Default Fifo Idle , Roundrobin (a) Average CS Batch Default Fifo Idle Roundrobin (b) Average CS deviation Table 33: Average CS on a VM VM HZ100 Comparing HZ1K and HZ100 within the virtual machine repeats what was seen at the bare system with two additional points of interest. First, the difference was greater than in the host, from around one percentage point to around four percentage points relative to the respective references, due to the virtual environment overhead. Second, the FIFO balancing effect in the average time remained as observed in the previous scenario.

84 84 Batch 2,955,442 2,955,629 2,964,281 2,985,079 4,266,868 5,892,702 8,787,543 Default 2,956,184 2,957,350 2,958,216 3,000,225 4,271,560 5,923,026 8,887,420 Fifo 2,956,332 2,955,905 2,958,289 3,023,419 3,026,399 3,083,003 3,128,935 Idle 2,955,510 2,957,244 2,970,757 2,991,526 4,307,327 5,866,577 8,768,654 Roundrobin 2,955,876 2,954,649 2,962,293 3,176,887 3,795,729 4,199,523 6,004,724 (a) Average time Batch 2,587 2,694 13,403 5,428 82,630 18,262 35,555 Default 1,705 1, ,109 33,246 26,058 13,739 Fifo 837 2,255 3,447 54,789 4,094 7,087 85,661 Idle 1,929 2,750 32,108 9,777 49,415 44,283 23,375 Roundrobin 1,845 1,964 2, , , , ,155 (b) Average time deviation Batch 97.34% 96.50% 96.66% 96.54% 95.61% 96.02% 94.35% Default 97.62% 96.87% 96.08% 96.65% 95.19% 96.35% 96.12% Fifo 96.96% 96.35% 95.83% 96.36% 96.02% 93.09% 99.36% Idle 97.30% 97.24% 96.17% 96.11% 94.14% 95.75% 95.42% Roundrobin 96.89% 96.15% 95.98% 94.08% 93.36% 89.52% 99.60% (c) % relative to VM Batch % % % % 96.25% 99.69% 99.11% Default % % % % 96.33% % % Fifo % 99.99% % % 68.25% 52.14% 35.28% Idle % % % % 97.16% 99.25% 98.90% Roundrobin % 99.96% % % 85.61% 71.04% 67.72% (d) Average time scaling Table 34: Average process time on a VM with HZ100 timer

85 85 Batch 2,955,442 2,961,176 2,981,570 3,053,768 4,793,000 6,029,812 9,037,550 Default 2,956,184 2,962,312 2,965,006 3,072,576 4,841,576 6,027,432 9,010,528 Fifo 2,956,332 2,956,738 3,552,502 4,237,562 6,012,774 6,212,130 9,794,826 Idle 2,955,510 2,962,334 2,991,484 3,041,382 4,690,864 6,029,078 8,982,902 Roundrobin 2,955,876 2,955,694 2,967,334 3,401,554 5,440,838 6,859,860 10,102,008 (a) Total time Batch 2,587 5,522 20,542 17, ,674 28,739 24,233 Default 1,705 3,111 3,830 36, ,076 23,540 26,513 Fifo 837 2,615 1,319,907 1,535,949 15,885 39,804 1,042,763 Idle 1,929 7,214 57,001 19,745 65,535 41,561 25,063 Roundrobin 1,845 2,401 5, , , , ,548 (b) Total time deviation Batch 97.34% 96.53% 96.56% 96.87% 95.74% 96.43% 95.64% Default 97.62% 96.89% 95.95% 97.58% % 96.74% 96.45% Fifo 96.96% 96.25% % % 96.05% 70.17% 85.85% Idle 97.30% 97.06% 96.46% 96.57% 97.61% 96.80% 96.30% Roundrobin 96.89% 96.08% 93.80% 90.04% % 95.45% 97.82% (c) % relative to VM Batch % % % % % % % Default % % % % % % % Fifo % % % % % % % Idle % % % % % % % Roundrobin % 99.99% % % % % % (d) Total time scaling Table 35: Total group time on a VM with HZ100 timer Batch Default Fifo Idle Roundrobin (a) Average CS Batch Default Fifo Idle Roundrobin (b) Average CS deviation Table 36: Average CS on a VM with HZ100 timer

86 Background within itself Background processes inside the same virtual machine where the tests are run. Standing out are in Table 38, the deviations in real time policies total time measurements. They were expected to increase, but not to the point of being in the same order of some measurements themselves, notably under saturation. Given those results, it has to be stressed that the usage of real time policies inside virtual machines with background must be done with great care since their timing is not reliable and even more, taking into account that there are not any deadline guarantees from within a VM given that the VM itself can be preempted. Those real time policies however maintained their effect in this scenario as can be seen from their total times, including the average time balancing for FIFO policy processes. As a special mention, round robin policy with twelve processes on both Table 37 and Table 38 displayed a significantly smaller deviation than with other process amounts. The average and total time for this seem out of line, but this is only due to the deviation being huge for the other cases, the ranges overlap. Overall real time policies were not consistent, for instance, the worst case was round robin with six processes, where the total time normal average was five milliseconds different (from thirteen to eighteen) when trimmed.

87 87 Batch 3,137,926 6,230,311 9,297,386 12,313,161 18,524,091 25,005,786 36,778,790 Default 3,108,750 6,211,307 9,321,021 12,468,284 18,370,087 24,661,277 37,087,647 Fifo 3,137,284 3,118,981 3,102,949 3,076,639 3,206,182 3,134,609 3,206,391 Idle 3,109,016 6,285,195 9,351,868 12,514,784 18,528,289 24,536,446 36,982,427 Roundrobin 3,109,020 5,852,238 8,693,119 11,289,374 10,856,151 13,439,010 4,413,607 (a) Average time Batch 21,999 41, , , , , ,392 Default 40,064 99, , ,717 79, , ,955 Fifo 22,050 42,412 33,686 26, ,534 51, ,337 Idle 35,045 54,751 37,263 70, , , ,256 Roundrobin 33,374 80, , ,241 6,751,301 8,781, ,593 (b) Average time deviation Batch % % % % % % % Default % % % % % % % Fifo % % % % 93.78% 80.44% 65.03% Idle % % % % % % % Roundrobin % % % % % % 68.93% (c) % relative to host with background Batch % % % % % % % Default % % % % % % % Fifo % 99.42% 98.91% 98.07% 68.13% 49.96% 34.07% Idle % % % % % % % Roundrobin % % % % % % 47.32% (d) Average time scaling Batch 3,151,773 6,254,062 9,356,397 12,404,097 18,668,576 25,225,062 36,930,232 Default 3,136,013 6,258,440 9,393,382 12,512,229 18,421,499 24,788,312 37,287,932 Fifo 3,150,863 3,137,840 3,118,152 3,094,477 3,140,116 3,166,711 3,200,340 Idle 3,129,777 6,299,223 9,375,349 12,553,208 18,648,376 24,674,437 37,233,358 Roundrobin 3,124,990 5,889,070 8,643,121 11,176,432 15,343,138 11,546,650 4,571,543 (e) Average time trimmed Table 37: Average process time on a VM with background on the same VM

88 88 Batch 3,137,926 6,249,064 9,321,302 12,352,350 18,578,040 25,078,344 36,868,024 Default 3,108,750 6,218,190 9,345,320 12,500,998 18,403,720 24,707,372 37,165,768 Fifo 3,137,284 5,788,216 9,313,212 8,457,992 11,398,592 11,189,766 16,004,802 Idle 3,109,016 6,299,334 9,375,300 12,557,342 18,567,692 24,580,504 37,073,222 Roundrobin 3,109,020 6,176,924 9,384,444 12,412,768 13,704,522 17,856,874 10,585,004 (a) Total time Batch 21,999 45, , , , , ,401 Default 40, , , ,065 78, , ,848 Fifo 22,050 1,086, ,262 3,551,094 6,607,390 7,643,192 11,858,247 Idle 35,045 59,937 41,883 78, , , ,773 Roundrobin 33,374 66,360 96, ,764 6,813,719 9,484, ,781 (b) Total time deviation Batch % % % % % % % Default % % % % % % % Fifo % % % % % % % Idle % % % % % % % Roundrobin % % % % % % % (c) % relative to host with background Batch % % % % % % % Default % % % % % % % Fifo % % % % % % % Idle % % % % % % % Roundrobin % % % % % % % (d) Total time scaling Batch 3,151,773 6,273,847 9,383,350 12,444,460 18,723,093 25,305,520 37,021,967 Default 3,136,013 6,268,537 9,420,173 12,553,897 18,454,440 24,840,820 37,372,230 Fifo 3,150,863 6,277,190 9,360,120 9,106,570 10,698,287 8,201,250 10,430,327 Idle 3,129,777 6,307,650 9,398,643 12,590,690 18,690,303 24,722,120 37,331,133 Roundrobin 3,124,990 6,217,990 9,430,697 12,343,297 18,676,370 15,448,287 10,776,680 (e) Total time trimmed Table 38: Total group time on a VM with background on the same VM

89 89 Batch , , Default , , Fifo Idle , , , , , , Roundrobin (a) Average CS Batch Default Fifo Idle Roundrobin (b) Average CS deviation Table 39: Average CS on a VM with background on the same VM

90 Background on Host Tests executed within the virtual machine while backgrounds were executed at the host. This is important in order to verify how both schedulers work interact with each other. Batch 3,192,966 6,355,526 9,463,665 12,544,853 18,028,416 25,135,128 37,714,294 Default 3,193,398 6,374,968 9,463,731 12,558,793 18,149,052 25,241,791 37,717,022 Fifo 3,206,864 4,458,942 8,666,118 9,816,894 9,919,013 11,098,143 10,314,251 Idle 3,187,470 6,360,283 9,534,575 12,557,529 17,792,235 25,008,406 37,479,267 Roundrobin 3,204,648 6,360,952 9,302,529 10,889,994 14,741,327 19,863,005 26,694,053 (a) Average time Batch 8,169 26,699 40,566 79, , , ,694 Default 16,114 53,264 38,723 70, ,879 37, ,628 Fifo 21,533 1,763,462 1,828,032 2,587, ,014 2,062,747 1,611,383 Idle 11,032 47,262 85,199 50, , , ,642 Roundrobin 5,003 52, , , ,966 2,421,031 1,254,224 (b) Average time deviation Batch % % % % 96.57% 99.64% % Default % % % % 98.52% % % Fifo % % % % % % % Idle % % % % 95.41% % % Roundrobin % % % 97.44% 96.08% % % (c) % relative to VM with background Batch % % % % % % % Default % % % % % % % Fifo % % % % % % % Idle % % % % % % % Roundrobin % % % % % % % (d) Average time scaling Table 40: Average process time on a VM with background on the Host

91 91 Batch 3,192,966 6,373,778 9,503,536 12,694,936 18,987,902 25,891,794 38,292,484 Default 3,193,398 6,414,792 9,497,388 12,619,588 19,158,042 25,659,176 38,187,230 Fifo 3,206,864 6,375,770 9,550,478 12,715,952 19,145,016 25,660,320 38,286,328 Idle 3,187,470 6,382,660 9,602,022 12,615,116 19,064,086 25,374,710 38,225,930 Roundrobin 3,204,648 6,401,864 9,594,102 12,754,304 19,016,906 25,803,068 38,214,562 (a) Total time Batch 8,169 39,828 55, , ,815 1,022, ,921 Default 16,114 63,476 49,840 82,357 77, , ,379 Fifo 21,533 47, ,562 85, , , ,923 Idle 11,032 53,850 76,761 67, , , ,637 Roundrobin 5,003 48, , , , , ,136 (b) Total time deviation Batch % % % % % % % Default % % % % % % % Fifo % % % % % % % Idle % % % % % % % Roundrobin % % % % % % % (c) % relative to VM with background Batch % % % % % % % Default % % % % % % % Fifo % % % % % % % Idle % % % % % % % Roundrobin % % % % % % % (d) Total time scaling Table 41: Total group time on a VM with background on the Host Obviously, the policies are not being forwarded to the host which is the one that matters, since while on average (Table 40) it would seem that the policies are being enforced, that is only from the VM point of view, all total times (Table 41) for all cases are very similar given the deviation. All cases have greater total times than if the background was on the VM itself. Involuntary context switches (Table 42) are very similar to the VM reference (Table 33), reinforcing the point that those are only from the VM point of view and do not include the VM being preempted.

92 92 Batch Default Fifo Idle , , Roundrobin (a) Average CS Batch Default Fifo Idle Roundrobin (b) Average CS deviation Table 42: Average CS on a VM with background on the Host Background on another VM Similar to previous, but with the background on another VM in order to verify the interference of all schedulers involved.

93 93 Batch 3,058,134 3,241,899 4,723,049 6,089,438 8,417,155 12,150,583 18,163,755 Default 3,074,388 3,201,615 4,386,639 6,190,279 8,377,145 12,144,116 18,182,869 Fifo 3,049,016 3,276,581 4,347,877 4,645,963 4,925,865 5,543,839 5,114,561 Idle 3,061,518 3,451,726 4,425,741 6,089,354 8,499,576 12,238,816 18,024,332 Roundrobin 3,088,344 3,588,994 4,558,896 5,759,045 7,252,055 8,246,574 11,542,647 (a) Average time Batch 24, , ,409 54, , , ,833 Default 35, , , , ,087 62, ,725 Fifo 8, , ,144 1,046, , , ,527 Idle 25, , ,526 37, , , ,427 Roundrobin 53, , , , , ,356 1,109,302 (b) Average time deviation Batch 97.03% 51.84% 50.48% 49.09% 45.09% 48.17% 49.18% Default 98.03% 51.16% 46.70% 49.47% 45.47% 48.99% 48.76% Fifo 96.77% % % % % % % Idle 97.82% 54.80% 47.21% 48.51% 45.58% 49.60% 48.41% Roundrobin 98.83% 60.94% 52.75% 51.53% 47.27% 71.42% % (c) % relative to VM with background Batch % % % % % % % Default % % % % % % % Fifo % % % % % 90.91% 55.91% Idle % % % % % % % Roundrobin % % % % % % % (d) Average time scaling Table 43: Average process time on a VM with background on another VM Confirming with Table 44, the pattern is the same as when the background was on host (section 5.3.4) and the policy information was not being forwarded, but the values are very different. The closer scenario was on the host with background when grouping was disabled (section 5.2.6) but with a greater overhead. A special table was created for this scenario in particular, Table 46, comparing both cases where it can be observed that idle policy is not being penalized, real time policies are not being privileged and batch and default policies overheads increase from around one percent up to sixty five percentage points with twelve processes, excluding the case with two test processes where both policies registered a decrease of almost twenty percentage points in the execution time.

94 94 Batch 3,058,134 3,281,518 4,997,324 6,165,116 9,367,292 12,402,106 18,550,398 Default 3,074,388 3,245,272 4,651,758 6,280,400 9,315,490 12,360,752 18,529,826 Fifo 3,049,016 3,320,286 4,688,624 7,107,146 9,468,814 12,575,984 18,656,338 Idle 3,061,518 3,730,182 4,736,190 6,184,782 9,279,158 12,491,046 18,488,932 Roundrobin 3,088,344 3,822,628 4,727,962 6,291,954 9,453,812 12,983,220 18,809,070 (a) Total time Batch 24, , ,506 41,136 42,063 72, ,085 Default 35, ,793 86, , ,674 58,480 80,641 Fifo 8, , , , , , ,330 Idle 25,062 1,013,872 76,764 35,591 89, ,050 60,412 Roundrobin 53,908 1,294, , , , , ,154 (b) Total time deviation Batch 97.03% 52.30% 53.26% 49.54% 50.03% 49.01% 50.11% Default 98.03% 51.77% 49.38% 50.03% 50.48% 49.76% 49.58% Fifo 96.77% 52.89% 50.09% 78.04% 88.51% % % Idle 97.82% 59.14% 50.39% 49.12% 49.65% 50.53% 49.53% Roundrobin 98.83% 61.48% 50.13% 50.97% 50.62% 84.04% % (c) % relative to VM with background Batch % % % % % % % Default % % % % % % % Fifo % % % % % % % Idle % % % % % % % Roundrobin % % % % % % % (d) Total time scaling Table 44: Total group time on a VM with background on another VM Batch Default Fifo Idle , Roundrobin (a) Average CS Batch Default Fifo Idle Roundrobin (b) Average CS deviation Table 45: Average CS on a VM with background on another VM

95 95 Batch % 87.29% % % % % % Default % 84.74% 98.96% % % % % Fifo % % % % % % 96.46% Idle 96.66% 56.45% 55.08% 56.36% 54.35% 58.31% 56.91% Roundrobin % % % % % % % Average process time % relative to host Batch % 83.53% % % % % % Default % 80.19% % % % % % Fifo % % % % % % % Idle 96.66% 60.98% 52.37% 51.73% 52.24% 53.42% 52.51% Roundrobin % % % % % % % Total time % relative to host Table 46: VM with background on another VM relative to host with background without groups Background on Host grouped In order to verify the grouping assumption from the background in a different VM scenario (section 5.3.5) the experiment was repeated with the background processes being executed on the host grouped, that is, not calling setsid. The pattern is repeated, which confirms the assumption, but the execution time measurements are higher, which is contrary to the expected given the lack of overhead from the other virtual machine not present here.

96 96 Batch 3,130,392 3,932,879 5,806,283 6,708,848 9,346,411 13,199,497 19,675,134 Default 3,129,078 4,037,795 5,442,051 6,780,318 9,477,060 13,301,053 19,770,714 Fifo 3,132,314 3,892,030 5,747,157 5,181,201 5,876,873 6,177,723 6,397,121 Idle 3,127,014 4,204,278 5,585,665 6,882,202 9,325,812 13,112,235 19,473,571 Roundrobin 3,146,906 4,041,402 5,649,233 6,463,936 7,740,949 8,436,212 12,456,381 (a) Average time Batch 30, , , , ,844 72, ,769 Default 57, , , , , , ,432 Fifo 39, , , , , , ,064 Idle 23, , , , , , ,491 Roundrobin 32, , , , , , ,370 (b) Average time deviation Batch 99.32% 62.89% 62.06% 54.09% 50.06% 52.33% 53.28% Default 99.78% 64.52% 57.93% 54.19% 51.45% 53.66% 53.02% Fifo 99.41% % % % % % % Idle 99.91% 66.74% 59.58% 54.82% 50.01% 53.14% 52.30% Roundrobin % 68.63% 65.36% 57.84% 50.45% 73.06% % (c) % relative to VM with background Batch % % % % % % % Default % % % % % % % Fifo % % % % % 98.61% 68.08% Idle % % % % % % % Roundrobin % % % % % % % (d) Average time scaling Table 47: Average process time on a VM with background also grouped but on the Host The most interesting result here is that while grouping is much better than not grouping for the test processes, as expected, it is not only not the same as another on another VM, but is consistently slower, meaning the background processes at the host got some advantage or that both being on virtual machines leveled the playing field more evenly, depending on the point of view. From the increased deviation the conclusion is that running the background at the host is interfering more with the VM itself and by proxy with its internal processes. Again, involuntary context switches seem to be unaffected since they are from the VM point of view.

97 97 Batch 3,130,392 4,419,308 5,979,320 6,865,278 10,328,050 13,526,220 20,114,060 Default 3,129,078 4,329,050 5,710,210 6,915,314 10,536,448 13,601,110 20,257,094 Fifo 3,132,314 5,397,740 6,002,108 9,195,834 11,818,660 13,544,508 21,200,978 Idle 3,127,014 4,526,236 5,810,128 7,031,998 9,912,216 13,392,634 20,019,530 Roundrobin 3,146,906 4,473,506 6,086,336 7,402,998 10,603,332 14,003,482 20,917,672 (a) Total time Batch 30, , , , , , ,397 Default 57, , , , , , ,763 Fifo 39, , , , ,266 1,259,012 1,358,256 Idle 23, , , , , , ,045 Roundrobin 32, , , , , , ,947 (b) Total time deviation Batch 99.32% 70.44% 63.72% 55.17% 55.16% 53.45% 54.33% Default 99.78% 69.06% 60.62% 55.09% 57.09% 54.75% 54.20% Fifo 99.41% 85.99% 64.12% % % % % Idle 99.91% 71.76% 61.82% 55.85% 53.03% 54.17% 53.63% Roundrobin % 71.94% 64.54% 59.98% 56.77% 90.65% % (c) % relative to VM with background Batch % % % % % % % Default % % % % % % % Fifo % % % % % % % Idle % % % % % % % Roundrobin % % % % % % % (d) Total time scaling Table 48: Total group time on a VM with background also grouped but on the Host Batch Default Fifo Idle Roundrobin (a) Average CS Batch Default Fifo Idle Roundrobin (b) Average CS deviation Table 49: Average CS on a VM with background also grouped but on the Host

98 Extended Virtual Machines It is possible to change what the VM can perceive as the real system, notably the amount of processors. Extended virtual machines (XVM) are henceforth virtual machines with a fake number of processors higher than what is available at the host, for the following tests eight processors instead of four, leaving everything else the same as previously. The goal is to see how those fake processors are managed observing from the experiments point of view. This is particularly important for heterogeneous environments that use live migrations Plain XVM Even tough the XVM believes to have eight processors the measurements resemble those with four, as can be seen from real time behavior in Table 50. Comparing involuntary context switch measurements between Table 52 and Table 33 may seem at first as an indication that preemptions are less frequent, however those are only from the XVM point of view. Since the host also has to manage the additional virtual processors the end result is that the final impact on the execution time for the tests is greater as can be seen from Table 51 (c) showing an overhead like that of the normal VM of around four percentage up to the saturation point (four processes, when all real hardware resources are being utilized by the tests processes) and around ten percentage points above the saturation point. One point of note is, in Table 50, that FIFO scheduled processes are not balanced like they were within a normal VM. Also, both real time policies show a discrepancy from the host, as can be seen from Table 50 (d) when above the saturation point, due to being executed as if there were eight processors so the host balances those extra processes up to eight at a time instead of letting them wait for the four which would otherwise be holding the processor exclusively.

99 99 Batch 3,029,848 3,062,155 3,068,409 3,107,451 5,078,783 6,451,517 9,120,549 Default 3,036,332 3,061,543 3,049,978 3,111,282 4,982,318 6,453,599 9,172,772 Fifo 3,040,834 3,049,620 3,066,588 3,093,610 3,476,134 3,982,282 5,065,242 Idle 3,040,358 3,084,336 3,085,841 3,254,315 5,004,468 6,442,291 9,206,027 Roundrobin 3,052,160 3,071,091 3,078,063 3,270,372 4,318,856 4,911,487 5,793,564 (a) Average time Batch 10,002 33,927 21,505 24, ,810 41, ,675 Default 16,345 25,792 26,558 18,193 41,511 30,766 92,332 Fifo 20,514 13,505 16,269 49, , , ,903 Idle 14,939 28,781 12, ,057 39,780 28, ,941 Roundrobin 17,034 19,096 24, , , , ,419 (b) Average time deviation Batch % % % % % % % Default % % % % % % % Fifo % % % % 97.34% % 98.16% Idle % % % % % % % Roundrobin % % % % % % 88.52% (c) % relative to host Batch % % % % % % % Default % % % % % % % Fifo % % % % 76.21% 65.48% 55.52% Idle % % % % % % % Roundrobin % % % % 94.33% 80.46% 63.27% (d) Average time scaling Batch 3,028,517 3,063,503 3,078,540 3,122,954 5,005,108 6,458,925 9,176,948 Default 3,027,423 3,066,248 3,066,743 3,124,025 4,995,398 6,451,987 9,120,021 Fifo 3,041,627 3,052,788 3,076,377 3,100,005 3,429,717 4,067,328 5,056,079 Idle 3,033,197 3,084,323 3,093,508 3,154,538 5,002,779 6,422,739 9,216,542 Roundrobin 3,051,480 3,079,812 3,091,230 3,163,393 4,356,438 4,811,777 5,762,767 (e) Average time trimmed Table 50: Average process time on a XVM

100 100 Batch 3,029,848 3,067,146 3,083,960 3,143,120 5,577,728 6,634,628 9,957,776 Default 3,036,332 3,067,238 3,062,198 3,157,538 5,089,452 6,603,136 10,002,638 Fifo 3,040,834 3,053,338 3,077,216 4,924,560 6,607,932 7,997,710 10,527,994 Idle 3,040,358 3,094,592 3,101,900 3,664,018 5,126,246 6,602,326 9,948,702 Roundrobin 3,052,160 3,075,416 3,094,190 3,453,338 5,724,416 7,243,358 10,940,062 (a) Total time Batch 10,002 32,964 24,130 29,018 1,066,310 59, ,409 Default 16,345 28,296 27,154 26,059 31,237 40, ,861 Fifo 20,514 14,898 17,368 1,612, ,839 1,079, ,095 Idle 14,939 29,037 15,429 1,084,856 30,103 36,125 36,421 Roundrobin 17,034 17,679 21, , , , ,004 (b) Total time deviation Batch % % % % % % % Default % % % % % % % Fifo % % % % % % % Idle % % % % % % % Roundrobin % % % % % % % (c) % relative to host Batch % % % % % % % Default % % % % % % % Fifo % % % % % % % Idle % % % % % % % Roundrobin % % % % % % % (d) Total time scaling Batch 3,028,517 3,069,277 3,088,527 3,159,347 5,097,077 6,629,883 9,991,667 Default 3,027,423 3,072,107 3,080,340 3,161,000 5,098,123 6,605,687 10,004,083 Fifo 3,041,627 3,060,097 3,087,793 5,109,800 6,562,300 7,886,593 10,361,143 Idle 3,033,197 3,092,923 3,111,073 3,218,723 5,133,257 6,609,153 9,950,733 Roundrobin 3,051,480 3,082,587 3,107,670 3,182,633 5,693,697 7,214,487 10,680,150 (e) Total time trimmed Table 51: Total group time on a XVM

101 101 Batch Default Fifo Idle Roundrobin (a) Average CS Batch Default Fifo Idle Roundrobin (b) Average CS deviation Table 52: Average CS on a XVM

102 XVM HZ100 Yet again, comparing both HZ100 and HZ1K yield the same general pattern, this time with gains from two percentage points increasing with the number of test processes up to around five percentage points. Again due to the XVM believing that there are more processors available the number of involuntary context switches is lower in from the XVM point of view, which compounds the effect of the slower timer in decreasing the number of said involuntary context switches. Batch 2,982,104 3,003,492 3,001,038 3,036,892 4,725,396 6,249,820 8,792,828 Default 2,988,148 2,995,459 3,002,140 3,041,262 4,737,480 6,253,635 8,808,282 Fifo 2,989,426 3,008,418 3,012,502 3,007,593 3,326,018 3,749,007 4,109,678 Idle 2,988,712 3,008,936 3,033,268 3,174,408 4,703,871 6,285,766 8,902,708 Roundrobin 2,994,964 2,999,353 3,002,527 3,154,379 3,876,597 4,786,168 6,672,402 (a) Average time Batch 38,752 59,247 58,843 73, , , ,773 Default 48,755 60,015 58,222 76, , , ,547 Fifo 45,969 70,919 66,217 36, ,310 1,009, ,795 Idle 45,905 73, , , , , ,185 Roundrobin 50,239 61,622 54, , , , ,109 (b) Average time deviation Batch 98.47% 98.04% 97.48% 97.24% 94.41% 96.76% 95.81% Default 98.70% 97.69% 97.89% 97.35% 94.84% 96.93% 96.58% Fifo 98.28% 98.55% 97.92% 97.02% 96.98% 92.17% 81.28% Idle 98.53% 97.56% 98.05% % 94.03% 97.87% 96.59% Roundrobin 98.15% 97.39% 97.13% 99.72% 88.99% 99.47% % (c) % relative to XVM Batch % % % % % % 98.28% Default % % % % % % 98.26% Fifo % % % % 74.17% 62.70% 45.82% Idle % % % % % % 99.29% Roundrobin % % % % 86.29% 79.90% 74.26% (d) Average time scaling Table 53: Average process time on a XVM with HZ100 timer

103 103 Batch 2,982,104 3,013,640 3,009,542 3,084,824 4,875,928 6,413,210 9,689,490 Default 2,988,148 3,001,604 3,012,804 3,082,328 4,879,022 6,438,106 9,670,458 Fifo 2,989,426 3,010,604 3,023,022 4,817,776 6,343,874 7,005,830 10,630,348 Idle 2,988,712 3,017,476 3,068,172 3,591,824 4,873,748 6,864,290 9,706,994 Roundrobin 2,994,964 3,001,482 3,023,068 3,437,944 5,411,328 6,746,690 10,278,026 (a) Total time Batch 38,752 58,222 58,343 64, , , ,306 Default 48,755 61,054 57,072 78, , , ,548 Fifo 45,969 70,578 69,035 1,651, ,824 1,146,003 1,031,357 Idle 45,905 77, ,900 1,119, ,779 1,087, ,110 Roundrobin 50,239 61,832 68, ,905 84, , ,095 (b) Total time deviation Batch 98.47% 98.19% 97.44% 97.64% 95.66% 96.73% 96.98% Default 98.70% 97.71% 97.81% 97.51% 95.70% 97.46% 96.67% Fifo 98.28% 98.38% 97.90% 94.29% 96.67% 88.83% % Idle 98.53% 97.56% 98.62% % 94.94% % 97.55% Roundrobin 98.15% 97.37% 97.28% % 95.04% 93.52% 96.23% (c) % relative to XVM Batch % % % % % % % Default % % % % % % % Fifo % % % % % % % Idle % % % % % % % Roundrobin % % % % % % % (d) Total time scaling Table 54: Total group time on a XVM with HZ100 timer Batch Default Fifo Idle Roundrobin (a) Average CS Batch Default Fifo Idle Roundrobin (b) Average CS deviation Table 55: Average CS on a XVM with HZ100 timer

104 Background within itself Since the XVM believes to have more processors while the number of background processes is the same the background processes share is scheduled as 3/8 of total CPU power instead of 3/4, easing their impact over tested processes when the host scheduler balances between all virtual CPUs. Batch 3,124,204 4,184,763 4,997,023 5,754,433 7,390,347 9,823,864 14,914,524 Default 3,110,396 4,260,602 5,003,838 5,794,106 7,462,243 9,895,420 14,973,849 Fifo 3,110,606 4,219,646 5,051,622 5,816,189 5,325,795 5,951,552 5,976,906 Idle 3,095,914 4,242,512 5,017,152 5,715,339 7,411,174 9,807,320 14,797,720 Roundrobin 3,137,610 4,196,442 5,041,569 5,692,581 6,322,667 6,907,302 6,979,712 (a) Average time Batch 41,650 41,933 49,967 54,466 86, , ,973 Default 20,440 47,723 58,219 81, ,416 70, ,415 Fifo 35,444 71,580 26,578 35, , , ,815 Idle 44,739 50,894 85,352 49,140 60, , ,033 Roundrobin 42,523 77, , , , , ,557 (b) Average time deviation Batch % 71.38% 56.42% 49.46% 41.94% 43.58% 46.97% Default % 71.90% 57.27% 48.90% 42.60% 42.06% 42.40% Fifo % % % % % % % Idle % 72.53% 57.69% 48.91% 42.51% 42.46% 47.55% Roundrobin % % % % % % % (c) % relative to host with background Batch % % % % % % % Default % % % % % % % Fifo % % % % % 95.67% 64.05% Idle % % % % % % % Roundrobin % % % % % % 74.15% (d) Average time scaling Batch 3,139,100 4,200,985 5,032,411 5,771,960 7,411,827 9,831,272 14,848,307 Default 3,119,413 4,234,272 5,026,827 5,744,765 7,449,716 9,886,480 14,970,557 Fifo 3,126,500 4,218,375 5,035,198 5,795,749 5,230,524 5,948,119 5,992,854 Idle 3,127,170 4,253,883 5,032,743 5,715,300 7,409,159 9,827,029 14,758,144 Roundrobin 3,154,457 4,221,737 5,058,268 5,712,811 6,357,338 6,812,099 6,757,674 (e) Average time trimmed Table 56: Average process time on a XVM with background on the same XVM

105 105 Batch 3,124,204 4,239,070 5,072,178 5,906,736 8,424,714 10,626,048 15,915,906 Default 3,110,396 4,293,238 5,081,222 5,899,464 8,246,532 10,715,680 15,918,098 Fifo 3,110,606 4,247,058 5,127,650 5,928,742 10,071,848 11,674,372 17,381,662 Idle 3,095,914 4,291,308 5,109,798 5,871,040 8,291,982 10,686,796 15,851,584 Roundrobin 3,137,610 4,232,472 5,349,372 7,477,214 9,531,212 12,325,172 18,016,830 (a) Total time Batch 41,650 38,369 32,035 71, , ,658 93,653 Default 20,440 66,041 54,475 58, ,700 77, ,614 Fifo 35,444 77,936 41,760 35, , , ,056 Idle 44,739 53,904 85,706 44, , , ,041 Roundrobin 42,523 63, , , , , ,503 (b) Total time deviation Batch % 71.62% 57.14% 50.41% 47.72% 46.73% 49.15% Default % 72.22% 57.52% 49.74% 46.97% 45.44% 45.01% Fifo % % % % % % % Idle % 72.75% 57.85% 49.96% 47.20% 45.62% 49.58% Roundrobin % % % % % % % (c) % relative to host with background Batch % % % % % % % Default % % % % % % % Fifo % % % % % % % Idle % % % % % % % Roundrobin % % % % % % % (d) Total time scaling Batch 3,139,100 4,253,270 5,088,147 5,933,850 8,323,507 10,638,873 15,864,097 Default 3,119,413 4,260,807 5,109,180 5,881,907 8,240,550 10,700,717 15,883,797 Fifo 3,126,500 4,256,043 5,114,613 5,931,247 10,232,470 11,665,910 17,380,200 Idle 3,127,170 4,310,117 5,143,073 5,873,890 8,250,007 10,675,003 15,830,210 Roundrobin 3,154,457 4,266,120 5,126,737 7,357,610 9,791,080 12,030,187 18,034,600 (e) Total time trimmed Table 57: Total group time on a XVM with background on the same XVM

106 106 Batch Default Fifo Idle Roundrobin (a) Average CS Batch Default Fifo Idle Roundrobin (b) Average CS deviation Table 58: Average CS on a XVM with background on the same XVM

107 Background on Host This scenario is another example proving that the policies are not forwarded to the host, like on the normal VM scenario in section 5.3.4, Table 59 the averages shows that within the XVM the policies do have an effect, but when looking to total times in Table 60 it becomes clear that they are not taken into account by the host with all cases yielding similar results. Batch 3,181,544 6,170,055 9,479,688 12,729,700 19,083,668 24,906,573 36,603,957 Default 3,164,712 6,314,840 9,532,008 12,695,884 19,113,149 24,959,331 36,262,956 Fifo 3,194,210 5,708,481 8,735,388 10,841,989 19,189,223 18,029,916 16,527,382 Idle 3,194,320 6,342,593 9,490,443 12,727,270 19,119,159 25,120,046 36,403,604 Roundrobin 3,182,850 6,360,535 9,290,282 12,386,651 17,624,147 22,192,508 28,667,778 (a) Average time Batch 22, ,014 28, ,928 70, , ,924 Default 23,736 30,960 81, , , , ,863 Fifo 21,483 1,415,288 1,870,088 2,523,996 90,887 4,560,399 2,990,106 Idle 18,433 50,663 50,596 52,777 83,020 52,788 95,734 Roundrobin 23,859 82, , ,920 1,971,720 2,008,666 2,217,425 (b) Average time deviation Batch % % % % % % % Default % % % % % % % Fifo % % % % % % % Idle % % % % % % % Roundrobin % % % % % % % (c) % relative to XVM with background Batch % % % % % % % Default % % % % % % % Fifo % % % % % % % Idle % % % % % % % Roundrobin % % % % % % % (d) Average time scaling Table 59: Average process time on a XVM with background on the Host

108 108 Batch 3,181,544 6,188,954 9,506,614 12,792,274 19,217,000 25,399,258 38,034,486 Default 3,164,712 6,325,918 9,583,240 12,768,304 19,344,536 25,403,768 38,028,556 Fifo 3,194,210 6,435,544 9,645,108 12,829,090 19,366,238 25,427,354 38,469,460 Idle 3,194,320 6,363,046 9,535,978 12,787,168 19,304,860 25,366,606 38,178,588 Roundrobin 3,182,850 6,388,132 9,567,956 12,835,244 19,252,386 25,582,756 38,450,796 (a) Total time Batch 22, ,846 15, ,639 69, , ,979 Default 23,736 32, , , , , ,280 Fifo 21,483 61,565 84, , , , ,132 Idle 18,433 61,917 59,253 80, , ,925 95,339 Roundrobin 23,859 77,848 77, , , , ,858 (b) Total time deviation Batch % % % % % % % Default % % % % % % % Fifo % % % % % % % Idle % % % % % % % Roundrobin % % % % % % % (c) % relative to XVM with background Batch % % % % % % % Default % % % % % % % Fifo % % % % % % % Idle % % % % % % % Roundrobin % % % % % % % (d) Total time scaling Table 60: Total group time on a XVM with background on the Host Batch Default Fifo Idle Roundrobin (a) Average CS Batch Default Fifo Idle Roundrobin (b) Average CS deviation Table 61: Average CS on a XVM with Background on the Host

109 Background on another XVM While the XMV with background on host scenario (section 5.4.4) closely resembled the same scenario with VMs (section 5.3.4), when the experiments were executed within an XVM while the background was on another XVM the results diverged from the same scenario with Vms. Table 65 was created as a special case to show those differences, where below the saturation point measurements are as expected given the additional XVM overhead, but with four test processes or more the experiments take less time to finish, the greater the amount of test process the less time took in relation to the VM scenario, down to twenty two percentage points less time. This can only be explained combining the effects of the host balancing the virtual processors and the de facto grouping of background processes within the second XVM. Batch 3,066,976 3,520,654 4,577,573 5,780,170 7,661,743 9,242,860 13,313,390 Default 3,107,286 3,354,250 4,647,121 5,731,068 7,602,763 9,194,867 13,342,369 Fifo 3,108,772 3,276,079 4,723,280 4,544,861 4,976,002 5,985,852 7,297,909 Idle 3,077,146 3,292,275 4,576,747 5,701,121 7,515,418 9,230,955 13,263,153 Roundrobin 3,085,296 3,862,901 4,576,471 5,574,178 6,875,633 8,331,503 9,236,517 (a) Average time Batch 33, ,313 97, , ,204 40, ,452 Default 22, ,829 28, , , , ,947 Fifo 43, , , , , , ,595 Idle 18, , ,271 98, ,233 53, ,791 Roundrobin 24,033 1,132,960 89, , , , ,161 (b) Average time deviation Batch 97.70% 83.81% 90.96% % % 94.01% 89.66% Default 99.61% 79.22% 92.45% 99.76% % 93.00% 89.12% Fifo 99.43% 77.66% 93.81% 78.42% 95.13% % % Idle 98.40% 77.39% 90.94% 99.75% % 93.93% 89.87% Roundrobin 97.81% 91.50% 90.48% 97.57% % % % (c) % relative to XVM with background Batch % % % % % % % Default % % % % % % % Fifo % % % % % 96.27% 78.25% Idle % % % % % % % Roundrobin % % % % % % 99.79% (d) Average time scaling Table 62: Average process time on a XVM with background on another XVM Comparing to the reference XVM background scenario, Table 64 and Table 65 (c), the

110 110 gains are around ten percentage points, this is the portion of the gains related to the grouping of background processes. The additional gains come from the host scheduler taking into account five idle CPUs within the second XVM instead of one and then balancing across all sixteen virtual CPUs so when more than three virtual CPUs at the XVM with the test processes were needed, as were in use by the background, the host took that into account penalizing the XVM with the background processes. Batch 3,066,976 3,583,096 4,871,092 5,951,630 7,801,856 9,457,048 14,461,684 Default 3,107,286 3,519,526 4,916,572 5,872,026 8,130,114 9,421,682 14,823,576 Fifo 3,108,772 4,023,372 5,282,398 7,254,462 10,287,464 11,588,444 15,960,508 Idle 3,077,146 3,376,258 4,730,712 5,813,722 7,658,392 9,464,334 14,570,444 Roundrobin 3,085,296 3,977,334 4,828,438 6,206,944 8,809,784 11,431,146 16,175,662 (a) Total time Batch 33, , , , ,356 38, ,522 Default 22, , , ,157 1,016,326 93,309 1,184,520 Fifo 43,990 1,210,508 1,076, ,884 1,302, ,407 1,076,294 Idle 18, , ,948 48, ,545 54, ,876 Roundrobin 24,033 1,234, , , , , ,144 (b) Total time deviation Batch 97.70% 84.24% 95.73% % 93.73% 88.89% 91.16% Default 99.61% 82.60% 96.23% 99.83% 98.66% 88.05% 93.33% Fifo 99.43% 94.53% % % % 99.34% 91.83% Idle 98.40% 78.33% 91.98% 98.98% 92.83% 88.66% 92.04% Roundrobin 97.81% 93.23% 94.18% 84.36% 89.98% 95.02% 89.69% (c) % relative to XVM with background Batch % % % % % % % Default % % % % % % % Fifo % % % % % % % Idle % % % % % % % Roundrobin % % % % % % % (d) Total time scaling Table 63: Total group time on a XVM with background on another XVM

111 111 Batch Default Fifo Idle Roundrobin (a) Average CS Batch Default Fifo Idle Roundrobin (b) Average CS deviation Table 64: Average CS on a XVM with Background on another XVM Batch % % 96.92% 94.92% 91.03% 76.07% 73.30% Default % % % 92.58% 90.76% 75.71% 73.38% Fifo % 99.98% % 97.82% % % % Idle % 95.38% % 93.62% 88.42% 75.42% 73.58% Roundrobin 99.90% % % 96.79% 94.81% % 80.02% Average process time % relative to VM Batch % % 97.47% 96.54% 83.29% 76.25% 77.96% Default % % % 93.50% 87.28% 76.22% 80.00% Fifo % % % % % 92.15% 85.55% Idle % 90.51% 99.88% 94.00% 82.53% 75.77% 78.81% Roundrobin 99.90% % % 98.65% 93.19% 88.05% 86.00% Total time % relative to VM Table 65: XVM with background on another XVM relative to the same scenario with VMs Background on Host grouped This scenario is intended to verify how grouping on the host compares when taking into account the balancing of excess virtual processors. The results are unique, overall resembling those from the XVM background reference (section 5.3.3), particularly the pattern where the real time policies show a lower average but greater total time, but there are some differences, except for one test process where all measurements for both the averages and the total time are within less than one percentage point difference from the reference.

112 Those measurements seem very erratic without a clear relation to the increase in execution time for the test processes and the test processes amount, even displaying some decreases as can be seen from both Table 66 and Table 67 (c). Even the XVM with background on host not grouped, section 5.4.4, was consistent with all experiments of a given group being similar and with a proportional increase in execution time in relation to the amount of test processes. The conclusion is that the host scheduler does not have proper safeguards when attempting to balance both the virtual excess processors and the groups at the same time. Batch 3,133,342 4,294,056 5,743,057 6,784,834 8,931,931 10,340,574 14,630,637 Default 3,133,694 4,298,563 5,739,724 6,557,128 8,706,633 10,338,023 14,802,677 Fifo 3,123,180 4,106,074 5,459,447 6,205,171 6,750,037 7,464,589 7,843,961 Idle 3,106,576 4,284,550 5,876,082 6,733,301 8,709,327 10,320,894 14,791,692 Roundrobin 3,133,780 4,277,373 5,769,419 6,500,875 8,281,168 9,025,415 10,291,738 (a) Average time Batch 17, , , , ,094 82, ,723 Default 50, , , ,720 38,591 99, ,742 Fifo 13, , , ,637 1,296, , ,591 Idle 50, , , ,578 64, , ,322 Roundrobin 53, , , , , , ,598 (b) Average time deviation Batch 99.82% % % % % % 98.53% Default % % % % % % 98.88% Fifo 99.89% 97.34% % % % % % Idle 99.34% % % % % % % Roundrobin 99.34% % % % % % % (c) % relative to XVM with background Batch % % % % % % % Default % % % % % % % Fifo % % % % % % 83.72% Idle % % % % % % % Roundrobin % % % % % % % (d) Average time scaling Table 66: Average process time on a XVM with background also grouped but on the Host 112

113 113 Batch 3,133,342 4,698,186 5,960,516 6,914,770 9,508,080 10,987,446 16,485,220 Default 3,133,694 4,671,300 5,865,442 6,710,062 8,874,164 10,594,308 16,023,866 Fifo 3,123,180 4,508,486 6,219,498 7,066,410 11,084,808 13,774,410 19,411,900 Idle 3,106,576 4,584,814 6,054,720 6,897,436 8,901,666 10,990,980 15,713,974 Roundrobin 3,133,780 4,612,432 6,303,404 7,634,640 9,915,940 12,983,074 18,346,560 (a) Total time Batch 17, , , ,549 1,041, ,856 1,178,868 Default 50, , , ,470 56,532 97, ,663 Fifo 13, , , ,336 1,328, ,676 1,134,574 Idle 50, , , ,162 34,554 1,042,733 92,048 Roundrobin 53, , , , , ,727 1,119,144 (b) Total time deviation Batch 99.82% % % % % % % Default % % % % % 99.01% % Fifo 99.89% % % % % % % Idle 99.34% % % % % % 99.27% Roundrobin 99.34% % % % % % % (c) % relative to XVM with background Batch % % % % % % % Default % % % % % % % Fifo % % % % % % % Idle % % % % % % % Roundrobin % % % % % % % (d) Total time scaling Table 67: Total group time on a XVM with background also grouped but on the Host Batch Default Fifo Idle Roundrobin (a) Average CS Batch Default Fifo Idle Roundrobin (b) Average CS deviation Table 68: Average CS on a XVM with background also grouped but on the Host

114 5.5 Standard Benchmarking A standardized approach is needed for real world usage scenarios, specially for evaluating resource limitation approaches using CGROUPs.

114 Standard Benchmarking A standardized approach is needed for real world usage scenarios, specially for evaluating resource limitation approaches using CGROUPs. The following illustrations are summaries, 0-3 meaning all 4 processors were used. 1-2 and 2-3 with cpuset limiting the use to processors 1 and 2 for the former and 2 and 3 for the latter. bw denotes bandwidth limitation for what would amount to two processors using 100 milliseconds time-slices and the / quota/period ratio Scheduler runtime tunables The tests in this section intend to find out the impact of changing the runtime configurable parameters of the CFS scheduler, the experiments were carried out under the default policy. The CFS parameters listed on each line of the tables for individual experiments are described back at chapter 2 and section Illustration 6: Composite scheduler runtime tunning results Default values for Illustration 6 are defined from a constant multiplied by the rule specified at sched_tunable_scaling. For instance, default latency in the test system, which has

115 four processors, is the constant six multiplied by the default logarithmic scaling of 1+ilog2(4) resulting in 18ms. Likewise, for the test system the resulting default minimum granularity is 2.

115 115 four processors, is the constant six multiplied by the default logarithmic scaling of 1+ilog2(4) resulting in 18ms. Likewise, for the test system the resulting default minimum granularity is 2.25ms and default wakeup granularity is 3ms. Most cases are very similar, but some do show differences, albeit small. Illustration 7: SciMark Dense LU Matrix Factorization One of those cases is for dense LU Matrix Factorization under SciMark implementation version 2.0 as can be seen on Illustration 7. The best result for this case was observed under the default settings and the worst with latency of 12ms, to a total 28 points difference between best and worst cases. Wakeup granularity changes yielded a total decrease of 13 points for the worst case with the highest granularity. Changes to minimum granularity to both lower and higher values yielded lower results, further increasing the setting yielded a better result but not back to

116 116 the default level. Illustration 8: 7-Zip Compression The changes for 7-Zip compression experiments are shown in Illustration 8. The best result was with 6ms latency with a 259 points gain, the application performance changed inversely to increases in latency for all testing conditions. Similarly, the application performance changed proportionally to increases in wakeup granularity for all testing conditions. Granularity changes were detrimental when both increasing and decreasing the setting from the default setting. The conclusion is that the actual differences are very application dependent, demanding proper profiling before they can be changed reliably.

117 CPUSET and CPU Bandwidth CGROUPs As noted previously on sections 3.1 and 4.3, the test system 0-1 and 2-3 processor pairs share their L2 caches among themselves, but not between the pairs. As such, in a division where one processor of each pair is used the cache will be available in its totality, but without sharing. While in a division where both processors of a pair are used only half of the total cache will be available, but it will be shared. Even if not usual on the test system since most experiments have shown very little to no variation (Illustration 9), differences were observed and were meaningful in some cases, like libvpx VP8 encoding as can be seen on Illustration 10. Those cases with meaningful differences happen when distributing loads around processors without taking into account the system topology sets is not optimal. Of note, this is not a NUMA nor a HT system, if there were different memory access nodes or processor subdivisions for different processor groups then they would also have to be taken into account. Instead of physically limiting processes to only certain CPUs it is possible to logically limit them through the use of bandwidth contention. It works by setting up a quota and a period for the group. If both match then the group gets what would amount to one CPU worth of time, if not then the group gets what would be the equivalent amount of one CPU by the same proportion. For instance, if the period is twice of the quota then the group will get half of one CPU, while if the quota is twice that of the period then the group will get two CPUs worth of time. The time slots used for balancing in this scenario are defined by /proc/sys/kernel/sched_cfs_bandwidth_slice_us, default 5ms, in this section's graphs that is the value after bw on the illustrations. As for the measurements, libvpx VP8 encoding could be explained by sub optimal cache sharing, tough not the worst case, hence an in between result. Cache usage from cachebench only was affected by bandwidth granularity with the most extreme setting tested, Illustration 11.

118 118 Illustration 9: CGROUP contention via cpuset and bandwidth at 50% Illustration 10: VP8 libvpx Encoding

119 119 Illustration 11: CacheBench Write Most Graphics Magick measurements on Illustration 12 resemble those of libvpx VP8. Their pattern is like what would be expected from imperfect scaling, however, while cache sharing doesn't impact the result, logical limiting instead of physical limiting does. An explanation would be actual physical processor pipeline usage, since logical limiting actually allows for the use of all physical CPUs as long as their total usage over time meets the quota/period ratio. Libvpx VP8 results clearly favor cache sharing against process migrations, as can be seen from Illustration 10 where the worst results are those from bandwidth limits in contrast to what happened with Graphics Magick where those were the best results. Cachebecnh write (Illustration 11) is an interesting case since, albeit the differences between the results being small, the best result was observed with the smallest bandwidth slice, decreasing constantly down to the lowest (worst) result with the biggest slice. This is counter intuitive since the bigger the slice the smaller is the overhead of switching and CacheBench is a single threaded benchmark. However, the CacheBench results for read (Illustration 9 line 1) and

120 read/modify/write (line 3) are somewhat different, with the read/modify/write test displaying the opposite behavior, as would have been originally expected.

120 120 read/modify/write (line 3) are somewhat different, with the read/modify/write test displaying the opposite behavior, as would have been originally expected. Hence, the execution costs of the more frequent switches are more meaningful to the read/modify/write test than the costs in cache bandwidth, where the opposite is true for the pure write test. Illustration 12: Graphics Magick Resizing In summary, choosing how to limit CPU utilization is application dependent as was the case with choosing the scheduler runtime parameters, proper decision making must be done in conjunct with application profiling. 5.6 Summary The Real Time scheduling policies are less efficient than the CFS based policies in all scenarios, but they do posses the capability to grant exclusive CPU utilization from the point of view of the scheduler that is able to recognize them (schedulers beyond a virtual machine are not able to for instance). A useful usage scenario would be momentarily favoring privileged processes over

Install Cisco ISE on a Linux KVM

Install Cisco ISE on a Linux KVM KVM Hypervisor Support, page 1 Obtain the Cisco ISE Evaluation Software, page 3 Install Cisco ISE on KVM, page 4 KVM Hypervisor Support Cisco ISE supports KVM hypervisor on Red Hat Enterprise Linux (RHEL)