Extending the user interface of irqbalance

Size: px

Start display at page:

Download "Extending the user interface of irqbalance"

Cleopatra Quinn
6 years ago
Views:

1 Masaryk University Faculty of Informatics Extending the user interface of irqbalance Bachelor s Thesis Veronika Kabátová Brno, Fall 2016

2 Masaryk University Faculty of Informatics Extending the user interface of irqbalance Bachelor s Thesis Veronika Kabátová Brno, Fall 2016

3 Replace this page with a copy of the official signed thesis assignment and the copy of the Statement of an Author.

4 Declaration Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Veronika Kabátová Advisor: RNDr. Adam Rambousek, Ph.D. i

5 Acknowledgement Firstly, I would like to thank my advisor RNDr. Adam Rambousek, PhD. for his advice and guidance during preparation and writing of the theses, and my external advisor Petr Holášek, who was not only willing to advise me about technical and implementation details of irqbalance, but made an in-depth review of everything relating to this theses. My further thanks go to my family and friends, who were very accommodating of my needs and supported me during my work. ii

6 Abstract In this theses, we focus on description of hardware interrupts and their distribution and balancing on Linux machines performed by irqbalance daemon. irqbalance does not allow changes to distribution settings during runtime, and we would like to change the situation. We make a thorough analysis of how the aforementioned daemon works and develop a usable, user-friendly interface for communication with it. The user interface allows users to take a look at the current distribution status and make changes to the settings. iii

7 Keywords processor, interrupt, distribution, balancing, text user interface, irqbalance, Linux kernel iv

8 Contents 1 Introduction Overview of processor architecture development Hardware interrupts Overview of interrupt distribution in Linux kernel Prerequisites Interrupt distribution irqbalance analysis Building processor tree Rebuilding interrupt database Parsing /proc/interrupts and /proc/stat Putting it all together Development of user interface Identifying the needs Changes needed in irqbalance Identifying tools for user interface development Development process Conclusion and future plans A Source code B Screenshots of the user interface v

9 1 Introduction Computer architects have always been trying to make computers better. When hardware has hit its limits in utilizing the chip space in both power efficiency and higher frequency, many other techniques were created to improve the performance further and the development has moved towards multiple processors on the same chip. As the demands for performance were increasing, it was clear that software needs to be written to be efficient as well as optimized for the given hardware, especially for time critical applications. A trivial example, where a different algorithmic approach produces better results on the same setup is sorting 1, but software architects are trying to build their applications not only using faster algorithms, but taking advantage of memory and processor architecture as well. This results in an application that is optimized for hardware, but many times the operating system and the fact that other applications might need to run on the same machine are forgotten. The operating system includes its own optimizations that might collide with programmers ideas and goals, which may produce even worse performance than the non-optimized version, and the applications running besides may interrupt with the considered runtime and planned performance as well. Because of these aspects, it is useful for system administrators or experienced users to be able to adjust some of the features the operating system provides, especially when they are running an optimized application on their servers. Modifying the internal functionality of the operating system is often not possible, but in our case, irqbalance, the daemon responsible for interrupt distribution is already running in the user space. irqbalance balances interrupts between processors on the machine while trying to keep cache misses minimal. Having only one processor responsible for all interrupts can harm the performance of the whole machine the other processors are waiting for the chosen processor to process network or storage interrupts for their processes and assigning an interrupt to be handled by random processor has it s issues as well, especially regarding cached data needed to process the interrupt. However, in some cases (like real-time systems),

10 1. Introduction manually pinning the interrupts (or choosing which processor should not handle interrupts because a special application should be running there) yields better results than basic interrupt distribution. Setting irqbalance options by the user is possible only with restarting the daemon. The goal of this thesis is to change the situation by creating an extension to the daemon that allows users to check a summary of the current run of the daemon and modify its settings to fit their needs better, without the need to restart. Besides Introduction and Conclusion chapters, this thesis is structured into five additional chapters. In the first part of Chapter 2, we briefly explain techniques used for improving processor performance over the years up to the current state, including the differences between the memory architectures. Chapter 3 gives an overview of hardware interrupts, interrupt handling routine and describes currently used types of the interrupts. Chapter 4 mentions obstacles for interrupt distribution, portrays tools that helped with overcoming them and a brief history of interrupt distribution in Linux kernel. Chapter 5 focuses on description and analysis of irqbalance current functionality and methods. Finally, Chapter 6 analyzes the goals to achieve by creating additional user interface, the development process itself and plans for future development. 2

11 2 Overview of processor architecture development One of the main goals of hardware architects is to make the CPUs more efficient. Since the late 1970, when the emergence of microprocessors started, to 2003, the performance of uniprocessors grew steadily in average for 35% per year. Between 1986 and 2003, the improvement was more than 50% per year. After this period, the growth of performance has slowed. Until then, all processors used pipelining (overlapping the execution of multiple instructions) to improve performance. This potential overlap among instructions is called instruction-level parallelism (ILP), since parts of the instructions can be evaluated in parallel. However, exploiting ILP effectively is not possible anymore - the improvement is too small to justify using inappropriate amount of limited resources on the chip. Additional improvements were achieved by speculation (guessing which branch will be taken depending on previous taken branches in the process, return address predictors and value predictors), but the power costs for extreme speculation are too high. The effort has moved towards multithreading (thread-level parallelism or TLP), which allows multiple threads to share functional units of a single processor, duplicating only their private states, such as registers and program counter. Most of the processor stalls occur because of cache misses and waiting for memory access, so being able to quickly switch to execution of another thread creates additional improvements in performance - while one thread is waiting for data, another one is running and the processor stays active. Today, the idea of multithreading is widely used in multicore processors, where more cores are situated on the same chip. Each of the cores can execute one or more threads at the same time, improving the performance. These processors exploit TLP through two approaches - using a set of a tightly coupled threads collaborating on a same task (called parallel programming) or a set of independent threads, possibly originating from a different processes (called request-level parallelism). To take full advantage of multicore processor, we need at least as many threads as is the number of cores of the processor. 3

12 2. Overview of processor architecture development These processors are called symmetric multiprocessors (shared memory multiprocessors or centralized shared memory multiprocessors) - SMP, because cores share a single centralized memory that all have equal access to. This type of memory architecture is also called UMA (uniform memory access) to emphasize that all cores have uniform latency from memory. [1] Figure 2.1: Example of UMA memory architecture with four processors[1] In addition, one computer can contain multiple multicore processors (multiprocessor computers). These computers use a different memory architecture design called distributed shared memory (DSM). Each processor is associated with a memory region and possibly with 4

13 2. Overview of processor architecture development an input and output (I/O) interface. This type of architecture is also called NUMA (non-uniform memory access), because the data access time depends on their location in the memory. Resources are typically split into NUMA domains. Each NUMA domain may contain none or more processors, none or more bytes of memory and none or more I/O hubs. UMA can be considered as a special case of NUMA where there is only one NUMA domain. This consideration is most useful when an operating system that is designed for NUMA needs to run on a UMA computer. Most NUMA systems are cache coherent (cc- NUMA), which means that a specific CPU takes into account the state of cache of the other CPUs. In case that the NUMA system is not cache coherent, either software needs to manage caches to ensure coherency, or each CPU has it s own memory that any other CPU cannot access (the computer behaves like a set of separate computers connected by a network). [2] Figure 2.2: Example of NUMA memory architecture with eight processors[1] Most of the systems with more than one CPU socket use distributed memory, because centralized memory would not be able to support the bandwidth demands without incurring excessively long latency. Distributing the memory both increases bandwidth and reduces the 5

14 2. Overview of processor architecture development latency to local memory, but communicating between the processors becomes more complex and interconnection network is needed as a communication medium. [1] What is important to think about when scheduling threads is where the data the thread uses are allocated. If the thread is started on one node, suspended and started again on another node, the memory access time to data the thread uses can significantly increase. Therefore, schedulers take into account processor affinity. Processor affinity refers to the persistence of associating a thread (or process) with a particular processor instance. Using a system API, or by modifying an operating system data structure, a specific core (or set of cores) can be associated with chosen thread or process. Processor affinity ensures that memory allocations remain local to the threads that need them, but it can harm performance by restricting scheduler options. If the thread is stalled and waiting for the chosen core, sooner access to computing resources on another core can compensate for a slower memory access. [3] 6

15 3 Hardware interrupts An interrupt is a digital signal to the processor indicating an event that needs immediate attention. It alerts the processor and requires interruption of the current code the processor is executing. The processor responds by suspending its current activity, saving its state and branching elsewhere into the memory, where interrupt handler is located. After the handler completes, the processor resumes its previous activity. [4] There are three types of interrupts: Exception: Generated internally by the processor, includes conditions such as Page Fault, Divide-by-zero Error, Breakpoint or Overflow Software interrupt: Typically used for system calls, on x86 generated by INT instruction Hardware interrupt or IRQ (Interrupt Request): Generated externally by the chipset, signaled by changing the voltage on the #INTR (or equivalent) pin; common examples include pressing a key on a keyboard, completed read from a hard disk or network packet buffer processing, but also asynchronous events such as data arrival from external network [5] There are two types of IRQs in common use today. The first type is IRQ Lines (or Pin-based IRQs). These are routed on the chipset, with wires or lines running from devices to an IRQ controller. IRQ controller serializes the requests and sends them to the processor. The second type are Message Based Interrupts, which are signaled by writing a value to the memory location reserved for information about the interrupting device and the interrupt itself. The device is assigned a location to which it writes (either by its firmware or kernel software). The IRQ is generated using a protocol specific to the device s bus. [6] When an event happens, the device driver signals a PIC (programmable interrupt controller) to cause an interrupt. PIC accepts the interrupt requests and feeds them to the processor in order. Without a PIC, devices would need to be polled to see if any event happened. PIC decides whether the processor needs to be immediately notified about the IRQ - whether the IRQ is a non-maskable interrupt (NMI). 7

16 3. Hardware interrupts NMI is the highest priority interrupt, usually indicating a critical hardware error. Unlike a software interrupt or regular hardware interrupt, NMI cannot be interrupted by any other interrupt (since there is no other interrupt that can have higher priority than NMI). [7] In the other case, PIC translates the IRQ number into a vector and sends it to the CPU. Each interrupt has a priority, which can be set either by the hardware design or programmed into the PIC. [8] If more than one interrupt is pending, the highest priority interrupt is sent first and if a lower priority interrupt is being handled, it can be interrupted by a higher priority one. Every time the CPU finishes execution of one machine instruction, it checks whether the PIC has signaled an interrupt. If that is the case, the CPU saves state information of the current process (including instruction counter, program status word and in some cases register content) on the stack, maps the received interrupt vector to the address of the interrupt handler using an interrupt vector table (data structure containing interrupt vectors and associated interrupt handlers) [8] and executes the chosen handler. After the handler completes, the processor returns to the execution of the interrupted process by using an IRET instruction (interrupt return), which tells the processor to load saved information from the stack. A special type of interrupts is a system management interrupt (SMI). These interrupts use a special signaling line directly into the CPU and cannot be disabled. 1 [9] When an SMI is received, the CPU enters system management mode (SMM). SMM is provided by system firmware, often BIOS. SMM is usually used for legacy hardware emulation, safety functions (such as shutdown on high CPU temperature) and power control and by design, the operating system cannot override or disable it. 1. Interrupt handling can be disabled by clearing associated flags or masking interrupts that we wish to be ignored. In this case, the CPU ignores PIC signals and does not execute any handler for these interrupts. Since this thesis concentrates on the distribution of interrupts between CPUs and extending the user interface of a daemon responsible for the distribution and it is not possible to handle and distribute disabled interrupt, we do not elaborate on this topic. 8

17 4 Overview of interrupt distribution in Linux kernel 4.1 Prerequisites In general, it is expensive to move processes across processors. Each process has its data that are allocated in given processor s registers or cache and moving the process requires moving all its data as well. However, interrupt handling routines don t require much data, so spreading them across multiple processors is possible without a huge overhead. [4] This generalization does not work for all the possible interrupts - for example, network interface cards (NICs) usually need information about connections when processing incoming packets, so with moving interrupts from NICs elsewhere we are hitting the same issue with cached data as with moving the process itself. Because of this, the solution often was manual pinning of all the interrupts from one NIC to one processor. On the other hand, newer NICs with MSI-X (extension to message signaled interrupts (MSI) MSI is a method of signaling interrupts which is used with PCI (Peripheral Component Interconnect) devices since version 2.2 and PCI Express bus; this method is mutually exclusive with using pins for interrupt signaling [10]) have multiple queues for packets, where packets are hashed into queues based on IP addresses and ports partaking in the communication [11] (in case of network communication with virtual machines, Intel VMDQ (Virtual Machine Device Queues) technology is often used to ensure scalability; VMDQ optimizes virtual machine traffic processing by putting different virtual machines communication into different queues [12]). This way, packets from the same communication go into same queue, and different queues can be assigned to different processors, which removes the need to manually pin all the interrupts to a single processor in order to avoid cache misses and provides a possibility to even out the load of interrupts between processors. In Linux kernel, the ability to choose a queue for packet transmission 9

18 4. Overview of interrupt distribution in Linux kernel was introduced in version NICs are the most common example of hardware that uses multiple queues for interrupts (more interrupt vectors), but in general, every hardware that uses MSI-X has this ability (plain MSI does not support this feature [13]). This means that with modern hardware and kernel, prerequisites for interrupt distributing are fulfilled. 4.2 Interrupt distribution How the interrupts are distributed depends on APIC (Advanced Programmable Interrupt Controller) 2 settings and abilities. APIC can run in logical or physical mode. In logical mode, interrupts are distributed using round-robin algorithm. This means that processors 3 are taking turns in which one handles incoming interrupt. A group of processors taking care of interrupt handling can be specified so not all processors need to partake. In case of NICs mentioned in previous section, every interrupt is handled by a different processor, so all the information and connection objects need to be loaded every time an interrupt occurs. This helps with distributing the load across processors, but increases cache misses. With physical destination mode, all interrupts are sent to one targeted processor [14] with Linux kernel, this is CPU 0. Having a single processor responsible for handling all interrupts creates the opposite situation while cache misses for relevant data will dramatically decrease (how much depends on processor s register, cache size and running processes), the rest of processors is waiting for CPU 0 to handle interrupts created by their processes. By default, APIC operates in physical mode. 1. First version stashed with commit f25f4e44808f0f6c9875d94ef1c41ef86c288eb2 (2.6.23), rewritten with d95b39c a c9b1026bd6bbed62 in Patches with support for multiple devices followed. 2. We use APIC here as a generalization for Local APIC (LAPIC) and IO APIC as well. Implementation details and differences between them are not a topic of this theses, a very high-level understanding of APIC s functionality is sufficient. 3. To be consistent with terminology used in irqbalance where CPU represents a core and processor package (or just package) describes a physical processor (which may contain more cores), from now on we will be using the terms in this meaning unless stated otherwise. 10

19 4. Overview of interrupt distribution in Linux kernel To balance out both extremes, SMP affinity was introduced in Linux kernel SMP affinity allows specifying which processors should handle interrupts from given interrupt sources. By default, all processors are allowed to handle all interrupts. Setting different affinity for each interrupt vector is possible with writing into /proc/irq/irq_id/smp_affinity files. These files contain a bit mask (in hexadecimal) representing all processors in system and by modifying the default bit mask, one can restrict a group of processors allowed to handle chosen interrupt (bit mask specifying no processor is not allowed). [15] Using this ability requires manual setup by system administrator. While specifying processors for handling chosen interrupts is useful for performance (or energy consumption) tuning, doing it manually every time load changes is time consuming and prone to human errors. With Linux kernel , automatic interrupt balancing was introduced. Fully compatible with SMP affinity interface (and possible to be overloaded by manual setup), first version was mostly focused on cache affinity (handling interrupts in a way that allows minimizing cache misses) and moving interrupts to idle processors. While all interrupts from the same source were assigned to a same cache domain, more processors can share higher-level caches and interrupt vectors were periodically randomly assigned to processors in given cache domain. However, this worked well only on systems with lighter interrupt load. With heavy load, multiple frequently generated interrupts were often moved to same processor, while other processors were loaded very lightly. Performance observation described in commit message 6 by Nitin A. Kamble mentions that with heavy load, this situation occurs with a probability of 50% on machines with two processors and approximately 80% on machines with four processors. A relevant issue mentioned was that balancing should take into consideration physical packages of processors. This way, the chance of assigning more frequently-generated interrupt vectors to a single Commit hash cf6f7853b1b75eaa20524a968e0cb8e12e6168f6, available in kernel history git repository 6. Commit hash 08f16f8ff05d9de37ea187a3bde79806c64b82e2, available in kernel history git repository 11

20 4. Overview of interrupt distribution in Linux kernel core on multicore processors is minimized. New version of balancing (proposed in the same commit) addressed all the issues. First change made was putting frequently-generated interrupt vectors to different processors and not moving them often, minimizing cache misses and avoiding clogging a single processor with multiple frequently occurring interrupts. This version took physical processor package into consideration when distributing load of logical processors that belong to the package and changed time interval between redistribution depending on changes of interrupt load. Over the next years, only minor changes and fixes were done. 12

21 5 irqbalance analysis Side-by-side with Linux kernel interrupt balancing development (although only minor changes were made to it), the idea to move interrupt balancing from kernel to user space occurred. This effort resulted into a user space utility called irqbalance, which was available to use since 2003[16]. While the initial version used a similar strategy as balancing in kernel (and this strategy has not changed over the years), irqbalance was superior to kernel balancing not only with it s ability to include user-defined policies, but with increased performance when working with network-related interrupts and NUMA-aware machines as well. This finally resulted in removal of interrupt balancing from Linux kernel. After five years since the first version of irqbalance was developed, balancing was dropped from kernel with version and as of now, there are no plans to include it again. Let s start with identifying larger sections of steps irqbalance makes. 5.1 Building processor tree Firstly, irqbalance builds the object tree representing processor hierarchy of given machine. Highest in the hierarchy are NUMA nodes, which consist of processor packages, processor packages consist of cache domains and cache domains consist of processors themselves. There is no issue if the machine is not designed for NUMA since Linux kernel version , kernel is able to fake one large NUMA node that consists of all processors in case of UMA machines (with version , sizes of nodes can be configured as well). All object types use the same structure topo_obj, defined as: struct topo_obj { uint64_t load ; uint64_t last_load ; uint64_t irq_count ; 1. Commit hash 8b8e8c1bf7275eca859fe551dfa484134eaf013b 2. Changelog for Linux kernel kernel/v2.6/changelog b8ca80e192b10eecc01fc44a af86f73b 13

22 5. irqbalance analysis }; enum obj_type_e obj_type ; int number ; int powersave_mode ; cpumask_t mask ; GList * interrupts ; struct topo_obj * parent ; GList * children ; GList ** obj_type_list ; In case of objects higher in the tree (NUMA nodes, processor packages and cache domains), load fields are the sum of loads across child objects. Data about configured NUMA nodes in the system are available in /sys/devices/system/node path. Information associated with each node is available in noden directory, N 0 N < number of nodes. Number of nodes is determined by number of noden directories, each topo_obj gets the number field assigned from this directory name. Since NUMA nodes are highest entries in the hierarchy, there is no parent structure for them (pointer is NULL). **obj_type_list points to a linked list of all NUMA node objects. The last field populated during parsing noden directory is mask, which is a bit mask stating which processors belong to given NUMA node and which do not. Hexadecimal mask is available in noden/cpumap file, and is parsed into bit mask. After creating objects representing the top of the hierarchy, irqbalance builds the rest of the tree from down to top. Before gathering data and creating structures representing all processors, irqbalance checks for processors banned from partaking in interrupt distribution. By default, isolated processors (processors ignored by scheduler for user space tasks (kernel threads may still get scheduled on isolated processors)[17], usually used for one dedicated task) and nohz_full processors (this option disables timer ticks used for scheduling on idle processors and processors with only one runnable task; used for realtime applications) are banned from interrupt distribution. List of isolated processors is available in /sys/devices/system/cpu/isolated file and a list of nohz_full processors in /sys/devices/system/cpu/nohz_full file. This default setting of processors banned from interrupt distribution 14

23 5. irqbalance analysis can be overridden by setting the IRQBALANCE_BANNED_CPUS environment variable. After getting a bit mask of banned processors, irqbalance parses all /sys/devices/system/cpu/cpun/ directories, N 0 N < number of processors. If given processor is offline (file online exists in it s directory and contains 0), it skips the processor. Same as with NUMA nodes, the number field in structure is parsed from directory name. Logically, the mask field covers only the current processor. To put together the whole tree, we need to get information about package and cache domains as well. Package number is read from /sys/devices/system/cpun/topology/- physical_package_id file and a bit mask of other processors in same package is parsed from /sys/devices/system/cpun/topology/core_siblings file. The last piece of information needed is the cache level on which the balancing should be done. This value can be configured and defaults to L2 cache. While higher cache level allows greater flexibility in interrupt distribution, on some systems the highest cache level is shared between all processors if we used it, instead of distributing interrupts between a smaller group of closer processors, we d be back to distribution between all processors. Processors from the package that share the same cache are found in cache/indexmax/shared_cpu_map file in a directory belonging to given processor, MAX == maximum cache level. Subsequently, banned processors get cleaned out of both cache and package masks. Now the object tree can finally be built recursively, topo_obj structures for cache domains and packages are created as needed. Pointers for children and parent structures are assigned according to retrieved information. Finally, created structures get linked to NUMA node structures based on the node number in which the processor belongs (/sys/devices/system/cpu/cpux/nodey directory (X is the number of given processor, Y the number of corresponding NUMA node)). 5.2 Rebuilding interrupt database Firstly, irqbalance retrieves a list of currently present interrupts. This is done by parsing /proc/interrupts file, which holds information about 15

24 5. irqbalance analysis how many interrupts of which source were handled by which processor. Each line (besides header line) contains interrupt s ID (either numeric or string, interrupts with string identification are internal to system and not associated with devices, therefore they are ignored by balancer), number of handled interrupts by each processor and information about the interrupt (like cause or responsible drivers). Each interrupt is represented by a irq_info structure: struct irq_info { int irq ; int class ; int type ; int level ; int flags ; struct topo_obj * numa_node ; cpumask_t cpumask ; cpumask_t affinity_hint ; int hint_policy ; uint64_t irq_count ; uint64_t last_irq_count ; uint64_t load ; int moved ; struct topo_obj * assigned_obj ; unsigned int warned ; char * name ; }; The irq field is the identification number of interrupt and the last string on the line is taken as the name of interrupt. Together with interrupt class and type, these fields hold information about interrupt that are used for debugging output. After interrupt structures allocation, devices responsible for interrupts are identified to determine interrupt type. This is done by parsing /sys/bus/pci/devices/device_id directories (excluding the directory of interrupt 0, which is permanently assigned to system timer). If the device directory contains msi_irqs directory, the device can be matched to one or more irq_info structures based on the interrupt numbers in mentioned directory. As mentioned in Chapter 4.1, all PCI devices since version 2.2 are capable of using MSI and since PCI version 3.0, MSI-X allows more interrupt 16

25 5. irqbalance analysis vectors per device. Type of these interrupts is either IRQ_TYPE_MSI or IRQ_TYPE_MSIX. In case of legacy devices without MSI support, irq file in device s directory is checked for interrupt number and interrupt s type is IRQ_TYPE_LEGACY (there is no issue with non-pci device source of the interrupt, only classification of the interrupt is affected). Now, irqbalance checks whether a user-defined policy about interrupt handling was presented interrupt can be banned from balancing and moving around processors completely (- -banirq or -i options), or a policy script may be specified with - -policyscript (or -l) option. The script will be executed for each interrupt and may specify whether to ban the interrupt from balancing completely (same as - -banirq option) level on which the balancing should be done (whether irqbalance is allowed to move interrupt to a different package or cache domain) overriding interrupt affinity hinting done by kernel (by default disabled) overriding which NUMA node the device interrupt is local to (often this information is not specified in system and all devices are considered equidistant from all NUMA nodes; this option allows manual setting of closest NUMA node or vice-versa, overriding system-specified value and setting the device to be equidistant from all nodes if it is desirable) Most fields in irq_info structure are assigned only right before the balancing takes place. 5.3 Parsing /proc/interrupts and /proc/stat When parsing /proc/interrupts file, we are again working only with interrupts with numbers as identification. There are two main reasons for /proc/interrupts parsing confirming correct interrupt types and checking current number of processed interrupts. As described in section 5.2, interrupt types are determined by parsing directories in /sys/bus/pci/devices path. However, if too old Linux kernel is run, it may 17

26 5. irqbalance analysis happen that a device is supposed to use MSI(-X) for interrupt signaling, but this doesn t happen in reality. Checking whether information retrieved from devices and /proc/interrupts match reveals this issue and produces a warning about improper interrupt classification. Counting the number of handled interrupts across all processors reveals whether a processor was hotplugged or removed after topology scan if this is the case, rescanning takes place. Otherwise, count and last_irq_count field of irq_info structure are populated the new count is assigned to count field and a previous count is put into last_irq_count. If the current count is smaller than previous one, the rescan needs to take place as well the source of interrupt was probably unplugged and plugged in again and the differences in counts could cause problems with overflows. Parsing /proc/stat file is required to retrieve load statistics for processors. For each processor, if it is not banned, the time it spent handling both hardware and software interrupts is obtained. Fields load and last_load of topo_obj structure representing chosen processor are populated and the new load is propagated all the way up to the top of the processor tree. Computing of the load of the corresponding processor tree branch takes place as next. This means that the whole load of interrupts assigned to a single NUMA node gets divided evenly between it s children processor packages, then a load of one package is divided between it s cache domains and finally, a load of one cache domain is divided between processors belonging to chosen cache domain. 5.4 Putting it all together Similarly to in-kernel balancing before , irqbalance is periodically analyzing the amount of work interrupts require on given system. The default sleep interval is set to 10 seconds, but is configurable with -t or - -interval=<time> options at irqbalance start. Algorithm used in irqbalance consists of simple steps: 1. Building current processor tree and identify interrupts to balance 2. Evaluating overloaded processors 18

27 5. irqbalance analysis 3. Ordering the assigned interrupts from most frequently generated to least 4. Rebalancing interrupts 5. Writing new SMP affinity values to corresponding files 6. Waiting for a period of time and repeating the algorithm Scanning the processor topology before every rebalancing is needed to include new hotplugged processors and remove the ones that went offline. Steps taken to build processor tree are described in section 5.1. Immediately after building the processor tree, a list of interrupts to balance is created (described in detail in section 5.2). After these steps, a cleanup from previous cycles is performed if any interrupt was assigned to be handled by any processor, remove the link between them and mark the interrupt for rebalancing. Statistics about load of processors are created next retrieving load is described in section 5.3 these statistics include search for minimal load, computation of average and total load across processors, computation of standard deviation from average load and counting how many processors are underloaded or overloaded depending on the computed average load. In case of low load (more than N processors have load at least one standard deviation below average load and no processors are overloaded; N is specified with - -powerthresh option), processors can be enabled for powersaving mode no interrupts will be assigned to those processors. If the load gets higher again, all processors in powersave mode will be enabled for interrupt distribution again. If the current load of any processor is higher than minimal load, interrupts assigned to it are sorted according to required workload. Some of these interrupts are then marked for migration to other processors if the interrupt is banned from balancing, it cannot be moved from given processor, interrupts with tiny load (load 1) are not worth to be migrated and there is no point of moving the interrupt if it is the only interrupt assigned to given processor. For interrupts not belonging to these categories, it is checked whether the migration will not swap the imbalance between the current processor and the 19

28 5. irqbalance analysis one with minimal load if it did, the overloaded processor would become the one with minimal load and the balancing would never finish. Interrupts that pass the check (sorted from the ones with highest workload to lowest) are then marked for redistribution (the minimal load is adjusted depending on which interrupts are chosen). Finally, the placement of these chosen interrupts is calculated. Since irqbalnce tries not to move interrupt from home NUMA node, processors belonging to this node are checked as first. Processors that are not banned nor in powersave mode are eligible for interrupt distribution. If there are no eligible processors, a node with lowest load is found and assigned to handle the interrupt. From top to the bottom of processor hierarchy (finishing when the set level of balancing for chosen interrupt was reached), the most suitable objects (objects with lowest load, or with less interrupt vectors assigned in case of same load) are recursively identified as destination objects for migration. For each of migrated interrupts, this new value of SMP affinity is saved into /proc/irq/irq_id/smp_affinity file. With this step, a single cycle of interrupt rebalancing is finished. The whole algorithm is repeated after sleeping for selected amount of time. 20

29 6 Development of user interface 6.1 Identifying the needs Right now, there is no other possibility to view which interrupts are assigned to which objects in the processor topology tree, other than debugging output. This output is enabled with - -debug option at start of the daemon, and prints information about available interrupt sources (whether they are banned from distribution or not), processors (again, information on whether they are banned from interrupt handling) in packages and interrupt assignment (aforementioned assignment tree). If the debug option was not enabled on the start, the daemon needs to be restarted if the administrator needs to check these information, which is a situation we would like to avoid. While the available interrupts sources can be checked by reading /proc/interrupts file, checking the other information is more complicated since processor and interrupt banning can be set by commandline options on daemon start, checking the output of ps axu grep irqbalance to look at the options is one of the ways to get these information (in case of policy script usage, the script needs to be checked as well). Retrieving the information about interrupt source assignment takes even more work corresponding processor mask needs to be parsed from /proc/irq/irq_id/smp_affinity file, and this needs to be done for every interrupt we are interested in. As mentioned in Chapter 5, options for balancing can be set only at daemon startup whether it is sleep interval, interrupt or processor banning, currently there is no possibility to change settings during the runtime. A restart of the daemon is needed to change previous settings (moreover, an environment variable setup is needed in order to ban processors from interrupt handling if the default values are not suitable for the specific case). Altogether, these facts make it clear that not only the interface of irqbalance is not very user-friendly, but it needs to be restarted with every setting change as well. The goal of this theses is to change the current situation by creating a tool capable of communication 21

30 6. Development of user interface with irqbalance, from which the users will be able to check and setup interrupt and processor banning, sleep interval and assignment tree. 6.2 Changes needed in irqbalance Currently, irqbalance does not have a suitable interface for communication with other tools. irqbalance is written to be compatible with older systems and various distributions built upon the Linux kernel, and adding an ability to communicate with a simple user interface should not break this compatibility and portability. Therefore, using an already existing way for inter-process communication (IPC transfer of data among processes) is preferable. Linux systems support a number of IPC mechanisms: Signals: Signals are used to signal events to other processes, they can be generated by interrupts or error conditions. However, no additional data can be sent using signals. Pipes and named pipes: Pipes permit unidirectional communication from one process to second. Data sent to pipes follow the FIFO (First In, First Out) principle the data are read in the same order they were written. Pipes can be used for input and output redirection as well as synchronization if no data are available for read, the reading process is blocked until the data are present. Shared memory: Shared memory allows communication by writing to and reading from a specified memory location (all processes see the memory as theirs). Using this technique avoids the need of copying the same data, but the access to memory is not synchronized we need to establish a protocol processes need to follow for preventing race conditions (situations where the output of operations is dependent on timing, in this case it happens for example if more processes are trying to allocate the same memory space). Mapped memory: Mapped memory follows the same principle as shared memory, with one difference the communication happens via a shared file. Processes can read and write to file 22

31 6. Development of user interface with ordinary memory access. File operations are handled by the Linux kernel, but synchronization in order to prevent simultaneous writing to file is required. Unix sockets: Sockets in general allow bidirectional communication between processes, also across different machines. Unix sockets (or local sockets) are used for communication between processes on the same machine and their address is specified with a file name. A client-server architecture model is used with sockets server is listening to incoming connections from clients and accepts them.[18] As is recognizable from the previous section, we need to be able to request data from irqbalance, pass new settings to the daemon, and we need to be able to retrieve data from irqbalance as well. Therefore, using signals is not applicable since no additional data can be sent. While using two pipes (one for each direction) would possibly work, we need to be careful about their blocking aspect we definitely do not want to stop execution of irqbalance nor the user interface program only to wait for data that may never arrive (for example, if the user interface is not running and irqbalance is waiting for requests). While there might be workarounds around this issue, it is easier to find a more generic IPC mechanism which doesn t create additional problems to solve if possible (especially in this early stage of development). Both shared and mapped memory require implementation of synchronization in both irqbalance and the user interface. While there are synchronization primitives already implemented and ready to use (for example, mutexes (MUTual EXclusion locks, only one thread or program which holds the lock may access chosen data) and semaphores (counters which increase and decrease based on actions performed by threads or processes)[18]), this still creates additional problem to think about during the future development and increases the complexity of the code. Last type of mentioned IPC mechanisms is the one that is most suitable for given use case Unix sockets already allow bidirectional communication and there is no need to implement synchronization, only to specify communication style (connection (or stream) communication guarantees delivery of all data in the order they were sent, 23

32 6. Development of user interface datagram communication does not) and already existing network protocol. The irqbalance daemon will be treated as a server, listening on a local socket and processing requests from user interface as they arrive. The user interface will connect to this socket as a client when required for requesting data from and sending settings to daemon. Figure 6.1: Stream socket lifetimes server side on left, client side on right Therefore, the only changes needed in irqbalance to ensure proper communication are to create a socket and implement a handler. This handler will validate the request and process it. For data request, the handler will gather requested data from internal structures and send them back to user interface via the socket connection, if the request comes with settings changes, the handler will implement them. 24

33 6. Development of user interface 6.3 Identifying tools for user interface development The user interface should be able to communicate with irqbalance and display received data. In previous section, we determined sockets to be the most suitable mechanism of IPC for this use case. Since irqbalance does not use graphical user interface (GUI) in any way and we are trying to add as little additional requirements as possible and ensure compatibility with most systems, textual user interface in form of terminal application is definitely preferred. By far, the most widely used textual user interface library is ncurses. This library is compatible with System V curses library, with addition of three additional libraries for panels, menus and forms and minor extensions and provides an API (Application Programming Interface) over raw terminal codes and control sequences. At first glance, the user interface doesn t require any of the extensions provided by ncurses, so the underlying curses library may be sufficient. Besides basic data showing and parsing input from the user, the interface should periodically request data about interrupts and processors from irqbalance daemon to prevent showing stale data. This data update should happen independently from waiting for input from the user to allow the user watch the changes. For this to happen, we would need to work with separate threads or callback functions for each action. Luckily, irqbalance already uses GLib, which provides a large set of utility functions for data structures and a main loop implementation. Main loop allows the application to react to events created by various sources (such as file descriptors) and timeouts. Timeout events allow us to work with callback functions and all the checks about when to call them are already implemented by GLib. To allow event handling in more threads, each event source is associated with a context. GLib allows assigning priority to event sources in a similar way as GNU nice negative number means higher priority (with 0 being default), and even setting up functions to be called whenever no higher priority events need to be handled (so called idle functions).[19] 25

34 6.4 Development process 6. Development of user interface Firstly, we need to outline how the interface should look like and behave in detail. We need three basic screens one for the processors and interrupts assignment tree, one for listing and setting up processor banning and sleep interval, and the last one for listing and setting up interrupt banning. Special keys will be assigned for the user to both moving between the screens and introducing new values for setups. Applicable help about these keys should be shown all the time as well. Figure 6.2: Diagram movement between screens in user interface To move between screens, function keys present a good option since they definitely will not be used in any kind of input and the chance that they would interfere with possible future work and enhancements is minimal. However, because of the cases when the user interface would be run in terminal in graphical mode, we need to check which keys can be caught by the terminal instead of the user interface process. For example, in Gnome terminal, F1 displays terminal help and F10 key rolls down the menu, but XTerm does not display any function key catching behavior. Safe assumption would be to not use the lowest nor the highest numbered function keys, and we will settle for F3 for assignment tree, F4 for processor and sleep interval settings and 26

35 6. Development of user interface F5 for interrupt banning. On assignment tree screen, there is no user input expected, so any other key presses should be ignored. For sleep interval and processor settings screen, we will use S (as sleep) key for sleep interval settings and C (as CPU) for processor ban setup. On the interrupt banning screen, I (as interrupt) will be used to signal incoming user input. Now, we need to take care about filtering valid key presses from the input. Only numbers are permitted as the sleep interval value (pressing Enter for confirming and Escape for discarding the value). If the inserted value after pressing Enter contains non-numeric symbols, it should be discarded. Processor and interrupt banning inputs can use the same control mechanism user is allowed to move up and down the processor (or interrupt) list, pressing Enter for changing ban setting for chosen processor (or interrupt), Escape for discarding his settings and S for saving them. Unrecognized key presses should be ignored. curses library overrides the default getch() function to return int instead of char type and can be used to detect not only basic characters, but special keys (such as function keys, arrows or backspace) as well, using defined constants from curses.h file getch() call takes only one value from the buffer, but special keys are recorded as a sequence of more values which need to be parsed, curses library already implements the solution. Therefore, a simple getch() call is sufficient for us to get any user input needed. As we have outlined the desired functionality of the interface and possible interactions with it, we need to check which data are required. For assignment tree, we need to know unbanned processors and interrupts. For sleep interval setup and processor banning we require the current sleep interval and all the processors in the system (both banned and unbanned) and for interrupt banning, data about all interrupts (again, both banned and unbanned ones) are needed. These are the data we need to request from irqbalance. New values of either sleep interval, banned processors (while processor banning is configurable by setting an environment variable IRQBALANCE_BANNED_CPUS, it is not possible to change the environment of a running process and the setenv() function only changes the environment of the calling process and it s children; changing processor ban values by environmental 27

36 6. Development of user interface variable setup is impossible during runtime) or banned interrupts should be sent back to the daemon if the user inserted valid settings. There is no issue with sending an integer (sleep interval) through socket (technically, this would have been an issue when passing integers through sockets to different machine size of various integer types as well as endianness (order used to interpret sequential bytes as value) are dependent on architecture, but since both the user interface and daemon are going to run on the same machine, we may safely ignore this fact), but processors and interrupts are represented by structures in irqbalance (struct topo_obj for processor representation is presented in section 5.1 and struct irq_info is introduced in section 5.2). Passing structures through sockets is not advised because of unknown bit padding and alignment, which may cause problems with not only data transfer itself, but with assigning the data to new structure in the receiving program as well (dependent also on compilation settings, architecture and so on). To send structures safely, we first need to serialize them into unambiguous byte sequence, and parse the byte sequence back into structures on the receiving end. There are several libraries available for data serialization in C (C language itself has no native support for serializing structures), with the most popular being Binn 1 and Google Protocol Buffers 2. Both libraries are actively developed, so using a fixed version would be needed to prevent incompatibilities. This would either pressure the user to have a specific library version installed (which for example may collide with programs that require the most recent versions), or we would need to include the library in our source code (same way as GLib is distributed with irqbalance). However, we do not need all the data from irqbalance structures, and using a generic library is a huge overhead, especially in case of Protocol Buffers (we need to create a.proto file with specification and compile it first with protoc and protobuf compiler before using). Creating a simple serialization functions for attributes we need would be much easier. Therefore, we will be passing a string containing tokens and values of requested attributes (not all structure attributes are needed). Tokens will be

37 6. Development of user interface pointers for parsing functions, used for both validating the format of received data and pointing to the next attribute s value. After clearing up confusion about retrieving the data from irqbalance, we should also make clear which values are valid to be sent to the daemon, and what is their meaning: stats: Retrieve the assignment tree of processors and interrupts setup: Retrieve values of current sleep interval, banned interrupts and banned processors settings sleep S: Set new value of sleep interval, S >= 1 settings cpus cpu_number1 cpu_number2... : Ban chosen processors from interrupt handling. Old values of banned processors are forgotten settings ban irqs irq1 irq2... : Ban chosen interrupts from being balanced. Old values of banned interrupts are forgotten Now, to be able to refresh data periodically, we will be using aforementioned GLib main loop. The controls are read from the user, so we also need to periodically check for the user input. This would be an issue since the normal getch() function call blocks the program execution until the input is provided, but curses library offers a setup of nodelay() option, which makes getch() a non-blocking call. If no input is ready, getch() returns ERR as a value. With the nodelay() option, we can add both data refresh and user input check functions as callbacks for GLib main loop. With all the possible issues solved, only the straightforward implementation of data deserialization, serialization and display is left.final user interface then needs to be included in the build of irqbalance itself, which requires modifying it s existing configure and Makefile build and installation configuration. irqbalance is using GNU Autotools 3, so we do not need to include all the possible configurations depending on the system and distribution. Simple check for curses library presence and user interface build commands are sufficient. The user interface command would be irqbalance-ui

38 7 Conclusion and future plans Goals of this theses were to describe interrupt distribution by irqbalance and to create an user interface for this daemon. The user interface was supposed to allow users to check current status and modify settings during runtime, without the need of restarting the daemon. We fulfilled the first goal by doing a thorough analysis of every part of irqbalance (available in Chapter 5). The second goal (creation of a user interface) consisted of modifications of the irqbalance daemon and development of the user interface itself. Modifications of the daemon allowed us to communicate with irqbalance via sockets. This ability is used not only for communication with the user interface, but serves as an interface for building scripts on the top as well. Experienced users may be able to write automatized scripts, for example for banning and unbanning processors from interrupt handling based on current load on the system. The user interface itself (available in form of irqbalance-ui command) makes setup changes and assignment tree view very userfriendly and easy. Currently, irqbalance-ui is available in upstream GitHub repository 1. We are looking forward to including irqbalance-ui to any Linux distribution repositories that are using irqbalance. How long it would take depends on the specific process of chosen Linux distribution how often the upstream code of a project is fetched, how elaborate the process of modifying the package and testing is and whether major changes to upstream code are allowed into package without version bumps are all key questions and should be leaved to package maintainers themselves. We are planning to maintain this part of irqbalance and are considering possible future enhancements. Some of these improvements are to make the user interface prettier (use Unicode characters instead of ASCII for assignment tree if terminal settings allow that, make the user experience better by using a curated, eye-pleasing color scheme), others are focusing on providing more information about processors (displaying load and powersave mode) and interrupts (show device name and Linux kernel driver names responsible for interrupt)

39 A Source code First appendix to this theses contains source code of irqbalance after including irqbalance-ui extension. Directory irqbalance contains whole source code, as is available in GitHub repository mentioned in final chapter Conclusion and future work. To distinguish work done for this theses from other people s contribution to irqbalance, directory diff contains additions to previous version of irqbalance, as would be outputted by diff -u command. 31

40 B Screenshots of the user interface Figure B.1: Assignment tree screen, displayed in Gnome-terminal 32

41 B. Screenshots of the user interface Figure B.2: Example of screen with sleep interval and processor list Figure B.3: Setting up new sleep interval value Figure B.4: Banned processors are displayed in different color 33

42 B. Screenshots of the user interface Figure B.5: Interrupt list different classes are displayed in different color 34

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction