Virtualization in Multicore Real-Time Embedded Systems for Improvement of Interrupt Latency

Size: px

Start display at page:

Download "Virtualization in Multicore Real-Time Embedded Systems for Improvement of Interrupt Latency"

Thomasina Summers
5 years ago
Views:

1 Virtualization in Multicore Real-Time Embedded Systems for Improvement of Interrupt Latency Ivan Pavić, MSc Faculty of Electrical Engineering and Computing University of Zagreb Zagreb, Croatia Hrvoje Džapo, PhD Faculty of Electrical Engineering and Computing University of Zagreb Zagreb, Croatia Abstract This paper investigates the possibilities of using virtualization in multicore real-time embedded systems based on combination of general-purpose and real-time operating systems running on separate processor cores. The research is focused on ARM processor architecture due to its widespread usage in numerous application domains. The paper describes the concepts and practical considerations of using virtualization to achieve better interrupt latency both in Linux, as a typical representative of commonly used general-purpose operating system in applications requiring real-time response, and real-time operating system running on separate cores. In the proposed approach we demonstrate how to use virtualization mechanism to decouple cores responsible for real-time and non real-time tasks in a multiprocessor system with real-time requirements. Xen Hypervisor is used as a virtual machine monitor with default credit and new experimental null scheduler. The aim of research was to study the system response time characteristics and to assess practical usability of such approach in applications with hard real-time requirements. I. INTRODUCTION In recent years, there is a noticeable trend of growth in processing power of embedded systems while maintaining the low power consumption at the same time. Following this trend, the preconditions for using virtualization in low-power embedded systems are now present for many cases, which was not a case until recently. Although the usage of general-purpose operating systems (GPOS), such as Linux, has become a popular choice for easier and faster development, such operating systems cannot guarantee out-of-the-box a performance required by hard real-time low latency deterministic restrictions. In this paper we investigate an approach to provide a deterministic real-time response of a hard real-time subsystem by separating it from the GPOS by using virtualization. The whole system is divided into two basic parts: GPOS and a subsystem with hard real-time requirements. The latter can be either baremetal application or real-time operating system (RTOS), such as FreeRTOS. In such configuration Linux can be used for noncritical soft real-time tasks which require high data throughput or feature-rich functionality while RTOS is used for handling critical tasks with hard real-time requirements [1]. We developed such system to test the validity and characteristics of the proposed approach using Odroid-XU4 hardware platform based on Exynos 5422 SoC and by employing Xen Hypervisor as a virtual machine monitor. The system was experimentally tested to determine the usability of the proposed configuration to achieve real-time and low latency response and to compare the predictability of response of such a system compared to Linux alone. For testing purposes both the default and the experimental null Xen schedulers were used [2]. The motivation for testing interrupt latency comes from performance requirements of hard real-time and safety embedded systems. Standards require temporal and spatial independence of software with different level of criticality [3]. Additionally, it is worth noting that there are many Linux extensions such as Xenomai, RTAI and similar projects that enable real-time operation in Linux and improve interrupt latency, but this paper is focused on improving latency through virtualization. The main contribution of this paper is the interrupt latency benchmark of Xen Hypervisor with different schedulers. Results are compared to Linux interrupt latency and other similar Xen benchmarks [4], [5]. II. RELATED WORK Need for the software partitioning in systems with different criticality levels in industry lead to the development of many hypervisor systems such as PikeOS, XtratuM [6], Jailhouse [7] etc. Design of these hypervisors was driven by hard realtime requirements of industrial embedded systems. Although there is a theoretical basis for determining spatial and temporal independence in such systems [8], there is lack of testing of these mentioned properties on different platforms in the context of hard real-time systems. However, there are tests and comparisons of different embedded hypervisors [9], but they are focused on properties such as CPU overhead, memory bandwidth and lock synchronization. Authors in [7] put an emphasis on duration of virtual machine exits during interrupt execution and propose similar methodology of measuring interrupt latency as it is proposed in this paper. A. Xen Hypervisor III. SYSTEM OVERVIEW Xen is a bare-metal hypervisor which is mostly used in server applications. Xen implementation uses paravirtualization as well as hardware virtualization extensions [10]. Xen creates a virtualization layer between hardware and guest operating system. In the Xen context guest operating systems MIPRO 2018/SSE 1633

are called domains [11]. The first domain is typically Linux, which contains a set of libraries for control of Xen system.

Xen portable layer takes advantage of ARM Generic Timer, ARM Generic Interrupt Controller (GIC) and ARM Memory Management Unit.

It cannot be easily ported to platforms without GIC interrupt controller (e.g. Raspberry Pi). IV.

Unlike the systems where virtualization is used when virtual CPU count is greater than a physical CPU count and where virtual CPU scheduling is required, in embedded applications it makes sense to

2 are called domains [11]. The first domain is typically Linux, which contains a set of libraries for control of Xen system. Xen was originally developed for Intel processors but nowadays it is also ported to ARMv7-A and ARMv8-A processor architectures [12]. Xen portable layer takes advantage of ARM Generic Timer, ARM Generic Interrupt Controller (GIC) and ARM Memory Management Unit. Therefore, Xen can be easily ported to any SoC which implements these features, such as Exynos 5422 or Xilinx UltraScale+ MPSoC. It cannot be easily ported to platforms without GIC interrupt controller (e.g. Raspberry Pi). IV. PROPOSED SYSTEM ARCHITECTURE By using Exynos 5422 or similar heterogeneous multicore SoC and Xen Hypervisor it is possible to statically assign virtual CPUs (vcpu) to physical CPUs (pcpu). Unlike the systems where virtualization is used when virtual CPU count is greater than a physical CPU count and where virtual CPU scheduling is required, in embedded applications it makes sense to have a smaller number of virtual CPUs than physical CPUs because in most cases tasks in embedded systems are known in advance. The configuration of the proposed system is shown in Fig. 1. In all software configurations mentioned in this paper Cortex-A7 runs at 1.4 GHz and Cortex-A15 runs at 2.0 GHz. driver interface. This approach can deteriorate the latency in domain 1 because such indirect hardware access can be potentially expensive in terms of time. In order to overcome this problem, another device passthrough mechanism is used. Using device passthrough mechanism in Xen external interrupt can be injected into guest domain. Virtualization extensions of GIC interrupt controller make hypervisor interrupt injection overhead minimal. To ensure that vcpus are never migrated from the assigned pcpus, a vcpu pinning technique is used. Virtual machine scheduler overhead can be minimized through the Xen Toolstack available for domain 0, which allows for vcpus to be pinned to predefined pcpus. V. REFERENCE SYSTEM AND TESTBENCH A. Reference system configuration Fig. 2 shows a reference configuration which uses only a single instance of Linux operating system (without hypervisor) on the top of Exynos The latencies measured for a system based on our proposed virtualization approach shown in Fig. 1 are compared to this reference system configuration. Kernel module exti_kern_irq.ko is used for detecting Fig. 2. Reference system configuration Fig. 1. Configuration of proposed system The software of the overall proposed solution is divided into three separate layers: Xen Hypervisor running on top of Exynos 5422, Linux running as domain 0 using four Cortex-A7 cores, FreeRTOS running as domain 1 using one Cortex-A15 core. Linux uses Cortex-A7 cores because it is the first operating system that boots after Xen and it is pinned to Cortex-A7 cores due to Xen boot command line parameter dom0_vcpu_pin. Xen typically uses paravirtualized interfaces for device drivers in the guest domain. This means that only domain 0 has right to access the hardware directly while other guest domains have access to the hardware through the split device and reacting to the interrupt. This reference system configuration was chosen as the most common application scenario when using Linux alone on a multicore SoC for various SoC designs. Kernel module implementation is simplistic, and it only registers interrupt service routine for required interrupt number. Part of implementation is shown by listing 1. static int init exti_irq_init(void){... result = request_irq(irq_no, (irq_handler_t) exti_irq_handler, IRQF_TRIGGER_RISING \ IRQF_TRIGGER_FALLING, "exti_irq_handler", NULL);... } static irq_handler_t exti_irq_handler(unsigned int irq, void *dev_id, struct pt_regs *regs){ 1634 MIPRO 2018/SSE

} gpio_set_value(gpio_out,gpio_get_value(gpio_in)); return (irq_handler_t) IRQ_HANDLED; Listing 1. exti_kern_irq.ko module implementation B.

Response time of the system depends greatly on the interrupt latency among other factors.

logical voltage level changes to trigger the GPIO interrupt) and the time instant which represents the finished reaction on the event that triggered the interrupt.

3 } gpio_set_value(gpio_out,gpio_get_value(gpio_in)); return (irq_handler_t) IRQ_HANDLED; Listing 1. exti_kern_irq.ko module implementation B. Testing interrupt latency The interrupt latency is one of the most important parameters in real-time embedded systems. Response time of the system depends greatly on the interrupt latency among other factors. In the context of this test case, the interrupt latency is defined as the time difference between the time instant when the external interrupt source was asserted (e.g. logical voltage level changes to trigger the GPIO interrupt) and the time instant which represents the finished reaction on the event that triggered the interrupt. The test configuration is shown in Fig. 3. In the test configuration the STM32F4 microcontroller Fig. 3. Test configuration is used to generate PWM signal that is fed to the EXTI input of Exynos 5422 system. The Exynos 5422 system generates a response to the external interrupt trigger by changing the logical level on GPA2.4 pin. The time difference between these two signals represents a time delay t d, which is considered as a latency of interrupt reaction and this time corresponds to the above noted interrupt latency definition. Similar measuring method is used in [7]. This time delay is measured precisely by a STM32F4 microcontroller as the time difference between the falling edge of the input PWM signal and the falling edge of the response signal from Exynos The measured interrupt latency can be considered as a stochastic variable which statistics depends on the software configuration of the Exynos 5422 system. The aim is to minimize the standard deviation of the interrupt latency by using different software configurations in order to achieve as deterministic system response as possible in terms of interrupt latency. An example of input and output signals measured on oscilloscope is shown in Fig. 4. The time delay t d can be described by equation (1): t d = t irq + t user (1) Fig. 4. Time delay t d between input and output signal shown on oscilloscope where t irq is a part of time delay related to the time necessary for hardware to process interrupt source and call the interrupt service routine, and t user is a time in which a reaction to interrupt is executed as a response (in this case changing the level of the output test pin GPA2.4). The equation (1) does not take into account the OS overhead and therefore it must be extended by an additional term in case of using OS in the system configuration: t d = t irq + t os + t user (2) where t os represents a time delay introduced by internal mechanisms of the OS. Moreover, in case of our proposed system architecture, we also need to furthermore extend the equation (2) to take into account the influence of the hypervisor layer to the interrupt latency: t d = t irq + t hyp + t os + t user (3) where t hyp is a time delay introduced by hypervisor. Generic interrupt controller (GIC) architecture supports injecting interrupts by using a special set of registers (list registers), which are maintained by the hypervisor [13]. This mechanism is implemented in arch/arm/gic.c and arch/arm/vgic.c in the Xen hypervisor. After IRQ interrupt is trapped, it is emulated by modifying the list registers (see function do_trap_irq in arch/arm/traps.c) and injected into to the guest. The deviation of t hyp is the most significant term in interrupt latency equation (3) as it depends on states and utilization of all virtual machines and physical CPUs. Interrupt latency and deviation also directly affect the scheduling latency in FreeRTOS virtual machine as the scheduling depends on the arrival of the system tick interrupt. In this case, virtual timer was used for generating a system tick interrupt in FreeRTOS operating system (a part of ARM Generic Timer architecture [14]). The tests were performed for two system configurations shown in Fig. 1 and Fig. 2. To simulate a workload in Linux operating system, a stress application was used. Stress is a simple Linux application which allows user to spawn workers that consume processor time and memory depending on the given parameters. CPU and memory allocation operations MIPRO 2018/SSE 1635

in Linux operating system will affect the interrupt latency of Linux. Furthermore, they will also affect the interrupt latency in the FreeRTOS virtual machine for the configuration shown in Fig.

4 in Linux operating system will affect the interrupt latency of Linux. Furthermore, they will also affect the interrupt latency in the FreeRTOS virtual machine for the configuration shown in Fig. 1 when virtualization is used. Therefore, to experimentally test both system configurations and different load conditions, we conducted three different test scenarios: no intensive CPU or memory utilization, intensive memory utilization (malloc and free workers - stress -m $N --vm-bytes 64M, ), intensive CPU utilization (sqrt workers - stress -c $N), where N is the number of cores available to Linux. It is worth noting that intensive memory utilization as defined here is not similar to the traditional definition of memory utilization which implies high load on the CPU to RAM data path and high rate of cache misses. Instead, our interest is testing how locking mechanisms in memory allocation can cause interference between Linux and FreeRTOS operating systems. Therefore, from now on we will refer to this kind of utilization as lock-intensive. Because these tests can yield results that are very variable depending on architecture, platform and hardware implementation, their usability is very limited. However, some qualitative properties of Xen interrupt processing can be asserted. To resolve this issue, we ran additional interrupt latency test on bare-metal FreeRTOS. Furthermore, additional profiling of Xen internal interrupt processing was done. Based on the mentioned measurements the ratio of t irq, t hyp, t os and t user was determined. This enables comparison to similar measurements in other papers. Before presenting the benchmark of various software configurations, we present baseline scenario which provides us with the estimation of time t irq. Time t irq is estimated with application which runs on bare-metal FreeRTOS and uses ARM Generic Timer to determine latency of the interrupt response. Similar procedure is used in TBM application [15] by Xen developers [5]. Basically, we configure physical timer to deliver interrupt after 1 ms and in the interrupt handler read difference between subsequent interrupt arrivals (by reading CNTPCT register). Considering frequency of timer (24 MHz), we devise interrupt latency mean and standard deviation which is shown in table I. Additionally, results are also shown in Fig. 5. TABLE I INTERRUPT LATENCY ON BARE-METAL FREERTOS Minimum (ns) Average (ns) Maximum (ns) Standard deviation (ns) VI. RESULTS The following software configurations for systems shown in Fig. 1 and Fig. 2 were experimentally tested: 1) operating system Linux without Xen Hypervisor (configuration shown in Fig. 2) Fig. 5. Interrupt latency on bare-metal FreeRTOS 2) operating system FreeRTOS in Xen 4.10 virtual machine with Linux in other Xen virtual machine with credit virtual machine scheduler (configuration shown in Fig. 1) 3) operating system FreeRTOS in Xen 4.10 virtual machine with Linux in other Xen virtual machine with null virtual machine scheduler (configuration shown in Fig. 1) The first configuration is considered as a reference test case because it resembles the most common and practical configuration in use for the Exynos 5422 and similar SoCs. Second and third software configuration represent a proposed system software architecture where the virtualization is used. The main difference between second and third software configuration is in the used Xen virtual machine scheduler. The second test case uses a default credit scheduler, which allows more vcpus than pcpus. The third configuration uses an experimental semi-static null scheduler [2], which removes the unnecessary scheduling overhead. The results obtained by the test procedure explained in the section V are shown in Tab. I-III and Fig For each test a series of interrupt latency samples was acquired. Every sample in series represents the time difference t d between falling edges of input and output signals, as described in section V.B, which corresponds to the interrupt latency of the system configuration under test. Each test samples series was analyzed by calculating average, maximum and standard deviation of the observed interrupt latency. Each table I-III contains summary for three measurements obtained under different load test scenarios, as described in the section V. The sign - in the first table row means that there was no intensive CPU or lockintensive utilization. Each table contains average, maximum and standard deviation of the interrupt latency measured in microseconds. The Fig. 6-8 visualize time series of the acquired interrupt latency samples under lock-intensive utilization as it causes the most significant deviation. Results of the first software configuration are shown in table II and in Fig 6. These results show how interrupt latency in Linux becomes nondeterministic under lock-intensive utilization. This is not acceptable behaviour in hard real-time and low latency systems. Results of the second software configuration 1636 MIPRO 2018/SSE

TABLE II INTERRUPT LATENCY IN OPERATING SYSTEM LINUX Utilization Average (µs) Maximum (µs) Standard deviation (µs) - 7,63 23,57 0,56 CPU 7,33 69,60 5,18 Lock-intensive 9,99 157,29 7,24 TABLE IV

Interrupt latency in operating system Linux with lock-intensive utilization Fig. 8.

In this test case, standard deviation of interrupt latency under different utilization conditions is significantly smaller than in the first reference test case.

5 TABLE II INTERRUPT LATENCY IN OPERATING SYSTEM LINUX Utilization Average (µs) Maximum (µs) Standard deviation (µs) - 7,63 23,57 0,56 CPU 7,33 69,60 5,18 Lock-intensive 9,99 157,29 7,24 TABLE IV INTERRUPT LATENCY IN FREERTOS (XEN - NULL SCHEDULER) Utilization Average (µs) Maximum (µs) Standard deviation (µs) - 8,66 11,95 0,11 CPU 8,76 14,02 0,10 Lock-intensive 9,17 13,90 1,07 Fig. 6. Interrupt latency in operating system Linux with lock-intensive utilization Fig. 8. Interrupt latency in FreeRTOS (Xen - null scheduler) with lockintensive utilization in Linux with Xen and default credit scheduler are shown in table III and Fig 7. In this test case, standard deviation of interrupt latency under different utilization conditions is significantly smaller than in the first reference test case. TABLE III INTERRUPT LATENCY IN FREERTOS (XEN - CREDIT SCHEDULER) Utilization Average (µs) Maximum (µs) Standard deviation (µs) - 8,68 23,38 0,39 CPU 8,77 22,95 0,38 Lock-intensive 9,20 62,38 1,37 Fig. 7. Interrupt latency in FreeRTOS (Xen - credit scheduler) with lockintensive utilization in Linux Results of the third software configuration with Xen and experimental null scheduler are shown in table IV and Fig 8. Interrupt latency deviation in this software configuration is significantly smaller than in other two cases. Finally, we present results of profiling every part of interrupt processing in case of the second and the third configuration for average case. Results are shown in table V. Time t irq TABLE V AVERAGE CASE INTERRUPT LATENCY t d FOR FREERTOS VIRTUAL MACHINE t irq (µs) t hyp (µs) t os + t user (µs) 0,6 6,8 1,0 is a delay caused by hardware latency. Time t hyp is caused by Xen and for purpose of determining its value execution time of function do_trap_irq in arch/arm/traps.c was measured. Times t os and t user depend on implementation of FreeRTOS port and user application. Our FreeRTOS port 1 for Xen on ARM enables user to register interrupt handler for the external interrupt or any other interrupt in an application. This is different approach from [7] where authors embedded interrupt response for the external interrupt in assembly which is not practical, but enables elimination of overhead created by guest operating system and application. VII. DISCUSSION By analyzing the results obtained by experimental measurements of system behaviour under different system configurations and load tests, it is clear that a minimum interrupt latency deviation for a real-time subsystem (i.e. virtual machine running FreeRTOS) is achieved for the third case in which experimental Xen null virtual machine scheduler was used. The null scheduler reduces virtual machine scheduler overhead and provides the most deterministic behaviour of 1 FreeRTOS port repository: MIPRO 2018/SSE 1637

the RTOS cores, even under the high CPU and lock-intensive load of the cores assigned to Linux. One can notice periodic high values of interrupt latencies in Fig. 7 caused by Xen credit scheduler.

6 the RTOS cores, even under the high CPU and lock-intensive load of the cores assigned to Linux. One can notice periodic high values of interrupt latencies in Fig. 7 caused by Xen credit scheduler. Periodic interrupt latency deviations while using null scheduler observed in Fig. 8 are caused by lockintensive utilization. It can be noticed that these latencies for null scheduler are almost absent in the case without high memory utilization as shown in Fig. 9. This is the result of the fact that Xen manages the memory and therefore interrupts are delayed because of the critical sections that cannot be preempted. This problem could be resolved if the memory could be statically assigned among virtual machines. Final results of average case t irq profiling show that the virtualization layer is responsible for significant part of the total latency (about 70%). However, out-of-the-box performance of Xen with bare-metal application in virtual machine has lower interrupt latency deviation than Linux, which is shown by previously elaborated results. Xen interrupt latency results in [4] and [5] are significantly smaller, but these measurements were executed on different platform (Xilinx UltraScale+ MPSoC) with different processor architecture and profile (ARMv8, Cortex-A53), although GIC architecture is equal (GICv2). It is unlikely that the difference in results is caused by different methodology. In [5] ARM Generic Timer was exploited to determine interrupt latency (with TBM application [15]). Our results with methodology elaborated in section V.B are consistent with results we obtain using ARM Generic Timer as proposed in [5]. Fig. 9. Interrupt latency in FreeRTOS (Xen - null scheduler) without lockintensive utilization in Linux VIII. CONCLUSION Although general-purpose operating systems such as Linux gained significant popularity as a platform of choice in many embedded systems applications, due to the nondeterministic latency they are not considered as an optimal choice for applications with hard real-time requirements. This paper investigated a possibility of using Xen Hypervisor to provide an elegant solution that can bring acceptable real-time performance in multicore SoC, without a need for modifying Linux kernel. The results of research showed that although the hypervisor layer introduces additional latency in a system response time, it can increase the predictability and reduce variability of interrupt response time for a real-time subsystem. The proposed approach achieved the best results by using static assignment of vcpus to pcpus with experimental Xen null scheduler. Lock-intensive utilization has much greater impact on a system response time predictability than CPU load. The research lead to the conclusion that even better results regarding the system response time predictability could be achieved by using static assignment of memory resources as hypervisor can significantly increase the statistical variation in response time while performing memory allocating operations. Such approach could be viable for solutions in embedded systems due to the fact that task and memory assignments could be in most cases determined in advance. The goal of future research is to investigate hypervisor schedulers in the context of temporal and spatial isolation in mixed-criticality systems. ACKNOWLEDGEMENT This work has been supported by the European Regional Development Fund under the project System for increased driving safety in public urban rail traffic (SafeTRAM). REFERENCES [1] G. Heiser, The role of virtualization in embedded systems, Proceedings of the 1st workshop on Isolation and integration in embedded systems, April [2] Introduce the null semi-static scheduler, Xen-devel Mailing List, April [Online]. Available: [3] Functional safety of electrical/electronic/programmable electronic safety-related-system, International Electrotechnical Commission, Geneva, CH, Standard, [4] Xen on ARM IRQ latency and scheduler overhead, Xendevel Mailing List, February [Online]. Available: [5] Xen on ARM interrupt latency, Xen Project Blog, March [Online]. Available: [6] S. Trujillo, A. Crespo, and A. Alonso, Multipartes: Multicore virtualization for mixed-criticality systems, in 2013 Euromicro Conference on Digital System Design, Sept 2013, pp [7] R. Ramsauer, J. Kiszka, D. Lohmann, and W. Mauerer, Look Mum, no VM Exits! (Almost), ArXiv e-prints, May [8] A. Crespo, I. Ripoll, and M. Masmano, Partitioned embedded architecture based on hypervisor: The xtratum approach, in 2010 European Dependable Computing Conference, April 2010, pp [9] A. Patel, M. Daftedar, M. Shalan, and M. W. El-Kharashi, Embedded hypervisor xvisor: A comparative analysis, in rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, March 2015, pp [10] D. Chisnall, The Definitive Guide to the Xen Hypervisor. Pearson Education, Inc, [11] P. Barham, B. Dragovic, K. Fraser, S. Hand, A. H. Tim Harris, R. Neugebauer, I. Pratt, and A. Warfield, Xen and the art of virtualization, Proceedings of the nineteenth ACM symposium on Operating systems principles, [12] J. Roach, Porting operating systems to run in xen virtual machines, Ground Vehicle Systems Engineering and Technology Symposium, [13] ARM Generic Interrupt Controller, Advanced RISC Machines (ARM), [Online]. Available: [14] ARM Architecture Reference Manual ARMv7-A and ARMv7-R edition, Advanced RISC Machines (ARM), [Online]. Available: [15] TBM application, GitHub, March [Online]. Available: MIPRO 2018/SSE

Integrating ROS and ROS2 on mixed-critical robotic systems based on embedded heterogeneous platforms

ROSCon 2018 Integrating ROS and ROS2 on mixed-critical robotic systems based on embedded heterogeneous platforms Fabio Federici, Giulio M. Mancuso This document contains no USA or EU export controlled