Efficiency and memory footprint of Xilkernel for the Microblaze soft processor

Efficiency and memory footprint of Xilkernel for the Microblaze soft processor Dariusz Caban, Institute of Informatics, Gliwice, Poland - June 18, 2014 The use of a real-time multitasking kernel simplifies the design process of embedded software, but the kernel requires some portion of system's resources. This paper describes results of research work performed to determine overheads incurred by Xilkernel, a real-time multitasking kernel developed by Xilinx. Introduction An embedded system should react to events in its environment in a predetermined period of time. At modest timing requirements the system software cyclically, in an infinite loop, checks the state of the environment. After the detection of an event it undertakes proper action. The processor can be informed about events by interrupt request signals, which reduces the probability of missing them. It is recommended that the interrupt service routine (ISR) should be executed as soon as possible. Often, in ISR, the event flag is only set, and the event is serviced in the loop. If an occurrence of an event is tested once in the loop, in the worst case the service will start only after serving all the remaining events. When interrupts are used, ISR s execution times must be added. Tests can be made several times in the loop which reduces the waiting time for serving the event. Meeting higher timing requirements can be difficult or impossible by software with such a structure. A solution is then to make use of a real-time multitasking kernel, also called the Real-Time Operating System (RTOS). The system software is divided into tasks of various degrees of importance. Each task is in any of several states. Only tasks ready to run can be executed. The task can become ready to run as a result of an event. The kernel will start this task immediately provided that its priority is higher than the priority of the task running so far [1, 2]. However, the kernel requires some portion of the program and data memory of the system, and CPU time. This paper presents results of the research whose purpose was to determine efficiency and memory footprint of Xilkernel, a real-time multitasking kernel developed by Xilinx [3]. Xilkernel Xilkernel is a kernel for the following embedded processors: Microblaze, PowerPC 405 and PowerPC 440. It is integrated with the Xilinx Platform Studio (XPS) framework and is a free software library that the user gets with the Xilinx Embedded Development Kit (EDK). Library functions are written mainly in the C language. Apart from task management, the kernel also provides typical services which enable tasks to, among others: - use semaphores, mutexes, message queues and shared memory, - use software timers,

- dynamically allocate and free memory buffers, - self-preemption. The user must configure the kernel appropriately. During the configuration there are chosen, among others: scheduling algorithm (priority-driven, round-robin), sizes of queues (ready, wait) and system services which will be used by the application. The XPS framework generates the Xilkernel library with the object code of the selected service functions. It is next merged with the object code of the application into one executable file. Xilkernel provides a POSIX interface to most of the library functions [3 ]. The measurement system The measurement system The measurement system was implemented in Virtex-5 FPGA on the Xilinx ML505 Evaluation Platform [4]. For designing and configuring it the XPS ver. 10.1 was used. The structure of the system is depicted in Fig. 1. Figure 1. Structure of the system in an FPGA circuit The Microblaze is a 32-bit embedded RISC processor soft core, optimized for implementation in FPGAs from Xilinx. A computer system based on Microblaze has a Harvard architecture, with separate address spaces of instruction and data memory. The processor communicates with these memories through separate buses, ILMB (Instruction Local Memory Bus) and DLMB (Data Local Memory Bus), respectively [5]. In the measurement system there is the processor in version 7.10d and the instructions and data are stored in a dual-port BRAM (XPS BRAM). Input-output devices are connected to the processor through PLB (Processor Local Bus). There are two timer/counters (XPS Timer). One of them is a system timer, the other was used to measure time of operations performed by Xilkernel. During measurements impulses of fixed frequency were being counted. The UART device (XPS UART) was used to transmit results of measurements to the computer. Five digital input-output devices (XPS GPIO) were added so that there were eight sources of interrupts in the system. The interrupt controller (XPS INTC) was necessary because the Microblaze processor supports only one external interrupt source. The Microblaze instruction execution is pipelined. The pipeline can be divided into three or five stages, to minimize hardware cost or maximize performance, respectively. For most instructions, each stage takes one clock cycle to complete. The processor in the measurement system was configured to have pipeline with five stages. The operation frequency of the processor was 100 MHz, the frequency of impulses counted during the measurements of time was 125 MHz (8 ns resolution). The efficiency of Xilkernel v4.00.a was investigated. The kernel used priority-driven preemptive scheduling only (it also supports roundrobin scheduling). The following parameters were measured: interrupt latency, task latency and execution time of most important services of Xilkernel being used during normal system operation

[6]. Interrupt latency The response time of the system to events depends among others on the interrupt latency. This term refers to the amount of time that elapses from the appearance of an interrupt request signal to the onset of the corresponding interrupt service routine. It next depends on rules of the interrupt handling. The Microblaze processor supports only one external interrupt source. When an interrupt occurs, the instruction in the execution stage completes and the instruction in the decode stage is replaced by a branch to the address 0x00000010 (for most instructions each stage takes one cycle clock). The return address is saved and future interrupts are disabled. At program memory addresses 0x00000010-0x00000013 jump instruction to the system ISR must be stored. The source code of this ISR is generated by XPS. The system ISR calls user-specified ISR. Of course, if there are multiple interrupt sources it must determine the current source. If a multitasking kernel is not used, upon completion of system ISR the interrupted program is being resumed [7]. Interrupt handling under Xilkernel supervision The interrupt handling is slightly different if the system software works under supervision of Xilkernel. The system ISR saves the context of the interrupted task and then calls the user-specified ISR. The user-specified ISR can release a semaphore or send a message to a queue. A task waiting for a semaphore or message will only then become ready to run. Task scheduling, restoring the context of the selected task and enabling of interrupts are carried out at the end of the system ISR [3]. For measuring interrupt latencies the timer/counter was programmed to work in the Generate mode. This mode is useful for generating repetitive interrupt requests with a specified interval [8]. At the start of a user-specified timer ISR content of the counter register was read. The counter counted up from an initial non-zero value, so the result of the measurement was equal to the difference between the read and initial values. Latencies were measured for highest[1] and lowest interrupt requests, which occurred, when processor executed program normally. The results are presented in Table 1. These values are minimal since no other request was serviced at the occurrence of these ones. The maximum time it took the processor to complete instruction was duration of 3 clock cycles (it didn t execute floating point arithmetic instructions, that requires 4-30 cycles to complete). Interrupt s priority System software without Xilkernel with Xilkernel highest 0.87 0.88 µs 1.39 1.41 µs lowest 1.54 1.56 µs 1.96 1.98 µs Table 1. Interrupt latencies Task latency If the service of an event is time-consuming, the most urgent actions should be done in the interrupt service routine, and the rest in the task [2]. The task can wait for releasing a semaphore or for a

message in a queue. Releasing the semaphore or sending the message by user-specified ISR only makes a waiting task ready to run. Task scheduling is performed at the end of system ISR. Consequently, it was possible to measure task latency, which is a sum of task scheduling and context restoring times. The results are presented in Table 2. The application consisted of only two tasks: the measurement task and the system idle task (with lowest priority, always ready to run). Task activation Task latency by releasing a 9.97 10.16 µs semaphore by sending a message 14.01 14.35 µs Table 2. Task latencies Execution time of Xilkernel service functions Execution times of the kernel s services which can be utilized during the interrupt service were measured and the results are presented in Table 3. There are two modes of message passing: basic and enhanced. The user chooses the mode during the configuration of the kernel. In the basic mode, allocation and freeing space for the messages is made by Xilkernel. For the allocation it uses the bufmalloc() service function which allocates a memory block from a pool. The execution time of this function is short and predictable. When the enhanced mode is chosen, the user must allocate a memory block by the malloc() function. This function is typically slow and its execution time is unpredictable [ 2]. The enhanced mode should not be used if also an ISR can send a message. The time of allocation, given in the table, was obtained when there was only 1 free block in every pool. Kernel s service Execution time releasing a semaphore 3.2 3.39 µs sending a 4-byte message (basic 7.22 7.54 µs sending a 16-byte message (basic memory block allocation from identified pool memory block allocation from any pool (there are 2 pools) 7.19 7.51 µs 1.41 µs 1.49 1.73 µs Table 3. Execution time of kernel s services utilized during interrupt service Task preemption Performing some of the kernel's services can result or not in task preemption. Preemption occurs when, for example, the task tries to take a semaphore already taken or releases a semaphore which a task with higher priority waits for. Execution times of these services were measured for both cases and the results are given in Table 4. Block diagrams of tasks in measurement applications are

presented in figures 2 and 3. Kernel s service Execution time a) b) taking a semaphore 1.36 µs 9.31 9.44 µs releasing a semaphore 1.35 µs 13.08 13.74 µs taking a mutex 1.53 µs 9.41 9.54 µs releasing a mutex 1.34 µs 13.37 13.88 µs sending a 4-byte message (basic 5.31 5.5 µs 21.37 22.07 µs sending a 16-byte message (basic 5.42 5.62 µs 21.32 22.02 µs reading a 4-byte message (basic 5.55 5.74 µs 9.55 9.69 µs reading a 16-byte message (basic 5.53 5.72 µs 9.55 9.69 µs sending a 4-byte message (enhanced 6.26 6.46 µs 9.55 9.69 µs sending a 16-byte message (enhanced 6.26 6.46 µs 9.55 9.69 µs reading a 4-byte message (enhanced 5.54 5.74 µs 22.1 22.94 µs reading a 16-byte message (enhanced 5.54 5.74 µs 22.1 22.8 µs a) without preemption of task b) with preemption of task Table 4. Execution time of kernel s services that can cause task preemption Figure 2. Block diagram of task to measure execution time of kernel s services, they don t cause task preemption

Figure 3. Block diagrams of tasks to measure execution time of semaphore services, they cause tasks preemption Memory block allocation time does not depend on the caller of the service function (ISR or task, see Table 3). Freeing a block lasts 3.18 µs, when the pool is identified. Freeing a block from any pool, if there are 2 pools, lasts 3.18 3.51 µs.

A task can yield a processor to the next task ready to run. It takes 10.76 11.08 µs. Memory footprint of Xilkernel The size of the Xilkernel code depends mainly on which services are utilized by the application. If the application uses semaphores only, the kernel s code occupies about 12 kb. If all services are used, the code size is about 20.5 kb. The size of the required RAM depends on many factors, among others: the number of priority levels, the length of queues of ready and blocked tasks, the number of semaphores and software timers, the length of message queues. Adopting the default values of configuration parameters results in the following size of the required RAM: - about 46.5 kb, if the application uses semaphores only, - about 59 kb, if the application uses all kernel s services and there are 10 blocks of shared memory, 1 kb each. Also, an individual stack is assigned to each task. The default size of the stack is about 1 kb. Summary The use of a multitasking kernel simplifies the design process of system software. The software is divided into tasks responsible for individual portions of work. Each task is assigned a priority that determines its importance. The kernel ensures that more important tasks are performed first. Still, the kernel requires some portion of the system's memory and consumes some of CPU time. The research work was done to determine the efficiency and memory footprint of Xilkernel. This paper presents results of the research. Knowledge of the kernel s efficiency should help a potential developer of an embedded system to estimate whether the system can meet timing requirements. References 1. Kalinsky D.: Context switch. Embedded Systems Programming, February 2001. 2. Simon D.E.: An Embedded Software Primer. Addison-Wesley, 1999. 3. Xilkernel (v4.00.a). www.xilinx.com. 4. Xilinx ML505 Evaluation Platform Documentation. www.xilinx.com. 5. Microblaze Processor Reference Guide. www.xilinx.com. 6. Lamie W., Carbone J.: Measure your RTOS s real-time performance. Embedded Systems Programming, May 2007. 7. Glover P., MacMahon S., Man Shakya D.: Using and Creating Interrupt-Based Systems. Application Note XAPP778, www.xilinx.com. 8. LogiCORE IP XPS Timer/Counter. www.xilinx.com. More about author Dariusz Caban. [1] It is the second interrupt request in terms of the importance if Xilkernel is used; the most important is a request from the system timer