Hardware Support for Histogram-based Performance Analysis of Embedded Systems

Size: px

Start display at page:

Download "Hardware Support for Histogram-based Performance Analysis of Embedded Systems"

Emory Jenkins
5 years ago
Views:

217 IEEE 2th International Symposium on Real-Time Distributed Computing Hardware Support for Histogram-based Performance Analysis of Embedded Systems Thomas Ballenthin HBM GmbH Darmstadt, Germany

1 217 IEEE 2th International Symposium on Real-Time Distributed Computing Hardware Support for Histogram-based Performance Analysis of Embedded Systems Thomas Ballenthin HBM GmbH Darmstadt, Germany Boris Dreyer and Christian Hochberger Fachgebiet Rechnersysteme, Technische Universität Darmstadt Darmstadt, Germany {dreyer, Simon Wegener AbsInt Angewandte Informatik GmbH Saarbrücken, Germany Abstract Timing analysis in embedded systems has focused mainly on the Worst-Case Execution Time (WCET) in the past. This was (and still is) important to make guarantees for the application of the system in safety critical environments. Today, two reasons call for a slightly changed perspective. Firstly, the complex and often unpredictable internal structure of modern system-on-chip architectures prohibits the calculation of realistic upper bounds for the WCET. Secondly, even if we can compute a realistic value for the WCET, the developer still does not know how the code under scrutiny behaves in general and whether it is useful or necessary to spend time on optimising this code. In this contribution, we present a new method and hardware architecture to collect Execution Time Profiles (ETP) which give us much more insight in the execution time behaviour on modern system-on-chip architectures as previously available. I. INTRODUCTION System-on-Chip architectures (SoC) are the predominant way to implement embedded systems. They offer high performance and are flexible through their software programmability. They are often used to control a technical context like the engine or the brake system in a car. For safety-critical systems, safety standards (e.g. ISO and DO-178B/C) require that guarantees for the upper bounds of execution time of specific parts of the software are given. In the past, this Worst-Case Execution Time (WCET) could be analysed statically [1]. Unfortunately, todays SoCs are highly complex systems with components working in parallel, often in an unpredictable way. Examples for this unpredictable behaviour are random cache replacement strategies or complex bus arbiters. Although such elements can be taken into account during WCET estimation by assuming maximal effects, the resulting times are far from realistic. Nowotsch et al. [2], for example, show the effect of assuming maximal bus contention in a worst-case analysis compared to a more sophisticated approach. Additionally, the WCET does not help programmers to identify sources of high variance. In this article, we present a new method and HW architecture that allows us to capture detailed Execution Time Profiles (ETPs) in the form of histograms. Our method is based on a new approach to process trace data of SoCs which This work was funded within the project CONIRAS by the German Federal Ministry for Education and Research with the funding ID 1IS1329. The responsibility for the content remains with the authors. we have previously published [3]. The great benefit of this approach is that it is non-intrusive and runs online (while the target system executes). Thus, it can aggregate statistics over arbitrary long test cycles. The execution time profiles can be used to understand the runtime behaviour of complex SoCs, which cannot be analysed statically. The histogram bin distribution gives us an overall overview of the execution time and enables us to make statistical statements of execution time probabilities. The following section discusses related work. Section III describes the environment and tools that have been used to carry out the experiments. The mechanism to detect the currently executed function is reviewed in Section IV. It is followed by our main contribution in Section V, where we explain how the detailed execution profile is captured in hardware. Section VI discusses different options to store the histogram in an FPGA. It is followed by an explanation of the computation of execution profiles in Section VII. In Section VIII we apply our tool and method to the well-known debie1 timing analysis benchmark. Finally, a conclusion is given. II. RELATED WORK Some research has been carried out to improve the predictability of processor architectures, either by design or by configuration (e.g. Cullmann et al. [4], Schoeberl et al. [5]). We, however, focus on a COTS architecture (ARM Cortex-A9) which has many unpredictable features. ETPs can be used in many different ways. Many papers about probabilistic schedulability analysis [6] [7] [8] require ETPs, for example to compute the response time distribution of each task in the system. Research has been conducted on how to schedule various tasks and maintain a given quality of service level by Abeni et al. [9] and Cazorla et al. [1]. This is done using Reservation-Based-Scheduling algorithms which utilize ETPs [11]. With the help of ETPs, the statistical probability is ascertained if a timing constrain can be met. The major challenge in these approaches is to achieve reasonable Execution-Time- Profiles of any given program code. Santinelli et al. [12] and Kaczynski et al. [13] use ETPs to estimate the execution times of programs and tasks /17 $ IEEE DOI 1.119/ISORC

2 A measurement-based approach to create ETPs is presented by Hansen et al. [14]. For this, the trace data of the processor was recorded and then evaluated. The recording time of the trace data was limited to 12 minutes. Furthermore, the program code was annotated by markers to assign the trace data which alters the execution time. We, in contrast, process the trace data of the processor in real-time and are therefore not limited in observation time. Also, we do not have to instrument the program under test, so we do not alter the program s timing behaviour and thus create authentic histograms. In order to process the high-speed trace data in real-time, we generate histograms in hardware. The use of hardware implementations to speed up histogram calculations is not new. It is a common feature to accelerate image processing [15] or face recognition. Alsuwailem and Alshebeili [16] presented an approach for computing histogram statistics and histogram equalisation in parallel to speed up image enhancement. Sanny et al. [17] developed a histogram implementation for image processing with a frame rate of 3 on a Virtex-7 FPGA. Stekas and v. d. Heuvel [18] uses Local Binary Patterns Histograms (LBPH) to extract features from test face images implemented on a Zynq-73 SoC. In our architecture, the trace extraction unit generates the events to be processed as histograms. These events are emitted at high frequency and require a fast histogram computation. We designed and implemented our own histogram module because none of the histogram implementations above meet our real-time demands, e.g. processing one event per cycle and an FPGA clock above 2 Mhz. III. TIMING ANALYSIS FRAMEWORK In previous work [3], we developed a non-intrusive measurement-based timing analysis framework. Our framework works on the object code level and is split into three phases: an offline pre-processing phase, the continuous online aggregation phase and an offline post-processing phase. For this work, we added a histogram-based statistics module to the FPGA and replaced the hybrid WCET estimation backend with an ETP computation backend. The workflow of our method is shown in Figure 1. We assume a static schedule where each task runs on one (predefined) core of a multicore system. Each core uses its own trace extraction and continuous aggregation modules. Hence it suffices to describe the workflow for a single core. Moreover, we assume that the analysed software consists of non-recursive functions only (loops are allowed, though). A. Pre-Processing First, the binary reader disassembles a fully linked binary executable into its individual instructions. Architecture specific patterns decide whether an instruction is a call, branch, return or just an ordinary instruction. This knowledge is used to form the basic blocks of the control flow graph (CFG). Then, the control flow between the basic blocks is reconstructed. In most cases, this is done completely automatically. However, if a target of a call or branch cannot be statically resolved, then the user needs to write some annotations to guide the control flow reconstruction. This can happen, for example, if the program contains calls via function pointer arrays. The embedded trace unit (ETU) of modern ARM processors (like the Xilinx Zynq XC7Z2 featuring a dual-core ARM Cortex-A9) is not fully compatible with the CFG model. The ETU emits a waypoint event for each non-linear control flow, for example, interrupts and hardware exceptions, but also for normal calls and branches. So-called waypoint instructions always generate waypoint events. Amongst others, instructions that possibly modify the program counter are waypoint instructions. This is enough to fully reconstruct the control flow, but less fine-grained than the CFG (see Figure 2). Therefore, after CFG reconstruction, the waypoint graph (WPG) is computed. To do so, a pattern matcher checks for each instruction whether it is a waypoint instruction. Afterwards, the edges of the WPG are computed. For each waypoint instruction found, the algorithm follows the edges in the CFG to find reachable waypoints. This gives the direction of a waypoint edge and its target. Now, the analysis hardware is powered on and its Virtex- 7 FPGA is configured with a bitstream that contains the trace extraction and the continuous aggregation unit. This configuration is not application-specific. Therefore, it only needs to be created once and can be used to create ETPs of distinct applications. Then, the WPG is used to create an application-specific meta-configuration for the trace extraction module as well as for the continuous online aggregation module. A unique ID is assigned to each edge in the WPG and the lookup tables in the function automata cluster are instantiated. After the creation, the modules are meta-configured. In contrast to the creation and configuration of a Virtex-7 FPGA with a large bitstream, the creation and configuration of a small application-specific meta-configuration happens within a few seconds. B. Trace Extraction and Continuous Aggregation During the program s execution, the ETU continuously emits raw trace data. This stream of data is fed into the trace extraction module. There, the raw data is decoded and compiled into an event stream. An event is generated for each traversal of a waypoint and consists of an ID and a timestamp. The special ID is used if the waypoint does not belong to the WPG computed during the pre-processing phase. This happens for example in case of an interrupt. The resulting waypoint event stream is then fed into the continuous aggregation module. The continuous aggregation module handles the recording of histograms. It consists of the function automata cluster and the histogram module. The function automata cluster generates a function event stream based on the waypoint event stream (see Section IV). The histogram module updates the histograms by processing either the function event stream or the raw waypoint event stream (see Section V). 2

3 1GB DDR3 Executable SoC FPGA (Xilinx Virtex-7) Post-Processing Pre-Processing Control Flow Reconstruction CFG WPG Measurement Configurator 512 kb BL2 Cache 32 kb L1 I-Cache z CPU Cortex A9, 667 MHz Embedded Trace Unit FIFO Trace Extraction Trace Data Pre-Processing Instruction Reconstruction Continuous Aggregation Function Automata Cluster Histogram Module Histograms ETP Computation ETP Config Traceable System States Trace Data Edge Events (Edge ID + Cycles) Function Events (Function ID + Cycles) Fig. 1. Workflow of our approach. It is splitted into three phases: offline pre-processing, continuous online aggregation and offline post-processing. LDR r1 [r, #4] LDR r2 [r, #8] LRD r3 [r, #12] CMP r1, r2 BLT B3 ID == 2 ID == ID == ID == enter woe_cycles woe_id 2: waypoint outgoing edge id ADD r1 r2 ADD r1, r2 ADD r3, r1 CMP r3, r1 BEQ finish woe_id 1: waypoint outgoing edge id woe_cycles ID == 4 ID == 5 ID == 6 ID == ID == exit exception Waypoint Instruction Normal Instruction Fig. 3. Exemplary comparator tree for one function having one entry and three exits. Unused comparators (shown in grey) return false. Fig. 2. CFG with highlighted waypoint instructions. C. Post-Processing After the program has finished (or the test engineer has collected enough data), the post-processing phase is started by downloading the histograms from the FPGA s memory. Subsequently, ETPs are calculated from the measured histograms, either directly from the function histograms or derived from the individual waypoint edge histograms (see Section VII). IV. FUNCTION AUTOMATA CLUSTER The function automata cluster models the mapping from the waypoint edge events to function events. For each function in the WPG, it contains a set of comparator trees and a finite state machine (FSM). The comparator trees (Figure 3) translate waypoint edge events into inputs for the FSM, namely enter (the function has been entered), exit (the function has been exited), and exception (knowledge about the function has been lost). The compare values (= edge IDs) of the comparator trees are part of the configuration that is loaded before the online aggregation phase is started. The FSM (Figure 4) is used to measure the execution time of one function. It consists of three states, namely Out (the function is not being executed), In (the function is being executed), and Unknown (it is not known if the function is being executed or not). If the function is not executed, the FSM is in state Out and the execution cycle counter is zero. Once the function is executed the state changes to In and the counter accumulates the execution times of the executed waypoint edge events. As soon as the machine changes its state from In to Out, the counter value is considered as the function s execution time and a function event is emitted. It is possible that a trace analysis starts after the program execution has been started. Consequently, there is a lack 3

4 exit Unknown Out enter enter exit In Counter Fig. 4. Finite state machine that measures the execution time of one function. Dashed edges are traversed for exception events. As long as the automaton is in state In, the counter accumulates the execution time of executed waypoint edge events. of function execution information at the beginning of the analysis. Therefore, the initial state of the FSM is Unknown. V. HISTOGRAM MODULE The Histogram Module can be connected to either the function event stream generated by the function automata cluster, calculating histograms at the function level, or directly to the waypoint edge event stream generated by the instruction reconstruction unit. Thus, for the sake of simplicity, we call both function events and waypoint edge events simply events. Each event consists of an identifier (event id) that identifies either a function or a waypoint edge and the elapsed cycles (event cycles) since the last event. The collected statistical data for each event id is stored as a bin distribution, also known as a histogram. Each histogram consists of k =2 n,n 4 N bins. Each bin has an identifier j, and the corresponding lower bound x l j N, as well as an upper bound x u j N. The bins are defined as disjoint left-open intervals ]x l j,xu j ]. Therefore, the lower bin boundary x l j is not part of the interval and the first lower bound is set to zero x l =, since all execution times will be strictly positive. The upper bin boundary x u j is part of the interval and is simultaneously the lower bound x l j+1 of bin j +1, thus j {,..., k 2} : x u j = xl j+1 applies. The last upper bin bound is set to infinity: x u k 1 =. A. Linear Bin Distribution Using a linear bin distribution, the upper bin boundaries x u j are linearly distributed, starting from zero. The range of a single bin is given as step N. This gives directly the cycle count up to which an event is sorted in the individual bins: step (k 1). All other event with an associated cycle count greater than this threshold will be put in the last (accumulative) bin. Table I shows an example for a linear distribution with k =8bins and step =4. TABLE I LINEAR BIN DISTRIBUTION WITH PARAMETERS k =8AND step =4. bin j upper bin bound x u j B. Central Bin Distribution The central bin distribution has been developed by assuming that the expected run times are clustered around a reference time. To obtain a more detailed view of the measured times, the bins are distributed around this reference time. When an event id is handled for the first time, the corresponding execution time event cycles is stored for further calculation as reference value event cycles ref. All future events with the same event id will use the same event cycles ref to determine the bin distribution. The distribution of the bins around the reference value event cycles ref can be versatilely arranged by several parameters. These can be modified to meet the varying requirements made by different program code. The overall number of bins is determined by the parameter k. These bins are spread around the event cycles ref value, whereas the parameter k l sets the bin number in which event cycles ref would be sorted and thus, the number of lower bins. event cycles ref is the upper bound of the corresponding bin: x u k l 1 = event cycle ref,it holds that k l k. The compressing factor f comp sets the range r of the histogram. As seen in Eq. 1, the range is dependent on the event cycles ref value. It limits the overall span of the histogram, which may be truncated to give a more detailed view of the relevant area. Furthermore, it reduces the amount of potentially empty bins. Every value exceeding the range will be put in either the smallest or the largest bin. Dividing the range r by the number of bins k gives the size of a single bin step and also the span between two boundaries (Eq. 2). The upper boundary x u j of an individual bin j is calculated as shown in Eq. 3, with r = event cycle ref. r := event cycle ref f comp (1) r step := k x u j := { r (k l j 1) step, <j k 2,j = k 1 A special case occurs when event cycle ref <k 1. This results in step =. Whenever this happens, a linear bin distribution is used for all events with the affected event id. The procedure is shown as a flow chart in Figure 5. An example for the calculation of bin boundaries is given in Figure 6. The number of bins is set to k =8, with lower bins being k l =6. Processing an event with event cycle ref =6 and using a compressing factor of f comp =.8leads to a range of r =48. It is evident that bin zero covers all events with event cycles smaller or equal to 3 and bin seven covers all events with event cycles larger than 67 cycles. The remaining bins, one to six, provide a detailed view of recorded events. VI. FAST HISTOGRAM STORAGE After the current bin for a single event has been calculated, the corresponding histogram needs to be updated. (2) (3) 4

5 yes j=event_cycle1 event_cycles previouslyseen? yes event_cycles_ref =storedevent_cycles_ref event_cycle<k1 no j=k1 no CalculateBinModule r=event_cycles_reff comp no event id j step= j=k1 step> yes j== yes r k event_cycles_ref=event_cycles storecurrentevent_cycles no j=j1 no u x < j event_cycles u. Fig. 5. Flow chart for determining the bin using the central bin distribution. u upper bin boundary x j bin j yes x j k l = 6 k u = 2 k = 8 Fig. 6. Example bin distribution with k =8bins, k l =6lower bins, a compressing factor of f comp =, 8 and event cycle ref =6. This happens within the storage module, which provides two storage targets. Either all histogram data is stored within the internal memory blocks (BRAM) or alternatively using external memory. A. Storage Using BRAM Using BRAM simplifies the architecture, given the fact that every clock cycle another word can be written to the memory. Each bin is stored at a specific address. Whenever a bin needs to be updated, its old value needs to be read, incremented by one and stored to the same address. This is achieved using dual port memory, which gives the possibility to have simultaneous read and write access. We used a pipelined architecture to improve the performance of the storage module. Whenever the same bin is continuously incremented, the access is buffered and the accumulated value is written to the BRAM only once. This reduces the number of memory accesses. If the program to be analysed is so complex that a large number of event ids is given, the memory requirement will exceed the available BRAM resources. In this case, external memory can be used. B. Storage Using External Memory To handle the access delays caused by the use of external memory, the Memory Buffer module (Figure 7) provides a caching and buffer infrastructure. Different Memory Master modules, for either RLDRAM or AXI-memory, can be used. Depending on the functionality provided by the Memory Master, the Memory Adapter module either uses sequential or parallel read and write accesses. The Histogram Cache is implemented in BRAM and temporarily stores several histograms. We anticipate a clustering of event ids according to the principle of locality. The FIFO Cache-Map stores the received event ids, where for every entry a new and unique cache id is assigned. Whenever a new event id is processed, firstly the FIFO is checked. Only if it does not already contain the event id, a memory access needs to be performed. The event ids and the corresponding cache ids are directly mapped. Thus, the Histogram Cache needs to hold as many histograms as entries are provided by the FIFO Cache-Map. If the external memory is ready to be accessed, the Memory Adapter takes the first entry from the FIFO Cache-Map. This entry contains the pair event id and cache id. The cached histogram and the externally stored histograms are read contemporaneously and the corresponding bins are summed up. Afterwards the result is written back to the external memory. The efficiency of the FIFO Cache-Map is depending on the executed program. The performance is improved by continuously aggregating the same set of event ids. If an overflow of the FIFO occurs, it s size needs to be increased. VII. COMPUTING EXECUTION TIME PROFILES ETPs can easily be computed from histograms by normalising each value from raw counts to its corresponding percentage in the histogram. This gives the probability for an execution to have an execution time corresponding to a given bin in the histogram (i.e. the probability mass function). When interested in worst-case behaviour, other representations of ETPs might be of interest, in particular the complementary cumulative representation [19]. There, for each bin, we calculate the probability that an execution has an execution time greater than the given bin. This is done by computing 1 j bin j. Figures 8 and 9 show the ETPs in complementary cumulative representation for the measured histograms. Note that events from the last bin of a histogram, i.e. the bin that spans until infinity, are not properly represented in the ETPs as we assumed that these events have a value of x u k 2 +1 during conversion. The aforementioned ETPs directly correspond to the measured histograms based on the function event stream. We also measured histograms based on the waypoint edge event stream. This gives ETPs for individual edges in the WPG. Although these ETPs might be interesting in itself, a more 5

6 find New Event event_cycle event_valid add _1 _2... _n next to process corresponding cache_id request histogram histo 1 histo 2 get corresponding cache_id return stored histogram histo 3 store event_cycles histo 1 histo 2 histo k... request cached histogram return requested histogram + store updated histogram... histo n Fig. 7. Histogram caching and storing module. natural (but also more coarse-grained) view on performance asks for the computation of ETPs for whole functions based on the ETPs of the individual edges of a function. Two operations are needed to carry out this computation: (a) choice and (b) convolution. Choice is used to calculate the least upper bound of two paths in the WPG. Depending on the context we either maximise or minimise with the choice operation. Convolutions are used to construct paths through the WPG out of individual edges. If a function contains loops, the application of choice and convolution needs to be iterated to generate all possible paths through the loop. Depending on the purpose of the analysis, different convolutions can be used to construct paths: Under the assumption that the execution times of edges are independent of each other, one can use the Gaussian convolution. However, with the presence of caches and other hardware features which take the history of instructions into account, this assumption is rather simplistic. For worst-case or best-case assessments, one can use the supremal convolution, or the infimal convolution, respectively. However, both might heavily overestimate (or underestimate) the real probabilities for a given execution time. Convolutions that correctly model the features that influence the execution time depend on the hardware on which a program is executed. They might be arbitrarily complex and are not in the scope of this publication. In Section VIII-D, we compare the ETPs derived with the help of different convolutions with an ETP that has been measured. VIII. USE CASE: ANALYSIS OF THE DEBIE1 BENCHMARK A. Settings The target (COTS) SoC in our prototype is a Xilinx Zynq XC7Z2 featuring a dual-core ARM Cortex-A9 running at 667 MHz. The FPGA part of this SoC was only used to route the trace data to the timing analysis platform which utilises a Xilinx Virtex-7 FPGA. The memory subsystem of the SoC consists of separate L1 instruction and data caches, each storing 32 kilobytes, 512 kilobytes of shared L2 cache and 1 gigabyte of DDR main memory. On one core, the debie1 benchmark was running. On the second core, a custom benchmark in a FreeRTOS instance was running to generate interferences on the shared L2 cache and the shared interconnects. This program consisted of multiple threads, which communicated over a shared buffer. The debie1 benchmark was compiled with the C++ compiler provided with the Xilinx SDK 216.1, GNU C/C (prerelease) with flags -mcpu=cortex-a9, -mfpu=vfpv3, -mfloat-abi=hard, -g3 and -O. B. The Benchmark The debie1 benchmark [2], [21] is based on the on-board software of the DEBIE-1 satellite instrument for measuring impacts of small space debris or micro-meteoroids. It defines six analysis problem sets, each derived from the original realtime requirements of the satellite instrument. For example, one problem considers the required deadline of the Interrupt Service Routine (ISR) TM_InterruptService. For our evaluation, we measured the execution times of the four tasks and two ISRs of the debie1 benchmark. Table II shows the number of observed function events for each of them. It shows also the reference value used for the centered bin distributions (Figure 9) as well as the minimal and maximal observed execution times of the tasks and ISRs. The aforementioned ISR TM_InterruptService, for example, got called times during the execution of the benchmark, with a maximal observed execution time of 239 cycles. C. Analysis Based on Function Events Figures 8 and 9 depict the results of our measurements of the tasks and ISRs of the debie1 benchmark. Histograms are shown on the left (in red) and ETPs are shown on the right (in blue). Figure 8 shows the histograms and ETPs when using a linear bin distribution with 128 bins, each of width 8. Figure 9 shows the histograms and ETPs when using a centered bin distribution. 6

7 Histogram (HandleAcquisition) <= 48 8 <= 168 <= 28 <= 248 <= 288 <= 328 <= 368 <= 48 <= 448 <= 488 <= 528 <= 568 <= 68 <= 648 <= 688 <= 728 <= <= 928 <= 968 <= Histogram (HandleHealthMonitoring) <= 48 8 <= 168 <= 28 <= 248 <= 288 <= 328 <= 368 <= 48 <= 448 <= 488 <= 528 <= 568 <= 68 <= 648 <= 688 <= 728 <= <= 928 <= 968 <= Histogram (HandleHitTrigger) <= 48 8 <= 168 <= 28 <= 248 <= 288 <= 328 <= 368 <= 48 <= 448 <= 488 <= 528 <= 568 <= 68 <= 648 <= 688 <= 728 <= <= 928 <= 968 <= 18 ETP (HandleAcquisition) ETP (HandleHealthMonitoring) ETP (HandleHitTrigger) Histogram (HandleTelecommand) <= 48 8 <= 168 <= 28 <= 248 <= 288 <= 328 <= 368 <= 48 <= 448 <= 488 <= 528 <= 568 <= 68 <= 648 <= 688 <= 728 <= <= 928 <= 968 <= Histogram (TC_InterruptService) <= 32 <= 56 <= 14 <= 152 <= 176 <= 2 <= 224 <= 248 <= 272 <= 296 <= 32 <= 344 <= 368 <= 392 <= 416 <= 44 <= 464 <= 488 <= 512 <= 536 <= 56 <= 584 <= 68 <= 632 <= 656 <= Histogram (TM_InterruptService) <= 16 <= 24 <= 32 <= 4 <= 48 <= 56 <= 64 <= 72 8 <= 96 <= 14 <= 112 <= 12 <= 136 <= 144 <= 152 <= 16 <= 168 <= 176 <= 184 <= 192 <= 2 <= 28 <= 216 <= 224 <= 232 <= 24 >= 241 ETP (TC_InterruptService) ETP (HandleTelecommand) ETP (TM_InterruptService) Fig. 8. Histograms and ETPs of the four tasks and two ISRs of the debie1 benchmark. Configuration: linear bin distribution, k = 128 bins, step =8. TABLE II NUMBER OF FUNCTION EVENTS AND REFERENCE VALUES AS WELL AS MINIMAL AND MAXIMAL OBSERVED EXECUTION TIMES OF THE TASKS AND ISRS OF THE DEBIE1 BENCHMARK. Name #Events Reference Min. Max. HandleAcquisition HandleHealthMonitoring HandleHitTrigger HandleTelecommand TC InterruptService TM InterruptService By construction, the histograms using a linear bin distribution can only track events well with an execution time of less than 117 cycles. All events with a higher execution time are accumulated in the last bin. This works well for the ISRs TC_InterruptService and TM_InterruptService, which have maximal observed execution times below this threshold. For the task HandleTelecommand, only a few events go over this threshold, and thus, the histogram represents the execution time distribution of this task rather good. However, since the real maximal observed execution time cannot be tracked by the histogram, the resulting ETP has been cut off at 117 cycles. This also happens for the tasks HandleAcquisition, HandleHealthMonitoring and HandleHitTrigger. For the last two tasks, we can in fact infer no meaningful ETPs with a linear bin distribution, as almost all events have associated execution times above the threshold. For HandleHealthMonitoring, this problem can be solved with the centered bin distribution. The histogram nicely shows how the execution time of most runs is distributed. For the other tasks and ISRs, the reference values do not fit the distributions well, and therefore most events are put either in the first or in the last bin. The centered bin distribution is thus very sensitive to good reference values. Overall, we obtain good results for those tasks and ISRs having low execution times (with the linear bin distribution) and for HandleHealthMonitoring, where we have a good reference value. Our method performs particular bad for HandleAcquisition and HandleHitTrigger, which have a wide spread between the minimal and maximal observed execution times and low reference values (see Table II). 7

8 Histogram (HandleAcquisition) <= 163 <= 183 <= 23 <= 223 <= 243 <= 263 <= 283 <= 33 <= 323 <= 343 <= 363 <= 383 <= 43 <= 423 <= 443 <= 463 <= 483 <= 53 <= 523 <= 543 <= 563 <= 583 <= 63 <= 623 <= 643 <= Histogram (HandleHealthMonitoring) <= 2627 <= 367 <= 357 <= 3947 <= 4387 <= 4827 <= 5267 <= 577 <= 6147 <= 6587 <= 727 <= 7467 <= <= 9227 <= 9667 <= 117 <= 1547 <= 1987 <= <= <= 1237 <= <= <= Histogram (HandleHitTrigger) <= 125 <= 13 <= 135 <= 14 <= 145 <= 15 <= 155 <= 16 <= 165 <= 17 <= 175 <= 18 <= 185 <= 19 <= 195 <= 2 <= 25 <= 21 <= 215 <= 22 <= 225 <= 23 <= 235 <= 24 <= 245 <= 25 ETP (HandleAcquisition) ETP (HandleHealthMonitoring) ETP (HandleHitTrigger) Histogram (HandleTelecommand) <= 357 <= 47 <= 457 <= 57 <= 557 <= 67 <= 657 <= 77 <= <= 97 <= 957 <= 17 <= 157 <= 117 <= 1157 <= 127 <= 1257 <= 137 <= 1357 <= 147 <= 1457 <= 157 <= 1557 <= Histogram (TC_InterruptService) <= 139 <= 144 <= 149 <= 154 <= 159 <= 164 <= 169 <= 174 <= 179 <= 184 <= 189 <= 194 <= 199 <= 24 <= 29 <= 214 <= 219 <= 224 <= 229 <= 234 <= 239 <= 244 <= 249 <= 254 <= 259 <= Histogram (TM_InterruptService) <= 133 <= 137 <= 141 <= 145 <= 149 <= 153 <= 157 <= 161 <= 165 <= 169 <= 173 <= 177 <= 181 <= 185 <= 189 <= 193 <= 197 <= 21 <= 25 <= 29 <= 213 <= 217 <= 221 <= 225 <= 229 <= 233 <= 237 ETP (HandleTelecommand) ETP (TC_InterruptService) ETP (TM_InterruptService) Fig. 9. Histograms and ETPs of the four tasks and two ISRs of the debie1 benchmark. Configuration: centered bin distribution, k = 128 bins, k l = 1, f comp =1. D. Analysis Based on Waypoint Edge Events E. Resource Usage Another point of interest for us was the comparison of ETPs derived from function measurements versus the ETPs derived from edge measurements (see Figure 1). Here, we had a closer look at TM_InterruptService. We took the ETPs for its individual edges and combined them with the help of three different convolutions: (a) the Gaussian convolution, (b) the infimal convolution and (c) the supremal convolution. For (a) and (c), we used maximisation as choice operation. For (b), we used minimisation as choice operation. In our example, the infimal and supremal convolution were of lesser use, as they heavily under- and overestimated the probabilities to overrun a given deadline. The Gaussian convolution proved to be too simplistic. It contains path combinations, that could never be observed in the measurements. Consequently, it predicts that the execution time of this ISR is below 548 cycles in % of all runs, whereas the measured ETP predicts that the execution time is below 24 cycles in % of all runs. The Histogram Module has been implemented using a Xilinx Virtex-7 XC7V585T FPGA. As shown in table III, the characteristics of the events depend on being either edge or function events. The former have a shorter average cycle count, which results in a higher frequency for new events to be processed. The overall amount of unique event IDs is in both cases rather low, which gives the possibility to store the results solely in BRAM memory. In doing so we miss the original objective to have a clock cycle faster than 5 ns, by.2 ns (equates to MHz). The resulting hardware consumption can be seen in Table IV. The timing delay can be further reduced by using the memory buffer implementation with external memory. In this case, we achieve an overall clock delay of 4.89 ns (equates to 24.5 Mhz). The limiting factor is the memory buffer and its FIFO size, which should not exceed 32 entries. If more entries are needed, the FIFO needs to be multiplexed. We interpolated the resulting hardware requirements for a FIFO size of 128 (see Table V). 8

9 Gaussian Convolution Infimal Convolution Supremal Convolution Fig. 1. ETPs of TM InterruptService, computed from the edge histograms with (a) Gaussian convolution, (b) infimal convolution or (c) supremal convolution. IX. CONCLUSION In this contribution, we have shown that beyond normal WCET analysis, advanced tools and HW architectures are capable to capture ETPs of modern SoCs, which cannot be analysed by other means. The challenge was to design this HW based analysis fast enough that it can run in parallel even to fast SoCs. This feature enables the developer to go through rather long and complex test cycles with the software. The gathered ETPs help us to identify those software parts with a high variance. Also, they enable us to make statistical predictions of upper bounds of the execution time with certain confidence intervals. One important point to consider is the choice of the bin distribution. The linear bin distribution works well when an upper bound of execution time is known (or can be easily estimated). The quality of results when using the centered bin distribution strongly depends on the chosen reference value. Finding some good heuristics for choosing the right bin distribution remains future work. In the future, we want to provide the user with a presentation of the results on a more abstract basis. Particularly, this could mean to identify certain types of statistical distributions and TABLE III PROCESSED EVENTS AND CONSEQUENTIAL REQUIREMENTS FOR FIFO SIZES WHEN USING EXTERNAL MEMORY Unique IDs Avg. Cycles FIFO Avg. Size FIFO Max. Size Function Events Edge Events TABLE IV RESOURCE USAGE OF IMPLEMENTATION USING BRAM MEMORY. LUTs Regs BRAM Delay Bin Calculation ns BRAM Storage ns Total ns TABLE V RESOURCE USAGE OF IMPLEMENTATION USING EXTERNAL MEMORY, WITH A MEMORY BUFFER FIFO SIZE OF 128. LUTs Regs BRAM Delay Bin Calculation ns Memory Buffer ns Total ns computing the relevant parameters of such distributions. ACKNOWLEDGMENT The authors like to thank Alexander Lange and Alexander Weiss of Accemic for providing the waypoint edge event stream. REFERENCES [1] R. Wilhelm, J. Engblom, A. Ermedahl, N. Holsti, S. Thesing, D. Whalley, G. Bernat, C. Ferdinand, R. Heckmann, T. Mitra, F. Mueller, I. Puaut, P. Puschner, J. Staschulat, and P. Stenström, The worst-case execution-time problem overview of methods and survey of tools, ACM Transactions on Embedded Computing Systems, vol. 7, no. 3, pp. 36:1 36:53, May 28. [Online]. Available: [2] J. Nowotsch, M. Paulitsch, D. Bühler, H. Theiling, S. Wegener, and M. Schmidt, Multi-core Interference-Sensitive WCET Analysis Leveraging Runtime Resource Capacity Enforcement, in ECRTS 14: Proceedings of the 26th Euromicro Conference on Real-Time Systems, July 214. [3] B. Dreyer, C. Hochberger, A. Lange, S. Wegener, and A. Weiss, Continuous Non-Intrusive Hybrid WCET Estimation Using Waypoint Graphs, in 16th International Workshop on Worst-Case Execution Time Analysis (WCET 216), ser. OpenAccess Series in Informatics (OASIcs), M. Schoeberl, Ed., vol. 55. Dagstuhl, Germany: Schloss Dagstuhl Leibniz-Zentrum fuer Informatik, 216, pp. 4:1 4:11. [Online]. Available: [4] C. Cullmann, C. Ferdinand, G. Gebhard, D. Grund, C. Maiza (Burguière), J. Reineke, B. Triquet, S. Wegener, and R. Wilhelm, Predictability Considerations in the Design of Multi-Core Embedded Systems, Ingenieurs de l Automobile, vol. 87, pp , 21. [5] M. Schoeberl, S. Abbaspour, B. Akesson, N. Audsley, R. Capasso, J. Garside, K. Goossens, S. Goossens, S. Hansen, R. Heckmann, S. Hepp, B. Huber, A. Jordan, E. Kasapaki, J. Knoop, Y. Li, D. Prokesch, W. Puffitsch, P. Puschner, A. Rocha, C. Silva, J. Spars, and A. Tocchi, T-crest: Time-predictable multi-core architecture for embedded systems, Journal of Systems Architecture, vol. 61, no. 9, pp , 215. [Online]. Available: [6] A. Leulseged and N. Nissanke, Probabilistic Analysis of Multi-processor Scheduling of Tasks with Uncertain Parameters. Berlin, Heidelberg: Springer Berlin Heidelberg, 24, pp [Online]. Available: 7 [7] J. L. Diaz, D. F. Garcia, K. Kim, C.-G. Lee, L. L. Bello, J. M. Lopez, S. L. Min, and O. Mirabella, Stochastic analysis of periodic real-time systems, in 23rd IEEE Real-Time Systems Symposium, 22. RTSS 22., 22, pp [8] A. Burns, G. Bernat, and I. Broster, A Probabilistic Framework for Schedulability Analysis. Berlin, Heidelberg: Springer Berlin Heidelberg, 23, pp [Online]. Available: 17/ [9] L. Abeni, T. Cucinotta, G. Lipari, L. Marzario, and L. Palopoli, Qos management through adaptive reservations, Real-Time Systems, vol. 29, no. 2, pp , 25. [Online]. Available: http: //dx.doi.org/1.17/s

10 [1] F. J. Cazorla, T. Vardanega, E. Quiñones, and J. Abella, Upperbounding Program Execution Time with Extreme Value Theory, in 13th International Workshop on Worst-Case Execution Time Analysis, ser. OpenAccess Series in Informatics (OASIcs), C. Maiza, Ed., vol. 3. Dagstuhl, Germany: Schloss Dagstuhl Leibniz-Zentrum fuer Informatik, 213, pp [Online]. Available: de/opus/volltexte/213/4123 [11] M. Lindberg, A survey of reservation-based scheduling, 27. [12] L. Santinelli, J. Morio, G. Dufour, and D. Jacquemart, On the Sustainability of the Extreme Value Theory for WCET Estimation, in 14th International Workshop on Worst-Case Execution Time Analysis, ser. OpenAccess Series in Informatics (OASIcs), H. Falk, Ed., vol. 39. Dagstuhl, Germany: Schloss Dagstuhl Leibniz-Zentrum fuer Informatik, 214, pp [Online]. Available: de/opus/volltexte/214/461 [13] G. A. Kaczynski, L. L. Bello, and T. Nolte, Deriving exact stochastic response times of periodic tasks in hybrid priority-driven soft real-time systems, in 27 IEEE Conference on Emerging Technologies and Factory Automation (EFTA 27), Sept 27, pp [14] J. P. Hansen, S. A. Hissam, and G. A. Moreno, Statistical-based WCET estimation and validation, in 9th Intl. Workshop on Worst-Case Execution Time Analysis, WCET 29, Dublin, Ireland, July 1-3, 29, 29. [Online]. Available: [15] K. S. Gautam, Parallel histogram calculation for fpga: Histogram calculation, in 216 IEEE 6th International Conference on Advanced Computing (IACC), Feb 216, pp [16] A. M. Alsuwailem and S. A. Alshebeili, A new approach for real-time histogram equalization using fpga, in 25 International Symposium on Intelligent Signal Processing and Communication Systems, Dec 25, pp [17] A. Sanny, Y. H. E. Yang, and V. K. Prasanna, Energy-efficient histogram on fpga, in 214 International Conference on ReConFigurable Computing and FPGAs (ReConFig14), Dec 214, pp [18] N. Stekas and D. v. d. Heuvel, Face recognition using local binary patterns histograms (lbph) on an fpga-based system on chip (soc), in 216 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), May 216, pp [19] S. M. Petters, Execution-Time Profiles, NICTA, Tech. Rep., January 27. [Online]. Available: Petters 7.pdf [2] The debie1 Benchmark, 215. [Online]. Available: fr/wiki/doku.php?id=wtc:benchmarks:debie1 [21] H. Falk, S. Altmeyer, P. Hellinckx, B. Lisper, W. Puffitsch, C. Rochange, M. Schoeberl, R. B. Sørensen, P. Wägemann, and S. Wegener, TACLeBench: A Benchmark Collection to Support Worst-Case Execution Time Research, in 16th International Workshop on Worst- Case Execution Time Analysis (WCET 216), ser. OpenAccess Series in Informatics (OASIcs), M. Schoeberl, Ed., vol. 55. Dagstuhl, Germany: Schloss Dagstuhl Leibniz-Zentrum fuer Informatik, 216, pp. 2:1 2:1. [Online]. Available: 1

Continuous Non-Intrusive Hybrid WCET Estimation Using Waypoint Graphs

Continuous Non-Intrusive Hybrid WCET Estimation Using Waypoint Graphs Boris Dreyer 1, Christian Hochberger 2, Alexander Lange 3, Simon Wegener 4, and Alexander Weiss 5 1 Fachgebiet Rechnersysteme, Technische