COdesign and power Management in PLatformbased design space EXploration. Final report on run-time management

Size: px

Start display at page:

Download "COdesign and power Management in PLatformbased design space EXploration. Final report on run-time management"

Percival Maxwell
5 years ago
Views:

FP7-ICT-2009-4 (247999) COMPLEX COdesign and power Management in PLatformbased design space EXploration Project Duration 2009-12-01 2013-03-31 Type IP WP no. Deliverable no. Lead participant WP3 D3.5.

Classification Chantal Ykman-Couvreur (IMEC), Kai Hylla, Sven Rosinger (OFFIS), Gianluca Palermo (POLIMI) Alberto Rosti, Sara Bocchio (ST-I) IMEC COMPLEX/IMEC/R/D3.5.3/1.

1 FP7-ICT (247999) COMPLEX COdesign and power Management in PLatformbased design space EXploration Project Duration Type IP WP no. Deliverable no. Lead participant WP3 D3.5.3 IMEC Prepared by Issued by Document Number/Rev. Classification Chantal Ykman-Couvreur (IMEC), Kai Hylla, Sven Rosinger (OFFIS), Gianluca Palermo (POLIMI) Alberto Rosti, Sara Bocchio (ST-I) IMEC COMPLEX/IMEC/R/D3.5.3/1.0 COMPLEX Submission Date Due Date Project co-funded by the European Commission within the Seventh Framework Programme ( ) Copyright 2012 OFFIS e.v., STMicroelectronics srl., STMicroelectronics Beijing R&D Inc, Thales Communications SA, GMV Aerospace and Defence SA, SNPS Belgium NV, EDALab srl, Magillem Design Services SAS, Politecnico di Milano, Universidad de Cantabria, Politecnico di Torino, Interuniversitair Micro-Electronica Centrum vzw, European Electronic Chips & Systems design Initiative. This document may be copied freely for use in the public domain. Sections of it may be copied provided that acknowledgement is given of this original work. No responsibility is assumed by COMPLEX or its members for any aplication or design, nor for any infringements of patents or rights of others which may result from the use of this document.

2 History of Changes ED. REV. DATE PAGES REASON FOR CHANGES IMEC First release of public report Page 2

3 Table of Contents Table of Contents Executive summary Abbreviations and glossary Introduction RRM framework Application terminology and assumptions Run-time decisions GRM implementation Interface between GRM and application Interface between GRM and user Interface between GRM and platform GRM software task Format of GRM input files High-level platform specification IP core type specification Application specification Interaction flow between GRM and CM Communication with GRM and CM Dedicated switching points Initialization of the application Actual execution of the application Power management of individual IP cores HW accelerators and black-box IP cores Model of computation Register Interface API documentation API implementation ReISC core Virtual Platform Power Monitoring APIs Application Level Power Driver APIs Power Manager Registers RCCU Registers ADC Registers RRM for audio-driven video surveillance domain Overview Experimental Results Binary size of GRM implementation Performance overhead and energy gain RRM for ultra-low power platforms Overview of the Methodology Tool-Flow Experimental Results References Page 3

4 simulation SystemC estimation & model generation BAC++ BAC++ SystemC exploration & optimization power/performance metrics HW tasks SW tasks executable specification design space definition MDA design entry COMPLEX/IMEC/R/D3.5.3/1.0 1 Executive summary This deliverable is the third report of Task 3.5, dealing with Run-time Resource Management (RRM). This task is coordinated by IMEC and also involves POLIMI and OFFIS. It started at month M7 and it ends at month M30. The goals of Task 3.5 are to develop a lightweight architecture for RRM in tightly constrained systems and sample run-time resource managers for both COMPLEX use cases 1 and 2. In addition to the RRM architecture, Task 3.5 also develops services and optimization heuristics to be supported by the RRM for alleviating the burden of the application programmer. h system specification in SystemC e a MARTE PIM or Matlab/ Simulink f b use-cases system input stimuli user constrained HW/SW sep. & mapping HW/SW task separation & testbench generation c d MARTE PDM (Platform Description Model) architecture/platform description (IP-XACT) g s design space instance parameters functional reimplementation hardware/software partitioning/separation runtime management embedded software/compiler optimizations IP platform selection & configuration memory configuration/management (static & dynamic) custom hardware synthesis constraints t l source analysis behavioral synthesis functional, power, & timing model generation i automatically pre-optimized power controller source analysis cross compilation functional, power, & timing model generation j virtual system generator with TLM2 interface synthesis bus cycle accurate SystemC model with self-simulating power & timing models m n o k virtual platform IP component models r user visualization/ reporting tool q p trace analysis tool simulation trace parameters for new design space instance exploration & optimization tool Figure 1: COMPLEX design flow This RRM corresponds with the pre-optimized power controller (m) in the COMPLEX design flow illustrated in Figure 1. For more details, see the COMPLEX Description of Work [1]. The goals of the first deliverable D3.5.1 [3] were to provide a preliminary vision of a generic and structured architecture for the RRM and to introduce both sample RRMs for COMPLEX use cases 1 and 2 respectively. Page 4

5 The goals of the second deliverable D3.5.2 [2] were to provide an updated vision of this RRM architecture and to present the status of the RRM implementation for both COMPLEX use cases. The goals of the public version of this third deliverable D3.5.3 are to describe the entire work performed in Task T3.5, including the description of experiments performed to analyze the efficiency of the RRM implementation for both COMPLEX use cases. To that end, the writing policy is as follows: Sections already provided in Deliverables D3.5.1 and D3.5.2 are explicitly mentioned below. Also for the sake of clarity, to give an overall picture of the RRM developed in COMPLEX, and to make this deliverable standalone, all RRM features resulting from work performed in other tasks are unified in this deliverable. Links to these tasks are also explicitly mentioned in the corresponding sections of this deliverable. Two versions of this deliverable are available: this one, being confidential, and the other one, being public, where confidential parts are removed. These confidential parts are also explicitly mentioned in the corresponding sections of this deliverable. The content of this deliverable is organized as follows: Section 2 (from D3.5.1 and D3.5.2) defines the abbreviations and some relevant terms used in the deliverable. Section 3 (from D3.5.1) overviews the challenges to be fulfilled by the RRM in future embedded computing. Section 4 (updated from D3.5.1 and D3.5.2) updates the RRM framework developed in Task T3.5. This RRM follows a distributed and hierarchical approach: it consists of both Central Manager (CM) and Global Resource Manager (GRM) at the platform level, and of Local Resource Managers (LRMs) at the Intellectual Property (IP) core level. Section 5 (from D3.5.1 and D3.5.2) describes the GRM implementation. It also describes the GRM interfaces with the application, the user, and the platform. Section 6 (from D3.5.2) describes the format of needed GRM input files for both GRM databases about high-level platform specification and available application configurations. Section 7 (from D3.5.2) describes the interaction flow between the GRM and the CM allowing managing in parallel the platform resources and the application functionalities. Section 8 (updated from D3.5.2) characterizes the Application Programming Interfaces (APIs) for power management of individual IP cores, as required by the GRM. Section 9 (updated from D3.5.2) presents the demonstrator used to instantiate the RRM in the COMPLEX use case 2. This demonstrator is taken from the audio-driven video surveillance domain. This section also describes the experiments performed to analyze the efficiency of the RRM for this demonstrator. Page 5

6 Section 10(new) describes the RRM framework, with some initial results, adopted for the power management of an ultra-low power platform as used in the COMPLEX use case 1. Page 6

7 2 Abbreviations and glossary The table below lists the abbreviations with their definition used in the deliverable. ADC ADVT API CM DMA DPM DSE DSU DTD DVFS GPIO GPT GRM I2C IP ISS ITIM HW LLVM LRM QoE QoS RCCU ReISC RRM RTC SPI SW UART USART USB WD WWD Analog Digital Converter Advanced Timer Application Programming Interface Central Manager Direct Memory Access Dynamic Power Management Design Space Exploration Debug Support Unit Document Type Definition Dynamic Voltage and Frequency Scaling General Purpose Input Output General Purpose Timer Global Resource Manager Inter Integrated Circuit Intellectual Property Instruction Set Simulator Internal Timer Hardware Low-Level Virtual Machine Local Resource Manager Quality of user Experience Quality of Service Reset Clock Control Unit Reduced energy Instruction Set Computer Run-time Resource Management Real-Time Clock Serial Parallel Interface Software Universal Asynchronous Receiver Transmitter Universal Synchronous Asynchronous Receiver Transmitter Universal Serial Bus Watch Dog Window Watch Dog Page 7

8 Also, some relevant terms used in the deliverable are shortly described in the following. More information can be found in Section 3. The application functionality is specified at different granularity levels: (1) the application is organized into application modes, each one specifying a different subset of functionalities; (2) each application mode consists of communicating jobs, each one mapped entirely on one IP core; (3) each job can consist of communicating tasks, all of them running on the same IP core. An application configuration specifies an application mapping on the platform. It is mainly characterized by: its Quality of Service (QoS) required by the user, its application mode, the job implementation, the assignment of its jobs on the IP cores of the platform, its average execution time and energy consumption, and its user benefit/value. Page 8

9 3 Introduction To address the challenges introduced by future embedded computing, a generic and structured architecture for RRM of embedded multi-core platforms is refined in Task T3.5. This RRM needs to fulfil the following features. First, the RRM has to support a variety of applications: mobile communications, networking, automotive and avionic applications, multimedia in the automobile and Internet interfaced with many embedded control systems. These applications may run concurrently, start and stop at any time. Each application may have multiple configurations, with different constraints imposed by the external world or the user (deadlines and quality requirements, such as audio and video quality, output accuracy), different usages of various types of platform resources (processing elements, memories and communication bandwidth) and different costs (performance, power consumption). Second, this RRM should support a holistic view of platform resources. This is needed for global resource allocation decisions optimizing a utility function (also called Quality of user Experience (QoE)), given the available platform resources. This QoE will allow trade-off, negotiated with the user, between diverse QoS requirements and costs. E.g., in the COMPLEX use case 2, this QoE should enable careful management of the energy stored in the battery. Third, this RRM should transparently optimize the platform resource usage and the application mapping on the platform. This is needed to facilitate the application development and manage the QoS requirements without rewriting the application. Next, this RRM should dynamically adapt to changing context. This is needed to achieve a high efficiency under changing environment. QoS requirements and platform resources must be scaled dynamically (e.g., by adjusting the clock frequencies and voltages, or by switching off some functions) in order to control the energy/power consumption and the heat dissipation of the platform. Finally, this RRM should allow different heuristics (e.g., for platform resource allocation and task scheduling), since a single heuristic cannot be expected to fit all application domains and optimization goals. Also, the software development productivity is of paramount importance. To address this challenge, and to facilitate the RRM implementation, a generic and structured architecture for the RRM is required. It should be valid for any used design flow, for any target platform, and for any application domain. The development of an RRM architecture is the first goal of Task T3.5. Nevertheless, since the RRM is intended for embedded platforms, a lightweight implementation only is acceptable. To address this challenge: This RRM should interface with design-time exploration to alleviate its run-time decision making. This is the goal of the RRM interface with the tool developed in Task T3.4. The RRM implementation should be instantiated from the RTM architecture, based on the target platform and the application domain. The development of sample run-time resource managers for both COMPLEX use cases 1 and 2 is the second goal of Task 3.5, in collaboration with Task 4.1. Page 9

4 RRM framework Figure 2: Two-level optimization flow The outcome of Task T3.5 is an RRM framework, intended for embedded heterogeneous MP- SoC platforms.

10 4 RRM framework Figure 2: Two-level optimization flow The outcome of Task T3.5 is an RRM framework, intended for embedded heterogeneous MP- SoC platforms. It is based on a two-level optimization flow outlined in Figure 2. At design time, a set of Pareto-optimal application configurations are derived by an automated design space exploration, based on the tool MOST, and developed in Task 3.4. The RRM then dynamically switches between these predefined configurations in order to continuously maximize the QoS of the application, while meeting the platform constraints. Page 10

11 Figure 3: Distributed and hierarchical RRM approach The target platform for our RRM framework is an embedded heterogeneous MP-SoC platform. This platform consists of multiple IP cores, and these IP cores can be of different types (e.g., HW accelerator, FPGA, multi-cpus). No task migration between different IP cores is considered in COMPLEX. Hence, the management of the communication infrastructure is not considered in COMPLEX. Our RRM architecture follows a distributed and hierarchical approach, illustrated in Figure 3. On the one hand, both CM and GRM are loaded on the host processor of the platform. They are software tasks, with the same priority, specified in C, and running in parallel. They are used to adapt both platform and applications at run time and to find global and optimal trade-offs in application mapping based on a given optimization goal. On the other hand, each IP core can execute its own resource management without any restriction, through an LRM. Such an LRM encapsulates the local policies and mechanisms used to initiate, monitor and control computation on its IP core. Page 11

Figure 4: Communication with GRM and CM As illustrated in Figure 4, the GRM manages the platform resources, whereas in parallel with the GRM, the CM manages the application functionalities: The

12 Figure 4: Communication with GRM and CM As illustrated in Figure 4, the GRM manages the platform resources, whereas in parallel with the GRM, the CM manages the application functionalities: The platform consists of multiple IP cores, and these IP cores can be of different types, but from the GRM viewpoint, these different types are managed similarly through APIs. The GRM selects the application configurations and reconfigures the IP cores (i.e., either switch on/off or perform DVFS) accordingly. This allows the GRM to be a generic entity, being unaware of the application functionalities, and hence reusable for other embedded platforms. The CM informs the GRM about actions to be done, it creates the threads of the application on the slave IP cores, and it performs some pre-processing before thread execution in parallel with the GRM. A detailed description of the communication with the GRM and the CM is given in Section Application terminology and assumptions The target applications have to fulfil the following terminology and assumptions, to enable the run-time management strategy based on the GRM/LRMs. Ideally, for any application, all functionalities should be accessible at any time. However, based on the user requirements, the available platform resources, the limited energy/power Page 12

13 budget of the platform, and the target platform autonomy, it may not be possible to integrate all these functionalities on the platform at the same time. Hence the application developer has to organize the application into application modes, each one specifying a different subset of functionalities. The application, within a selected mode, consists of jobs communicating with each other, where: o In view of system robustness, and to make more lightweight dynamic power management, each job is mapped entirely on one IP core. Nevertheless, a job can consist of multiple tasks communicating with each other, but all of them have to run on the same IP core. o Whereas the functional specification of a job is fixed, there may be several specific algorithms or implementations for a given job. Also a job implementation can take several forms (fixed logic, configurable logic, SW) and offer different characteristics. To conform to the hierarchical approach of the RRM, jobs and communication between them are managed at the platform level by the GRM, whereas tasks and communication between them are managed at the IP core level by the LRM. An application configuration specifies an application mapping on the platform. It is mainly characterized by: its QoS required by the user, its application mode, the job implementation, the assignment of its jobs on the IP cores of the platform, its average execution time and energy consumption, and its user benefit/value. These available application configurations are provided at design time, structured and stored in a GRM database to enable fast exploration during run-time decisions. The user benefit/value is the returned value of the utility function (see below) applied to the application configuration. It is computed at the initialization of the application (see Section 5.2). The QoS requirements and the optimization goal are defined through the Quality of user Experience (QoE) manager. This goal is translated into an abstract and mathematical function, called utility function. Examples of utility functions are: performance of the application, energy/power consumption of the platform, battery life, revenue if the user has to pay for the application, QoS of the application, weighted combination of them. In the COMPLEX use cases, the considered IP cores are either processors or custom HW blocks. Only one application is considered, but the application consists of several application modes. There is no task migration between different IP cores. Each job consists of only one task. Page 13

14 4.2 Run-time decisions Figure 5: Run-time decisions In our RRM framework, the run-time decisions are illustrated in Figure 5. In contrast to selection of application configurations and task mapping and scheduling that involve coarsegrain run-time decisions, have on aimpact on the usage of the platform resources, and hence require dynamic reconfiguration, fine-grain DVFS does not require reconfiguration, is cheaper, and can be performed more frequently. The optimal selection of application configurations is the focus of the RRM implementation in the COMPLEX use case 2, whereas the fine-grain DVFS is the focus of the RRM implementation in the COMPLEX use case 1. Page 14

5 GRM implementation Figure 6: Architecture of the GRM implementation The GRM should be a middleware providing a bridge between the application, the user, and the platform.

15 5 GRM implementation Figure 6: Architecture of the GRM implementation The GRM should be a middleware providing a bridge between the application, the user, and the platform. As mentioned in Section 4, the GRM focuses on the management of the platform resources, so leaving the CM in charge of the management of the application functionalities. Also in COMPLEX, no task migration between different IP cores is considered, and each job consists of only one task. Taking these considerations into account, the architecture of the GRM implementation developed in COMPLEX is as illustrated in Figure 6. To provide a bridge between the application, the user, and the platform, generic services are supported by the GRM. These services are classified into managers to structure the interface between the GRM and the application, the user, and the platform, respectively. 5.1 Interface between GRM and application The interface with the application is provided by the application manager. This manager provides the following main services: GRM_ConfigureApplication(), GRM_SelectApplicationConfiguration(), and GRM_ReconfigurePlatform(). GRM_ConfigureApplication() loads the available application configurations derived by the design-time exploration. Input required for this service is an XML file characterizing these application configurations (see Section 6.3). GRM_SelectApplicationConfiguration() selects an application configuration with the maximum user value, while meeting the platform constraints and taking the available platform resources into account. This function is frequently executed at run time, so that a lightweight implementation is mandatory. It relies on a QoS-aware optimization heuristic, such as the one implemented for the COMPLEX use case 2. It allows providing the user Page 15

16 with the maximum QoS according to the energy budget and the battery duration of the platform. The pseudo-code of this heuristic is as follows: GRM_SelectApplicationConfiguration() { elapsed_time = host_clock(); remaing_time = battery_duration(platform) elapsed_time; GRM_EstimateElapsedEnergy(); remaining_energy = energy_budget(platform) elapsed_energy(platform); remaining_frames = remaining_time / audio_frame_proc; max_energy_per_frame = remaining_energy/remaining_frames; /* See Figure 8 */ for each appl_config in sorted_pareto_set { if (energy_per_frame(appl_config) >= max_energy_per_frame) break; } } where it is assumed that: The application starts at time 0. audio_frame_proc is a QoS required by the user. It corresponds to the maximum allowed execution time to process one audio frame. energy_budget(platform) is a constraint initially provided in the high-level platform specification input file (see Section 6.1). Nevertheless, whenever the battery is recharged, this energy budget has to be updated accordingly. Due to the translation of the application configuration space into a two-dimension design space [user value, cost] and the preprocessing performed by GRM_SortApplicationConfigurations() (see Section 5.2), the complexity of GRM_SelectApplicationConfiguration() is only O(n), where n is the size of the sorted Pareto set. GRM_ReconfigurePlatform() requests each IP core of the platform to switch to the power mode specified in the newly selected application configuration. Such a request is performed by the service GRM_SwitchToPowerMode() (see Section 5.3). 5.2 Interface between GRM and user The interface with the user (or external entity accessing the application specification) is provided by the QoE manager. The QoE is a subjective measure of the application value from the user perspective. It is influenced by the user terminal device (e.g., low- or high-definition TV), his environment (e.g., in the car or at home), his expectations, the nature of the content and its importance (e.g., a simple yes/no message or an orchestral concert). Changes in user preferences may involve (re)negotiation between user and QoE manager. Indeed, the platform resources may not be sufficient to provide the desired QoS to the application. The user needs a simple way to communicate with the QoE manager in order to control and customize the QoS of his application. Page 16

17 The QoE manager provides the following main services: GRM_DeriveUserValues() and GRM_SortApplicationConfigurations(). The negotiation between user and QoE manager involves the selection of the utility function and of the optimization heuristic: The utility function allows to model in an abstract and mathematical way the user benefit for the application. It allows a trade-off between diverse QoS requirements and costs. Examples of utility functions are: performance of the application, energy/power consumption of the platform, battery life, revenue if the user has to pay for the application, fair sharing of platform resources, and weighted combination of them. Once selected, the utility function is applied to each application configuration, to derive its user value. This is performed by the service GRM_DeriveUserValues(). This utility function is then optimized by the GRM in the service GRM_SelectApplicationConfiguration() of the application manager. The selection of the optimization heuristic allows fitting the current application domain and optimization goal. In the COMPLEX use case 2, the optimization goal is to maximize the QoS of the application, whereas the platform constraints are the energy budget and the battery duration of the platform. To that end, the utility function models the QoS of an application configuration as a weighted sum of its audio and image frequency and resolution and of the amount of application functionalities provided by its application mode. Its pseudocode is currently as follows: user_value(appl_config) = appl_mode_id * appl_mode_id + audio_frequency + image_resolution + image_frequency GRM_SortApplicationConfigurations() is developed as follows: o The input for this service is the set of Pareto-optimal application configurations in the multi-dimension space, as the one illustrated in Figure 7 for the COMPLEX use case 2. This set is derived by a design-time exploration, such as MOST in Task T3.4. o This service keeps the Pareto-optimal application configurations in the twodimension design space [user value, cost] and sorts them in ascending order according to the user value. In the COMPLEX use case 2, the considered cost is the energy consumption per audio frame, and the considered two-dimension design space is illustrated in Figure 8. o This service is a preprocessing to speed up the run-time execution of GRM_SelectApplicationConfiguration(). Moreover the sorting is performed through the efficient standard C function qsort(). Page 17

18 Figure 7: Multi-dimension space of application configurations Figure 8: Two-dimension design space of application configurations 5.3 Interface between GRM and platform The interface with the platform is provided by the platform manager and the IP core manager. On the one hand, the platform manager provides GRM services related to platform configuration and resource monitoring. On the other hand, the IP core manager conforms to practices of each IP core separately and mainly provides GRM services to set the power mode of an IP core. Page 18

19 The platform manager provides the following main services: GRM_ConfigurePlatform() and GRM_EStimateElapsedEnergy(). The IP core manager provides the following main services: GRM_ConfigureIPcoreType() and GRM_SwitchToPowerMode(). GRM_ConfigurePlatform() loads the high-level platform specification, the platform constraints (e.g., battery duration, energy budget), and the power mode table of each IP core. A power mode of an IP core is characterized by: its supply voltage and clock frequency, its average dynamic and leakage power consumption, its available power mode transitions. A power mode transition also specifies its switching time and power consumption. Inputs required for this service are a textual file for platform specification and an XML file for each power mode table (see Sections 6.1 and 6.2). GRM_EstimateElapsedEnergy()estimates the energy consumption of the platform elapsed from the application start up to this function call. Currently no sensor is used, but a very simple energy model is used: elapsed_energy(platform) = Σ IP core elapsed_energy(ip core). The pseudo-code of this function is as follows: GRM_EstimateElapsedEnergy() { en = 0; cur_time = host_clock(); for each IP core { en += elapsed_energy(ip core); pm = current_power_mode(ip_core); en += (avg_leakage(pm) + avg_dyn_power(pm)) * (cur_time switching_time(ip core)); } } where en denotes the estimated elapsed energy of the platform, switching_time(ip core) denotes host_clock() at the last switching point (see Section 7.2) of the IP core, elapsed_energy(ip core) denotes the estimated energy consumption of the IP core elapsed from the system start up to the last switching point. This latest is updated during the run of the application whenever a switching to a new power mode is performed on the IP core. This updating is computed as follows: cur_pm = current_power_mode(ip core); new_pm = new_power_mode(ip core); tm = power_mode_transition(cur_pm, new_pm); cur_time = host_clock(); elapsed_energy(ip core) += (avg_leakage(pm) + avg_dyn_power(pm)) * (cur_time switching_time(ip core)); elapsed_energy(ip core) += switching_power(tm) * switching_time(tm); The pseudo-code of GRM_SwitchToPowerMode() is as follows: GRM_SwitchToPowerMode(IP core) { cur_pm = current_power_mode(ip core); new_pm = new_power_mode(ip core); tm = power_mode_transition(cur_pm, new_pm); PerformDVFS(IP core, new_pm); Update elapsed_energy(ip core); } where PerformDVFS() is implemented in conformity with the IP core practice and makes use of the corresponding API. If the IP core is an HW block or a black-box IP core, this API is Page 19

20 implemented through the function lrm_request_mode() (see Section 8.1). If the IP core is a SW platform core, this API is coordinated with the platform provider. The current implementation status is described in Section 8.2 for the ReISC DSP core of the platform used in COMPLEX use cases 1 and GRM software task As mentioned in Section 4, the GRM is a SW task running in parallel with the CM on the host processor. This SW task is implemented through GRM_Execute(), in conformity with the interaction flow between the GRM and CM described in Section 7. Pseudo-code of this function is as follows: GRM_Execute() { GRM_ConfigurePlatform("platform.dat"); GRM_ConfigureApplication("application.xml"); GRM_DeriveUserValues(UTILITY); GRM_SortApplicationConfigurations(); while (1) { sem_wait(grm_action); /* GRM is waked up */ /* Select an application configuration */ if (grm_action == SIG_selection) { GRM_SelectApplicationConfiguration(); } } /* Reconfigure the platform */ else if (grm_action == SIG_recrequest) { GRM_ReconfigurePlatform() At the initialization of the application, thus without run-time overhead, the GRM has to execute only once the following services: GRM_ConfigurePlatform(), GRM_ConfigureApplication(), GRM_DeriveUserValues(), and GRM_SortApplicationConfigurations().Nevertheless, during the run of the application, the GRM has to frequently execute the following services: GRM_SelectApplicationConfiguration() and GRM_ReconfigurePlatform(). So a lightweight implementation is mandatory. Experiments on the GRM overhead and feasibility are reported in Section 9.2. Page 20

21 6 Format of GRM input files As mentioned in Section 5, the GRM needs three types of input file: A textual file describing the high-level platform specification and one XML file for each IP core type of the platform. Such an XML file characterizes the power modes and power mode transitions available on the IP core. These input files are needed to execute the service GRM_ConfigurePlatform(). An XML file characterizing the available application configurations. It is needed to execute the service GRM_ConfigureApplication(). The updated format of these three types of input file is summarized in the following subsections. A first updating is the introduction of the field unit. Indeed, the GRM needs to combine measures together and consistently. These measures are specified in different input files and derived from different independent external tools. A second updating is the use of the same XML file format for each IP core type. A third updating is the use of an XML file format for both IP core type and application specification with similar constructs. 6.1 High-level platform specification The textual file for platform specification is as follows: # PLATFORM Platform platform_stm Number_of_IP_cores 6 Number_of_IP_core_types 2 Energy_budget joules Battery_duration 24 hours # IPCORES IP_core ipcore_0 REISC HOST IP_core ipcore_1 REISC SLAVE IP_core ipcore_2 REISC SLAVE IP_core ipcore_3 REISC SLAVE IP_core ipcore_4 REISC SLAVE IP_core ipcore_5 HW SLAVE # IPCORE TYPES IP_core_type REISC REISC.xml IP_core_type HW HW.xml 6.2 IP core type specification A unified XML format is used for any IP core type. The XML file characterizes the power modes and power mode transitions available on any IP core type. It is illustrated below, where two power modes and the corresponding power mode transition are specified: <Power_mode_table> <Power_mode> <parameters> <parameter name= ID value= 0 unit= no \> <parameter name= clock_frequency value= 0 unit= MHz /> Page 21

22 <parameter name= supply_voltage value= 0.0 unit= volts /> <patameter name= avg_dyn_power value= 0.0 unit= milli_watts /> <parameter name= avg_leakage value= 5.0 unit= milli_watts /> </parameters> <Power_mode_transitions> <pm_trans> <parameter name= pm_id value= 1 unit= no /> <parameter name= switching_time value= 2.0 unit= milli_sec > <parameter name= switching_power value= unit= milli_watts /> </pm_trans> </Power_mode_transitions> </Power_mode> <Power_mode> <parameters> <parameter name= ID value= 1 unit= no \> <parameter name= clock_frequency value= 300 unit= MHz /> <parameter name= supply_voltage value= 1.2 unit= volts /> <patameter name= avg_dyn_power value= unit= milli_watts /> <parameter name= avg_leakage value= unit= milli_watts /> </parameters> <Power_mode_transitions> <pm_trans> <parameter name= pm_id value= 0 unit= no /> <parameter name= switching_time value= 0.1 unit= milli_sec > <parameter name= switching_power value= 20.0 unit= milli_watts /> </pm_trans> </Power_mode_transitions> </Power_mode> </Power_mode_table> 6.3 Application specification The format of the XML file characterizing the available application configurations is illustrated below for one application configuration in the COMPLEX use case 2: <point> <parameters> <parameter name="appl_mode" value="0" unit= no /> <parameter name="audio_frequency" value="128" unit= kbits_per_sec /> <parameter name="image_resolution" value="101376" unit= pixels_per_image /> <parameter name="image_rate" value="10" unit= frames_per_sec /> </parameters> <scheduling> <sched name="task_id" value="0" name="ipcore_id" value="1" name="power_mode_id" value="1"/> <sched name="task_id" value="1" name="ipcore_id" value="2" name="power_mode_id" value="1"/> </scheduling> <system_metrics> <system_metric name="execution_time" value="7.56" unit= milli_sec /> <system_metric name="energy_consumption" value="11.89" unit= milli_joule /> </system_metrics> </point> Page 22

23 7 Interaction flow between GRM and CM This section describes the interaction flow between the GRM and the CM allowing managing in parallel the platform resources and the application functionalities, as advertised in Section 4. One issue is to optimize the work repartition between the GRM and the CM in order to be as efficient and reactive as possible. A correct synchronization is required between the GRM and the CM to guarantee that the IP cores are reconfigured completely before launching the thread executions. 7.1 Communication with GRM and CM The following legend is used in the next figures of Section 7: Figure 9: Communications with GRM and CM Page 23

24 Communications with the GRM and the CM are performed either through signals or through shared variables. As illustrated in Figure 9, two shared variables are used: new_ac is shared between the GRM and the CM. It is written by the GRM and read by the CM. It stores the ID of the application configuration selected by GRM_SelectApplicationConfiguration(). Whenever the CM wants to update the application configuration, it copies new_ac into current_ac. current_ac is shared between the CM and the slave threads. It is written by the CM and read by the slave threads. It stores the ID of the application configuration currently executed on the platform. It is through current_ac that the CM communicates the application configuration to the slave threads. Three types of signal communication are also used: Communication between the GRM and the CM, whose interrupt signals are defined as follows: o As soon as GRM_ReconfigurationPlatform() is completed for the last selected application configuration, the GRM starts waiting for sig_selection. o Regularly, the CM wants to update the current application configuration. To that end, it sends sig_selection to the GRM to request the execution of GRM_SelectApplicationConfiguration(). o ack_selection is sent by the GRM to the CM to indicate that the variable new_ac has been updated: The CM can read new_ac at any time to be aware of the newly selected application mode. As soon as the CM agrees to perform the reconfiguration, the CM needs to copy new_ac into the variable current_ac. Before each new processing, the active slave threads need to read current_ac to be aware of the options to be executed. o As soon as the CM agrees to switch to the newly selected application configuration, it sends sig_request to the GRM to request the execution of GRM_ReconfigurePlatform(). o As soon as sig_terminating is received, sig_reconfig(ipcore) is sent by the CM to the GRM to indicate that the IP core can be reconfigured. o IPC_ready(ipcore) is sent by the GRM to the CM to indicate that the IP core is ready to execute the thread. Communication between the GRM and the IP cores of the platform, whose interrupt signals are defined as follows: Page 24

25 o sig_switch_pm(power_mode) is sent by the GRM to the IP core to request the switching to the given power mode. o ack_switch_pm is sent by the IP core to the GRM to indicate that the reconfiguration is completed with the new power mode. Communication between the CM and the slave threads, whose interrupt signals are defined as follows: o sig_wakeup is sent by the CM: Either to create the thread and start its execution. Or to reactivate the thread after a new power mode switching. o sig_terminating is sent by the slave thread to the CM as soon as a switching point (see Section 7.2) is met. This signal indicates that the thread completed its current processing, it is in a stable state and it can deal with hardware adaptations. This signal also implies that the thread read current_ac and that it is aware whether it has to continue its execution or to enter a sleep mode before executing a new configuration. Detailed interaction flows both at the initialization of the application and during its actual execution are given in Sections 7.3 and Section 7.4. As illustrated in Figure 10 and Figure 12, a timeout must also be used for robustness in case of faulty IP cores or slave threads, in which case some special action must be taken by the CM. E.g., see wait(sig_terminating and ack_switch_pm for 50ms). Page 25

26 7.2 Dedicated switching points Figure 10: Application configuration switching Page 26

27 Activation of new application configuration must be seamless to avoid damage and to maintain the real-time behaviour and the data integrity of the application. Obviously, this cannot be done at any time: the threads should be in a stable state (e.g., not yet started, having completed some processing, and being in a waiting state). This means that all threads will probably not switch at the same time. Hence the only robust solution is to reconfigure the IP cores and the threads one by one as soon as a thread is ready for it until the complete platform and application are reconfigured. As a consequence, the main issue in application configuration switching is to find appropriate moments for performing such a switching. To that end, dedicated switching points must be specified by the application developer inside the source code of the job. A smooth transition from the current application configuration to the new one is illustrated in Figure 10 and performed as follows, where each thread performs the common following steps: 1. Through the signal ack_selection, the GRM indicates the CM that an application configuration switching is required. 2. Whenever a thread reaches a dedicated switching point (i.e., after some reaction time), the thread checks whether a switching is requested by reading the shared variable current_ac. 3. If a switching is requested: a. The thread sends a signal sig_terminating to the CM. b. The thread enters a sleep mode until reception of the signal sig_wakeup, requesting the activation of the thread after reconfiguration of the IP core. These IP core reconfiguration and thread activation are not immediate. They need some freeze time due to DVFS for example. During this freeze time, the CM can perform some pre-processing before next thread activations (e.g., prepare a list of next actions for the GRM, keep sending signals sig_reconfig). c. The thread starts its execution accordingly to the newly selected configuration. Hence the switching mechanism takes place at two levels: At the platform level: all IP cores, managed by the GRM, are reconfigured accordingly the newly selected power modes. At the application level: all jobs, managed by the CM, switch safely to the newly selected configuration. In the COMPLEX use case 2, where the application consists of three for loop jobs (i.e., alarm processing, audio activity detection, and video image processing), the switching points are set between two successive iterations of each job. Page 27

28 7.3 Initialization of the application Figure 11: Initialization of the application The initialization of the application is illustrated in Figure 10, where it is assumed that the currently selected application configuration consists in executing Job 1 and Job 2. Page 28

29 7.4 Actual execution of the application Figure 12: Actual execution of the application Page 29

30 The actual execution of the application is illustrated in Figure 12, where it is assumed that Job 1 has to switch to a new configuration, Job 2 has to stop its execution, and Job 3 has to start its execution. Page 30

31 8 Power management of individual IP cores Individual IP cores of the platform are controlled by the GRM through an LRM (see Section 4). Each IP core provides a set of power modes that can be activated by the GRM. These modes (typically a combination of supply voltage and clock frequency) allow the core operating at different performance levels or being deactivated completely. During the characterisation phase, a power mode table (see Section 6.2) is generated. This XML file contains all information about available power modes and allowed transitions between them. It contains also information about the switching overhead in terms of delay and power. In both COMPLEX use cases 1 and 2, the platform consists of HW accelerators and SW cores (e.g., ReISC DSP cores). The APIs and power mode tables for HW accelerators and for blackbox IP cores are provided in Section 8.1. The ones of SW platform cores must be provided by the platform provider. The current implementation status is described in Section 8.2 for the REISC DSP core of the platform used in COMPLEX use cases 1 and HW accelerators and black-box IP cores For HW modules as well as black-box IP cores of the virtual prototype, information about available power modes is handled by the non-functional model. This model can be accessed by the GRM using a TLM2-based interface, which is implemented as register interface, accessible using a TLM2 socket. This is shown in Figure 13. TLM2- communication interface 31 IF function call Desired Recent Status Memory/ FIFO 0 BAC++ Communication adapter functional model (augmented behaviour) non-functional model (V dd, V th, clock-tree, leackage, etc.) observer (calculates power and timing) Figure 13: TLM-based LRM interface for HW accelerator modules Using the register interface, power modes can be requested and the actual state of power mode management can be obtained. The registers are accessed through the TLM generic_payload pattern. If one of the registers is read or written, the interface adapter communicates directly with the non-functional model, using methods of a generic base class. All non-functional models are derived from that class, so a generic approach for accessing the models is available and only one type of interface adapter must be provided. In order to keep the interface as simple as possible, the interface simply calls the appropriate getter and setter methods for each register. The register file has the structure, shown in Table 1. Page 31

32 8.1.1 Model of computation Power mode switching can only be performed, if the particular module has completed its computation, i.e. is idle. Whenever a power mode is requested, it is checked whether the mode is a valid one (i.e. the mode id is known) and whether there exists a valid transition from the current mode to the requested one. If the mode is not known or if the transition is not possible, the status register is set accordingly (see Section ). If the requested mode is valid and a transition is possible but the module is currently active, the requested mode is accepted but pending. If the module completes its computation and becomes idle, the power mode switching is performed if such is pending. That is, new supply voltage and clock frequency are applied and the observer is informed about the overhead in terms of power and timing, caused by the mode switching. As long as the requested mode is pending, the request can be revoked by requesting the currently active power mode. Same is true for unknown states or impossible transitions. Figure 14 shows the flow for requesting and revoking power mode switching. Figure 14: Power mode request/revoke flow For modelling individual power modes, an approach similar to the power state machine for black-box IP cores (see COMPLEX deliverable D2.3.2 [4]) is used. Each power mode is represented by a state of the state machine. A state i.e., a power mode is enriched with attributes like supply voltage and clock frequency, for example. A power mode switch is done by executing a state transition. Such a transition is also enriched with attributes but in this case with attributes describing the overhead of the transition. Such attributes are a timing overhead (delay) and a power overhead. When switching to a state with a lower supply voltage, no overhead is given, since the module can be used immediately. Functionality is retained if internal capacities have a higher voltage level. The correct voltage level is automatically reached during operation. Figure 15 shows such an annotated power mode machine. Page 32

Figure 15: Example power mode machine Transitions have guards assigned. These guards are responsible for consuming the input word i.e., the power mode switch requests.

33 Figure 15: Example power mode machine Transitions have guards assigned. These guards are responsible for consuming the input word i.e., the power mode switch requests. The power mode switch is triggered by the simulation logic, shown in Figure 14. If a switch should be performed, an event (which is equal to the desired power mode) is fired by the simulation logic and consumed by the appropriate guard Register Interface Table 1: LRM register interface Desired Resent V Reserved The following sections describe the functionality of each register Register Desired 31:0 Desired Contains the ID of the desired power mode, as requested by the GRM. The register is r/w. S Register Recent 31:0 Recent Contains the ID of the recent power mode. The register is read-only. It is only valid, if the valid bit of the status register is set. Page 33

34 Register Status 31 Valid If set, the content of the register file is valid. If a any of the registers of the interface is written, this bit is set imediately. That is, this bit can be read in the next cycle and will then contain a valid value. This bit is readonly. 30:2 Reserved Reserved bits. The content is not defined. 1:0 Status Determins the current status of the LRM. These bits are read-only. The content if these bits is only valid, if the valid bit of this register ist set. The following values are possible: 00: OK; Desired power mode is accepted and active. 01: Pending; Desired power is accepted, but not activeded, yet. It will be come active as soon, as is it is possible. 10: Invalid transition; The desired power mode is known, but it is not possible to switch from the current power mode to the desired one. 11: Invalid mode; The desired power mode is not known API documentation A generic API for accessing the registers is provided. This API can be used by all software cores (e.g., the software running on them) to access the LRM interface of a particular HW module. All methods of the API return a response type, conveying whether the call was successful or not. The definition of this response is shown in the listing below. //The return type of all API calls typedef enum { //No error occurred. LRM settings are correct and no power mode //switch is pending. LRM_STATUS_OK = 0x0 // Power mode request accepted. Mode is switch as soon as a power // mode switch is possible., LRM_STATUS_PENDING = 0x1 // The requested mode is known, but the transition from the actual // mode to the requested one is not allowed., LRM_INVALID_TRANS = 0x2 //The requested power mode is unknown., LRM_INVALID_MODE = 0x3 } lrm_response; The ID of a power mode is simply an integer number: //The c/c++ type of the power mode ID //The ID of a power mode equals its ID given in the power mode table. typedef unsigned int power_mode; The GRM can request a HW or IP module to switch to a certain power mode. This is done using the method shown below. If the requested power mode is not known, or the transition is not possible, an error is returned. If mode and transition are valid, the mode request is acknowledged and the module will switch to the requested mode as soon as possible. Page 34

35 // Request a certain power mode. lrm_addr: The base address of the LRM register interface pm : The ID of the requested power mode. // Typically 0x1, 0x2, or 0x3. It might happen, that the mode switch // is performed immediately. In this case 0x0 is returned. lrm_response lrm_request_mode( volatile void * lrm_addr, power_mode pm ); The current and the requested power mode can be obtained from the interface using the following two methods: // Gets the content of the current power mode register. lrm_addr: The base address of the LRM register interface pm : The ID of the module's current power mode. // Should be always 0x0. Later some more error codes might be added. lrm_response lrm_get_current_mode( volatile void* lrm_addr, power_mode* pm ); // Gets the content of the requested power mode register. // If not equal to the content of the current power mode register, a // power mode switch is pending. And is performed as soon as possible. // If equal, no switch is pending. lrm_addr: The base address of the LRM register interface pm : The ID of the module's requested power mode. // Should be always 0x0. Late some more error codes might be added. lrm_response lrm_get_requested_mode( volatile void* lrm_addr, power_mode* pm ); The content of the status register is available using the following method: // Gets the status of the LRM. lrm_addr: The base address of the LRM register interface lrm_response lrm_get_status( volatile void* lrm_addr ); In the COMPLEX use case 2, a HW accelerator is used to implement an FFT. Its power table is the one given in Section API implementation Two versions of the LRM API have been implemented. The first one implements the API as free C functions which could for instance be used by code running in an ISS. These plain C functions perform the register accesses directly via the volatile pointer that is passed as first argument. The pointer is interpreted to point to a struct type that reproduces the layout of the LRM register interface. Listing 1 shows the declaration of this structure. /*! 32bit register 'status' of the LRM register interface with 3 sub fields */ union lrm_status_register_type { struct { unsigned int status:2; unsigned int reserved:29; unsigned int valid:1; Page 35

36 } fields; unsigned int all; }; /*! layout of the LRM register interface */ typedef struct { unsigned int desired; // 32bit register 'desired' mode unsigned int recent; // 32bit register 'recent' mode union lrm_status_register_type status; // 32bit register 'status' } lrm_register_if_type; Listing 1: C implementation of the LRM register interface As an example for a C-style implementation of an LRM API function, Listing 2 shows the definition of the lrm_request_mode function. Note that it waits for the valid bit to be set by the power mode model after writing to the desired register before returning the LRM status. The helper function lrm_status_to_response which is not shown simply converts the value of the status field into an enumerator of the lrm_response enumeration. This conversion step was added in order to avoid a direct dependency between the possible values of the status field, which might change in the future, and the integer values of the response enumerators. lrm_response lrm_request_mode(volatile void *lrm_addr, power_mode pm) { volatile lrm_register_if_type *lrm_reg_p = (volatile lrm_register_if_type*)lrm_addr; union lrm_status_register_type status; } lrm_reg_p->desired = pm; do { status.all = lrm_reg_p->status.all; } while (!status.fields.valid); return lrm_status_to_response(status.fields.status); Listing 2: C implementation of the lrm_request_mode API function A second implementation of the API consists of namespace encapsulated C++ functions that construct a TLM2 generic payload and pass it to the socket of the currently active initiator module. The address used in the TLM transaction is derived from the given lrm_addr pointer. The corresponding initiator module is obtained from the SystemC simulation kernel using a helper function called lrm_initiator. Listing 3 shows the TLM version of the lrm_request_mode API function. It must be noted that the actual TLM implementation of the API makes some assumptions on the initiator module. That is, it is assumed that the initiator module was derived from the corresponding wrapper base class from the COMPLEX library and provides interface methods synchronize and adjust_global_cycle_count, which are used to control the consumption of execution time in the initiator module, as well as tlm_read and tlm_write methods that perform the actual construction of a tlm_generic_payload object and its transportation over the initiator module s socket. Page 36

37 namespace cplx { namespace vp_tlm { /*! helper function returning the actual initiator module as obtained from the SystemC kernel */ inline tlm_initiator_wrapper_base * lrm_initiator() { sc_core::sc_process_handle hndl = sc_get_current_process_handle(); tlm_initiator_wrapper_base *initiator = dynamic_cast<tlm_initiator_wrapper_base*>(hndl.get_parent_object()); sc_assert(initiator); return initiator; } lrm_response lrm_request_mode_tlm(volatile void* lrm_addr, power_mode pm) { volatile lrm_register_if_type* lrm_reg_p = (volatile lrm_register_if_type*)lrm_addr; unsigned int buscycles = 0; tlm_initiator_wrapper_base *initiator = lrm_initiator(); // synchronize with initiator // (let cpu time that elapsed before this transaction pass): initiator->synchronize(); initiator->simple_tlm_write<unsigned int>( sc_dt::uint64((unsigned long)&(lrm_reg_p->desired)), &pm, buscycles); lrm_status_register_type status; do { initiator->simple_tlm_read<unsigned int>( sc_dt::uint64((unsigned long)&(lrm_reg_p->status.all)), &status.all, buscycles); } while (!status.fields.valid); // \todo maybe add some timeout or delay? // notify initiator on consumed bus cycles for this communication: initiator->adjust_global_cycle_count(buscycles); return lrm_status_to_response(status.fields.status); } } // namespace vp_tlm } // namespace cplx Listing 3: TLM version of the lrm_request_mode API function Both implementations of the LRM API have been compiled into the COMPLEX library libcomplex-osci.a. Page 37

38 8.2 ReISC core The ReISC SoC is a system on chip, taped-out by STMicroelectronics at the end of 2009 in a 90 nm technology. It is the first system on chip of a new family of ultra-low power products. It encompasses the proprietary ReISC 3 core (Reduced energy Instruction Set Computer), providing hardware support for 8/16/20/32 data sizes, variable 16 bit-based instruction length and secure data. ReISC 3 is a micro-controller core targeted at ultra-low power applications. It operates up to 50 MHz frequency, contains embedded memories (1 Mbytes Flash memory and 32 Kbytes SRAM) and an extensive range of enhanced I/Os and peripherals. The ReISC SoC contains one 12-bit ADC, three general purpose 16-bit timers plus one internal timer, as well as standard and advanced communication interfaces: one I2C, two GPIOs, two SPIs, one USART, and one USB. A comprehensive set of power-saving modes, internal to the ReISC SoC platform, allow the design of low-power applications. It can apply different power reduction techniques such as clock gating and power gating; it can also select among four clock sources. The architecture is hierarchically organized in power islands that can be switched off under the control of the Power Manager unit; finer control on the power consumption can also be obtained by the RCCU that allows setting the enabling status of the peripherals and to enable or disable their clock. Moreover the power status of a peripheral depends also on the status defined in its registers. A glimpse of the organization in power islands is summarized here and shown in Figure 16: An ALWAYS ON power island includes the ReISC core, the RCCU, the Power Manager, the Timers, all the other components that are kept always enabled. A FUNCTIONAL STATE power island (with retention flip-flops) contains the other peripherals that can be switched on/off, e.g., the SPI and the GPIOs. An ANALOG power island includes the ADC and the clock sources. Page 38

39 Interrupt Controller REISC Core DSU JTAG TAP I-Side D-Side DSU MUX DMA (7ch) I3 I2 I1 T2 X BAR T1 T4 T3 Flash ITF I Ram ITF D Ram ITF Peripherals Decoder 384 KB Flash I2Ram 16KB I1Ram 16KB IbRam 1KB DbRam 1KB D1Ram 16KB D2Ram 16KB RCCU PWR MNG EXTEVCTRL GPIO0(16 port) GPIO1(8 port) USB SysRegs Window WDG RTC ITF Int WDG ITF Clock Gen 1KBRam RTC KERN Int WDG KERN (ADV) TIMER0 (GP) TIMER1 (GP) TIMER2 (INT) TIMER3 PLL XTAL 32KHz XTAL 1-25 MHz RC 32KHz RC 16 MHz SPI0 SPI1 SCI0 I2C0 ADC ADC HM Always ON Functional State: - RUN - SNOOZE - SLEEP FF retention implemented Analog No connection between I2 and T3 inside XBAR: fetch from peripherals not allowed No connection between I1 and T2 inside XBAR: moving data with DMA from flash not allowed Figure 16: ReISC SoC architecture showing power islands The overall power consumption within the ReISC SoC can be controlled by two peripherals: the Power Manager and the Reset and Clock Control Unit, by setting their registers described in sections and The Power Manager controls and monitors the overall power consumption at the SoC level. It allows putting the processor in deep sleep mode or in snooze mode. The Power Manager provides the possibility to power down the RAM, the Flash memory and the ADC. The Reset and Clock Control Unit provides the ability to enable and select one of the four clock sources available and to enable/disable the clock of the peripherals. Page 39

Moreover the power state of a peripheral depends on its status registers. For instance the power consumption of an ADC depends on the fact that it is enabled and sampling.

40 Moreover the power state of a peripheral depends on its status registers. For instance the power consumption of an ADC depends on the fact that it is enabled and sampling. An excerpt of the registers that control an ADC is shown in section The functional simulation of an application, such as the application in the COMPLEX use case 1, is made on a virtual platform simulation framework of the ReISC SoC, shown in Figure 17. It consists of an ISS of the ReISC 3 processor which communicates with the hardware models of the peripherals through a bus model. A SystemC wrapper implements the interface among the instruction-set simulator (ISS) and the rest of the system: the peripherals that are mostly modelled in SystemC. Only the components that are closely linked to the ISS or to the memory, have been left under the direct control of the ISS. In Figure 17 the SystemC peripherals are shown in orange, while the parts in yellow are modelled in C within the ISS. Figure 17: Architecture of the ReISC SoC virtual platform The power consumption status is determined by the registers that control the peripherals; to perform power profiling of an application running on the ReISC processor it is necessary to provide two sets of APIs: Virtual Platform Power Monitoring APIs: enhance the virtual platform with the capability of monitoring power consumption. Application Level Power Driver APIs: provide the possibility control the power consumption from the application. Page 40

41 8.2.1 Virtual Platform Power Monitoring APIs The APIs that dynamically trace the power consumption of an application running on the ReISC SoC are developed as state machines that monitor the the evolution of the system components. These APIs are written in SystemC (they belong to the Orange domain in Figure 17), have access to the registers of the peripherals, and are added as an extension of the Virtual Platform just for the purpose of providing the ability to monitor the power consumption. For instance to compute the power profile of the ADC, it is necessary to observe the events from the Power Manager, the RCCU and the ADC. A SystemC model that implements the FSM that traces the power state transition is shown in the following code. The Vitual platform has access to the registers that control the peripherals, the evolution of the power consumption can be monitored by functions as the following. SC_MODULE(POWER_FSM_ADC) { sc_in<bool> clock; sc_in<bool> PWRMNG_CMD_HM_12; sc_in<bool> RCCU_PERIPHCKEN_8; sc_in<bool> ADC0_CR2_0; sc_signal<power_state_adc> next_state; sc_signal<power_state_adc> current_state; void getnextst(); void setstate(); }; SC_CTOR(POWER_FSM_ADC) { current_state = IDLE; SC_METHOD(getnextst); dont_initialize(); sensitive << SC_METHOD(setstate); dont_initialize(); sensitive << clock.pos(); } Page 41

42 void POWER_FSM_ADC::getnextst() { switch(current_state) { case IDLE: if(pwrmng_cmd_hm_12 == 1) next_state = OFF; else if (RCCU_PERIPHCKEN_8 == 0) next_state = NOCLOCK; else if (ADC0_CR2_0 == 1) next_state == SAMPLE; break; case SAMPLE: if(pwrmng_cmd_hm_12 == 1) next_state = OFF; else if (RCCU_PERIPHCKEN_8 == 0) next_state = NOCLOCK; else if (ADC0_CR2_0 == 0) next_state == IDLE; break; case OFF: if(pwrmng_cmd_hm_12 == 0) { next_state = IDLE; if (RCCU_PERIPHCKEN_8 == 0) next_state = NOCLOCK; else if (ADC0_CR2_0 == 1) next_state == SAMPLE; } break; case NOCLOCK: if(pwrmng_cmd_hm_12 == 1) next_state = OFF; else if (RCCU_PERIPHCKEN_8 == 1) { next_state = IDLE; if (ADC0_CR2_0 == 1) next_state == SAMPLE; } break; }// end switch }//end getnextst void POWER_FSM_ADC::setstate() { current_state = next_state; trace_power_state(current_state); } The transitions among the power states are registered with their corresponding timestamps, so that it is possible to compute the power profile of the system components. A state machine similar to the one described in this example is needed for each component that needs to be profiled about its power consumption. So it is possible to have a set of concurrent power state machines that monitor all the components of the SoC. The power profiling is obtained by integrating with respect to time the power consumption spent in the power states Application Level Power Driver APIs On the application side it is necessary to provide a set of functions that control the registers governing the peripherals. Such functions provide power driver APIs written in C, and running on the ISS (belonging to the Yellow domain in Figure 17). These function have access to the registers that control the peripherals, they work under the control of the operation system. An example of driver for the ADC is shown in the following code excerpt. Page 42

43 void ADC_powerControl(int targetstate) { if(targetstate == ADCSAMPLE)//CR2 CONT ON { //CLOCK enable, POWER on ADC0->ADC_CR2 =0x1;//ue on ADC0->ADC_CR2 =0x2;//cont on RCCU0->RCCU_PERIPHCKEN =EXT_RCCU_PERIPHCKEN_ADC0;//CLOCK ENABLE PWRMNG0->PWRMNG_PD_HM&=~PWR_MNG_ADCOKINV33;//POWER ON } if(targetstate == ADCIDLE)//CR2 CONT ON { //CLOCK enable, POWER on ADC0->ADC_CR2 =0x1;//ue on ADC0->ADC_CR2&=~0x2;//cont off RCCU0->RCCU_PERIPHCKEN =EXT_RCCU_PERIPHCKEN_ADC0;//CLOCK ENABLE PWRMNG0->PWRMNG_PD_HM&=~PWR_MNG_ADCOKINV33;//POWER ON } if(targetstate == ADCOFF) { //CLOCK disable, POWER off ADC0->ADC_CR2&=~0x1; ADC0->ADC_CR2&=~0x2;//cont off PWRMNG0->PWRMNG_PD_HM =PWR_MNG_ADCOKINV33;//POWER OFF RCCU0->RCCU_PERIPHCKEN&=~EXT_RCCU_PERIPHCKEN_ADC0;//CLOCK DISABLE } if(targetstate == ADCNOCLOCK) { //CLOCK disable, POWER off RCCU0->RCCU_PERIPHCKEN&=~EXT_RCCU_PERIPHCKEN_ADC0;//CLOCK DISABLE ADC0->ADC_CR2&=~0x2;//cont off ADC0->ADC_CR2&=~0x1; PWRMNG0->PWRMNG_PD_HM&=~PWR_MNG_ADCOKINV33;//POWER ON &=~ } } Application Level Power Driver APIs: an example of how they are used follows. //ADC POWER PLATFORM TEST STARTING vtaskdelay(10); ADC_powerControl(ADCIDLE); vtaskdelay(10); ADC_powerControl(ADCSAMPLE); vtaskdelay(10): ADC_powerControl(ADCNOCLOCK); vtaskdelay(10); ADC_powerControl(ADCOFF); vtaskdelay(10); //UARTGPIO POWER PLATFORM TEST STARTING..\n"); vtaskdelay(10); UARTGPIO_powerControl(UARTGPIOIDLE); Power Manager Registers Register name Address Function PWRMNG_STATE 0xFE800 PWRMNG _PD 0xFE804 Page 43

44 PWRMNG _CMD PWRMNG _FLHPDT PWRMNG _PD_HM 0xFE808 0xFE80C 0xFE810 PWRMNG_STATE A Name Bit Rights Reset Description Power manager functional state 0x0 -> rst 0x1 -> rst2 0x2 -> run 0x3 -> clk_gate A 3 0 R 0x0 0x4 -> retention 0x5 -> snooze 0x6 -> waiting flash and psw wake-up time 0x7 -> wakeup 0x8 -> no retention 0x9 -> sleep PWRMNG_PD_HM H G F E D C B A Name Bit Rights Reset Description A 1 0 RW 0x0 D2 ram pd_mode (msb), pd (lsb) B 3 2 RW 0x0 D1 ram pd_mode (msb), pd (lsb) C 5 4 RW 0x0 I2 ram pd_mode (msb), pd (lsb) D 7 6 RW 0x0 I1 ram pd_mode (msb), pd (lsb) E 9 8 RW 0x0 usb ram pd_mode (msb), pd (lsb) F 10 RW 0x0 Flash power down (Stop) G 11 RW 0x0 Flash deep power down (DeepPD) H 12 RW 0x0 Adc_okinV33 power down 0 -> adc functional 1 -> adc power down RCCU Registers Register name Address Function RCCU_CKEN 0xFEC00 RCCU_CKSEL 0xFEC04 RCCU_CKRDY 0xFEC08 Page 44

45 RCCU_CKRDYIE RCCU_CKRDYF RCCU_PLLDIV RCCU_RCHSTRIM RCCU_PERIPHRST RCCU_PERIPHCKEN 0xFEC0C 0xFEC10 0xFEC14 0xFEC18 0xFEC1C 0xFEC20 RCCU_PERIPHCKEN R Q P O N M L K J I H G F E D C B A Name Bit Rights Reset Description A 0 RW 0x0 wwdg0 B 1 RW 0x0 spi0 C 2 RW 0x0 spi1 D 3 RW 0x0 sci0 E 4 RW 0x0 i2c0 F 5 RW 0x0 gptim0 G 6 RW 0x0 gptim1 H 7 RW 0x0 gptim2 I 8 RW 0x0 adc0 J 9 RW 0x0 usb0 K 10 RW 0x0 iwdg0 L 11 RW 0x0 dma0 M 12 RW 0x0 gpio0 N 13 RW 0x0 gpio1 O 14 RW 0x0 gptim3 P 15 RW 0x0 rtc0 Q 16 RW 0x0 evctl0 R 17 RW 0x1 dsu 0 -> clock disable; 1 -> clock enable ADC Registers Register name Address Function ADC0_SR 0xFDC00 ADC0_CR1 0xFDC04 ADC0_CR2 0xFDC08 ADC0_CR E D C B A Page 45

46 Name Bit Rights Reset Description A 0 RW 0x0 ADCON 0 -> OFF 1 -> ON B 1 RW 0x0 CONT (Continuos conversion) 0 -> single conversion mode 1 -> continuos conversion mode C 2 RW 0x0 DMA mode 0 -> disable 1 -> enable D 3 RW 0x0 EXT_TRIG (Conversion on external event) 0 -> disable 1 -> enable E 6 4 RW 0x0 EXT_SEL (External Event Select) Page 46

47 9 RRM for audio-driven video surveillance domain 9.1 Overview Figure 18: Audio-driven surveillance application Figure 19: Country border protection Page 47

The application, represented in Figure 18, is from the audio-driven video surveillance domain for surveillance of critical areas. In particular, assume the situation illustrated in Figure 19.

48 The application, represented in Figure 18, is from the audio-driven video surveillance domain for surveillance of critical areas. In particular, assume the situation illustrated in Figure 19. A territory, being hundreds of kilometers long territory, is controlled by a set of fixed cameras. Whenever some border fence is damaged, the territory becomes vulnerable, and the current surveillance system needs to be reinforced. To that end, a solution for a short period is needed: it must be deployed instantly, and it must avoid to deal with a complex and costly infrastructure. Figure 20: Platform architecture On the platform side, the hardware architecture is an embedded MP-SoC platform with one core acting as host processor, and controlling different HW/SW Processing Units (PUs). The programming model is component-based, which means that each PU can be seen as a coprocessor. Additional HW accelerators can be included to perform the computationally intensive tasks. To enable dynamic power management, different power islands are taken into account: the host processor, the image acquisition pipeline, and each PU. The overall target platform, represented in Figure 20 consists of three slave SW cores and one host SW processor. The platform constraints are the battery duration (e.g., 24 hours) and the energy budget in the battery. In the application, three jobs are considered: audio activity detection on the first SW core for environment noise detection, video processing on the second SW core for move detection and image selection, and alarm on the third SW core. Two application modes are allowed: either audio with alarm, or audio with video. QoS options are characterized at two levels in the application configurations. These options influence the number of operations executed by the jobs. Hence they offer the possibility to apply dynamic voltage and frequency scaling (DVFS) to each power island for controlling the power consumption of the platform. Page 48

49 The first QoS level is related to the amount of application functionalities provided by the application configuration. This QoS should get the highest priority. The second level is related to the QoS of the inputs and outputs of the application: Sampling frequency of audio inputs: two sampling frequencies can be configured: 16 bits at 8 or 16 KHz, corresponding to a bit rate of 128 or 256 Kbits/sec. Image resolution and rate: this depends on the camera packaged with the system. However in order to reduce the power consumption, image resolution and rate can be reduced. For video surveillance systems, it is usually not necessary to store high-definition images. An image resolution of 352x288 pixels (CIF format) or 704 x 576 pixels (4CIF format) at a rate of 16 or 25 images per second is considered. This application being embedded, autonomy is a critical requirement, and control and reduction of energy consumption is crucial. The optimization goal is to maximize the QoS of the application, whereas the platform constraints are the energy budget and the battery duration of the platform. To that end, the utility function models the QoS of an application configuration as a weighted sum of its audio and image frequency and resolution and of the amount of application functionalities provided by its application mode. 9.2 Experimental Results In order to perform initial experiments on the overhead and feasibility of the presented runtime management approach, both the GRM and the CM have been integrated in a POSIX implementation of the audio-surveillance video application. First, this implementation has then been deployed and tested on an X86-based platform running at 800 MHz. This section analyzes the obtained results. Second, due to the current unavailability of the ST-I platform (including the host, the four ReISC DSP cores, and the Free RTOS), and since the RRM framework needs a host and an OS, the RRM framework will be evaluated on an ARM-based TI OMAP 4460 embedded platform running at 700 MHz. Obtained results will be reported in Deliverable D4.2.2, entitled Final report on evaluation of design tools Binary size of GRM implementation As explained in Section 0, the GRM is implemented in C and compiled into a library libgrm.a which is then linked to the application. The current binary size of libgrm.a is 107 KB on the X86-based platform, without taking the GRM databases into account. Two GRM databases are required to store the high-level specification of the platform, the IP core types, and the application configurations: The current binary size of the platform database is: * ipcore_nr + 54 * ipcore_type_nr Page 49

50 + 20 * ipcore_type_nr * power_mode_nr * (1 + power_mode_nr) bytes, where ipcore_nr denotes the number of IP cores in the platform, ipcore_type_nr denotes the number of IP core types, and power_mode_nr denotes the maximum number of power modes per IP core type. E.g., in our demonstrator, where the platform consists of 4 SW cores, with maximum 5 available power modes, the binary size of the platform database is 1506 bytes. Similarly, the current binary size of the application configuration database is: ( * job_nr) * appl_config_nr bytes, where job_nr denotes the number of jobs in the application and appl_config_nr denotes the number of application configurations. E.g., in our demonstrator, where the application consists of three jobs (i.e., alarm processing, audio activity detection, and video image processing), the binary size of an application configuration is 88 bytes Performance overhead and energy gain Figure 21: Energy-per-frame evolution with and without GRM Figure 21 illustrates the energy-per-frame evolution of our demonstrator for two platform constraints (different energy budgets, same battery duration) with and without the GRM. Due to an optimized adaptive selection of application configurations, our GRM allows optimizing the QoS of the application while keeping the platform battery alive during its whole required duration. In contrast, this cannot be ensured without such an RRM framework. Without the GRM, only one application configuration may be activated from the start of the application. With the GRM, several ones may be successively activated during the run of the application: Page 50

51 in this demonstrator, among the 16 available application configurations, 4 (resp. 9) are activated to satisfy the platform constraint 1 (resp. 2). Figure 22: Performance of GRM initialization Figure 22 illustrates the CPU processing of the GRM services executed at initialization on the X86-based platform. Both GRM_ConfigurePlatform() and GRM_ConfigureApplication() require more processing due to parsing of high-level platform specification and available application configurations. Nevertheless, these services are executed only once without any run-time overhead. Page 51

The performance of GRM_ReconfigurePlatform() includes all waiting times for IP cores being ready and for IP core reconfiguration.

52 Figure 23: Performance of GRM run-time services Figure 23 illustrates the CPU processing of the GRM services executed at run time on the X86-based platform. The performance of GRM_SelectApplicationConfiguration() includes the one of GRM_EstimateElapsedEnergy(). The performance of GRM_ReconfigurePlatform() includes all waiting times for IP cores being ready and for IP core reconfiguration. Nevertheless the average execution time on the X86-based platform is still < 0.5 ms. Figure 24: GRM CPU processing overhead Page 52

53 Finally, a global analysis of the GRM CPU processing overhead compared to the application processing shows an overhead of only 1.16% on the X86-based MHz (see Figure 24). Observe that this overhead is only 0.6% on a TI OMAP 4460 embedded platform running at 700 MHz. This shows that the overhead of the proposed run-time energy management mechanism is negligible and there is no significant impact on the application. In conclusion, the experiments performed so far should indicate that the proposed combined approach of design-time exploration of application configurations with run-time optimization can improve the overall QoE of the system. Page 53

54 10 RRM for ultra-low power platforms This section presents the RRM implementation in the COMPLEX use case 1. Since the target platform consists of a single core, the RRM instantiation is very simple. It focuses on a new heuristic for fine-grain DVFS in the set of required run-time decisions introduced in Section 4.2. This heuristic consists of a methodology and a tool-chain developed to perform the optimization of the energy consumption associated to software execution of a tiny embedded system. The optimization is made combining the SW estimation process (developed in Task 3.2 Embedded Software Optimization) together with design space exploration methodologies (developed in Task 3.4 Design Space Exploration) in order to exploit finegrain DVFS. The proposed approach operates at compile-time, with the granularity of single C function, by augments the source code with calls directed to drive at run-time the voltage and frequency scaling of the core. The design-part of the methodology uses the concepts developed in Task 3.2 (see for details Deliverable D3.2.2 [5]) for software estimation and Task 3.4 (see for details Deliverable D3.4.2 [2]) for exploration. Regarding the SW estimation part, the methodology developed in T3.2 has been enhanced with the modification of expressing the energy costs of the basic entities of the LLVM intermediate representation in terms of effective capacitance, rather than as average current absorption per clock-cycle. This choice allows accumulating energy figures independently from the actual clock frequency and core supply voltage. This approach is crucial for exploring the different voltage/frequency operating modes Overview of the Methodology The methodology that we are presenting here uses the power modes of the target platform in predetermined positions of the code as the knobs (parameters) of the methodology. In fact, due to the single core version of the platform and to the application characteristics, no different mapping or application reconfigurations are needed. Thus, the instantiation of the global view of the RRM manages only the power states of the platform, making it acting as a Power Manager. The methodology we adopted supposes that the target processors provides a set of operating modes (Voltage and Frequency pairs), which will be referred to in the following as explicit modes. Obviously, at a certain point in time, the target processor can run only to single operating mode. In the approach, we considered the function as the smallest granularity for the analysis. This means that if a function is assigned to an operating modes OM1=<V1, F1>, the processors runs at the frequency F1 considering the supply voltage F1. Moreover, in our methodology when a function is assigned to a specific explicit mode, the tool-chain will augment the C source code by inserting RRM calls (that wrap the platformspecific library function calls) devoted to switch to the selected mode on entering the function and back to the previous mode, on exiting. In addition to the explicit modes, the proposed methodology adds two implicit modes (described in the following) that can be used from the application developer to guide the run- Page 54

55 time manager without explicitly forcing to a particular frequency but deriving it from the execution context. - Force. When the mode of a function is set to force to a specific explicit mode, all its callers will be executed in the same operating condition as the caller, regardless of their specific explicit assignments. The force mode is especially useful as a mean to classify a certain function as having a high importance, so high that its execution should not undergo any operating condition changes. - Inherit. The inherit mode has, in a sense, a dual meaning. It specifies that the mode of a function is not explicitly set, but is rather inherited from its caller. Thanks to this mode, small functions that do not constitute a critical portion of the task on their own, but can be let free to operate under the control of their callers. These special modes provide a simple yet powerful means to have a function being executed in different modes, depending on its context, thus providing more flexibility to the approach. The timing diagram of Figure 25 shows the effects of three different mode assignments on the processor operating conditions. In the figure, the labels X1 and X2 indicate explicit mode assignments, F1 and F2 force modes, and I the inherit mode. Figure 25: Mode Assignment Effects A part from the obvious effects of explicit assignments shown in Figure 25(a), it is interesting to observe the behaviour resulting from forcing and inheriting modes. To this purpose, we concentrate the attention on function f1(). When f2() forces the mode 1 and calls f1(), Figure 25-(b) shows that the explicit mode assignment of f2() is ignored. On the other hand, when called directly form f3(), f1() is executed in mode 2, as specified by its explicit assignment. Furthermore, observing the diagram in Figure 25-(c), it can be noted that the operating mode in which f1() is executed is always that of its caller. Page 55

56 Though the behaviour in these two last cases is the same, it is obtained in two dual ways: in the first case the caller imposes its mode to all callers, while in the second is the caller that delegated the decision on the operating mode to its caller. Once explained the meaning of the different operating modes (both explicit and implicit cases), it is clear to define the optimization goal. The goal of the design-time optimization is to find the assignment of a mode to each function in such a way to minimize the overall energy of a program run with a constraint on the maximum allowed time for the task. On the other side, as already explained before, the goal of the run-time part of the methodology is to deploy at run-time the operating modes defined at design-time. Considering N possible processor operating modes and F functions, the exact solution of the problem requires examining N^F assignments. Given the exponential complexity, this problem becomes soon intractable. For 4 modes and 20 functions, for example, the number of assignments is close to 3.5 billions, which makes using a heuristic design space exploration approach necessary Tool-Flow This section is intended to describe the estimation and optimization flow adopted. More detail about the estimation methodology/tool used and by the exploration tool can be found in D3.2.2 Final report on software and hardware optimization and D3.4.3 Final report on design space exploration respectively. The implemented estimation flow is based on the LLVM compiler infrastructure, upon which the toolset SWAT has been developed. A simplified view of the portion of the flow strictly related to the estimation process necessary for the target problem is outlined in Figure 26. The input is the set of C source files collecting the code of task being considered, a model of the target CPU (cpu.lib) and an assignment of modes to functions (task.modes). Performing a sequence of transformations the flow produces the energy and time estimates T and E. Note that, in this context, the mode assignment file is consider constant, later when talking about the exploration part, this constraint will be removed. Page 56

57 Figure 26: Simplified SWAT estimation flow The transformation performed by the tools of the flow have been collected into four phases, indicated by numbered black boxes, and are detailed in the following. 1) Front-End. This phase compiles each source file into architecture-independent LLVM assembly code, which is then used to build a model (*.bbmodel) of each basic block consisting of the list of op-codes, functions called, size, execution time in clock cycle and effective capacitance. Data for timing and energy characterization is in the target CPU library (cpu.lib), which is the result of the processor characterization. 2) Instrumentation. Instrumentation is performed by first enriching each basic block of the LLVM code with all the relevant figures in the form of a special comment (metainstrumentation), then by translating the comments into actual calls to tracing functions based on expansion rules collected into an instrumentation library. The output of this phase is a new, instrumented, LLVM assembly file (*.i.ll). 3) Back-End. The back-end of the SWAT flow performs two main operations. First, it translates all the instrumented LLVM files into host assembly code, which is then assembled and linked into an executable program. Secondly, it runs the executable and collects the execution trace (bbtrace) consisting of a list of the identifiers of the basic blocks that have been executed. 4) Post-Processing. The post-processing phase analyzes the execution trace and combines the dynamic information with the static costs models, accounting for the specific operating modes specified in the allocation file (task.modes). This produces the total timing and energy of the specific run of the task. Page 57

58 This estimation flow is then combined with the optimization engine MOST that performs design space exploration over the possible mode assignment. It is worth noting that steps 1--3 of the estimation flow need not to be repeated for each assignment. They are, in fact, performed only once with the goal of producing an execution trace and a set of cost models. Steps 1--3 (and in particular step 3, that involves task execution) are much more timeconsuming than the post-processing phase only. The proposed tool-chain is thus efficient enough to enable design space exploration with simulation-in-the-loop. Figure 27: SWAT/MOST optimization flow The optimization flow, sketched in Figure 27, is built around the design exploration engine MOST. 5) Design space exploration. The tool requires a configuration file (task.dse) specifying which are the parameters and which values each parameter can assume. In our case the parameter are the modes of each function and the values are integers in the range between 1 and N corresponding to the target processor modes. Based on this configuration, the DSE engine generates a specific mode assignment (task.modes) which is fed as input, together with the execution trace and the basic block models, to the SWAT postprocessor. The execution time and energy estimated by SWAT are used by MOST to selects a new, potentially better, assignment. This loop is repeated until a sub-optimal assignment (task.opt.modes) is found. 6) Code augmentation. Using a set of predefined macros and a simple code generator, this tool adds to the original source code, at the beginning and at the end of each function, the suitable code performing mode switching. Those inserted calls are the lightweight instantiation of the RRM. In order to better clarify the code augmentation step, here in the follow will be presented a simple example explaining how it has been implemented. Page 58

59 The code generation is the last step of the optimization flow and its goal is to augment the original source code with calls to suitable and user-definable APIs devoted to changing the operating mode of the processor on entry and/or on exit of a function. This process requires adding two macros at the beginning and at the exit of each function that is considered in the exploration process. Considering, for example, a function: and assuming that the function as a single exit point, the only task left to the programmer is to modify the function definition as follows: The expansion of the two macros generates new code that depends in turn on other macros built based on the function name passed as argument, which, in our example would be VFS_FMODE_foo. Since functions can not only be assigned explicit modes, but also can be defined as forcing or inheriting the operating mode of caller/callee, it is necessary to implement a sort of mode stack where to save the mode of the current function before entering one of its callees and restore this mode on exit. Rather than using a separate stack, our mechanism is based on four entities, namely: - A global variable vfs_cm storing the current mode. The variable is static in a support library that need to be compiled along with the application. - A global variable vfs_fm indicating whether the current mode is being forced or if it is explicit/inherited. This variable is also static in the support library. - A variable vfs_sm, local to each function, holding the saved mode, i.e. the current mode upon function entry. - A macro VFS_SET_MODE(m) that is platform specific and will be expanded to a call to the suitable function exposed by the target API and responsible of changing the operating mode. Such a function will usually write the relevant CPU registers. Page 59

60 Thanks to the local variable added to each function, a separate stack for modes is not necessary as it is distributed into the activation frames of the function themselves. Using these variables and exploiting the macros, the code of the original function is transformed into that shown hereafter. It is worth noting that, since macros are expanded before compile time, only one of the branches of the conditional constructs in the preamble will actually be compiled, the other being dead code. Although a minimal overhead is introduced by this mechanism, the macrobased approach tends to limit it to a minimum. Despite of up to now the methodology has been tested only by considering the power states of the software processor, the same can be applied with simple extensions to the state of the peripherals in different voltage and frequency islands Experimental Results The experimental results presented here after refer to the STMicroelectronics ultra low power ReISC core. Due to the public nature of the document, energy and timing figures appearing in the graphs of this section have been scaled by a constant factor in order not to disclose proprietary information. The ReISC core considered in this work presents provides dynamic voltage and frequency scaling capabilities over three different modes. In order to validate the approach on a large number of differently structured tasks, synthetic code has been used. To this purpose a parametric tool for code generation has been developed. It can generate random programs based on the parameters summarized in the following table along with the ranges used to generate the specific tasks for which results are reported. Page 60

61 Let us start considering a simple example, with three functions only. In this case the possible assignments are 5^3=125. Since the exploration engine do not perform an exhaustive analysis, much fewer assignments have been generated, as shown in the plot of Figure 28. Figure 28: Energy and Time for a 3-functions task As it can be noted, to a reduction of the execution time corresponds an increase in the energy consumption. In this example the execution time constraint was set to 1.65us. The solution found, highlighted in the plot was characterized by an execution time of 1.626us and an energy consumption of 412nJ. This corresponds to an average power consumption of 253uW, as the plot of Figure 29. Figure 29: Average Power for a 3-functions task Page 61

Finally, the results obtained applying the proposed optimization methodology to a set of 33 randomly generated tasks are reported in Figure 30.

low-power modes. It must be noted that the optimized tasks (and of course the tasks run in full active mode) do respect their deadlines, while the tasks run in the lowest power mode do not.

62 Finally, the results obtained applying the proposed optimization methodology to a set of 33 randomly generated tasks are reported in Figure 30. The plot shows the energy consumption of the optimized task (black bars) with that obtained maintaining the system either in the highest voltage/frequency mode (white bars) or in its deepest low-power modes. It must be noted that the optimized tasks (and of course the tasks run in full active mode) do respect their deadlines, while the tasks run in the lowest power mode do not. The energy gains obtained by the mode allocation technique with respect to the full active mode of the processor are shown in Figure 31, where a maximum energy saving of 29.4% can be observed. The average gain for the test cases considered is 20.1%. Figure 30: Absolute energy consumption comparison Figure 31: Energy consumption gain w.r.t full voltage and frequency mode Page 62

COdesign and power Management in PLatformbased design space EXploration. Preliminary report on run-time management

FP7-ICT-2009-4 (247999) COMPLEX COdesign and power Management in PLatformbased design space EXploration Project Duration 2009-12-01 2012-11-30 Type IP WP no. Deliverable no. Lead participant WP3 D3.5.1