COdesign and power Management in PLatformbased design space EXploration. Preliminary report on run-time management

Size: px

Start display at page:

Download "COdesign and power Management in PLatformbased design space EXploration. Preliminary report on run-time management"

Janis Benson
5 years ago
Views:

FP7-ICT-2009-4 (247999) COMPLEX COdesign and power Management in PLatformbased design space EXploration Project Duration 2009-12-01 2012-11-30 Type IP WP no. Deliverable no. Lead participant WP3 D3.5.

0 COMPLEX Submission Date 2010-03-15 Due Date 2011-03-01 Project co-funded by the European Commission within the Seventh Framework Programme (2007-2013) Copyright 2010 OFFIS e.v., STMicroelectronics srl.

1 FP7-ICT (247999) COMPLEX COdesign and power Management in PLatformbased design space EXploration Project Duration Type IP WP no. Deliverable no. Lead participant WP3 D3.5.1 IMEC Prepared by Issued by Document Number/Rev. Classification Chantal Ykman-Couvreur (IMEC), Sven Rosinger, Kai Hylla (OFFIS), Gianluca Palermo (POLIMI) IMEC COMPLEX/IMEC/R/D3.5.1/1.0 COMPLEX Submission Date Due Date Project co-funded by the European Commission within the Seventh Framework Programme ( ) Copyright 2010 OFFIS e.v., STMicroelectronics srl., STMicroelectronics Beijing R&D Inc, Thales Communications SA, GMV Aerospace and Defence SA, SNPS Belgium NV, EDALab srl, Magillem Design Services SAS, Politecnico di Milano, Universidad de Cantabria, Politecnico di Torino, Interuniversitair Micro-Electronica Centrum vzw, European Electronic Chips & Systems design Initiative. This document may be copied freely for use in the public domain. Sections of it may be copied provided that acknowledgement is given of this original work. No responsibility is assumed by COMPLEX or its members for any aplication or design, nor for any infringements of patents or rights of others which may result from the use of this document.

2 History of Changes ED. REV. DATE PAGES REASON FOR CHANGES IMEC Initial version Page 2

3 Table of Contents Table of Contents Executive summary Abbreviations and glossary Introduction RRM architecture Interface between GRM and application Interface between GRM and user Interface between GRM and platform GRM databases Platform information database Application and job information databases Interaction between applications and GRM GRM run-time optimization heuristics QoS-aware run-time application scheduling Previous work Summary of our application scheduling Power management of custom HW blocks TLM2-based interface to the LRM Register Desired Register Recent Register Status Sample RRM in COMPLEX use case Sample RRM in COMPLEX use case Application and application modes QoS options GRM efficiency analysis References Page 3

4 simulation SystemC estimation & model generation BAC++ BAC++ SystemC exploration & optimization power/performance metrics HW tasks SW tasks executable specification design space definition MDA design entry COMPLEX/IMEC/R/D3.5.1/1.0 1 Executive summary This deliverable is the first report of Task 3.5, dealing with Run-Time Resource Management (RRM). This task is coordinated by IMEC and also involves POLIMI and OFFIS. It started at month M7 and it will end at month M30. The goals of Task 3.5 are to develop a lightweight architecture for RRM in tightly constrained systems and sample run-time resource managers for both COMPLEX use cases 1 and 2. In addition to the RRM architecture, Task 3.5 will also develop services and optimization heuristics to be supported by the RRM for alleviating the burden of the application programmer. h system specification in SystemC e a MARTE PIM or Matlab/ Simulink f b use-cases system input stimuli user constrained HW/SW sep. & mapping HW/SW task separation & testbench generation c d MARTE PDM (Platform Description Model) architecture/platform description (IP-XACT) g s design space instance parameters functional reimplementation hardware/software partitioning/separation runtime management embedded software/compiler optimizations IP platform selection & configuration memory configuration/management (static & dynamic) custom hardware synthesis constraints t l source analysis behavioral synthesis functional, power, & timing model generation i automatically pre-optimized power controller source analysis cross compilation functional, power, & timing model generation j virtual system generator with TLM2 interface synthesis bus cycle accurate SystemC model with self-simulating power & timing models m n o k virtual platform IP component models r user visualization/ reporting tool q p trace analysis tool simulation trace parameters for new design space instance exploration & optimization tool Figure 1: COMPLEX design flow This RRM corresponds with the pre-optimized power controller (m) in the COMPLEX design flow illustrated in Figure 1. For more details, see the COMPLEX Description of Work [1]. Page 4

5 The goals of this first deliverable are to provide a preliminary vision of the work to be performed in Task 3.5. The content of this deliverable is organized as follows: Section 2 defines the abbreviations used in the deliverable. Section 3 overviews the challenges to be fulfilled by the RRM in future embedded computing. Section 4 gives a preliminary description of the RRM architecture being developed in Task T3.5 and of the RRM interfaces with the applications, the user, and the platform. This RRM follows a distributed and hierarchical approach: it consists of a Global Resource Manager (GRM) at the platform level, communicating with Local Resource Managers (LRMs) at the IP core level. To alleviate the run-time decision making, a design-time exploration is performed. The resulting relevant information is stored in GRM databases, characterized in Section 5. Section 6 describes the list of commands that allow the GRM to interact with the applications. Section 7 describes the run-time optimization heuristic that will be developed and integrated in the GRM. This heuristic selects an application configuration in order to maximize its Quality of Service (QoS) while meeting its deadline and respecting the energy budget of the platform and the target platform autonomy. Section 8 describes the power controller of HW blocks to be used as LRM. Sections 9 and 10 introduce both sample RRMs for COMPLEX use cases 1 and 2 respectively. Finally Section 11 presents how the GRM efficiency will be analyzed. Page 5

6 2 Abbreviations and glossary The table below lists the abbreviations with their definition used in the deliverable. DPM DTD DVFS GRM HW IP LRM QoE QoS RRM Dynamic Power Management Document Type Definition Dynamic Voltage and Frequency Scaling Global Resource Manager Hardware Intellectual Property Local Resource Manager Quality of user Experience Quality of Service Run-time Resource Management Also, some relevant terms used in the deliverable are also shortly described in the following. The application functionality is specified at different granularity levels: (1) the application is organized into application modes, each one specifying a different subset of functionalities; (2) each application mode consists of communicating jobs, each one mapped entirely on one IP core; (3) each job can consist of communicating tasks, all of them running on the same IP core. For more details, see Section 4. Page 6

7 3 Introduction To address the challenges introduced by future embedded computing, a generic and structured architecture for RRM of embedded multi-core platforms is refined in Task T3.5. This RRM needs to fulfil the following features: First, the RRM has to support a variety of applications: mobile communications, networking, automotive and avionic applications, multimedia in the automobile and Internet interfaced with many embedded control systems. These applications may run concurrently, start and stop at any time. Each application may have multiple configurations, with different constraints imposed by the external world or the user (deadlines and quality requirements, such as audio and video quality, output accuracy), different usages of various types of platform resources (processing elements, memories and communication bandwidth) and different costs (performance, power consumption). Second, this RRM should support a holistic view of platform resources. This is needed for global resource allocation decisions optimizing a utility function (also called Quality of user Experience (QoE)), given the available platform resources. This QoE will allow trade-off, negotiated with the user, between diverse QoS requirements and costs. E.g., in the COMPLEX use case 2, this QoE should enable careful management of the energy stored in the battery. Third, this RRM should transparently optimize the platform resource usage and the application mapping on the platform. This is needed to facilitate the application development and manage the QoS requirements without rewriting the application. Next, this RRM should dynamically adapt to changing context. This is needed to achieve a high efficiency under changing environment. QoS requirements and platform resources must be scaled dynamically (e.g., by adjusting the clock frequencies and voltages, or by switching off some functions) in order to control the energy/power consumption and the heat dissipation of the platform. Finally, this RRM should allow different heuristics (e.g., for platform resource allocation and task scheduling), since a single heuristic cannot be expected to fit all application domains and optimization goals. Also, the software development productivity is of paramount importance. To address this challenge, and to facilitate the RRM implementation, a generic and structured architecture for the RRM is required. It should be valid for any used design flow, for any target platform, and for any application domain. The development of an RRM architecture is the first goal of Task T3.5. Nevertheless, since the RRM is intended for embedded platforms, a lightweight implementation only is acceptable. To address this challenge: This RRM should interface with design-time exploration to alleviate its run-time decision making. This is the goal of the RRM interface with the tool developed in Task T3.4. The RRM implementation should be instantiated from the RTM architecture, based on the target platform and the application domain. The development of sample run-time resource managers for both COMPLEX use cases 1 and 2 is the second goal of Task 3.5, in collaboration with Task 4.1. Page 7

8 4 RRM architecture Figure 2: Distributed and hierarchical RRM approach Figure 3: GRM architecture Page 8

9 The RRM architecture follows a distributed and hierarchical approach (see Figure 2): On the one hand, the GRM is loaded on the host processor of the platform. It is a software task, specified in C, running in parallel with the application. It is a middleware providing a bridge between the application, the user, and the platform. It conforms to practices of the LRM in each Intellectual Property (IP) core (e.g., ASIC, FPGA, multi-cpus). It is used to adapt both platform and applications at run time and to find global and optimal trade-offs in application mapping based on a given optimization goal. A detailed view of the GRM is depicted in Figure 3. On the other hand, each IP core can execute its own resource management without any restriction, through a LRM. Such an LRM encapsulates the local policies and mechanisms used to initiate, monitor and control computation on its IP core. The following terminology and assumptions are used. Ideally, for any application, all functionalities should be accessible at any time. However, based on the user requirements, the available platform resources, the limited energy/power budget of the platform, and the target platform autonomy, it may not be possible to integrate all these functionalities on the platform at the same time. Hence the application developer has to organize the application into application modes, each one specifying a different subset of functionalities. The application, within a selected mode, consists of jobs communicating with each other through inter-job channels, where: One job is mapped entirely on one IP core. Whereas the functional specification of a job is fixed, there may be several specific algorithms or implementations for a given job. Also a job implementation can take several forms (fixed logic, configurable logic, software) and offer different characteristics. These job configurations with associated meta-data (e.g., QoS, platform resource usage, costs) are provided at design time, and structured and stored in the Job information database of the GRM to enable fast exploration during run-time decisions. A job can consist of multiple tasks communicating with each other, all of them running on the same IP core. To conform to the hierarchical approach of the RRM, jobs and communication between them are managed at the platform level by the GRM, whereas tasks and communication between them are managed at the IP core level by the LRM. In COMPLEX, the considered IP cores are either processors or custom HW blocks. In contrast to the collaboration between the GRM and the IP cores, the GRM collaboration with application and user is visible to the application developer and is performed as follows: The QoS requirements and the optimization goal are defined through the QoE manager. This goal is translated into an abstract and mathematical function, called utility function (e.g., performance, power consumption, battery life, QoS, weighted combination of them). The GRM manages and optimizes the application mapping (i.e., job configuration selection, job scheduling, allocation, and binding) taking into account the possible job Page 9

10 configurations, the available platform resources, the QoS requirements, the application constraints, and the utility function. To provide a bridge between the application, the user, and the platform, generic services should be supported by the GRM. Among them, we distinguish between services called by the GRM and being automated and services called by the application and controlled by the application developer. These latest services relate to job execution (start, stop, resume, kill, synchronize, wait, switching point), message exchanges, event recognition and handling, timer interrupts, and shared memory access. In Task T3.5, we focus on services called by the GRM, mainly on platform resource management and on QoE management. These services are classified into managers to structure the interface between the GRM and the application, the user, and the platform, respectively. 4.1 Interface between GRM and application The interface with the application is provided by three managers: the application, job, and inter-job channel managers. Their goal is to enable a holistic view of the platform resources, a dynamic adaptation to changing context, and a transparent optimization of platform resource usage. The application manager is mainly responsible for loading the application configurations and adequately setting the platform at application initialization. It provides the following main services: ConfigureApplication loads the application configurations (e.g., the application modes, the jobs, the inter-job channels, the constraints such as deadlines) in the Application information database of the GRM (see Section 5.2). These configurations are provided at design time in an XML file. This service also asks both job and inter-job channel managers to load both job and inter-job channel configurations. The pseudo code of ConfigureApplication is as follows: ConfigureApplication(appl_filename) { ParseApplication(appl_filename); for each job { ConfigureJob(job_filename); } ConfigureInterJobChannels(interjobchannel_filename); } Listing 1: Pseudo code of ConfigureApplication SetPlatform sets up the parameters of the needed platform components at application initialization. GenerateException detects any malfunctioning due to failing running jobs and generates exception to the application. Page 10

Figure 4: Job configuration switching mechanism The job manager is mainly responsible: (1) for loading the available configurations of all application jobs; (2) for selecting such an optimal

11 Figure 4: Job configuration switching mechanism The job manager is mainly responsible: (1) for loading the available configurations of all application jobs; (2) for selecting such an optimal configuration for each active job of the applications, based on the utility function and the optimization heuristic provided by the QoE manager in coordination with the user; and (3) for communicating its run-time decisions to the concerned IP cores. The job manager provides the following main services: ConfigureJob loads in the Job information database of the GRM (see Section 5.2) all available configurations of the job on the platform. These configurations are provided at design time in an XML file. This Job information database enables fast exploration in SelectJobConfigurations. Whenever the environment is changing (e.g., when the user changes his requirements, when the battery level becomes too low, when some slack time becomes available), this service selects at run time application modes and optimal configurations for each job within the selected application modes, taking into account the available platform resources, the QoS requirements, and the application constraints, and optimizing the utility function through the selected optimization heuristic. Different optimization heuristics can be enabled and the best-suited one is selected by the user (see Section 7). Environment changes may give rise to reselection of application modes and job configurations. One main issue is JobConfigurationSwitching which must be seamless. A smooth transition from the current configuration to the new one is provided as follows (see Figure 4): o On the one hand, the GRM can signal the concerned application at any time when a job configuration switching is requested. o On the other hand, whenever the application reaches a job switching point, identified by a JobSwitchingPoint (JSP) call inside its code, the application Page 11

12 checks whether a switching is requested. If yes, the application enters an interrupted state and transfers all relevant state information to the GRM. o The GRM communicates the newly selected job configuration and the related received state information to the concerned IP core. o The communication of run-time decisions to the concerned IP cores is performed in collaboration with both services SendExecuteRequest and SendTerminateRequest in the IP core manager. The inter-job channel manager is responsible for loading the inter-job channel configurations, through the service ConfigureInterJobChannels. Also, once the GRM has selected the right configuration for each active job, it has to determine the communication resources with respect to the inter-job channels of the application. This is performed through the service SelectRoutingPaths. 4.2 Interface between GRM and user The interface with the user (or external entity accessing the application specification) is provided by the QoE manager. The QoE is a subjective measure of the application value from the user perspective. It is influenced by the user terminal device (e.g., low- or high-definition TV), his environment (e.g., in the car or at home), his expectations, the nature of the content and its importance (e.g., a simple yes/no message or an orchestral concert). Changes in user preferences may involve (re)negotiation between user and QoE manager. Indeed, the platform resources may not be sufficient to provide the desired QoS to the application. The user needs a simple way to communicate with the QoE manager in order to control and customize the QoS of his application. To help the negotiation, in collaboration with ResourceMonitoring in the platform manager, both services ProvideJobInfo and ProvidePlatformInfo provide information about active jobs and used platform resources. The negotiation between user and QoE manager involves the selection of the utility function and of the optimization heuristic: The utility function allows to model in an abstract and mathematical way the user benefit for the application. It allows a trade-off between diverse QoS requirements and costs. Examples of utility functions are: performance of the application, energy/power consumption of the platform, battery life, revenue if the user has to pay for the application, fair sharing of platform resources, and weighted combination of them. Once selected, the utility function is applied to each application and job configuration, to derive its user value. This is performed by the service DeriveUserValues. This utility function is then optimized by the GRM in the service SelectJobConfigurations of the job manager. The selection of the optimization heuristic allows to fit the current application domain and optimization goal. Page 12

13 4.3 Interface between GRM and platform The interface with the platform is provided by three managers: the platform, IP core, routing path managers. The platform manager provides the following main services: ConfigurePlatform loads the platform configuration in the Platform information database of the GRM (see Section 5). ResourceMonitoring consists of run-time profiling to monitor the platform resource usage, the execution time, the energy/power consumption, the battery level, and the slack time. Run-time profiling is also needed to detect any platform overload, relatively to resource usage, die temperature and cooling capacity. The IP core manager is responsible for the interface between the GRM and the IP cores. It supports services related to requests about setting and job execution on the corresponding IP core resulting from run-time decisions. The main provided services are SendExecuteRequest and SendTerminateRequest, where the communication with the IP cores has to conform to their respective practices. If the power mode of the IP core is selected at the job level, the pseudo code of SendExecuteRequest is as follows: SendExecuteRequest(job_config) { ip_core = job_config->ip_core; job_binary = job_config->job_binary; power_mode = job_config->power_mode; ip_core->loadbinary(job_binary); //optionally ip_core->switchtopowermode(power_mode); } Listing 2: Pseudo code of SendExecuteRequest where all data related to ip_core and job_config are accessed through the GRM databases (see Section 5). Calls to IP core services are blocking calls and wait for confirmation of successful execution. If the IP core is an HW block, the functionality of switchtopowermode is characterized in Task 3.3. In COMPLEX, the interface between the IP core manager and the custom HW blocks of the platform is coordinated with Task T2.4. The interface between the IP core manager and the processors is coordinated with the platform provider. For the COMPLEX use cases 1 and 2, it is provided in Task T4.1. The routing path manager is responsible for atomically establishing set of routing paths, globally optimizing the usage of the communication infrastructure, and enabling dynamic bandwidth allocation. The functionality of this manager is based on the decisions done by the servivce SelectRoutingPaths in the inter-job channel manager. Page 13

14 5 GRM databases 5.1 Platform information database The platform configuration loaded in the Platform information database of the GRM consists of the following information: A system-level description of the platform, provided by the platform provider: number and type of IP cores, memory architecture, communication infrastructure and bandwidth. The energy budget of the battery and the target platform autonomy. For each IP core of the platform: o The power mode table, as illustrated in Table 1, Power mode transitions consists of an arbitrary number of power mode transitions, each one being characterized by: the new power mode ID, the switching time, and the switching power consumption. o The available low-power states (e.g., idle, sleep, deep-sleep, doze) and their characteristics. o Services such as: loadbinary (if the IP core is a processor) and switchtopowermode. Power mode ID Supply voltage Clock frequency Average dynamic power Average leakage Power mode transitions Table 1: Power mode table per IP core The power mode table per IP core as it is shown in Table 1 will be automatically generated by the characterization tools of the COMPLEX flow for the custom HW blocks or will be provided by the platform provider based on data sheets for the processors. Page 14

15 Since the information in the power mode tables must be read and loaded in the Platform information database of the GRM, the following XML document type definition (DTD) is proposed: <?xml version="1.0" encoding="utf-8"?>   <!ELEMENT Power_mode_table (Power_mode)*>  <!ELEMENT Power_mode (Name,Comment,Supply_voltage,Clock_frequency,Average_dynamic_power,Average_ leakage,power_mode_transition*)>  <!ELEMENT Power_mode_transition (Name,Comment,Switching_time,Switching_power_consumption)>  <!ELEMENT Name (#PCDATA)> <!ELEMENT Comment (#PCDATA)> <!ELEMENT Supply_voltage (#PCDATA)> <!ELEMENT Clock_frequency (#PCDATA)> <!ELEMENT Average_dynamic_power (#PCDATA)> <!ELEMENT Average_leakage (#PCDATA)> <!ELEMENT Switching_time (#PCDATA)> <!ELEMENT Switching_power_consumption (#PCDATA)>  <!ATTLIST Power_mode_table design_entity CDATA #REQUIRED> <!ATTLIST Power_mode pm_id ID #REQUIRED> <!ATTLIST Power_mode_transition ref_to_pm_id IDREF #REQUIRED> <!ATTLIST Supply_voltage unit (V mv) "V"> <!ATTLIST Clock_frequency unit (Hz KHz MHz GHz) "Hz"> <!ATTLIST Average_dynamic_power unit (W mw uw nw pw fw) "W"> <!ATTLIST Average_leakage unit (W mw uw nw pw fw) "W"> <!ATTLIST Switching_time unit (S ms us ns ps fs) "S"> <!ATTLIST Switching_power_consumption unit (W mw uw nw pw fw) "W"> The DTD of Listing 3 is organized as follows: Listing 3: DTD of the power mode table A top-level Power_mode_table element is defined containing an arbitrary number of Power_modes. Each power mode is then defined by a name, comment, supply voltage, clock frequency, average dynamic power dissipation, average leakage currents, as well as an arbitrary number of power mode transitions. Each power mode transition within the power mode is linked to a target power mode (ref_to_pm_id) by the IDREF command. This guarantees the integrity of the XML file. Further, all possible units for voltage, clock frequency, and power consumption are included. Page 15

16 A simple example XML file based on the DTD and describing two power modes and the corresponding power mode transitions is shown below: <?xml version="1.0" encoding="utf-8"?> <!DOCTYPE Power_mode_table SYSTEM "powermodetable.dtd">  <Power_mode_table design_entity="asic_dct"> <Power_mode id="1"> <Name>Active</Name> <Comment>Regular operation at nominal voltage</comment> <Supply_voltage unit="v">1.2</supply_voltage> <Clock_frequency unit="mhz">300</clock_frequency> <Average_dynamic_power unit="mw">120</average_dynamic_power> <Average_leakage unit="mw">100</average_leakage> <Power_mode_transition ref_to_pm_id="2"> <Name>Power down</name> <Comment>Component is put into power-gated power mode</comment> <Switching_time unit="ms">0.1</switching_time> <Switching_power_consumption unit="mw">20 </Switching_power_consumption> </Power_mode_transition> </Power_mode> <Power_mode id="2"> <Name>Power gated</name> <Comment>Power gating is applied</comment> <Supply_voltage unit="v">0.0</supply_voltage> <Clock_frequency unit="mhz">0</clock_frequency> <Average_dynamic_power unit="mw">0</average_dynamic_power> <Average_leakage unit="mw">5</average_leakage> <Power_mode_transition ref_to_pm_id="1"> <Name>Wake up</name> <Comment>Component is put into active power mode</comment> <Switching_time unit="ms">2</switching_time> <Switching_power_consumption unit="mw">200 </Switching_power_consumption> </Power_mode_transition> </Power_mode> </Power_mode_table> Listing 4: Example XML file of a power mode table 5.2 Application and job information databases As mentioned in Section 4: Ideally, for any application, all functionalities should be accessible at any time. However, based on the user requirements, the available platform resources, the limited energy/power budget of the platform, and the target platform autonomy, it may not be possible to integrate all these functionalities on the platform at the same time. Hence the application developer has to organize the application into application modes, each one specifying a different subset of functionalities. The application, within a selected mode, consists of jobs communicating with each other through inter-job channels, as follows: Page 16

17 o One job is mapped entirely on one IP core. o Whereas the functional specification of a job is fixed, there may be several specific algorithms or implementations for a given job. Also a job implementation can take several forms (fixed logic, configurable logic, software) and offer different characteristics. o A job can consist of multiple tasks communicating with each other. Hence, in the GRM databases, any application is characterized hierarchically as follows: At the top level, an application is characterized by: o Its deadline. o Its priority. o The set of possible QoS, and the minimum required QoS. o The set of possible application modes. o The set of available application configurations. These application configurations can be generated by a design-time exploration, such as the one developed in Task T3.4. An application mode is characterized by: o Its set of jobs. o Its set of inter-job channels. An application configuration specifies an application mapping on the platform. It is characterized by: o Its QoS. o Its application mode. o One configuration for each job and each inter-job channel characterizing the application mode. o Its average execution time and energy consumption, its user value, and the deadline of each job. A job configuration specifies a job mapping on an IP core. It is characterized by: o The IP core where the job should be executed. o The job binary and its location in the memory hierarchy of the platform, if the IP core is a processor, and if the job binary has to be loaded at run time (through the service loadbinary of the IP core). o Its QoS. Page 17

18 o Its average execution time and energy consumption, a (supply voltage, clock frequency) assignment to each task of the job. All application configurations are loaded in the Application information database, whereas all job configurations are loaded in the Job information database. Page 18

19 6 Interaction between applications and GRM As indicated in Section 3, the RRM has to support a variety of applications. These applications may run concurrently, start and stop at any time. Each application may have multiple configurations, with different constraints imposed by the external world or the user, different usages of various types of platform resources and different costs. Both GRM and control of applications are performed at the platform level. However, it is not possible to have several masters at the same time. In our framework, the master has the following responsibilities: Get the control of the applications. Through commands, question the GRM about its run-time decisions. Map the applications based on the GRM decisions. Take care that no data are lost and that the applications remain in a stable state during reconfiguration. When only one application runs on the platform, such as in COMPLEX use cases 1 and 2, both GRM and control of the application run on the host processor of the platform. To avoid communication overhead, the execution of commands sent to the GRM is integrated in the application flow. The master is called central manager, as illustrated in Figure 7. Among commands sent to the GRM, we distinguish between: (1) commands sent once at the initialization of the system and having no impact on the run-time overhead of the GRM; and (2) commands frequently sent at run time and allowing interaction with the applications. The current commands sent once at the initialization of the system are: Load the platform, for which the platform configuration file needs to be specified. This command is executed through the service ConfigurePlatform in the platform manager. Load the application, for which the application configuration file needs to be specified. This command is executed through the service ConfigureApplication in the application manager. Select the options. The execution of this command implies the following actions: (1) select the utility function and the optimization heuristic to be used by the service SelectJobConfigurations in the job manager; (2) run the service DeriveUserValues in the QoE manager to compute the user value for each application and job configuration according to the selected utility function. The current commands frequently sent at run time are: Execute applications. The execution of this command implies the following actions: o Run the service SelectJobConfigurations in the job manager to select application modes and optimal configurations for each job within the selected Page 19

20 application modes. The selection is performed according to the selected optimization heuristic and utility function. o Run the service JobConfigurationSwitching to check whether job configuration switching is required. o Whenever a job configuration switching is required (see Section 4.1), the GRM performs the following actions: Inform the concerned application. Send a request to the concerned IP core to stop the execution of the current job configuration (by running the service SendTerminateRequest in the IP core manager). Send a request to the newly concerned IP core about the new setting and job configuration (by running the service SendExecuteRequest in the IP core manager). o Update the estimation of the energy/power consumption of the platform, according to the selected application configurations. Terminate an application. The execution of this command implies the following actions: o Send a request to all IP cores executing a job of the application to terminate its execution (by running the service SendTerminateRequest in the IP core manager). o Execute again the command Execute applications. o Update the estimation of the energy/power consumption of the platform. Page 20

21 7 GRM run-time optimization heuristics As explained in Section 4.2, the negotiation between user and QoE manager involves the selection of the utility function and of the optimization heuristic. On the one hand, the utility function allows to model in an abstract and mathematical way the optimization goal. On the other hand, a single optimization heuristic cannot be expected to fit all application domains and optimization goals. Hence, RRM should allow different heuristics. In COMPLEX, we focus on optimization heuristics to be supported by the service SelectJobConfigurations in the job manager. A first power-aware heuristic, adapted from [3] and described in [4], is already available. It has the following characteristics: Applications are dynamically selected and activated by the user. The goal is to select one job configuration for each job of active applications, according to the available platform resources, making sure that one job implementation runs on exactly one IP core, in order to minimize the total power consumption of the platform, while satisfying the job deadlines. A new QoS-aware heuristic, to be adapted from [5], will be developed and implemented in order to fit the needs of the COMPLEX use case 2. It will have the following characteristics: Only one application is mapped on the platform. However the application mode is dynamically (re)selected. The goal will be to select one application configuration, according to the available platform resources (including the energy budget of the platform), also making sure that one job implementation runs on exactly one IP core, in order to maximize the user satisfaction (i.e., the QoS required by the user, measured by the utility function), while satisfying the deadlines and the target platform autonomy. A preliminary description of this new heuristic is given in the following subsections. 7.1 QoS-aware run-time application scheduling As mentioned previously, our heuristic can be used when only one application is mapped on the platform. It selects one application configuration, according to the available platform resources (including the energy budget of the platform), making sure that one job implementation runs on exactly one IP core, in order to maximize the QoS required by the user, while satisfying the deadlines and the target platform autonomy. To that end, the QoS must be modelled by the utility function in the GRM (see Section 4.2). Our heuristic presents the following advantages: (1) it deals at design time already with the complex dynamic behaviour of the application to reduce the run-time computation efforts; (2) it uses average rather than worst-case execution estimations; (3) run-time decisions are done fast and can be frequently (re)evaluated based on the actual load of the application. Page 21

22 As mentioned in Section 4: (1) the application is organized into application modes; (2) the application, within a selected mode, consists of communicating jobs; and (3) for each job, there may be several specific configurations. Hence many application schedulings can already be explored at design time. The most promising ones are represented in a multi-dimension Pareto set that describes the optimal energy-performance-qos compromises. This Pareto set is then used by the GRM to optimally select one application scheduling at run time, whenever the environment is changing. The more points in the Pareto set, the better results in the GRM selection, but at the cost of greater run-time computation complexity and overhead. With our approach, in any application scheduling, both supply voltage and clock frequency of the different IP cores are dynamically adapted taking the work load variations of the jobs into account: when the work load is low (resp. high), the supply voltage and clock frequency are lowered (resp. raised). The remainder of this section is organized as follows. Section explains how our heuristic differs from [5]. Section summarizes each step of our heuristic and emphasizes the new extensions and the practical issues Previous work Scheduling an application that consists of communicating jobs on a multi-core platform has to: (1) decide the order in which those jobs are executed; (2) determine on which IP core a job must be executed; (3) determine the supply voltage and the clock frequency for each IP core if DVFS is allowed; and (4) determine when and how to put an IP core in a low-power state. To minimize the energy/power consumption at the system level, two techniques, DVFS and Dynamic Power Management (DPM), can impact the application scheduling: Whenever an IP core enters an idle state, DPM either shuts it down completely or swaps it into some low-power state. Since the switching between the normal and the low-power state requires extra time and energy, it can increase the IP core response time undesirably. Hence the key issues of DPM are to decide whether to switch to a low-power state and if yes, to which one. In a CMOS circuit, reducing the IP core speed only, without reducing the supply voltage, does not reduce the energy consumption. DVFS decreases the supply voltage, while running the IP core at its maximum speed. Whenever both DVFS and DPM are available on the platform, DVFS should be exploited first, since the shutdown techniques appear to be inferior to DVFS when this latter is applied to the entire platform. Nevertheless, the run-time scheduling overhead for code size and energy consumption may be excessive. Hence the most promising approaches combine design-time and run-time scheduling. They have the following advantage: the run-time computation complexity is minimized, which reduces energy and performance penalty so that faster reaction times can be achieved. Our scheduling approach presents the following features: Page 22

23 The concept of application scenario is used, which allows: (1) dealing at design time already with complex behaviour; (2) using average rather than worst-case execution estimations, which are too pessimistic. Discrete DVFS is enabled on the IP cores. Supply voltages and clock frequencies are selected from an ordered Pareto set with a limited number of points (after the filtering stage of the design-time scheduling). A design-time scheduling is combined with a low-complexity run-time scheduling integrated in the application, as follows: o Our design-time scheduling is a design space exploration. For each relevant scenario of the application, it generates a multi-dimension Pareto set of schedulings instead of a two-dimension Pareto curve (energy versus performance) in [5]. Each scheduling is characterized by an optimal energyperformance-qos compromise and by different (supply voltage, clock frequency) assignments to each code segment of the application (either to each job or to each task of each job of the application). o In contrast to the slack-stealing DVFS, the voltage is computed at design time based on the average execution time of each scenario segment. o Our run-time scheduling only needs to select a scheduling from the Pareto set. But besides, our heuristic differs from [5] as follows: The approach [5] was targeted for single-core platforms and aimed at minimizing the energy consumption. In contrast to [5], our approach is targeted for heterogeneous multicore platforms. It aims at optimizing the QoS of the application under the energy budget of the platform and target platform autonomy. In contrast to [5], supply voltage and clock frequency can be assigned at two granularity levels: either on a job-by-job basis, or on a task-by-task basis. A finer assignment allows more reduction of the energy consumption, but at the price of a larger run-time computation overhead. Whenever some slack time is available, DPM is used, which is not the case in [5] Summary of our application scheduling Our approach consists of the following steps Job and task extraction As mentioned in Section 4, an application consists of communicating jobs, and jobs can consist of communicating tasks. Also, one job has to be mapped entirely on one IP core. Job extraction is derived from the HW/SW partitioning tool developed in WP2. Each job corresponds with a sequential thread of control. In our approach, task extraction is performed for each job as follows: Page 23

24 First, the application is simulated on a host machine to profile the code and generate execution times of relevant code segments inside each job. Next, each job is partitioned into tasks to ensure that the average execution time is of the same order of magnitude. Each resulting task is a basic scheduling unit in our design-time scheduling Scenario characterization Each job exhibits a different dynamic behaviour, dependent on the application mode, the required QoS, and the input streams. Its behaviour yields different execution paths due to different iteration numbers in loop statements and different branch executions in conditional statements. Hence the execution time and the energy consumption needed to perform the job can vary significantly. Any such job execution path characterizes one scenario of the job. For the design-time scheduling, each scenario is also characterized with the assignment of jobs to IP cores (either processors or custom HW blocks), the average execution time and energy consumption of each task of each job on the target IP core for any (supply voltage, clock frequency) combination considered by DVFS Design-time scheduling The design-time scheduling is applied to each possible scenario at the job/task granularity level. In the following description of our heuristic, the task granularity level is assumed. The design-time scheduling takes as input: (1) the task graph restricted to the scenario; (2) the set of (supply voltage, clock frequency) combinations considered by DVFS; (3) the average execution time and energy consumption of each task as derived by the previous step; (4) the average time to complete a (supply voltage, clock frequency) switch on each IP core of the platform. Our design-time scheduling is an exploration that generates a set of optimal energyperformance-qos compromise schedulings for each scenario. Each explored scheduling is also characterized by a (voltage supply, clock frequency) assignment to each task of the scenario. Only scenarios being better than the other ones in at least one dimension of the design space (i.e., currently the energy consumption, the execution time, and the user value of the scenario) are retained. They are called Pareto points. The resulting set of Pareto points is called the Pareto set. Since the design-time scheduling is performed off-line, computation efforts can be paid as much as necessary, provided that it can give a better scheduling result and reduce the computation efforts of the run-time scheduling in the later step. The design-time exploration is performed by Task T3.4. Next, the huge overall Pareto set is filtered and points being too close to each other are eliminated. This is required to achieve a low-complexity run-time scheduling. Finally, the Pareto set is sorted according first to the energy consumption, then to the execution time and finally to the user value. Page 24

25 Run-time scheduling Only at run time, the system-level information about the active scenario is complete. Given the Pareto set of this scenario and the constraints such as the application deadline and QoS requirements, the energy budget of the platform, and the target platform autonomy, the runtime scheduling has to select one point in order to maximize the QoS while meeting these constraints. One important advantage of this run-time scheduling is its flexibility: unforeseen demands for more execution time by any task can be accommodated by stealing time from another task, based on the available Pareto sets. This step must be done fast to allow a frequent (re)evaluation of the run-time decisions or the handling of more tasks in a single shot. Both result in still more energy saving. In contrast to our design-time scheduling, our run-time scheduling works at the job granularity level. The details inside a job, like the task execution times and energy consumptions, remain invisible to the run-time selection, and this reduces its complexity significantly. The integration of the run-time scheduling in the application, together with an efficient implementation of Pareto set storage and access, is implemented in C for performance and portability reasons. It is a library function called in the application specification, and its functionality is as follows: At the initialization of the application, only the essential features of all Pareto sets, generated by the design-time scheduling, are stored. For each point on a Pareto set, these features are: the scenario execution time and energy consumption, and the (supply voltage, clock frequency) assignment to each task of the scenario. Our run-time scheduling is called whenever the environment is changing (e.g., when the user changes his requirements, when the battery level becomes too low, when some slack time becomes available). Whenever our run-time scheduling is called, it has to: o Predict the scenario that will be activated, and select the Pareto set accordingly. o On the active Pareto set, select one scheduling in such a way that the QoS of the application is maximized while respecting the application deadline, the energy budget of the platform, and the target platform autonomy. This selection is realized, applying a fast greedy heuristic. Whenever a task function is called, it accesses the data structures storing the scheduling features of the scenario it belongs to (previously predicted by the run-time scheduling), and the appropriate IP core function is called to set the right supply voltage and clock frequency. The IP core is swapped into a low-power state whenever the job is completely executed and some slack time is available. Which low-power state is decided by our run-time scheduling. Page 25

26 Setting IP cores is performed through APIs. For processors, these APIs are provided by the platform provider. For HW blocks, these APIs interface with the power controller implemented in the HW block and described in the following. Page 26

27 8 Power management of custom HW blocks During synthesis of any HW job, an RTL description is generated, which also contains a power controller, being the LRM for the custom HW block: The HW block implements the job and its tasks into individual power modes, whose characteristics are summarized in a power mode table (see Table 1) and provided to the GRM through the Platform information database. The power mode selection is done by the GRM and communicated to the power controller. This selection takes into account: The number of clock cycles and the switched capacity required by to execute the HW job. The average dynamic power and the average leakage of each possible power mode. The switching time and the switching power consumption between two power modes. The power controller generation, including the power mode table, is performed by Task T2.4. Using this information, the GRM is able to control the power modes of the particular HW block at run time. Communication between GRM and LRM is performed through a TLM2- based interface. 8.1 TLM2-based interface to the LRM The TLM2-based interface is implemented as register interface that is accessible using a TLM2 socket, as shown in Figure 5. TLM2- communication interface 31 IF function call Desired Recent Status Memory/ FIFO 0 BAC++ Communication adapter functional model (augmented behaviour) non-functional model (V dd, V th, clock-tree, leackage, etc.) observer (calculates power and timing) Figure 5: TLM-based LRM interface The registers are accessed through the TLM generic_payload pattern. If one of the register is read or written, the interface adapter communicates directly with the non-functional model, using methods of a generic base class. All non-functional models are derived from that class, Page 27

28 so a generic approach for accessing the models is available and only one type of interface adapter must be provided. In order to keep the interface as simple as possible, the interface simply calls the appropriate getter and setter methods for each register. The register file has the structure, shown in Table Desired Resent V Reserved S Table 2: LRM register interface The following sections describe the functionality of each register Register Desired 31:0 Desired Contains the ID of the desired power mode, as requested by the GRM. The register is r/w Register Recent 31:0 Recent Contains the ID of the recent power mode. The register is read-only. It is only valid, if the valid bit of the status register is set Register Status 31 Valid If set, the content of the register file is valid. If a any of the registers of the interface is written, this bit is set imediately. That is, this bit can be read in the next cycle and will then contain a valid value. This bit is readonly. 30:2 Reserved Reserved bits. The content is not defined. 1:0 Status Determins the current status of the LRM. These bits are read-only. The content if these bits is only valid, if the valid bit of this register ist set. The following values are possible: 00: OK; Desired power mode is accepted and active. 01: Pending; Desired power is accepted, but not activeded, yet. It will be come active as soon, as is it is possible. 10: Invalid transition; The desired power mode is known, but it is not possible to switch from the current power mode to the desired one. 11: Invalid mode; The desired power mode is not known. Page 28

29 9 Sample RRM in COMPLEX use case 1 The application to be used in the COMPLEX use case 1 comes from the health care domain. It is a virtual machine oriented to data processing in body sensor networks. A node of this application is able to perform some computations based on collected data from sensors. The parameters of these computations (called features ) such as sampling rate, window, shift of data set, etc. can be tuned and the features can be activated or deactivated depending on the application demands. Parameter tuning and activation/deactivation of features can be done at run time. One prominent application of this virtual machine will be to detect body movements, for example to monitor the health state of an elderly patient. Figure 6: Architecture of the embedded distributed system use case The implementation of the RRM in the COMPLEX use case 1 is mainly devoted to HW platform modes. Anyway, the application could have possible scenarios where some application parameters can be also adapted to the environment conditions. In particular, this RRM should configure: The platform by managing: o Processor status: power modes and frequency. o Peripherals status: power modes and clock gating. The application by managing: o Memory usage: memory accesses could be redirected to selected memories (to be analyzed once the application is ready to be deployed). o Application parameters: this will be related on the sampling rate on the sensors and transmission latency of the acquired information. Moreover, this RRM should configure adequately the platform and the application depending on the application scenario, battery status and network status. Regarding the application scenario, some examples could be: No anomaly is detected in the monitoring of health state of an elderly patient (low sampling rate and low transmission rate). Page 29

30 An anomaly has been detected in the monitoring of health state of an elderly patient (the sampling rate should be increased, requiring limited transmission latency). A monitoring request is sent to the target node forcing a specific QoS to the application (e.g. predetermined sampling rate or fast transmission). Despite a different use case structure from the COMPLEX use case 2, both sample RRMs will behave in a similar way, integrated at the application level, and selecting operating points depending on the system (application, platform and environment) conditions. The main difference between both sample RRMs is that the one implemented in the COMPLEX use case 1 will be probably lighter. Page 30

10 Sample RRM in COMPLEX use case 2 Figure 7: Platform architecture in the COMPLEX use case 2 In the COMPLEX use case 2, the application will be mapped: (1) on an abstract virtual platform programmed

31 10 Sample RRM in COMPLEX use case 2 Figure 7: Platform architecture in the COMPLEX use case 2 In the COMPLEX use case 2, the application will be mapped: (1) on an abstract virtual platform programmed in SystemC and using the VP tool provided by Synopsys; (2) on a TLM virtual platform provided by STMicroelectronics. More details on this use case can be found in the COMPLEX deliverable D4.1.1 [2] Application and application modes The application considered in the COMPLEX use case 2 is from the audio-surveillance domain. It consists of the following stages and options: Stage 1: Audio activity detection Stage 2: Multi-channel processing, with two options: o Option 2.1: Localisation o Option 2.2: Signal enhancement Stage 3: Feature extraction, with two options: o Option 3.1: Feature extraction o Option 3.2: Compression of extracted features Page 31

COdesign and power Management in PLatformbased design space EXploration. Final report on run-time management

FP7-ICT-2009-4 (247999) COMPLEX COdesign and power Management in PLatformbased design space EXploration Project Duration 2009-12-01 2013-03-31 Type IP WP no. Deliverable no. Lead participant WP3 D3.5.3