CAP-OS: Operating System for Runtime Scheduling, Task Mapping and Resource Management on Reconfigurable Multiprocessor Architectures

Size: px

Start display at page:

Download "CAP-OS: Operating System for Runtime Scheduling, Task Mapping and Resource Management on Reconfigurable Multiprocessor Architectures"

Egbert Stafford
5 years ago
Views:

1 : Operating Sstem for Runtime Scheduling, Task Mapping and Resource Management on Reconfigurable Multiprocessor Architectures Diana Göhringer 1, Michael Hübner 2, Etienne Nguepi Zeutebouo 1, Jürgen Becker 2 Fraunhofer IOSB, German 1 ITIV, Karlsruhe Institute of Technolog (KIT) German 2 {dgoehringer, zeutebouo}@fom.fgan.de 1, {michael.huebner, becker}@kit.edu 2 Abstract Operating sstems traditionall handle the task scheduling of one or more application instances on a processor like hardware architecture. Novel runtime adaptive hardware eploits the dnamic reconfiguration on FPGAs, where hardware blocks are generated, started and terminated. This is similar to software tasks in well established operating sstem approaches. The hardware counterparts to the software tasks have to be transferred to the reconfigurable hardware via a configuration access port. This port enables the allocation of hardware blocks on the FPGA. Current reconfigurable hardware, like e.g. Xilin Virte 5 provide two internal configuration access ports (ICAPs), where onl one of these ports can be accessed at one point of time. In e.g. a multiprocessor sstem on an FPGA, it can happen that multiple instances tr to access these ports simultaneousl. To prevent conflicts, the access to these ports as well as the hardware resource management needs to be controlled b a special purpose operating sstem running on an embedded processor. This special purpose operating sstem, called CAP- OS (Configuration Access Port-Operating Sstem), which will be presented in this paper, supports the clients using the configuration port with the service of priorit-based access scheduling, hardware task mapping and resource management. Kewords- Operating Sstem, MPSoC, Reconfigurable Computing, FPGA, Scheduling, Task Mapping I. INTRODUCTION Scheduling of tasks within a given time frame and with respect to a required deadline due to real-time aspects is well known in computer science from operating sstems (OSs), especiall in real-time operating sstems (RTOSs). Scheduling strategies of conventional OSs var between preemptive and non-pre-emptive scheduling and are further classified e.g. between earliest deadline first or rate monotonic algorithm (see [1] for detailed descriptions). The classical scheduling and task mapping process of softwarebased sstems with a traditional OS has its counterpart in novel runtime reconfigurable hardware sstems. Within these sstems, tasks can be presented additionall to the traditional software representation, as phsical hardware realization e.g. on an FPGA. That means that a further degree of freedom for task mapping on hardware resources is available for the OS laer. For eample, compared to a task in a traditional software-based sstem that was mapped and eecuted on a resource as a software thread, the hardware reconfigurable variant of such a sstem would also allow running this task as a hardware block realized with logic resources on an FPGA. This difference and the new degree of freedom in task representation require the consideration of a novel concept for hardware task scheduling and mapping. In order to handle this process, a detailed analsis of the consequences e.g. due to data dependencies, priorit and real-time aspects has to be investigated in detail and formalized in a feasible algorithm for an efficient special purpose OS. Furthermore, the underling hardware resources, including the internal configuration access port (ICAP) have to be characterized in terms of timing, determinism, behavior in termination cases etc. Also, these results have to be accounted for in the special purpose OS approach b a cost function. The described investigation and the results can be eploited efficientl in the runtime adaptive Multiprocessor Sstem-on-Chip (RAMPSoC) approach as described in [2]. In this approach several processors, co-processors and hardware accelerators are available for concurrent task realization on an FPGA. The approach presented in this paper allows to schedule tasks of a control dataflow graph (CDG) and to map these tasks either in hardware or in software on a reconfigurable multicore hardware on the FPGA. The algorithm therefore considers data dependencies, phsical constraints from the configuration interface and the reconfigurable resources and additionall the abilit of the parallel data processing hardware of the RAMPSoC approach. The paper is organized as follows: Related work is presented in Section II. Section III describes briefl the RAMPSoC approach and its features. In Section IV the concept and the features of (Configuration Access Port-Operating Sstem) are described. Section V presents how is integrated into the RAMPSoC hardware architecture. A case stud and first results are presented in Section VI. Finall, the paper is closed b presenting the conclusions and an outlook in Section VII. II. RELATED WORK Scheduling for hardware reconfigurable architecture is used in approaches reported in different publications. The selected publications discussed in this paper are onl a subset of the numerous approaches developed in academic and industrial environment. However, the papers, which are references for the related work section, reflect the significant aspects in respect to the presented approach and allow an /10/$ IEEE

2 objective comparison of the benefits achieved in the proposed solution of the special purpose OS named CAP- OS. Dittmann et al. [3] describe a scheduling approach for a single processor and several accelerators, which can be configured at runtime. The solution provides a pre-emptive reconfiguration, which is important, if a task with a higher priorit has to substitute the configuration process of a lower prior task. The scheduling strateg is based on a deadline monotonic (DM) algorithm with some etensions related to the fact that a hardware / software reconfigurable sstem is targeted. The approach has some restrictions due to the fact that onl homogeneous shaped reconfigurable areas are supported. For this purpose, onl a fied (and non variant) time frame for reconfiguration of the hardware is considered in the algorithms. A further restriction is that data dependencies between the tasks are not considered within the scheduling algorithm. Furthermore, the approach requires drivers supporting the phsical reconfiguration of the FPGA. This certainl could be a standard ICAP driver with the related IP cores. Ullmann et al. [4] also targets, similar to the previousl described approach, a single processor solution with reconfigurable accelerators in a homogeneous shape and size. The scheduling is priorit-based and non-pre-emptive due to the fact, that this approach was developed for automotive applications, where a pre-emption of a certain tasks is not allowed. The reported runtime sstem in the paper includes the hardware drivers for the configuration access port. The runtime sstems included some features like contet load and save, which allows the resumption of tasks in hardware or software. ReconOS [5], uses an ecos real-time operating sstem as basis for the own solution. Also, here a single processor and reconfigurable accelerators loosel connected to the processor is the target hardware architecture. In comparison to the previousl described approach the authors use a fied priorit scheduling approach. For snchronization purposes, a communication method for the software and hardware threads over the ecos RTOS was developed. An interesting approach is that a task graph with dependent and independent tasks is used as input description for the scheduler. On the basis of the reported approaches described in the references as written above it is obvious, that a novel OS approach for a reconfigurable multiprocessor Sstem-on- Chip like RAMPSoC has to be introduced. One simple eample for this necessit amongst others is the fact that the reconfigurable regions are not longer homogeneous in their footprint and therefore the configuration times var between the different tasks, which ma have to be allocated to the hardware. This and other parameters have to be handled with the novel approach of the. III. THE RAMPSOC APPROACH The is used for runtime scheduling, task mapping and resource management on a RAMPSoC [2]. Fig. 1 shows an eample for a RAMPSoC architecture at one point in time. As can be seen, the RAMPSoC is a heterogeneous multiprocessor Sstem-on-Chip (MPSoC), consisting of a number of different processors connected over a communication infrastructure, which is a switchbased Network-on-Chip (NoC) in this eample. The processors can be etended with one or several hardware accelerators. Furthermore, also a Finite State Machine (FSM) together with a hardware function can be used instead of a processor, if desired. FPGA Virtual-I/O (Tpe 2) 1 2 (Tpe 1) FSM + Hardware Function (Tpe 1) 3 4 Figure 1. Eample of a RAMPSoC architecture at one point in time Dnamic and partial reconfiguration is used to adapt the hardware architecture of the RAMPSoC at runtime. The following runtime adaptations are supported b the RAMPSoC: Number and characteristics of processors Communication infrastructure (e.g. size, bandwidth, topolog) Number and functionalit of hardware accelerators Software for the processors. This wa, a good trade-off between performance, power consumption and area requirements can be achieved through runtime adaptation of the hardware architecture with respect to the needs of the applications. More details about the hardware architecture of the RAMPSoC and its benefits can be found in [2]. For an efficientl programming of such a fleible hardware architecture, an eas to use toolflow is required, which guides the user in partitioning the application at design time. It also generates the partial bitstreams for the several hardware modules. An overview of this toolflow can be found in [6]. These partial bitstreams together with the task graphs of the applications are required b the, which will be presented in detail in the net section. The is responsible for the runtime scheduling of the configurations of the different tasks, allocating the tasks to the processing elements and for resource management. Furthermore, the needs to respond to runtime demands of the application, such as one or several processors needing different accelerators. IV. CONCEPT OF THE For an adaptive MPSoC like RAMPSoC, a fleible RTOS is required, which schedules the reconfiguration of the tasks and their runtime allocation to a specific processing element. Furthermore, this RTOS has to assure that the

3 different applications meet their real-time requirements and that the utilization of the hardware resources and therefore the power consumption is kept low. Fig. 2 shows how the manages the underling RAMPSoC hardware architecture to fulfill the real-time requirements of the user applications. The further hides the compleit of the underling dnamic RAMPSoC architecture from the user. Figure 2. Abstraction Level Applications Task graphs Tasks from Bitstreams from User RAMPSoC for the tasks Runtime Resource Configuration Scheduling Allocation Management Xilkernel Thread Scheduling Hardware Drivers RAMPSoC Hardware Architecture processor ICAP s, accelerators FPGA Hardware Architecture LUTs, BRAM, DSP NoC, Bus, P2P, Memor I/Os embedded in the several abstraction laers of the sstem approach Resource allocation at runtime is done b partial and dnamic reconfiguration using the ICAP. Therefore, the scheduling algorithm has to consider the time required for reconfiguring a module, which depends on the data throughput of the ICAP interface and certainl on the size of the module. This time frame is not negligible since the data amount for hardware modules can be ver small, but also several hundred kilobtes. For each task, two different implementation options eist. A task can, either be eecuted in software on a processor or in hardware as a hardware accelerator. For the task implementation in software or in hardware different choices can eist, varing in size, performance and reconfiguration time. The scheduling algorithm has to choose the appropriate implementation tpe to fulfill the real-time constraints. Moreover, the presented scheduling approach tries to reuse eisting resources, which were alread configured onto the chip in a previous point of time, with the goal to reduce the overall reconfiguration overhead. Furthermore, the scheduling algorithm has to support pre-emptive reconfiguration, because while reconfiguring one task it can happen that a request for the reconfiguration of another task with higher priorit occurs. As onl one ICAP is available, the reconfiguration of the previous task has to be terminated and the new task needs to be reconfigured. After this, the reconfiguration of the interrupted task has to restart, because a continuation of the terminated reconfiguration is not supported b the FPGA vendor. This scheduling approach can handle both independent and dependent tasks. A group of interrelated tasks is called a task graph (TG). Each TG must fulfill the following requirements: The TG is a directed acclic graph (DAG) Each task runs on processors/hardware accelerators Each task has an identit (ID) Each task has the following information: o Neighborhood relation (predecessor/successor) o Algorithm tpe or hardware constraints (Algo- ID) o Eecution time, reconfiguration time o Communication costs The TG has a global deadline (D) The TG has either hard or soft real-time constraints, which are inherited b the tasks belonging to the TG For the configuration of a task the following two rules appl: It can be terminated It is onl feasible, after all predecessor tasks are completel reconfigured Fig. 3 shows an eample of such a TG including the global deadline, the interrelation and the communication costs. K 24 K 12 K 45 K 56 K 35 T6 K 13 T: Task D: Global Deadline K : Communication Costs between Task and Task Figure 3. Eample task graph with global deadline, interrelation and communication costs Within the, each task within a TG has a life ccle as shown in Fig. 4. Not_Read Read Config Eec Eit K 36 Figure 4. Life ccle states of a task Table 1 describes each of the states, which are traversed b a task during its life ccle, in detail. Table 1. Description of the life ccle states of a task Task States Not_read Read Config Eec Eit D Description This task is not read for reconfiguration, because its predecessors are not completel reconfigured. This task is read for reconfiguration and competes with the other Read task for the access to the ICAP. Onl tasks without predecessors, or whose predecessors have alread been reconfigured can enter this state. The task is under configuration via the ICAP onto the RAMPSoC. If a task with higher priorit becomes Read, the reconfiguration process is terminated, the task returns into the Read state and waits for a new possibilit to access the ICAP. After successful configuration the task starts eecution and enters this state. An eecution cannot be interrupted. After the eecution the task enters this state. The allocated processing element is now free for the net task. Important is here, if the configuration of a task is interrupted, the task returns into the Read state, the

4 configuration is lost and has to start all over again. As alread mentioned in the previous section, the multiprocessor model used for the scheduling is a heterogeneous runtime adaptive MPSoC that uses a message passing communication scheme. The runtime scheduling algorithm is onl performed for tasks, which are in state Read. The novel runtime scheduling approach is described in detail in the net subsection. A. The Novel Runtime Scheduling Approach The novel runtime scheduling algorithm is divided into two main steps. First, a static scheduling algorithm is used to roughl assign priorities to the tasks of each TG using the information given b the TG description. For this, the list scheduling algorithm is used, because it is a priorit-based static scheduling algorithm, which respects resource constraints. The available resources are the single ICAP and the maimum number of possible processors, which depends on the size of the chosen FPGA. First conservative estimates for the ASAP (As Soon As Possible) and the ALAP (As Late As Possible) start time for each task of a TG, consisting of m tasks, are calculated using the formulas: ASAP( T ) = ( t rec T pre( T ) ( T ) + t pre( T ) : Predeccessor of task T t t rec ee ee ( T )) ( T ) : Reconfiguration time of task T ( T ) : Eecution time of task T ALAP( T ) = D ( t succ( T rec T succ( T ) ) :Succcessor of task T D :Global deadline of the task graph µ(t ) : Mobilit of task ( T ) + t T ee µ T ) = ALAP( T ) ASAP( T ) ( ( T )) (1) f (2) f Based on the ASAP and ALAP start time of each task, a priorit can be assigned to each task in the TG using the urgenc or the mobilit of each task. The urgenc depends on the maimum number of successors of a task. The mobilit of a task (see Formula (3)) is the difference between its ALAP and ASAP start time and favors the tasks along the critical path. The TG in Fig. 5 has e.g. the following critical path: T6. Because of this, the mobilit is used here to assign the priorities to the tasks. The smaller the mobilit, the higher is the priorit of the task. At runtime, onl the Read tasks are scheduled for configuration according to their priorities, which have been calculated with the list scheduling algorithm. Fig. 5 shows such a TG, which is processed b the to schedule the reconfiguration of the different tasks. In the current time step, shown in Fig. 5, has alread been reconfigured and therefore and are now in the Read state. Normall, the task with the highest priorit will be reconfigured first. If there are two or more Read tasks and the difference between the mobilities of the two tasks with the highest priorit is smaller than the reconfiguration time of the task with the lower priorit (see Formula (4)) a dnamic cost function K(T ) (Formula (5)) is used to reassign the priorities of these two tasks. (3) T6 Current scheduling step Figure 5. Task graph to illustrate the functionalit of the scheduling T T T : Task is in state Eec : Task is in state Read : Task is in state Not_Read K(T ) considers the ratio between the mobilities of the two tasks K 1 (T,T ) (Formula (6)) and the ratio between the number of successors of the two tasks K 2 (T,T ) (Formula (7)). K(T ) is computed using Formula (5) to (7) and it is onl computed for the current two tasks with the highest priorit to be scheduled. T gets highest priorit if: µ ( T ) µ ( T ) > RT ( T ), µ ( T ) < µ ( T ) RT(T ) : Reconfiguration time of task Else decision is made using K(T ): K ( T ) > K( T ), T gets highest priorit K( T ) K( T ), T gets highest priorit K ( T ) = ω * K ( T, T ) + ω * K ( T, T ) ω,ω : Weighting factors µ ( T ) / µ ( T ), µ ( T ) < µ ( T ) µ ( T ) 0 K1( T, T ) = 0, else µ(t ) : Mobilit of task N ( T ) / N ( T ), N ( T ) > N ( T ) N ( T ) 0 K 2 ( T, T ) = 0, else N(T ) : Number of successors of task (4) f (5) f (6) f (7) f K 1 gets a greater weight in the cost function compared to K 2, because for real-time applications the eecution time is the most important factor. Therefore the default values were set to 0.6 for ω 1 and 0.4 for ω 2. These weights can be modified b the user depending on the requirements of the application. Additionall, multiple TGs can be scheduled at runtime. If some of these TGs have hard real-time and others onl soft real-time requirements, then all tasks of the TGs with the soft real-time constraints will be delaed. The will be reconfigured after the tasks with the hard real-time constraints, even though the might have a higher priorit according to the list scheduling algorithm. This is important, to assure, that the hard real-time TGs meet their constraints. Finall, an additional feature is supported b. This feature allows increasing the clock frequenc of a processing element at runtime b reconfiguring the corresponding digital clock manager (DCM). This reconfiguration is faster than reconfiguring a new hardware module and it is used to speed up the eecution time of a task. Hereb, it is assumed, that the eecution time stas in strong relation to the clock frequenc. This DCM

5 reconfiguration is used, if a task cannot complete within its ALAP time or, if another task urgentl requires the same processor. Therefore the single steps of the scheduling algorithm can be summarized as follows: (1) Calculate ASAP and ALAP start time for each task in the task graph (2) Calculate the mobilit of each task and schedule their priorities using a list scheduling algorithm (3) Select the Read tasks and schedule them dnamicall: a. dela tasks with soft real-time constraints b. reassign priorities using the cost function if necessar c. reconfigure the DCM, if necessar d. terminate the current reconfiguration, if a task with a higher priorit occurs This results in a pre-emptive scheduling approach, which allows the termination of a configuration. Furthermore, it uses a combination of static list scheduling and a novel dnamic scheduling approach. It considers resource constraints, such as a single ICAP or the maimal number of possible processors. Moreover, the clock frequenc of processing elements can be increased at runtime if necessar and the reconfiguration times as well as the communication costs between tasks are considered. B. Resource Allocation of the After the scheduling, the tries to allocate a resource for the Read task with the highest priorit. For the resource allocation, the decision is made as shown in Fig. 6. Blocked processor soon free? Yes No Wait for a processor to finish Yes blocked? No Yes New Task present? Space for reconfiguring a new processor? Figure 6. Decision tree for resource allocation First the analzes, if a processor is present and available on the reconfigurable hardware or not. If no processor is present, a new one is configured and allocated for the new task. If processors are present in the sstem, it searches for one, which is not blocked b another task. If all eisting processors are blocked, it is checked, if one of them will finish its eecution soon. This is important, because the reconfiguration and allocation takes an amount of time. If an eisting processor finishes in a shorter amount of time than the reconfiguration time of a new processor, the reuse of this No Allocate eisting No Yes Configure and allocate a new eisting processor is preferred. This also has the benefit to reduce the area utilization and therefore to reduce the overall power consumption. If none of the eisting processors will finish soon, it is analzed, if the maimal number of processors is reached or if there is still space to reconfigure a new processor. If there is space on the reconfigurable hardware, a new processor is reconfigured and allocated for the new task. If not, the new task has to wait, until one of the processors becomes available. C. Configuration Management After the Read task with the highest priorit has been successfull assigned to a processor, this task is handed over to the configuration management. The configuration management is responsible for handling the configuration of the tasks via the ICAP. It is also responsible for pre-empting a current configuration, if another task with higher priorit needs to be reconfigured. As mentioned before, a terminated configuration has to restart again from the beginning, because Xilin FPGAs do not support the continuation of a terminated configuration so far. Therefore, the configuration management of the distinguishes between two tpes of configurations as shown in Table 2. Table 2. Configuration tpes Configuration Tpe Soft Features Interruptible until 80% of the bitstream are reconfigured Elements Software, Hard Not interruptible, DCM The term soft means an interruptible and hard means a non-interruptible configuration. Soft configuration tpes are e.g. the configuration of software tasks or hardware accelerators for eisting processors. As soon as 80% of the corresponding bitstream of a soft configuration tpe is configured, this element changes to be a hard configuration tpe. The reason is to prevent the termination of a nearl finished configuration, because the alread configured data would be lost. Other eamples of hard configuration tpes are the configuration of the DCMs and of the processors, because the configuration of a DCM is urgent and fast and the processor is far less task specific than an accelerator. D. Communication Establishment between Tasks After successfull configuring a task, the tries to establish a communication with this task and to transfer information about the IDs of the communication partners to it. Fig. 7 illustrates the required steps, to successfull establish a communication between the different tasks at runtime z 1 2 : Snc 3 : Task Info 4 : Task ID 5 : End Figure 7. Runtime communication establishment steps between different tasks.

6 The five runtime communication establishment steps required after a task has been mapped onto a processor are: (1) sends snc word to processor (2) responds with the same snc word to ensure a correct communication (3) sends task info (Task ID, number of predecessor/successor tasks and their IDs) to processor. This task info is required b the task to find its communication partners at runtime. (4) sends its Task ID to all other processors and it checks each of its communication links for the Task ID of its communication partners. It has to send its Task ID to all other processors, because it could happen, that a predecessor and a successor will be mapped onto the same processor. An eample for such a case will be given in Section VI. (5) After eecution, processor informs that it is now free for a new task. V. INTEGRATION OF ON RAMPSOC is integrated into a RAMPSoC b implementing it in software on one of the microprocessors. On the selected microprocessor, a state-of-the-art RTOS with multithreading capabilities is implemented. On top of this RTOS, the CAP- OS is implemented using different threads for the different functionalities. As shown in Fig. 8 this microprocessor is directl connected with the Xilin ICAP primitive and with an eternal memor, in which the partial bitstreams of the tasks are stored. User applications FPGA Eternal Memor +RTOS +Microprocessor ICAP Virtual-I/O (Tpe 2) Figure 8. Integration of the on the RAMPSoC The microprocessor is connected with the other processors in this eample over a switched-based NoC, but a Point-to-Point connection with each of the other partners or a connection over a different NoC is also supported. Several possible choices for an on-chip microprocessor eist. As processor running the, the IBM PowerPC 405 (PPC405) [7] was chosen. It is available on Xilin Virte- 4FX FPGAs as a hard core IP. The main reasons for choosing the PPC405 are the support of high frequencies up to 450 MHz and the availabilit on the Virte-4FX100 FPGA on the used target FPGA board from Alpha-Data [8]. High frequencies are important to eecute the fast 1 2 (Tpe 1) FSM + Hardware Function (Tpe 1) 3 4 and to support the real-time requirements. Other possible microprocessors would be soft core IPs, such as Xilin MicroBlaze or Leon SPARC, but the lack the support of such high frequencies. After selecting the processor, an appropriate RTOS was chosen. The demands for the RTOS are: support of PPC405 and well tested multithreading capabilities small memor footprint Several different RTOS eist, but due to the reasons above, the Xilkernel [9] from Xilin was selected. The CAP- OS is programmed in C and its functionalities are implemented in several different threads, which are eecuted in Xilkernel using multithreading. For scheduling the different threads, Xilkernel offers two policies: round robin or priorit-based scheduling. Priorit-based scheduling was chosen, to eecute the different threads according to their priorities. Furthermore, the PowerPC is directl connected to the ICAP primitive and to an eternal memor (DDR2 SDRAM), in which the bitstreams are stored. The and Xilkernel are eecuted using on-chip memor for maimum performance. In the following subsection the implementation of the different threads are described in detail. A. Implementation of the The is programmed using si threads as shown in Table 3. Table 3. Realized threads of the Thread Priorit Description Test_main 0 Initial thread. Launches the other five threads. Init_proc 1 Generates a list containing all possible processors and their attributes. Eecutes onl once. Task_graph 2 Initialization of the tasks and generation of the task graphs. Calculation of ALAP and ASAP start time and the mobilit of each task. Matching of tasks with equal requirements (HW constraints, same algorithm) Schedule 3 Scheduling of the Read tasks and processor allocation. Configure 3 Configuration management for the scheduled and allocated task and communication establishment between the new configured task and its neighbors. Contr_Eit _Task 3 Controls the eecuting tasks. If a task finishes eecution the occupied processing element is freed. A lower priorit number means a higher priorit. Test_main is the startup thread and has a fied priorit. The priorities of the other five threads can change at runtime depending on the demands of the applications. The three threads with priorit level 3 (Schedule, Configure and Contr_Eit_Task) compete against each other, after the first three threads with higher priorit have finished eecuting. While the other threads onl eecute in the beginning once,

7 these three concurring threads eecute until the last task finishes eecuting. VI. CASE STUDY AND RESULTS The correct functionalit of the was evaluated b implementing a RAMPSoC sstem on the target Alpha- Data FPGA board. The was implemented using one of the available PPC405s and the Xilkernel RTOS. The maimum number of reconfigurable processors was set to four, to be below the number of tasks within our evaluation task graphs. As the target Virte-4FX 100 FPGA is quite big, a higher number of processors could be used, if necessar. For the reconfigurable processors the Xilin MicroBlaze (µblaze) [10] was chosen, due to its small area footprint and the good compatibilit to the PPC405. As shown in Fig. 9, the Fast Simple Links (FSLs) [11] are used for communication between the processors. The offer a FIFObased unidirectional communication and for the limited number of processors a NoC is not required. The PPC405 can be connected via FSL to 32 partners, while each µblaze could be connected to 16 partners. Eternal Memor DDR2 SDRAM FPGA +Xilkernel + PPC405 XPS-ICAP µblaze2 µblaze3 µblaze1 µblaze0 Static Region Dnamic Reconfigurable Region PCI : FSL Point-to-Point connections between the µblazes : FSL Point-to-Point connections between the PPC405 and the µblazes : PLB-Bus : Communication between User and over RS232 Figure 9. Implemented RAMPSoC sstem Additionall, the XPS-ICAP IP core from Xilin together with an eternal DDR2 SDRAM is connected via the PLBbus to the PPC405. The user communicates via RS232 with the. For the test dnamic and partial reconfiguration was not used, because the scope was to verif the and not the ICAP primitive. Instead of sending the partial bitstreams to the ICAP core, a counter within the Configure thread was used, to simulate the reconfiguration times of the different tasks. For reconfiguring a whole processor 5 ms, and for reconfiguring a software task onto an eisting processor 2 ms were assumed. These reconfiguration times are worst case scenarios. Software is assumed to be transferred via the ICAP core to the BRAMs of the corresponding processor like it was shown for eample in [12]. The reconfiguration times could also be reduced b using an ICAP with a direct DMA-access to eternal memor, such as presented in [13]. At sstem startup, it is assumed that onl the static part is present and the other processors are reconfigured ondemand. Phsicall the sstem, as shown in Fig. 9, was present from the beginning and after the simulation of the reconfiguration time is finished the corresponding processor is activated. To verif the functionalit of and the implemented sstem the two TGs of Fig. 10 are used. TG1 has hard real-time constraints. This could be e.g. an image processing application, which receives the images from a camera and has to present the results to the user via a monitor in real-time. Therefore the global deadline (D 1 ) of TG1 is 40ms using a camera with a frame rate of 25 Hz. If this deadline is missed, frames will be lost. TG2 is a soft real-time application, whose global deadline (D 2 ) can be missed, without causing problems. D 2 is set to 50 ms here. Task Graph 1: Hard Real-Time K 12 K 25 K 13 K 35 K 14 K 45 D 1 Task Graph 2: Soft Real-Time Task Description Algo-ID,, T6, T8 Same hardware requirements: e.g. receive/send data via PCI 0, T7 Same algorithm: e.g. same image processing filter 1 Different algorithm: e.g. different image processing filter 2 Different algorithm: e.g. different image processing filter 3 Figure 10. Two task graphs for the evaluation: D 1 = 40ms, D 2 = 50ms To measure the timing overhead, the was eecuted on the FPGA using TG1. To test, if the correctl reuses eisting resources, the two tasks and were set to have the same algorithm (same Algo-ID) as shown in Fig. 10. During the eecution on the FPGA the number of clock ccles, required per call b each thread, were measured. The results for the timing overhead provided b the are shown in Table 4. Table 4. Timing overhead of for processing TG1 Thread K 67 K 78 T6 T7 T8 D 2 Average number of clock clces per call Init_proc 2118 Task_graph 9022 Schedule 650 Contr_Eit_Task 227 The clock ccles of the Configure thread depend on the size of the bitstream and on the speed of the ICAP primitive. Therefore, the are not given here. Test_main onl launches the other five threads, but itself does not produce timing overhead and is therefore also not mentioned here. Of course, Init_proc depends on the number of processors (here four) and Task_graph depends on the TG (here TG1 with five tasks). Therefore, these numbers are just an eample for the given TG. The clock ccles required for the Schedule thread depends on the compleit of the scheduling. E.g. the increase slightl, if the cost function needs to be evaluated for two tasks. Contr_Eit_Task is ver stable.

8 a) P3 P2 P1 P0 C T6 T6 T8 T7 T8 P3 P2 P1 P0 With this eample it can be shown that worked correctl and assigned the tasks of TG1 without violating the global deadline. Also, the resource reuse worked correctl. was allocated onto the same processor as, because the have the same algorithm and this wa the reconfiguration time could be saved. Finall, a case stud using image processing tasks within the task graphs TG1 and TG2, was done. The eecution times for the single tasks were measured on a single µblaze. Fig. 11 shows the calculated results of and compares them against the ones, calculated using the scheduling approach of Dittmann et al. [3]. In these results the scheduling overhead is not included, because the overhead of the approach of Dittmann et al. was not known. Here, it was assumed that also the approach of Dittmann et.al. can differentiate between a SW and a HW reconfiguration, and therefore reuse eisting processors, which is not the case in [3]. VII. CONCLUSIONS AND OUTLOOK In this paper the concept and the features of a special purpose OS called were presented. The is responsible for the scheduling, the resource allocation and reconfiguration and for managing the access to the configuration access port. The has been integrated into the RAMPSoC approach to handle the runtime organization for the adaptive RAMPSoC hardware architecture. The was implemented using si threads on the Xilkernel RTOS running on a PPC405. The correct functionalit and the timing overheads of the CAP- OS were measured on the FPGA using an eemplaril TG. The benefits of the were shown using a case stud with two TGs and comparing the results against the scheduling approach of Dittmann et al. [3]. Future work will be the etension of the to support the reconfiguration of the communication infrastructure. Furthermore, it will be etended to handle not onl the demands of the user, but also the reconfiguration demands of the other processors within the RAMPSoC. These demands are mainl the reconfiguration of the accelerators, if at runtime for eample a different accelerator Time/ms Method b) misses deadline b) Dittmann et al. [3] T6 T6 T8 T Time/ms Solution Eecution Time Results T : Reconfiguration time of task Dittmann et al. [3] > MHz - Real-Time (TG1<40 ms, TG2>50ms) + Resources T : Eecution time of task < MHz + Real-Time C : Reconfiguration time for a DCM (DCM of P1 was reconfigured for 150 MHz) + Resources P : Figure 11. Theoretical results of and Dittmann et al. [3] 1 T7 T7 is required depending on the currentl processed data. Furthermore, the will be further evaluated and will be also tested using real dnamic and partial reconfiguration. Additional etensions of will be the support of merging several bitstreams and supporting bitstream relocation. Bitstream relocation is important to reduce the amount of required eternal memor for storing each bitstream for each possible location. REFERENCES [1] J. Blazewicz, K.H. Ecker, E. Pesch,G. Schmidt, J. Weglarz: Scheduling Computer and Manufacturing Processes ; Berlin (Springer) 2001, ISBN [2] D. Göhringer, M. Hübner, V. Schatz, J. Becker: Runtime Adaptive Multi- Sstem-on-Chip: RAMPSoC ; In Proc. of RAW 2008 at the IPDPS 2008, April [3] F. Dittman, S. Frank: Hard Real-Time Reconfiguration Port Scheduling ; In Proc. of DATE 2007, p , April [4] M. Ullmann, M. Hübner, B. Grimm, J. Becker: On-Demand FPGA Run-Time Sstem for Dnamical Reconfiguration with Adaptive Priorities ; In Proc. of FPL 2004, pp , August [5] E. Lübbers, M. Platzer: ReconOS: An RTOS supporting Hard- and Software Threads ; In Proc. of FPL 2007, August [6] D. Göhringer, M. Hübner, T. Perschke, J. Becker: New Dimensions for Multiprocessor Architectures: On Demand Heterogeneit, Infrastructure and Performance through Reconfigurabilit: The RAMPSoC Approach ; In Proc. of FPL 2008, pp , Sept [7] PowerPC Reference Guide ; UG011 (v.1.2), Jan.19, Available at [8] Alpha Data: [9] Xilkernel v3_00_a ; EDK 9.1i, December 12, Available at [10] MicroBlaze Reference Guide, Embedded Development Kit, EDK 9.2i, UG081 (v8.1). Available at [11] Fast Simple Link (FSL) Bus (v2.11a) ; DS449, June 25, Available at [12] O. Sander, L. Braun, M. Huebner, J. Becker: Data Reallocation b Eploiting FPGA Configuration Mechanisms ; In Proc of ARC 2008, Springer Volume 4943/2008, March [13] C. Claus, B. Zhang, W. Stechele, L. Braun, M. Hübner, J. Becker: A multi-platform controller allowing for maimum dnamic partial reconfiguration throughput ; In Proc. of FPL 2008, Sept

9 Year: 2010 Author(s): Göhringer, D.; Hübner, M.; Zeutebouo, E.N.; Becker, J. Title: : Operating sstem for runtime scheduling, task mapping and resource management on reconfigurable multiprocessor architectures DOI: /IPDPSW ( IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse an coprighted component of this work in other works must be obtained from the IEEE. Details: Institute of Electrical and Electronics Engineers -IEEE-; IEEE Computer Societ: IEEE International Smposium on Parallel & Distributed Processing Workshops and Phd Forum, IPDPSW Vol.1 : Atlanta, Georgia, USA, April 2010 Piscatawa/NJ: IEEE, 2010 ISBN: ISBN: ISBN: pp

Fast dynamic and partial reconfiguration Data Path

Fast dynamic and partial reconfiguration Data Path with low Michael Hübner 1, Diana Göhringer 2, Juanjo Noguera 3, Jürgen Becker 1 1 Karlsruhe Institute t of Technology (KIT), Germany 2 Fraunhofer IOSB,