Instruction Set Architecture Extensions for a Dynamic Task Scheduling Unit

Size: px

Start display at page:

Download "Instruction Set Architecture Extensions for a Dynamic Task Scheduling Unit"

Claud O’Neal’
5 years ago
Views:

1 Instruction Set Architecture Extensions for a Dynamic Task Scheduling Unit Oliver Arnold, Benedikt Noethen, and Gerhard Fettweis Vodafone Chair Mobile Communications Systems Dresden University of Technology (TU Dresden) Dresden, Germany {oliver.arnold, benedikt.noethen, fettweis}@ifn.et.tu-dresden.de Abstract In this paper a heterogeneous Multiprocessor Systemon-Chip (MPSoC) is controlled by a dynamic task scheduling unit called. The instruction set architecture of this unit is extended to improve performance for dynamic data dependency checking, task scheduling, processing element (PE) allocation and data transfer management. In order to analyze and compare different implementations and trade-offs a tool flow was developed. Area and timing results are provided as well. A significant performance improvement can be shown for all parts of the. Keywords- dynamic task scheduling; heterogeneous MPSoC; instruction set extension; I. INTODUCTION One promising approach to ensure high parallelism in future embedded systems is the task level parallelism, where multiple instructions will be bundled to a task. An example is the CellSs programming model [13] running on the Cell Broadband Engine [5]. It integrates up to eight processing cores and one PowerPC, which is used for task scheduling. There are two concepts to schedule tasks to the available processing elements (PE): static and dynamic scheduling. Static task schedulers are used especially for embedded systems, where power consumption and performance are crucial. In this case the set of applications running on the hardware is limited and the overall application requirements/tasks are known. The characterization is the main challenge of this approach. In particular, looking to mobile handsets, where more and more different applications/standards have to be supported and a full characterization of applications is infeasible, especially if several applications run in parallel. In such scenarios, dynamic task scheduling is better suited since complete characterization is not required. A dedicated core is in charge of scheduling the tasks to the available PEs. The schedule is built at runtime after a task data dependency analysis. The dynamic task scheduler can be implemented in hardware as an accelerator or in software, running on a general purpose core. The hardware implementation is characterized by a very short scheduling time of less than 100 cycles for one task [6]. It has a deterministic scheduling time but is neither configurable nor extensible. Only one application at a time can be executed. Specifications of priorities are not possible. In [7] This work was supported by the German Federal Ministry of Education and esearch (BMBF) as part of the CoolBaseStations project under grant 13N a software approach of the scheduler was presented and the performance impact was analyzed. It has been shown that the scheduler will need in average more than 1000 cycles due to the limited performance of the core. Checking the data dependencies at runtime is the most time consuming part of the software approach [7]. The extension of the instruction set of standard processors is available in many areas [1], e.g., in the field of security [2] and network applications [3]. In contrast to these works we analyze critical sections of a software based dynamic taskscheduler in more detail and will define new instruction extensions to increase the overall scheduler performance. An ASIP approach has been presented in [12]. It provides OS support on MPSoCs. Central hardware units in charge of scheduling are shown in [6] and [14]. The remainder of the paper is organized as follows: In section II, the hardware system, the programming model and the tool flow are presented. In the following section the s instruction set architecture extensions are introduced. Section IV presents benchmarks and experimental results. II. SYSTEM MODEL A. Hardware A heterogeneous Multiprocessor System-on-Chip (MPSoC) is shown in Fig. 1. It consists of several blocks connected by a Network-on-Chip (NoC). Therefore, each block has a dedicated router (). A router is connected to its neighbors by point-topoint data links. The routers are responsible for packet scheduling and arbitration. XY routing is applied. Further details about the integrated NoC can be found in [8]. Several types of blocks can be distinguished. Three global memory ports are available (MEM0, MEM1 and MEM2). They allow a connection to the off-chip SDAMs. The application processor (APP) hosts the operating system and executes the sequential part of an application. The data plane of the MPSoC consists of eleven Processing elements (PE). Altogether four digital signal processors (DSP), five general purpose (GP) cores and two application specific instruction set processors (ASIP) are integrated. The (CM) controls the data plane of

2 the MPSoC. It is responsible for dynamic data dependency checking, task scheduling, PE allocation and data transfer management. Furthermore, it is responsible for the power management of the platform. It determines the point of time of the power-on process and the frequency for each PE. A more detailed view of each block is shown in Fig. 2. In Fig. 2a) a PE is connected to a data and an instruction memory. Furthermore, the Spin-Off (CM_SO) is integrated. It contains a task FIFO. Thus, up to four tasks can be scheduled on a PE. The CM_SO is responsible for IN and OUT data transfers. Data transfers and task execution can be simultaneously performed. Thus, explicit prefetching of data is made available. Nevertheless, the is responsible for the configuration of the CM_SO. E.g., the determines the mapping of data in the local memories. In this approach a PE can solely operate on its local memory. No cache misses occur. Thus, task execution time is deterministic leading to a better predictability on system level. Prefetching of data is possible for the next two tasks, but must be explicitly annotated by the. The application processor is formed by a Tensilica 570t as shown in Fig. 2b). It has 2-way set associative instruction and data caches, each 16 Kbyte in size. In the system model it is placed next to an off-chip memory interface for fast data access. In Fig. 2c) the and its subcomponents are shown. Similar to the PEs, the solely works on local on-chip memories. Instruction and data memory size is 32 Kbyte each. The Transfer Unit (CM_TU) is available for data transfers between the s local memories and any other address in the system. Timers and FIFO memories are available as well. The DebugUnit can be used for online and offline debugging. E.g., it traces the internal states and the dynamic decisions of the. Initialization of the platform is as follows: in a first step the application processor is booted from global memory. After the boot process the application processor copies the binary to the local memory of the. The can boot itself as soon as a trigger is set by the application processor. PEs are dynamically booted by the. For this purpose boot code is available for each PE type. MEM0 APP PE_DSP0 MEM1 PE_ASIP0 PE_GP0 CM PE_GP2 PE_DSP1 PE_DSP2 PE_GP4 PE_DSP3 Figure 1. System Model PE_GP1 PE_GP3 MEM2 PE_ASIP1 PE [DSP, GP, ASIP] Inst CM_SO Data TASK_FIFO a) b) c) LX4 ISA_E Inst Data Application Proc. Tensilica 570t Data Cache CM_TU FIFOs Timers Inst Cache DebugUnit Figure 2. Selected plaftorm components: a) PE subsystem, b) Application processor subsystem, c) subsystem B. Programming Model A task based programming model is used for the development of a parallel application [6]. It is independent from the underlying hardware. Thus, applications are fully portable as long as a task can be executed on at least one PE. A task is a collection of instructions. For each task input and output data arrays are specified at runtime. E.g., in software defined radio system data locations of a task are specified after the header is processed. No static data analysis is possible for these kinds of applications. A simple example is shown in Fig. 3. It is executed on the application processor (APP). The header is evaluated and the task description (tasktype and data arrays) is transferred to the. In this example two task descriptions are transferred, either tasktype0 and tasktype1 or tasktype0 and tasktype2. In the next step the checks data dependencies between the tasks. If a data dependency is present the task is delayed until its predecessor tasks are finished. As soon as all dependencies are resolved a task can be scheduled on a suitable PE. For this reason preferred and possible PE types are annotated for each task type. An as soon as possible (ASAP) list based scheduling approach is used. Local memory of the PE must be allocated as well. Within this step, increased data locality is made available by using the on-chip local memories as explicit memory buffers. The necessary information is available within the after the data dependency checking stage. The configures the CM_SO of the selected PE. It will carry out the following steps: If the PE is not ready it is booted. Simultaneously, the necessary instruction and data of the task is fetched. Concurrently to the task execution data can be fetched for the next task. After a task is finished output data is transferred to its destination. As soon as a task is finished data dependencies can be resolved by the.

3 task( tasktype0, IN( ptr0, size0), IN( ptr1, size1), OUT( ptr2, size2) ); C/C++ Application C Task Definitions Specification Source Code If ( header == 0x143) task( tasktype1, IN( ptr2, size2/2), IN( ptr0, size0), OUT( ptr3, size3) ); else task( tasktype2, IN( ptr3, size3), IN( ptr2, size2), OUT( ptr4, size4) ); Figure 3. Example of the task programming model C. Tool Flow The tool flow is shown in Fig. 4. A C/C++ application is developed and compiled for the Tensilica 570t processor. It contains task calls as shown in Fig. 3. The and PE specifications define the hardware capabilities of the and all PE types respectively. The integration, placement and connections of all cores are specified in the platform specification. TL code can be generated by the Tensilica Xtensa Processor Generator (XPG) [9]. Suitable compilers are generated as well. Thus, and PE binaries can be generated. The s source code is adapted to the available hardware configuration. The as well as the PE binaries are linked in the application processor binary. A cycle accurate simulation is available with the Tensilica XTSC simulation environment. For post processing and further analysis the TaskVisualizer and the DebugVisualizer are available. The TaskVisualizer is taken from [11]. The frontend is adapted to the system used in this work. The DebugVisualizer is newly developed and allows a deep insight in the dynamic behavior of the. For this purpose debug message are cyclically stored in the main memory occupying 32 MByte. Debug messages are cyclically written to this region. Each debug message has the following format: { <time stamp>, <debug opcode>, <data> }. As soon as the writes the debug opcode and the data to a 32-bit register in the DebugUnit the time stamp is attached. Afterwards, the whole debug message is written to main memory. The DebugVisualizer analyzes these messages and checks the correct behavior of the. Furthermore, visualization of all states of each task is possible. 570t Compiler APP Binary TaskVisualizer InstGenerator TaskCompiler PE Boot Binaries XTSC Simulation PE Specifications PE Compiler Task Binaries DebugVisualizer PE Cores XPG HW Platform Figure 4. Tool flow Compiler Binary Platform Specification User specification Tensilica Task Tools III. COEMANAGE IINSTUCTION SET EXTENSIONS In this section the instruction set extension of the dynamic task scheduling unit, called, is described. Therefore, the execution is profiled. Each part of the is regarded and analyzed. The most time consuming parts are accelerated. The Tensilica tool chain is used to implement the very large instruction words (VLIW) as well as single instruction multiple data operations (SIMD) [9]. For comparison a basic LX4 core is used as a reference implementation. This core is reasonably configured with functional units. E.g., a full-adder and a multiplier are available. A plain-c version of the software is running on it. It is taken from [7]. Further analysis of the runtime performance and scalability on an AM926 can be found there. In the first step VLIW is used to group instruction for a parallel execution. This step is compiler assisted. Furthermore, new instructions can be specified to improve system performance. Examples are SIMD operations which allow a parallel execution of one instruction on multiple data words. These new instructions are specified in a Verilog-like language. If two types of implementations satisfy all requirements the most generic one is used.

4 TABLE I. NEWLY INTODUCED INSTUCTIONS Instruction Arguments Explanation ADD3 Adds three integer LZ Count leading zeros XO_LZ AND_LZ O_LZ NEG_AND_LZ LoadDepCheck DepCheck_SIMD_1 DepCheck_SIMD_LD2 DepCheck_SIMD_LD4 GetDepCheckesults GetPE GetPePos emovetransfers ADD_SHIFT_LEFT ADD_SHIFT_IGHT SHIFT_1_XO SHIFT_LEFT_O SHIFT_IGHT_O SHIFT_LEFT_XO SHIFT_IGHT_XO MASK_SHIFT_AND, uint8 uint8 1. XO, 2. count leading zeros 1. AND, 2. count leading zeros 1. O, 2. count leading zeros 1. Negate first argument, 2. AND, 3. count leading zeros 1. Loads one data transfer of 64bit, 2. increments data pointer by 8 Dependency checking with one and one state64 Dependency checking with two states and one Dependency checking with two states and two eturns the last depcheck results, dependencies are marked with a dedicated bit for each transfer comparison. Performs a PE allocation for 16 possible and 16 preferred PEs. PE annotation is bitwise. eturns an available taskpos on a PE. Increase data locality in the case a successor task is executed on the same PE. emoves unnecessary transfers. 1. ADD, 2. shift left by n bits 1. ADD, 2. shift right by n bits res = (1<<in0)^in1 res = (res<<in0) in1 res = (res>>in0) in1 res = (res<<in0) ^in1 res = (res>>in0) ^in1 res = (~(in0<<in1))&in2 In Table I an overview of all newly introduced instructions is presented. Load and store instructions are not shown. Several internal states and bit registers are available. The asm_ prefix of each instruction name is omitted. In Fig. 5 the evolution of the dependency checking instruction is presented. In the first line the C-Version is shown (1). Two memory regions are compared. The first region is formed by the pointer p0 and size s0 and p1; the second region by p1 and s1 respectively. Two subtractions, two compares and one O operation are necessary. These instructions can be merged in one asm_depcheck instruction (2). Afterwards, the load of the arguments can be accelerated by applying a 64-bit data bus and 64-bit registers (3). Thus, the burden of memory loads is decreased to half of the amount. In the next step SIMD can be applied. Instead of one compare of two transfers four compares are done in parallel (4). Therefore, four transfers are to be loaded. By applying explicit load instructions and dedicated internal states data loads can be reduced. Thus, data locality is increased and the number of register loads decrease (5). Furthermore, the depchecksimd_ld4 instruction is able to compute false dependencies in the case of read-read transfers. IV. ESULTS A. Benchmarks The Global System for Mobile Communication (GSM) physical layer implementation is used to evaluate the performance of the. The GSM benchmark consists of a receiving and a transmitting part. For each signal processing step a dedicated task type is available. These are e.g. channel encoding/decoding, interleaving, ciphering, burst formatting, modulation and demodulation. Additionally, an additive white Gaussian noise (AWGN) channel is integrated. Channel coding is done by applying a convolutional encoder. Gaussian minimum shift keying (GMSK) is used for modulation. Cyphering uses the A5/1 algorithm. Key generation is not regarded. The most time consuming part of the is the runtime data dependency checking. Therefore, a synthetic benchmark was implemented which solely configures the s initialization and data dependency checking stage. The task window size, which determines the maximum number of tasks in the system as well as the number of input and output data transfers, can be varied.

$( unsigned )( p0 p1) s1)) ( unsigned)( p1 p0) s0 (1) + Merge Instructions asm _ depcheck( p0, s0, p1, s1) (2) + 64 bit egs ( _X={pX,sX} ) asm _ depcheck( _0, _1) (3) + SIMD (4 comparisons) asm _$ 6 the processing time of the is shown on component level. The GSM benchmark is executed. Odd bars represent the Plain-C version; even bars belong to the VLIW+SIMD execution.

6 the processing time of the is shown on component level. The GSM benchmark is executed. Odd bars represent the Plain-C version; even bars belong to the VLIW+SIMD execution.

5 ( unsigned )( p0 p1) s1)) ( unsigned)( p1 p0) s0 (1) + Merge Instructions asm _ depcheck( p0, s0, p1, s1) (2) + 64 bit egs ( _X={pX,sX} ) asm _ depcheck( _0, _1) (3) + SIMD (4 comparisons) asm _ depchecksimd4( _0, _1, _2, _3) (4) + Explicit Load Instructions asm _ depchecksimd _ LD4( _2, _3) (5) Figure 5. Dependency checking instruction evolution B. Performance In Fig. 6 the processing time of the is shown on component level. The GSM benchmark is executed. Odd bars represent the Plain-C version; even bars belong to the VLIW+SIMD execution. For each execution the minimum, average and maximum processing time are given. A decrease in processing time can be observed as soon as VLIW+SIMD are applied. The most time consuming part for this benchmark is the dynamic data dependency checking. Furthermore, it can be seen that some components have a fixed execution time. These are e.g. the PE allocation and task scheduling. In Fig. 7 to Fig. 9 the data dependency stage is analyzed according to the scalability by varying the number of data transfers and by varying the number of tasks already in the task queue within the. esults are shown for the Plain-C, VLIW, SIMD and VLIW+SIMD version of the. In Fig. 7 and Fig. 8 it is assumed that three and 15 tasks respectively are already in the task queue. The number of data transfers is changed for all tasks. The processing time is greatly reduced as soon as SIMD is applied. A minor improvement can be observed for the VLIW versions. In Fig. 9 the number of data transfers is set to a fixed value of four. The number of tasks in the queue is varied between one and 15. As in the example above, a major reduction in processing time can be observed in the SIMD version. Overall a reduction of up to 97 % is achieved. Figure 6. component comparision between plain-c and SIMD+VLIW implementation Figure 7. Processing time of the initialization and dynamic data dependency checking stage for 3 tasks in the task queue Figure 8. Processing time of the initialization and dynamic data dependency checking stage for 15 tasks in the task queue

TABLE II. AEA AND TIMING COMPAISION Plain-C VLIW SIMD VLIW+SIMD Area (mm2) 0.140 0.180 0.231 0.277 f (MHz) 333 333 333 333 Figure 9.

6 TABLE II. AEA AND TIMING COMPAISION Plain-C VLIW SIMD VLIW+SIMD Area (mm2) f (MHz) Figure 9. Processing time of the initialization and dynamic data dependency checking stage for 4 transfers per task C. Area and Timing In Table II area and frequency are shown for the. All cores have been synthesized with Synopsys Design Compiler for a 65nm low power process from TSMC using worst case conditions. Only logic area is evaluated. For timing correctness interfaces to the local memories are integrated. The local memories itself are not included in the area. For a fair comparison synthesis was done for a target frequency of 333 MHz. An overall area increase of 98% can be observed. V. CONCLUSIONS AND OUTLOOK In this paper a central scheduling unit, called was improved with a newly introduced instruction set architecture extension. It allows a faster processing of the dynamic data dependency checking, task scheduling, PE allocation and data transfer management. VLIW as well as SIMD is applied. The obtained results show an improvement for the dynamic data dependency checking stage of up to 97 %. Furthermore, all other stages are accelerated as well. Future work aims at implementing a silicon prototype of the in a heterogeneous MPSoC, including several types of processing elements as well as IO interfaces. Further optimizations of the architecture and algorithms will be investigated. Especially performance, area and power consumption will be improved. EFEENCES [1] Wang, A.; Killian, E.; Maydan, D.; owen, C.;, "Hardware/software instruction set configurability for system-on-chip processors," Design Automation Conference, Proceedings, vol., no., pp , [2] Potlapally, N..; avi, S.; aghunathan, A.; Lee,.B.; Jha, N.K.;, "Configuration and Extension of Embedded Processors to Optimize IPSec Protocol Execution," Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol.15, no.5, pp , May [3] Chormoviti, A.; Vassiliadis, N.; Theodoridis, G.; Nikolaidis, S.;, "Enhancing Embedded Processors with Specific Instruction Set Extensions for Network Applications," Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications, IDAACS IEEE, pp , 5-7 Sept [4] K. Asanovic et al., The landscape of parallel computing research: a view from Berkeley, Electrical Engineering and Computer Sciences, University of California, Berkeley, Long Beach, CA, USA, Tech. ep., Dec [5] Johns, C..; Brokenshire, D. A., "Introduction to the Cell Broadband Engine Architecture," IBM Journal of esearch and Development, vol.51, no.5, pp , Sept [6] T. Limberg, M. Winter, M. Bimberg,. Klemm et al, "A Heterogeneous MPSoC with Hardware Supported Dynamic Task Scheduling for Software Defined adio", DAC/ISSCC Student Design Contest, [7] O. Arnold, and G. Fettweis, "On the Impact of Dynamic Task Scheduling in Heterogeneous MPSoCs," Embedded Computer Systems (SAMOS), 2011 International Conference on, pp.17-24, July [8] M. Winter, and G. Fettweis, Guaranteed Service Virtual Channel Allocation in NoCs for un-time Task Scheduling, in Proceedings of the Design Automation and Test in Europe (DATE'11), Grenoble, France, March [9] March [10] March [11] O. Arnold, and G. Fettweis, " Power Aware Heterogeneous MPSoC with Dynamic Task Scheduling and Increased Data Locality for Multiple Applications," Embedded Computer Systems (SAMOS), 2010 International Conference on, pp , July [12] J. Castrillon, D. Zhang, T. Kempf, B. Vanthournout,. Leupers, and G. Ascheid, Task Management in MPSoCs: An ASIP Approach, International Conference on Computer-Aided Design, [13] Bellens, P.; Perez, J.M.; Badia,.M.; Labarta, J., "CellSs: a Programming Model for the Cell BE Architecture," in SC 06, Proceedings of the Supercomputing conference, [14] J. Lee, V. J. Mooney III, A. Daleby, K. Ingström, T. Klevin, and L. Lindh, A comparison of the TU hardware TOS with a hardware/software TOS, In ASP-DAC '03, Proceedings of the Asia and South Pacific Design Automation Conference, 2003.

On mapping to multi/manycores

On mapping to multi/manycores Jeronimo Castrillon Chair for Compiler Construction (CCC) TU Dresden, Germany MULTIPROG HiPEAC Conference Stockholm, 24.01.2017 Mapping for dataflow programming models MEM