Optimization of Task Scheduling and Memory Partitioning for Multiprocessor System on Chip

Size: px

Start display at page:

Download "Optimization of Task Scheduling and Memory Partitioning for Multiprocessor System on Chip"

Mervyn Horn
5 years ago
Views:

1 Optimization of Task Scheduling and Memory Partitioning for Multiprocessor System on Chip 1 Mythili.R, 2 Mugilan.D 1 PG Student, Department of Electronics and Communication K S Rangasamy College Of Technology, TN, India 2 Assistant Professor, Department of Electronics and Communication K S Rangasamy College Of Technology, TN, India Abstract - Multiprocessor system-on-chip (MPSoC) is an attractive solution for increase in complexity and size of embedded applications. MPSoC is an integrated circuit containing multiple instruction-set processors on a single chip that implements most of the functionality of a complex electronic system. While embedded systems become increasingly complex, the increase in memory access speed has failed to keep up with the processor speed. This makes the memory access latency a major issue in scheduling embedded applications on embedded systems. Scheduling the tasks of an embedded application on the processors and partitioning the available Scratch-pad memory (SPM) budget among those processors are two critical issues in complex embedded systems. This research focuses mainly on task scheduling and SPM partitioning to reduce the execution time of embedded applications. Equally partitioned SPM reduces the computation time. To further reduce these applications computation time, available SPM can be divided between the processors in any ratio. Pipelined scheduling allows tasks of different embedded application instances to be scheduled at each stage of the pipeline. Keywords - Memory partitioning, multiprocessor system-onchip, scratchpad memory, task scheduling. in terms of the clock cycles compared to fast on-chip SPM. Cache memory in the processor is replaced by SPM. SPM has been employed as a partial or entire replacement for cache memory due to its better energy efficiency. SPM consists of only decoding circuits, data arrays, and output units. Unlike in caches, it does not require tag comparison on SPM. Due to its simplified architecture, SPM is more energy/area efficient than cache. The computation time of a program on a processor depends on how much SPM is allocated to that processor. Execution time predictability is a critical issue for realtime embedded applications; this means that data caches are not suitable since it is hard to model the exact behaviour and to predict the execution time of programs. To alleviate such problems, many modern MPSoC systems use scratchpad memories. SPM contributes to better timing predictability. Cellular phones, portable media players, gaming consoles are some complex embedded applications consisting of multiple concurrent real-time tasks. Usually tasks are scheduled first and the SPM budget is then partitioned among the processors. Such a decoupled technique may prevent better schedules in terms of minimizing the computation time of the whole application. The integration of those two steps improve the performance. I. INTRODUCTION MPSoC consists of multiple heterogeneous processing elements, a SPM memory hierarchy, and input/output components which are linked together by an on-chip interconnect structure. MPSoC models use a memory hierarchy with slow off-chip memory and fast on-chip scratchpad memories. A larger SPM results in less computation time since off-chip access is more expensive II. METHODOLOGY The embedded application is given to the MPSoC that consists of multiple processors. The application is then divided in to number of tasks. These tasks are scheduled and the memory should be partitioned among the processors. Finally the execution time has to be found. ISSN: Page 1058

2 Embedded Application MPSoC Assigning the scheduled tasks with allocated memory to each processors Tasks TDG Task Scheduling & Memory Partitionin. Any time there is an edge between two tasks Ti and Tj means that a communication cost should be accounted for provided that these two tasks are allocated to two different processors. Tasks, T2, T3, and are ready to be scheduled in our example. Task will not be scheduled at this point based on its ALAP value. Thus, first tasks and T2 will be mapped to the two available processors and. T2 T4 T6 Execution Prediction T3 Fig.1. Block Diagram of the Project. III. TASKS & TDG Embedded applications usually consist of computation blocks, which are treated as tasks. An application program is divided in to tasks. Tasks are the various processes in the application. Whenever a program is executed, the operating system creates a new task for it. The task is like an envelope for the program. The state information of a task is represented by the task states such as idle, running, ready and blocked states. There are usually dependences between tasks that should be respected in the schedule. The problem formulation is based on a task dependence graph (TDG). A TDG is a directed acyclic graph with weighted edges where each vertex represents a task in the embedded application. An edge from task Ti to task Tj in the TDG represents a scheduling order that needs to be enforced due to the fact that Tj needs data to be transferred from Ti after Ti is already executed. The weight of this edge is the communication cost. A processor cannot start executing task until all the necessary data communication is performed. The weight of an edge is the communication cost. Each task can be mapped to any of the available processors. Since the processors in this architectural model can be heterogeneous, the execution time of each task depends on the processor to which this task is mapped as well as the SPM memory allocated to that processor. Accessing a data variable from a SPM is usually in the order of 100 times faster than accessing it from the off-chip memory. Consider the example task graph shown below with six tasks,, T2, T3, T4,, and T6. Task T4 depends on tasks, T2 and T3, and task T6 depends on tasks T4 and Fig.2. An example TDG. The scheduling algorithm will map T3 to as it is free before since the computation time of T2 is less than that of. In a similar fashion, the scheduling algorithm will assign tasks T4 and T6 to processor whereas task will be mapped to processor. From the task schedule, it has seen that task T4 can only start after is done executing task T3. The issue now is to try to reduce the dead time between tasks and T4 imposed by the computation time for tasks T2 and T3. To minimize this dead time, techniques usually allocate more SPM budget to processor to reduce the computation time of tasks T2 and T3. IV. TASK SCHEDULING & MEMORY PARTITIONING Four approaches can be implemented to solve the task scheduling and memory allocation problem on MPSoC systems, namely: Decoupled task scheduling and memory partitioning assuming equally partitioned SPM among all available processors, TSMP EQUAL; Decoupled task scheduling and memory partitioning with SPM partitioned among different processors with any ratio, TSMP ANY; Integrated task scheduling and memory partitioning heuristic, TSMP INTEG; Integrated heuristic with pipelining TSMP PIPE; Unlike current approaches that treat task scheduling and memory partitioning as two separate problems, these two problems can be solved in an integrated fashion. An effective heuristic was developed for the task scheduling/ memory partitioning problem for a multiprocessor system- ISSN: Page 1059

3 on-chip where a single application is using the MPSoC at a time. These two steps are performed in an integrated fashion where the private on-chip memory budget allocated to a processor is decided as tasks are mapped to this processor. The computation time of a task depends on the processor to which it is mapped, as well as on the SPM memory available for that task. Therefore, task scheduling should take into consideration the varying computation time of a task based on the processor and the SPM budget. An embedded application is usually executed many times for a stream of input data on an MPSoC. Such multiple executions make embedded applications amenable to pipelined implementation. Pipeline scheduling benefits from allowing tasks of different embedded application instances to be scheduled at each stage of the pipeline. The objective is to decrease the pipeline stage time interval, as after filling up the pipeline an instance execution of the application is performed each pipeline stage. The maximum number of stages is equal to the number of processors in the MPSoC system. A. Decoupled TSMP using Cache Memory At first the schedule is done by assuming no available scratch pad memories. Tasks, T2, T3, and are ready to be scheduled in the example. Task will not be scheduled at this point based on its ALAP value. Thus, first tasks and T2 will be mapped to the two available processors and. The scheduling algorithm will map T3 to as it is free before since the computation time of T2 is less than that of. In a similar fashion, the scheduling algorithm will assign tasks T4 and T6 to processor whereas task will be mapped to processor. Fig.4. Schedule on Equal Partitioned SPM The results following partitioning the available SPM memory equally between the two processors. With such a criterion, the available SPM budget will be equally divided between processors and regardless of what tasks are mapped to what processors. The idle time can be reduced. Equally partitioned SPM reduces the computation time of the whole application. C. Decoupled TSMP on Non equal Partitioned SPM Fig.5. Schedule Based on Non equal Partitioned SPM To further reduce this application s computation time, the available SPM can be divided between the two processors in any ratio. From the task schedule, we can see that task T4 can only start after is done executing task T3. The issue now is to try to reduce the dead time between tasks and T4 imposed by the computation time for tasks T2 and T3. To minimize this dead time, techniques usually allocate more SPM budget to processor to reduce the computation time of tasks T2 and T3. D. Integrated TSMP T2 T3 T4 T6 T4 T6 T2 T3 Fig.3. Schedule Based on no SPM The problem with the previous schedule is that it allocated T3 to the same processor that is scheduled to execute T2. This choice is the reason for the dead time in the schedule as T2 cannot benefit much from more SPM memory which is clear from the Min, Avg, and Max values. A good heuristic should take these values into consideration where a better choice for T3 is to be scheduled on with all available SPM memory being allocated to this processor, and the result is a schedule with the minimal end time. B. Decoupled TSMP on Equal Partitioned SPM T3 T4 T6 T4 T6 T2 T2 T3 ISSN: Page 1060

Such a schedule does not necessarily decrease the computation time of one instance of embedded application, but rather it decreases the time between the start times of two consecutive iterations of

4 Fig.6. Schedule Based on Integrated Approach E. Integrated TSMP with Pipelining Pipeline scheduling allows tasks of different embedded application instances to be scheduled at each stage of the pipeline. Such a schedule does not necessarily decrease the computation time of one instance of embedded application, but rather it decreases the time between the start times of two consecutive iterations of the task graph. Here the pipelined concept is implemented by storing the result of previous task in to the memory while current task is executing. This further reduces the computation time. V. RESULTS AND DISCUSSION Task Dependency Graph shown in fig.2 is considered and the implementation was done using the Modelsim software. The various tasks are considered to be interpolation, Sum of Absolute Differences (SAD), Multiply and Accumulation (MAC), addition, subtraction, and multiplication from MPEG4 encoder block. The decoupled TSMP approach using equally partitioned SPM needs 700ns for execution. C. Simulation Result for Decoupled TSMP on Non equal Partitioned SPM A. Simulation Result for Decoupled TSMP using Cache Memory The execution time obtained for decoupled TSMP using cache memory approach is 800ns. The execution time obtained for decoupled TSMP approach using non equal partitioned SPM is 600ns. D. Simulation Result for Integrated TSMP The execution time obtained for integrated TSMP approach using SPM is 500ns. B. Simulation Result for Decoupled TSMP on Equal Partitioned SPM ISSN: Page 1061

Compared to the widely-used decoupled approach, this integrated approach significantly improved the results, since the appropriate partitioning of SPM spaces among different processors depends on the

5 E. Simulation Result for Integrated TSMP with pipelining Integrated TSMP with pipelining approach needs 500ns for executing the given tasks. An effective heuristic was presented that integrates task scheduling and memory partitioning of embedded applications on multiprocessor systems-on-chip with scratchpad memory. Compared to the widely-used decoupled approach, this integrated approach significantly improved the results, since the appropriate partitioning of SPM spaces among different processors depends on the tasks scheduled on each of those processors and vice versa. Thus the reduction in the execution time of the tasks scheduled on the processors is obtained using various approaches such as equally partitioned SPM, non equal partitioned SPM, integrated approach and integrated approach with pipelining. Simulation results are obtained using modelsim software and the frequency values are obtained using xilinx software. REFERENCES F. Comparison Result The results obtained for various processes are compared and it is shown in fig.7. Comparison is done with 37k memory allocation for five different approaches T0 T2 T3 T4 Frequency Fig.7. Comparison between Various Approaches T 0 -- Decoupled TSMP using Cache Memory T 1 -- Decoupled TSMP on Equal Partitioned SPM T 2 -- Decoupled TSMP on Non equal Partitioned SPM T 3 -- Integrated TSMP T 4 -- Integrated TSMP with Pipelining The frequency values increased between each processes and hence the execution time gets reduced in the implemented concept. [1] Hassan Salamy and J. Ramanujam, An Effective Solution to Task Scheduling and Memory Partitioning for Multiprocessor System-on- Chip, in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 31, no. 5, May [2] L. Benini, D. Bertozzi, A. Guerri, and M. Milano, Allocation and scheduling for MPSOC via decomposition and no-good generation, in Proc. IJCAI, 2005, pp [3] Y.K. Kwok and I. Ahmad, Benchmarking and comparison of the task graph scheduling algorithms, J. Parallel Distributed Comput., vol. 59, no. 3, pp , Dec [4] R. Neimann and P. Marwedel, Hardware/software partitioning using integer programming, in Proc. DATE, 1996, pp [5] K. S. Chatha and R. Vemuri, Hardware-software partitioning and pipelined scheduling of transformative applications, IEEE Trans. Very Large Scale Integr., vol. 10, no. 3, pp , Jun [6] P. Panda, N. D. Dutt, and A. Nicolau, On-chip vs. off-chip memory: The data partitioning problem in embedded processorbased systems, ACM Trans. Des. Automat. Electron. Syst., vol. 5, no. 3, pp , Jul [7] O. Avissar, R. Barua, and D. Stewart, An optimal memory allocation scheme for scratch-pad-based embedded systems, ACM Trans. Embedded Comput. Syst., vol. 1, no. 1, pp. 6 26, Nov [8] A. Dominguez, S. Udayakumaran, and R. Barua, Heap data allocation to scratch-pad memory in embedded systems, J. Embedded Comput., vol. 1, no. 4, pp , Dec AUTHORS PROFILE Mythili.R received her B.E degree from Anna University, Coimbatore, India, in She is currently pursuing her M.E degree from Anna university, Chennai, India. Her research area includes optimization of MPSoC and low power VLSI circuits. VI. CONCLUSION ISSN: Page 1062

Mugilan.D received his B.E degree from Erode Sengunthar Engineering College, Erode, India, in 2007, M.E degree from Kongu Engineering College, Erode, India, in 2009.

6 Mugilan.D received his B.E degree from Erode Sengunthar Engineering College, Erode, India, in 2007, M.E degree from Kongu Engineering College, Erode, India, in He worked as a Assistant Professor in Maharaja Engineering College, Avinashi, India. Since 2010 he is working as a Assistant Professor in K.S.Rangasamy College of Technology, Tamilnadu, India. His research is in the area of embedded systems and digital image processing. He is a life member in ISTE. ISSN: Page 1063

Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management

Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management International Journal of Computer Theory and Engineering, Vol., No., December 01 Effective Memory Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management Sultan Daud Khan, Member,