Embedded Systems. For other titles published in this series, go to

Size: px

Start display at page:

Download "Embedded Systems. For other titles published in this series, go to"

Quentin Fisher
6 years ago
Views:

1 Embedded Systems Series Editors Nikil D. Dutt, Department of Computer Science, Donald Bren School of Information and Computer Sciences, University of California, Irvine, Zot Code 3435, Irvine, CA , USA Peter Marwedel, Informatik 12, TU Dortmund, Otto-Hahn-Str. 16, Dortmund, Germany Grant Martin, Tensilica Inc., Scott Blvd., Santa Clara, CA 95054, USA For other titles published in this series, go to

2 Akash Kumar Henk Corporaal Bart Mesman Yajun Ha Multimedia Multiprocessor Systems Analysis, Design and Management

3 Dr. Akash Kumar Eindhoven University of Technology Eindhoven Netherlands and National University of Singapore Electrical and Computer Engineering Engineering Drive Singapore Singapore eleak@nus.edu.sg Prof. Dr. Henk Corporaal Eindhoven University of Technology Electrical Engineering Den Dolech AZ Eindhoven Netherlands h.corporaal@tue.nl Dr. Bart Mesman Eindhoven University of Technology Electrical Engineering Den Dolech AZ Eindhoven Netherlands b.mesman@tue.nl Asst. Prof. Yajun Ha National University of Singapore Electrical and Computer Engineering Engineering Drive Singapore Singapore elehy@nus.edu.sg ISBN e-isbn DOI / Springer Dordrecht Heidelberg London New York Library of Congress Control Number: Springer Science+Business Media B.V No part of this work may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, microfilming, recording or otherwise, without written permission from the Publisher, with the exception of any material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Cover design: VTEX, Vilnius Printed on acid-free paper Springer is part of Springer Science+Business Media (

4 Preface Preface and Outline Computing systems are around for a relatively short period; it is only since the invention of the microprocessor systems in the early seventies that processors became affordable by everybody. Although this period is short it is hardly imaginable how current life would be without computing systems. They penetrated virtually all aspects of life. Most visible are PCs, smartphones and gaming consoles. Almost every household has several of them, counting up to billions worldwide. However, we are even more dependent on so-called embedded systems; that are computing systems which are usually not visible, but they largely determine the functionality of the surrounding system or equipment. They control factories, take care of security (using e.g. smart cameras), control your car and its engine (actually your car likely contains tens of embedded systems), calculate your travel route, take care of internet traffic, control your coffee machine, etc.; they make the surrounding system behave more intelligently. As a consequence we are dependent on them. One of the reasons computing systems are everywhere is that they are cheap, and they are cheap because of the ongoing integration and miniaturization. While we needed a large room to install the earliest computing systems, now a processor can be integrated in a few square millimeters or less, while giving substantial performance in terms of operations per second. Talking about performance, performance demands are constantly increasing. This holds especially for multi-media type of systems, that are systems which process streams of data, data being e.g. audio, all kinds of sensing data, video and graphics. To give an example, the next generation of smartphones is estimated to require about one Tera operations per second for its video, vision, graphics, GPS, audio and speech recognition capabilities, and to make matters worse, this performance has be delivered in a small package and should leave your batteries operational for many days. To deliver this performance for such a small energy budget, computing systems will contain many processors. Some of these processors will be general purpose programmable; others need to be more tuned to the application domain in order to reach the required computational efficiency (in operations v

5 vi Preface per second per Watt). Future multi-media systems will therefore be heterogeneous multi-core. Multi-media will not be the sole domain of advanced smartphones and the like. Since cameras are getting very cheap we predict that many future smart systems will be enhanced by adding vision capabilities like surveillance, recognition, virtual reality, visual control, and so on. Multi-media systems can be dedicated for one specific application; however an ongoing trend is to map multiple applications to the same system. This sharing of resources by several applications makes the system cheaper and more versatile, but substantially adds to its design complexity. Although integration is still following Moore s law, making computing systems cheaper, at least when counting chip area, their design becomes extremely complex, and therefore very costly. This is not only caused by increased functionality and performance demands of the running applications, but also by other, non-functional, requirements like energy cost and real-time demands. Making systems functionally correct is already quite challenging, but making this functionality operating at the right speed, obeying severe throughput and latency requirements can become a nightmare and may lead to many and large debugging and redesign cycles. Therefore a more systematic design method is urgently needed which avoids large debugging cycles while taking real-time demands into account as an integral part of the design process. Aim of This Book In this book we focus on (streaming) multi-media systems, and in particular on the real-time aspects of these systems. These systems run multiple applications and are realized using multiple processing cores, some of them specialized for a specific target application domain. The goal of this book is to make you familiar with techniques to model and analyze these systems, both hardware and software, while taking timing requirements into account. The models will be taken from the data flow modeling domain; in particular we will teach you how to use SDF (Synchronous Data Flow) models to specify and analyze your system. Although SDF is restricted in its expressive power, it has very good analysis power, and appropriate tools exist to perform this analysis. For example, timing properties and deadlock avoidance can be easily verified. It also allows for the calculation of appropriate buffer sizes for the inter-core communication buffers. Based on this specification and analysis this book teaches you how to synthesize and implement the specified multi-processor system using FPGA technology. This synthesis is correct by construction, and therefore it avoids many debugging iterations. The FPGA implementation can also act as a quick prototype for a final silicon realization. Mapping multiple applications to such a system requires a run-time manager. This manager is responsible for admission control, i.e., can a new application be added to the other, already running applications, such that every application still

6 Audience vii meets its timing requirements. Once admitted the run-time manager is also responsible for controlling the resources and enforcing the right time budgets to all applications. This book will show several techniques for performing these management tasks. You may not be satisfied by just running one set of applications. The set of running applications may regularly change, leading to multiple use cases. This adds new dimensions to the design process, especially when mapping to FPGAs; some of them will be treated at the end of this book. E.g. how do you share the resources of an FPGA between the multiple use cases, this to reduce the number of FGPA reconfigurations. Another dimension discussed is the estimation of the amount of FPGA resources needed by a set of use cases prior to the synthesis. In short you will learn how to map multiple applications, possibly divided into multiple use cases to a multi-processor system, and you will be able to quickly realize such a system into an FPGA. All the theory discussed in this book is supported by a complete design flow called MAMPS, which stands for Multiple Applications Multi-Processor Synthesis. The flow is fully implemented and demonstrated by several examples. Within the book we make several restrictions. A major one is that we mainly deal with soft real-time constraints. The techniques used in this book do not give hard real-time guarantees (unless indicated otherwise). The reason is that giving hard real-time guarantees may result in severe overestimation of the required resources, and therefore may give a huge performance and cost penalty. This does not mean that we do not emphasize research in hard real-time systems. On the contrary; it is one of our major research themes, already for many years, and the focus of many of our current projects. The reader is referred to the book website (see below) for further information on our other projects. Audience This book covers a complete design trajectory for the design of multi-media systems running multiple applications; it includes both theory and practice. It is meant for all people interested in designing multi-media and other real-time multi-processor systems. It helps them to think and reason about timing requirements and offers them various modeling, analysis, design and management techniques needed to realize these complex systems first time right. The book is also meant for system level architects who want to quickly make high level estimates and system trade-offs based on solid modeling. The book is also suitable for use within a post graduate course. To this purpose we included extensive introductory chapters on trends and challenges in multi-media systems, and on the theory behind application modeling and scheduling. In particular data flow techniques are treated in some depth. In such a course the available tools will help students to get familiar with future design flows, and bring the theory into practice.

7 viii Preface Accompanying Material Besides the printed copy, there is an accompanying book website at ele.tue.nl/~akash/mmsbook. This website contains further information about designing real-time systems and various links to other related research. In addition to that, accompanying slides can be found on the website. The slides can be used in a course, provided the copyright information is retained. As mentioned most of the presented techniques and algorithms are integrated in the MAMPS design flow. The corresponding tooling, and its documentation, can be found at The tools can be used online. The site contains a couple of tested examples to try out the tools. For collaborating partners tools can be made available on request for development. This book is based on research and development being performed at the TU/e, the Eindhoven University of Technology, within the PreMaDoNA project of the Electronic Systems group. PreMaDoNA stands for predictable matching of demands on networked architectures. Other results from this project and it s follow up projects can also be found following the links on the book website. Organization of This Book This book is divided into seven chapters. The first two are introductory. The first one describes trends in multimedia systems; the second one goes into the theory behind data flow modeling and scheduling, and introduces the necessary notation. Chapter 3 describes our new iterative analysis method. Chapter 4 treats how to perform resource management. Chapter 5 describes the MAMPS design flow which allows for quick realization of a system into an FPGA. Chapter 6 extends the system to support multiple use cases. Finally Chap. 7 gives several conclusions and outlines the open problems that are not solved in this book. Although the best way is to read all chapters in the presented order, some readers may find it convenient to skip parts on first reading. Chapters 3, 4 and 5 do not have big interdependences. Therefore, after reading the first 2 chapters, readers can select continue with either Chap. 3, 4 or 5. Chapter 6 depends on Chap. 5, so readers interested in mapping multiple use cases should first read Chap. 5. Henk Corporaal

8 Contents 1 Trends and Challenges in Multimedia Systems Trends in Multimedia Systems Applications Trends in Multimedia Systems Design Key Challenges in Multimedia Systems Design DesignFlow BookOverview Application Modeling and Scheduling Application Model and Specification Introduction to SDF Graphs Comparison of Dataflow Models Performance Modeling Scheduling Techniques for Dataflow Graphs Analyzing Application Performance on Hardware Composability Static vs Dynamic Ordering Conclusions Probabilistic Performance Prediction Basic Probabilistic Analysis IterativeAnalysis Experiments Suggested Readings Conclusions Resource Management Off-line Derivation of Properties On-line Resource Manager Achieving Predictability Through Suspension Experiments Suggested Readings Conclusions ix

9 x Contents 5 Multiprocessor System Design and Synthesis Performance Evaluation Framework MAMPS FlowOverview ToolImplementation Experiments and Results Suggested Readings Conclusions Multiple Use-cases System Design Merging Multiple Use-cases Use-case Partitioning EstimatingArea:DoesItFit? Experiments and Results Suggested Readings Conclusions Conclusions and Open Problems Conclusions OpenProblems About the Authors Glossary References Index...161

10 List of Figures Fig. 1.1 Comparison of world s first video console with one of the most modern consoles. ( a) Odyssey, released in 1972 an example from first generation video game console (Odyssey 1972). ( b) Sony PlayStation3 released in 2006 an example from the seventh generation video game console (PS3 2006)... 2 Fig. 1.2 Increasing processor speed and reducing memory cost (Adee 2008) 5 Fig. 1.3 Comparison of speedup obtained by combining r smaller cores into a bigger core in homogeneous and heterogeneous systems (Hill and Marty 2008)... 6 Fig. 1.4 The intrinsic computational efficiency of silicon as compared to theefficiencyofmicroprocessors... 8 Fig. 1.5 Platform-based design approach system platform stack... 9 Fig. 1.6 Application performance as obtained with full virtualization in comparisontosimulation Fig. 1.7 Complete design flow starting from applications specifications and ending with a working hardware prototype on an FPGA Fig.2.1 ExampleofanSDFGraph Fig. 2.2 SDF Graph after modeling auto-concurrency of 1 for the actor a Fig. 2.3 SDF Graph after modeling buffer-size of 2 on the edge from actor a 2 to a Fig. 2.4 Comparison of different models of computation (Stuijk 2007) Fig. 2.5 SDF Graph and the multi-processor architecture on which it is mapped Fig. 2.6 Steady-state is achieved after two executions of a 0 and one of a Fig. 2.7 Example of a system with 3 different applications mapped on a 3-processorplatform Fig. 2.8 Graph with clockwise schedule (static) gives MCM of 11 cycles. Thecriticalcycleisshowninbold Fig. 2.9 Graph with anti-clockwise schedule (static) gives MCM of 10 cycles. The critical cycle is shown in bold. Here two iterations are carried out in one steady-state iteration xi

11 xii List of Figures Fig Deadlock situation when a new job, C arrives in the system. A cycle a 1, b 1, b 2, c 2, c 3, a 3, a 1 is created without any token in it Fig Modeling worst case waiting time for application A in Fig Fig SDF graphs of H263 encoder and decoder Fig Two applications running on same platform and sharing resources. 43 Fig Static-order schedule of applications in Fig executing concurrently Fig Schedule of applications in Fig executing concurrently when B haspriority Fig. 3.1 Comparison of various techniques for performance evaluation Fig. 3.2 Two application SDFGs A and B Fig. 3.3 Probability distribution of the time another actor has to wait when actor a is mapped on the resource Fig. 3.4 SDFGs A and B with response times Fig. 3.5 Different states an actor cycles through Fig. 3.6 Probability distribution of the waiting time added by actor a to other actor when actor a is mapped on the resource with explicit waiting time probability Fig. 3.7 SDF application graphs A and B updated after applying iterative analysis technique Fig. 3.8 Iterative probability method. Waiting times and throughput are updated until needed Fig. 3.9 Probability distribution of waiting time another actor has to wait when actor a is mapped on the resource with explicit waiting time probability for the conservative iterative analysis Fig Comparison of periods computed using different analysis techniques as compared to the simulation result (all 10 applications running concurrently). All periods are normalized to the original period 71 Fig Inaccuracy in application periods obtained through simulation and different analysis techniques Fig Probability distribution of the time other actors have to wait for actor a2 of application F. a2 is mapped on processor 2 with a utilization of The overall waiting time measured is 12.13, while the predicted time is The conservative prediction for the same caseis Fig Probability distribution of the time other actors have to wait for actor a5 of application G. a5 is mapped on processor 5 with a utilization of The overall waiting time measured is 4.49, while the predicted time is The conservative prediction for the same caseis Fig Waiting time of actors of different applications mapped on Processor 2. The utilization of this processor Fig Waiting time of actors of different applications mapped on Processor 5. The utilization of this processor

12 List of Figures xiii Fig Comparison of periods computed using iterative analysis techniques as compared to simulation results (all 10 applications running concurrently) Fig Change in period computed using iterative analysis with increase in the number of iterations for application A Fig Change in period computed using iterative analysis with increase in the number of iterations for application C Fig Comparison of periods with variable execution time for all applications. A new conservative technique is applied; the conservation mechanism is used only for the last iteration after applying the base iterative analysis for 10 iterations Fig Comparison of application periods when multiple actors of one application are mapped on one processor Fig Comparison of performance observed in simulation as compared to the prediction made using iterative analysis for real applications in a mobile phone Fig SDF model of Sobel algorithm for one pixel, and JPEG encoder foronemacroblock Fig Architecture of the generated hardware to support Sobel and JPEG encoder Fig. 4.1 Off-line application(s) partitioning, and computation of application(s) properties. Three applications photo taking, bluetooth and music playing, are shown above. The partitioning and property derivation is done for all of them, as shown for photo taking application, forexample Fig. 4.2 The properties of H263 decoder application computed off-line Fig. 4.3 Boundary specification for non-buffer critical applications Fig. 4.4 Boundary specification for buffer-critical applications or constrained by input/output rate Fig. 4.5 On-line predictor for multiple application(s) performance Fig. 4.6 Two applications running on same platform and sharing resources. 95 Fig. 4.7 Schedule of applications in Fig. 4.6 running together. The desired throughput is 450 cycles per iteration Fig. 4.8 Interaction diagram between user interface, resource manager, andapplicationsinthesystem-setup Fig. 4.9 Resource manager achieves the specified quality without interferingattheactorlevel Fig SDF graph of JPEG decoder modeled from description in (Hoes 2004) Fig Progress of H263 and JPEG when they run on the same platform in isolation and concurrently Fig With a resource manager, the progress of applications is closer to desired performance Fig Increasing granularity of control makes the progress of applicationssmoother...105

13 xiv List of Figures Fig The time wheel showing the ratio of time spent in different states. 106 Fig Performance of applications H263 and JPEG with static weights for different time wheels. Both applications are disabled in the spare time,i.e.combinationc0isbeingused Fig Performance of applications H263 and JPEG with time wheel of 10 million time units with the other two approaches Fig. 5.1 Ideal design flow for multiprocessor systems Fig.5.2 MAMPSdesignflow Fig. 5.3 Snippet of H263 application specification Fig. 5.4 SDF graph for H263 decoder application Fig. 5.5 The interface for specifying functional description of SDF-actors. 117 Fig. 5.6 Example of specifying functional behaviour in C Fig. 5.7 Hardware topology of the generated design for H Fig. 5.8 Architecture with Resource Manager Fig. 5.9 An overview of the design flow to analyze the application graph andmapitonthehardware Fig Xilinx Evaluation Kit ML605 with Virtex 6 LX240T (Xilinx 2010) 121 Fig (Colour online) Layout of the Virtex-6 FPGA with 100 Microblazes highlighted in colour Fig Effect of varying initial tokens on JPEG throughput Fig. 6.1 An example showing how the combined hardware for different use-cases is computed. The corresponding communication matrix is also shown for each hardware design Fig. 6.2 The overall flow for analyzing multiple use-cases. Notice how the hardware flow executes only once while the software flow is repeated foralltheuse-cases Fig. 6.3 Putting applications, use-cases and feasible partitions in perspective 135 Fig. 6.4 Increase in the number of LUTs and FPGA Slices used as the numberoffslsindesignisincreased Fig. 6.5 Increase in the number of LUTs and FPGA Slices used as the numberofmicroblazeprocessorsisincreased...139

14 List of Tables Table 2.1 The time which the scheduling activities assignment, ordering, and timing are performed is shown for four classes of schedulers. The scheduling activities are listed on top and the strategies on the left (Lee and Ha 1989) Table 2.2 Table showing the deadlock condition in Fig Table 2.3 Estimating performance: iteration-count for each application in 3,000,000 time units Table 2.4 Properties of scheduling strategies Table 3.1 Probabilities of different queues with a Table 3.2 Comparison of the time actors actually spend in different stages assumed in the model vs the time predicted Table 3.3 Measured inaccuracy for period in % as compared with simulation results for iterative analysis. Both the average and maximum are shown. 78 Table 3.4 The period of concurrently executing Sobel and JPEG encoder applications as measured or analyzed Table 3.5 The number of clock cycles consumed on a Microblaze processor during various stages, and the percentage of error (both average and maximum)andthecomplexity Table 4.1 Table showing how predictability can be achieved using budget enforcement. Note how the throughput changes by varying the ratio of timeindifferentcombinations Table 4.2 Load (in proportion to total available cycles) on processing nodes due to each application Table 4.3 Iteration count of applications and utilization of processors for different sampling periods for 100M cycles Table 4.4 Time weights statically computed using linear programming to achieve desired performance Table 4.5 Summary of related work (heterogeneous property is not applicable for uniprocessor schedulers) Table 5.1 Comparison of various methods to achieve performance estimates 114 Table 5.2 Comparison of throughput for different applications obtained on FPGAwithsimulation xv

15 xvi List of Tables Table 5.3 Number of iterations of the two applications obtained by varying initial number of tokens i.e. buffer-size, in 100 million cycles Table 5.4 Time spent on DSE of JPEG-H263 combination Table 5.5 Comparison of various approaches for providing performance estimates Table 6.1 Resource utilization for different components in the design Table 6.2 Performance evaluation of heuristics used for use-case reduction and partitioning

MULTI-PROCESSOR SYSTEM-LEVEL SYNTHESIS FOR MULTIPLE APPLICATIONS ON PLATFORM FPGA

MULTI-PROCESSOR SYSTEM-LEVEL SYNTHESIS FOR MULTIPLE APPLICATIONS ON PLATFORM FPGA Akash Kumar,, Shakith Fernando, Yajun Ha, Bart Mesman and Henk Corporaal Eindhoven University of Technology, Eindhoven,