Universiteit van Amsterdam 1

Similar documents
UvA-DARE (Digital Academic Repository)

Mapping of Applications to Multi-Processor Systems

EFFICIENT AUTOMATED SYNTHESIS, PROGRAMING, AND IMPLEMENTATION OF MULTI-PROCESSOR PLATFORMS ON FPGA CHIPS. Hristo Nikolov Todor Stefanov Ed Deprettere

Mapping of Applications to Multi-Processor Systems

MOORE S law predicts the exponential growth over time

System-Level Design Space Exploration of Dynamic Reconfigurable Architectures

IN order to increase design productivity, raising the level

Multi-processor System Design with ESPAM

A Multiobjective Optimization Model for Exploring Multiprocessor Mappings of Process Networks

System-level design space exploration of dynamic reconfigurable architectures Sigdel, K.; Thompson, M.; Pimentel, A.D.; Stefanov, T.; Bertels, K.

A Mixed-level Co-simulation Method for System-level Design Space Exploration

ESE Back End 2.0. D. Gajski, S. Abdi. (with contributions from H. Cho, D. Shin, A. Gerstlauer)

FPGA-Based Rapid Prototyping of Digital Signal Processing Systems

MULTI-PROCESSOR SYSTEM-LEVEL SYNTHESIS FOR MULTIPLE APPLICATIONS ON PLATFORM FPGA

The Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006

Co-synthesis and Accelerator based Embedded System Design

Towards an ESL Design Framework for Adaptive and Fault-tolerant MPSoCs: MADNESS or not?

Windowed FIFOs for FPGA-based Multiprocessor Systems

Evaluation of Runtime Task Mapping Heuristics with rsesame - A Case Study

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey

Towards Multi-application Workload Modeling in Sesame for System-Level Design Space Exploration

A Methodology for Automated Design of Hard-Real-Time Embedded Streaming Systems

Handbook of Hardware/Software Codesign Soonhoi Ha and Jürgen Teich

Simulation and Exploration of LAURA Processor Architectures with SystemC

Distributed Operation Layer Integrated SW Design Flow for Mapping Streaming Applications to MPSoC

Hardware/Software Co-design

Scenario-based Design Space Exploration

Cover Page. The following handle holds various files of this Leiden University dissertation:

Scenario-Based Design Space Exploration of MPSoCs

Modeling and Simulation of System-on. Platorms. Politecnico di Milano. Donatella Sciuto. Piazza Leonardo da Vinci 32, 20131, Milano

Hardware-Software Codesign

Pilot: A Platform-based HW/SW Synthesis System

Long Term Trends for Embedded System Design

Model-Based Design for effective HW/SW Co-Design Alexander Schreiber Senior Application Engineer MathWorks, Germany

Easy Multicore Programming using MAPS

Codesign Framework. Parts of this lecture are borrowed from lectures of Johan Lilius of TUCS and ASV/LL of UC Berkeley available in their web.

System Level Design with IBM PowerPC Models

DOTTORATO DI RICERCA

Universiteit Leiden Opleiding Informatica

FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC)

Parameterized System Design

NGUYEN KHAC HIEU REVIEW OF SYSTEM DESIGN FRAMEWORKS. Master of Science thesis

EE382V: System-on-a-Chip (SoC) Design

Laura: Leiden Architecture Research and Exploration Tool

HW/SW Co-design. Design of Embedded Systems Jaap Hofstede Version 3, September 1999

Cover Page. The handle holds various files of this Leiden University dissertation

Hardware Software Co-design and SoC. Neeraj Goel IIT Delhi

A Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on

Hardware Software Codesign of Embedded Systems

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP

Platform-based Design

DIGITAL VS. ANALOG SIGNAL PROCESSING Digital signal processing (DSP) characterized by: OUTLINE APPLICATIONS OF DIGITAL SIGNAL PROCESSING

Part 2: Principles for a System-Level Design Methodology

ECE 459/559 Secure & Trustworthy Computer Hardware Design

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013

Hardware-Software Codesign. 1. Introduction

Design and Implementation of Low Complexity Router for 2D Mesh Topology using FPGA

Design Methodologies

Design Methodologies. Kai Huang

Extensions of Daedalus Todor Stefanov

Co-Design and Co-Verification using a Synchronous Language. Satnam Singh Xilinx Research Labs

A Software Framework for Efficient System-level Performance Evaluation of Embedded Systems

The S6000 Family of Processors

Hardware Design and Simulation for Verification

On-Chip Communications

Increasing pipelined IP core utilization in Process Networks using Exploration

Mapping Multi-Million Gate SoCs on FPGAs: Industrial Methodology and Experience

Hardware-Software Codesign. 1. Introduction

Mapping Array Communication onto FIFO Communication - Towards an Implementation

Hardware-Software Co-Design and Prototyping on SoC FPGAs Puneet Kumar Prateek Sikka Application Engineering Team

Combined System Synthesis and Communication Architecture Exploration for MPSoCs

EEM870 Embedded System and Experiment Lecture 4: SoC Design Flow and Tools

Distributed Operation Layer

Modelling, Analysis and Scheduling with Dataflow Models

ECE 111 ECE 111. Advanced Digital Design. Advanced Digital Design Winter, Sujit Dey. Sujit Dey. ECE Department UC San Diego

Integrated Workflow to Implement Embedded Software and FPGA Designs on the Xilinx Zynq Platform Puneet Kumar Senior Team Lead - SPC

FPGA: What? Why? Marco D. Santambrogio

System Planning Overcoming Gap Between Design at Electronic System Level (ESL) and Implementation

Embedded Systems: Hardware Components (part I) Todor Stefanov

ReconOS: An RTOS Supporting Hardware and Software Threads

EEL 4783: Hardware/Software Co-design with FPGAs

Digital Systems Design. System on a Programmable Chip

A Flexible Modeling and Simulation Framework for Design Space Exploration

Rapid Evaluation of Instantiations of Embedded Systems Architectures: a Case Study

Trend in microelectronics The design process and tasks Different design paradigms Basic terminology The test problems

Design Process. Design : specify and enter the design intent. Verify: Implement: verify the correctness of design and implementation

Extending TASTE through integration with Space Studio

Multi-level Design Methodology using SystemC and VHDL for JPEG Encoder

Rapid-Prototyping Emulation System using a SystemC Control System Environment and Reconfigurable Multimedia Hardware Development Platform

NASA: A Generic Infrastructure for System-level MP-SoC Design Space Exploration

Power Aware Architecture Design for Multicore SoCs

SOFTWARE DRIVES HARDWARE, LESSONS LEARNED AND FUTURE DIRECTIONS

The Artemis Workbench for System-level Performance Evaluation of Embedded Systems

A Unified HW/SW Interface Model to Remove Discontinuities between HW and SW Design

Introduction. Definition. What is an embedded system? What are embedded systems? Challenges in embedded computing system design. Design methodologies.

Applications of scenarios in early embedded system design space exploration van Stralen, P.

Embedded Systems: Projects

Towards a Dynamically Reconfigurable System-on-Chip Platform for Video Signal Processing

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

Transcription:

Universiteit van Amsterdam 1

Embedded systems are ubiquitous! Andy D. Pimentel IPA Lentedagen, 22 April, 2010 2

The design of modern embedded systems becomes increasingly complex Need to support multiple applications and standards Just look at your mobile phone Market pull: Design better products faster High design productivity required Andy D. Pimentel IPA Lentedagen, 22 April, 2010 3

Transistors per Chip (M) 10,000 Productivity Trans./Staff - Mo. 100,000,000.10µ.35µ 2.5µ 1,000 58%/Yr. compound 100 To close Complexity this growth productivity rate 10 1.1.01.001 gap we require new, disruptive design x x x x x x methods & x tools x 1981 1983 1985 1987 1989 1991 21%/Yr. compound Productivity growth rate 1993 1995 1997 1999 2001 2003 2005 2007 2009 10,000,000 1,000,000 100,000 10,000 1,000 100 10 Logic Tr./Chip Tr./S.M. Source: SEMATECH Andy D. Pimentel IPA Lentedagen, 22 April, 2010 4

The design of modern embedded systems becomes increasingly complex Need to support multiple applications and standards Just look at your mobile phone Market pull: Design better products faster High design productivity required Design quality Real time, low cost, low power, flexible, no bugs Multi-dimensional design space with many tradeoffs: Cost (silicon area, design time), performance, power consumption, flexibility, dependability, time-to-market, etc. Andy D. Pimentel IPA Lentedagen, 22 April, 2010 5

System complexity: trend towards heterogeneous Multi-Processor Systems on Chip (MP-SoCs), integrating Dedicated hardware blocks Embedded processor cores Reconfigurable components Network on Chip (NoC) Now: up to 10s of processors 100s of on-chip processors are foreseen in a few years Processors are the logic gates of the future! Andy D. Pimentel IPA Lentedagen, 22 April, 2010 6

A lot of challenging design steps! Decomposing applications for mapping onto an MP-SoC Hardware/software partitioning of applications Modeling and simulating MP-SoC architecture(s) At various levels of abstraction Efficient (and early!) exploration of design options Architecture trade-offs Different mappings and HW/SW partitionings System synthesis/implementation and mapping application(s) onto the system Different tools/tool-flows are usually needed Interoperability problems! Andy D. Pimentel IPA Lentedagen, 22 April, 2010 7

Andy D. Pimentel IPA Lentedagen, 22 April, 2010 8

Background MP-SoC design The Daedalus design-flow Automatic parallelization of streaming applications System-level modeling and simulation for DSE System-level synthesis in a plug-and-play fashion How all tools fit together A JPEG case study Conclusions Andy D. Pimentel IPA Lentedagen, 22 April, 2010 9

System-level Design Space Exploration Programming/ mapping Prototyping P1" FIFO2" P1" P5" FIFO1" FIFO3" P2! P3" FIFO4" P4" Streaming Application(s) FIFO7"? FIFO6" DSP DSP SWITCH SWITCH FIFO FIFO µp µp MEM MEM PPC PPC? MEM PPC 1 DSP 2 MEM FIFO5 FIFO3 FIFO2 FIFO6 FIFO4 FIFO1 FIFO7 MEM µp 1 NoC-based MP-SoC µp 2 MEM Andy D. Pimentel IPA Lentedagen, 22 April, 2010 10

High-level Models Explore, modify, select instances System-level design space exploration Sequential application Automatic Parallelization Library of IP cores Common XML Interface Platform specification Mapping specification Parallel application specification RTL-level Models System-level synthesis Multi-processor System on Chip (Synthesizable VHDL and C/C++ code for processors) Andy D. Pimentel IPA Lentedagen, 22 April, 2010 11

High-level Models Explore, modify, select instances System-level design space exploration Sequential application Automatic Parallelization Library of IP cores Common XML Interface Platform specification Mapping specification Parallel application specification RTL-level Models System-level synthesis Multi-processor System on Chip (Synthesizable VHDL and C/C++ code for processors) Andy D. Pimentel IPA Lentedagen, 22 April, 2010 12

EASY to specify Sequential Application Specification Application DIFFICULT to specify Parallel Application Specification for j = 1:1:N, [x(j)] = Source1( ); end for i = 1:1:K, [y(i)] = Source2( ); end for j = 1:1:N, for i = 1:1:K, [y(i), x(j)] = F( y(i), x(j) ); end end for i = 1:1:K, [Out(i)] = Sink( y( I ) ); end DIFFICULT to map MEM Programming KPNgen tool FIFO5 FIFO3 FIFO2 FIFO4 Source MEM P1 P3 S1 P2 P4 EASY to map Sink PPC 1 CC CC µb 1 PPC 2 CC CC µb 2 MEM FIFO6 FIFO1 FIFO7 MEM Andy D. Pimentel IPA Lentedagen, 22 April, 2010 13

Sequential Application Specification for j = 1:1:N, [x(j)] = Source1( ); end for i = 1:1:K, [y(i)] = Source2( ); end for j = 1:1:N, for i = 1:1:K, [y(i), x(j)] = F( y(i), x(j) ); end end for i = 1:1:K, [Out(i)] = Sink( y( I ) ); end Andy D. Pimentel IPA Lentedagen, 22 April, 2010 14

Affine Nested Loop programs (C/C++) for j = 1:1:N, [x(j)] = Source1( ); end for i = 1:1:K, [y(i)] = Source2( ); end for j = 1:1:N, for i = 1:1:K, [y(i), x(j)] = F( y(i), x(j) ); end end for i = 1:1:K, [Out(i)] = Sink( y( I ) ); end KPNgen Transformations, dependency analysis, and linearization Parallel application instances: Kahn Process Networks System-level synthesis System-level Simulation & DSE Functional verification & analysis Andy D. Pimentel IPA Lentedagen, 22 April, 2010 15

Explore, modify, select instances Sequential application High-level Models System-level design space exploration KPNgen (Automatic Parallelization) Library of IP cores Common XML Interface Platform specification Mapping specification Parallel application specification RTL-level Models System-level synthesis Multi-processor System on Chip (Synthesizable VHDL and C/C++ code for processors) Andy D. Pimentel IPA Lentedagen, 22 April, 2010 16

Application model Description of functional behavior of an application Independent from architecture, HW/SW partitioning and timing characteristics Generates application events representing the workload imposed on the architecture Architecture model Application model Traces of application events Parameterized timing behavior of architecture components Models timing consequences of application events Architecture model Explicit mapping of application and architecture models Trace-driven co-simulation Easy reuse of both application and architecture models! Andy D. Pimentel IPA Lentedagen, 22 April, 2010 17

Process A Process C Process D Process B Application model (Kahn Process Network) Mapping layer (mapping, scheduling and event refinement) Processor 1 Processor 2 Processor 3 Cycle-approximate architecture model Shared memory Andy D. Pimentel IPA Lentedagen, 22 April, 2010 18

Process A Process C Process D Process B Application model (Kahn Process Network) Mapping layer (mapping, scheduling and event refinement) Processor 1 Processor 2 Processor 3 Cycle-approximate architecture model Shared memory Andy D. Pimentel IPA Lentedagen, 22 April, 2010 19

Process A Process C Process D Process B Application model (Kahn Process Network) Op. X Processor Y 150 1 Mapping layer (mapping, Cycles scheduling and event refinement) 750 Z 1500 Processor 2 Processor 3 Cycle-approximate architecture model Shared memory Andy D. Pimentel IPA Lentedagen, 22 April, 2010 20

Process A Process C Process D Process B Application model (Kahn Process Network) Abstract (RT)OS model Processor 1 Processor 2 Processor 3 Cycle-approximate architecture model Shared memory Andy D. Pimentel IPA Lentedagen, 22 April, 2010 21

Targets efficient evaluation of different Application-to-architectures mappings Hardware/Software partitionings MP-SoC architectures Different type and number of processing cores, interconnects (NoCs), scheduling policies, etc. Provides approximations/insight on Cycle times, system utilization, bottlenecks/resource contention Low modeling effort and high simulation speed Modeling in a matter of hours/days Typically, a full system-level MP-SoC simulation takes less than 1 second on an average laptop Andy D. Pimentel IPA Lentedagen, 22 April, 2010 22

Individuals (i.e. candidate platforms and mappings) Sesame System-level simulation Performance, power and cost of individuals GA-based multi-objective optimization Application model Platform Components Andy D. Pimentel IPA Lentedagen, 22 April, 2010 23

Explore, modify, select instances Sequential application High-level Models Sesame (DSE) KPNgen (Automatic Parallelization) Library of IP cores Common XML Interface Platform specification Mapping specification Parallel application specification RTL-level Models System-level synthesis Multi-processor System on Chip (Synthesizable VHDL and C/C++ code for processors) Andy D. Pimentel IPA Lentedagen, 22 April, 2010 24

Application KPNgen System-Level Specification Platform Spec in XML Mapping Spec in XML KPN in XML Library of IP cores ESPAM RTL-Level Specification Platform topology description IP cores in VHDL C/C++ code for processors Auxiliary files Xilinx Platform Studio (XPS) Tool Gate-Level Specification Program code Processor 1 Program code Processor 3 Program code Processor 2 VirtexII-Pro FPGA Andy D. Pimentel IPA Lentedagen, 22 April, 2010 25

Library of parameterized components: Processing Components: PowerPC (PPC), MicroBlaze (µb), or dedicated HW IP blocks Memory Components: Program/Data Memory (MEM) Random access Communication Memory (CM) FIFO access Communication Components: Point-to-point network Crossbar switch Shared bus with Round-Robin, Fixed Priority, or TDMA arbitration Communication Controller (CC) interface between processing, memory, and communication components MEM PPC 1 PPC n MEM CM CC CC CM Communication Component CM CC CC CM MEM µb 1 µb m MEM Many alternative platforms can be easily constructed by instantiating different type/ number of components and setting their parameters Andy D. Pimentel IPA Lentedagen, 22 April, 2010 26

Takes relatively short amount of time: A multiprocessor system with 8 processors KPN Derivation System-level to RTL Conversion Physical Implementation KPNgen 00:00:22 -- -- ESPAM tool -- 00:00:24 -- XPS tool -- -- 02:09:00 Manual Manipulation 00:30:00 00:10:00 -- Total Time 02:49:46 Simple exploration of the performance of alternative MP-SoCs is feasible even at implementation level in several hours The accuracy is 100% Andy D. Pimentel IPA Lentedagen, 22 April, 2010 27

Model refinement techniques Design space pruning techniques System-level power models Multi-application system design Introducing the notion of workload scenarios Adaptive and dynamic systems applications, mappings and architectures DSE support framework Experimental support + analysis support (visualization) Relaxing input constraints of tools. Andy D. Pimentel IPA Lentedagen, 22 April, 2010 28

Background MP-SoC design The Daedalus design-flow Automatic parallelization of (streaming) applications System-level modeling and simulation for DSE System-level synthesis in a plug-and-play fashion How all tools fit together A JPEG case study Conclusions Andy D. Pimentel IPA Lentedagen, 22 April, 2010 29

Image processing solutions for customers that build Medical appliances Very high resolution images Industrial process monitoring Very high frame rate Chess B.V. deployed Daedalus Still image JPEG compression system Very fast evaluation (exploration and implementation) of alternative systems (MP-SoCs) Trade-off between Cost, Design time, Performance, etc. Andy D. Pimentel IPA Lentedagen, 22 April, 2010 30

Tile DCT1 JPEG 1 Tile... Q1 KPN... Tile Vin DCT2 Q2 VLE Vout.jpg IMAGE Tile DCT8... Q8... JPEG 1 Tile JPEG 1 Tile... Tile = 128 MacroBlocks Packet of bytes Compressed byte sequence for Tile MacroBlock = 2Yblocks + 1Ublock + 1Vblock Yblock = 64 pixels, Ublock = 64 pixels, Vblock = 64 pixels, Andy D. Pimentel IPA Lentedagen, 22 April, 2010 31

MP-SoCs consist of MicroBlaze softcores and/or dedicated HW components Point-to-point connections IP component library contains High-level HW component model for all tasks (Sesame) RTL HW model only for DCT task (ESPAM) MP-SoC implementations on FPGA are constrained by the on-chip memory (288KB) Andy D. Pimentel IPA Lentedagen, 22 April, 2010 32

Single JPEG encoder DSE: Andy D. Pimentel IPA Lentedagen, 22 April, 2010 33

Architecture instances for a single-tile JPEG encoder: 16KB Vin,DCT 2KB 32KB Q,VLE,Vout 32KB Vin,Q,VLE,Vout 4KB DCT 2 MicroBlaze processors (50KB) 1 MicroBlaze, 1HW DCT (36KB) 8KB Vin DCT, Q DCT, Q 4x2KB 32KB VLE, Vout 2KB Vin 8KB DCT 2KB Q 8KB 32KB 2KB VLE, Vout 4x2KB DCT, Q DCT, Q 4x16KB 2KB DCT 2KB 8KB Q 2KB 6 MicroBlaze processors (120KB) 4 MicroBlazes, 2HW DCT (68KB) Andy D. Pimentel IPA Lentedagen, 22 April, 2010 34

Multi JPEG encoder MP-SoCs: Andy D. Pimentel IPA Lentedagen, 22 April, 2010 35

JPEG case study, homogeneous systems (32 tiles): 7x 7x 4x 2x 7.4x 8x 8.4x 3x 3x 2x 4x 2x 9.7x 10.3x 1x 2x Andy D. Pimentel IPA Lentedagen, 22 April, 2010 36

JPEG case study, heterogeneous systems (32 tiles): 3.8x DCT DCT DCT 8x 15.2x 15.2x 15.9x 17.7x 19.7x DCT DCT DCT 2x DCT 3x 3x DCT DCT DCT 4x DCT 3x 1x DCT 2x DCT DCT DCT Andy D. Pimentel IPA Lentedagen, 22 April, 2010 37

We performed the DSE study ( 5% error) and the implementation of 25 MP-SoC JPEG encoder variations on an FPGA in only 5 days! Combining data and task parallelism: 24 cores, 19.7x speed-up, 288KB memory Andy D. Pimentel IPA Lentedagen, 22 April, 2010 38

39

Daedalus : historical figure from Greek mythology Means cunning worker He was an innovator in many arts Daedalus was the father of Icarus Analogy: It s new, disruptive technology But there are still limitations Don t fall into the sea! Andy D. Pimentel IPA Lentedagen, 22 April, 2010 40

Merits of the Daedalus design-flow: Automated parallelization of media/streaming applications into parallel specifications (KPNs) Automated synthesis of MP-SoC platforms at system level, in a plug-and-play fashion Automated mapping of parallel application specifications onto MP-SoC platform Steering by means of efficient system-level design space exploration All of this in a matter of hours Andy D. Pimentel IPA Lentedagen, 22 April, 2010 41

Mark Thompson Cagkan Erbas Simon Polstra Toktam Taghavi Peter van Stralen Stanley Jaddoe Joseph Coffland Berry van Halderen The OOTI@TU/e 2006 trainees Ed Deprettere Bart Kienhuis Todor Stefanov Hristo Nikolov Paul Lieverse Sven Verdoolaege" Kai Huang" Ji Gu " Wei Zhong" Ying Tao Andy D. Pimentel IPA Lentedagen, 22 April, 2010 42

For more information: http://daedalus.liacs.nl/ or email: a.d.pimentel@uva.nl " 43