Universiteit van Amsterdam 1

Embedded systems are ubiquitous! Andy D. Pimentel IPA Lentedagen, 22 April, 2010 2

Transistors per Chip (M) 10,000 Productivity Trans./Staff - Mo. 100,000,000.10µ.35µ 2.5µ 1,000 58%/Yr. compound 100 To close Complexity this growth productivity rate 10 1.1.01.001 gap we require new, disruptive design x x x x x x methods & x tools x 1981 1983 1985 1987 1989 1991 21%/Yr. compound Productivity growth rate 1993 1995 1997 1999 2001 2003 2005 2007 2009 10,000,000 1,000,000 100,000 10,000 1,000 100 10 Logic Tr./Chip Tr./S.M. Source: SEMATECH Andy D. Pimentel IPA Lentedagen, 22 April, 2010 4

The design of modern embedded systems becomes increasingly complex Need to support multiple applications and standards Just look at your mobile phone Market pull: Design better products faster High design productivity required Design quality Real time, low cost, low power, flexible, no bugs Multi-dimensional design space with many tradeoffs: Cost (silicon area, design time), performance, power consumption, flexibility, dependability, time-to-market, etc. Andy D. Pimentel IPA Lentedagen, 22 April, 2010 5

System complexity: trend towards heterogeneous Multi-Processor Systems on Chip (MP-SoCs), integrating Dedicated hardware blocks Embedded processor cores Reconfigurable components Network on Chip (NoC) Now: up to 10s of processors 100s of on-chip processors are foreseen in a few years Processors are the logic gates of the future! Andy D. Pimentel IPA Lentedagen, 22 April, 2010 6

A lot of challenging design steps! Decomposing applications for mapping onto an MP-SoC Hardware/software partitioning of applications Modeling and simulating MP-SoC architecture(s) At various levels of abstraction Efficient (and early!) exploration of design options Architecture trade-offs Different mappings and HW/SW partitionings System synthesis/implementation and mapping application(s) onto the system Different tools/tool-flows are usually needed Interoperability problems! Andy D. Pimentel IPA Lentedagen, 22 April, 2010 7

Andy D. Pimentel IPA Lentedagen, 22 April, 2010 8

Background MP-SoC design The Daedalus design-flow Automatic parallelization of streaming applications System-level modeling and simulation for DSE System-level synthesis in a plug-and-play fashion How all tools fit together A JPEG case study Conclusions Andy D. Pimentel IPA Lentedagen, 22 April, 2010 9

System-level Design Space Exploration Programming/ mapping Prototyping P1" FIFO2" P1" P5" FIFO1" FIFO3" P2! P3" FIFO4" P4" Streaming Application(s) FIFO7"? FIFO6" DSP DSP SWITCH SWITCH FIFO FIFO µp µp MEM MEM PPC PPC? MEM PPC 1 DSP 2 MEM FIFO5 FIFO3 FIFO2 FIFO6 FIFO4 FIFO1 FIFO7 MEM µp 1 NoC-based MP-SoC µp 2 MEM Andy D. Pimentel IPA Lentedagen, 22 April, 2010 10

High-level Models Explore, modify, select instances System-level design space exploration Sequential application Automatic Parallelization Library of IP cores Common XML Interface Platform specification Mapping specification Parallel application specification RTL-level Models System-level synthesis Multi-processor System on Chip (Synthesizable VHDL and C/C++ code for processors) Andy D. Pimentel IPA Lentedagen, 22 April, 2010 11

EASY to specify Sequential Application Specification Application DIFFICULT to specify Parallel Application Specification for j = 1:1:N, [x(j)] = Source1( ); end for i = 1:1:K, [y(i)] = Source2( ); end for j = 1:1:N, for i = 1:1:K, [y(i), x(j)] = F( y(i), x(j) ); end end for i = 1:1:K, [Out(i)] = Sink( y( I ) ); end DIFFICULT to map MEM Programming KPNgen tool FIFO5 FIFO3 FIFO2 FIFO4 Source MEM P1 P3 S1 P2 P4 EASY to map Sink PPC 1 CC CC µb 1 PPC 2 CC CC µb 2 MEM FIFO6 FIFO1 FIFO7 MEM Andy D. Pimentel IPA Lentedagen, 22 April, 2010 13

Sequential Application Specification for j = 1:1:N, [x(j)] = Source1( ); end for i = 1:1:K, [y(i)] = Source2( ); end for j = 1:1:N, for i = 1:1:K, [y(i), x(j)] = F( y(i), x(j) ); end end for i = 1:1:K, [Out(i)] = Sink( y( I ) ); end Andy D. Pimentel IPA Lentedagen, 22 April, 2010 14

Affine Nested Loop programs (C/C++) for j = 1:1:N, [x(j)] = Source1( ); end for i = 1:1:K, [y(i)] = Source2( ); end for j = 1:1:N, for i = 1:1:K, [y(i), x(j)] = F( y(i), x(j) ); end end for i = 1:1:K, [Out(i)] = Sink( y( I ) ); end KPNgen Transformations, dependency analysis, and linearization Parallel application instances: Kahn Process Networks System-level synthesis System-level Simulation & DSE Functional verification & analysis Andy D. Pimentel IPA Lentedagen, 22 April, 2010 15

Explore, modify, select instances Sequential application High-level Models System-level design space exploration KPNgen (Automatic Parallelization) Library of IP cores Common XML Interface Platform specification Mapping specification Parallel application specification RTL-level Models System-level synthesis Multi-processor System on Chip (Synthesizable VHDL and C/C++ code for processors) Andy D. Pimentel IPA Lentedagen, 22 April, 2010 16

Application model Description of functional behavior of an application Independent from architecture, HW/SW partitioning and timing characteristics Generates application events representing the workload imposed on the architecture Architecture model Application model Traces of application events Parameterized timing behavior of architecture components Models timing consequences of application events Architecture model Explicit mapping of application and architecture models Trace-driven co-simulation Easy reuse of both application and architecture models! Andy D. Pimentel IPA Lentedagen, 22 April, 2010 17

Process A Process C Process D Process B Application model (Kahn Process Network) Mapping layer (mapping, scheduling and event refinement) Processor 1 Processor 2 Processor 3 Cycle-approximate architecture model Shared memory Andy D. Pimentel IPA Lentedagen, 22 April, 2010 18

Process A Process C Process D Process B Application model (Kahn Process Network) Op. X Processor Y 150 1 Mapping layer (mapping, Cycles scheduling and event refinement) 750 Z 1500 Processor 2 Processor 3 Cycle-approximate architecture model Shared memory Andy D. Pimentel IPA Lentedagen, 22 April, 2010 20

Process A Process C Process D Process B Application model (Kahn Process Network) Abstract (RT)OS model Processor 1 Processor 2 Processor 3 Cycle-approximate architecture model Shared memory Andy D. Pimentel IPA Lentedagen, 22 April, 2010 21

Targets efficient evaluation of different Application-to-architectures mappings Hardware/Software partitionings MP-SoC architectures Different type and number of processing cores, interconnects (NoCs), scheduling policies, etc. Provides approximations/insight on Cycle times, system utilization, bottlenecks/resource contention Low modeling effort and high simulation speed Modeling in a matter of hours/days Typically, a full system-level MP-SoC simulation takes less than 1 second on an average laptop Andy D. Pimentel IPA Lentedagen, 22 April, 2010 22

Individuals (i.e. candidate platforms and mappings) Sesame System-level simulation Performance, power and cost of individuals GA-based multi-objective optimization Application model Platform Components Andy D. Pimentel IPA Lentedagen, 22 April, 2010 23

Explore, modify, select instances Sequential application High-level Models Sesame (DSE) KPNgen (Automatic Parallelization) Library of IP cores Common XML Interface Platform specification Mapping specification Parallel application specification RTL-level Models System-level synthesis Multi-processor System on Chip (Synthesizable VHDL and C/C++ code for processors) Andy D. Pimentel IPA Lentedagen, 22 April, 2010 24

Application KPNgen System-Level Specification Platform Spec in XML Mapping Spec in XML KPN in XML Library of IP cores ESPAM RTL-Level Specification Platform topology description IP cores in VHDL C/C++ code for processors Auxiliary files Xilinx Platform Studio (XPS) Tool Gate-Level Specification Program code Processor 1 Program code Processor 3 Program code Processor 2 VirtexII-Pro FPGA Andy D. Pimentel IPA Lentedagen, 22 April, 2010 25

Library of parameterized components: Processing Components: PowerPC (PPC), MicroBlaze (µb), or dedicated HW IP blocks Memory Components: Program/Data Memory (MEM) Random access Communication Memory (CM) FIFO access Communication Components: Point-to-point network Crossbar switch Shared bus with Round-Robin, Fixed Priority, or TDMA arbitration Communication Controller (CC) interface between processing, memory, and communication components MEM PPC 1 PPC n MEM CM CC CC CM Communication Component CM CC CC CM MEM µb 1 µb m MEM Many alternative platforms can be easily constructed by instantiating different type/ number of components and setting their parameters Andy D. Pimentel IPA Lentedagen, 22 April, 2010 26

Takes relatively short amount of time: A multiprocessor system with 8 processors KPN Derivation System-level to RTL Conversion Physical Implementation KPNgen 00:00:22 -- -- ESPAM tool -- 00:00:24 -- XPS tool -- -- 02:09:00 Manual Manipulation 00:30:00 00:10:00 -- Total Time 02:49:46 Simple exploration of the performance of alternative MP-SoCs is feasible even at implementation level in several hours The accuracy is 100% Andy D. Pimentel IPA Lentedagen, 22 April, 2010 27

Model refinement techniques Design space pruning techniques System-level power models Multi-application system design Introducing the notion of workload scenarios Adaptive and dynamic systems applications, mappings and architectures DSE support framework Experimental support + analysis support (visualization) Relaxing input constraints of tools. Andy D. Pimentel IPA Lentedagen, 22 April, 2010 28

Background MP-SoC design The Daedalus design-flow Automatic parallelization of (streaming) applications System-level modeling and simulation for DSE System-level synthesis in a plug-and-play fashion How all tools fit together A JPEG case study Conclusions Andy D. Pimentel IPA Lentedagen, 22 April, 2010 29

Image processing solutions for customers that build Medical appliances Very high resolution images Industrial process monitoring Very high frame rate Chess B.V. deployed Daedalus Still image JPEG compression system Very fast evaluation (exploration and implementation) of alternative systems (MP-SoCs) Trade-off between Cost, Design time, Performance, etc. Andy D. Pimentel IPA Lentedagen, 22 April, 2010 30

Tile DCT1 JPEG 1 Tile... Q1 KPN... Tile Vin DCT2 Q2 VLE Vout.jpg IMAGE Tile DCT8... Q8... JPEG 1 Tile JPEG 1 Tile... Tile = 128 MacroBlocks Packet of bytes Compressed byte sequence for Tile MacroBlock = 2Yblocks + 1Ublock + 1Vblock Yblock = 64 pixels, Ublock = 64 pixels, Vblock = 64 pixels, Andy D. Pimentel IPA Lentedagen, 22 April, 2010 31

MP-SoCs consist of MicroBlaze softcores and/or dedicated HW components Point-to-point connections IP component library contains High-level HW component model for all tasks (Sesame) RTL HW model only for DCT task (ESPAM) MP-SoC implementations on FPGA are constrained by the on-chip memory (288KB) Andy D. Pimentel IPA Lentedagen, 22 April, 2010 32

Single JPEG encoder DSE: Andy D. Pimentel IPA Lentedagen, 22 April, 2010 33

Architecture instances for a single-tile JPEG encoder: 16KB Vin,DCT 2KB 32KB Q,VLE,Vout 32KB Vin,Q,VLE,Vout 4KB DCT 2 MicroBlaze processors (50KB) 1 MicroBlaze, 1HW DCT (36KB) 8KB Vin DCT, Q DCT, Q 4x2KB 32KB VLE, Vout 2KB Vin 8KB DCT 2KB Q 8KB 32KB 2KB VLE, Vout 4x2KB DCT, Q DCT, Q 4x16KB 2KB DCT 2KB 8KB Q 2KB 6 MicroBlaze processors (120KB) 4 MicroBlazes, 2HW DCT (68KB) Andy D. Pimentel IPA Lentedagen, 22 April, 2010 34

Multi JPEG encoder MP-SoCs: Andy D. Pimentel IPA Lentedagen, 22 April, 2010 35

JPEG case study, homogeneous systems (32 tiles): 7x 7x 4x 2x 7.4x 8x 8.4x 3x 3x 2x 4x 2x 9.7x 10.3x 1x 2x Andy D. Pimentel IPA Lentedagen, 22 April, 2010 36

JPEG case study, heterogeneous systems (32 tiles): 3.8x DCT DCT DCT 8x 15.2x 15.2x 15.9x 17.7x 19.7x DCT DCT DCT 2x DCT 3x 3x DCT DCT DCT 4x DCT 3x 1x DCT 2x DCT DCT DCT Andy D. Pimentel IPA Lentedagen, 22 April, 2010 37

We performed the DSE study ( 5% error) and the implementation of 25 MP-SoC JPEG encoder variations on an FPGA in only 5 days! Combining data and task parallelism: 24 cores, 19.7x speed-up, 288KB memory Andy D. Pimentel IPA Lentedagen, 22 April, 2010 38

Daedalus : historical figure from Greek mythology Means cunning worker He was an innovator in many arts Daedalus was the father of Icarus Analogy: It s new, disruptive technology But there are still limitations Don t fall into the sea! Andy D. Pimentel IPA Lentedagen, 22 April, 2010 40

Merits of the Daedalus design-flow: Automated parallelization of media/streaming applications into parallel specifications (KPNs) Automated synthesis of MP-SoC platforms at system level, in a plug-and-play fashion Automated mapping of parallel application specifications onto MP-SoC platform Steering by means of efficient system-level design space exploration All of this in a matter of hours Andy D. Pimentel IPA Lentedagen, 22 April, 2010 41

Mark Thompson Cagkan Erbas Simon Polstra Toktam Taghavi Peter van Stralen Stanley Jaddoe Joseph Coffland Berry van Halderen The OOTI@TU/e 2006 trainees Ed Deprettere Bart Kienhuis Todor Stefanov Hristo Nikolov Paul Lieverse Sven Verdoolaege" Kai Huang" Ji Gu " Wei Zhong" Ying Tao Andy D. Pimentel IPA Lentedagen, 22 April, 2010 42

For more information: http://daedalus.liacs.nl/ or email: a.d.pimentel@uva.nl " 43