Embedded System Design

Size: px

Start display at page:

Download "Embedded System Design"

Spencer Strickland
5 years ago
Views:

System design trends Model-based synthesis Transaction level model

1 Modeling, Synthesis, Verification Daniel D. Gajski, Samar Abdi, Andreas Gerstlauer, Gunar Schirner 9/29/2011 Outline System design trends Model-based synthesis Transaction level model generation Application to platform mapping Platform generation Cycle-accurate model generation 9/29/2011 2

2 Traditional System Design Platform HW Dev. Board SW Dev. Board App. Dev. Prototype + BSP Hardware first approach Platform is defined by architect or based on legacy Designers develop and verify RTL model of platform Slow error prone process SW development after HW is finalized Debugging is complicated on the board due to limited observablity HW errors found during SW development are difficult to rectify Application is ported after system SW is finalized 9/29/ Virtual Platform based System Design VP SW Dev. Platform Platform Modeling Virtual Platform HW Dev. Board + BSP App. Dev. Prototype Virtual platform (VP) is a fast model of the HW platform Typically an instruction set simulator or C/C++ model of the processor Peripherals are modeled as remotely callable functions Executes several orders of magnitude faster than RTL SW and HW development are concurrent VP serves as the golden model for both SW and HW development SW development can start earlier HW designers can use SW for realistic test bench for RTL 9/29/2011 4

3 Model-based System Design SW Decisions Application Developer C/MoC Platform TLM Gen. SW Gen. TLM HW Gen. HW Decisions Board + BSP + App ASIC/ FPGA Tools Prototype Model based design gives control to application developers Application is captured as high level C/C++/UML specification Transaction level model (TLM) is used to verify and evaluate the design System synthesis The best platform for given application can be synthesized automatically For legacy platforms, application mapping can be generated automatically Cycle accurate SW/HW can be generated from TLM for implementation 9/29/ Outline System design trends Model-based synthesis Transaction level model generation Application to platform mapping Platform generation Cycle-accurate model generation 9/29/2011 6

4 Model Based Synthesis Specification Component Models System Synthesis TLM TLM TLMs Component Library Cycle-accurate SW/HW Synthesis CAM Synthesis of cycle-accurate model (CAM) from specification Process may be divided into several steps Specification is defined as application model and design constraints Several intermediate models, such as TLMs, may be used Platform component models are needed for TLM generation 9/29/ System Synthesis Inputs and Output Inputs Application Model Purely functional model Specified in a given model of computation (Stateflow, dataflow, CSP, MP) Component Models Data models of configurability and metrics Functional models of component services Examples: HW IP models (Processor, Peripheral, Bus), SW IP models (RTOS, Drivers) Constraints Bounds on metrics (Performance, area, power, reliability, security) Optimization goal as a cost function of metrics Output TLM of application mapped to HW/SW platform 9/29/2011 8

5 Three Models with Respect to OSI (Ref. Chapter 3) Cycle Accurate Model Transaction Level Model Specification Model 7. Application 6. Presentation 5. Session 4. Transport 3. Network 2 b. Link + Stream 2 a. Media Access Ctrl 2 a. Protocol 1. Physical Spec TLMs Address lines Data lines Control lines 7. Application 6. Presentation 5. Session 4. Transport 3. Network 2 b. Link + Stream 2 a. Media Access Ctrl 2 a. Protocol 1. Physical 9/29/ Outline System design trends Model-based synthesis Transaction level model generation Application to platform mapping Platform generation Cycle-accurate model generation 9/29/

Synthesis Case 1: Fixed Platform and Mapping Initial platform and mapping are given Optimization tools may modify spec under given constraints 9/29/2011 11 Tool support for Synthesis Case 1 GUI for

6 Synthesis Case 1: Fixed Platform and Mapping Initial platform and mapping are given Optimization tools may modify spec under given constraints 9/29/ Tool support for Synthesis Case 1 GUI for application specification GUI for platform specification GUI for application to platform mapping TLM generation tool TLM-based metric estimation tools Constraint-based spec optimization tools 9/29/

7 C1 C1 Input: Application Model P1 P2 v1 C2 P3 P4 Application model consists of Processes for computation (eg. P1, P2, P3, P4) Channels for communication (eg. C1 between P1 and P3) Variables for storage (eg. v1) 9/29/ Application Model Objects Processes Symbolic representation of computation Contain C/C++ code imported from reference P1 P2 C2 v1 Process ports Symbolic representation of communication services required by processes Provide object orientation by allowing processes to connect to different channels Channels Symbolic representation of inter-process communication Implement communication services such as blocking, non-blocking, handshake, FIFO etc. Encapsulation for communication functions Variables Symbolic representation of data storage P3 P4 9/29/

8 Arbiter TX Arbiter TX Input: Platform Architecture CPU1 Mem OS1 Bus1 Bus2 OS2 HW CPU2 Platform consists of Hardware: PEs (eg. CPU1, HW), Buses (eg. Bus1), Memories (eg. Mem), Interfaces (eg. Transducer) Software: Operating systems (eg. OS1) on SW PEs 9/29/ Platform Objects Processing element (PE) Symbolic representation of computation resources Different types such as SW processors, HW IPs etc. CPU1 OS1 Mem Bus Symbolic representation of communication media Types include shared, point-to-point, link, crossbar etc. Bus1 Bus2 OS2 HW CPU2 Memory Symbolic representation of physical storage May contain shared variables or SW program/data Transducer For protocol conversion and store-forward routing Necessary for PEs with different bus protocols Operating system (OS) Software platform for individual PEs Needed for scheduling multiple processes on a PE 9/29/

9 Arbiter C1 TX Input: Mapping CPU1 P1 P2 Mem v1 Processes PEs Channels Routes Variables Memories OS C2 Bus1 Bus2 P3 OS P4 HW IP CPU2 9/29/ Mapping Rules Processes to PEs Each process in the application must be mapped to a PE Multiple processes may be mapped to SW PE with OS support Example: P1, P2 CPU1 Channels to Routes All channels between processes mapped to different PEs are mapped to routes in the platform Route consists of bus segments and interfaces Channel on each bus segment is assigned a unique address Variables to Memories Variables accessed by processes mapped to different PEs are mapped to shared memories All variables are assigned an address range depending on size 9/29/

10 Computation Timing Estimation BB1 RF al bl BB1 PC CMem CW const Mul BB2 N If Y BB3 OR offset AG Sum P status AR ALU Mem Add DR Processor Model wait(t1) N If Y BB2 BB3 wait(t2) wait(t3) N If Y Timing Estimation N If Y Process CDFG Timed Process Stochastic memory delay model DFG scheduling to compute basic block delay [DATE 08] RTOS model added for PEs with multiple processes 9/29/ Stochastic Memory Delay Model Assumption Cache and branch prediction hit rate available in data model Delay Estimation Operation access overhead = N op * ((1.0 HR i ) * (CD + L mem )) Data access overhead = N ld * ((1.0 HR d ) * (CD + L mem )) Branch prediction miss penalty = MP rate * Penalty 1: a = $i - 1 2: t1 = a + 2 3: t2 = $n * $m 4: t3 = t1 - t2 5: load b 6: t4 = b / 10 7: jmp LLVM Bytecode Cache D-Mapped 16K Icache: 97.79% Dcache: 69.96% Delay : 1 Memory Delay: 8 BrPredict Policy: Taken Penalty : % Memory/Branch Model Mem./Br. Delay Calcutation Mem. Overhead= 4.1 Branch Delay= 1.2 9/29/

Bus3 Processor Timing Estimation Assumptions In-order, single issue processor Optimistic during scheduling (100% cache hit) 1: a = $i - 1 2: t1 = a + 2 3: t2 = $n * $m 4: t3 = t1 - t2 5: load b 6: t4

$3 cycles Operation delay= 42 9/29/2011 21 Communication Timing Estimation Tx2 PE3 PE1 Bus1 Tx1 Bus2 PE2 p1 p2 Tx3 PE4 Write( ) { Get_Bus( ); Transfer( ); Release_Bus( ); } Untimed Bus1 Estimation$

11 Bus3 Processor Timing Estimation Assumptions In-order, single issue processor Optimistic during scheduling (100% cache hit) 1: a = $i - 1 2: t1 = a + 2 3: t2 = $n * $m 4: t3 = t1 - t2 5: load b 6: t4 = b / 10 7: jmp 8: wait 47*CT LLVM Bytecode Operations Add IF ID EX: int-alu IntAdd Sub IF ID EX: int-alu IntSub Pipeline Scheduling Datapath Int-ALU Qty: 1 IntAdd IntSub Lat: 1 Lat: 1 Processor Data Model Total BB delay= Op.+Mem.+Br. = 47.3 cycles Operation delay= 42 9/29/ Communication Timing Estimation Tx2 PE3 PE1 Bus1 Tx1 Bus2 PE2 p1 p2 Tx3 PE4 Write( ) { Get_Bus( ); Transfer( ); Release_Bus( ); } Untimed Bus1 Estimation Engine Write( ) { Get_Bus( ); wait (t1); Transfer( ); wait (t2); Release_Bus( ); wait (t3); } Application + Platform Protocol Model Timed Bus1 Protocol model used to estimate synchronization, arbitration and transfer Timing is annotated in bus channel 9/29/

12 CPU1 Output: SystemC Timed TLM P1 OS P2 Mem TX Bus1 Bus2 P3 HW IP TLM Generation Technique Application code sc_thread Processing element sc_module OS Model sc_module Bus sc_channel Memory Array inside sc_module Interface FIFO channel+sc_process CPU2 OS P4 9/29/ Outline System design trends Model-based synthesis Transaction level model generation Application to platform mapping Platform generation Cycle-accurate model generation 9/29/

Application to Platform Mapping Mapping is derived from Application and Platform Optimization loop is driven by estimation results and constraints Chapter 4: 4 System Synthesis 9/29/2011 25

13 Application to Platform Mapping Mapping is derived from Application and Platform Optimization loop is driven by estimation results and constraints Chapter 4: 4 System Synthesis 9/29/ Application Example Encoder LP_Analysis Open Loop Closed-loop Pitch search Codebook Search 2x per frame 2 sub frames Update GSM Encoder Compresses raw speech data frame-by-frame Over 10K lines of C code in specification 5 top level functions: LP, OP, CL, CB, UP Contains if-then-else and loop control flow 9/29/

14 Profiling Given input MoC, profile application for: Computation Number of operations (size) Operations type per data type and frequency of use Concurrency between modules and dependency Communication Volume, frequency of communication between modules Timing dependency Latency requirements Storage Instruction size Variable size Profiling Application Instrumentation Profiled App. Instr. Appl Simulation Basic Block Counters Static Analysis 9/29/ Profiled Statistics Encoder LP_Analysis 377.0MOp 0 8,802 Open Loop 337.1MOp 272 Closed-loop 478.7MOp 2x per frame others, (/, int), 4.10% 9.10% ,544 Codebook 646.5MOp 69, ,568 2 sub frames (+, int), 33.50% (*, int), 46.20% Update 43.6MOp Profiling helps select the appropriate components for implementation All fixed point ops No need for processors with floating point units Large number of multiplications Processor with HW multiplier is ideal CB is most computationally intensive Ideal for custom HW mapping 9/29/

15 Application Graph Encoder LP_Analysis 377.0MOp ,802 Open Loop 337.1MOp 272 Closed-loop 478.7MOp 79,544 Codebook 646.5MOp 315,568 2x per frame 2 sub frames Arbiter Interface LP OP CL UP 44 69,112 Update 43.6MOp CB Profile information is abstracted into a simplified graphical representation for synthesis algorithms Node tags = number of operations in the process (#ops) in millions Edge tags = kilobytes transferred between the processes Control dependencies are excluded for simplicity 9/29/ Platform Connectivity Graph CPU Mem M S S CPU 150 S Bus1 Bus2 M HW 200 DSP 100 HW IP DSP Platform architecture is abstracted into a connectivity graph showing possibility of inter-pe communication Node label = PE name Node tag = estimated computation speed of the PE in Million of Operations per second (Mops) Edge implies a communication path between PEs No edge between HW IP and DSP due to missing DMA on Bus1 9/29/

16 Load Balancing (LB) Algorithm Greedy heuristic to map processes to least busy PEs PE load = total time PE will be busy executing the mapped processes Defined as #ops(p) / Speed (PE), for all p mapped to PE Does not account for any communication time! Feasibility list defined for each process The set of PEs to which a process can be mapped such that the process communication requirements are not violated Let processes p & q are have an edge in the application graph Let q be already mapped to PE PE is in feasible(p) if there exists edge (PE, PE ) in platform graph Basic idea Map processes to least loaded feasible PE in decreasing order of #ops 9/29/ LB Algorithm in Action LP 8.8 OP CL UP CB (a) Application graph CL, LP CPU 150 CB,UP HW 200 DSP 100 OP (b) Platform connectivity graph w/ mapping 9/29/

17 Longest Processing Time (LPT) Algorithm Drawbacks of LB Does not account for communication May not terminate with a mapping if feasibility list is empty LPT accounts for communication and always produces a mapping Fully connected platform DMA is added to allow HW peripherals to communicate with processes and memories No need to evaluate feasibility of mapping Cost of mapping process to PE is defined Includes both communication and computation time Basic idea Map processes to least cost PE in decreasing order of #ops 9/29/ LPT Cost Function = max (T(PE), for all PEs) E(p, PE): Estimated time for running p on PE Mops(p) / Speed (PE) + time to send data to mapped processes C(p, PE) = T(PE) + E(p, PE) SystemEndTime PE with lowest execution time may not have the lowest cost E(p, PE3) > E(p, PE2), but C(p, PE3) < C(p, PE2) 9/29/

Arbiter Interface New Connectivity Graph of Updated Platform CPU Mem M S S CPU 150 300 200 Bus1 Bus2 S M M HW 200 100 DSP 100 Edge possible due to DMA HW DMA DSP Platform architecture is abstracted

18 Arbiter Interface New Connectivity Graph of Updated Platform CPU Mem M S S CPU Bus1 Bus2 S M M HW DSP 100 Edge possible due to DMA HW DMA DSP Platform architecture is abstracted into a connectivity graph Node label= PE name, Node tag= estimated PE speed in Mops Edge = Connectivity between PEs Edge label = effective transaction speed between PEs in Kilobytes per second (Kbps): added for LPT algorithm 9/29/ LPT Algorithm in Action LP OP CL UP CB (a) Application graph CL, UP CPU CB HW DSP 100 LP (b) Platform connectivity graph w/ mapping 9/29/

19 Outline System design trends Model-based synthesis Transaction level model generation Application to platform mapping Platform generation Cycle-accurate model generation 9/29/ Platform Generation 9/29/

20 Component Database PE Type Cost Speed Capacity (Speed *6 sec) CPU DSP HW Timing constraint: Application must complete in <6 seconds. Database of processing elements used for component selection Characterized by type, cost and speed Cost includes IP licensing, development, manufacturing etc. PE Computation Capacity = PE speed (in Mops) * timing constraint Indicates number of operations (in millions) that may be mapped to a PE while still meeting the timing constraints 9/29/ Platform Generation Algorithm Greedy heuristic to minimize cost and meet timing constraint PE slack Capacity remaining on the PE Slack(PE) = Capacity(PE) ( #ops(p), for all p mapped to PE) Closeness factor of a process p to a PE Total communication data between p and all processes mapped to PE C(p, PE) = (Edge-weight(p, q) in app. graph), for all q mapped to PE Basic idea Iterate over processes (p) with in decreasing order of #ops Update slacks of allocated PEs Map p to closest PE with Slack(PE) #ops(p) Allocate least cost PE with Capacity #ops(p) if mapping fails 9/29/

21 Platform Generation Algorithm in Action LP 8.8 OP CL UP CB (a) Application graph LP CPU 100 CB,CL, UP HW 200 DSP 50 OP (b) Generated Platform w/ mapping 9/29/ Outline System design trends Model-based synthesis Transaction level model generation Application to platform mapping Platform generation Cycle-accurate model generation 9/29/

22 CPU1 CAM Generation SystemC TLM SW/RTOS Library SW Synthesis C RTL Interface Synthesis RTL IP Library OR Binary HW RTL IF RTL CA Sim. Tools Pin/Cycle Accurate Model (PCAM) Generator C/Verilog CAM FPGA Tools Bus Library Prototype 9/29/ Cycle-Accurate Software Synthesis (Chapter 5) Program EXE RTOS HAL Compile RTOS/ Driver Synthesis P1 P2 OS HAL Processes Compiled App. OS model Real OS HAL model Real HAL TX Bus1 Bus2 OS P4 Compile RTOS/ Driver Synthesis Program EXE RTOS HAL HW IP CPU2 9/29/

23 CPU1 Cycle-Accurate Hardware Synthesis (Chapter 6) Mem Process Synthesizable RTL High level synthesis for custom Replacement for HW IP TX Bus1 Bus2 Cycleaccurate Synthesis P3 HW IP (RTL) Processes in C CPU2 9/29/ Cycle-Accurate Interface Synthesis (Chapter 7) CPU1 IC Mem Sync. Model Interrupts Bus channel Arbiter + Signals Interface model RTL Channel access PE interface Arbiter TX Interface Synthesis HW IP CPU2 9/29/

24 CPU1 Cycle-Accurate Model Arbiter Program EXE RTOS HAL IC Mem Interface HW IP PCAM is downloaded automatically for fast prototyping with FPGAs or simulated using validation tools Program EXE RTOS HAL CPU2 9/29/ Summary Emergence of model-based system design Virtual platforms replace prototypes for early SW development Increasing adoption of TLMs for SW/HW design Challenges for synthesis of large system designs Manual model development is time consuming and error-prone Different platforms are needed for different application domains Mapping application to a multi-core platform is complicated Need for well defined model semantics is needed at TLM and cycle-accurate levels Enables automatic TLM generation System synthesis becomes possible Future of system synthesis Based on formalized system level models such as TLM Automatic mapping of application to platform Automatic generation of application specific platforms 9/29/

Embedded System Design Modeling, Synthesis, Verification

Embedded System Design Modeling, Synthesis, Verification Modeling, Synthesis, Verification Daniel D. Gajski, Samar Abdi, Andreas Gerstlauer, Gunar Schirner Chapter 4: System Synthesis Outline System design trends Model-based synthesis Transaction level model