Co-Design of Many-Accelerator Heterogeneous Systems Exploiting Virtual Platforms. SAMOS XIV July 14-17,

Size: px

Start display at page:

Download "Co-Design of Many-Accelerator Heterogeneous Systems Exploiting Virtual Platforms. SAMOS XIV July 14-17,"

Violet Mills
5 years ago
Views:

1 Co-Design of Many-Accelerator Heterogeneous Systems Exploiting Virtual Platforms SAMOS XIV July 14-17,

2 Outline Introduction + Motivation Design requirements for many-accelerator SoCs Design problems Objective Proposed Co-design methodology Virtual Prototyping framework. VP simulation speedup. Co-design formalization. Case study: H.264 decoding server Conclusions 2

3 Introduction Need for performance and energy improvements Many-accelerator SoCs Massive parallelization Pipelining Energy-efficient hardware configuration 3

4 Introduction HW/SW co-design for many-accelerator systems Different architectural configurations required in each HW core: Example: Surveillance server: Group 1 Group 2 Group N 4

5 Introduction HW/SW co-design for many-accelerator systems Different architectural configurations required in each HW core: Example: Surveillance server: VIDEO 1 VIDEO 2 VIDEO N Group 1 Group 2 Group N 5

6 Introduction HW/SW co-design for many-accelerator systems Different architectural configurations required in each HW core: Example: Surveillance server: VIDEO 1 VIDEO 2 VIDEO N Group 1 Group 2 Group N Requirements 1 Requirements 2 Requirements N 6

7 Introduction Typical Co-design: Common configuration Area violation Suboptimal design Maximum allowed area

8 Introduction Using different configurations: Constraints met Design optimization Maximum allowed area

9 Design problems Problem 1: Exponentially-increased design space size: A j=1 P aj V i=1 N: # accelerator groups A: Number of accelerators per group P a : Arch. Parameters of accelerator a V p : Value of architectural parameter p Increased number of evaluations is needed. p a i N

10 Design problems Problem 2: Slow system evaluation: Accurate evaluation slow simulation. Increased number of components. Non-productive simulation phases: Phases out of evaluation scope Large number of slow simulations Highly-increased design time

11 Goal of the proposed framework VP framework for co-design of manyaccelerator systems. Supports the use of different hardware core configurations Optimal designs Simulation time reduction. Avoiding non-productive simulation phases. The increased design time can be alleviated.

12 Proposed VP framework SystemC/TLM-based Virtual Platform 12

13 Proposed VP framework SystemC/TLM-based Virtual Platform Software Part RAM LOCAL BUS ROM (Dataset) 13

14 Proposed VP framework SystemC/TLM-based Virtual Platform Software Part RAM LOCAL BUS ROM (Dataset) Part Accelerators 14

15 Proposed VP framework SystemC/TLM-based Virtual Platform Software Part RAM LOCAL BUS ROM (Dataset) Bridge System (Global) Bus Part Accelerators 15

16 Proposed VP framework SystemC/TLM-based Virtual Platform Software Part RAM LOCAL BUS ROM (Dataset) Bridge Profiler System (Global) Bus Part Accelerators Area Metric (SystemC ports) 16

17 Part Software Part Proposed VP framework SystemC/TLM-based Virtual Platform Instance 1 Instance 2 Instance N RAM ROM (Dataset) RAM ROM (Dataset) RAM ROM (Dataset) LOCAL BUS LOCAL BUS LOCAL BUS Bridge Profiler Bridge Profiler Bridge Profiler System (Global) Bus Accelerators Area Metric (SystemC ports) Accelerators Area Metric (SystemC ports) Accelerators Area Metric (SystemC ports) 17

18 Proposed VP framework SystemC/TLM-based Virtual Platform Instance 1 Instance 2 Instance N RAM ROM (Dataset) RAM ROM (Dataset) RAM ROM (Dataset) Software Part LOCAL BUS LOCAL BUS LOCAL BUS Sync Bridge Profiler Sync Bridge Profiler Sync Bridge Profiler System (Global) Bus Part Accelerators Area Metric (SystemC ports) Accelerators Area Metric (SystemC ports) Accelerators Area Metric (SystemC ports) 18

19 Proposed VP framework SystemC/TLM-based Virtual Platform Software Part Instance 1 RAM LOCAL BUS ROM (Dataset) Instance 2 RAM LOCAL BUS ROM (Dataset) Partial Partial Sync Bridge Profiler Metrics Sync Bridge Profiler Metrics Sync Instance N RAM LOCAL BUS Bridge ROM (Dataset) Profiler Partial Metrics System (Global) Bus Part Accelerators Area Metric (SystemC ports) Accelerators Area Metric (SystemC ports) Accelerators Area Metric (SystemC ports) Simulation Control Server Configuration Global Metrics (Area, Power, Throughput) 19

20 Proposed VP framework SystemC/TLM-based Virtual Platform Software Part Instance 1 RAM LOCAL BUS ROM (Dataset) Instance 2 RAM LOCAL BUS ROM (Dataset) Partial Partial Sync Bridge Profiler Metrics Sync Bridge Profiler Metrics Sync Instance N RAM LOCAL BUS Bridge ROM (Dataset) Profiler Partial Metrics System (Global) Bus Part Accelerators Area Metric (SystemC ports) Accelerators Area Metric (SystemC ports) Accelerators Area Metric (SystemC ports) Simulation Control Server Create config. file Configuration Global Metrics (Area, Power, Throughput) 20

21 Proposed VP framework SystemC/TLM-based Virtual Platform Software Part Instance 1 RAM LOCAL BUS ROM (Dataset) Instance 2 RAM LOCAL BUS ROM (Dataset) Partial Partial Sync Bridge Profiler Metrics Sync Bridge Profiler Metrics Sync Instance N RAM LOCAL BUS Bridge ROM (Dataset) Profiler Partial Metrics System (Global) Bus Part Accelerators Area Metric (SystemC ports) Accelerators Area Metric (SystemC ports) Accelerators Area Metric (SystemC ports) Simulation Control Server Create config. file Simulation Start Configuration Global Metrics (Area, Power, Throughput) 21

22 Proposed VP framework SystemC/TLM-based Virtual Platform Software Part Instance 1 RAM LOCAL BUS ROM (Dataset) Instance 2 RAM LOCAL BUS ROM (Dataset) Partial Partial Sync Bridge Profiler Metrics Sync Bridge Profiler Metrics Sync Instance N RAM LOCAL BUS Bridge ROM (Dataset) Profiler Partial Metrics System (Global) Bus Part Accelerators Area Metric (SystemC ports) Accelerators Area Metric (SystemC ports) Accelerators Area Metric (SystemC ports) Simulation Control Server Create config. file Simulation Start Wait for metrics, for each instance Configuration Global Metrics (Area, Power, Throughput) 22

23 Proposed VP framework SystemC/TLM-based Virtual Platform Software Part Instance 1 RAM LOCAL BUS ROM (Dataset) Instance 2 RAM LOCAL BUS ROM (Dataset) Partial Partial Sync Bridge Profiler Metrics Sync Bridge Profiler Metrics Sync Instance N RAM LOCAL BUS Bridge ROM (Dataset) Profiler Partial Metrics System (Global) Bus Part Accelerators Area Metric (SystemC ports) Accelerators Area Metric (SystemC ports) Accelerators Area Metric (SystemC ports) Simulation Control Server Create config. file Simulation Start Wait for metrics, for each instance Write global metrics: - Power: Sum of all HW - Area: Sum of all HW - Throughput: Min. from all HW Configuration Global Metrics (Area, Power, Throughput) 23

24 Proposed VP framework Communication among components: Software 24

25 Proposed VP framework Communication among components: Software Send input data Get input 25

26 Proposed VP framework Communication among components: Software Send input data Get input Process Set delay 26

27 Proposed VP framework Communication among components: Software Send input data Get input Ready= 0' Process Set delay 27

28 Proposed VP framework Communication among components: Software Send input data Get input Ready= 0' Process Ready= 0' Set delay 28

29 Proposed VP framework Communication among components: Software Send input data Get input Ready= 0' Process Ready= 0' Set delay Ready= 0' 29

30 Proposed VP framework Communication among components: Software Send input data Get input Ready= 0' Process Ready= 0' Set delay Ready= 0' Produce output; Ready = 1 30

31 Proposed VP framework Communication among components: Software Send input data Get input Ready= 0' Process Ready= 0' Set delay Ready= 0' Get output data Produce output; Ready = 1 Send output 31

32 Proposed VP framework Communication among components: Software Send input data Get input Ready= 0' Process Ready= 0' Set delay Ready= 0' Get output data Produce output; Ready = 1 Send output YES More accelerators to manipulate? 32

33 Proposed VP framework Communication among components: Software Send input data Get input Ready= 0' Process Ready= 0' Set delay Ready= 0' Produce output; Ready = 1 Get output data Send output YES More accelerators to manipulate? NO Send # cycles Profiler Get cycles Combine with HW metrics Publish metrics 33

34 Proposed VP framework Communication among components: Software Send input data Get input Ready= 0' Process Ready= 0' Set delay Ready= 0' Produce output; Ready = 1 Get output data Send output YES More accelerators to manipulate? NO Send # cycles Profiler Get cycles Combine with HW metrics Publish metrics Synchronize Synchronization Module Synchronization point reached for current instance Wait while the remaining instances are synchronized No full Sync 34

35 Proposed VP framework Communication among components: Software Send input data Get input Ready= 0' Process Ready= 0' Set delay Ready= 0' Produce output; Ready = 1 Get output data Send output YES More accelerators to manipulate? NO Send # cycles Profiler Get cycles Combine with HW metrics Publish metrics Synchronize Synchronization Module Synchronization point reached for current instance Continue execution Full Sync. Wait while the remaining instances are synchronized No full Sync 35

36 VP Simulation speedup Process-based Reconfigurable SystemC Module (PRM): Separates the virtual platform into two O/S processes: Static VP process s, memories, auxiliary peripherals Constant during exploration of design space process All hardware accelerators, for a specific group Changes during design space exploration 36

37 VP Simulation speedup Process-based Reconfigurable SystemC Module (PRM) cont d: Instead of restarting the whole simulation, the designer: 1. Pauses simulation 2. Restarts only the hardware process 3. The simulation continues from the point it was paused. Non productive simulation phases are not repeated! Exploration speedup 37

38 VP Simulation speedup PRM structure: Virtual Platform (O/S Process 1) PRM Wrapper Output 1 Output 2 Data & Timing Forwarder Output 3 Input 1 Input 2 Input 3 Input 4 Remaining Platform Connection to interface Request pause Check pause Continue simulation Shared Memory Forwarder Shared Memory User Interface (O/S process 3) Process (O/S Process 2) 38

39 VP Simulation speedup PRM structure: Virtual Platform (O/S Process 1) PRM Wrapper Output 1 Output 2 Data & Timing Forwarder Output 3 Input 1 Input 2 Input 3 Input 4 Remaining Platform Connection to interface Request pause Check pause Continue simulation Shared Memory Forwarder Shared Memory User Interface (O/S process 3) In Out SystemC Accelerator 1 In Out In Out SystemC SystemC Accelerator... Accelerator 2 K Process (O/S Process 2) 39

40 VP Simulation speedup PRM structure: Virtual Platform (O/S Process 1) PRM Wrapper Output 1 Output 2 Data & Timing Forwarder Output 3 Input 1 Input 2 Input 3 Input 4 Remaining Platform Connection to interface Shared Memory User Interface (O/S process 3) Request pause Check pause Continue simulation In Out SystemC Accelerator 1 Shared Memory Forwarder Accelerator In Out In Out Computationally SystemC characterization intensive kernel SystemC SystemC (Delay/Area) Accelerator Accelerator... Accelerator 2 K Input ports Output ports Metrics ports Process (O/S Process 2) 40

41 VP Simulation speedup PRM structure: Virtual Platform (O/S Process 1) PRM Wrapper Output 1 RUNTIME DESIGN TIME Output 2 Data & Timing Forwarder Output 3 Input 1 Input 2 Input 3 Input 4 Remaining Platform Extracted delay and area metrics High-Level Synthesis tool Connection to interface Request pause Check pause Continue simulation Shared Memory Forwarder C or SystemC behavioural code DESIGN TIME RUNTIME Shared Memory User Interface (O/S process 3) In Out SystemC Accelerator 1 Accelerator In Out In Out Computationally SystemC characterization intensive kernel SystemC SystemC (Delay/Area) Accelerator Accelerator... Accelerator 2 K Input ports Output ports Metrics ports Process (O/S Process 2) 41

42 VP Simulation speedup PRM structure: Virtual Platform (O/S Process 1) Pause Request PRM Wrapper Synchronization Synchronization signal Check: Pause Request && Sync == 1 YES Pause Output 1 Output 2 Pause simulation Data & Timing Forwarder Output 3 Input 1 Input 2 Input 3 Input 4 Remaining Platform RUNTIME DESIGN TIME Extracted delay and area metrics High-Level Synthesis tool Connection to interface Request pause Check pause Continue simulation Shared Memory Forwarder C or SystemC behavioural code DESIGN TIME RUNTIME Shared Memory User Interface (O/S process 3) In Out SystemC Accelerator 1 Accelerator In Out In Out Computationally SystemC characterization intensive kernel SystemC SystemC (Delay/Area) Accelerator Accelerator... Accelerator 2 K Input ports Output ports Metrics ports Process (O/S Process 2) 42

43 VP Simulation speedup PRM structure: Virtual Platform (O/S Process 1) Pause Request PRM Wrapper Synchronization Synchronization signal Check: Pause Request && Sync == 1 YES Pause Output 1 Output 2 Pause simulation Continue simulation Data & Timing Forwarder Output 3 Input 1 Input 2 Input 3 Input 4 Remaining Platform RUNTIME DESIGN TIME Extracted delay and area metrics High-Level Synthesis tool Connection to interface Request pause Check pause Continue simulation Shared Memory Forwarder C or SystemC behavioural code DESIGN TIME RUNTIME Shared Memory User Interface (O/S process 3) In Out SystemC Accelerator 1 Accelerator In Out In Out Computationally SystemC characterization intensive kernel SystemC SystemC (Delay/Area) Accelerator Accelerator... Accelerator 2 K Input ports Output ports Metrics ports Process (O/S Process 2) 43

44 VP Simulation speedup PRM structure: Virtual Platform (O/S Process 1) Connection to interface Pause Request Event: New HW configuration Request pause Check pause Continue simulation PRM Wrapper Synchronization Synchronization signal Check: Pause Request && Sync == 1 YES Pause Output 1 Output 2 Pause simulation Continue simulation Initializer Reset = 1 Run without propagating delay Reset = 0 Normal execution Reset signal Data & Timing Forwarder Shared Memory Forwarder Output 3 Input 1 Input 2 Input 3 Input 4 Remaining Platform RUNTIME DESIGN TIME Extracted delay and area metrics High-Level Synthesis tool C or SystemC behavioural code DESIGN TIME RUNTIME Shared Memory User Interface (O/S process 3) In Out SystemC Accelerator 1 Accelerator In Out In Out Computationally SystemC characterization intensive kernel SystemC SystemC (Delay/Area) Accelerator Accelerator... Accelerator 2 K Input ports Output ports Metrics ports Process (O/S Process 2) 44

45 Proposed VP framework + PRM Virtual Platform (Unix process 1) Instance 1 RAM ROM (Dataset) Instance 2 RAM ROM (Dataset) Instance N RAM ROM (Dataset) LOCAL BUS LOCAL BUS LOCAL BUS Partial Partial Sync Bridge Metrics Profiler Sync Bridge Metrics Profiler Sync Bridge Profiler Partial Metrics System (Global) Bus PRM Wrapper Area Metric (SystemC ports) PRM Wrapper Area Metric (SystemC ports) PRM Wrapper Area Metric (SystemC ports) PRM Control PRM Control PRM Control PRM HW Process Instance 1 (Unix process 2) HW 1 HW 2 HW 3 HW 4 PRM HW Process Instance 2 (Unix process 3) HW 1 HW 2 HW 3 HW 4 PRM HW Process Instance N (Unix process N+1) HW 1 HW 2 HW 3 HW 4 Simulation Control Server Create config. file (Re)start all HW Processes Simulation Start / Continue Wait for metrics, for each instance Pause all PRMs Write global metrics: - Power: Sum of all HW - Area: Sum of all HW - Throughput: Min. from all HW Configuration Global Metrics (Area, Power, Throughput) 45

46 Co-design formalization Global area constraint (A max ): Typical N A A max vs. N i=1 Proposed A i A max N: # instances A: Accelerators area for each instance N: # instances A i : Accelerators area for instance i 46

47 Co-design formalization Throughput constraint (T min ) per instance: Typical vs. Proposed T T min T: Throughput of each instance All instances have the same throughput T is the system throughput i 1,, N, T i T min Or equivalently: min i 1,,N T i T min N: # instances T i : Throughput for instance i Exploiting the slacks induced by min i 1,,N T i 47

48 Co-design formalization Optimization objective: Let T i the throughput of instance i 1,, N The instance with the minimum throughput defines the system throughput: T system = min i 1,,N T i Co-design objective: Maximization of T system max T system max min i [1,,N] T i 48

49 Co-design formalization Configuration representation: For a single instance i [1,, N]: V i = p 1 i, p 2 i,, p K i K: Number of architectural parameters (common to all instances) p j i : Value of parameter j [1, 2,, K], for instance i At least two instances are differently configured: i 1, i 2 1,2,, N : i 1 i 2 V i1 V i2 For the overall system: p 1 1, p 2 1,, p K 1, p 1 2, p 2 2,, p K 2,, (p 1 N, p 2 N,, p K N ) 49

50 Case study: H.264 decoding server Video decoding for surveillance server 8 instances 4 accelerators per instance: Inverse cosine tranformation Motion compensation: 1 Luma, 2 Chroma I/O H264 Decoder HW 1 HW 2 Motion Detection HW HW 3 4 I/O H264 Decoder HW 1 HW 2 Motion Detection HW HW 3 4 I/O H264 Decoder HW 1 HW 2 Motion Detection HW HW 3 4 Surveillance server 50

51 Case study: H.264 decoding server Constraints: A max = 5.5 mm 2 T min = 13.6 frames per second Exploration: 200 random evaluations Targeting to exploration time up to 20 hours. 51

52 Case study: H.264 decoding server Proposed vs. Software-only Using a solution with the minimum possible throughput. Throughput (fps) T min =

53 Case study: H.264 decoding server Proposed vs. Overdesign Overdesign: Using the same configuration for all instances Proposed Overdesign (max. throughput) Overdesign (5 accelerators) x Area( mm ) A max = 5.5 Proposed Overdesign (5 accelerators) x Throughput (fps) T min =

54 Case study: H.264 decoding server Simulation speedup (using PRM): Bypassing: VP startup (memory allocation etc) Target software initialization Warm-up phase: The dataset is processed without obtaining metrics, in order to minimize cache misses.

55 Conclusions VP co-design framework for many-accelerator systems Groups the HW accelerators Each group uses different configuration Optimal designs H.264 use case: 1.58x less area, similar throughput. Use of Process-based Reconfigurable SystemC Module Simulation speedup Bypassing non-productive simulation phases Investing time improvements for DSE quality. H.264 use case: 40% less simulation time. 55

56 Thank you Questions? 56

Co-synthesis and Accelerator based Embedded System Design

Co-synthesis and Accelerator based Embedded System Design COE838: Embedded Computer System http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer