Design methodology for multi processor systems design on regular platforms

Size: px

Start display at page:

Download "Design methodology for multi processor systems design on regular platforms"

Gwen Reeves
5 years ago
Views:

1 Design methodology for multi processor systems design on regular platforms Ph.D in Electronics, Computer Science and Telecommunications Ph.D Student: Davide Rossi Ph.D Tutor: Prof. Roberto Guerrieri

2 Outline Motivations Approach overview Software layers Programming model Compilation flow System-C simulator Architecture Cluster architecture Tile architecture Preliminary implementation results

3 Motivations Embedded Applications requirements: Time to market Energy efficiency Performance Flexibility Programmability SoC trends: Multi-core Hierarchical memory architecture Efficient connectivity On-demand accelerator engines *source ITRS 2009

4 Main Goals Multi-Many processor Spatial computation vs. Sequential computation Thread level parallelism Homogeneous architecture Distributed ASIC acceleration Heterogeneous set of accelerators Instruction level parallelism Data level parallelism Regularity Architectural level : tile base approach Accelerators: implemented on a metal programmable technology (Customization with 9/36 masks CMOS65LP) Homogeneous Programming Model and Tool- Chain High level Application exploration and partitioning Kernel extraction Development of hardware accelerators for most critical kernel portions

5 Programming model Based on the OpenCL 1.0 specifications Sequential code is executed on a host Parallel kernels executed on the multi core device are declared with the kernel specifier Data parallel programming model One kernel can run as an NDRange space of simultaneous work-groups A maximum of 8 work items batched to a work group run on the cluster Task parallel programming model There is no implicit declaration of index space Data structures are implemented by the device Tasks and Kernels are synchronized through events and barriers Example of NDRange index space

6 Parallel Memory Sharing Work item Work group NDRange kernel 1 Private Memory Shared Memory Private Memory: per work-item Private per work-item Local Memory: per-work group Shared by threads of the same work group Inter-work item communication Global Memory: per-application Shared by all work-groups Inter-kernel communication Data can be allocated on memory utilizing private, local and global address space qualifiers NDRange kernel 2... Global Memory Sequential kernels in Time

Tool-chain flow overview 1) High level partitioning: Evaluation of task level parallelism using OpenCL Profiling without hardware acceleration using TLM simulator 2) Kernel extraction and

7 Tool-chain flow overview 1) High level partitioning: Evaluation of task level parallelism using OpenCL Profiling without hardware acceleration using TLM simulator 2) Kernel extraction and implementation Extraction of data level and instruction level parallelism Thread level profiling using Griffy tools Application level profiling using system-c simulator + Griffy emulation library

Compilation flow The OpenCL input file is processed and splitted in a HOST and a DEVICE C-code: HOST: Contains the sequential parts of an application Handles off-chip memory allocation Configures,

8 Compilation flow The OpenCL input file is processed and splitted in a HOST and a DEVICE C-code: HOST: Contains the sequential parts of an application Handles off-chip memory allocation Configures, calls and synchronizes parallel and hardware accelerated kernels DEVICE: Executes threads in parallel Each thread handles its own data chunks transfers (DMA LIB) Each thread configures and launches hardware accelerated functions (PGA LIB) HOST and DEVICE c-code are further compiled with the processor specific compiler and linked with runtime libraries

9 Accelerators Design Flow Accelerator description Single assignment DFG description with simplified C syntax The compilation flow automatically extracts information about Routing only operations pipeline stages Outputs of the flow are: Cycle accurate emulation libraries for integration with the standalone Cycle accurate simulator Functional emulation libraries for integration with the SystemC TLM simulator RTL implementation of the pipelined accelerators GDS-II macro and all FE and BE views for integration in Synopsys and Cadence place & route tools

10 System-C Simulator Integrates: N computational tiles instances (processor ISS, DMA, memories, buffers, registers, accelerators, local interconnect) 1 IO tile instance (processor, memories, registers) Shared and device memories Interconnect TLM models Memory transfers are Transaction Accurate ISS read/write request are converted into packages flowing through routers Each time a packet cross a router a delay unit is introduced Target configuration files contain Memory map information Processors configuration information Links to a shared libraries containing accelerators emulation function

11 Cluster Architecture The NoC connects the computational resources with a multi back shared memory The bus connect the shared memory with the IO Cluster manager is responsible for Synchronization of work items within a work groups NoC Configuration

12 CT Architecture (1) GP Processor Private Data and Program TCM (16K each) Dedicated interfaces for subsystem management Stream channel MFU: DMA transfers EFU: access hardware accelerators EXT: access external memory space Operation is independent from CPU Separated RW/WR physical channels Multiple logical channels 2-D addressing patterns Features AUDIO and VIDEO addressing modes

CT Architecture (2) Synchronization mechanism Local registers Local configuration bus Hardware accelerators Features 4 1024x 32-bit

13 CT Architecture (2) Synchronization mechanism Local registers Local configuration bus Hardware accelerators Features x 32-bit buffers Address generators: 2D addressing patterns (step/stride) Circular addressing Each accelerated operation is triggered by the processor

14 CT synthesis results Technology: CMOS65LP, HVT/SVT, 1.2V CT Area: 1.25 mm 2 Max frequency (wc, 125 C, 1.1V): 250 MHz SVT/HVT ratio: 0,11

15 Collaborations The PhD is in collaboration with STMicroelectronics Collaborations in 3 European projects: MORPHEUS (FP6) MODERN (ENIAC) THERMINATOR (FP7)

16 Publications N. Voros et al. Dynamic System Reconfiguration in Heterogeneous Platforms, Chapter 5: The DREAM digital Signal Processor, Chapter 8: The MORPHEUS Data Communication and Storage Infrastructure, Springer, D. Rossi et al. A Heterogeneous Digital Signal Processor Implementation for Dynamically Reconfigurable Computing, CICC (Custom Integrated Circuit Conference), D. Rossi et al. A Multi-Core Signal Processor for Heterogeneous Reconfigurable Computing, International Symposium on System-on-Chip, Proceedings, F. Campi et al. RTL-to-Layout Implementation of an Embedded Coarse Grained Architecture for Dynamically Reconfigurable Computing in Systems-on-Chip, Proceedings, D. Rossi et al., A Heterogeneous Digital Signal Processor for Dynamically Reconfigurable Computing, JSSC IEEE Journal of Solid-State Circuits, 2010.

17 Thank you for attention

Multi processor systems with configurable hardware acceleration

Multi processor systems with configurable hardware acceleration Ph.D in Electronics, Computer Science and Telecommunications Ph.D Student: Davide Rossi Ph.D Tutor: Prof. Roberto Guerrieri Outline Motivations