Multithreaded Coprocessor Interface for Dual-Core Multimedia SoC

Size: px

Start display at page:

Download "Multithreaded Coprocessor Interface for Dual-Core Multimedia SoC"

Abraham Pope
5 years ago
Views:

1 Multithreaded Coprocessor Interface for Dual-Core Multimedia SoC Student: Chih-Hung Cho Advisor: Prof. Chih-Wei Liu VLSI Signal Processing Group, DEE, NCTU 1

2 Outline Introduction Multithreaded Coprocessor Interface Simulation & Implementation Results 2/27

3 Introduction Dual-core/Multi-core SoC is the possible solution for modern mobile multimedia systems Task divergence in most embedded systems drives the heterogeneous computing platform Control-oriented task vs. computation-intensive task RISC + DSP, play the right thing in the right place [1] Example: TI OMAP RISC: ARM9 DSP: TI C 5x TI OMAP Processor ARM926 TI C'5x DSP Shared Memory Controller/DMA 2D Graphics Accelerator Timer, Interrupt Controller, RTC Frame Buffer/Internal SRAM [1]. J. Shandle, The give and take of DSP processors, IEEE Signal Processing Mag., vol.17, pp , March /27

4 Low DSP Utilization Problem (1/2) Peak vs. delivered performance gap increasing Instruction latency & data dependency Limits to ILP seem to limit to 3~6-issue for practical options DSP utilization is below 50% [2] IPC (Inter-Processor Communication) latency Memory latency Pipeline latency Performance beyond single thread ILP Explicitly parallel (DLP or TLP) TLP could be more cost-effective than to exploit ILP IMT (Fine-Grained) vs. BMT (Coarse-Grained) A A A cycle A A A A A A [2]: Wall, D. W., Limits of instruction-level parallelism, in Proc. Int. Conf. ASPLOS-IV, pp , April 1991, 4/27

5 Solutions to Pipeline Latency Forwarding path Overheads on area, power, and even critical path Non-causal path existence Software optimization Overhead on code size Hardware multithreading Explicit TLP exploration Multiple thread context Hardware-supported thread switch mechanism execution order Pipeline Latency r4 = r2 + r1; thread1 (W) thread2 thread3 thread4 (R) thread5 r5 = r4 + 1; clock cycle data dependency 5/27

6 Low DSP Utilization Problem (2/2) Inter-processor communication (IPC) Become more complicated in multi-core/multithreaded computing model Out In P3 P1 P2 In Out P3 P1 P2 Shared Memory Shared Memory Conventional Enea s OSEck (RTOS) provides full support for StarCore s SC1000 families of DSP core 6/27

7 Outline Introduction Multithreaded Coprocessor Interface Implementation & Simulation Results 7/27

8 DSP Core 4 threads IMT MIPS compatible ISA 4 program counters bit register files (each thread) 5-stage pipeline IF ID EXE MEM WB PC 1 PC 2 PC 3 PC PC PC PC PC IM Thread select IR GPR GPR5 1 X Y DM 8/27

9 Dual-Core Software Architecture Multiple tasks tend to use DSP concurrently DSP task management is required 9/27

10 Simulation Model: Dataflow Process Network Multimedia applications can be described as FIFO-communicated processes FIFO channel process For simplicity, we assume each process has only a single input & a single output (SISO) 10/27

11 JPEG Encoding Example 4 processes CST DCT Q VLC RGB-to-YUV color space transform (CST) Discrete cosine transform (DCT) Quantization (Q) Zero run-length & variable-length coding (VLC) Processes are mapped on a single (multithreaded) DSP Each process is assigned a unique priority FIFO channels (except those for I/O processes) are implemented on DSP local memory A process notifies its descendent that the data are ready, when it completes its computations 11/27

12 Simulation Platform (Concept) MPU (ARM926) 0x4000_0000 0x4000_0000 DSP (ARM926) VIC VIC MPU2DSP ROM 0x0000_0000 RAM 0x4600_0000 DSP2MPU RAM ROM 0x0000_0000 0x4200_0000 0x0400_0000 Mem Ctrl 0x0400_0000 Shared Memory 0x1000_ /27

13 Task Management on MPU IPC Overhead 13/27

14 Task Management on DSP IPC Overhead, but. 14/27

15 Outline Introduction Multithreaded Coprocessor Interface with Hardware Queues Implementation & Simulation Results 15/27

16 Experiment Framework Prototyping on ARM Versatile Multithreaded DSP core on Xilinx Virtex II-6000 Host processor: AMBA Target application JPEG encoding 320*240 Lena image RGB 彩色圖片 8x8 block R G B RGB to YCbCr DCT Y DCT coefficients Cb Cr Quantization cases Case1: Software Process Management on Host Case2: Software Process Management on DSP Variable Length Coding 16/27

17 Case I & Case II ARM In CST Out DCT VLC Q Interface In page #0 Out page #0 PC Initiate #0 In page #7 Out page #7 PC Initiate #7 Thread #0 Thread #7 Computation kernel ARM Interface Supervisor thread In page #1 Out page #1 PC Initiate #1 Thread #1 In Out VLC CST Q DCT In page #7 Out page #7 PC Initiate #7 Thread #7 Computation kernel 17/27

18 Simulation Results Performance comparison Q&A DSP utilization evaluation DSP idle time comparison Improvement? Second DSP Execution Time on JPEG Encoding Case I Case II Case III Number of image Clock cycle Case I Case II Case III DSP Idle Time Number of image 18/27

A hardware operating system kernel for multi-processor systems

A hardware operating system kernel for multi-processor systems Sanggyu Park a), Do-sun Hong, and Soo-Ik Chae School of EECS, Seoul National University, Building 104 1, Seoul National University, Gwanakgu,