Multithreaded Coprocessor Interface for Dual-Core Multimedia SoC

Multithreaded Coprocessor Interface for Dual-Core Multimedia SoC Student: Chih-Hung Cho Advisor: Prof. Chih-Wei Liu VLSI Signal Processing Group, DEE, NCTU 1

Outline Introduction Multithreaded Coprocessor Interface Simulation & Implementation Results 2/27

Introduction Dual-core/Multi-core SoC is the possible solution for modern mobile multimedia systems Task divergence in most embedded systems drives the heterogeneous computing platform Control-oriented task vs. computation-intensive task RISC + DSP, play the right thing in the right place [1] Example: TI OMAP RISC: ARM9 DSP: TI C 5x TI OMAP Processor ARM926 TI C'5x DSP Shared Memory Controller/DMA 2D Graphics Accelerator Timer, Interrupt Controller, RTC Frame Buffer/Internal SRAM [1]. J. Shandle, The give and take of DSP processors, IEEE Signal Processing Mag., vol.17, pp. 43-51, March 2000 3/27

Low DSP Utilization Problem (1/2) Peak vs. delivered performance gap increasing Instruction latency & data dependency Limits to ILP seem to limit to 3~6-issue for practical options DSP utilization is below 50% [2] IPC (Inter-Processor Communication) latency Memory latency Pipeline latency Performance beyond single thread ILP Explicitly parallel (DLP or TLP) TLP could be more cost-effective than to exploit ILP IMT (Fine-Grained) vs. BMT (Coarse-Grained) A A A cycle A A A A A A [2]: Wall, D. W., Limits of instruction-level parallelism, in Proc. Int. Conf. ASPLOS-IV, pp. 176-188, April 1991, 4/27

Solutions to Pipeline Latency Forwarding path Overheads on area, power, and even critical path Non-causal path existence Software optimization Overhead on code size Hardware multithreading Explicit TLP exploration Multiple thread context Hardware-supported thread switch mechanism execution order Pipeline Latency r4 = r2 + r1; thread1 (W) thread2 thread3 thread4 (R) thread5 r5 = r4 + 1; clock cycle data dependency 5/27

Low DSP Utilization Problem (2/2) Inter-processor communication (IPC) Become more complicated in multi-core/multithreaded computing model Out In P3 P1 P2 In Out P3 P1 P2 Shared Memory Shared Memory Conventional Enea s OSEck (RTOS) provides full support for StarCore s SC1000 families of DSP core 6/27

Outline Introduction Multithreaded Coprocessor Interface Implementation & Simulation Results 7/27

DSP Core 4 threads IMT MIPS compatible ISA 4 program counters 32 32-bit register files (each thread) 5-stage pipeline IF ID EXE MEM WB PC 1 PC 2 PC 3 PC PC PC PC PC 8 5 4 +1 3 IM Thread select IR GPR GPR5 1 X Y DM 8/27

Dual-Core Software Architecture Multiple tasks tend to use DSP concurrently DSP task management is required 9/27

Simulation Model: Dataflow Process Network Multimedia applications can be described as FIFO-communicated processes FIFO channel process For simplicity, we assume each process has only a single input & a single output (SISO) 10/27

JPEG Encoding Example 4 processes CST DCT Q VLC RGB-to-YUV color space transform (CST) Discrete cosine transform (DCT) Quantization (Q) Zero run-length & variable-length coding (VLC) Processes are mapped on a single (multithreaded) DSP Each process is assigned a unique priority FIFO channels (except those for I/O processes) are implemented on DSP local memory A process notifies its descendent that the data are ready, when it completes its computations 11/27

Simulation Platform (Concept) MPU (ARM926) 0x4000_0000 0x4000_0000 DSP (ARM926) VIC VIC MPU2DSP ROM 0x0000_0000 RAM 0x4600_0000 DSP2MPU RAM ROM 0x0000_0000 0x4200_0000 0x0400_0000 Mem Ctrl 0x0400_0000 Shared Memory 0x1000_0000 12/27

Task Management on MPU IPC Overhead 13/27

Task Management on DSP IPC Overhead, but. 14/27

Outline Introduction Multithreaded Coprocessor Interface with Hardware Queues Implementation & Simulation Results 15/27

Experiment Framework Prototyping on ARM Versatile Multithreaded DSP core on Xilinx Virtex II-6000 (@35MHz) Host processor: ARM926 @210MHz AMBA AHB @35MHz Target application JPEG encoding 320*240 Lena image RGB 彩色圖片 8x8 block R G B RGB to YCbCr DCT Y DCT coefficients Cb Cr Quantization 010011... 3 cases Case1: Software Process Management on Host Case2: Software Process Management on DSP Variable Length Coding 16/27

Case I & Case II ARM In CST Out DCT VLC Q Interface In page #0 Out page #0 PC Initiate #0 In page #7 Out page #7 PC Initiate #7 Thread #0 Thread #7 Computation kernel ARM Interface Supervisor thread In page #1 Out page #1 PC Initiate #1 Thread #1 In Out VLC CST Q DCT In page #7 Out page #7 PC Initiate #7 Thread #7 Computation kernel 17/27

Simulation Results Performance comparison Q&A DSP utilization evaluation DSP idle time comparison Improvement? Second 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 DSP Execution Time on JPEG Encoding Case I Case II Case III 1 2 3 4 5 Number of image Clock cycle 12000000 10000000 8000000 6000000 4000000 Case I Case II Case III DSP Idle Time 2000000 0 1 2 3 4 5 Number of image 18/27