Multithreaded Coprocessor Interface for Dual-Core Multimedia SoC

Similar documents
A hardware operating system kernel for multi-processor systems

ARM Processors for Embedded Applications

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Lecture-13 (ROB and Multi-threading) CS422-Spring

Parallel Processing SIMD, Vector and GPU s cont.

Architectural Support for Reducing Parallel Processing Overhead in an Embedded Multiprocessor

Exploitation of instruction level parallelism

Embedded Systems: Architecture

A Predictable RTOS. Mantis Cheng Department of Computer Science University of Victoria

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Assuming ideal conditions (perfect pipelining and no hazards), how much time would it take to execute the same program in: b) A 5-stage pipeline?

CS425 Computer Systems Architecture

An Overview of MIPS Multi-Threading. White Paper

Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures

Lecture 1: Introduction

Multithreaded Processors. Department of Electrical Engineering Stanford University

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

CISC RISC. Compiler. Compiler. Processor. Processor

FPGA Implementation of 2-D DCT Architecture for JPEG Image Compression

15CS44: MICROPROCESSORS AND MICROCONTROLLERS. QUESTION BANK with SOLUTIONS MODULE-4

Simulink -based Programming Environment for Heterogeneous MPSoC

Multi-level Design Methodology using SystemC and VHDL for JPEG Encoder

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

MLR INSTITUTE OF TECHNOLOGY DUNDIGAL , HYDERABAD QUESTION BANK


Introduction ti to JPEG

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture

Modern Processor Architectures. L25: Modern Compiler Design

Lecture 14: Multithreading

CS 152, Spring 2011 Section 10

Slides for Lecture 15

Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores

Media Instructions, Coprocessors, and Hardware Accelerators. Overview

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

The Processor: Instruction-Level Parallelism

CONTACT: ,

Pipelining to Superscalar

Development of Parallel Queue Processor and its Integrated Development Environment

A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing

Mapping applications into MPSoC

Interfacing a High Speed Crypto Accelerator to an Embedded CPU

Copyright 2016 Xilinx

ECE 571 Advanced Microprocessor-Based Design Lecture 4

Nios II Embedded Electronic Photo Album

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

ECSE 425 Lecture 25: Mul1- threading

Chapter 4 The Processor (Part 4)

Embedded Systems: Hardware Components (part I) Todor Stefanov

Processor (IV) - advanced ILP. Hwansoo Han

Rapid Prototyping.

Multi processor systems with configurable hardware acceleration

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

ESE532: System-on-a-Chip Architecture. Today. Programmable SoC. Message. Process. Reminder

Digital Image Representation Image Compression

Handout 2 ILP: Part B

ECE 471 Embedded Systems Lecture 2

Advanced issues in pipelining

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Long Term Trends for Embedded System Design

ELC4438: Embedded System Design Embedded Processor

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Multicore Hardware and Parallelism

Cymric A Framework for Prototyping Near-Memory Architectures

Adapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]

Design of Out-Of-Order Superscalar Processor with Speculative Thread Level Parallelism

Embedded Systems. 7. System Components

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi. Lecture - 10 System on Chip (SOC)

EECS 452 Lecture 9 TLP Thread-Level Parallelism

System Architecture Directions for Networked Sensors. Jason Hill et. al. A Presentation by Dhyanesh Narayanan MS, CS (Systems)

VLSI Signal Processing

Revisiting the Past 25 Years: Lessons for the Future. Guri Sohi University of Wisconsin-Madison

Architecture at HP: Two Decades of Innovation

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

I/O Systems (3): Clocks and Timers. CSE 2431: Introduction to Operating Systems

Product Technical Brief S3C2416 May 2008

General introduction: GPUs and the realm of parallel architectures

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Computer Architecture EE 4720 Midterm Examination

A Hardware Task-Graph Scheduler for Reconfigurable Multi-tasking Systems

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Reusing Cache for Real-Time Memory Address Trace Compression

Parallelism in Hardware

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

A 167-processor 65 nm Computational Platform with Per-Processor Dynamic Supply Voltage and Dynamic Clock Frequency Scaling

Beyond Pipelining. CP-226: Computer Architecture. Lecture 23 (19 April 2013) CADSL

Hi Hsiao-Lung Chan, Ph.D. Dept Electrical Engineering Chang Gung University, Taiwan

DUE to the high computational complexity and real-time

Blackfin Optimizations for Performance and Power Consumption

Pilot: A Platform-based HW/SW Synthesis System

Embedded Systems. 8. Hardware Components. Lothar Thiele. Computer Engineering and Networks Laboratory

Organic Computing. Dr. rer. nat. Christophe Bobda Prof. Dr. Rolf Wanka Department of Computer Science 12 Hardware-Software-Co-Design

Grassroots ASPLOS. can we still rethink the hardware/software interface in processors? Raphael kena Poss University of Amsterdam, the Netherlands

! Readings! ! Room-level, on-chip! vs.!

Advanced Instruction-Level Parallelism

CS425 Computer Systems Architecture

Signal Processing Algorithms into Fixed Point FPGA Hardware Dennis Silage ECE Temple University

Transcription:

Multithreaded Coprocessor Interface for Dual-Core Multimedia SoC Student: Chih-Hung Cho Advisor: Prof. Chih-Wei Liu VLSI Signal Processing Group, DEE, NCTU 1

Outline Introduction Multithreaded Coprocessor Interface Simulation & Implementation Results 2/27

Introduction Dual-core/Multi-core SoC is the possible solution for modern mobile multimedia systems Task divergence in most embedded systems drives the heterogeneous computing platform Control-oriented task vs. computation-intensive task RISC + DSP, play the right thing in the right place [1] Example: TI OMAP RISC: ARM9 DSP: TI C 5x TI OMAP Processor ARM926 TI C'5x DSP Shared Memory Controller/DMA 2D Graphics Accelerator Timer, Interrupt Controller, RTC Frame Buffer/Internal SRAM [1]. J. Shandle, The give and take of DSP processors, IEEE Signal Processing Mag., vol.17, pp. 43-51, March 2000 3/27

Low DSP Utilization Problem (1/2) Peak vs. delivered performance gap increasing Instruction latency & data dependency Limits to ILP seem to limit to 3~6-issue for practical options DSP utilization is below 50% [2] IPC (Inter-Processor Communication) latency Memory latency Pipeline latency Performance beyond single thread ILP Explicitly parallel (DLP or TLP) TLP could be more cost-effective than to exploit ILP IMT (Fine-Grained) vs. BMT (Coarse-Grained) A A A cycle A A A A A A [2]: Wall, D. W., Limits of instruction-level parallelism, in Proc. Int. Conf. ASPLOS-IV, pp. 176-188, April 1991, 4/27

Solutions to Pipeline Latency Forwarding path Overheads on area, power, and even critical path Non-causal path existence Software optimization Overhead on code size Hardware multithreading Explicit TLP exploration Multiple thread context Hardware-supported thread switch mechanism execution order Pipeline Latency r4 = r2 + r1; thread1 (W) thread2 thread3 thread4 (R) thread5 r5 = r4 + 1; clock cycle data dependency 5/27

Low DSP Utilization Problem (2/2) Inter-processor communication (IPC) Become more complicated in multi-core/multithreaded computing model Out In P3 P1 P2 In Out P3 P1 P2 Shared Memory Shared Memory Conventional Enea s OSEck (RTOS) provides full support for StarCore s SC1000 families of DSP core 6/27

Outline Introduction Multithreaded Coprocessor Interface Implementation & Simulation Results 7/27

DSP Core 4 threads IMT MIPS compatible ISA 4 program counters 32 32-bit register files (each thread) 5-stage pipeline IF ID EXE MEM WB PC 1 PC 2 PC 3 PC PC PC PC PC 8 5 4 +1 3 IM Thread select IR GPR GPR5 1 X Y DM 8/27

Dual-Core Software Architecture Multiple tasks tend to use DSP concurrently DSP task management is required 9/27

Simulation Model: Dataflow Process Network Multimedia applications can be described as FIFO-communicated processes FIFO channel process For simplicity, we assume each process has only a single input & a single output (SISO) 10/27

JPEG Encoding Example 4 processes CST DCT Q VLC RGB-to-YUV color space transform (CST) Discrete cosine transform (DCT) Quantization (Q) Zero run-length & variable-length coding (VLC) Processes are mapped on a single (multithreaded) DSP Each process is assigned a unique priority FIFO channels (except those for I/O processes) are implemented on DSP local memory A process notifies its descendent that the data are ready, when it completes its computations 11/27

Simulation Platform (Concept) MPU (ARM926) 0x4000_0000 0x4000_0000 DSP (ARM926) VIC VIC MPU2DSP ROM 0x0000_0000 RAM 0x4600_0000 DSP2MPU RAM ROM 0x0000_0000 0x4200_0000 0x0400_0000 Mem Ctrl 0x0400_0000 Shared Memory 0x1000_0000 12/27

Task Management on MPU IPC Overhead 13/27

Task Management on DSP IPC Overhead, but. 14/27

Outline Introduction Multithreaded Coprocessor Interface with Hardware Queues Implementation & Simulation Results 15/27

Experiment Framework Prototyping on ARM Versatile Multithreaded DSP core on Xilinx Virtex II-6000 (@35MHz) Host processor: ARM926 @210MHz AMBA AHB @35MHz Target application JPEG encoding 320*240 Lena image RGB 彩色圖片 8x8 block R G B RGB to YCbCr DCT Y DCT coefficients Cb Cr Quantization 010011... 3 cases Case1: Software Process Management on Host Case2: Software Process Management on DSP Variable Length Coding 16/27

Case I & Case II ARM In CST Out DCT VLC Q Interface In page #0 Out page #0 PC Initiate #0 In page #7 Out page #7 PC Initiate #7 Thread #0 Thread #7 Computation kernel ARM Interface Supervisor thread In page #1 Out page #1 PC Initiate #1 Thread #1 In Out VLC CST Q DCT In page #7 Out page #7 PC Initiate #7 Thread #7 Computation kernel 17/27

Simulation Results Performance comparison Q&A DSP utilization evaluation DSP idle time comparison Improvement? Second 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 DSP Execution Time on JPEG Encoding Case I Case II Case III 1 2 3 4 5 Number of image Clock cycle 12000000 10000000 8000000 6000000 4000000 Case I Case II Case III DSP Idle Time 2000000 0 1 2 3 4 5 Number of image 18/27