Venezia: a Scalable Multicore Subsystem for Multimedia Applications

Similar documents
A 1-GHz Configurable Processor Core MeP-h1

Functional modeling style for efficient SW code generation of video codec applications

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection

Comparing Memory Systems for Chip Multiprocessors

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki

Multicore SoC is coming. Scalable and Reconfigurable Stream Processor for Mobile Multimedia Systems. Source: 2007 ISSCC and IDF.

Parallel Processing SIMD, Vector and GPU s cont.

Development of Low Power ISDB-T One-Segment Decoder by Mobile Multi-Media Engine SoC (S1G)

Computer Architecture

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Multithreaded Processors. Department of Electrical Engineering Stanford University

MAPPING VIDEO CODECS TO HETEROGENEOUS ARCHITECTURES. Mauricio Alvarez-Mesa Techische Universität Berlin - Spin Digital MULTIPROG 2015

SA-1500: A 300 MHz RISC CPU with Attached Media Processor*

PowerPC 740 and 750

Advanced Parallel Programming I

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

Register File Organization

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

Modern CPU Architectures

XT Node Architecture

Development of Low Power and High Performance Application Processor (T6G) for Multimedia Mobile Applications

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Mapping the AVS Video Decoder on a Heterogeneous Dual-Core SIMD Processor. NikolaosBellas, IoannisKatsavounidis, Maria Koziri, Dimitris Zacharis

A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications

H.264/AVC und MPEG-4 SVC - die nächsten Generationen der Videokompression

Keywords and Review Questions

Lecture 11 Cache. Peng Liu.

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

IMPLEMENTATION OF H.264 DECODER ON SANDBLASTER DSP Vaidyanathan Ramadurai, Sanjay Jinturkar, Mayan Moudgill, John Glossner

SH-X3 Flexible SuperH Multi-core for High-performance and Low-power Embedded Systems

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

The Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

SPARC64 X: Fujitsu s New Generation 16 Core Processor for the next generation UNIX servers

Embedded Systems. 7. System Components

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Arquitecturas y Modelos de. Multicore

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

1. PowerPC 970MP Overview

Power 7. Dan Christiani Kyle Wieschowski

Portland State University ECE 588/688. Cray-1 and Cray T3E

Multimedia in Mobile Phones. Architectures and Trends Lund

Caches (Writing) P & H Chapter 5.2 3, 5.5. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

ISSCC 2001 / SESSION 9 / INTEGRATED MULTIMEDIA PROCESSORS / 9.2

Scalable Multi-DM642-based MPEG-2 to H.264 Transcoder. Arvind Raman, Sriram Sethuraman Ittiam Systems (Pvt.) Ltd. Bangalore, India

Chapter 1 Computer System Overview

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola

PACE: Power-Aware Computing Engines

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

H.264 Decoding. University of Central Florida

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

The University of Texas at Austin

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

The Nios II Family of Configurable Soft-core Processors

2D/3D Graphics Accelerator for Mobile Multimedia Applications. Ramchan Woo, Sohn, Seong-Jun Song, Young-Don

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

EC 513 Computer Architecture

Performance Tuning on the Blackfin Processor

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan

Advanced Memory Organizations

Mapping C code on MPSoC for Nomadic Embedded Systems

Communications and Computer Engineering II: Lecturer : Tsuyoshi Isshiki

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Lecture 24: Virtual Memory, Multiprocessors

Page 1. Multilevel Memories (Improving performance using a little cash )

Overview of SOC Architecture design

ISSCC 2006 / SESSION 22 / LOW POWER MULTIMEDIA / 22.1

! Readings! ! Room-level, on-chip! vs.!

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Computer Systems Architecture Spring 2016

All About the Cell Processor

Itanium 2 Processor Microarchitecture Overview

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers

Caches (Writing) P & H Chapter 5.2 3, 5.5. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

Benchmarking H.264 Hardware/Software Solutions

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Memory Hierarchy. 2/18/2016 CS 152 Sec6on 5 Colin Schmidt

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Jim Keller. Digital Equipment Corp. Hudson MA

04 - DSP Architecture and Microarchitecture

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

Show Me the $... Performance And Caches

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Lecture 1: Introduction

Caches. Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University. See P&H 5.1, 5.2 (except writes)

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Combining Arm & RISC-V in Heterogeneous Designs

0;L$+LJK3HUIRUPDQFH ;3URFHVVRU:LWK,QWHJUDWHG'*UDSKLFV

Simultaneous Multithreading (SMT)

CPS104 Computer Organization and Programming Lecture 20: Superscalar processors, Multiprocessors. Robert Wagner

Caches. See P&H 5.1, 5.2 (except writes) Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

Transcription:

Venezia: a Scalable Multicore Subsystem for Multimedia Applications Takashi Miyamori Toshiba Corporation Outline Background Venezia Hardware Architecture Venezia Software Architecture Evaluation Chip and Performance Results Summary 2

Outline Background Venezia Hardware Architecture Venezia Software Architecture Evaluation Chip and Performance Results Summary 3 Background Today's mobile multimedia devices support many audio and video CODECs. H.264, G-4, VC-1, AC-3, MP3, WMA etc. The size of image processing is increasing rapidly. QCIF, CIF, QVGA, VGA, 720p, 1080i, 1080p etc. Current SoC (T5V) VCORE (Video/JPEG) CPU VMCB VHMCB VDCT VHDCT DMAC VMEF VLZP VHLZP HW Solution: New designs are required for new CODECs. 90nm CMOS (16Mbit edram x2) H.264 decode @CIF 30fps, G-4 encode/decode @VGA 30fps 4

Our Approach: Scalable Multicore Processor Performance Scalability 720p VGA Codec FW Binary Binary Compatibility Compatibility L2$ L2$ Venezia QVGA L2$ 1 4 8 Number of s 5 MPSoC Architecture Trade-offs for Multimedia Applications, MPSoC 07 Homogeneous vs. Heterogeneous Homogeneous Heterogeneous Architecture Mx MDx MDA MDHx M M M M M D D D M D A M D H H Programmab ility / Scalability Perf./Cost or Perf./Power Examples Very good Good Fair Fair Good Very Good Core 2 (M 2, M 4 ), Xbox 360 CPU (M 3 ), MPCore(M 4 ), Niagara(M 8 ) Cell (MD 8 ), SB3000(MD4), Philips Cake/Wasabi Uniphier M: Main CPU D: DSP/Media Processor A: Accelerator H: Hardware Engine Most of current SoCs MPSoC 07 6 6

Outline Background Venezia Hardware Architecture Venezia Software Architecture Evaluation Chip and Performance Results Summary 7 Venezia Processor Architecture Headquarters MeP Core MeP Core Media Processing Engines (s) MeP Core MeP Core VLIW Cop. VLIW Cop. Venezia Architecture Multi RISC Cores & Multi s Cache-Based System I$ D$ I$ D$ I$ D$ I$ D$ Venezia L2 Cache Media Processing Engine 3-way VLIW Processor SIMD Instruction Support Small Size about 1.3mm2@65nm MeP: Media embedded Processor 8

(Media Processing Engine) MeP Core I$ + Ctrl. Dec. Reg. ALU D$ + Ctrl. IVC2 Dec. Cop. Reg. ALU ALU Mul. MeP (Media embedded Processor) Core 32-bit RISC Processor 5 Pipeline Stages 3 Low-power Mechanisms IVC2(Coprocessor) Extension of MeP Core 64-bit SIMD Operations 3-way VLIW (Core + Cop.A + Cop.B) Acc. Acc. Core Pipe Cop. PipeA Cop. PipeB 9 Instruction Formats 16b/32b Instruction Mode (Core Mode) core16 core32 cop.a/b32 64b Instruction Mode (VLIW Mode) core16 cop.a20 cop.b28 core32 cop.b28 Loop Control, Address Calcu., Load, and Store cop.a28 The instruction mode is switched by the special subroutine call instruction (bsrv). Data Calculation cop.b28 10

IVC2 64-bit SIMD Datapath IVC2 ALU/Shift/ Compare/ Pack/Unpack General-purpose Registers 64bitsx32 5R3W 8 Bits Bits x 8 16 16Bits x 4 32 32Bits x 2 64 64 Bits Bits Add/Subtract/ Shift Shift X1 Stage X2 Stage X3 Stage ALU Accumulator + + + ALU Multiplier X X X Accumulator + + + 8 Bits Bits x 8 16 16Bits x 4 32 32 Bits Bits x 2 32 32 Bits Bits x 8 64 64 Bits Bits x 4 Pipe 0 Pipe 1 11 Cache Memory System Headquarters L1 I$ L1 D$ 64b 512b Buffer L1 I$ 256b Datapath L1 D$ 64b 512b Buffer Interconnect L2 Cache L1 Inst./Data Caches 2-way Set Assoc. 8/16KB 64B Line Size L2 Cache 64/128/256/512KB 4-way Set Assoc. 256B Line Size Prefetch Functions L1 I$ Auto Prefetch L1 D$ Prefetch Inst. L2 Interconnect Buffer L2 $ Buffer L2 $ Prefetch Main Mem. L2 $ 12

Comparing Memory Systems for Chip Multiprocessors (Stanford Univ., ISCA 07) CC: Cache-Coherent Model 32KB 2-way assoc. cache STR: Streaming Model 24KB local memory and 8KB 2-way assoc. cache 512KB L2 cache Both models perform and scale equally well. Non-allocate store to cache can reduce memory traffic. Ex. Prepare for Store (PFS) instruction of MIPS32 G-2: Traffic due to write miss was reduced 56%. Streaming programming model, such as blocking and locality-aware scheduling, is efficient for cache model as well as streaming model. 13 L2 Cache to 64b Arbiter 512b Buffer 512b Buffer 512b Buffer Interconnect 256b Datapath 8-entry Queue L2 Cache Tag SRAM Throughput: 512b / 2 cycles Latency: 10 cycles Tag Check Refill Data SRAM to Main Bus 512b Write Back 1/2 CPU Freq. 14

Outline Background Venezia Hardware Architecture Venezia Software Architecture Evaluation Chip and Performance Results Summary 15 Software Hierarchy of Venezia V-Kernel: simple and light operating system kernel Media FW V-Kernel V-Kernel Library Memory Management V-Kernel Base V-Thread Execution Framework HW Memory Headquarters Venezia Subsystem 16

Multigrain Parallelism Exploits Instruction / Data Level Parallelism Multicore Architecture Exploits and V-Thread Level Parallelism Coarse Granularity V-Thread VLIW SIMD Application level e.g. Audio Decode, Video Decode Function level e.g. MC, IQ/IT Instruction level Data level Multi-task programming Multi-thread programming programming Headquarters & s Within Fine 17 V-Thread Execution Model Scalability and Compatibility by V-Thread Parallel Execution by s Abstraction of Computing Resources Media FW Headquarters V-Thread V-Thread Scheduler V-Thread V-Thread V-Thread 18

Granularity of V-Threads in H.264 Decode V-Thread 1 MC (L) V-Thread 2 IP/IQT (L) DBF (L) Video Signal Processor V-Thread 0 MVP BS Macro Blocks V-Thread 3 MC (C) V-Thread 4 IP/IQT (C) DBF(C) Video Signal Processor VGA: 1000 V-Threads / Frame 720p: 3000 V-Threads / Frame MVP: Motion Vector Prediction BS: Boundary Strength MC(L): Motion Compensation (and Weighted Prediction) for Luma MC(C): Motion Compensation (and Weighted Prediction) for Chroma IP/IQT(L): Intra Prediction (and Inverse Quantization, Inverse Transform) for Luma IP/IQT(C): Intra Prediction (and Inverse Quantization, Inverse Transform) for Chroma DBF(L): De-Blocking Filter for Luma DBF(C): De-Blocking Filter for Chroma EoM : The end of the macroblock process 19 Spatial Dependency of V-Threads MB Data Dependency V-Thread 00 V-Thread 01 V-Thread 02 V-Thread 10 V-Thread 11 V-Thread 12 20

V-Kernel Shared Memory Model open_private_memory() open_protected_memory() PRIVATE PUBLIC PROTECTED L1 Cache Write Invalidate close_private_memory() allocate_public _memory() INVALID close_protected_memory() free_public _memory() L1 Cache Invalidate Cache coherency among s is maintained by software. L1 $ read L1 $ write L2 Direct read L2 Direct write PUBLIC NG NG OK OK PRIVATE OK OK NG NG PROTECTED OK NG (OK) NG 21 V-Kernel and V-Thread Execution Framework Media FW runs on V-Kernel User API and V-Thread Execution Framework API V-Kernel User API Media FW V-Thread Execution Frame API Syntax & V-Thread Dispatch V-Kernel Library Memory Management V-Kernel Base V-Thread Execution Framework V-Thread Execution e.g. Signal Processing HW Memory Headquarters Venezia Subsystem 22

Outline Background Venezia Hardware Architecture Venezia Software Architecture Evaluation Chip and Performance Results Summary 23 VeneziaEX: Evaluation Chip of Venezia Architecture L2$ SRAM 4Mbit D$ Bus L2$ I/F Controller PLL I$ Technology Die Size Frequency Supply Voltage 65nm CMOS, 8LM 5.06mm x 5.06 mm 333MHz (, L2$ Logic) 166MHz (L2$ SRAM, Bus I/F) 2.5V (I/O) 1.2V (Core) 1.2V/0.95V/0V (SVC Output) 5R3W RegFile L1 Cache 8KB (Instruction), 8KB (Data) 2-way, FIFO, 64B Line L2 Cache 512KB (unified), 4-way, LRU, 256B Line 24

Performance Scalability H.264 720p Decode Scalability Bottlenecks I-Picture: VLD P-Picture: L2 Cache Miss Penalty Frame Rate [fps] 4.1x Number of s 25 Outline Background Venezia Hardware Architecture Venezia Software Architecture Evaluation Chip and Performance Results Summary 26

Summary Venezia: Scalable Multicore Subsystem for Multimedia Applications RISC Cores (Headquarters), s, and L2 Cache is 3-way VLIW Processor with SIMD Instructions Cache Based Memory System to Realize Performance Scalability with SW Binary Compatibility V-Kernel and V-Thread Execution Framework Focus on A/V CODECs Video is Divided into a Large Number of Small V-Threads Cache Coherency Is Maintained by Software Performance Evaluation Results x4.1 Performance Improvement by 6 s 27 28