Venezia: a Scalable Multicore Subsystem for Multimedia Applications

Venezia: a Scalable Multicore Subsystem for Multimedia Applications Takashi Miyamori Toshiba Corporation Outline Background Venezia Hardware Architecture Venezia Software Architecture Evaluation Chip and Performance Results Summary 2

Outline Background Venezia Hardware Architecture Venezia Software Architecture Evaluation Chip and Performance Results Summary 3 Background Today's mobile multimedia devices support many audio and video CODECs. H.264, G-4, VC-1, AC-3, MP3, WMA etc. The size of image processing is increasing rapidly. QCIF, CIF, QVGA, VGA, 720p, 1080i, 1080p etc. Current SoC (T5V) VCORE (Video/JPEG) CPU VMCB VHMCB VDCT VHDCT DMAC VMEF VLZP VHLZP HW Solution: New designs are required for new CODECs. 90nm CMOS (16Mbit edram x2) H.264 decode @CIF 30fps, G-4 encode/decode @VGA 30fps 4

Our Approach: Scalable Multicore Processor Performance Scalability 720p VGA Codec FW Binary Binary Compatibility Compatibility L2$ L2$ Venezia QVGA L2$ 1 4 8 Number of s 5 MPSoC Architecture Trade-offs for Multimedia Applications, MPSoC 07 Homogeneous vs. Heterogeneous Homogeneous Heterogeneous Architecture Mx MDx MDA MDHx M M M M M D D D M D A M D H H Programmab ility / Scalability Perf./Cost or Perf./Power Examples Very good Good Fair Fair Good Very Good Core 2 (M 2, M 4 ), Xbox 360 CPU (M 3 ), MPCore(M 4 ), Niagara(M 8 ) Cell (MD 8 ), SB3000(MD4), Philips Cake/Wasabi Uniphier M: Main CPU D: DSP/Media Processor A: Accelerator H: Hardware Engine Most of current SoCs MPSoC 07 6 6

Outline Background Venezia Hardware Architecture Venezia Software Architecture Evaluation Chip and Performance Results Summary 7 Venezia Processor Architecture Headquarters MeP Core MeP Core Media Processing Engines (s) MeP Core MeP Core VLIW Cop. VLIW Cop. Venezia Architecture Multi RISC Cores & Multi s Cache-Based System I$ D$ I$ D$ I$ D$ I$ D$ Venezia L2 Cache Media Processing Engine 3-way VLIW Processor SIMD Instruction Support Small Size about 1.3mm2@65nm MeP: Media embedded Processor 8

(Media Processing Engine) MeP Core I$ + Ctrl. Dec. Reg. ALU D$ + Ctrl. IVC2 Dec. Cop. Reg. ALU ALU Mul. MeP (Media embedded Processor) Core 32-bit RISC Processor 5 Pipeline Stages 3 Low-power Mechanisms IVC2(Coprocessor) Extension of MeP Core 64-bit SIMD Operations 3-way VLIW (Core + Cop.A + Cop.B) Acc. Acc. Core Pipe Cop. PipeA Cop. PipeB 9 Instruction Formats 16b/32b Instruction Mode (Core Mode) core16 core32 cop.a/b32 64b Instruction Mode (VLIW Mode) core16 cop.a20 cop.b28 core32 cop.b28 Loop Control, Address Calcu., Load, and Store cop.a28 The instruction mode is switched by the special subroutine call instruction (bsrv). Data Calculation cop.b28 10

IVC2 64-bit SIMD Datapath IVC2 ALU/Shift/ Compare/ Pack/Unpack General-purpose Registers 64bitsx32 5R3W 8 Bits Bits x 8 16 16Bits x 4 32 32Bits x 2 64 64 Bits Bits Add/Subtract/ Shift Shift X1 Stage X2 Stage X3 Stage ALU Accumulator + + + ALU Multiplier X X X Accumulator + + + 8 Bits Bits x 8 16 16Bits x 4 32 32 Bits Bits x 2 32 32 Bits Bits x 8 64 64 Bits Bits x 4 Pipe 0 Pipe 1 11 Cache Memory System Headquarters L1 I$ L1 D$ 64b 512b Buffer L1 I$ 256b Datapath L1 D$ 64b 512b Buffer Interconnect L2 Cache L1 Inst./Data Caches 2-way Set Assoc. 8/16KB 64B Line Size L2 Cache 64/128/256/512KB 4-way Set Assoc. 256B Line Size Prefetch Functions L1 I$ Auto Prefetch L1 D$ Prefetch Inst. L2 Interconnect Buffer L2 $ Buffer L2 $ Prefetch Main Mem. L2 $ 12

Comparing Memory Systems for Chip Multiprocessors (Stanford Univ., ISCA 07) CC: Cache-Coherent Model 32KB 2-way assoc. cache STR: Streaming Model 24KB local memory and 8KB 2-way assoc. cache 512KB L2 cache Both models perform and scale equally well. Non-allocate store to cache can reduce memory traffic. Ex. Prepare for Store (PFS) instruction of MIPS32 G-2: Traffic due to write miss was reduced 56%. Streaming programming model, such as blocking and locality-aware scheduling, is efficient for cache model as well as streaming model. 13 L2 Cache to 64b Arbiter 512b Buffer 512b Buffer 512b Buffer Interconnect 256b Datapath 8-entry Queue L2 Cache Tag SRAM Throughput: 512b / 2 cycles Latency: 10 cycles Tag Check Refill Data SRAM to Main Bus 512b Write Back 1/2 CPU Freq. 14

Outline Background Venezia Hardware Architecture Venezia Software Architecture Evaluation Chip and Performance Results Summary 15 Software Hierarchy of Venezia V-Kernel: simple and light operating system kernel Media FW V-Kernel V-Kernel Library Memory Management V-Kernel Base V-Thread Execution Framework HW Memory Headquarters Venezia Subsystem 16

Multigrain Parallelism Exploits Instruction / Data Level Parallelism Multicore Architecture Exploits and V-Thread Level Parallelism Coarse Granularity V-Thread VLIW SIMD Application level e.g. Audio Decode, Video Decode Function level e.g. MC, IQ/IT Instruction level Data level Multi-task programming Multi-thread programming programming Headquarters & s Within Fine 17 V-Thread Execution Model Scalability and Compatibility by V-Thread Parallel Execution by s Abstraction of Computing Resources Media FW Headquarters V-Thread V-Thread Scheduler V-Thread V-Thread V-Thread 18

Granularity of V-Threads in H.264 Decode V-Thread 1 MC (L) V-Thread 2 IP/IQT (L) DBF (L) Video Signal Processor V-Thread 0 MVP BS Macro Blocks V-Thread 3 MC (C) V-Thread 4 IP/IQT (C) DBF(C) Video Signal Processor VGA: 1000 V-Threads / Frame 720p: 3000 V-Threads / Frame MVP: Motion Vector Prediction BS: Boundary Strength MC(L): Motion Compensation (and Weighted Prediction) for Luma MC(C): Motion Compensation (and Weighted Prediction) for Chroma IP/IQT(L): Intra Prediction (and Inverse Quantization, Inverse Transform) for Luma IP/IQT(C): Intra Prediction (and Inverse Quantization, Inverse Transform) for Chroma DBF(L): De-Blocking Filter for Luma DBF(C): De-Blocking Filter for Chroma EoM : The end of the macroblock process 19 Spatial Dependency of V-Threads MB Data Dependency V-Thread 00 V-Thread 01 V-Thread 02 V-Thread 10 V-Thread 11 V-Thread 12 20

V-Kernel Shared Memory Model open_private_memory() open_protected_memory() PRIVATE PUBLIC PROTECTED L1 Cache Write Invalidate close_private_memory() allocate_public _memory() INVALID close_protected_memory() free_public _memory() L1 Cache Invalidate Cache coherency among s is maintained by software. L1 $ read L1 $ write L2 Direct read L2 Direct write PUBLIC NG NG OK OK PRIVATE OK OK NG NG PROTECTED OK NG (OK) NG 21 V-Kernel and V-Thread Execution Framework Media FW runs on V-Kernel User API and V-Thread Execution Framework API V-Kernel User API Media FW V-Thread Execution Frame API Syntax & V-Thread Dispatch V-Kernel Library Memory Management V-Kernel Base V-Thread Execution Framework V-Thread Execution e.g. Signal Processing HW Memory Headquarters Venezia Subsystem 22

Outline Background Venezia Hardware Architecture Venezia Software Architecture Evaluation Chip and Performance Results Summary 23 VeneziaEX: Evaluation Chip of Venezia Architecture L2$ SRAM 4Mbit D$ Bus L2$ I/F Controller PLL I$ Technology Die Size Frequency Supply Voltage 65nm CMOS, 8LM 5.06mm x 5.06 mm 333MHz (, L2$ Logic) 166MHz (L2$ SRAM, Bus I/F) 2.5V (I/O) 1.2V (Core) 1.2V/0.95V/0V (SVC Output) 5R3W RegFile L1 Cache 8KB (Instruction), 8KB (Data) 2-way, FIFO, 64B Line L2 Cache 512KB (unified), 4-way, LRU, 256B Line 24

Performance Scalability H.264 720p Decode Scalability Bottlenecks I-Picture: VLD P-Picture: L2 Cache Miss Penalty Frame Rate [fps] 4.1x Number of s 25 Outline Background Venezia Hardware Architecture Venezia Software Architecture Evaluation Chip and Performance Results Summary 26

Summary Venezia: Scalable Multicore Subsystem for Multimedia Applications RISC Cores (Headquarters), s, and L2 Cache is 3-way VLIW Processor with SIMD Instructions Cache Based Memory System to Realize Performance Scalability with SW Binary Compatibility V-Kernel and V-Thread Execution Framework Focus on A/V CODECs Video is Divided into a Large Number of Small V-Threads Cache Coherency Is Maintained by Software Performance Evaluation Results x4.1 Performance Improvement by 6 s 27 28