Multicore SoC is coming. Scalable and Reconfigurable Stream Processor for Mobile Multimedia Systems. Source: 2007 ISSCC and IDF.

Size: px

Start display at page:

Download "Multicore SoC is coming. Scalable and Reconfigurable Stream Processor for Mobile Multimedia Systems. Source: 2007 ISSCC and IDF."

Erika Phillips
5 years ago
Views:

General Director, SOC Center National Taiwan University DSP/IC

1 Scalable and Reconfigurable Stream Processor for Mobile Multimedia Systems Liang-Gee Chen Distinguished Professor General Director, SOC Center National Taiwan University DSP/IC Design Lab, GIEE, NTU 1 Multicore SoC is coming Source: 2007 ISSCC and IDF 2

2 Multithread Multiprocessor SM Core Source: nvidia 3 Typical Architecture of MPSOC Source: ITRS 4

a chip doubling every 18-24 Months Same design

3 The Perspective for MPSoC Source: ITRS, IEK/ITRI Driving Forces for MPSOC Technology Push: Semiconductor technology Moore s law is still working Number of transistors on a chip doubling every Months Same design with Smaller area Lower cost Same area with more functionality 6

4 Demand Pull Driving Forces for MPSOC RMS (Recognition, Mining and Synthesis) for Server MMM (Mobile( MultiMedia) ) for end products 7 7 Mobile multimedia platform Mobile device becomes the personal multimedia platform 8

5 Multimedia Signal Processing Signals: Image, Speech, Audio, Graphic Representations: Bitstream, Frame, Pixel Applications: 2D/3D Modeling, Animation, Content Based Indexing and Retrieval, Storage and Transmission Key Technology: Video Coding and Graphics Ref: Prof. Pirsch Keynote on ICME Evaluation of Video Coding Standards 10

functionalities 11 11 Evolution of Coding

6 Services with Scalable Functions From compression efficiency to specific functionalities Evolution of Coding Tools New adaptive prediction modes, function-oriented tools 12 12

7 Parallel Processing Architectural Perspective for Multimedia Processing Requirement: Processing of independent segments Pipelining, systolic processing Task level parallelism Instruction level parallelism Data level parallelism Stream Processor Requirement: Processing of dependent segments Scalable architecture Multi-thread and cache Design Space in Various Levels 14 14

8 H.264 Encoding as Example 15 P-frame scheme Data Dependency Analysis Referenced frames are all different B-frame scheme The B-frames may have the same referenced P- frames The next P-frame also has the same reference frame 16 16

9 Frame-Parallel Encoding Scheme Encode frames of same reference frames in parallel The MBs of same location are encoded simultaneously Achieve frame-level data reuse For IBBP scheme, 66% system memory bandwidth is saved I P ME Algorithm Predicted moving window (PMW) search Adaptive window size 18

10 Concept of Data Sharing Data sharing in two directions within 16 4 candidates 19 Level C and Level D Schemes Level C Horizontal data reuse N Search Region 0 SR H -1 N Search Region 1 N ( SRV + N 1) EaLevelC 1+ N N SRB = ( SRH + N 1)( SRV + N 1) Level D Fully data reuse SR N V N N SR V + N-1 SR H -1 Search Region 0 CB 0 CB 1 Ea LevelD =1 SRB = H V SR V -1 ( SR + W 1)( SR 1) CB 0 CB 1 N Search Region 1 W+SR H

11 2D Bandwidth Sharing ME Architecture Horizontal DS (16x) Vertical DS (4x) Data Sharing (DS) within 16 4 candidates 21 Horizontal Data Sharing AD AD AD AD AD AD AD AD AD AD AD AD AD AD AD AD Sub Tree BW reduction: (12.1%) ACC 22

12 Vertical Data Sharing BW reduction: 16x4 19 (29.7%) 23 Main Concepts of the FME Stage Thorough parallelization of each block with full pipeline and high utilization Sequential process of VBS The least common factor of VBS is 4x4. The SATD involves 4x4 Hadamard transform. 1. Design a 4x4-block processing unit (PU) 2. Apply the folding technique for larger blocks to iteratively utilize the PUs 24

13 4x4-Block Processing Unit (PU) 4x throughput but similar gate count compared with the prior sequential transform design 25 Parallel Configuration of PUs Luma MC SRAM Cur. MB SRAM MV Cost Gen. Ref. Cost Gen. Mode Cost Gen. Search Area SRAMs (Shared with IME) 9x4 MV Costs 4x4-Block PU 4x4-Block PU 4x4-Block PU Lagrangian Mode Decision Router Interpolation 4x4-Block PU 4x4-Block PU 4x4-Block PU 4 Ref. Costs 9x4 Candidate Costs Best Inter Mode Info. Buffer 4x4-Block PU 4x4-Block PU 4x4-Block PU Inter Mode Decision Results to IP Shared interpolation saves 764K logic gates. Nine PUs of Adjacent 3x3 Candidates for Every Ref. Frame 26

14 4x4 Intra Prediction (I4MB) 27 16x16 Intra Prediction (I16MB) 28

15 Reconfigurable Computing Four processing elements Generate four predictors in one cycle Normal configuration I4MB modes except hori. and vert. Bypass configuration I4MB/I16MB hori. and vert. modes Cascading configuration I4MB and I16MB DC modes Recursive configuration I16MB plane prediction mode Neighboring Rec. Pels, Parameters, PE Connections One PE PE Conn. 29 Interleaved I4MB/I16MB Schedule I4MB Block 0 9 Modes I4MB Block 1 9 Modes I4MB Block 15 9 Modes I16MB MB Mode 0 I16MB MB Mode 1 I16MB MB Mode 2 I16MB MB Mode 3 Bubbles Bubbles Bubbles I4MB Block 0 9 Modes I4MB Block 1 9 Modes I4MB Block 15 9 Modes Bubbles Eliminated I16MB Block 0 Mode 0-3 I16MB Block 1 Mode 0-3 I16MB Block 14 Mode 0-3 I16MB Block 15 Mode

16 System Architecture for Video Encoding 31 Chip Micrograph DWT Level 1 RDO DWT Level 2 Main Controller Arithmetic Encoder FIFO BSF Block Coder Source: Huang, NTU, ISSCC

17 Chip Features Technology Supply Voltage Core Area Logic Gates SRAM Encoding Tools Operating Frequency Power Consumption 0.18 μm CMOS 1P6M 1.8V mm K (2-input NAND gate) 34.72KB Baseline Profile Compression 81MHz for D1 108MHz for HDTV720p 581mW for D1 785mW for HDTV720p 33 Introduction Low power stream video processor Data parallel optimazation Outline Multi-core stream processor SoC for Graphic and computer vision Streaming data model Conclusion 34

GPUs evolution Evolution of the PC hardware graphics pipeline: 1995-1998: Texture mapping and z-buffer 1998: Multitexturing 1999-2000: Transform and lighting 2001:

18 GPUs evolution Evolution of the PC hardware graphics pipeline: : Texture mapping and z-buffer 1998: Multitexturing : Transform and lighting 2001: Programmable vertex shader : Programmable pixel shader : Shader model 3.0 and 64-bit color support 2007: Stream computing 35 Traditional GPU pipeline 36

19 Performance evolution Power Vertex Rate GFLOPS Area Memory Bandwidth 37 Contemporary GPUs evolution Area Graphic Specific Computing Pixel Fill Rate ASIC Memory Bandwidth Programability Power Vertex Rate GFLOPS Stream Processing Stream ASIC computing mixed Is coming With (NV9 Programmability ATI-R700) Memory Bandwidth (GB/s) Vertex Rate (M vertices/s) Pixel Rate (M pixels/s) GFLOPS (G flops) Power (mw) Area (M Transistors) GL1.1 DX6 DX7 DX8 DX9 GL1.4 GL1.5 DX10 GL2.0 38

20 Stream Programming Model Stream program organizes data as streams and computation as kernel Stream element (record) 8-bit pixel User defined data structure (MB, RGB_pixel ) Record Record Record Kernel Define computation from input streams to output streams Only access local memory space!! Stream Processing Model Stream Buffer Bridge Stream Processing Model Stream Input Registers/Cache Input Stream Data Reference Data Buffer Texture Cache Temporary Registers Kernel Execution Uit Constant Registers Reference Data Buffer Stream Output Registers/Cache Bridge Output Stream Data Stream Buffer 40

Unified Stream Kernel Fragment Operation Bridge Texture External Memory Vertex Stream Tile Stream Texture

21 Multi-core System Architecture Dual-Kernels for graphic pipeline efficiency On Chip System Geometry Stream Fragment Stream Primitive Fetching Bridge Unified Stream Kernel Primitive Setup Bridge ATDF Unified Stream Kernel Fragment Operation Bridge Texture External Memory Vertex Stream Tile Stream Texture Stream Color/Z Stream RISC CPU (GLES Application and Driver) 42 Pipeline Task Sharing by Reconfigurable Stream Core 43

22 Unified Stream Core 44 Work Load Analysis 45

23 Task level scalable for multi-core system 46 Reconfiguration for instruction path 47

24 Reconfiguration for instruction path 48 Average 1.6 times speed up Measurement result Utilization rate(%) Proposed ATS Non-schduling Frame Accepted as Configurable Memory Array for Multimedia Multithreading Stream Processing Architecture, in Proc. IEEE International Symposium on Circuits and Systems (ISCAS 08), May

25 Stream architecture for Mobile Multimedia Duo heterogeneous cores Due homogenous stream processor cores Multi-Clock domain power optimization 50 Die photo and specification Accepted as A 26mW 6.4GFLOPS Multi-Core Stream Processor for Mobile Multimedia Applications, in Digest of Technical Papers of Symposium on VLSI Circuits,

Stream Architecture for Computer Vision 2D Stream processor Computer vision function architecture Tile base stream processing Parallel mask conditional operation Avoid a lot of branch instructions

26 Stream Architecture for Computer Vision 2D Stream processor Computer vision function architecture Tile base stream processing Parallel mask conditional operation Avoid a lot of branch instructions Inst MEM Decode Controller Flag Mask ALU Flag Data MEM Mask Tile Memory Manager Tile Reg Data Reg Data MEM Kernal FIFO TileID Kernal Store TileID FIFO Function Inst MEM Decode Controlle r Tile Hand ler Hist Reg Loop Control Memor y Ctrl Tile Memory Manager Matrix Reg SIMD PE Tree Reg ALU Share d Reg SP Reg Stream Interface External Bus Stream Architecture for Computer Vision Flag Mask Data MEM Flag Mask Data MEM Tile Memory Manager Inst MEM Tile Reg TileID Kernal Kernal FIFO Decode Controller ALU Data Reg TileID Function Store FIFO SP Reg Stream Interface External Bus 53 53

Stream Architecture for Computer Vision Flag Mask Data MEM Flag Mask Data MEM Tile Memory Manager Inst MEM Decode Controller SP Reg ALU Tile Reg Data Reg Kernal FIFO TileID Kernal Store TileID FIFO

27 Stream Architecture for Computer Vision Flag Mask Data MEM Flag Mask Data MEM Tile Memory Manager Inst MEM Decode Controller SP Reg ALU Tile Reg Data Reg Kernal FIFO TileID Kernal Store TileID FIFO Function Inst MEM Hist Reg Matrix Reg Tree Reg Share d Reg Stream Interface External Bus Decode Controlle r Tile Hand ler Loop Control Memor y Ctrl Tile Memory Manager SIMD PE ALU Scalable Visual Processor 55

combined Change integrated PE array number Source: Chin, NTU, ISSCC 2008 56

28 ivisual (Intelligent Visual Processor) Highest level abstraction Image-in, answerout Scalable architecture for higher specification Can be easily combined Change integrated PE array number Source: Chin, NTU, ISSCC Video Stream Processor Data parallel Instruction parallel Source: ISSCC

29 Parallel Processing Architectural Perspective for Multimedia Processing Requirement: Processing of independent segments Pipelining, systolic processing Task level parallelism Instruction level parallelism Data level parallelism Stream Processor Requirement: Processing of dependent segments Scalable architecture Multi-thread and cache Architectural Perspectives of MPSOC Stream base PE Core with: Data parallel Instruction parallel Reconfigurable Scalable Generic CPU with -Record preparing 59

30 Conclusion Semiconductor Technology and Mobile MultiMedia applications are continuously the push-pull driving forces for MPSoC. End Products with Battery power for MMM requires novel multi-core architectures. Stream Processor could play important role to provide reconfigurability and scalability. 60 Thank you! 61

Computer Architecture

Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,