1TOPS/W Software Programmable Media Processor. David Moloney, CTO, Movidius 19 August 2011

Size: px

Start display at page:

Download "1TOPS/W Software Programmable Media Processor. David Moloney, CTO, Movidius 19 August 2011"

Dana Flowers
6 years ago
Views:

1 1TOPS/W Software Programmable Media Processor David Moloney, CTO, Movidius 19 August 2011

2 Movidius Background Started in 2005 looking at mobile gaming acceleration Decided on multicore design to allow software derivatives and meet OPS/W/$ target Existing processors poor cost/performance match for target workloads Developed SHAVE vector processor with HW support for sparse data-structures (Matrix-Vector) Expanded ISA to support C-complier Talked to mobile phone customers in 2007 Turned out their real problem was video Back to the drawing-board! Initial 65nm Silicon & all IP on founder & angel funding Allowed us to close A-round in October 2008! 65nm Myriad MM SoC in mass-production Next generation 28nm SoC H1/2012 with 10x Perf/W

3 Mobile Video Processing Workload 20/Apr/2011 3

4 Movidius SHAVE Processor Streaming Hybrid Architecture Vector Engine Hybrid of RISC, DSP, VLIW & GPU architectural features 128-bit vector arithmetic: 8/16/32-bit INT & fp16/fp32 Unique proprietary architecture Tailored to streaming workloads and architected for outstanding OPS/mW/$ performance Excellent Graphics and matrix mathematics support HW texture unit for good graphics performance Predicated execution to eliminate branches Compiler-friendly architecture HW support for compressed data-structures (ex. matrices)

5 SHAVE Instruction-Set RISC-style Instruction predication Extensive integer ISA Excellent C-compiler support DSP-style Zero overhead looping Modulo addressing Transparent DMA modes FFT, Viterbi, and other DSP operation support Parallel comparisons VLIW-style Parallel functional units controlled by VLIW instr. 8/16/32-bit x 1-4 SIMD INT GPU-style Streaming operations Floating-point operations (fp16/fp32 IEEE-compliant) Texture-Management Unit and L1 Cache

6 SHAVE ISA Richness

7 PEU 65nm Myriad SoC SHAVE Variable-Length Instruction BRU LSU0 LSU1 VAU IAU SAU CMU TMU DMA 180MHz 2.9GB/Sec 128kB Per SHAVE 128kB SRAM Tile SHAVE Bus SHAVE Processor TMU 1kB cache L1 PEU IRF 32x32 IAU LSU0 LSU1 IDC BRU CMU SRF 32x32 SAU 12.2GB/Sec Myriad DCU 8.6GB/Sec VRF 32x128 VAU Decoded instrs 128-bit AXI 128 kb 128kB 2-way L2 17.3GB/Sec DDR2 Cont. 16/64 MB SDRAM 16/64MB SDRAM Die 5.8GB/Sec 5.8GB/Sec 1.5GB/Sec

Myriad Silicon Platform SW Controlled I/O Multiplexing MEBI SEBI SDIO SDIO SDIO x3 x3 SPI SPI SPI x3 x3 x3 FLSH LCD LCD Cam LCD USB2 OTG SPI I 2 C x3 SPI I 2 S x3 JTAG UART UART TS 64 GPS 128kB 128kB

8 Myriad Silicon Platform SW Controlled I/O Multiplexing MEBI SEBI SDIO SDIO SDIO x3 x3 SPI SPI SPI x3 x3 x3 FLSH LCD LCD Cam LCD USB2 OTG SPI I 2 C x3 SPI I 2 S x3 JTAG UART UART TS 64 GPS 128kB 128kB 128kB 128kB 128kB 128kB 128kB 128kB TIM RISC TMU L1 TMU L1 TMU L1 TMU L1 TMU L1 TMU L1 TMU L1 TMU L1 SVE0 SVE0 SVE0 SVE0 SVE0 SVE0 SVE0 SVE0 NAL Stacked 16/64MB SDRAM die 32 DDR L2 Cache Main Bus 50GFLOPS/W (IEEE 754 SP) 128 Bridge Movidius IP

Myriad GOPS/Watt (Total) Myriad GOPS/W (total) 1004 GOPS/W PEU BRU LSU0 LSU1 VAU IAU SAU CMU TMU DMA 1200 708 1000 560 800 263 600

9 Myriad GOPS/Watt (Total) Myriad GOPS/W (total) 1004 GOPS/W PEU BRU LSU0 LSU1 VAU IAU SAU CMU TMU DMA GOPS/W (arith) IAU SAU VAU OP/W tot OP/W arith int8 int16 int32 fp16 fp32

Myriad GOPS/Watt (Arithmetic) Myriad GOPS/W GOPS/W (arith) 181 PEU BRU LSU0 LSU1 VAU IAU SAU CMU TMU DMA 200 180 160

10 Myriad GOPS/Watt (Arithmetic) Myriad GOPS/W GOPS/W (arith) 181 PEU BRU LSU0 LSU1 VAU IAU SAU CMU TMU DMA IAU SAU VAU OP/W arith int8 int16 int32 fp16 fp32

Myriad 65nm CMOS LP Die SHAVE RISC sub-system Analog SHAVE 16MB SDRAM DIE Myriad DIE 16MB Stacked 1 2 3 4 5 6 SDRAM 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 SHAVE SHAVE SHAVE SHAVE SHAVE SHAVE

11 Myriad 65nm CMOS LP Die SHAVE RISC sub-system Analog SHAVE 16MB SDRAM DIE Myriad DIE 16MB Stacked SDRAM SHAVE SHAVE SHAVE SHAVE SHAVE SHAVE Author Year FLOPS/core Cores GFLOPS W GFLOPS/W Myriad Movidius (1 KAIST (2 Intel (4 Adapteva

12 Technology - Platform Approach 3D Capture Video Edit Applications Software Modules Silicon Platform Products 3D Video Anaglyph-3D Foundation Technology 20/Apr/

Myriad Example Applications SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4

4 4 4 4 4 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 20/Apr/2011

13 Myriad Example Applications SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 20/Apr/

Engine XML Architecture Description Task Allocator Memory Allocator

14 MoviSim ISS Architecture Runtime OpenCL SABRE Debugger MoviSim SHAVE ISS SHAVE ISS SHAVE ISS LEON/ARM ISS Heterogeneous Core ISS DRAM Simulation Engine XML Architecture Description Task Allocator Memory Allocator Messaging Instrumentation Thread 0 Thread 1 Thread n-1 Model API Thread m-1 Thread m XML Parser

15 Fragrak 28nm Platform SW Controlled I/O Multiplexing MEBI SEBI SDIO SDIO SDIO x3 x3 SPI SPI SPI x3 x3 x3 FLSH MIPI LCD DSI x MIPI LCD CSI 2x USB2 OTG I SPI 2 C x3 I SPI 2 S x3 JTAG UART UART TS 64 GPS 128kB 128kB SHAVE 128kB SHAVE 0 256kB 1SHAVE SHAVE kB 128kB SHAVE 128kB SHAVE 0 256kB 1SHAVE SHAVE kB 128kB SHAVE 128kB SHAVE 0 256kB 1SHAVE SHAVE kB 128kB SHAVE 128kB SHAVE 0 256kB 1SHAVE SHAVE 0 16 TIM RISC ICB ICB ICB ICB NAL XCB Stacked 256/512MB SDRAM die 64 DDR3 LP L2 512kB Main Bus GFLOPS/W (IEEE 754 SP) Brid ge Movidius IP 15

16 Any questions? The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/ ) under grant agreement n (PEPPHER Project,

17 Abstract The rationale and architecture behind a new software programmable multimedia coprocessor for mobile devices is outlined. The focus of the architecture is on power-efficient operation, allowing functions which are traditionally implemented in fixed-function hardware to be implemented competitively in software. For instance the sustained single-precision IEEE 754 rate is 50GFLOPS/W allowing existing applications to be ported with great ease. The device supports 8, 16, 32 and some 64-bit integer operations as well as fp16 (OpenEXR) and fp32 arithmetic and is capable of an aggregate 1 TOPS/W maximum 8-bit equivalent operations in a low-cost plastic BGA package with integrated 16 or 64MB SDRAM. New architectural features such as support for random-accessible sparse data-structures are implemented for the first time improving memory utilization and bandwidth efficiency. Power efficiency is paramount and the device contains a total of 11 power-islands with 8 dedicated to each of the integrated SHAVE processors, allowing very fine-grained power control. Comparisons to previous work based on 65nm silicon and applications are shown to illustrate the power of the device.

Myriad GOPS/Watt (Total/Arithmetic) GOPS/W (total) 1004 708 560 Myriad GOPS/W 1200 1000 800 600 400 200 0 GOPS/W (arith) 24 48 48 22 21 48 80 40 20 40 20 4 2 1 8 4

18 Myriad GOPS/Watt (Total/Arithmetic) GOPS/W (total) Myriad GOPS/W GOPS/W (arith) int8 int16 int32 fp16 fp32 91 PEU PEU 45 LSU0 99 VAU 263 IAU DMA RISC OP/W tot BRU LSU0 LSU1 VAU IAU SAU CMU TMU DMA

19 16x800MHz = 25.6GB/Sec SHAVE Processor 28nm (Fragrak) Intra-Cluster Bus (ICB) PEU SHAVE Variable-Length Instruction BRU LSU0 LSU1 VAU IAU SAU CMU TMU DMA 800MHz 8x800MHz = 12.8GB/Sec 256kB Per SHAVE 256kB SRAM Xtra-Cluster Bus (XCB) 16x800MHz 25.6GB/Sec SHAVE Processor TMU 16k 1kB cache L1 16x800MHz 12.8GB/Sec PEU IRF 32x32 IAU LSU0 LSU1 IDC 4x17x800MHz 54.4GB/Sec BRU CMU SRF 32x32 SAU DCU 4x12x800MHz 38.4GB/Sec Fragrak VRF 32x128 VAU Decoded instructions 128-bit AXI Bus L2 Cache 512kB 2-way 16x12x800MHz 76.8GB/Sec 16x800MHz 25.6GB/Sec 512 kb LP DDR3 Cont MB SDRAM MB SDRAM Die 8x800MHz 12.8GB/Sec

20 BW Hierarchy Myriad 65nm Fragrak 28nm 547GB/Sec SRAM L1 Cache Registers V/S/IRF ICB 4864GB/Sec 190:1 42:1 2.88GB/Sec XCB L2 Cache 115GB/Sec 2:1 18:1 1.44GB/Sec SDRAM 6.4GB/Sec Bottom-Line - Very High Sustainable Performance

21 BW Hierarchy (Detail) Myriad 65nm VRF SRF IRF LSU IDC L1 ISB L2 SDRAM Clk Bytes Ports BW #SHAVES Total BW Fragrak 28nm VRF SRF IRF LSU IDC L1 ICB L2 XCB SDRAM Clk Bytes Ports BW #SHAVES Total BW

22 LSU HW Sparse-Data Support IRF bitmap field bm7 bm6 bm5 bm4 bm3 bm2 bm1 bm f[7] f[6] f[5] f[4] f[3] f[2] f[1] f[0] f[7:0] addr[1:0] fen[2:0] word_cnt 6 addr_gen pw_config instr_f[7:0] 32 IRF[base_addr] RAM_addr[31:0] RAM_wr RAM_rd bru_hold

23 Sparse Data-Structure Example bitmap description data 64-bit RAM word address 0 1 sx sx sy base sz 1.0 base x 0 x 1 base x 2 x 3 base y 0 y 1 base sy y 2 y 3 base z 0 z 1 base x4 scaling matrix 0.0 z 2 z 3 base addr bmp base sz x element vector x x x y element vector y y y z element vector z z z pointer to next str. addr 29 1 next str. Bitmap bmp

24 References 1) H-E. Kim, J-S. Yoon, K-D. Hwang, Y-J. Kim, J-S. Park, L-S. Kim, "A 275mw heterogeneous Multimedia processor for ic-stacking on Si-interposer" Proc. ISSCC ) S.Vangal, J.Howard, G.Ruhl, S.Dighe, H.Wilson, J.Tschanz, D.Finan, P.Iyer,A. Singh, T.Jacob, S.Jain, S.Venkataraman, Y.Hoskote and N.Borkar, "An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS", Proc. ISSCC 2007, pp.5-7 3) A. Olofsson, R. Trogan, O. Raikhman, A 25 GFLOPS/Watt S oftware Programmable Floating Point Accelerator, HPEC 2010, Sep ) C.Y. Park, N.I. Cho, "A fast algorithm for the conversion of DCT coefficients to H.264 transform coefficients", ICIP 2005 Proceedings, pp /Apr/

Green Multicore. David Moloney, CTO, Movidius 24 November 2011

Green Multicore. David Moloney, CTO, Movidius 24 November 2011 Green Multicore David Moloney, CTO, Movidius 24 November 2011 Overview Fabless semiconductor company founded in 2005 VC backed (completing C-round today @ 12:00) Focus on computational imaging and video