Green Multicore. David Moloney, CTO, Movidius 24 November 2011

Size: px

Start display at page:

Download "Green Multicore. David Moloney, CTO, Movidius 24 November 2011"

Hubert Ryan
5 years ago
Views:

1 Green Multicore David Moloney, CTO, Movidius 24 November 2011

2 Overview Fabless semiconductor company founded in 2005 VC backed (completing C-round 12:00) Focus on computational imaging and video Uniquely positioned for this market with a software-programmable media processor with state-of-the-art GOPS/W performance Enables SW derivatives of the base silicon platform Current 65nm product in mass-production and expected to ship 1-3M qty in 2012 Next gen 28nm product in design will deliver the power of a desktop GPU in a 8x8mm 350mW

3 Myriad of Applications Consumer Electronics Robotics Mobile phones Automotive Video/DSC Cameras Computational Cameras Medical HPC Camera Modules Wireless Cameras Aerospace

4 Technology - Platform Approach 3D Capture Video Edit Applications Software Modules Silicon Platform Products 3D Video Anaglyph-3D Foundation Technology 4

5 Mobile Video Processing Workload 5

6 GPU FLOPS/W Trend 7 GPU GFLOPS/W Historical Trend GPU GFLOPS/W 1.4x per Year G 100 GT 120 GT 130 GT GTS GT 210 GT 220 GT GTS GTX GTX GTX GTX GTX GTX GTX 295 GT 420 GT 430 GT 430 GT 440 GT GTS GTS GTX GTX GTX GTX GTX GTX SE

7 Movidius SHAVE Processor Unique proprietary architecture Tailored to streaming workloads and architected for outstanding OPS/mW/$ performance Streaming Hybrid Architecture Vector Engine Hybrid of RISC, DSP, VLIW & GPU architectural features 128-bit vector arithmetic: 8/16/32-bit INT & fp16/fp32 Excellent Graphics and matrix mathematics support HW texture unit for good graphics performance Predicated execution to eliminate branches Compiler-friendly architecture HW support for compressed data-structures (ex. matrices)

Myriad Silicon Platform SW Controlled I/O Multiplexing

FLSH LCD LCD Cam LCD USB2 OTG SPI I 2 C x3 SPI I 2 S x3

128kB 128kB 128kB TIM RISC TMU L1 TMU L1 TMU L1 TMU L1

SVE5 SVE6 SVE7 NAL Stacked 16/64MB SDRAM die 32 DDR L2

8 Myriad Silicon Platform SW Controlled I/O Multiplexing MEBI SEBI SDIO SDIO SDIO x3 x3 SPI SPI SPI x3 x3 x3 FLSH LCD LCD Cam LCD USB2 OTG SPI I 2 C x3 SPI I 2 S x3 JTAG UART UART TS 64 GPS 128kB 128kB 128kB 128kB 128kB 128kB 128kB 128kB TIM RISC TMU L1 TMU L1 TMU L1 TMU L1 TMU L1 TMU L1 TMU L1 TMU L1 SVE0 SVE1 SVE2 SVE3 SVE4 SVE5 SVE6 SVE7 NAL Stacked 16/64MB SDRAM die 32 DDR L2 Cache Main Bus 50GFLOPS/W (IEEE 754 SP) 128 Bridge Movidius IP

65nm Myriad SoC SHAVE Variable-Length Instruction PEU BRU LSU0 LSU1

9GB/Sec SHAVE Bus SHAVE Processor TMU 1kB cache L1 PEU IRF 32x32 IAU

6GB/Sec VRF 32x128 VAU Decoded instrs 128-bit AXI 128 kb 128kB 2-way

9 65nm Myriad SoC SHAVE Variable-Length Instruction PEU BRU LSU0 LSU1 VAU IAU SAU CMU 180MHz 128kB Per SHAVE 128kB SRAM Tile 2.9GB/Sec SHAVE Bus SHAVE Processor TMU 1kB cache L1 PEU IRF 32x32 IAU BRU CMU SRF 32x32 SAU LSU0 LSU1 IDC 12.2GB/Sec Myriad DCU 8.6GB/Sec VRF 32x128 VAU Decoded instrs 128-bit AXI 128 kb 128kB 2-way L2 17.3GB/Sec DDR2 Cont. 16/64 MB SDRAM 16/64MB SDRAM Die 5.8GB/Sec 5.8GB/Sec 1.5GB/Sec

20 0 GOPS/W (arith) 32 4 2 1 8 4 16 2 8 int8 int16 int32

10 Myriad GOPS/Watt (Arithmetic) Myriad 181 GOPS/W PEU BRU LSU0 LSU1 VAU SAU IAU CMU GOPS/W (arith) int8 int16 int32 fp16 fp I A U S A U VA U O P / W a r i t h

Myriad 65nm CMOS LP Die RISC sub-system SHAVE Analog SHAVE 16MB SDRAM DIE Myriad DIE 16MB Stacked 1 2 3 4 5 6 SDRAM 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 SHAVE SHAVE SHAVE SHAVE SHAVE SHAVE

11 Myriad 65nm CMOS LP Die RISC sub-system SHAVE Analog SHAVE 16MB SDRAM DIE Myriad DIE 16MB Stacked SDRAM SHAVE SHAVE SHAVE SHAVE SHAVE SHAVE Author Year FLOPS/core Cores GFLOPS W GFLOPS/W Myriad Movidius (1 KAIST (2 Intel (4 Adapteva

12 Now I ve got a Green Compute Platform What can I do with it?

13 MA1135-3D Converter Box Application HDMI in HDMI out image stripes 20/Apr/

Myriad Example Applications SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6

SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 1 SHAVE 2 SHAVE 3 SHAVE

14 Myriad Example Applications SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 14

SHAVE5 SHAVE6 SHAVE7 Movidius Profiler Movidius Assembly Optimizer Code transformations Loop Unrolling Inline assembler

15 Application Development App Partition Compile Profile Optimize Power X86 C-code Visual Studio Intel Parallel Studio Refactor design Data-layout Use of DMA to handle streams Movidius C- compiler -LLVM SHAVE0 SHAVE1 SHAVE2 SHAVE3 SHAVE4 SHAVE5 SHAVE6 SHAVE7 Movidius Profiler Movidius Assembly Optimizer Code transformations Loop Unrolling Inline assembler Data-layout DRAM access DMA for streams Run quickly and switch off to minimize leakage Optimise clockrates for each SHAVE Power-off domains

Lightfield Requirements Replaces glass with SW CUDA

computationally expensive Interpolation key kernel

16 Lightfield Requirements Replaces glass with SW CUDA implementation of Giorgiev (Adobe) LF algorithm Very computationally expensive Interpolation key kernel Geforce GT120 at 130 GFLOPs and 50W (2.6GFLOPs/W) eforce_100_series GPU completes refocusing in 30ms (33.3fps) 4fps on Myriad 65nm Lytro Raytrix

17 Performance Roadmap (Nvidia)

Fragrak 28nm Platform SW Controlled I/O Multiplexing MEBI SEBI SDIO SDIO SDIO x3 x3 SPI SPI

UART TS 64 GPS 128kB 128kB SHAVE 128kB SHAVE 0 256kB 1SHAVE SHAVE 0 04 128kB 128kB SHAVE 128kB

128kB SHAVE 128kB SHAVE 0 256kB 1SHAVE SHAVE 0 16 TIM RISC ICB ICB ICB ICB NAL XCB Stacked

18 Fragrak 28nm Platform SW Controlled I/O Multiplexing MEBI SEBI SDIO SDIO SDIO x3 x3 SPI SPI SPI x3 x3 x3 FLSH MIPI LCD DSI x MIPI LCD CSI 2x USB2 OTG I SPI 2 C x3 I SPI 2 S x3 JTAG UART UART TS 64 GPS 128kB 128kB SHAVE 128kB SHAVE 0 256kB 1SHAVE SHAVE kB 128kB SHAVE 128kB SHAVE 0 256kB 1SHAVE SHAVE kB 128kB SHAVE 128kB SHAVE 0 256kB 1SHAVE SHAVE kB 128kB SHAVE 128kB SHAVE 0 256kB 1SHAVE SHAVE 0 16 TIM RISC ICB ICB ICB ICB NAL XCB Stacked 256/512MB SDRAM die 64 DDR3 LP L2 512kB Main Bus GFLOPS/W (IEEE 754 SP) Brid ge Movidius IP 18

19 GFLOPS/W in Context Movidius 28nm GPU rate of increase 1.4x per Year 7 Years to hit 50GFLOPS/W! Movidius 65nm GeForce G 100 Tesla C870 GeForce GT 120 GeForce GT 130 GeForce GT 140 GeForce GTS 150 Fermi GT 420 GeForce GT 210 GeForce GT 220 Tesla C1060 GeForce GT 240 GeForce GTS 250 Fermi GTX 465 GeForce GTX 260 Fermi GTS 450 GeForce GTX 260 Tesla C2050/C2070 GeForce GTX 260 Fermi GT 430 GeForce GTX 275 Tesla M2050 Tesla M2070/M2070Q GeForce GTX 280 GeForce GTX 285 GeForce GTX 295 Fermi GT 440 GeForce GT 420 Fermi GTX 460 SE GeForce GT 430 Fermi GTX 470 GeForce GT 430 GeForce GT 440 GeForce GT 440 Fermi GTX 480 GeForce GTS 450 Fermi GT 430 GeForce GTS 450 GeForce GTX 460 SE Fermi GTS 450 GeForce GTX 460 Fermi GTX 460 GeForce GTX 460 Fermi GTX 460 GeForce GTX 465 Fermi GT 440 GeForce GTX 470 GeForce GTX 480 Myriad Myriad2

20 Myriad of Cameras 1 Platform Standard camera All optical focusing: bulky lenses & autofocus for close-ups Wide aperture good for low-light but limits depth-of-field Scale and cost due to established manufacturing processes Lightfield camera (Plenoptic = Lightfield) Post-capture refocusing in software (Lytro) Computationally expensive (GPU-based = cloud Decouples aperture from Depth of Field (DoF) Array Camera (Stereo is a 2x1 special case) Uses array of MxN completely focused cameras Composite & interpolate array of low-res cameras (Levoy) Individual camera control allows: HDR capture, faulttolerance, slow-motion, power-saving etc.

21 Movidius Computational Imaging Applications Conventional Cameras Software Modules Silicon Platform Products 3D Stereo Cameras Foundation Technology Lightfield Cameras Tiny 8x8mm Myriad BGA Array Cameras

22 Summary Movidius 65nm silicon platform Ground-breaking functionality in SW Enabled by ground-breaking GFLOPS/W Compact form-factor In mass-production today 10x better GFLOPS/W than GPU Next generation 28nm SoC 9x perf/watt available in x better GFLOPS/W than GPU 22

23 Any questions? The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/ ) under grant agreement n (PEPPHER Project,

24 References 1) H-E. Kim, J-S. Yoon, K-D. Hwang, Y-J. Kim, J-S. Park, L-S. Kim, "A 275mw heterogeneous Multimedia processor for ic-stacking on Si-interposer" Proc. ISSCC ) S.Vangal, J.Howard, G.Ruhl, S.Dighe, H.Wilson, J.Tschanz, D.Finan, P.Iyer,A. Singh, T.Jacob, S.Jain, S.Venkataraman, Y.Hoskote and N.Borkar, "An 80-Tile 1.28TFLOPS Networkon-Chip in 65nm CMOS", Proc. ISSCC 2007, pp.5-7 3) A. Olofsson, R. Trogan, O. Raikhman, A 25 GFLOPS/Watt Software Programmable Floating Point Accelerator, HPEC 2010, Sep ) C.Y. Park, N.I. Cho, "A fast algorithm for the conversion of DCT coefficients to H.264 transform coefficients", ICIP 2005 Proceedings, pp

1TOPS/W Software Programmable Media Processor. David Moloney, CTO, Movidius 19 August 2011

1TOPS/W Software Programmable Media Processor David Moloney, CTO, Movidius 19 August 2011 Movidius Background Started in 2005 looking at mobile gaming acceleration Decided on multicore design to allow