Green Multicore. David Moloney, CTO, Movidius 24 November 2011

Size: px
Start display at page:

Download "Green Multicore. David Moloney, CTO, Movidius 24 November 2011"

Transcription

1 Green Multicore David Moloney, CTO, Movidius 24 November 2011

2 Overview Fabless semiconductor company founded in 2005 VC backed (completing C-round 12:00) Focus on computational imaging and video Uniquely positioned for this market with a software-programmable media processor with state-of-the-art GOPS/W performance Enables SW derivatives of the base silicon platform Current 65nm product in mass-production and expected to ship 1-3M qty in 2012 Next gen 28nm product in design will deliver the power of a desktop GPU in a 8x8mm 350mW

3 Myriad of Applications Consumer Electronics Robotics Mobile phones Automotive Video/DSC Cameras Computational Cameras Medical HPC Camera Modules Wireless Cameras Aerospace

4 Technology - Platform Approach 3D Capture Video Edit Applications Software Modules Silicon Platform Products 3D Video Anaglyph-3D Foundation Technology 4

5 Mobile Video Processing Workload 5

6 GPU FLOPS/W Trend 7 GPU GFLOPS/W Historical Trend GPU GFLOPS/W 1.4x per Year G 100 GT 120 GT 130 GT GTS GT 210 GT 220 GT GTS GTX GTX GTX GTX GTX GTX GTX 295 GT 420 GT 430 GT 430 GT 440 GT GTS GTS GTX GTX GTX GTX GTX GTX SE

7 Movidius SHAVE Processor Unique proprietary architecture Tailored to streaming workloads and architected for outstanding OPS/mW/$ performance Streaming Hybrid Architecture Vector Engine Hybrid of RISC, DSP, VLIW & GPU architectural features 128-bit vector arithmetic: 8/16/32-bit INT & fp16/fp32 Excellent Graphics and matrix mathematics support HW texture unit for good graphics performance Predicated execution to eliminate branches Compiler-friendly architecture HW support for compressed data-structures (ex. matrices)

8 Myriad Silicon Platform SW Controlled I/O Multiplexing MEBI SEBI SDIO SDIO SDIO x3 x3 SPI SPI SPI x3 x3 x3 FLSH LCD LCD Cam LCD USB2 OTG SPI I 2 C x3 SPI I 2 S x3 JTAG UART UART TS 64 GPS 128kB 128kB 128kB 128kB 128kB 128kB 128kB 128kB TIM RISC TMU L1 TMU L1 TMU L1 TMU L1 TMU L1 TMU L1 TMU L1 TMU L1 SVE0 SVE1 SVE2 SVE3 SVE4 SVE5 SVE6 SVE7 NAL Stacked 16/64MB SDRAM die 32 DDR L2 Cache Main Bus 50GFLOPS/W (IEEE 754 SP) 128 Bridge Movidius IP

9 65nm Myriad SoC SHAVE Variable-Length Instruction PEU BRU LSU0 LSU1 VAU IAU SAU CMU 180MHz 128kB Per SHAVE 128kB SRAM Tile 2.9GB/Sec SHAVE Bus SHAVE Processor TMU 1kB cache L1 PEU IRF 32x32 IAU BRU CMU SRF 32x32 SAU LSU0 LSU1 IDC 12.2GB/Sec Myriad DCU 8.6GB/Sec VRF 32x128 VAU Decoded instrs 128-bit AXI 128 kb 128kB 2-way L2 17.3GB/Sec DDR2 Cont. 16/64 MB SDRAM 16/64MB SDRAM Die 5.8GB/Sec 5.8GB/Sec 1.5GB/Sec

10 Myriad GOPS/Watt (Arithmetic) Myriad 181 GOPS/W PEU BRU LSU0 LSU1 VAU SAU IAU CMU GOPS/W (arith) int8 int16 int32 fp16 fp I A U S A U VA U O P / W a r i t h

11 Myriad 65nm CMOS LP Die RISC sub-system SHAVE Analog SHAVE 16MB SDRAM DIE Myriad DIE 16MB Stacked SDRAM SHAVE SHAVE SHAVE SHAVE SHAVE SHAVE Author Year FLOPS/core Cores GFLOPS W GFLOPS/W Myriad Movidius (1 KAIST (2 Intel (4 Adapteva

12 Now I ve got a Green Compute Platform What can I do with it?

13 MA1135-3D Converter Box Application HDMI in HDMI out image stripes 20/Apr/

14 Myriad Example Applications SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 14

15 Application Development App Partition Compile Profile Optimize Power X86 C-code Visual Studio Intel Parallel Studio Refactor design Data-layout Use of DMA to handle streams Movidius C- compiler -LLVM SHAVE0 SHAVE1 SHAVE2 SHAVE3 SHAVE4 SHAVE5 SHAVE6 SHAVE7 Movidius Profiler Movidius Assembly Optimizer Code transformations Loop Unrolling Inline assembler Data-layout DRAM access DMA for streams Run quickly and switch off to minimize leakage Optimise clockrates for each SHAVE Power-off domains

16 Lightfield Requirements Replaces glass with SW CUDA implementation of Giorgiev (Adobe) LF algorithm Very computationally expensive Interpolation key kernel Geforce GT120 at 130 GFLOPs and 50W (2.6GFLOPs/W) eforce_100_series GPU completes refocusing in 30ms (33.3fps) 4fps on Myriad 65nm Lytro Raytrix

17 Performance Roadmap (Nvidia)

18 Fragrak 28nm Platform SW Controlled I/O Multiplexing MEBI SEBI SDIO SDIO SDIO x3 x3 SPI SPI SPI x3 x3 x3 FLSH MIPI LCD DSI x MIPI LCD CSI 2x USB2 OTG I SPI 2 C x3 I SPI 2 S x3 JTAG UART UART TS 64 GPS 128kB 128kB SHAVE 128kB SHAVE 0 256kB 1SHAVE SHAVE kB 128kB SHAVE 128kB SHAVE 0 256kB 1SHAVE SHAVE kB 128kB SHAVE 128kB SHAVE 0 256kB 1SHAVE SHAVE kB 128kB SHAVE 128kB SHAVE 0 256kB 1SHAVE SHAVE 0 16 TIM RISC ICB ICB ICB ICB NAL XCB Stacked 256/512MB SDRAM die 64 DDR3 LP L2 512kB Main Bus GFLOPS/W (IEEE 754 SP) Brid ge Movidius IP 18

19 GFLOPS/W in Context Movidius 28nm GPU rate of increase 1.4x per Year 7 Years to hit 50GFLOPS/W! Movidius 65nm GeForce G 100 Tesla C870 GeForce GT 120 GeForce GT 130 GeForce GT 140 GeForce GTS 150 Fermi GT 420 GeForce GT 210 GeForce GT 220 Tesla C1060 GeForce GT 240 GeForce GTS 250 Fermi GTX 465 GeForce GTX 260 Fermi GTS 450 GeForce GTX 260 Tesla C2050/C2070 GeForce GTX 260 Fermi GT 430 GeForce GTX 275 Tesla M2050 Tesla M2070/M2070Q GeForce GTX 280 GeForce GTX 285 GeForce GTX 295 Fermi GT 440 GeForce GT 420 Fermi GTX 460 SE GeForce GT 430 Fermi GTX 470 GeForce GT 430 GeForce GT 440 GeForce GT 440 Fermi GTX 480 GeForce GTS 450 Fermi GT 430 GeForce GTS 450 GeForce GTX 460 SE Fermi GTS 450 GeForce GTX 460 Fermi GTX 460 GeForce GTX 460 Fermi GTX 460 GeForce GTX 465 Fermi GT 440 GeForce GTX 470 GeForce GTX 480 Myriad Myriad2

20 Myriad of Cameras 1 Platform Standard camera All optical focusing: bulky lenses & autofocus for close-ups Wide aperture good for low-light but limits depth-of-field Scale and cost due to established manufacturing processes Lightfield camera (Plenoptic = Lightfield) Post-capture refocusing in software (Lytro) Computationally expensive (GPU-based = cloud Decouples aperture from Depth of Field (DoF) Array Camera (Stereo is a 2x1 special case) Uses array of MxN completely focused cameras Composite & interpolate array of low-res cameras (Levoy) Individual camera control allows: HDR capture, faulttolerance, slow-motion, power-saving etc.

21 Movidius Computational Imaging Applications Conventional Cameras Software Modules Silicon Platform Products 3D Stereo Cameras Foundation Technology Lightfield Cameras Tiny 8x8mm Myriad BGA Array Cameras

22 Summary Movidius 65nm silicon platform Ground-breaking functionality in SW Enabled by ground-breaking GFLOPS/W Compact form-factor In mass-production today 10x better GFLOPS/W than GPU Next generation 28nm SoC 9x perf/watt available in x better GFLOPS/W than GPU 22

23 Any questions? The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/ ) under grant agreement n (PEPPHER Project,

24 References 1) H-E. Kim, J-S. Yoon, K-D. Hwang, Y-J. Kim, J-S. Park, L-S. Kim, "A 275mw heterogeneous Multimedia processor for ic-stacking on Si-interposer" Proc. ISSCC ) S.Vangal, J.Howard, G.Ruhl, S.Dighe, H.Wilson, J.Tschanz, D.Finan, P.Iyer,A. Singh, T.Jacob, S.Jain, S.Venkataraman, Y.Hoskote and N.Borkar, "An 80-Tile 1.28TFLOPS Networkon-Chip in 65nm CMOS", Proc. ISSCC 2007, pp.5-7 3) A. Olofsson, R. Trogan, O. Raikhman, A 25 GFLOPS/Watt Software Programmable Floating Point Accelerator, HPEC 2010, Sep ) C.Y. Park, N.I. Cho, "A fast algorithm for the conversion of DCT coefficients to H.264 transform coefficients", ICIP 2005 Proceedings, pp

1TOPS/W Software Programmable Media Processor. David Moloney, CTO, Movidius 19 August 2011

1TOPS/W Software Programmable Media Processor. David Moloney, CTO, Movidius 19 August 2011 1TOPS/W Software Programmable Media Processor David Moloney, CTO, Movidius 19 August 2011 Movidius Background Started in 2005 looking at mobile gaming acceleration Decided on multicore design to allow

More information

A Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013

A Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013 A Closer Look at the Epiphany IV 28nm 64 core Coprocessor Andreas Olofsson PEGPUM 2013 1 Adapteva Achieves 3 World Firsts 1. First processor company to reach 50 GFLOPS/W 3. First semiconductor company

More information

The World Leader in High Performance Signal Processing Solutions. DSP Processors

The World Leader in High Performance Signal Processing Solutions. DSP Processors The World Leader in High Performance Signal Processing Solutions DSP Processors NDA required until November 11, 2008 Analog Devices Processors Broad Choice of DSPs Blackfin Media Enabled, 16/32- bit fixed

More information

Development of Low Power and High Performance Application Processor (T6G) for Multimedia Mobile Applications

Development of Low Power and High Performance Application Processor (T6G) for Multimedia Mobile Applications Session 8D-2 Development of Low Power and High Performance Application Processor (T6G) for Multimedia Mobile Applications Yoshiyuki Kitasho, Yu Kikuchi, Takayoshi Shimazawa, Yasuo Ohara, Masafumi Takahashi,

More information

An Alternative to GPU Acceleration For Mobile Platforms

An Alternative to GPU Acceleration For Mobile Platforms Inventing the Future of Computing An Alternative to GPU Acceleration For Mobile Platforms Andreas Olofsson andreas@adapteva.com 50 th DAC June 5th, Austin, TX Adapteva Achieves 3 World Firsts 1. First

More information

INTEGRATING COMPUTER VISION SENSOR INNOVATIONS INTO MOBILE DEVICES. Eli Savransky Principal Architect - CTO Office Mobile BU NVIDIA corp.

INTEGRATING COMPUTER VISION SENSOR INNOVATIONS INTO MOBILE DEVICES. Eli Savransky Principal Architect - CTO Office Mobile BU NVIDIA corp. INTEGRATING COMPUTER VISION SENSOR INNOVATIONS INTO MOBILE DEVICES Eli Savransky Principal Architect - CTO Office Mobile BU NVIDIA corp. Computer Vision in Mobile Tegra K1 It s time! AGENDA Use cases categories

More information

There s STILL plenty of room at the bottom! Andreas Olofsson

There s STILL plenty of room at the bottom! Andreas Olofsson There s STILL plenty of room at the bottom! Andreas Olofsson 1 Richard Feynman s Lecture (1959) There's Plenty of Room at the Bottom An Invitation to Enter a New Field of Physics Why cannot we write the

More information

GPUs and GPGPUs. Greg Blanton John T. Lubia

GPUs and GPGPUs. Greg Blanton John T. Lubia GPUs and GPGPUs Greg Blanton John T. Lubia PROCESSOR ARCHITECTURAL ROADMAP Design CPU Optimized for sequential performance ILP increasingly difficult to extract from instruction stream Control hardware

More information

GPU Architecture. Alan Gray EPCC The University of Edinburgh

GPU Architecture. Alan Gray EPCC The University of Edinburgh GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From

More information

Designing with NXP i.mx8m SoC

Designing with NXP i.mx8m SoC Designing with NXP i.mx8m SoC Course Description Designing with NXP i.mx8m SoC is a 3 days deep dive training to the latest NXP application processor family. The first part of the course starts by overviewing

More information

THE LEADER IN VISUAL COMPUTING

THE LEADER IN VISUAL COMPUTING MOBILE EMBEDDED THE LEADER IN VISUAL COMPUTING 2 TAKING OUR VISION TO REALITY HPC DESIGN and VISUALIZATION AUTO GAMING 3 BEST DEVELOPER EXPERIENCE Tools for Fast Development Debug and Performance Tuning

More information

Silicon Motion s Graphics Display SoCs

Silicon Motion s Graphics Display SoCs WHITE PAPER Silicon Motion s Graphics Display SoCs Enable 4K High Definition and Low Power Power and bandwidth: the twin challenges of implementing a solution for bridging any computer to any high-definition

More information

Embedded Computing without Compromise. Evolution of the Rugged GPGPU Computer Session: SIL7127 Dan Mor PLM -Aitech Systems GTC Israel 2017

Embedded Computing without Compromise. Evolution of the Rugged GPGPU Computer Session: SIL7127 Dan Mor PLM -Aitech Systems GTC Israel 2017 Evolution of the Rugged GPGPU Computer Session: SIL7127 Dan Mor PLM - Systems GTC Israel 2017 Agenda Current GPGPU systems NVIDIA Jetson TX1 and TX2 evaluation Conclusions New Products 2 GPGPU Product

More information

Will Everything Start To Look Like An SoC?

Will Everything Start To Look Like An SoC? Will Everything Start To Look Like An SoC? Vikas Gautam, Synopsys Verification Futures Conference 2013 Bangalore, India March 2013 Synopsys 2012 1 SystemVerilog Inherits the Earth e erm SV urm AVM 1.0/2.0/3.0

More information

Introduction to CELL B.E. and GPU Programming. Agenda

Introduction to CELL B.E. and GPU Programming. Agenda Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU

More information

The Challenges of System Design. Raising Performance and Reducing Power Consumption

The Challenges of System Design. Raising Performance and Reducing Power Consumption The Challenges of System Design Raising Performance and Reducing Power Consumption 1 Agenda The key challenges Visibility for software optimisation Efficiency for improved PPA 2 Product Challenge - Software

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 18

ECE 571 Advanced Microprocessor-Based Design Lecture 18 ECE 571 Advanced Microprocessor-Based Design Lecture 18 Vince Weaver http://www.eece.maine.edu/ vweaver vincent.weaver@maine.edu 11 November 2014 Homework #4 comments Project/HW Reminder 1 Stuff from Last

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

Ultra Low Power GPUs for Wearables

Ultra Low Power GPUs for Wearables Ultra Low Power GPUs for Wearables Georgios Keramidas January 2015 The Company Who we are? Think Silicon is a privately held company founded in 2007. What we do? Development of low power GPU IP semiconductor

More information

Parallella: A $99 Open Hardware Parallel Computing Platform

Parallella: A $99 Open Hardware Parallel Computing Platform Inventing the Future of Computing Parallella: A $99 Open Hardware Parallel Computing Platform Andreas Olofsson andreas@adapteva.com IPDPS May 22th, Cambridge, MA Adapteva Achieves 3 World Firsts 1. First

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

DevKit8000 Evaluation Kit

DevKit8000 Evaluation Kit DevKit8000 Evaluation Kit TI OMAP3530 Processor based on 600MHz ARM Cortex-A8 core Memory supporting 256MByte DDR SDRAM and 256MByte NAND Flash UART, USB Host/OTG, Ethernet, Camera, Audio, SD, Keyboard,

More information

Outline Marquette University

Outline Marquette University COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations

More information

Techniques for Optimizing Performance and Energy Consumption: Results of a Case Study on an ARM9 Platform

Techniques for Optimizing Performance and Energy Consumption: Results of a Case Study on an ARM9 Platform Techniques for Optimizing Performance and Energy Consumption: Results of a Case Study on an ARM9 Platform BL Standard IC s, PL Microcontrollers October 2007 Outline LPC3180 Description What makes this

More information

Timothy Lanfear, NVIDIA HPC

Timothy Lanfear, NVIDIA HPC GPU COMPUTING AND THE Timothy Lanfear, NVIDIA FUTURE OF HPC Exascale Computing will Enable Transformational Science Results First-principles simulation of combustion for new high-efficiency, lowemision

More information

Will Everything Start To Look Like An SoC?

Will Everything Start To Look Like An SoC? Will Everything Start To Look Like An SoC? Janick Bergeron, Synopsys Verification Futures Conference 2012 France, Germany, UK November 2012 Synopsys 2012 1 SystemVerilog Inherits the Earth e erm SV urm

More information

A new Computer Vision Processor Chip Design for automotive ADAS CNN applications in 22nm FDSOI SOI Symposium Santa Clara, Apr.

A new Computer Vision Processor Chip Design for automotive ADAS CNN applications in 22nm FDSOI SOI Symposium Santa Clara, Apr. Dr. Jens Benndorf MD, COO Dream Chip A new Computer Vision Processor Chip Design for automotive ADAS CNN applications in 22nm FDSOI SOI Symposium Santa Clara, Apr. 13th, 2017 DCT Company Profile Dream

More information

TR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut

TR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut TR-2014-17 An Overview of NVIDIA Tegra K1 Architecture Ang Li, Radu Serban, Dan Negrut November 20, 2014 Abstract This paperwork gives an overview of NVIDIA s Jetson TK1 Development Kit and its Tegra K1

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 20

ECE 571 Advanced Microprocessor-Based Design Lecture 20 ECE 571 Advanced Microprocessor-Based Design Lecture 20 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 12 April 2016 Project/HW Reminder Homework #9 was posted 1 Raspberry Pi

More information

A Scalable Processor Architecture for the Next Generation of Low Power Supercomputer. PRACE Workshop, October 2010 Andreas Olofsson

A Scalable Processor Architecture for the Next Generation of Low Power Supercomputer. PRACE Workshop, October 2010 Andreas Olofsson A Scalable Processor Architecture for the Next Generation of Low Power Supercomputer PRACE Workshop, October 2010 Andreas Olofsson Company Introduction Company founded in 2008 with mission to produce programmable

More information

Yafit Snir Arindam Guha Cadence Design Systems, Inc. Accelerating System level Verification of SOC Designs with MIPI Interfaces

Yafit Snir Arindam Guha Cadence Design Systems, Inc. Accelerating System level Verification of SOC Designs with MIPI Interfaces Yafit Snir Arindam Guha, Inc. Accelerating System level Verification of SOC Designs with MIPI Interfaces Agenda Overview: MIPI Verification approaches and challenges Acceleration methodology overview and

More information

Adding C Programmability to Data Path Design

Adding C Programmability to Data Path Design Adding C Programmability to Data Path Design Gert Goossens Sr. Director R&D, Synopsys May 6, 2015 1 Smart Products Drive SoC Developments Feature-Rich Multi-Sensing Multi-Output Wirelessly Connected Always-On

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

Multimedia in Mobile Phones. Architectures and Trends Lund

Multimedia in Mobile Phones. Architectures and Trends Lund Multimedia in Mobile Phones Architectures and Trends Lund 091124 Presentation Henrik Ohlsson Contact: henrik.h.ohlsson@stericsson.com Working with multimedia hardware (graphics and displays) at ST- Ericsson

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

Mathematical computations with GPUs

Mathematical computations with GPUs Master Educational Program Information technology in applications Mathematical computations with GPUs GPU architecture Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University GPU Graphical Processing

More information

AT-501 Cortex-A5 System On Module Product Brief

AT-501 Cortex-A5 System On Module Product Brief AT-501 Cortex-A5 System On Module Product Brief 1. Scope The following document provides a brief description of the AT-501 System on Module (SOM) its features and ordering options. For more details please

More information

When MPPDB Meets GPU:

When MPPDB Meets GPU: When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU

More information

CS427 Multicore Architecture and Parallel Computing

CS427 Multicore Architecture and Parallel Computing CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

How to build a Megacore microprocessor. by Andreas Olofsson (MULTIPROG WORKSHOP 2017)

How to build a Megacore microprocessor. by Andreas Olofsson (MULTIPROG WORKSHOP 2017) How to build a Megacore microprocessor by Andreas Olofsson (MULTIPROG WORKSHOP 2017) 1 Disclaimers 2 This presentation summarizes work done by Adapteva from 2008-2016. Statements and opinions are my own

More information

ISSCC 2001 / SESSION 9 / INTEGRATED MULTIMEDIA PROCESSORS / 9.2

ISSCC 2001 / SESSION 9 / INTEGRATED MULTIMEDIA PROCESSORS / 9.2 ISSCC 2001 / SESSION 9 / INTEGRATED MULTIMEDIA PROCESSORS / 9.2 9.2 A 80/20MHz 160mW Multimedia Processor integrated with Embedded DRAM MPEG-4 Accelerator and 3D Rendering Engine for Mobile Applications

More information

2D/3D Graphics Accelerator for Mobile Multimedia Applications. Ramchan Woo, Sohn, Seong-Jun Song, Young-Don

2D/3D Graphics Accelerator for Mobile Multimedia Applications. Ramchan Woo, Sohn, Seong-Jun Song, Young-Don RAMP-IV: A Low-Power and High-Performance 2D/3D Graphics Accelerator for Mobile Multimedia Applications Woo, Sungdae Choi, Ju-Ho Sohn, Seong-Jun Song, Young-Don Bae,, and Hoi-Jun Yoo oratory Dept. of EECS,

More information

ID 730L: Getting Started with Multimedia Programming on Linux on SH7724

ID 730L: Getting Started with Multimedia Programming on Linux on SH7724 ID 730L: Getting Started with Multimedia Programming on Linux on SH7724 Global Edge Ian Carvalho Architect 14 October 2010 Version 1.0 Mr. Ian Carvalho System Architect, Global Edge Software Ltd. Responsible

More information

Introduction to GPGPU and GPU-architectures

Introduction to GPGPU and GPU-architectures Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks

More information

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA) NVIDIA s Compute Unified Device Architecture (CUDA) Mike Bailey mjb@cs.oregonstate.edu Reaching the Promised Land NVIDIA GPUs CUDA Knights Corner Speed Intel CPUs General Programmability 1 History of GPU

More information

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA) NVIDIA s Compute Unified Device Architecture (CUDA) Mike Bailey mjb@cs.oregonstate.edu Reaching the Promised Land NVIDIA GPUs CUDA Knights Corner Speed Intel CPUs General Programmability History of GPU

More information

Elaborazione dati real-time su architetture embedded many-core e FPGA

Elaborazione dati real-time su architetture embedded many-core e FPGA Elaborazione dati real-time su architetture embedded many-core e FPGA DAVIDE ROSSI A L E S S A N D R O C A P O T O N D I G I U S E P P E T A G L I A V I N I A N D R E A M A R O N G I U C I R I - I C T

More information

TEGRA K1 AND THE AUTOMOTIVE INDUSTRY. Gernot Ziegler, Timo Stich

TEGRA K1 AND THE AUTOMOTIVE INDUSTRY. Gernot Ziegler, Timo Stich TEGRA K1 AND THE AUTOMOTIVE INDUSTRY Gernot Ziegler, Timo Stich Previously: Tegra in Automotive Infotainment / Navigation Digital Instrument Cluster Passenger Entertainment TEGRA K1 with Kepler GPU GPU:

More information

A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications

A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications Ju-Ho Sohn, Jeong-Ho Woo, Min-Wuk Lee, Hye-Jung Kim, Ramchan Woo, Hoi-Jun Yoo Semiconductor System

More information

ECE 8823: GPU Architectures. Objectives

ECE 8823: GPU Architectures. Objectives ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading

More information

MAPPING VIDEO CODECS TO HETEROGENEOUS ARCHITECTURES. Mauricio Alvarez-Mesa Techische Universität Berlin - Spin Digital MULTIPROG 2015

MAPPING VIDEO CODECS TO HETEROGENEOUS ARCHITECTURES. Mauricio Alvarez-Mesa Techische Universität Berlin - Spin Digital MULTIPROG 2015 MAPPING VIDEO CODECS TO HETEROGENEOUS ARCHITECTURES Mauricio Alvarez-Mesa Techische Universität Berlin - Spin Digital MULTIPROG 2015 Video Codecs 70% of internet traffic will be video in 2018 [CISCO] Video

More information

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1 Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei

More information

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8 Markus Hadwiger, KAUST Reading Assignment #5 (until March 12) Read (required): Programming Massively Parallel Processors book, Chapter

More information

Massively Parallel Processor Breadboarding (MPPB)

Massively Parallel Processor Breadboarding (MPPB) Massively Parallel Processor Breadboarding (MPPB) 28 August 2012 Final Presentation TRP study 21986 Gerard Rauwerda CTO, Recore Systems Gerard.Rauwerda@RecoreSystems.com Recore Systems BV P.O. Box 77,

More information

Interconnect Challenges in a Many Core Compute Environment. Jerry Bautista, PhD Gen Mgr, New Business Initiatives Intel, Tech and Manuf Grp

Interconnect Challenges in a Many Core Compute Environment. Jerry Bautista, PhD Gen Mgr, New Business Initiatives Intel, Tech and Manuf Grp Interconnect Challenges in a Many Core Compute Environment Jerry Bautista, PhD Gen Mgr, New Business Initiatives Intel, Tech and Manuf Grp Agenda Microprocessor general trends Implications Tradeoffs Summary

More information

Challenges of mixed-width vector code generation and static scheduling in LLVM (for VLIW Architectures)

Challenges of mixed-width vector code generation and static scheduling in LLVM (for VLIW Architectures) Challenges of mixed-width vector code generation and static scheduling in LLVM (for VLIW Architectures) *Erkan Diken, **Pierre-Andre Saulais, ***Martin J. O Riordan (*) Eindhoven University of Technology,

More information

Technology Trends Presentation For Power Symposium

Technology Trends Presentation For Power Symposium Technology Trends Presentation For Power Symposium 2006 8-23-06 Darryl Solie, Distinguished Engineer, Chief System Architect IBM Systems & Technology Group From Ingenuity to Impact Copyright IBM Corporation

More information

Low-Power Processor Solutions for Always-on Devices

Low-Power Processor Solutions for Always-on Devices Low-Power Processor Solutions for Always-on Devices Pieter van der Wolf MPSoC 2014 July 7 11, 2014 2014 Synopsys, Inc. All rights reserved. 1 Always-on Mobile Devices Mobile devices on the move Mobile

More information

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas NVidia s GPU Microarchitectures By Stephen Lucas and Gerald Kotas Intro Discussion Points - Difference between CPU and GPU - Use s of GPUS - Brie f History - Te sla Archite cture - Fermi Architecture -

More information

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D

More information

Selecting the right Tesla/GTX GPU from a Drunken Baker's Dozen

Selecting the right Tesla/GTX GPU from a Drunken Baker's Dozen Selecting the right Tesla/GTX GPU from a Drunken Baker's Dozen GPU Computing Applications Here's what Nvidia says its Tesla K20(X) card excels at doing - Seismic processing, CFD, CAE, Financial computing,

More information

Software Defined Modem A commercial platform for wireless handsets

Software Defined Modem A commercial platform for wireless handsets Software Defined Modem A commercial platform for wireless handsets Charles F Sturman VP Marketing June 22 nd ~ 24 th Brussels charles.stuman@cognovo.com www.cognovo.com Agenda SDM Separating hardware from

More information

Multicore SoC is coming. Scalable and Reconfigurable Stream Processor for Mobile Multimedia Systems. Source: 2007 ISSCC and IDF.

Multicore SoC is coming. Scalable and Reconfigurable Stream Processor for Mobile Multimedia Systems. Source: 2007 ISSCC and IDF. Scalable and Reconfigurable Stream Processor for Mobile Multimedia Systems Liang-Gee Chen Distinguished Professor General Director, SOC Center National Taiwan University DSP/IC Design Lab, GIEE, NTU 1

More information

Antonio R. Miele Marco D. Santambrogio

Antonio R. Miele Marco D. Santambrogio Advanced Topics on Heterogeneous System Architectures GPU Politecnico di Milano Seminar Room A. Alario 18 November, 2015 Antonio R. Miele Marco D. Santambrogio Politecnico di Milano 2 Introduction First

More information

Design and Technology Trends

Design and Technology Trends Lecture 1 Design and Technology Trends R. Saleh Dept. of ECE University of British Columbia res@ece.ubc.ca 1 Recently Designed Chips Itanium chip (Intel), 2B tx, 700mm 2, 8 layer 65nm CMOS (4 processors)

More information

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture? This Unit: Putting It All Together CIS 371 Computer Organization and Design Unit 15: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010 Introduction to Multicore architecture Tao Zhang Oct. 21, 2010 Overview Part1: General multicore architecture Part2: GPU architecture Part1: General Multicore architecture Uniprocessor Performance (ECint)

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

IoT Project Proposals

IoT Project Proposals IoT Project Proposals 1 Submit before 31 st March Best 5 proposals will be given Intel Galileo Gen 2 microcontroller boards each 2 Advisory Board will evaluate and select the best project proposals Dr.

More information

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367

More information

Catapult: A Reconfigurable Fabric for Petaflop Computing in the Cloud

Catapult: A Reconfigurable Fabric for Petaflop Computing in the Cloud Catapult: A Reconfigurable Fabric for Petaflop Computing in the Cloud Doug Burger Director, Hardware, Devices, & Experiences MSR NExT November 15, 2015 The Cloud is a Growing Disruptor for HPC Moore s

More information

Spring 2009 Prof. Hyesoon Kim

Spring 2009 Prof. Hyesoon Kim Spring 2009 Prof. Hyesoon Kim Application Geometry Rasterizer CPU Each stage cane be also pipelined The slowest of the pipeline stage determines the rendering speed. Frames per second (fps) Executes on

More information

Real-Time Rendering Architectures

Real-Time Rendering Architectures Real-Time Rendering Architectures Mike Houston, AMD Part 1: throughput processing Three key concepts behind how modern GPU processing cores run code Knowing these concepts will help you: 1. Understand

More information

3D systems-on-chip. A clever partitioning of circuits to improve area, cost, power and performance. The 3D technology landscape

3D systems-on-chip. A clever partitioning of circuits to improve area, cost, power and performance. The 3D technology landscape Edition April 2017 Semiconductor technology & processing 3D systems-on-chip A clever partitioning of circuits to improve area, cost, power and performance. In recent years, the technology of 3D integration

More information

3D Graphics in Future Mobile Devices. Steve Steele, ARM

3D Graphics in Future Mobile Devices. Steve Steele, ARM 3D Graphics in Future Mobile Devices Steve Steele, ARM Market Trends Mobile Computing Market Growth Volume in millions Mobile Computing Market Trends 1600 Smart Mobile Device Shipments (Smartphones and

More information

Embedded HW/SW Co-Development

Embedded HW/SW Co-Development Embedded HW/SW Co-Development It May be Driven by the Hardware Stupid! Frank Schirrmeister EDPS 2013 Monterey April 18th SPMI USB 2.0 SLIMbus RFFE LPDDR 2 LPDDR 3 emmc 4.5 UFS SD 3.0 SD 4.0 UFS Bare Metal

More information

Computer Organization and Design, 5th Edition: The Hardware/Software Interface

Computer Organization and Design, 5th Edition: The Hardware/Software Interface Computer Organization and Design, 5th Edition: The Hardware/Software Interface 1 Computer Abstractions and Technology 1.1 Introduction 1.2 Eight Great Ideas in Computer Architecture 1.3 Below Your Program

More information

To hear the audio, please be sure to dial in: ID#

To hear the audio, please be sure to dial in: ID# Introduction to the HPP-Heterogeneous Processing Platform A combination of Multi-core, GPUs, FPGAs and Many-core accelerators To hear the audio, please be sure to dial in: 1-866-440-4486 ID# 4503739 Yassine

More information

High Performance Memory Opportunities in 2.5D Network Flow Processors

High Performance Memory Opportunities in 2.5D Network Flow Processors High Performance Memory Opportunities in 2.5D Network Flow Processors Jay Seaton, VP Silicon Operations, Netronome Larry Zu, PhD, President, Sarcina Technology LLC August 6, 2013 2013 Netronome 1 Netronome

More information

Spring 2011 Prof. Hyesoon Kim

Spring 2011 Prof. Hyesoon Kim Spring 2011 Prof. Hyesoon Kim Application Geometry Rasterizer CPU Each stage cane be also pipelined The slowest of the pipeline stage determines the rendering speed. Frames per second (fps) Executes on

More information

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017 VOLTA: PROGRAMMABILITY AND PERFORMANCE Jack Choquette NVIDIA Hot Chips 2017 1 TESLA V100 21B transistors 815 mm 2 80 SM 5120 CUDA Cores 640 Tensor Cores 16 GB HBM2 900 GB/s HBM2 300 GB/s NVLink *full GV100

More information

. SMARC 2.0 Compliant

. SMARC 2.0 Compliant MSC SM2S-IMX8 NXP i.mx8 ARM Cortex -A72/A53 Description The new MSC SM2S-IMX8 module offers a quantum leap in terms of computing and graphics performance. It integrates the currently most powerful i.mx8

More information

HotChips An innovative HD video and digital image processor for low-cost digital entertainment products. Deepu Talla.

HotChips An innovative HD video and digital image processor for low-cost digital entertainment products. Deepu Talla. HotChips 2007 An innovative HD video and digital image processor for low-cost digital entertainment products Deepu Talla Texas Instruments 1 Salient features of the SoC HD video encode and decode using

More information

SiFive Freedom SoCs: Industry s First Open-Source RISC-V Chips

SiFive Freedom SoCs: Industry s First Open-Source RISC-V Chips SiFive Freedom SoCs: Industry s First Open-Source RISC-V Chips Yunsup Lee Co-Founder and CTO High Upfront Cost Has Killed Innovation Our industry needs a fundamental change Total SoC Development Cost Design

More information

MYC-C7Z010/20 CPU Module

MYC-C7Z010/20 CPU Module MYC-C7Z010/20 CPU Module - 667MHz Xilinx XC7Z010/20 Dual-core ARM Cortex-A9 Processor with Xilinx 7-series FPGA logic - 1GB DDR3 SDRAM (2 x 512MB, 32-bit), 4GB emmc, 32MB QSPI Flash - On-board Gigabit

More information

MYC-C437X CPU Module

MYC-C437X CPU Module MYC-C437X CPU Module - Up to 1GHz TI AM437x Series ARM Cortex-A9 Processors - 512MB DDR3 SDRAM, 4GB emmc Flash, 32KB EEPROM - Gigabit Ethernet PHY - Power Management IC - Two 0.8mm pitch 100-pin Board-to-Board

More information

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors Brent Bohnenstiehl and Bevan Baas Department of Electrical and Computer Engineering University of California, Davis {bvbohnen,

More information

Mapping applications into MPSoC

Mapping applications into MPSoC Mapping applications into MPSoC concurrency & communication Jos van Eijndhoven jos@vectorfabrics.com March 12, 2011 MPSoC mapping: exploiting concurrency 2 March 12, 2012 Computation on general purpose

More information

Cannon Mountain Dr Longmont, CO LS6410 Hardware Design Perspective

Cannon Mountain Dr Longmont, CO LS6410 Hardware Design Perspective LS6410 Hardware Design Perspective 1. S3C6410 Introduction The S3C6410X is a 16/32-bit RISC microprocessor, which is designed to provide a cost-effective, lowpower capabilities, high performance Application

More information

Fast Stereoscopic Rendering on Mobile Ray Tracing GPU for Virtual Reality Applications

Fast Stereoscopic Rendering on Mobile Ray Tracing GPU for Virtual Reality Applications Fast Stereoscopic Rendering on Mobile Ray Tracing GPU for Virtual Reality Applications SAMSUNG Advanced Institute of Technology Won-Jong Lee, Seok Joong Hwang, Youngsam Shin, Jeong-Joon Yoo, Soojung Ryu

More information

SoC FPGAs. Your User-Customizable System on Chip Altera Corporation Public

SoC FPGAs. Your User-Customizable System on Chip Altera Corporation Public SoC FPGAs Your User-Customizable System on Chip Embedded Developers Needs Low High Increase system performance Reduce system power Reduce board size Reduce system cost 2 Providing the Best of Both Worlds

More information

Software Driven Verification at SoC Level. Perspec System Verifier Overview

Software Driven Verification at SoC Level. Perspec System Verifier Overview Software Driven Verification at SoC Level Perspec System Verifier Overview June 2015 IP to SoC hardware/software integration and verification flows Cadence methodology and focus Applications (Basic to

More information

Multi-Core SoCs for ADAS and Image Recognition Applications

Multi-Core SoCs for ADAS and Image Recognition Applications Multi-Core SoCs for ADAS and Image Recognition Applications Takashi Miyamori, Senior Manager Embedded Core Technology Development Department Center for Semiconductor Research & Development Storage Device

More information

Adhocracy Innovation with Imaging technology. Socionext Inc. Hiroyuki Komori July 5th, 2017

Adhocracy Innovation with Imaging technology. Socionext Inc. Hiroyuki Komori July 5th, 2017 Adhocracy Innovation with Imaging technology Socionext Inc. Hiroyuki Komori July 5th, 2017 Video Steaming in the world Video uploading : 65,000 clip/min, 300 hour/min on YouTube Video viewing : billions

More information

The Mont-Blanc approach towards Exascale

The Mont-Blanc approach towards Exascale http://www.montblanc-project.eu The Mont-Blanc approach towards Exascale Alex Ramirez Barcelona Supercomputing Center Disclaimer: Not only I speak for myself... All references to unavailable products are

More information

The S6000 Family of Processors

The S6000 Family of Processors The S6000 Family of Processors Today s Design Challenges The advent of software configurable processors In recent years, the widespread adoption of digital technologies has revolutionized the way in which

More information

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources This Unit: Putting It All Together CIS 501 Computer Architecture Unit 12: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital Circuits

More information

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley Computer Systems Architecture I CSE 560M Lecture 19 Prof. Patrick Crowley Plan for Today Announcement No lecture next Wednesday (Thanksgiving holiday) Take Home Final Exam Available Dec 7 Due via email

More information