1TOPS/W Software Programmable Media Processor. David Moloney, CTO, Movidius 19 August 2011
|
|
- Dana Flowers
- 6 years ago
- Views:
Transcription
1 1TOPS/W Software Programmable Media Processor David Moloney, CTO, Movidius 19 August 2011
2 Movidius Background Started in 2005 looking at mobile gaming acceleration Decided on multicore design to allow software derivatives and meet OPS/W/$ target Existing processors poor cost/performance match for target workloads Developed SHAVE vector processor with HW support for sparse data-structures (Matrix-Vector) Expanded ISA to support C-complier Talked to mobile phone customers in 2007 Turned out their real problem was video Back to the drawing-board! Initial 65nm Silicon & all IP on founder & angel funding Allowed us to close A-round in October 2008! 65nm Myriad MM SoC in mass-production Next generation 28nm SoC H1/2012 with 10x Perf/W
3 Mobile Video Processing Workload 20/Apr/2011 3
4 Movidius SHAVE Processor Streaming Hybrid Architecture Vector Engine Hybrid of RISC, DSP, VLIW & GPU architectural features 128-bit vector arithmetic: 8/16/32-bit INT & fp16/fp32 Unique proprietary architecture Tailored to streaming workloads and architected for outstanding OPS/mW/$ performance Excellent Graphics and matrix mathematics support HW texture unit for good graphics performance Predicated execution to eliminate branches Compiler-friendly architecture HW support for compressed data-structures (ex. matrices)
5 SHAVE Instruction-Set RISC-style Instruction predication Extensive integer ISA Excellent C-compiler support DSP-style Zero overhead looping Modulo addressing Transparent DMA modes FFT, Viterbi, and other DSP operation support Parallel comparisons VLIW-style Parallel functional units controlled by VLIW instr. 8/16/32-bit x 1-4 SIMD INT GPU-style Streaming operations Floating-point operations (fp16/fp32 IEEE-compliant) Texture-Management Unit and L1 Cache
6 SHAVE ISA Richness
7 PEU 65nm Myriad SoC SHAVE Variable-Length Instruction BRU LSU0 LSU1 VAU IAU SAU CMU TMU DMA 180MHz 2.9GB/Sec 128kB Per SHAVE 128kB SRAM Tile SHAVE Bus SHAVE Processor TMU 1kB cache L1 PEU IRF 32x32 IAU LSU0 LSU1 IDC BRU CMU SRF 32x32 SAU 12.2GB/Sec Myriad DCU 8.6GB/Sec VRF 32x128 VAU Decoded instrs 128-bit AXI 128 kb 128kB 2-way L2 17.3GB/Sec DDR2 Cont. 16/64 MB SDRAM 16/64MB SDRAM Die 5.8GB/Sec 5.8GB/Sec 1.5GB/Sec
8 Myriad Silicon Platform SW Controlled I/O Multiplexing MEBI SEBI SDIO SDIO SDIO x3 x3 SPI SPI SPI x3 x3 x3 FLSH LCD LCD Cam LCD USB2 OTG SPI I 2 C x3 SPI I 2 S x3 JTAG UART UART TS 64 GPS 128kB 128kB 128kB 128kB 128kB 128kB 128kB 128kB TIM RISC TMU L1 TMU L1 TMU L1 TMU L1 TMU L1 TMU L1 TMU L1 TMU L1 SVE0 SVE0 SVE0 SVE0 SVE0 SVE0 SVE0 SVE0 NAL Stacked 16/64MB SDRAM die 32 DDR L2 Cache Main Bus 50GFLOPS/W (IEEE 754 SP) 128 Bridge Movidius IP
9 Myriad GOPS/Watt (Total) Myriad GOPS/W (total) 1004 GOPS/W PEU BRU LSU0 LSU1 VAU IAU SAU CMU TMU DMA GOPS/W (arith) IAU SAU VAU OP/W tot OP/W arith int8 int16 int32 fp16 fp32
10 Myriad GOPS/Watt (Arithmetic) Myriad GOPS/W GOPS/W (arith) 181 PEU BRU LSU0 LSU1 VAU IAU SAU CMU TMU DMA IAU SAU VAU OP/W arith int8 int16 int32 fp16 fp32
11 Myriad 65nm CMOS LP Die SHAVE RISC sub-system Analog SHAVE 16MB SDRAM DIE Myriad DIE 16MB Stacked SDRAM SHAVE SHAVE SHAVE SHAVE SHAVE SHAVE Author Year FLOPS/core Cores GFLOPS W GFLOPS/W Myriad Movidius (1 KAIST (2 Intel (4 Adapteva
12 Technology - Platform Approach 3D Capture Video Edit Applications Software Modules Silicon Platform Products 3D Video Anaglyph-3D Foundation Technology 20/Apr/
13 Myriad Example Applications SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 SHAVE 1 SHAVE 2 SHAVE 3 SHAVE 4 SHAVE 5 SHAVE 6 SHAVE 7 SHAVE 8 20/Apr/
14 MoviSim ISS Architecture Runtime OpenCL SABRE Debugger MoviSim SHAVE ISS SHAVE ISS SHAVE ISS LEON/ARM ISS Heterogeneous Core ISS DRAM Simulation Engine XML Architecture Description Task Allocator Memory Allocator Messaging Instrumentation Thread 0 Thread 1 Thread n-1 Model API Thread m-1 Thread m XML Parser
15 Fragrak 28nm Platform SW Controlled I/O Multiplexing MEBI SEBI SDIO SDIO SDIO x3 x3 SPI SPI SPI x3 x3 x3 FLSH MIPI LCD DSI x MIPI LCD CSI 2x USB2 OTG I SPI 2 C x3 I SPI 2 S x3 JTAG UART UART TS 64 GPS 128kB 128kB SHAVE 128kB SHAVE 0 256kB 1SHAVE SHAVE kB 128kB SHAVE 128kB SHAVE 0 256kB 1SHAVE SHAVE kB 128kB SHAVE 128kB SHAVE 0 256kB 1SHAVE SHAVE kB 128kB SHAVE 128kB SHAVE 0 256kB 1SHAVE SHAVE 0 16 TIM RISC ICB ICB ICB ICB NAL XCB Stacked 256/512MB SDRAM die 64 DDR3 LP L2 512kB Main Bus GFLOPS/W (IEEE 754 SP) Brid ge Movidius IP 15
16 Any questions? The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/ ) under grant agreement n (PEPPHER Project,
17 Abstract The rationale and architecture behind a new software programmable multimedia coprocessor for mobile devices is outlined. The focus of the architecture is on power-efficient operation, allowing functions which are traditionally implemented in fixed-function hardware to be implemented competitively in software. For instance the sustained single-precision IEEE 754 rate is 50GFLOPS/W allowing existing applications to be ported with great ease. The device supports 8, 16, 32 and some 64-bit integer operations as well as fp16 (OpenEXR) and fp32 arithmetic and is capable of an aggregate 1 TOPS/W maximum 8-bit equivalent operations in a low-cost plastic BGA package with integrated 16 or 64MB SDRAM. New architectural features such as support for random-accessible sparse data-structures are implemented for the first time improving memory utilization and bandwidth efficiency. Power efficiency is paramount and the device contains a total of 11 power-islands with 8 dedicated to each of the integrated SHAVE processors, allowing very fine-grained power control. Comparisons to previous work based on 65nm silicon and applications are shown to illustrate the power of the device.
18 Myriad GOPS/Watt (Total/Arithmetic) GOPS/W (total) Myriad GOPS/W GOPS/W (arith) int8 int16 int32 fp16 fp32 91 PEU PEU 45 LSU0 99 VAU 263 IAU DMA RISC OP/W tot BRU LSU0 LSU1 VAU IAU SAU CMU TMU DMA
19 16x800MHz = 25.6GB/Sec SHAVE Processor 28nm (Fragrak) Intra-Cluster Bus (ICB) PEU SHAVE Variable-Length Instruction BRU LSU0 LSU1 VAU IAU SAU CMU TMU DMA 800MHz 8x800MHz = 12.8GB/Sec 256kB Per SHAVE 256kB SRAM Xtra-Cluster Bus (XCB) 16x800MHz 25.6GB/Sec SHAVE Processor TMU 16k 1kB cache L1 16x800MHz 12.8GB/Sec PEU IRF 32x32 IAU LSU0 LSU1 IDC 4x17x800MHz 54.4GB/Sec BRU CMU SRF 32x32 SAU DCU 4x12x800MHz 38.4GB/Sec Fragrak VRF 32x128 VAU Decoded instructions 128-bit AXI Bus L2 Cache 512kB 2-way 16x12x800MHz 76.8GB/Sec 16x800MHz 25.6GB/Sec 512 kb LP DDR3 Cont MB SDRAM MB SDRAM Die 8x800MHz 12.8GB/Sec
20 BW Hierarchy Myriad 65nm Fragrak 28nm 547GB/Sec SRAM L1 Cache Registers V/S/IRF ICB 4864GB/Sec 190:1 42:1 2.88GB/Sec XCB L2 Cache 115GB/Sec 2:1 18:1 1.44GB/Sec SDRAM 6.4GB/Sec Bottom-Line - Very High Sustainable Performance
21 BW Hierarchy (Detail) Myriad 65nm VRF SRF IRF LSU IDC L1 ISB L2 SDRAM Clk Bytes Ports BW #SHAVES Total BW Fragrak 28nm VRF SRF IRF LSU IDC L1 ICB L2 XCB SDRAM Clk Bytes Ports BW #SHAVES Total BW
22 LSU HW Sparse-Data Support IRF bitmap field bm7 bm6 bm5 bm4 bm3 bm2 bm1 bm f[7] f[6] f[5] f[4] f[3] f[2] f[1] f[0] f[7:0] addr[1:0] fen[2:0] word_cnt 6 addr_gen pw_config instr_f[7:0] 32 IRF[base_addr] RAM_addr[31:0] RAM_wr RAM_rd bru_hold
23 Sparse Data-Structure Example bitmap description data 64-bit RAM word address 0 1 sx sx sy base sz 1.0 base x 0 x 1 base x 2 x 3 base y 0 y 1 base sy y 2 y 3 base z 0 z 1 base x4 scaling matrix 0.0 z 2 z 3 base addr bmp base sz x element vector x x x y element vector y y y z element vector z z z pointer to next str. addr 29 1 next str. Bitmap bmp
24 References 1) H-E. Kim, J-S. Yoon, K-D. Hwang, Y-J. Kim, J-S. Park, L-S. Kim, "A 275mw heterogeneous Multimedia processor for ic-stacking on Si-interposer" Proc. ISSCC ) S.Vangal, J.Howard, G.Ruhl, S.Dighe, H.Wilson, J.Tschanz, D.Finan, P.Iyer,A. Singh, T.Jacob, S.Jain, S.Venkataraman, Y.Hoskote and N.Borkar, "An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS", Proc. ISSCC 2007, pp.5-7 3) A. Olofsson, R. Trogan, O. Raikhman, A 25 GFLOPS/Watt S oftware Programmable Floating Point Accelerator, HPEC 2010, Sep ) C.Y. Park, N.I. Cho, "A fast algorithm for the conversion of DCT coefficients to H.264 transform coefficients", ICIP 2005 Proceedings, pp /Apr/
Green Multicore. David Moloney, CTO, Movidius 24 November 2011
Green Multicore David Moloney, CTO, Movidius 24 November 2011 Overview Fabless semiconductor company founded in 2005 VC backed (completing C-round today @ 12:00) Focus on computational imaging and video
More informationA Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013
A Closer Look at the Epiphany IV 28nm 64 core Coprocessor Andreas Olofsson PEGPUM 2013 1 Adapteva Achieves 3 World Firsts 1. First processor company to reach 50 GFLOPS/W 3. First semiconductor company
More informationThe World Leader in High Performance Signal Processing Solutions. DSP Processors
The World Leader in High Performance Signal Processing Solutions DSP Processors NDA required until November 11, 2008 Analog Devices Processors Broad Choice of DSPs Blackfin Media Enabled, 16/32- bit fixed
More informationMultimedia in Mobile Phones. Architectures and Trends Lund
Multimedia in Mobile Phones Architectures and Trends Lund 091124 Presentation Henrik Ohlsson Contact: henrik.h.ohlsson@stericsson.com Working with multimedia hardware (graphics and displays) at ST- Ericsson
More informationHow to build a Megacore microprocessor. by Andreas Olofsson (MULTIPROG WORKSHOP 2017)
How to build a Megacore microprocessor by Andreas Olofsson (MULTIPROG WORKSHOP 2017) 1 Disclaimers 2 This presentation summarizes work done by Adapteva from 2008-2016. Statements and opinions are my own
More informationAdministrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.
Administrivia HW0 scores, HW1 peer-review assignments out. HW2 out, due Nov. 2. If you re having Cython trouble with HW2, let us know. Review on Wednesday: Post questions on Piazza Introduction to GPUs
More informationStorage I/O Summary. Lecture 16: Multimedia and DSP Architectures
Storage I/O Summary Storage devices Storage I/O Performance Measures» Throughput» Response time I/O Benchmarks» Scaling to track technological change» Throughput with restricted response time is normal
More informationVector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks
Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks Christos Kozyrakis Stanford University David Patterson U.C. Berkeley http://csl.stanford.edu/~christos Motivation Ideal processor
More information2D/3D Graphics Accelerator for Mobile Multimedia Applications. Ramchan Woo, Sohn, Seong-Jun Song, Young-Don
RAMP-IV: A Low-Power and High-Performance 2D/3D Graphics Accelerator for Mobile Multimedia Applications Woo, Sungdae Choi, Ju-Ho Sohn, Seong-Jun Song, Young-Don Bae,, and Hoi-Jun Yoo oratory Dept. of EECS,
More informationComputer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley
Computer Systems Architecture I CSE 560M Lecture 19 Prof. Patrick Crowley Plan for Today Announcement No lecture next Wednesday (Thanksgiving holiday) Take Home Final Exam Available Dec 7 Due via email
More informationTile Processor (TILEPro64)
Tile Processor Case Study of Contemporary Multicore Fall 2010 Agarwal 6.173 1 Tile Processor (TILEPro64) Performance # of cores On-chip cache (MB) Cache coherency Operations (16/32-bit BOPS) On chip bandwidth
More informationA 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications
A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications Ju-Ho Sohn, Jeong-Ho Woo, Min-Wuk Lee, Hye-Jung Kim, Ramchan Woo, Hoi-Jun Yoo Semiconductor System
More informationIntroduction to CELL B.E. and GPU Programming. Agenda
Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU
More informationA framework for optimizing OpenVX Applications on Embedded Many Core Accelerators
A framework for optimizing OpenVX Applications on Embedded Many Core Accelerators Giuseppe Tagliavini, DEI University of Bologna Germain Haugou, IIS ETHZ Andrea Marongiu, DEI University of Bologna & IIS
More informationAdvance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts
Computer Architectures Advance CPU Design Tien-Fu Chen National Chung Cheng Univ. Adv CPU-0 MMX technology! Basic concepts " small native data types " compute-intensive operations " a lot of inherent parallelism
More informationAn Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki
An Ultra High Performance Scalable DSP Family for Multimedia Hot Chips 17 August 2005 Stanford, CA Erik Machnicki Media Processing Challenges Increasing performance requirements Need for flexibility &
More informationThe Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA
The Alpha 21264 Microprocessor: Out-of-Order ution at 600 Mhz R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA 1 Some Highlights z Continued Alpha performance leadership y 600 Mhz operation in
More informationCannon Mountain Dr Longmont, CO LS6410 Hardware Design Perspective
LS6410 Hardware Design Perspective 1. S3C6410 Introduction The S3C6410X is a 16/32-bit RISC microprocessor, which is designed to provide a cost-effective, lowpower capabilities, high performance Application
More informationGeneral Purpose Signal Processors
General Purpose Signal Processors First announced in 1978 (AMD) for peripheral computation such as in printers, matured in early 80 s (TMS320 series). General purpose vs. dedicated architectures: Pros:
More informationHandout 3. HSAIL and A SIMT GPU Simulator
Handout 3 HSAIL and A SIMT GPU Simulator 1 Outline Heterogeneous System Introduction of HSA Intermediate Language (HSAIL) A SIMT GPU Simulator Summary 2 Heterogeneous System CPU & GPU CPU GPU CPU wants
More informationBuilding supercomputers from embedded technologies
http://www.montblanc-project.eu Building supercomputers from embedded technologies Alex Ramirez Barcelona Supercomputing Center Technical Coordinator This project and the research leading to these results
More informationUMBC. Rubini and Corbet, Linux Device Drivers, 2nd Edition, O Reilly. Systems Design and Programming
Systems Design and Programming Instructor: Professor Jim Plusquellic Text: Barry B. Brey, The Intel Microprocessors, 8086/8088, 80186/80188, 80286, 80386, 80486, Pentium and Pentium Pro Processor Architecture,
More informationCombining Arm & RISC-V in Heterogeneous Designs
Combining Arm & RISC-V in Heterogeneous Designs Gajinder Panesar, CTO, UltraSoC gajinder.panesar@ultrasoc.com RISC-V Summit 3 5 December 2018 Santa Clara, USA Problem statement Deterministic multi-core
More informationAn Alternative to GPU Acceleration For Mobile Platforms
Inventing the Future of Computing An Alternative to GPU Acceleration For Mobile Platforms Andreas Olofsson andreas@adapteva.com 50 th DAC June 5th, Austin, TX Adapteva Achieves 3 World Firsts 1. First
More informationSoC Platforms and CPU Cores
SoC Platforms and CPU Cores COE838: Systems on Chip Design http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer Engineering Ryerson University
More informationAn Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection
An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection Hiroyuki Usui, Jun Tanabe, Toru Sano, Hui Xu, and Takashi Miyamori Toshiba Corporation, Kawasaki, Japan Copyright 2013,
More informationM7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle
M7: Next Generation SPARC Hotchips 26 August 12, 2014 Stephen Phillips Senior Director, SPARC Architecture Oracle Safe Harbor Statement The following is intended to outline our general product direction.
More informationECE 471 Embedded Systems Lecture 2
ECE 471 Embedded Systems Lecture 2 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 3 September 2015 Announcements HW#1 will be posted today, due next Thursday. I will send out
More informationThe University of Texas at Austin
EE382N: Principles in Computer Architecture Parallelism and Locality Fall 2009 Lecture 24 Stream Processors Wrapup + Sony (/Toshiba/IBM) Cell Broadband Engine Mattan Erez The University of Texas at Austin
More informationParallella: A $99 Open Hardware Parallel Computing Platform
Inventing the Future of Computing Parallella: A $99 Open Hardware Parallel Computing Platform Andreas Olofsson andreas@adapteva.com IPDPS May 22th, Cambridge, MA Adapteva Achieves 3 World Firsts 1. First
More informationCOMP 635: Seminar on Heterogeneous Processors. Lecture 7: ClearSpeed CSX600 Processor.
COMP 635: Seminar on Heterogeneous Processors Lecture 7: ClearSpeed CSX600 Processor www.cs.rice.edu/~vsarkar/comp635 Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu October
More informationThere s STILL plenty of room at the bottom! Andreas Olofsson
There s STILL plenty of room at the bottom! Andreas Olofsson 1 Richard Feynman s Lecture (1959) There's Plenty of Room at the Bottom An Invitation to Enter a New Field of Physics Why cannot we write the
More informationHotChips An innovative HD video and digital image processor for low-cost digital entertainment products. Deepu Talla.
HotChips 2007 An innovative HD video and digital image processor for low-cost digital entertainment products Deepu Talla Texas Instruments 1 Salient features of the SoC HD video encode and decode using
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationChapter 1 Computer System Overview
Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Seventh Edition By William Stallings Course Outline & Marks Distribution Hardware Before mid Memory After mid Linux
More informationDevelopment of Low Power and High Performance Application Processor (T6G) for Multimedia Mobile Applications
Session 8D-2 Development of Low Power and High Performance Application Processor (T6G) for Multimedia Mobile Applications Yoshiyuki Kitasho, Yu Kikuchi, Takayoshi Shimazawa, Yasuo Ohara, Masafumi Takahashi,
More informationXbox 360 Architecture. Lennard Streat Samuel Echefu
Xbox 360 Architecture Lennard Streat Samuel Echefu Overview Introduction Hardware Overview CPU Architecture GPU Architecture Comparison Against Competing Technologies Implications of Technology Introduction
More informationCopyright 2016 Xilinx
Zynq Architecture Zynq Vivado 2015.4 Version This material exempt per Department of Commerce license exception TSU Objectives After completing this module, you will be able to: Identify the basic building
More informationARM Processors for Embedded Applications
ARM Processors for Embedded Applications Roadmap for ARM Processors ARM Architecture Basics ARM Families AMBA Architecture 1 Current ARM Core Families ARM7: Hard cores and Soft cores Cache with MPU or
More informationThis Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?
This Unit: Putting It All Together CIS 371 Computer Organization and Design Unit 15: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital
More informationProduct Technical Brief S3C2416 May 2008
Product Technical Brief S3C2416 May 2008 Overview SAMSUNG's S3C2416 is a 32/16-bit RISC cost-effective, low power, high performance micro-processor solution for general applications including the GPS Navigation
More informationKeyStone C665x Multicore SoC
KeyStone Multicore SoC Architecture KeyStone C6655/57: Device Features C66x C6655: One C66x DSP Core at 1.0 or 1.25 GHz C6657: Two C66x DSP Cores at 0.85, 1.0, or 1.25 GHz Fixed and Floating Point Operations
More informationNear Memory Computing Spectral and Sparse Accelerators
Near Memory Computing Spectral and Sparse Accelerators Franz Franchetti ECE, Carnegie Mellon University www.ece.cmu.edu/~franzf Co-Founder, SpiralGen www.spiralgen.com The work was sponsored by Defense
More informationOriginal PlayStation: no vector processing or floating point support. Photorealism at the core of design strategy
Competitors using generic parts Performance benefits to be had for custom design Original PlayStation: no vector processing or floating point support Geometry issues Photorealism at the core of design
More informationSeveral Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining
Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the
More informationCUDA OPTIMIZATIONS ISC 2011 Tutorial
CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationKiloCore: A 32 nm 1000-Processor Array
KiloCore: A 32 nm 1000-Processor Array Brent Bohnenstiehl, Aaron Stillmaker, Jon Pimentel, Timothy Andreas, Bin Liu, Anh Tran, Emmanuel Adeagbo, Bevan Baas University of California, Davis VLSI Computation
More informationIMAGINE: Signal and Image Processing Using Streams
IMAGINE: Signal and Image Processing Using Streams Brucek Khailany William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jinyung Namkoong, John D. Owens, Brian Towles Concurrent VLSI Architecture
More informationThe Challenges of System Design. Raising Performance and Reducing Power Consumption
The Challenges of System Design Raising Performance and Reducing Power Consumption 1 Agenda The key challenges Visibility for software optimisation Efficiency for improved PPA 2 Product Challenge - Software
More informationECE 471 Embedded Systems Lecture 3
ECE 471 Embedded Systems Lecture 3 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 10 September 2018 Announcements New classroom: Stevens 365 HW#1 was posted, due Friday Reminder:
More informationThe Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006
The Use Of Virtual Platforms In MP-SoC Design Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006 1 MPSoC Is MP SoC design happening? Why? Consumer Electronics Complexity Cost of ASIC Increased SW Content
More informationUltra Low Power GPUs for Wearables
Ultra Low Power GPUs for Wearables Georgios Keramidas January 2015 The Company Who we are? Think Silicon is a privately held company founded in 2007. What we do? Development of low power GPU IP semiconductor
More informationINSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing
UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 22 Title: and Extended
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationIntroduction to Embedded System Processor Architectures
Introduction to Embedded System Processor Architectures Contents crafted by Professor Jari Nurmi Tampere University of Technology Department of Computer Systems Motivation Why Processor Design? Embedded
More informationCS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology
CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367
More informationA 400Gbps Multi-Core Network Processor
A 400Gbps Multi-Core Network Processor James Markevitch, Srinivasa Malladi Cisco Systems August 22, 2017 Legal THE INFORMATION HEREIN IS PROVIDED ON AN AS IS BASIS, WITHOUT ANY WARRANTIES OR REPRESENTATIONS,
More informationIntroduction to GPU computing
Introduction to GPU computing Nagasaki Advanced Computing Center Nagasaki, Japan The GPU evolution The Graphic Processing Unit (GPU) is a processor that was specialized for processing graphics. The GPU
More informationBlink: 3D Display Multiplexing for Virtualized Applications
: 3D Display Multiplexing for Virtualized Applications January 20, 2006 : 3D Display Multiplexing for Virtualized Applications Motivation Sprites and Tiles Lessons Learned GL in, GL out Communication Protocol
More informationSoftware Defined Modem A commercial platform for wireless handsets
Software Defined Modem A commercial platform for wireless handsets Charles F Sturman VP Marketing June 22 nd ~ 24 th Brussels charles.stuman@cognovo.com www.cognovo.com Agenda SDM Separating hardware from
More informationThis Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources
This Unit: Putting It All Together CIS 501 Computer Architecture Unit 12: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital Circuits
More informationTechniques for Optimizing Performance and Energy Consumption: Results of a Case Study on an ARM9 Platform
Techniques for Optimizing Performance and Energy Consumption: Results of a Case Study on an ARM9 Platform BL Standard IC s, PL Microcontrollers October 2007 Outline LPC3180 Description What makes this
More informationThis Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources
This Unit: Putting It All Together CIS 371 Computer Organization and Design Unit 15: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital
More informationJim Keller. Digital Equipment Corp. Hudson MA
Jim Keller Digital Equipment Corp. Hudson MA ! Performance - SPECint95 100 50 21264 30 21164 10 1995 1996 1997 1998 1999 2000 2001 CMOS 5 0.5um CMOS 6 0.35um CMOS 7 0.25um "## Continued Performance Leadership
More informationECE 571 Advanced Microprocessor-Based Design Lecture 20
ECE 571 Advanced Microprocessor-Based Design Lecture 20 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 12 April 2016 Project/HW Reminder Homework #9 was posted 1 Raspberry Pi
More informationMaster Informatics Eng.
Advanced Architectures Master Informatics Eng. 2018/19 A.J.Proença Data Parallelism 3 (GPU/CUDA, Neural Nets,...) (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 1 The
More informationSA-1500: A 300 MHz RISC CPU with Attached Media Processor*
and Bridges Division SA-1500: A 300 MHz RISC CPU with Attached Media Processor* Prashant P. Gandhi, Ph.D. and Bridges Division Computing Enhancement Group Intel Corporation Santa Clara, CA 95052 Prashant.Gandhi@intel.com
More informationECE 471 Embedded Systems Lecture 2
ECE 471 Embedded Systems Lecture 2 Vince Weaver http://www.eece.maine.edu/ vweaver vincent.weaver@maine.edu 4 September 2014 Announcements HW#1 will be posted tomorrow (Friday), due next Thursday Working
More informationA Scalable Processor Architecture for the Next Generation of Low Power Supercomputer. PRACE Workshop, October 2010 Andreas Olofsson
A Scalable Processor Architecture for the Next Generation of Low Power Supercomputer PRACE Workshop, October 2010 Andreas Olofsson Company Introduction Company founded in 2008 with mission to produce programmable
More informationOutline Marquette University
COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations
More informationDigital Signal Processor Core Technology
The World Leader in High Performance Signal Processing Solutions Digital Signal Processor Core Technology Abhijit Giri Satya Simha November 4th 2009 Outline Introduction to SHARC DSP ADSP21469 ADSP2146x
More informationAdding C Programmability to Data Path Design
Adding C Programmability to Data Path Design Gert Goossens Sr. Director R&D, Synopsys May 6, 2015 1 Smart Products Drive SoC Developments Feature-Rich Multi-Sensing Multi-Output Wirelessly Connected Always-On
More informationVector Engine Processor of SX-Aurora TSUBASA
Vector Engine Processor of SX-Aurora TSUBASA Shintaro Momose, Ph.D., NEC Deutschland GmbH 9 th October, 2018 WSSP 1 NEC Corporation 2018 Contents 1) Introduction 2) VE Processor Architecture 3) Performance
More informationUnit 11: Putting it All Together: Anatomy of the XBox 360 Game Console
Computer Architecture Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Milo Martin & Amir Roth at University of Pennsylvania! Computer Architecture
More informationTHE NVIDIA DEEP LEARNING ACCELERATOR
THE NVIDIA DEEP LEARNING ACCELERATOR INTRODUCTION NVDLA NVIDIA Deep Learning Accelerator Developed as part of Xavier NVIDIA s SOC for autonomous driving applications Optimized for Convolutional Neural
More informationIt's not about the core, it s about the system
It's not about the core, it s about the system Gajinder Panesar, CTO, UltraSoC gajinder.panesar@ultrasoc.com RISC-V Workshop 18 19 July 2018 Chennai, India Overview Architecture overview Example Scenarios
More informationMarkets Demanding More Performance
The Tile Processor Architecture: Embedded Multicore for Networking and Digital Multimedia Tilera Corporation August 20 th 2007 Hotchips 2007 Markets Demanding More Performance Networking market - Demand
More informationDesign and Optimization of Geometry Acceleration for Portable 3D Graphics
M.S. Thesis Design and Optimization of Geometry Acceleration for Portable 3D Graphics Ju-ho Sohn 2002.12.20 oratory Department of Electrical Engineering and Computer Science Korea Advanced Institute of
More informationAge nda. Intel PXA27x Processor Family: An Applications Processor for Phone and PDA applications
Intel PXA27x Processor Family: An Applications Processor for Phone and PDA applications N.C. Paver PhD Architect Intel Corporation Hot Chips 16 August 2004 Age nda Overview of the Intel PXA27X processor
More informationIntel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins
Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications
More informationParallel Programming on Larrabee. Tim Foley Intel Corp
Parallel Programming on Larrabee Tim Foley Intel Corp Motivation This morning we talked about abstractions A mental model for GPU architectures Parallel programming models Particular tools and APIs This
More informationMedia Instructions, Coprocessors, and Hardware Accelerators. Overview
Media Instructions, Coprocessors, and Hardware Accelerators Steven P. Smith SoC Design EE382V Fall 2009 EE382 System-on-Chip Design Coprocessors, etc. SPS-1 University of Texas at Austin Overview SoCs
More informationEE382V: System-on-a-Chip (SoC) Design
EE382V: System-on-a-Chip (SoC) Design Lecture 10 Task Partitioning Sources: Prof. Margarida Jacome, UT Austin Prof. Lothar Thiele, ETH Zürich Andreas Gerstlauer Electrical and Computer Engineering University
More informationTHE PATH TO EXASCALE COMPUTING. Bill Dally Chief Scientist and Senior Vice President of Research
THE PATH TO EXASCALE COMPUTING Bill Dally Chief Scientist and Senior Vice President of Research The Goal: Sustained ExaFLOPs on problems of interest 2 Exascale Challenges Energy efficiency Programmability
More informationBy: Tomer Morad Based on: Erik Lindholm, John Nickolls, Stuart Oberman, John Montrym. NVIDIA TESLA: A UNIFIED GRAPHICS AND COMPUTING ARCHITECTURE In IEEE Micro 28(2), 2008 } } Erik Lindholm, John Nickolls,
More informationCSCI 402: Computer Architectures. Fengguang Song Department of Computer & Information Science IUPUI. Recall
CSCI 402: Computer Architectures Memory Hierarchy (2) Fengguang Song Department of Computer & Information Science IUPUI Recall What is memory hierarchy? Where each level is located? Each level s speed,
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationDesign Space Exploration for Memory Subsystems of VLIW Architectures
E University of Paderborn Dr.-Ing. Mario Porrmann Design Space Exploration for Memory Subsystems of VLIW Architectures Thorsten Jungeblut 1, Gregor Sievers, Mario Porrmann 1, Ulrich Rückert 2 1 System
More information"On the Capability and Achievable Performance of FPGAs for HPC Applications"
"On the Capability and Achievable Performance of FPGAs for HPC Applications" Wim Vanderbauwhede School of Computing Science, University of Glasgow, UK Or in other words "How Fast Can Those FPGA Thingies
More informationInterconnect Challenges in a Many Core Compute Environment. Jerry Bautista, PhD Gen Mgr, New Business Initiatives Intel, Tech and Manuf Grp
Interconnect Challenges in a Many Core Compute Environment Jerry Bautista, PhD Gen Mgr, New Business Initiatives Intel, Tech and Manuf Grp Agenda Microprocessor general trends Implications Tradeoffs Summary
More informationConfigurable Processors for SOC Design. Contents crafted by Technology Evangelist Steve Leibson Tensilica, Inc.
Configurable s for SOC Design Contents crafted by Technology Evangelist Steve Leibson Tensilica, Inc. Why Listen to This Presentation? Understand how SOC design techniques, now nearly 20 years old, are
More informationProduct Technical Brief S3C2413 Rev 2.2, Apr. 2006
Product Technical Brief Rev 2.2, Apr. 2006 Overview SAMSUNG's is a Derivative product of S3C2410A. is designed to provide hand-held devices and general applications with cost-effective, low-power, and
More informationProcessor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP
Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP Presenter: Course: EEC 289Q: Reconfigurable Computing Course Instructor: Professor Soheil Ghiasi Outline Overview of M.I.T. Raw processor
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationChallenges of mixed-width vector code generation and static scheduling in LLVM (for VLIW Architectures)
Challenges of mixed-width vector code generation and static scheduling in LLVM (for VLIW Architectures) *Erkan Diken, **Pierre-Andre Saulais, ***Martin J. O Riordan (*) Eindhoven University of Technology,
More informationMulticore SoC is coming. Scalable and Reconfigurable Stream Processor for Mobile Multimedia Systems. Source: 2007 ISSCC and IDF.
Scalable and Reconfigurable Stream Processor for Mobile Multimedia Systems Liang-Gee Chen Distinguished Professor General Director, SOC Center National Taiwan University DSP/IC Design Lab, GIEE, NTU 1
More informationLecture 14: Memory Hierarchy. James C. Hoe Department of ECE Carnegie Mellon University
18 447 Lecture 14: Memory Hierarchy James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L14 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Your goal today Housekeeping understand memory system
More information