High Performance Compute Platform Based on multi-core DSP for Seismic Modeling and Imaging
|
|
- Evangeline Carmel James
- 6 years ago
- Views:
Transcription
1 High Performance Compute Platform Based on multi-core DSP for Seismic Modeling and Imaging Presenter: Murtaza Ali, Texas Instruments Contributors: Murtaza Ali, Eric Stotzer, Xiaohui Li, Texas Instruments William Symes, Jan Odegard, Rice University 1
2 Outline Introduction to TI Multi-core DSP Brief review of IWAVE based seismic signal modeling Details and challenges of implementation Results and conclusions 2
3 A New Paradigm in High Performance Computing Industry-best floating point performance 16 Gflops/W Standard programming model supports MPI and OpenMP Wide range of applications from embedded systems to server blades Full ecosystem support Off the shelf PCIe and ATCA cards O/S and application software Supported by a full set of development tools and Code Composer Studio IDE
4 TeraNet Shannon (TMS320C6678) Block Diagram Multi-Core KeyStone SoC Fixed/Floating CorePac GHz 0.5MB L2/core, 4.0 MB Shared L2 320G MAC, 160G FLOP, 60G DFLOPS 10W Navigator Hardware Queue Manager with DMA Multicore Shared Memory Controller Low latency, high bandwidth memory access Network Coprocessor IPv4/IPv6 Network interface solution IPSec, SRTP, Encryption fully offloaded HyperLink 50G Baud Expansion Port Transparent to Software C66x DSP L1 L2 C66x DSP L1 L2 DDR3-64b C66x DSP L1 Multicore Navigator L2 C66x DSP L1 8 x CorePac L2 C66x DSP L1 L2 C66x DSP L1 L2 Memory Subsystem Power Management Debug C66x DSP L1 L2 C66x DSP L1 Multicore Shared Memory Controller (MSMC) Shared Memory 4MB System Elements SysMon EDMA L2 Hyper Link 50 Network CoProcessors IP Interfaces SGMII Peripherals & IO SRIO x4 TSIP 2x Crypto Packet Accelerator GbE Switch PCIe x2 I 2 C SPI SGMII EMIF 16 UART 4
5 C66x Core Architecture 8 issue VLIW Architecture Can issue 8 instructions per cycle 2 data paths 4 units per data path L, S, D, M 64 registers (32 bit) 32 per data path Can be arranged in dual (64 bit) or quad (128 bit) registers Cross connect available Single Instruction Multiple Data (SIMD) available Dual or quad multiplies
6 TI DSP SW Resources Multicore Software Development Kit Peripheral drivers Demos for quick start OpenMP alpha version released, example code available Linear Algebra Library (BLAS, LAPACK) Working with UT Austin to port libflame (LAPACK equivalent) to Shannon Optimized Libraries DSPLIB (math functions), ImageLib Medical Imaging SW Toolkit Ultrasound, Optical Coherence, 3D Rendering
7 Shannon PCIe Development Cards 512 Gflops 50 W Available Now! 1 Tera-flop 120 W Available 1Q12
8 Seismic Modeling Focus of our current study wave equation update source addition boundary condition Typical iteration in forward sweep (essential part in modeling) Reverse Time migration (RTM) wave equation update Receiver addition boundary condition Imaging after iterations complete Typical iteration in Backward sweep essential part in imaging) IWAVE: A framework to enable efficient and scalable Finite Difference simulation on regular grid includes seismic modeling and imaging Implement different wave equation update Used for modeling and imaging Open source from Rice University 8
9 Inside wave update p x epx mpx Update p x v x v y v z x y z dv x dx dv y dy dv z dz Linear Combination p y epy mpy Update p z epz mpz p y Based on velocity stress PDE First order hyperbolic system 10th order finite difference method lax lay laz Update p z p x x dp x dx v x evx mvx Update v x p y lay y dp y dy v y evy mvy Update v y p z z dp z dz v z evz mvz Update v x lax laz
10 Load store friendly Memory access (load/store) Kernels Implementations Identified four kernels to optimize to core instruction architecture Differential in x-direction (first dimension) Differential in y or z-direction (orthogonal dimension) Update in x-directions Update in y or z directions Compute resource Optimization trade-off at kernel levels Cache friendly (first dimension) ;*.L units 0 0 ;*.S units 0 0 ;*.D units 8* 8* ;*.M units 5 7 ;*.X cross paths 3 2 ;*.T address paths 8* 8*.. ;* ;* Searching for software pipeline schedule at... ;* ii = 8 Schedule found with 4 iterations in parallel 10
11 openmp threads running on each core Kernel Results Kernels takes between 1-3 cycles per cell Summing up kernel numbers show capability of over 200 M cells/sec on 8 core DSP running at 1 GHz. Initial benchmarks carried out using all data being kept in DDR3 memory OpenMP used to parallelize across cores Assignment is based on z direction Need better data movement strategy over DDR3 Analyze bottlenecks of performance Core #7 Core #6 Core #5 Core #4 Core #3 Core #2 Core #1 Core #0 11
12 Data Movement Strategy C66 architecture allows 3-D data movement using DMA Allows defining strides in two direction Some limitations exist on sizes of strides limiting shape May limit sub-domain definition A tall sub-domain will be most useful DMAs can be linked Multiple data transfer can be initiated Continued without core intervention Compute can be overlapped to Data movement Need double buffering 12
13 3-D differential calculation strategy Kernel operates on 4 lines simultaneously Operate on a set of 4 x 4 x nx data set as the core computations strategy Total data set needed Determine x-differentials on the set of 16 lines Add y-differentials on a horizontal plane of 4 x nx fours times x-differential Add z-differentials on a vertical plane of 4 x nx fours times y-differential z-differential 13
14 Example of Data Movement CPU L1 (16K SRAM/ 16K Cache) L2 (384K SRAM/ 128K Cache) MSMCSRAM (shared by all cores) DDR
15 Results After implementing DMA data movement, performance went from 45 to 59 M cells/sec on a single 8-core C6678 multi-core DSP Performance limited by data transfers over DDR3 Performance only went up to 63 M cells/sec when all computes are disables Theoretical DDR3 bandwidth limited performance is 120 M 1330 MHz DDR3. Currently we at operating at about 50% of DDR3 bandwidth 15
16 Future Activity Continued performance analysis Current measurements done with DDR3 clock rate of 1330 MHz Device capable of handling 1600 MHz-> 20% improvement Optimize further for parameters for maximum data transfer utilization Extend analysis to multiple DSP based PCI board MPI based message passing Side region data exchange Integrate with IWAVE framework Framework can run on host with main computes being handled by DSP board(s) Add more complicated wave equation update Elastic modeling 16
Introduction to AM5K2Ex/66AK2Ex Processors
Introduction to AM5K2Ex/66AK2Ex Processors 1 Recommended Pre-Requisite Training Prior to this training, we recommend you review the KeyStone II DSP+ARM SoC Architecture Overview, which provides more details
More informationEmbedded Processing Portfolio for Ultrasound
Embedded Processing Portfolio for Ultrasound High performance, programmable platform Processor performance speeds image analysis faster, clearer results Power/size efficient processors enable portability
More informationHigh Performance Embedded Computing
Design is a strategic asset High Performance Embedded Computing Arnon Friedmann Texas Instruments 1 Overview What is embedded? How did we get here? Shannon DSP Brief history of TI DSP for HPC What makes
More informationOpenMP Accelerator Model for TI s Keystone DSP+ARM Devices. SC13, Denver, CO Eric Stotzer Ajay Jayaraj
OpenMP Accelerator Model for TI s Keystone DSP+ Devices SC13, Denver, CO Eric Stotzer Ajay Jayaraj 1 High Performance Embedded Computing 2 C Core Architecture 8-way VLIW processor 8 functional units in
More informationKeyStone C66x Multicore SoC Overview. Dec, 2011
KeyStone C66x Multicore SoC Overview Dec, 011 Outline Multicore Challenge KeyStone Architecture Reminder About KeyStone Solution Challenge Before KeyStone Multicore performance degradation Lack of efficient
More informationC66x KeyStone Training HyperLink
C66x KeyStone Training HyperLink 1. HyperLink Overview 2. Address Translation 3. Configuration 4. Example and Demo Agenda 1. HyperLink Overview 2. Address Translation 3. Configuration 4. Example and Demo
More informationKeyStone Training. Turbo Encoder Coprocessor (TCP3E)
KeyStone Training Turbo Encoder Coprocessor (TCP3E) Agenda Overview TCP3E Overview TCP3E = Turbo CoProcessor 3 Encoder No previous versions, but came out at same time as third version of decoder co processor
More informationC66x KeyStone Training HyperLink
C66x KeyStone Training HyperLink 1. HyperLink Overview 2. Address Translation 3. Configuration 4. Example and Demo Agenda 1. HyperLink Overview 2. Address Translation 3. Configuration 4. Example and Demo
More informationIntegrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali
Integrating DMA capabilities into BLIS for on-chip data movement Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali 5 Generations of TI Multicore Processors Keystone architecture Lowers
More informationOptimizing the performance and portability of multicore DSP platforms with a scalable programming model supporting the Multicore Association s MCAPI
Texas Instruments, PolyCore Software, Inc. & The Multicore Association Optimizing the performance and portability of multicore DSP platforms with a scalable programming model supporting the Multicore Association
More informationDoing more with multicore! Utilizing the power-efficient, high-performance KeyStone multicore DSPs. November 2012
Doing more with multicore! Utilizing the power-efficient, high-performance KeyStone multicore DSPs November 2012 How the world is doing more with TI s multicore Using TI multicore for wide variety of applications
More informationKeystone Architecture Inter-core Data Exchange
Application Report Lit. Number November 2011 Keystone Architecture Inter-core Data Exchange Brighton Feng Vincent Han Communication Infrastructure ABSTRACT This application note introduces various methods
More informationLevel-3 BLAS on the TI C6678 multi-core DSP
Level-3 BLAS on the TI C6678 multi-core DSP Murtaza Ali, Eric Stotzer Texas Instruments {mali,estotzer}@ti.com Francisco D. Igual Dept. Arquitectura de Computadores y Automática Univ. Complutense de Madrid
More informationUsing OpenMP to Program. Systems
Using OpenMP to Program Embedded Heterogeneous Systems Eric Stotzer, PhD Senior Member Technical Staff Software Development Organization, Compiler Team Texas Instruments February 16, 2012 Presented at
More informationKeyStone C665x Multicore SoC
KeyStone Multicore SoC Architecture KeyStone C6655/57: Device Features C66x C6655: One C66x DSP Core at 1.0 or 1.25 GHz C6657: Two C66x DSP Cores at 0.85, 1.0, or 1.25 GHz Fixed and Floating Point Operations
More informationTMS320C6678 Memory Access Performance
Application Report Lit. Number April 2011 TMS320C6678 Memory Access Performance Brighton Feng Communication Infrastructure ABSTRACT The TMS320C6678 has eight C66x cores, runs at 1GHz, each of them has
More informationKeyStone II. CorePac Overview
KeyStone II ARM Cortex A15 CorePac Overview ARM A15 CorePac in KeyStone II Standard ARM Cortex A15 MPCore processor Cortex A15 MPCore version r2p2 Quad core, dual core, and single core variants 4096kB
More informationPorting BLIS to new architectures Early experiences
1st BLIS Retreat. Austin (Texas) Early experiences Universidad Complutense de Madrid (Spain) September 5, 2013 BLIS design principles BLIS = Programmability + Performance + Portability Share experiences
More information1 TMS320C6678 Features and Description
Check for Evaluation Modules (EVM): TMS320C6678 TMS320C6678 SPRS691E November 2010 Revised March 2014 1 TMS320C6678 Features and Description 1.1 Features Eight TMS320C66x DSP Core Subsystems (C66x CorePacs),
More informationSoC Overview. Multicore Applications Team
KeyStone C66x ulticore SoC Overview ulticore Applications Team KeyStone Overview KeyStone Architecture & Internal Communications and Transport External Interfaces and s Debug iscellaneous Application and
More informationOn the efficiency of the Accelerated Processing Unit for scientific computing
24 th High Performance Computing Symposium Pasadena, April 5 th 2016 On the efficiency of the Accelerated Processing Unit for scientific computing I. Said, P. Fortin, J.-L. Lamotte, R. Dolbeau, H. Calandra
More informationTile Processor (TILEPro64)
Tile Processor Case Study of Contemporary Multicore Fall 2010 Agarwal 6.173 1 Tile Processor (TILEPro64) Performance # of cores On-chip cache (MB) Cache coherency Operations (16/32-bit BOPS) On chip bandwidth
More informationApplication Performance on Dual Processor Cluster Nodes
Application Performance on Dual Processor Cluster Nodes by Kent Milfeld milfeld@tacc.utexas.edu edu Avijit Purkayastha, Kent Milfeld, Chona Guiang, Jay Boisseau TEXAS ADVANCED COMPUTING CENTER Thanks Newisys
More informationMulticore DSP+ARM KeyStone II System-on-Chip (SoC)
66AK2E05, 66AK2E02 SPRS865C November 2012 Revised August 2014 1 66AK2E05/02 Features and Description 1.1 Features ARM Cortex -A15 MPCore CorePac Up to Four ARM Cortex-A15 Processor Cores at up to 1.4-GHz
More informationClassification of Semiconductor LSI
Classification of Semiconductor LSI 1. Logic LSI: ASIC: Application Specific LSI (you have to develop. HIGH COST!) For only mass production. ASSP: Application Specific Standard Product (you can buy. Low
More informationIntroduction to Sitara AM437x Processors
Introduction to Sitara AM437x Processors AM437x: Highly integrated, scalable platform with enhanced industrial communications and security AM4376 AM4378 Software Key Features AM4372 AM4377 High-performance
More informationMaster Informatics Eng.
Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationHyperLink Programming and Performance consideration
Application Report Lit. Number July, 2012 HyperLink Programming and Performance consideration Brighton Feng Communication Infrastructure ABSTRACT HyperLink provides a highest-speed, low-latency, and low-pin-count
More informationHigh performance Computing and O&G Challenges
High performance Computing and O&G Challenges 2 Seismic exploration challenges High Performance Computing and O&G challenges Worldwide Context Seismic,sub-surface imaging Computing Power needs Accelerating
More informationQorIQ T4 Family of Processors. Our highest performance processor family. freescale.com
of Processors Our highest performance processor family freescale.com Application Brochure QorIQ Communications Platform: Scalable Processing Performance Overview The QorIQ communications processors portfolio
More informationA design of real-time image processing platform based on TMS320C6678
Advanced Materials Research Online: 2014-06-25 ISSN: 1662-8985, Vols. 971-973, pp 1454-1458 doi:10.4028/www.scientific.net/amr.971-973.1454 2014 Trans Tech Publications, Switzerland A design of real-time
More informationDigital Signal Processor 2010/1/4
Digital Signal Processor 1 Analog to Digital Shift 2 Digital Signal Processing Applications FAX Phone Personal Computer Medical Instruments DVD player Air conditioner (controller) Digital Camera MP3 audio
More informationKeyStone Training. Power Management
KeyStone Training Management Overview Domains Clock Domains States SmartReflex Agenda Overview Domains Clock Domains States SmartReflex C66x Overview New Management Features New features: Switchable Logic
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More information1 66AK2H14/12/06 Features and Description
Check for Evaluation Modules (EVM): EVMK2H 1 66AK2H14/12/06 Features and Description 1.1 Features Eight (66AK2H14/12) or Four (66AK2H06) TMS320C66x DSP Core Subsystems (C66x CorePacs), Each With Up to
More informationMulticore ARM KeyStone II System-on-Chip (SoC)
AM5K2E04, AM5K2E02 SPRS864B June 2013 Revised January 2014 1 AM5K2E04/02 Features and Description 1.1 Features ARM Cortex -A15 MPCore CorePac Up to Four ARM Cortex-A15 Processor Cores at up to 1.4-GHz
More informationImplementing FFT in an FPGA Co-Processor
Implementing FFT in an FPGA Co-Processor Sheac Yee Lim Altera Corporation 101 Innovation Drive San Jose, CA 95134 (408) 544-7000 sylim@altera.com Andrew Crosland Altera Europe Holmers Farm Way High Wycombe,
More informationImplementation of DSP Algorithms
Implementation of DSP Algorithms Main frame computers Dedicated (application specific) architectures Programmable digital signal processors voice band data modem speech codec 1 PDSP and General-Purpose
More information2008/12/23. System Arch 2008 (Fire Tom Wada) 1
Digital it Signal Processor System Arch 2008 (Fire Tom Wada) 1 Analog to Digital Shift System Arch 2008 (Fire Tom Wada) 2 Digital Signal Processing Applications FAX Phone Personal Computer Medical Instruments
More informationWith Fixed Point or Floating Point Processors!!
Product Information Sheet High Throughput Digital Signal Processor OVERVIEW With Fixed Point or Floating Point Processors!! Performance Up to 14.4 GIPS or 7.7 GFLOPS Peak Processing Power Continuous Input
More informationZynq-7000 All Programmable SoC Product Overview
Zynq-7000 All Programmable SoC Product Overview The SW, HW and IO Programmable Platform August 2012 Copyright 2012 2009 Xilinx Introducing the Zynq -7000 All Programmable SoC Breakthrough Processing Platform
More informationIntroducing the AM57x Sitara Processors from Texas Instruments
Introducing the AM57x Sitara Processors from Texas Instruments ARM Cortex-A15 solutions for automation, HMI, vision, analytics, and other industrial and high-performance applications. Embedded Processing
More informationSupercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?
Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC? Nikola Rajovic, Paul M. Carpenter, Isaac Gelado, Nikola Puzovic, Alex Ramirez, Mateo Valero SC 13, November 19 th 2013, Denver, CO, USA
More informationAll About the Cell Processor
All About the Cell H. Peter Hofstee, Ph. D. IBM Systems and Technology Group SCEI/Sony Toshiba IBM Design Center Austin, Texas Acknowledgements Cell is the result of a deep partnership between SCEI/Sony,
More informationNext Generation Enterprise Solutions from ARM
Next Generation Enterprise Solutions from ARM Ian Forsyth Director Product Marketing Enterprise and Infrastructure Applications Processor Product Line Ian.forsyth@arm.com 1 Enterprise Trends IT is the
More informationKeyStone Training. Multicore Navigator Overview
KeyStone Training Multicore Navigator Overview What is Navigator? Overview Agenda Definition Architecture Queue Manager Sub-System (QMSS) Packet DMA () Descriptors and Queuing What can Navigator do? Data
More informationOctopus: A Multi-core implementation
Octopus: A Multi-core implementation Kalpesh Sheth HPEC 2007, MIT, Lincoln Lab Export of this products is subject to U.S. export controls. Licenses may be required. This material provides up-to-date general
More informationOptimizing ARM SoC s with Carbon Performance Analysis Kits. ARM Technical Symposia, Fall 2014 Andy Ladd
Optimizing ARM SoC s with Carbon Performance Analysis Kits ARM Technical Symposia, Fall 2014 Andy Ladd Evolving System Requirements Processor Advances big.little Multicore Unicore DSP Cortex -R7 Block
More informationGeorgia Institute of Technology, August 17, Justin W. L. Wan. Canada Research Chair in Scientific Computing
Real-Time Rigid id 2D-3D Medical Image Registration ti Using RapidMind Multi-Core Platform Georgia Tech/AFRL Workshop on Computational Science Challenge Using Emerging & Massively Parallel Computer Architectures
More informationAn Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki
An Ultra High Performance Scalable DSP Family for Multimedia Hot Chips 17 August 2005 Stanford, CA Erik Machnicki Media Processing Challenges Increasing performance requirements Need for flexibility &
More informationBuilding blocks for 64-bit Systems Development of System IP in ARM
Building blocks for 64-bit Systems Development of System IP in ARM Research seminar @ University of York January 2015 Stuart Kenny stuart.kenny@arm.com 1 2 64-bit Mobile Devices The Mobile Consumer Expects
More informationCopyright 2016 Xilinx
Zynq Architecture Zynq Vivado 2015.4 Version This material exempt per Department of Commerce license exception TSU Objectives After completing this module, you will be able to: Identify the basic building
More informationAnalyzing the Performance of IWAVE on a Cluster using HPCToolkit
Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,
More informationIntelop. *As new IP blocks become available, please contact the factory for the latest updated info.
A FPGA based development platform as part of an EDK is available to target intelop provided IPs or other standard IPs. The platform with Virtex-4 FX12 Evaluation Kit provides a complete hardware environment
More informationCOMP 635: Seminar on Heterogeneous Processors. Lecture 7: ClearSpeed CSX600 Processor.
COMP 635: Seminar on Heterogeneous Processors Lecture 7: ClearSpeed CSX600 Processor www.cs.rice.edu/~vsarkar/comp635 Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu October
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationOn-Chip Debugging of Multicore Systems
Nov 1, 2008 On-Chip Debugging of Multicore Systems PN115 Jeffrey Ho AP Technical Marketing, Networking Systems Division of Freescale Semiconductor, Inc. All other product or service names are the property
More informationEach Milliwatt Matters
Each Milliwatt Matters Ultra High Efficiency Application Processors Govind Wathan Product Manager, CPG ARM Tech Symposia China 2015 November 2015 Ultra High Efficiency Processors Used in Diverse Markets
More informationIBM Cell Processor. Gilbert Hendry Mark Kretschmann
IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:
More informationOutline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends
Collaborators: Richard T. Mills, Argonne National Laboratory Sarat Sreepathi, Oak Ridge National Laboratory Forrest M. Hoffman, Oak Ridge National Laboratory Jitendra Kumar, Oak Ridge National Laboratory
More informationPerformance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &
More informationIntel Performance Libraries
Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationCUDA. Matthew Joyner, Jeremy Williams
CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel
More informationNetronome NFP: Theory of Operation
WHITE PAPER Netronome NFP: Theory of Operation TO ACHIEVE PERFORMANCE GOALS, A MULTI-CORE PROCESSOR NEEDS AN EFFICIENT DATA MOVEMENT ARCHITECTURE. CONTENTS 1. INTRODUCTION...1 2. ARCHITECTURE OVERVIEW...2
More informationAccelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing
Accelerating HPC (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing SAAHPC, Knoxville, July 13, 2010 Legal Disclaimer Intel may make changes to specifications and product
More informationWhat does Heterogeneity bring?
What does Heterogeneity bring? Ken Koch Scientific Advisor, CCS-DO, LANL LACSI 2006 Conference October 18, 2006 Some Terminology Homogeneous Of the same or similar nature or kind Uniform in structure or
More informationDSP Solutions For High Quality Video Systems. Todd Hiers Texas Instruments
DSP Solutions For High Quality Video Systems Todd Hiers Texas Instruments TI Video Expertise Enables Faster And Easier Product Innovation TI has a long history covering the video market from end to end
More informationDeveloping and Integrating FPGA Co-processors with the Tic6x Family of DSP Processors
Developing and Integrating FPGA Co-processors with the Tic6x Family of DSP Processors Paul Ekas, DSP Engineering, Altera Corp. pekas@altera.com, Tel: (408) 544-8388, Fax: (408) 544-6424 Altera Corp., 101
More informationIP Video Phone on DM64x
IP Video Phone on DM64x Sriram Sethuraman Ittiam Systems Pvt. Ltd., Bangalore Acknowledgments to: Ittiam AV Systems and VVOIP Teams Video Phone Brief history Over IP New Markets Suitability of DM64x Solution
More informationUnleashing the high-performance and low-power of multi-core DSPs for general-purpose HPC
Unleashing the high-performance and low-power of multi-core DSPs for general-purpose HPC Francisco D. Igual Texas Advanced Computing Center figual@tacc.utexas.edu Murtaza Ali, Arnon Friedmann Eric Stotzer
More informationGeneral Purpose GPU Computing in Partial Wave Analysis
JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data
More informationMapping applications into MPSoC
Mapping applications into MPSoC concurrency & communication Jos van Eijndhoven jos@vectorfabrics.com March 12, 2011 MPSoC mapping: exploiting concurrency 2 March 12, 2012 Computation on general purpose
More informationMicroprocessors vs. DSPs (ESC-223)
Insight, Analysis, and Advice on Signal Processing Technology Microprocessors vs. DSPs (ESC-223) Kenton Williston Berkeley Design Technology, Inc. Berkeley, California USA +1 (510) 665-1600 info@bdti.com
More informationMercury Computer Systems & The Cell Broadband Engine
Mercury Computer Systems & The Cell Broadband Engine Georgia Tech Cell Workshop 18-19 June 2007 About Mercury Leading provider of innovative computing solutions for challenging applications R&D centers
More informationFreescale QorIQ Program Overview
August, 2009 Freescale QorIQ Program Overview Multicore processing view Jeffrey Ho Technical Marketing service names are the property of their respective owners. Freescale Semiconductor, Inc. 2009. We
More information1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola
1. Microprocessor Architectures 1.1 Intel 1.2 Motorola 1.1 Intel The Early Intel Microprocessors The first microprocessor to appear in the market was the Intel 4004, a 4-bit data bus device. This device
More informationCell Processor and Playstation 3
Cell Processor and Playstation 3 Guillem Borrell i Nogueras February 24, 2009 Cell systems Bad news More bad news Good news Q&A IBM Blades QS21 Cell BE based. 8 SPE 460 Gflops Float 20 GFLops Double QS22
More informationHeterogeneous Multi-Processor Coherent Interconnect
Heterogeneous Multi-Processor Coherent Interconnect Kai Chirca, Matthew Pierson Processors, Texas Instruments Inc, Dallas TX 1 Agenda q TI KeyStoneII Architecture and MSMC (Multicore Shared Memory Controller)
More informationEmerging Integrated Drive Controller
Emerging Integrated Drive Controller Ramesh Ramamoorthy - Senior Drive solution Expert & Sam Sabapathy - SMTS (Senior Member of the Technical Staff) C2000 System Solutions Industrial Drives & Functional
More informationA Next Generation Home Access Point and Router
A Next Generation Home Access Point and Router Product Marketing Manager Network Communication Technology and Application of the New Generation Points of Discussion Why Do We Need a Next Gen Home Router?
More informationFPQ6 - MPC8313E implementation
Formation MPC8313E implementation: This course covers PowerQUICC II Pro MPC8313 - Processeurs PowerPC: NXP Power CPUs FPQ6 - MPC8313E implementation This course covers PowerQUICC II Pro MPC8313 Objectives
More informationVery Large FFT Multicore DSP Implementation Demonstration Guide
Very Large FFT Multicore DSP Implementation Demonstration Guide 1 Very Large FFT Multicore DSP Implementation Demonstration Guide Overview This demo software implements single precision floating point
More informationOptimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor
Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor Intel K. K. E-mail: hirokazu.kobayashi@intel.com Yoshifumi Nakamura RIKEN AICS E-mail: nakamura@riken.jp Shinji Takeda
More informationOn Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators
On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC
More informationKeyStone Training. Bootloader
KeyStone Training Bootloader Overview Configuration Device Startup Summary Agenda Overview Configuration Device Startup Summary Boot Overview Boot Mode Details Boot is driven on a device reset. Initial
More informationOvercoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics
Overcoming the Memory System Challenge in Dataflow Processing Darren Jones, Wave Computing Drew Wingard, Sonics Current Technology Limits Deep Learning Performance Deep Learning Dataflow Graph Existing
More informationMassively Parallel Processor Breadboarding (MPPB)
Massively Parallel Processor Breadboarding (MPPB) 28 August 2012 Final Presentation TRP study 21986 Gerard Rauwerda CTO, Recore Systems Gerard.Rauwerda@RecoreSystems.com Recore Systems BV P.O. Box 77,
More informationARM+DSP - a winning combination on Qseven
...embedding excellence ARM+DSP - a winning combination on Qseven 1 ARM Conference Munich July 2012 ARM on Qseven your first in module technology Over 6 Billion ARM-based chips sold in 2010 10% market
More informationIGLOO2 Evaluation Kit Webinar
Power Matters. IGLOO2 Evaluation Kit Webinar Jamie Freed jamie.freed@microsemi.com August 29, 2013 Overview M2GL010T- FG484 $99* LPDDR 10/100/1G Ethernet SERDES SMAs USB UART Available Demos Small Form
More informationC6000 Compiler Roadmap
C6000 Compiler Roadmap CGT v7.4 CGT v7.3 CGT v7. CGT v8.0 CGT C6x v8. CGT Longer Term In Development Production Early Adopter Future CGT v7.2 reactive Current 3H2 4H 4H2 H H2 Future CGT C6x v7.3 Control
More informationHow to build a Megacore microprocessor. by Andreas Olofsson (MULTIPROG WORKSHOP 2017)
How to build a Megacore microprocessor by Andreas Olofsson (MULTIPROG WORKSHOP 2017) 1 Disclaimers 2 This presentation summarizes work done by Adapteva from 2008-2016. Statements and opinions are my own
More informationXPU A Programmable FPGA Accelerator for Diverse Workloads
XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for
More informationNumerical Algorithms on Multi-GPU Architectures
Numerical Algorithms on Multi-GPU Architectures Dr.-Ing. Harald Köstler 2 nd International Workshops on Advances in Computational Mechanics Yokohama, Japan 30.3.2010 2 3 Contents Motivation: Applications
More informationn N c CIni.o ewsrg.au
@NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU
More information04 - DSP Architecture and Microarchitecture
September 11, 2015 Memory indirect addressing (continued from last lecture) ; Reality check: Data hazards! ; Assembler code v3: repeat 256,endloop load r0,dm1[dm0[ptr0++]] store DM0[ptr1++],r0 endloop:
More informationHotChips An innovative HD video and digital image processor for low-cost digital entertainment products. Deepu Talla.
HotChips 2007 An innovative HD video and digital image processor for low-cost digital entertainment products Deepu Talla Texas Instruments 1 Salient features of the SoC HD video encode and decode using
More informationAdaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics
Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics H. Y. Schive ( 薛熙于 ) Graduate Institute of Physics, National Taiwan University Leung Center for Cosmology and Particle Astrophysics
More informationVXS-621 FPGA & PowerPC VXS Multiprocessor
VXS-621 FPGA & PowerPC VXS Multiprocessor Xilinx Virtex -5 FPGA for high performance processing On-board PowerPC CPU for standalone operation, communications management and user applications Two PMC/XMC
More information