High Performance Compute Platform Based on multi-core DSP for Seismic Modeling and Imaging

Similar documents
Introduction to AM5K2Ex/66AK2Ex Processors

Embedded Processing Portfolio for Ultrasound

High Performance Embedded Computing

OpenMP Accelerator Model for TI s Keystone DSP+ARM Devices. SC13, Denver, CO Eric Stotzer Ajay Jayaraj

KeyStone C66x Multicore SoC Overview. Dec, 2011

C66x KeyStone Training HyperLink

KeyStone Training. Turbo Encoder Coprocessor (TCP3E)

C66x KeyStone Training HyperLink

Integrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali

Optimizing the performance and portability of multicore DSP platforms with a scalable programming model supporting the Multicore Association s MCAPI

Doing more with multicore! Utilizing the power-efficient, high-performance KeyStone multicore DSPs. November 2012

Keystone Architecture Inter-core Data Exchange

Level-3 BLAS on the TI C6678 multi-core DSP

Using OpenMP to Program. Systems

KeyStone C665x Multicore SoC

TMS320C6678 Memory Access Performance

KeyStone II. CorePac Overview

Porting BLIS to new architectures Early experiences

1 TMS320C6678 Features and Description

SoC Overview. Multicore Applications Team

On the efficiency of the Accelerated Processing Unit for scientific computing

Tile Processor (TILEPro64)

Application Performance on Dual Processor Cluster Nodes

Multicore DSP+ARM KeyStone II System-on-Chip (SoC)

Classification of Semiconductor LSI

Introduction to Sitara AM437x Processors

Master Informatics Eng.

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

HyperLink Programming and Performance consideration

High performance Computing and O&G Challenges

QorIQ T4 Family of Processors. Our highest performance processor family. freescale.com

A design of real-time image processing platform based on TMS320C6678

Digital Signal Processor 2010/1/4

KeyStone Training. Power Management

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

1 66AK2H14/12/06 Features and Description

Multicore ARM KeyStone II System-on-Chip (SoC)

Implementing FFT in an FPGA Co-Processor

Implementation of DSP Algorithms

2008/12/23. System Arch 2008 (Fire Tom Wada) 1

With Fixed Point or Floating Point Processors!!

Zynq-7000 All Programmable SoC Product Overview

Introducing the AM57x Sitara Processors from Texas Instruments

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

All About the Cell Processor

Next Generation Enterprise Solutions from ARM

KeyStone Training. Multicore Navigator Overview

Octopus: A Multi-core implementation

Optimizing ARM SoC s with Carbon Performance Analysis Kits. ARM Technical Symposia, Fall 2014 Andy Ladd

Georgia Institute of Technology, August 17, Justin W. L. Wan. Canada Research Chair in Scientific Computing

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki

Building blocks for 64-bit Systems Development of System IP in ARM

Copyright 2016 Xilinx

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Intelop. *As new IP blocks become available, please contact the factory for the latest updated info.

COMP 635: Seminar on Heterogeneous Processors. Lecture 7: ClearSpeed CSX600 Processor.

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

On-Chip Debugging of Multicore Systems

Each Milliwatt Matters

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Intel Performance Libraries

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

CUDA. Matthew Joyner, Jeremy Williams

Netronome NFP: Theory of Operation

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

What does Heterogeneity bring?

DSP Solutions For High Quality Video Systems. Todd Hiers Texas Instruments

Developing and Integrating FPGA Co-processors with the Tic6x Family of DSP Processors

IP Video Phone on DM64x

Unleashing the high-performance and low-power of multi-core DSPs for general-purpose HPC

General Purpose GPU Computing in Partial Wave Analysis

Mapping applications into MPSoC

Microprocessors vs. DSPs (ESC-223)

Mercury Computer Systems & The Cell Broadband Engine

Freescale QorIQ Program Overview

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola

Cell Processor and Playstation 3

Heterogeneous Multi-Processor Coherent Interconnect

Emerging Integrated Drive Controller

A Next Generation Home Access Point and Router

FPQ6 - MPC8313E implementation

Very Large FFT Multicore DSP Implementation Demonstration Guide

Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

KeyStone Training. Bootloader

Overcoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics

Massively Parallel Processor Breadboarding (MPPB)

ARM+DSP - a winning combination on Qseven

IGLOO2 Evaluation Kit Webinar

C6000 Compiler Roadmap

How to build a Megacore microprocessor. by Andreas Olofsson (MULTIPROG WORKSHOP 2017)

XPU A Programmable FPGA Accelerator for Diverse Workloads

Numerical Algorithms on Multi-GPU Architectures

n N c CIni.o ewsrg.au

04 - DSP Architecture and Microarchitecture

HotChips An innovative HD video and digital image processor for low-cost digital entertainment products. Deepu Talla.

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

VXS-621 FPGA & PowerPC VXS Multiprocessor

Transcription:

High Performance Compute Platform Based on multi-core DSP for Seismic Modeling and Imaging Presenter: Murtaza Ali, Texas Instruments Contributors: Murtaza Ali, Eric Stotzer, Xiaohui Li, Texas Instruments William Symes, Jan Odegard, Rice University 1

Outline Introduction to TI Multi-core DSP Brief review of IWAVE based seismic signal modeling Details and challenges of implementation Results and conclusions 2

A New Paradigm in High Performance Computing Industry-best floating point performance 16 Gflops/W Standard programming model supports MPI and OpenMP Wide range of applications from embedded systems to server blades Full ecosystem support Off the shelf PCIe and ATCA cards O/S and application software Supported by a full set of development tools and Code Composer Studio IDE

TeraNet Shannon (TMS320C6678) Block Diagram Multi-Core KeyStone SoC Fixed/Floating CorePac 8 CorePac @ 1.25 GHz 0.5MB L2/core, 4.0 MB Shared L2 320G MAC, 160G FLOP, 60G DFLOPS 10W Navigator Hardware Queue Manager with DMA Multicore Shared Memory Controller Low latency, high bandwidth memory access Network Coprocessor IPv4/IPv6 Network interface solution IPSec, SRTP, Encryption fully offloaded HyperLink 50G Baud Expansion Port Transparent to Software C66x DSP L1 L2 C66x DSP L1 L2 DDR3-64b C66x DSP L1 Multicore Navigator L2 C66x DSP L1 8 x CorePac L2 C66x DSP L1 L2 C66x DSP L1 L2 Memory Subsystem Power Management Debug C66x DSP L1 L2 C66x DSP L1 Multicore Shared Memory Controller (MSMC) Shared Memory 4MB System Elements SysMon EDMA L2 Hyper Link 50 Network CoProcessors IP Interfaces SGMII Peripherals & IO SRIO x4 TSIP 2x Crypto Packet Accelerator GbE Switch PCIe x2 I 2 C SPI SGMII EMIF 16 UART 4

C66x Core Architecture 8 issue VLIW Architecture Can issue 8 instructions per cycle 2 data paths 4 units per data path L, S, D, M 64 registers (32 bit) 32 per data path Can be arranged in dual (64 bit) or quad (128 bit) registers Cross connect available Single Instruction Multiple Data (SIMD) available Dual or quad multiplies

TI DSP SW Resources Multicore Software Development Kit Peripheral drivers Demos for quick start OpenMP alpha version released, example code available Linear Algebra Library (BLAS, LAPACK) Working with UT Austin to port libflame (LAPACK equivalent) to Shannon Optimized Libraries DSPLIB (math functions), ImageLib Medical Imaging SW Toolkit Ultrasound, Optical Coherence, 3D Rendering

Shannon PCIe Development Cards 512 Gflops 50 W Available Now! 1 Tera-flop 120 W Available 1Q12

Seismic Modeling Focus of our current study wave equation update source addition boundary condition Typical iteration in forward sweep (essential part in modeling) Reverse Time migration (RTM) wave equation update Receiver addition boundary condition Imaging after iterations complete Typical iteration in Backward sweep essential part in imaging) IWAVE: A framework to enable efficient and scalable Finite Difference simulation on regular grid includes seismic modeling and imaging Implement different wave equation update Used for modeling and imaging Open source from Rice University 8

Inside wave update p x epx mpx Update p x v x v y v z x y z dv x dx dv y dy dv z dz Linear Combination p y epy mpy Update p z epz mpz p y Based on velocity stress PDE First order hyperbolic system 10th order finite difference method lax lay laz Update p z p x x dp x dx v x evx mvx Update v x p y lay y dp y dy v y evy mvy Update v y p z z dp z dz v z evz mvz Update v x lax laz

Load store friendly Memory access (load/store) Kernels Implementations Identified four kernels to optimize to core instruction architecture Differential in x-direction (first dimension) Differential in y or z-direction (orthogonal dimension) Update in x-directions Update in y or z directions Compute resource Optimization trade-off at kernel levels Cache friendly (first dimension) ;*.L units 0 0 ;*.S units 0 0 ;*.D units 8* 8* ;*.M units 5 7 ;*.X cross paths 3 2 ;*.T address paths 8* 8*.. ;* ;* Searching for software pipeline schedule at... ;* ii = 8 Schedule found with 4 iterations in parallel 10

openmp threads running on each core Kernel Results Kernels takes between 1-3 cycles per cell Summing up kernel numbers show capability of over 200 M cells/sec on 8 core DSP running at 1 GHz. Initial benchmarks carried out using all data being kept in DDR3 memory OpenMP used to parallelize across cores Assignment is based on z direction Need better data movement strategy over DDR3 Analyze bottlenecks of performance Core #7 Core #6 Core #5 Core #4 Core #3 Core #2 Core #1 Core #0 11

Data Movement Strategy C66 architecture allows 3-D data movement using DMA Allows defining strides in two direction Some limitations exist on sizes of strides limiting shape May limit sub-domain definition A tall sub-domain will be most useful DMAs can be linked Multiple data transfer can be initiated Continued without core intervention Compute can be overlapped to Data movement Need double buffering 12

3-D differential calculation strategy Kernel operates on 4 lines simultaneously Operate on a set of 4 x 4 x nx data set as the core computations strategy Total data set needed Determine x-differentials on the set of 16 lines Add y-differentials on a horizontal plane of 4 x nx fours times x-differential Add z-differentials on a vertical plane of 4 x nx fours times y-differential z-differential 13

Example of Data Movement CPU L1 (16K SRAM/ 16K Cache) L2 (384K SRAM/ 128K Cache) MSMCSRAM (shared by all cores) DDR

Results After implementing DMA data movement, performance went from 45 to 59 M cells/sec on a single 8-core C6678 multi-core DSP Performance limited by data transfers over DDR3 Performance only went up to 63 M cells/sec when all computes are disables Theoretical DDR3 bandwidth limited performance is 120 M cells/sec @ 1330 MHz DDR3. Currently we at operating at about 50% of DDR3 bandwidth 15

Future Activity Continued performance analysis Current measurements done with DDR3 clock rate of 1330 MHz Device capable of handling 1600 MHz-> 20% improvement Optimize further for parameters for maximum data transfer utilization Extend analysis to multiple DSP based PCI board MPI based message passing Side region data exchange Integrate with IWAVE framework Framework can run on host with main computes being handled by DSP board(s) Add more complicated wave equation update Elastic modeling 16