High Performance Compute Platform Based on multi-core DSP for Seismic Modeling and Imaging

High Performance Compute Platform Based on multi-core DSP for Seismic Modeling and Imaging Presenter: Murtaza Ali, Texas Instruments Contributors: Murtaza Ali, Eric Stotzer, Xiaohui Li, Texas Instruments William Symes, Jan Odegard, Rice University 1

Outline Introduction to TI Multi-core DSP Brief review of IWAVE based seismic signal modeling Details and challenges of implementation Results and conclusions 2

A New Paradigm in High Performance Computing Industry-best floating point performance 16 Gflops/W Standard programming model supports MPI and OpenMP Wide range of applications from embedded systems to server blades Full ecosystem support Off the shelf PCIe and ATCA cards O/S and application software Supported by a full set of development tools and Code Composer Studio IDE

TeraNet Shannon (TMS320C6678) Block Diagram Multi-Core KeyStone SoC Fixed/Floating CorePac 8 CorePac @ 1.25 GHz 0.5MB L2/core, 4.0 MB Shared L2 320G MAC, 160G FLOP, 60G DFLOPS 10W Navigator Hardware Queue Manager with DMA Multicore Shared Memory Controller Low latency, high bandwidth memory access Network Coprocessor IPv4/IPv6 Network interface solution IPSec, SRTP, Encryption fully offloaded HyperLink 50G Baud Expansion Port Transparent to Software C66x DSP L1 L2 C66x DSP L1 L2 DDR3-64b C66x DSP L1 Multicore Navigator L2 C66x DSP L1 8 x CorePac L2 C66x DSP L1 L2 C66x DSP L1 L2 Memory Subsystem Power Management Debug C66x DSP L1 L2 C66x DSP L1 Multicore Shared Memory Controller (MSMC) Shared Memory 4MB System Elements SysMon EDMA L2 Hyper Link 50 Network CoProcessors IP Interfaces SGMII Peripherals & IO SRIO x4 TSIP 2x Crypto Packet Accelerator GbE Switch PCIe x2 I 2 C SPI SGMII EMIF 16 UART 4

C66x Core Architecture 8 issue VLIW Architecture Can issue 8 instructions per cycle 2 data paths 4 units per data path L, S, D, M 64 registers (32 bit) 32 per data path Can be arranged in dual (64 bit) or quad (128 bit) registers Cross connect available Single Instruction Multiple Data (SIMD) available Dual or quad multiplies

TI DSP SW Resources Multicore Software Development Kit Peripheral drivers Demos for quick start OpenMP alpha version released, example code available Linear Algebra Library (BLAS, LAPACK) Working with UT Austin to port libflame (LAPACK equivalent) to Shannon Optimized Libraries DSPLIB (math functions), ImageLib Medical Imaging SW Toolkit Ultrasound, Optical Coherence, 3D Rendering

Shannon PCIe Development Cards 512 Gflops 50 W Available Now! 1 Tera-flop 120 W Available 1Q12

Seismic Modeling Focus of our current study wave equation update source addition boundary condition Typical iteration in forward sweep (essential part in modeling) Reverse Time migration (RTM) wave equation update Receiver addition boundary condition Imaging after iterations complete Typical iteration in Backward sweep essential part in imaging) IWAVE: A framework to enable efficient and scalable Finite Difference simulation on regular grid includes seismic modeling and imaging Implement different wave equation update Used for modeling and imaging Open source from Rice University 8

Inside wave update p x epx mpx Update p x v x v y v z x y z dv x dx dv y dy dv z dz Linear Combination p y epy mpy Update p z epz mpz p y Based on velocity stress PDE First order hyperbolic system 10th order finite difference method lax lay laz Update p z p x x dp x dx v x evx mvx Update v x p y lay y dp y dy v y evy mvy Update v y p z z dp z dz v z evz mvz Update v x lax laz

Load store friendly Memory access (load/store) Kernels Implementations Identified four kernels to optimize to core instruction architecture Differential in x-direction (first dimension) Differential in y or z-direction (orthogonal dimension) Update in x-directions Update in y or z directions Compute resource Optimization trade-off at kernel levels Cache friendly (first dimension) ;*.L units 0 0 ;*.S units 0 0 ;*.D units 8* 8* ;*.M units 5 7 ;*.X cross paths 3 2 ;*.T address paths 8* 8*.. ;* ;* Searching for software pipeline schedule at... ;* ii = 8 Schedule found with 4 iterations in parallel 10

openmp threads running on each core Kernel Results Kernels takes between 1-3 cycles per cell Summing up kernel numbers show capability of over 200 M cells/sec on 8 core DSP running at 1 GHz. Initial benchmarks carried out using all data being kept in DDR3 memory OpenMP used to parallelize across cores Assignment is based on z direction Need better data movement strategy over DDR3 Analyze bottlenecks of performance Core #7 Core #6 Core #5 Core #4 Core #3 Core #2 Core #1 Core #0 11

Data Movement Strategy C66 architecture allows 3-D data movement using DMA Allows defining strides in two direction Some limitations exist on sizes of strides limiting shape May limit sub-domain definition A tall sub-domain will be most useful DMAs can be linked Multiple data transfer can be initiated Continued without core intervention Compute can be overlapped to Data movement Need double buffering 12

3-D differential calculation strategy Kernel operates on 4 lines simultaneously Operate on a set of 4 x 4 x nx data set as the core computations strategy Total data set needed Determine x-differentials on the set of 16 lines Add y-differentials on a horizontal plane of 4 x nx fours times x-differential Add z-differentials on a vertical plane of 4 x nx fours times y-differential z-differential 13

Example of Data Movement CPU L1 (16K SRAM/ 16K Cache) L2 (384K SRAM/ 128K Cache) MSMCSRAM (shared by all cores) DDR

Results After implementing DMA data movement, performance went from 45 to 59 M cells/sec on a single 8-core C6678 multi-core DSP Performance limited by data transfers over DDR3 Performance only went up to 63 M cells/sec when all computes are disables Theoretical DDR3 bandwidth limited performance is 120 M cells/sec @ 1330 MHz DDR3. Currently we at operating at about 50% of DDR3 bandwidth 15

Future Activity Continued performance analysis Current measurements done with DDR3 clock rate of 1330 MHz Device capable of handling 1600 MHz-> 20% improvement Optimize further for parameters for maximum data transfer utilization Extend analysis to multiple DSP based PCI board MPI based message passing Side region data exchange Integrate with IWAVE framework Framework can run on host with main computes being handled by DSP board(s) Add more complicated wave equation update Elastic modeling 16