High Performance Embedded Computing

Size: px

Start display at page:

Download "High Performance Embedded Computing"

Madeline Rich
6 years ago
Views:

1 Design is a strategic asset High Performance Embedded Computing Arnon Friedmann Texas Instruments 1

2 Overview What is embedded? How did we get here? Shannon DSP Brief history of TI DSP for HPC What makes a DSP? Where are we now Benchmarks Hawking DSP, Brown Dwarf, Moonshot It s about the Software Where are we headed Summary 2

3 Embedded Markets DVR / NVR & smart camera Wireless and Networking Mission critical systems Medical imaging Video and audio infrastructure High-performance and cloud computing Portable mobile radio Industrial imaging Home AVR and automotive audio Analytics Test and Measurement Industrial control media processing computing radar & communications industrial electronics 3

DSP Roadmap Concept Development Sampling Production DSP Low DSP Mid DSP High C6678/4/2/1 1x/2x/4x/8x C66x 1.25GHz, 8MB of L2 PCIe, GigE, SRIO 24x24mm C6657 1x/2x C66x 1.

A15, 1.4 GHz 4x/8x C66x, 1.2GHz 1.4GHz,up to 8MB L2 PCIe, USB, GigE, SRIO 40x40mm AM5K2H04 4x ARM A15, 1.4 GHz 1.

4 DSP Roadmap Concept Development Sampling Production DSP Low DSP Mid DSP High C6678/4/2/1 1x/2x/4x/8x C66x 1.25GHz, 8MB of L2 PCIe, GigE, SRIO 24x24mm C6657 1x/2x C66x 1.25 GHz, 3MB L2 PCIe, USB, GigE, SRIO 21x21mm OMAP L138 1xARM A9, 456 MHz 1xC674x, 456 MHz EMAC, USB2, TDM 13mm 2,16x16mm C6748 1xC674x, 456 MHz EMAC, USB2, McASP 13mm 2,16x16mm 66AK2H12/06 2x/4x ARM A15, 1.4 GHz 4x/8x C66x, 1.2GHz 1.4GHz,up to 8MB L2 PCIe, USB, GigE, SRIO 40x40mm AM5K2H04 4x ARM A15, 1.4 GHz 1.4GHz,up to 8MB L2 PCIe, USB, GigE, SRIO Next DSP Low Multicore ARM and DSP devices Industrial, Audio and Communicaitons Next High End Multicore High performance Multicore ARM + DSP Large L2, 2x DDR4 High speed serial I/O Next Mid-range Multicore ARM and DSP Industrial control and communications Production

Technology for Video Security Analytics from the Core to the Edge Analog Camera Coax cable DM36x DM385 DMVA2 Smart Analytics IP Camera DMVA3 DM33x ISP+ARM Main Stream IP Camera 3G/Edge TI s DSP &

5 Technology for Video Security Analytics from the Core to the Edge Analog Camera Coax cable DM36x DM385 DMVA2 Smart Analytics IP Camera DMVA3 DM33x ISP+ARM Main Stream IP Camera 3G/Edge TI s DSP & vision solutions: - All DM81xx DVR solutions with embedded analytics capabilities - Analytics at the edge with DMVAx & Advanced Analytics IP Camera Additional processing With C665x Multicore IP DM812x DM644x DM812x C665x TVP5154 TVP5158 DM385 C665x DM64x DM6467 DM810x DM814x DM816x C667x DVR : Digital Video Recorder NVR : Network Video Recorder DVS : Digital Video Server C667x Multicore

Unleashing TI multicore DSPs @ SC 11 Innovative new DSP core

6 Unleashing TI multicore SC 11 Innovative new DSP core Most powerful multicore DSPs Lowest power per MHz/GMAC/GFLOP 6

Point NEW MultiCore DSP Highest Performance Fixed and Floating Point DSP 40 GMACs/ 20 GFLOPs/Core 320 GMACs/ 160 GFLOPs/Total C67x Core Floating Point C66x

7 Evolution of the C66x Most Power Efficient Scientific Computing Engine in the Industry! C64x Core C66x DSP Core C64xx Industry s Lowest Power Fixed-point DSP Core Industry s Highest Performance DSP Core Current base for multi-core product line Fixed Point NEW MultiCore DSP Highest Performance Fixed and Floating Point DSP 40 GMACs/ 20 GFLOPs/Core 320 GMACs/ 160 GFLOPs/Total C67x Core Floating Point C66x >10GFlops/Watt TI Optimizing Compiler (GCC support, C/C++) Industry s Lowest Power Floating-point DSP Core Scientific Computing Libraries C67xx High precision and wide dynamic range Multicore Tools and Code Composer Studio IDE Easy and flexible programming

8 Shannon (TMS320C6678) Block Diagram Multi-Core KeyStone SoC Fixed/Floating CorePac GHz 0.5MB L2/core, 4.0 MB Shared L2 320G MAC, 160G FLOP, 60G DFLOPS 10W Navigator Hardware Queue Manager with DMA Multicore Shared Memory Controller Low latency, high bandwidth memory access Network Coprocessor IPv4/IPv6 Network interface solution IPSec, SRTP, Encryption fully offloaded HyperLink 50G Baud Expansion Port Transparent to Software C66x DSP L1 L2 C66x DSP L1 L2 DDR3-64b C66x DSP L1 Multicore Navigator L2 C66x DSP L1 8 x CorePac L2 C66x DSP L1 L2 C66x DSP L1 L2 Memory Subsystem Power Management Debug C66x DSP L1 L2 C66x DSP L1 Multicore Shared Memory Controller (MSMC) Shared Memory 4MB System Elements SysMon EDMA L2 TeraNet Hyper Link 50 Network CoProcessors IP Interfaces SGMII Peripherals & IO SRIO x4 TSIP 2x Crypto Packet Accelerator GbE Switch PCIe x2 I 2 C SPI SGMII EMIF 16 UART 8

9 Telecom ATCA Blade 1T DFLOPS / blade 240W (board power) 256GB / s memory bandwidth 20GB memory 100Gbit/s interconnect total bandwidth Dual 10Gbit/s Ethernet uplink 20 devices, 8 cores each 50Gbit/s links pairing devices 9

10 Quad/Octal-Shannon PCIe Cards 512 Gflops 50 W ~1 Teraflop 110 W 16 GByte DDR3

11 High Level Comparison TI Quad Shannon PCIe ~50 W (2011) ~12.8 Gflops/W SP ~3.2 Gflops/W DP Nvidia Kepler ~250 W (2012) Dominates acceleration today Powers #2 Supercomputer (Titan) ~12 Gflops/W SP ~4 Gflops/W DP Intel Xeon PHI (MIC) ~250 W (2012) Unveiled at SC 12 Powers #1 Supercomputer (Tian 2) ~8 Gflops/W SP ~4 Gflops/W DP

12 QCDSP: A Teraflop Scale Massively Parallel Supercomputer Researchers at Brookhaven develop DSP-based system in the mid-late 90 s We discuss the work of the QCDSP collaboration to build an inexpensive Teraflop scale massively parallel computer suitable for computations in Quantum Chromodynamics (QCD). The computer is a collection of nodes connected in a four dimensional toroidial grid with nearest neighbor bit serial communications. A node is composed of a Texas Instruments Digital Signal Processor (DSP), memory, and a custom made communications and memory controller chip. An 8192 node computer with a peak speed of 0.4 Teraflops is being constructed at Columbia University for a cost of $1.8 Million. A 12,288-node machine with a peak speed of 0.6 Teraflops is being constructed for the RIKEN Brookhaven Research Center. Other computers have been built including a 50 Gigaflop version for Florida State University. Keywords: parallel, supercomputer, digital signal processor, QCD Introduction The atoms and nuclei of everyday matter are now known to be made up of still tinier particles known as quarks and leptons. 12

QCDSP: A Teraflop Scale Massively Parallel Supercomputer Researchers at Brookhaven develop DSP-based system in the mid-late 90 s TI then forgot all about this.

The computer is a collection of nodes connected in a four dimensional toroidial grid with nearest neighbor bit serial communications.

13 QCDSP: A Teraflop Scale Massively Parallel Supercomputer Researchers at Brookhaven develop DSP-based system in the mid-late 90 s TI then forgot all about this... We discuss the work of the QCDSP collaboration to build an inexpensive Teraflop scale massively parallel computer suitable for computations in Quantum Chromodynamics (QCD). The computer is a collection of nodes connected in a four dimensional toroidial grid with nearest neighbor bit serial communications. A node is composed of a Texas Instruments Digital Signal Processor (DSP), memory, and a custom made communications and memory controller chip. An 8192 node computer with a peak speed of 0.4 Teraflops is being constructed at Columbia University for a cost of $1.8 Million. A 12,288-node machine with a peak speed of 0.6 Teraflops is being constructed for the RIKEN Brookhaven Research Center. Other computers have been built including a 50 Gigaflop version for Florida State University. Keywords: parallel, supercomputer, digital signal processor, QCD Introduction The atoms and nuclei of everyday matter are now known to be made up of still tinier particles known as quarks and leptons. 13

14 14

15 Linpack Results from KTH Data from study performed at KTH Supercomputing Center LINPACK running on C6678 achieves 25.6 GFlops, ~2.1 GFlops/W Single precision performance ~4x better, ~8 GFlops/W 15

16 Comparison Algorithm Level GPU benchmark: Nvidia Tesla C1060 (Bisceglie 10, ref[2]) Core clock= 1.296GHz; Processor core #=240; memory= 800MHz Testing algorithm: Range-azimuth algorithm, FFT size 4096 DSP > 20x in power/performance ns/pixel 53.3 FPGA: Xilinx VIRTEX-5 (Pfitzner 11, ref[3]) Comparison DSP GPU FPGA 16

17 Running on OpenMP today 17

18 Video Analytics Comparison between KeyStone I and x86 processors Watts consumed / Cost per Channel (QVGA) Watts consumed/ channel Cost (USD)/ channel QVGA Watt per channel QVGA Cost per channel i Xeon E5620 Dual E5645 Xeon X5675 single Shannon Quad Shannon Octo Shannon 0 Processors

19 FINALLY SOME DETAILS... 19

Management L M S D L M S D registers registers L2 Register file A

20 C66x Core Overview L1P SRAM/Cache 32KB L1P Prefetch Embedded Debug Interrupt controller Emulation Dispatch C66x DSP Fetch Exectute Power Management L M S D L M S D registers registers L2 Register file A Register file B Prefetch DMA L1D Prefetch L1D SRAM/Cache 32KB L2 SRAM/Cache 1MB

21 C6X High Performance at low power The c6x architecture is designed to provide the highest performance DSP processing Superscalar DSP is capable of executing 8 instructions per clock cycle VLIW engine works in concert w/ compiler technology to provide superscalar performance without the power overhead of general purpose superscalar CPUs Instruction Scheduler Instruction Dispatch C6X Compiler Instruction Scheduler C6X VLIW Engine Instruction Dispatch Reservation Stations ALU ALU ALU Re-order Buffers GPP Register Allocation ALU ALU ALU C6X

22 C6X VLIW Power Optimization Traditional exposed-pipeline VLIW machines have some drawbacks with respect to power Instruction RAM usage is high due to No instruction scheduler if an ALU (or other functional unit) is not used, a NOP must be issued to it Loops must be unrolled by the compiler NOP ADD NOP NOP MPY NOP NOP NOP SUB NOP NOP NOP MPY NOP NOP NOP NOP NOP LD NOP MPY NOP NOP NOP Pure VLIW Machine code for low IPC code 6 instructions encoded in 24 instruction words

23 C6X VLIW Instruction Dispatch To reduce instruction fetch power and code size, a simple instruction dispatch unit is introduced Instruction RAM Instruction RAM Instruction Dispatch Decoder Decoder Decoder Decoder Decoder Decoder ALU ALU ALU ALU ALU ALU Pure VLIW Machine C6X VLIW

24 C6X VLIW Instruction Dispatch (2) Execution unit and Parallelism encoded in machine code Simplified dispatch unit unpacks the machine code. ADD MPY SUB MPY LD MPY C6X Machine Code C6X VLIW Core Instruction Dispatch NOP ADD NOP NOP MPY NOP NOP NOP SUB NOP NOP NOP MPY NOP NOP NOP NOP NOP LD NOP MPY NOP NOP NOP

25 C6X VLIW Loop Unrolling Traditional exposed pipeline VLIW machines have additional instruction overhead Loops are unrolled by the compiler A loop that really only has 4 unique instructions can easily need instructions after unrolling C64x+ generation introduced a loop construct which unrolls the loop w/in the CPU Code size reduction for loops Power Savings in CPU instruction pipeline

26 C6X VLIW Loop Unrolling (2) A while loop in C is depicted below The resulting assembly code is shown for Traditional VLIW and C6X VLIW B LOOP LDW *A0++, A7 B LOOP LDW *A0++, A7 B LOOP LDW *A0++, A7 B LOOP LDW *A0++, A7 B LOOP LDW *A0++, A7 B LOOP LOOP: LDW *A0++, A7 [A1] SUB A1,1,A1 ADD A7,A8,A8 [A1] B LOOP Traditional VLIW while (A1--) A8 += *A0++; C-psuedo source MVC A1, ILC SPLOOP 1 LDW *A0++,A7 NOP 4 ADD A7,A8,A8 SPKERNEL C6X VLIW w/ Software Pipelined Loop Unroller 20% Overall Dynamic Power reduction Same performance using less than ½ the instructions

27 C6X VLIW plus SIMD Strategy in evolving from c674x core to c66x Increase datapath width, leave overhead the same C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions Instruction decode overhead is the same, but processing power goes up by 2x Overall energy consumption is lower for a given benchmark. When only 32-bits of the unit is required, clock-gating eliminates the dynamic power of the unused 32-bits

28 WHERE ARE WE NOW 28

29 KeyStone Innovation 5 generations of multicore Lowers development effort Speeds time to market Leverages TI s investment Optimal software reuse KeyStone II KeyStone II 28nm KeyStone III 20nm ARM A15 10G Networking 64 bit ARM v8 C66x+ 40G Networking Multicore cache coherency KeyStone KeyStone 40nm ARM A8 C66x fixed and floating point, FPi, VSPi Concept Development Sampling Production Janus 130nm Faraday 65nm Network and Security AccelerationPacs C64x+ Wireless Accelerators 6 core DSP / /

30 K2H Platform 66AK2H12/06 Functional Diagram C66x Fixed or Floating Point DSP 4x/8x 66x DSP cores up to 1.4GHz 2x/4x Cotex ARM A15 1MB of local L2 cache RAM per C66 DSP core 4MB shared across all ARM Large on chip and off chip memory Multicore Shared Memory Controller provides low latency & high bandwidth memory access 6MB Shared L2 on-chip 2 x 72 bit DDR3, 72-bit (with ECC), 16 GB total addressable, DIMM support (4 ranks total) KeyStone multicore architecture and acceleration Multicore Navigator, TeraNet, HyperLink 1GbE Network coprocessor (IPv4/IPv6) Crypto Engine (IPSec, SRTP) Peripherals 4 Port 1G Layer 2 Ethernet Switch 2x PCIe, 1x4 SRIO 2.1, EMIF16, USB 3.0 UARTx2, SPI, I 2 C 15-25W depending upon DSP cores, speed, temp & other factors 28 nm ARM A15 64/72b DDR3 x2 ARM A15 16b EMIF 4MB ARM A15 SRIO 4 x UART x2 ARM A15 MSMC 6MB Multicore Navigator 66x 66x 66x 66x 66x 66x 66x 66x 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB EMIF and I/O SPI x3 I2C x3 40mm x 40mm package TeraNet High Speed SERDES PCIe 2 x USB3 HyperLink x2 8 x System Elements Power Mgr Debug Network AccelerationPacs Packet Accelerator Security Accelerator 5 port 1GbE Switch 1GbE 4 x SysMon EDMA

=180 servers per chassis (future capability) Dual low-latency switches HP Moonshot-45G

31 HP Moonshot Keystone II software-defined server The essential foundation for the new style of IT 45 hot-plug cartridges Single-server = 45 servers per chassis Quad-server =180 servers per chassis (future capability) Dual low-latency switches HP Moonshot-45G Switch Module (180 x1gb downlinks) Compute, Storage, or Combination x86, ARM, or Accelerator

32 TI KeyStone II HPC System ATCA-Based Rapid IO Switching 8 TFlop/Blade 100 GByte/Blade Up to 14 blades/chassis 32

33 SOFTWARE AND TOOLS

34 Multicore Software Vision Multicore ARM Same User Experience as x86 devices Multicore ARM + Multicore DSP Multicore DSP Make easy native DSP experience Mainline SMP Linux Standard Linux Tools Distribs. MPI Augment with TI differentiation Leverage standard accelerator models OpenCL OpenMP Accel Extensive Tool box for advanced programmers OpenMP Libraries Navigator run-time RTOS & Drivers IPC And more Development Environment: GDB, etc. Apps Debug Environment for ARM & DSP Eclipse Embedded Development & Debug Environment Instrumentation and Trace leveraging embedded hardware capability 34

35 Fast, Effective, Open Tools Optimized C Compiler Multicore parallel programming models Productive IDE Code Composer Studio Eclipse based, host TI and 3 rd party tools for easy debug Advanced analysis and visualization, speeds SW development Efficient Multicore Software Development Kit Available on both DSP and ARM with free source code HLOS/RTOS, Optimized library, algorithm and drivers, multicore runtime, protocol stack and application demos. 35

36 Discrete to Integrated Getting from here To here x86 DSP Hawking 28nm Parallel computing strategies with TI DSPs 1 Get started quickly with Optimized libraries Simple host/dsp interface TI LIBS BLAS, DSPLIB FFT 3 rd Party LIBS VSIPL User LIBS Custom fxn 2 Offload code simply with Directive-based programming OpenMP Accelerator Model Accelerator Model 3 Create optimized functions using Standard Programming Vector Programming OpenMP Programming User LIBS TI Tools

37 There s No Single Answer Different parallelism models: o Task Parallel or Data parallel o Course or Fine Grained Parallelism o Large or Small Data Sets Variety of System Baselines o Current Multicore Systems o Variety of methods of expressing and managing parallelism. o Already or easily partitioned 37

38 Hierarchy of Multicore Engagement Options Increasing abstraction and productivity Increasing control and performance Multicore Libraries MPI OpenMP (Accelerator Model) OpenCL for Accelerators OpenMP (Homogeneous) Explicit IPC Development Approach: Engage at the most abstract level Incrementally optimize to achieve required performance or power efficiency

39 Cooperative Parallel Programming (brief history of expression APIs/languages) MPI Communication APIs Node 0 Node 1 Node N 39

40 Cooperative Parallel Programming (brief history of expression APIs/languages) MPI Communication APIs OpenMP Threads OpenMP Threads OpenMP Threads CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU Node 0 Node 1 Node N 40

41 Cooperative Parallel Programming (brief history of expression APIs/languages) MPI Communication APIs OpenMP Threads OpenMP Threads OpenMP Threads CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CUDA/OpenCL CUDA/OpenCL CUDA/OpenCL GPU GPU GPU Node 0 Node 1 Node N 41

42 Cooperative Parallel Programming On KeyStone II as an example MPI Communication APIs OpenMP Threads OpenMP Threads OpenMP Threads CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU OpenCL OpenCL OpenCL DSP DSP DSP Node 0 Node 1 Node N 42

43 Cooperative Parallel Programming On KeyStone II as an alternative example MPI Communication APIs CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU OpenCL OpenCL OpenCL Node 0 Node 1 Node N 43

44 Cooperative Parallel Programming On KeyStone II as an alternative example MPI Communication APIs CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU OpenCL OpenCL OpenCL OpenMP OpenMP OpenMP Node 0 Node 1 Node N 44

45 Cooperative Parallel Programming On KeyStone II as an alternative example MPI Communication APIs OpenMP Accel OpenMP Accel OpenMP Accel CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU Node 0 Node 1 Node N 45

$target map (to: in1[0:count-1], in2[0:count-1], count, \ from: out1[0:count-1]) { #pragma omp parallel shared(in1, in2, out1) { int i; #pragma omp for for (i = 0; i < count; i++) out1[i] = in1[i] +$

46 OpenMP Accelerator model: Target Construct Pragma based model to dispatch computation from host to accelerator (K2H ARMs to DSPs) void foo(int *in1, int *in2, int *out1, int count) { #pragma omp target map (to: in1[0:count-1], in2[0:count-1], count, \ from: out1[0:count-1]) { #pragma omp parallel shared(in1, in2, out1) { int i; #pragma omp for for (i = 0; i < count; i++) out1[i] = in1[i] + in2[i]; } } } Extends OpenMP by adding A target construct to indicate regions to be dispatched Map clause to indicate data transfer between host & accelerator Does not have to be a copy (e.g. shared memory) Clauses to indicate that variables/functions reside on host/device/both Target regions can contain OpenMP constructs TI co-chair on OpenMP accelerator model sub-committee Played significant role in spec definition TI Confidential - NDA Restrictions 46

47 EPCC Micro benchmark data Cycles Parallel-For Overheads OpenMP Runtime OpenMP Runtime Cycles Barrier Construct Overheads OpenMP Runtime OpenMP Runtime OpenMP Runtime 2.0 Significantly reduces (2.5x) overhead of OpenMP constructs such as parallel for, barrier makes it feasible to use OpenMP for parallel regions with smaller granularity i.e. fewer cycles Optimized OpenMP runtime built on OpenEM and libgomp (gcc openmp library) Does not require BIOS/IPC/XDC However, runtime will co-exist with BIOS etc. if present in user application

48 WHERE ARE WE GOING 48

49 High Performance Compute moving to mainstream Oil and Gas Exploration Bioscience Big data mining Weather forecast Financial trading Electronics design automation Defense 49

50 HPC System and Architecture Evolution More computation capacity More memory and memory BW More networking and IO capability Less power consumption Compute Data Movement Connectivity Performance/W Heterogeneous processing High level of parallelism Reducing memory and IO bottleneck Efficient networking Higher IO bandwidth Increasing power efficiency High Performance Power Efficiency Real time Scalability Safety Reliability 50

51 TI Continues to Invest in DSP DSP Leadership Innovation Next Gen DSP 1995 C6000 C66xx 12.8GFLOPS/w 32 bit C5000 Fixed and/or Floating point 1982 C1x First 16bit Commercial DSP 16 bit Fixed point Ultra low power

High Performance Memory Interfaces Hybrid memory cube(hmc) High

networking and applications that are latency tolerant Lower

stack memory into SoC package Wide interface to SoC cores

52 High Performance Memory Interfaces Hybrid memory cube(hmc) High BW serialized interface Large DRAM memory space Suitable for networking and applications that are latency tolerant Lower mw/gbps High Bandwidth Memory Interface(HBM) Interposer/TSV stack memory into SoC package Wide interface to SoC cores Suitable for core centric access requires large BW and low latency Higher mw/gbps 52

53 High BW IO and network on chip Multicore Navigator enables zero copy and common multicore programming model Modular, scalable networking solution Other IO (PCIe, JESD204B SRIO, USB ) Hyperlink Security Accelerator Hyperlink enables 50Gbps throughput with minimum latency and SW overhead Ethernet Switch Multicore Navigator PKT DMA Packet Accelerator Teranet enables high throughput non-blocking network on chip Teranet 53

54 Holistic Power Optimization -- From board to transistor level Board level Device level Transistor level Memory integration, (e.g.hmc, HBM) Low voltage operation DVSF Retention Bias In-package voltage regulation Interposer/TSV Signal transport on-die Serdes FinFET, Significant leakage current reduction with lower Vdd ASIC 1 ASIC 2 Si Interposer 54

Overcoming Integration Complexities Industry Standard Ecosystem In-house and 3 rd party IP -- interface IP, core IP and soft IP Static Power Management Power domains on processor cores, accelerators,

55 Overcoming Integration Complexities Industry Standard Ecosystem In-house and 3 rd party IP -- interface IP, core IP and soft IP Static Power Management Power domains on processor cores, accelerators, and I/O Dynamic Power Management IO voltages, core logic AVS and DVFS domains, SRAM supplies. System Management Reset, Clocking, DFT, Interrupts, Interconnect fabric Board-Level Feature Integration Asynchronous clocking, scalable clocks, fixed frequency clocks. A/D & D/A Converters, RF integration, voltage regulation.

56 Design is a strategic asset Thank you Texas Instruments 56

High Performance Compute Platform Based on multi-core DSP for Seismic Modeling and Imaging

High Performance Compute Platform Based on multi-core DSP for Seismic Modeling and Imaging Presenter: Murtaza Ali, Texas Instruments Contributors: Murtaza Ali, Eric Stotzer, Xiaohui Li, Texas Instruments