High Performance Embedded Computing

Size: px
Start display at page:

Download "High Performance Embedded Computing"

Transcription

1 Design is a strategic asset High Performance Embedded Computing Arnon Friedmann Texas Instruments 1

2 Overview What is embedded? How did we get here? Shannon DSP Brief history of TI DSP for HPC What makes a DSP? Where are we now Benchmarks Hawking DSP, Brown Dwarf, Moonshot It s about the Software Where are we headed Summary 2

3 Embedded Markets DVR / NVR & smart camera Wireless and Networking Mission critical systems Medical imaging Video and audio infrastructure High-performance and cloud computing Portable mobile radio Industrial imaging Home AVR and automotive audio Analytics Test and Measurement Industrial control media processing computing radar & communications industrial electronics 3

4 DSP Roadmap Concept Development Sampling Production DSP Low DSP Mid DSP High C6678/4/2/1 1x/2x/4x/8x C66x 1.25GHz, 8MB of L2 PCIe, GigE, SRIO 24x24mm C6657 1x/2x C66x 1.25 GHz, 3MB L2 PCIe, USB, GigE, SRIO 21x21mm OMAP L138 1xARM A9, 456 MHz 1xC674x, 456 MHz EMAC, USB2, TDM 13mm 2,16x16mm C6748 1xC674x, 456 MHz EMAC, USB2, McASP 13mm 2,16x16mm 66AK2H12/06 2x/4x ARM A15, 1.4 GHz 4x/8x C66x, 1.2GHz 1.4GHz,up to 8MB L2 PCIe, USB, GigE, SRIO 40x40mm AM5K2H04 4x ARM A15, 1.4 GHz 1.4GHz,up to 8MB L2 PCIe, USB, GigE, SRIO Next DSP Low Multicore ARM and DSP devices Industrial, Audio and Communicaitons Next High End Multicore High performance Multicore ARM + DSP Large L2, 2x DDR4 High speed serial I/O Next Mid-range Multicore ARM and DSP Industrial control and communications Production

5 Technology for Video Security Analytics from the Core to the Edge Analog Camera Coax cable DM36x DM385 DMVA2 Smart Analytics IP Camera DMVA3 DM33x ISP+ARM Main Stream IP Camera 3G/Edge TI s DSP & vision solutions: - All DM81xx DVR solutions with embedded analytics capabilities - Analytics at the edge with DMVAx & Advanced Analytics IP Camera Additional processing With C665x Multicore IP DM812x DM644x DM812x C665x TVP5154 TVP5158 DM385 C665x DM64x DM6467 DM810x DM814x DM816x C667x DVR : Digital Video Recorder NVR : Network Video Recorder DVS : Digital Video Server C667x Multicore

6 Unleashing TI multicore SC 11 Innovative new DSP core Most powerful multicore DSPs Lowest power per MHz/GMAC/GFLOP 6

7 Evolution of the C66x Most Power Efficient Scientific Computing Engine in the Industry! C64x Core C66x DSP Core C64xx Industry s Lowest Power Fixed-point DSP Core Industry s Highest Performance DSP Core Current base for multi-core product line Fixed Point NEW MultiCore DSP Highest Performance Fixed and Floating Point DSP 40 GMACs/ 20 GFLOPs/Core 320 GMACs/ 160 GFLOPs/Total C67x Core Floating Point C66x >10GFlops/Watt TI Optimizing Compiler (GCC support, C/C++) Industry s Lowest Power Floating-point DSP Core Scientific Computing Libraries C67xx High precision and wide dynamic range Multicore Tools and Code Composer Studio IDE Easy and flexible programming

8 Shannon (TMS320C6678) Block Diagram Multi-Core KeyStone SoC Fixed/Floating CorePac GHz 0.5MB L2/core, 4.0 MB Shared L2 320G MAC, 160G FLOP, 60G DFLOPS 10W Navigator Hardware Queue Manager with DMA Multicore Shared Memory Controller Low latency, high bandwidth memory access Network Coprocessor IPv4/IPv6 Network interface solution IPSec, SRTP, Encryption fully offloaded HyperLink 50G Baud Expansion Port Transparent to Software C66x DSP L1 L2 C66x DSP L1 L2 DDR3-64b C66x DSP L1 Multicore Navigator L2 C66x DSP L1 8 x CorePac L2 C66x DSP L1 L2 C66x DSP L1 L2 Memory Subsystem Power Management Debug C66x DSP L1 L2 C66x DSP L1 Multicore Shared Memory Controller (MSMC) Shared Memory 4MB System Elements SysMon EDMA L2 TeraNet Hyper Link 50 Network CoProcessors IP Interfaces SGMII Peripherals & IO SRIO x4 TSIP 2x Crypto Packet Accelerator GbE Switch PCIe x2 I 2 C SPI SGMII EMIF 16 UART 8

9 Telecom ATCA Blade 1T DFLOPS / blade 240W (board power) 256GB / s memory bandwidth 20GB memory 100Gbit/s interconnect total bandwidth Dual 10Gbit/s Ethernet uplink 20 devices, 8 cores each 50Gbit/s links pairing devices 9

10 Quad/Octal-Shannon PCIe Cards 512 Gflops 50 W ~1 Teraflop 110 W 16 GByte DDR3

11 High Level Comparison TI Quad Shannon PCIe ~50 W (2011) ~12.8 Gflops/W SP ~3.2 Gflops/W DP Nvidia Kepler ~250 W (2012) Dominates acceleration today Powers #2 Supercomputer (Titan) ~12 Gflops/W SP ~4 Gflops/W DP Intel Xeon PHI (MIC) ~250 W (2012) Unveiled at SC 12 Powers #1 Supercomputer (Tian 2) ~8 Gflops/W SP ~4 Gflops/W DP

12 QCDSP: A Teraflop Scale Massively Parallel Supercomputer Researchers at Brookhaven develop DSP-based system in the mid-late 90 s We discuss the work of the QCDSP collaboration to build an inexpensive Teraflop scale massively parallel computer suitable for computations in Quantum Chromodynamics (QCD). The computer is a collection of nodes connected in a four dimensional toroidial grid with nearest neighbor bit serial communications. A node is composed of a Texas Instruments Digital Signal Processor (DSP), memory, and a custom made communications and memory controller chip. An 8192 node computer with a peak speed of 0.4 Teraflops is being constructed at Columbia University for a cost of $1.8 Million. A 12,288-node machine with a peak speed of 0.6 Teraflops is being constructed for the RIKEN Brookhaven Research Center. Other computers have been built including a 50 Gigaflop version for Florida State University. Keywords: parallel, supercomputer, digital signal processor, QCD Introduction The atoms and nuclei of everyday matter are now known to be made up of still tinier particles known as quarks and leptons. 12

13 QCDSP: A Teraflop Scale Massively Parallel Supercomputer Researchers at Brookhaven develop DSP-based system in the mid-late 90 s TI then forgot all about this... We discuss the work of the QCDSP collaboration to build an inexpensive Teraflop scale massively parallel computer suitable for computations in Quantum Chromodynamics (QCD). The computer is a collection of nodes connected in a four dimensional toroidial grid with nearest neighbor bit serial communications. A node is composed of a Texas Instruments Digital Signal Processor (DSP), memory, and a custom made communications and memory controller chip. An 8192 node computer with a peak speed of 0.4 Teraflops is being constructed at Columbia University for a cost of $1.8 Million. A 12,288-node machine with a peak speed of 0.6 Teraflops is being constructed for the RIKEN Brookhaven Research Center. Other computers have been built including a 50 Gigaflop version for Florida State University. Keywords: parallel, supercomputer, digital signal processor, QCD Introduction The atoms and nuclei of everyday matter are now known to be made up of still tinier particles known as quarks and leptons. 13

14 14

15 Linpack Results from KTH Data from study performed at KTH Supercomputing Center LINPACK running on C6678 achieves 25.6 GFlops, ~2.1 GFlops/W Single precision performance ~4x better, ~8 GFlops/W 15

16 Comparison Algorithm Level GPU benchmark: Nvidia Tesla C1060 (Bisceglie 10, ref[2]) Core clock= 1.296GHz; Processor core #=240; memory= 800MHz Testing algorithm: Range-azimuth algorithm, FFT size 4096 DSP > 20x in power/performance ns/pixel 53.3 FPGA: Xilinx VIRTEX-5 (Pfitzner 11, ref[3]) Comparison DSP GPU FPGA 16

17 Running on OpenMP today 17

18 Video Analytics Comparison between KeyStone I and x86 processors Watts consumed / Cost per Channel (QVGA) Watts consumed/ channel Cost (USD)/ channel QVGA Watt per channel QVGA Cost per channel i Xeon E5620 Dual E5645 Xeon X5675 single Shannon Quad Shannon Octo Shannon 0 Processors

19 FINALLY SOME DETAILS... 19

20 C66x Core Overview L1P SRAM/Cache 32KB L1P Prefetch Embedded Debug Interrupt controller Emulation Dispatch C66x DSP Fetch Exectute Power Management L M S D L M S D registers registers L2 Register file A Register file B Prefetch DMA L1D Prefetch L1D SRAM/Cache 32KB L2 SRAM/Cache 1MB

21 C6X High Performance at low power The c6x architecture is designed to provide the highest performance DSP processing Superscalar DSP is capable of executing 8 instructions per clock cycle VLIW engine works in concert w/ compiler technology to provide superscalar performance without the power overhead of general purpose superscalar CPUs Instruction Scheduler Instruction Dispatch C6X Compiler Instruction Scheduler C6X VLIW Engine Instruction Dispatch Reservation Stations ALU ALU ALU Re-order Buffers GPP Register Allocation ALU ALU ALU C6X

22 C6X VLIW Power Optimization Traditional exposed-pipeline VLIW machines have some drawbacks with respect to power Instruction RAM usage is high due to No instruction scheduler if an ALU (or other functional unit) is not used, a NOP must be issued to it Loops must be unrolled by the compiler NOP ADD NOP NOP MPY NOP NOP NOP SUB NOP NOP NOP MPY NOP NOP NOP NOP NOP LD NOP MPY NOP NOP NOP Pure VLIW Machine code for low IPC code 6 instructions encoded in 24 instruction words

23 C6X VLIW Instruction Dispatch To reduce instruction fetch power and code size, a simple instruction dispatch unit is introduced Instruction RAM Instruction RAM Instruction Dispatch Decoder Decoder Decoder Decoder Decoder Decoder ALU ALU ALU ALU ALU ALU Pure VLIW Machine C6X VLIW

24 C6X VLIW Instruction Dispatch (2) Execution unit and Parallelism encoded in machine code Simplified dispatch unit unpacks the machine code. ADD MPY SUB MPY LD MPY C6X Machine Code C6X VLIW Core Instruction Dispatch NOP ADD NOP NOP MPY NOP NOP NOP SUB NOP NOP NOP MPY NOP NOP NOP NOP NOP LD NOP MPY NOP NOP NOP

25 C6X VLIW Loop Unrolling Traditional exposed pipeline VLIW machines have additional instruction overhead Loops are unrolled by the compiler A loop that really only has 4 unique instructions can easily need instructions after unrolling C64x+ generation introduced a loop construct which unrolls the loop w/in the CPU Code size reduction for loops Power Savings in CPU instruction pipeline

26 C6X VLIW Loop Unrolling (2) A while loop in C is depicted below The resulting assembly code is shown for Traditional VLIW and C6X VLIW B LOOP LDW *A0++, A7 B LOOP LDW *A0++, A7 B LOOP LDW *A0++, A7 B LOOP LDW *A0++, A7 B LOOP LDW *A0++, A7 B LOOP LOOP: LDW *A0++, A7 [A1] SUB A1,1,A1 ADD A7,A8,A8 [A1] B LOOP Traditional VLIW while (A1--) A8 += *A0++; C-psuedo source MVC A1, ILC SPLOOP 1 LDW *A0++,A7 NOP 4 ADD A7,A8,A8 SPKERNEL C6X VLIW w/ Software Pipelined Loop Unroller 20% Overall Dynamic Power reduction Same performance using less than ½ the instructions

27 C6X VLIW plus SIMD Strategy in evolving from c674x core to c66x Increase datapath width, leave overhead the same C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions Instruction decode overhead is the same, but processing power goes up by 2x Overall energy consumption is lower for a given benchmark. When only 32-bits of the unit is required, clock-gating eliminates the dynamic power of the unused 32-bits

28 WHERE ARE WE NOW 28

29 KeyStone Innovation 5 generations of multicore Lowers development effort Speeds time to market Leverages TI s investment Optimal software reuse KeyStone II KeyStone II 28nm KeyStone III 20nm ARM A15 10G Networking 64 bit ARM v8 C66x+ 40G Networking Multicore cache coherency KeyStone KeyStone 40nm ARM A8 C66x fixed and floating point, FPi, VSPi Concept Development Sampling Production Janus 130nm Faraday 65nm Network and Security AccelerationPacs C64x+ Wireless Accelerators 6 core DSP / /

30 K2H Platform 66AK2H12/06 Functional Diagram C66x Fixed or Floating Point DSP 4x/8x 66x DSP cores up to 1.4GHz 2x/4x Cotex ARM A15 1MB of local L2 cache RAM per C66 DSP core 4MB shared across all ARM Large on chip and off chip memory Multicore Shared Memory Controller provides low latency & high bandwidth memory access 6MB Shared L2 on-chip 2 x 72 bit DDR3, 72-bit (with ECC), 16 GB total addressable, DIMM support (4 ranks total) KeyStone multicore architecture and acceleration Multicore Navigator, TeraNet, HyperLink 1GbE Network coprocessor (IPv4/IPv6) Crypto Engine (IPSec, SRTP) Peripherals 4 Port 1G Layer 2 Ethernet Switch 2x PCIe, 1x4 SRIO 2.1, EMIF16, USB 3.0 UARTx2, SPI, I 2 C 15-25W depending upon DSP cores, speed, temp & other factors 28 nm ARM A15 64/72b DDR3 x2 ARM A15 16b EMIF 4MB ARM A15 SRIO 4 x UART x2 ARM A15 MSMC 6MB Multicore Navigator 66x 66x 66x 66x 66x 66x 66x 66x 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB EMIF and I/O SPI x3 I2C x3 40mm x 40mm package TeraNet High Speed SERDES PCIe 2 x USB3 HyperLink x2 8 x System Elements Power Mgr Debug Network AccelerationPacs Packet Accelerator Security Accelerator 5 port 1GbE Switch 1GbE 4 x SysMon EDMA

31 HP Moonshot Keystone II software-defined server The essential foundation for the new style of IT 45 hot-plug cartridges Single-server = 45 servers per chassis Quad-server =180 servers per chassis (future capability) Dual low-latency switches HP Moonshot-45G Switch Module (180 x1gb downlinks) Compute, Storage, or Combination x86, ARM, or Accelerator

32 TI KeyStone II HPC System ATCA-Based Rapid IO Switching 8 TFlop/Blade 100 GByte/Blade Up to 14 blades/chassis 32

33 SOFTWARE AND TOOLS

34 Multicore Software Vision Multicore ARM Same User Experience as x86 devices Multicore ARM + Multicore DSP Multicore DSP Make easy native DSP experience Mainline SMP Linux Standard Linux Tools Distribs. MPI Augment with TI differentiation Leverage standard accelerator models OpenCL OpenMP Accel Extensive Tool box for advanced programmers OpenMP Libraries Navigator run-time RTOS & Drivers IPC And more Development Environment: GDB, etc. Apps Debug Environment for ARM & DSP Eclipse Embedded Development & Debug Environment Instrumentation and Trace leveraging embedded hardware capability 34

35 Fast, Effective, Open Tools Optimized C Compiler Multicore parallel programming models Productive IDE Code Composer Studio Eclipse based, host TI and 3 rd party tools for easy debug Advanced analysis and visualization, speeds SW development Efficient Multicore Software Development Kit Available on both DSP and ARM with free source code HLOS/RTOS, Optimized library, algorithm and drivers, multicore runtime, protocol stack and application demos. 35

36 Discrete to Integrated Getting from here To here x86 DSP Hawking 28nm Parallel computing strategies with TI DSPs 1 Get started quickly with Optimized libraries Simple host/dsp interface TI LIBS BLAS, DSPLIB FFT 3 rd Party LIBS VSIPL User LIBS Custom fxn 2 Offload code simply with Directive-based programming OpenMP Accelerator Model Accelerator Model 3 Create optimized functions using Standard Programming Vector Programming OpenMP Programming User LIBS TI Tools

37 There s No Single Answer Different parallelism models: o Task Parallel or Data parallel o Course or Fine Grained Parallelism o Large or Small Data Sets Variety of System Baselines o Current Multicore Systems o Variety of methods of expressing and managing parallelism. o Already or easily partitioned 37

38 Hierarchy of Multicore Engagement Options Increasing abstraction and productivity Increasing control and performance Multicore Libraries MPI OpenMP (Accelerator Model) OpenCL for Accelerators OpenMP (Homogeneous) Explicit IPC Development Approach: Engage at the most abstract level Incrementally optimize to achieve required performance or power efficiency

39 Cooperative Parallel Programming (brief history of expression APIs/languages) MPI Communication APIs Node 0 Node 1 Node N 39

40 Cooperative Parallel Programming (brief history of expression APIs/languages) MPI Communication APIs OpenMP Threads OpenMP Threads OpenMP Threads CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU Node 0 Node 1 Node N 40

41 Cooperative Parallel Programming (brief history of expression APIs/languages) MPI Communication APIs OpenMP Threads OpenMP Threads OpenMP Threads CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CUDA/OpenCL CUDA/OpenCL CUDA/OpenCL GPU GPU GPU Node 0 Node 1 Node N 41

42 Cooperative Parallel Programming On KeyStone II as an example MPI Communication APIs OpenMP Threads OpenMP Threads OpenMP Threads CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU OpenCL OpenCL OpenCL DSP DSP DSP Node 0 Node 1 Node N 42

43 Cooperative Parallel Programming On KeyStone II as an alternative example MPI Communication APIs CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU OpenCL OpenCL OpenCL Node 0 Node 1 Node N 43

44 Cooperative Parallel Programming On KeyStone II as an alternative example MPI Communication APIs CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU OpenCL OpenCL OpenCL OpenMP OpenMP OpenMP Node 0 Node 1 Node N 44

45 Cooperative Parallel Programming On KeyStone II as an alternative example MPI Communication APIs OpenMP Accel OpenMP Accel OpenMP Accel CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU Node 0 Node 1 Node N 45

46 OpenMP Accelerator model: Target Construct Pragma based model to dispatch computation from host to accelerator (K2H ARMs to DSPs) void foo(int *in1, int *in2, int *out1, int count) { #pragma omp target map (to: in1[0:count-1], in2[0:count-1], count, \ from: out1[0:count-1]) { #pragma omp parallel shared(in1, in2, out1) { int i; #pragma omp for for (i = 0; i < count; i++) out1[i] = in1[i] + in2[i]; } } } Extends OpenMP by adding A target construct to indicate regions to be dispatched Map clause to indicate data transfer between host & accelerator Does not have to be a copy (e.g. shared memory) Clauses to indicate that variables/functions reside on host/device/both Target regions can contain OpenMP constructs TI co-chair on OpenMP accelerator model sub-committee Played significant role in spec definition TI Confidential - NDA Restrictions 46

47 EPCC Micro benchmark data Cycles Parallel-For Overheads OpenMP Runtime OpenMP Runtime Cycles Barrier Construct Overheads OpenMP Runtime OpenMP Runtime OpenMP Runtime 2.0 Significantly reduces (2.5x) overhead of OpenMP constructs such as parallel for, barrier makes it feasible to use OpenMP for parallel regions with smaller granularity i.e. fewer cycles Optimized OpenMP runtime built on OpenEM and libgomp (gcc openmp library) Does not require BIOS/IPC/XDC However, runtime will co-exist with BIOS etc. if present in user application

48 WHERE ARE WE GOING 48

49 High Performance Compute moving to mainstream Oil and Gas Exploration Bioscience Big data mining Weather forecast Financial trading Electronics design automation Defense 49

50 HPC System and Architecture Evolution More computation capacity More memory and memory BW More networking and IO capability Less power consumption Compute Data Movement Connectivity Performance/W Heterogeneous processing High level of parallelism Reducing memory and IO bottleneck Efficient networking Higher IO bandwidth Increasing power efficiency High Performance Power Efficiency Real time Scalability Safety Reliability 50

51 TI Continues to Invest in DSP DSP Leadership Innovation Next Gen DSP 1995 C6000 C66xx 12.8GFLOPS/w 32 bit C5000 Fixed and/or Floating point 1982 C1x First 16bit Commercial DSP 16 bit Fixed point Ultra low power

52 High Performance Memory Interfaces Hybrid memory cube(hmc) High BW serialized interface Large DRAM memory space Suitable for networking and applications that are latency tolerant Lower mw/gbps High Bandwidth Memory Interface(HBM) Interposer/TSV stack memory into SoC package Wide interface to SoC cores Suitable for core centric access requires large BW and low latency Higher mw/gbps 52

53 High BW IO and network on chip Multicore Navigator enables zero copy and common multicore programming model Modular, scalable networking solution Other IO (PCIe, JESD204B SRIO, USB ) Hyperlink Security Accelerator Hyperlink enables 50Gbps throughput with minimum latency and SW overhead Ethernet Switch Multicore Navigator PKT DMA Packet Accelerator Teranet enables high throughput non-blocking network on chip Teranet 53

54 Holistic Power Optimization -- From board to transistor level Board level Device level Transistor level Memory integration, (e.g.hmc, HBM) Low voltage operation DVSF Retention Bias In-package voltage regulation Interposer/TSV Signal transport on-die Serdes FinFET, Significant leakage current reduction with lower Vdd ASIC 1 ASIC 2 Si Interposer 54

55 Overcoming Integration Complexities Industry Standard Ecosystem In-house and 3 rd party IP -- interface IP, core IP and soft IP Static Power Management Power domains on processor cores, accelerators, and I/O Dynamic Power Management IO voltages, core logic AVS and DVFS domains, SRAM supplies. System Management Reset, Clocking, DFT, Interrupts, Interconnect fabric Board-Level Feature Integration Asynchronous clocking, scalable clocks, fixed frequency clocks. A/D & D/A Converters, RF integration, voltage regulation.

56 Design is a strategic asset Thank you Texas Instruments 56

High Performance Compute Platform Based on multi-core DSP for Seismic Modeling and Imaging

High Performance Compute Platform Based on multi-core DSP for Seismic Modeling and Imaging High Performance Compute Platform Based on multi-core DSP for Seismic Modeling and Imaging Presenter: Murtaza Ali, Texas Instruments Contributors: Murtaza Ali, Eric Stotzer, Xiaohui Li, Texas Instruments

More information

OpenMP Accelerator Model for TI s Keystone DSP+ARM Devices. SC13, Denver, CO Eric Stotzer Ajay Jayaraj

OpenMP Accelerator Model for TI s Keystone DSP+ARM Devices. SC13, Denver, CO Eric Stotzer Ajay Jayaraj OpenMP Accelerator Model for TI s Keystone DSP+ Devices SC13, Denver, CO Eric Stotzer Ajay Jayaraj 1 High Performance Embedded Computing 2 C Core Architecture 8-way VLIW processor 8 functional units in

More information

Introduction to AM5K2Ex/66AK2Ex Processors

Introduction to AM5K2Ex/66AK2Ex Processors Introduction to AM5K2Ex/66AK2Ex Processors 1 Recommended Pre-Requisite Training Prior to this training, we recommend you review the KeyStone II DSP+ARM SoC Architecture Overview, which provides more details

More information

KeyStone C66x Multicore SoC Overview. Dec, 2011

KeyStone C66x Multicore SoC Overview. Dec, 2011 KeyStone C66x Multicore SoC Overview Dec, 011 Outline Multicore Challenge KeyStone Architecture Reminder About KeyStone Solution Challenge Before KeyStone Multicore performance degradation Lack of efficient

More information

Embedded Processing Portfolio for Ultrasound

Embedded Processing Portfolio for Ultrasound Embedded Processing Portfolio for Ultrasound High performance, programmable platform Processor performance speeds image analysis faster, clearer results Power/size efficient processors enable portability

More information

Integrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali

Integrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali Integrating DMA capabilities into BLIS for on-chip data movement Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali 5 Generations of TI Multicore Processors Keystone architecture Lowers

More information

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors 1 Agenda OpenCL Overview of Platform, Execution and Memory models Mapping these models to AM57x Overview of OpenMP Offload Model Compare and contrast

More information

Doing more with multicore! Utilizing the power-efficient, high-performance KeyStone multicore DSPs. November 2012

Doing more with multicore! Utilizing the power-efficient, high-performance KeyStone multicore DSPs. November 2012 Doing more with multicore! Utilizing the power-efficient, high-performance KeyStone multicore DSPs November 2012 How the world is doing more with TI s multicore Using TI multicore for wide variety of applications

More information

CPU Agnostic Motherboard design with RapidIO Interconnect in Data Center

CPU Agnostic Motherboard design with RapidIO Interconnect in Data Center Agnostic Motherboard design with RapidIO Interconnect in Data Center Devashish Paul Senior Product Manager IDT Chairman RapidIO Trade Association: Marketing Council 2013 RapidIO Trade Association Agenda

More information

Tile Processor (TILEPro64)

Tile Processor (TILEPro64) Tile Processor Case Study of Contemporary Multicore Fall 2010 Agarwal 6.173 1 Tile Processor (TILEPro64) Performance # of cores On-chip cache (MB) Cache coherency Operations (16/32-bit BOPS) On chip bandwidth

More information

Keystone Architecture Inter-core Data Exchange

Keystone Architecture Inter-core Data Exchange Application Report Lit. Number November 2011 Keystone Architecture Inter-core Data Exchange Brighton Feng Vincent Han Communication Infrastructure ABSTRACT This application note introduces various methods

More information

The Stampede is Coming: A New Petascale Resource for the Open Science Community

The Stampede is Coming: A New Petascale Resource for the Open Science Community The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center boisseau@tacc.utexas.edu Stampede: Solicitation US National Science Foundation

More information

Simplify System Complexity

Simplify System Complexity 1 2 Simplify System Complexity With the new high-performance CompactRIO controller Arun Veeramani Senior Program Manager National Instruments NI CompactRIO The Worlds Only Software Designed Controller

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

KeyStone C665x Multicore SoC

KeyStone C665x Multicore SoC KeyStone Multicore SoC Architecture KeyStone C6655/57: Device Features C66x C6655: One C66x DSP Core at 1.0 or 1.25 GHz C6657: Two C66x DSP Cores at 0.85, 1.0, or 1.25 GHz Fixed and Floating Point Operations

More information

Optimizing the performance and portability of multicore DSP platforms with a scalable programming model supporting the Multicore Association s MCAPI

Optimizing the performance and portability of multicore DSP platforms with a scalable programming model supporting the Multicore Association s MCAPI Texas Instruments, PolyCore Software, Inc. & The Multicore Association Optimizing the performance and portability of multicore DSP platforms with a scalable programming model supporting the Multicore Association

More information

Simplify System Complexity

Simplify System Complexity Simplify System Complexity With the new high-performance CompactRIO controller Fanie Coetzer Field Sales Engineer Northern South Africa 2 3 New control system CompactPCI MMI/Sequencing/Logging FieldPoint

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory

More information

KeyStone Training. Turbo Encoder Coprocessor (TCP3E)

KeyStone Training. Turbo Encoder Coprocessor (TCP3E) KeyStone Training Turbo Encoder Coprocessor (TCP3E) Agenda Overview TCP3E Overview TCP3E = Turbo CoProcessor 3 Encoder No previous versions, but came out at same time as third version of decoder co processor

More information

SoC Overview. Multicore Applications Team

SoC Overview. Multicore Applications Team KeyStone C66x ulticore SoC Overview ulticore Applications Team KeyStone Overview KeyStone Architecture & Internal Communications and Transport External Interfaces and s Debug iscellaneous Application and

More information

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi архитектура, модели программирования, оптимизация. Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture

More information

A Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013

A Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013 A Closer Look at the Epiphany IV 28nm 64 core Coprocessor Andreas Olofsson PEGPUM 2013 1 Adapteva Achieves 3 World Firsts 1. First processor company to reach 50 GFLOPS/W 3. First semiconductor company

More information

Porting BLIS to new architectures Early experiences

Porting BLIS to new architectures Early experiences 1st BLIS Retreat. Austin (Texas) Early experiences Universidad Complutense de Madrid (Spain) September 5, 2013 BLIS design principles BLIS = Programmability + Performance + Portability Share experiences

More information

RapidIO.org Update. Mar RapidIO.org 1

RapidIO.org Update. Mar RapidIO.org 1 RapidIO.org Update rickoco@rapidio.org Mar 2015 2015 RapidIO.org 1 Outline RapidIO Overview & Markets Data Center & HPC Communications Infrastructure Industrial Automation Military & Aerospace RapidIO.org

More information

Catapult: A Reconfigurable Fabric for Petaflop Computing in the Cloud

Catapult: A Reconfigurable Fabric for Petaflop Computing in the Cloud Catapult: A Reconfigurable Fabric for Petaflop Computing in the Cloud Doug Burger Director, Hardware, Devices, & Experiences MSR NExT November 15, 2015 The Cloud is a Growing Disruptor for HPC Moore s

More information

Godson Processor and its Application in High Performance Computers

Godson Processor and its Application in High Performance Computers Godson Processor and its Application in High Performance Computers Weiwu Hu Institute of Computing Technology, Chinese Academy of Sciences Loongson Technologies Corporation Limited hww@ict.ac.cn 1 Contents

More information

Introduction to Sitara AM437x Processors

Introduction to Sitara AM437x Processors Introduction to Sitara AM437x Processors AM437x: Highly integrated, scalable platform with enhanced industrial communications and security AM4376 AM4378 Software Key Features AM4372 AM4377 High-performance

More information

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center SDAccel Environment The Xilinx SDAccel Development Environment Bringing The Best Performance/Watt to the Data Center Introduction Data center operators constantly seek more server performance. Currently

More information

Welcome. Altera Technology Roadshow 2013

Welcome. Altera Technology Roadshow 2013 Welcome Altera Technology Roadshow 2013 Altera at a Glance Founded in Silicon Valley, California in 1983 Industry s first reprogrammable logic semiconductors $1.78 billion in 2012 sales Over 2,900 employees

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Exploring System Coherency and Maximizing Performance of Mobile Memory Systems

Exploring System Coherency and Maximizing Performance of Mobile Memory Systems Exploring System Coherency and Maximizing Performance of Mobile Memory Systems Shanghai: William Orme, Strategic Marketing Manager of SSG Beijing & Shenzhen: Mayank Sharma, Product Manager of SSG ARM Tech

More information

C66x KeyStone Training HyperLink

C66x KeyStone Training HyperLink C66x KeyStone Training HyperLink 1. HyperLink Overview 2. Address Translation 3. Configuration 4. Example and Demo Agenda 1. HyperLink Overview 2. Address Translation 3. Configuration 4. Example and Demo

More information

RapidIO.org Update.

RapidIO.org Update. RapidIO.org Update rickoco@rapidio.org June 2015 2015 RapidIO.org 1 Outline RapidIO Overview Benefits Interconnect Comparison Ecosystem System Challenges RapidIO Markets Data Center & HPC Communications

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

Intel Xeon Phi Coprocessors

Intel Xeon Phi Coprocessors Intel Xeon Phi Coprocessors Reference: Parallel Programming and Optimization with Intel Xeon Phi Coprocessors, by A. Vladimirov and V. Karpusenko, 2013 Ring Bus on Intel Xeon Phi Example with 8 cores Xeon

More information

DSP Solutions For High Quality Video Systems. Todd Hiers Texas Instruments

DSP Solutions For High Quality Video Systems. Todd Hiers Texas Instruments DSP Solutions For High Quality Video Systems Todd Hiers Texas Instruments TI Video Expertise Enables Faster And Easier Product Innovation TI has a long history covering the video market from end to end

More information

Kalray MPPA Manycore Challenges for the Next Generation of Professional Applications Benoît Dupont de Dinechin MPSoC 2013

Kalray MPPA Manycore Challenges for the Next Generation of Professional Applications Benoît Dupont de Dinechin MPSoC 2013 Kalray MPPA Manycore Challenges for the Next Generation of Professional Applications Benoît Dupont de Dinechin MPSoC 2013 The End of Dennard MOSFET Scaling Theory 2013 Kalray SA All Rights Reserved MPSoC

More information

Zynq-7000 All Programmable SoC Product Overview

Zynq-7000 All Programmable SoC Product Overview Zynq-7000 All Programmable SoC Product Overview The SW, HW and IO Programmable Platform August 2012 Copyright 2012 2009 Xilinx Introducing the Zynq -7000 All Programmable SoC Breakthrough Processing Platform

More information

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC? Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC? Nikola Rajovic, Paul M. Carpenter, Isaac Gelado, Nikola Puzovic, Alex Ramirez, Mateo Valero SC 13, November 19 th 2013, Denver, CO, USA

More information

Markets Demanding More Performance

Markets Demanding More Performance The Tile Processor Architecture: Embedded Multicore for Networking and Digital Multimedia Tilera Corporation August 20 th 2007 Hotchips 2007 Markets Demanding More Performance Networking market - Demand

More information

C66x KeyStone Training HyperLink

C66x KeyStone Training HyperLink C66x KeyStone Training HyperLink 1. HyperLink Overview 2. Address Translation 3. Configuration 4. Example and Demo Agenda 1. HyperLink Overview 2. Address Translation 3. Configuration 4. Example and Demo

More information

Using OpenMP to Program. Systems

Using OpenMP to Program. Systems Using OpenMP to Program Embedded Heterogeneous Systems Eric Stotzer, PhD Senior Member Technical Staff Software Development Organization, Compiler Team Texas Instruments February 16, 2012 Presented at

More information

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications

More information

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구 MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구 Leading Supplier of End-to-End Interconnect Solutions Analyze Enabling the Use of Data Store ICs Comprehensive End-to-End InfiniBand and Ethernet Portfolio

More information

Intel Enterprise Processors Technology

Intel Enterprise Processors Technology Enterprise Processors Technology Kosuke Hirano Enterprise Platforms Group March 20, 2002 1 Agenda Architecture in Enterprise Xeon Processor MP Next Generation Itanium Processor Interconnect Technology

More information

Octopus: A Multi-core implementation

Octopus: A Multi-core implementation Octopus: A Multi-core implementation Kalpesh Sheth HPEC 2007, MIT, Lincoln Lab Export of this products is subject to U.S. export controls. Licenses may be required. This material provides up-to-date general

More information

Building High Performance, Power Efficient Cortex and Mali systems with ARM CoreLink. Robert Kaye

Building High Performance, Power Efficient Cortex and Mali systems with ARM CoreLink. Robert Kaye Building High Performance, Power Efficient Cortex and Mali systems with ARM CoreLink Robert Kaye 1 Agenda Once upon a time ARM designed systems Compute trends Bringing it all together with CoreLink 400

More information

Altera SDK for OpenCL

Altera SDK for OpenCL Altera SDK for OpenCL A novel SDK that opens up the world of FPGAs to today s developers Altera Technology Roadshow 2013 Today s News Altera today announces its SDK for OpenCL Altera Joins Khronos Group

More information

The World Leader in High Performance Signal Processing Solutions. DSP Processors

The World Leader in High Performance Signal Processing Solutions. DSP Processors The World Leader in High Performance Signal Processing Solutions DSP Processors NDA required until November 11, 2008 Analog Devices Processors Broad Choice of DSPs Blackfin Media Enabled, 16/32- bit fixed

More information

KeyStone II. CorePac Overview

KeyStone II. CorePac Overview KeyStone II ARM Cortex A15 CorePac Overview ARM A15 CorePac in KeyStone II Standard ARM Cortex A15 MPCore processor Cortex A15 MPCore version r2p2 Quad core, dual core, and single core variants 4096kB

More information

Building blocks for 64-bit Systems Development of System IP in ARM

Building blocks for 64-bit Systems Development of System IP in ARM Building blocks for 64-bit Systems Development of System IP in ARM Research seminar @ University of York January 2015 Stuart Kenny stuart.kenny@arm.com 1 2 64-bit Mobile Devices The Mobile Consumer Expects

More information

Software Driven Verification at SoC Level. Perspec System Verifier Overview

Software Driven Verification at SoC Level. Perspec System Verifier Overview Software Driven Verification at SoC Level Perspec System Verifier Overview June 2015 IP to SoC hardware/software integration and verification flows Cadence methodology and focus Applications (Basic to

More information

Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford

Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford Efficiency and Programmability: Enablers for ExaScale Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford Scientific Discovery and Business Analytics Driving an Insatiable

More information

Embedded Systems: Hardware Components (part I) Todor Stefanov

Embedded Systems: Hardware Components (part I) Todor Stefanov Embedded Systems: Hardware Components (part I) Todor Stefanov Leiden Embedded Research Center Leiden Institute of Advanced Computer Science Leiden University, The Netherlands Outline Generic Embedded System

More information

Pedraforca: a First ARM + GPU Cluster for HPC

Pedraforca: a First ARM + GPU Cluster for HPC www.bsc.es Pedraforca: a First ARM + GPU Cluster for HPC Nikola Puzovic, Alex Ramirez We ve hit the power wall ALL computers are limited by power consumption Energy-efficient approaches Multi-core Fujitsu

More information

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE Haibo Xie, Ph.D. Chief HSA Evangelist AMD China OUTLINE: The Challenges with Computing Today Introducing Heterogeneous System Architecture (HSA)

More information

CUDA on ARM Update. Developing Accelerated Applications on ARM. Bas Aarts and Donald Becker

CUDA on ARM Update. Developing Accelerated Applications on ARM. Bas Aarts and Donald Becker CUDA on ARM Update Developing Accelerated Applications on ARM Bas Aarts and Donald Becker CUDA on ARM: a forward-looking development platform for high performance, energy efficient hybrid computing It

More information

Versal: AI Engine & Programming Environment

Versal: AI Engine & Programming Environment Engineering Director, Xilinx Silicon Architecture Group Versal: Engine & Programming Environment Presented By Ambrose Finnerty Xilinx DSP Technical Marketing Manager October 16, 2018 MEMORY MEMORY MEMORY

More information

The S6000 Family of Processors

The S6000 Family of Processors The S6000 Family of Processors Today s Design Challenges The advent of software configurable processors In recent years, the widespread adoption of digital technologies has revolutionized the way in which

More information

GPUs and Emerging Architectures

GPUs and Emerging Architectures GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs

More information

Higher Level Programming Abstractions for FPGAs using OpenCL

Higher Level Programming Abstractions for FPGAs using OpenCL Higher Level Programming Abstractions for FPGAs using OpenCL Desh Singh Supervising Principal Engineer Altera Corporation Toronto Technology Center ! Technology scaling favors programmability CPUs."#/0$*12'$-*

More information

Lesson 6 Intel Galileo and Edison Prototype Development Platforms. Chapter-8 L06: "Internet of Things ", Raj Kamal, Publs.: McGraw-Hill Education

Lesson 6 Intel Galileo and Edison Prototype Development Platforms. Chapter-8 L06: Internet of Things , Raj Kamal, Publs.: McGraw-Hill Education Lesson 6 Intel Galileo and Edison Prototype Development Platforms 1 Intel Galileo Gen 2 Boards Based on the Intel Pentium architecture Includes features of single threaded, single core and 400 MHz constant

More information

Embedded Systems. 7. System Components

Embedded Systems. 7. System Components Embedded Systems 7. System Components Lothar Thiele 7-1 Contents of Course 1. Embedded Systems Introduction 2. Software Introduction 7. System Components 10. Models 3. Real-Time Models 4. Periodic/Aperiodic

More information

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

IBM Cell Processor. Gilbert Hendry Mark Kretschmann IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:

More information

S2C K7 Prodigy Logic Module Series

S2C K7 Prodigy Logic Module Series S2C K7 Prodigy Logic Module Series Low-Cost Fifth Generation Rapid FPGA-based Prototyping Hardware The S2C K7 Prodigy Logic Module is equipped with one Xilinx Kintex-7 XC7K410T or XC7K325T FPGA device

More information

High performance, power-efficient DSPs based on the TI C64x

High performance, power-efficient DSPs based on the TI C64x High performance, power-efficient DSPs based on the TI C64x Sridhar Rajagopal, Joseph R. Cavallaro, Scott Rixner Rice University {sridhar,cavallar,rixner}@rice.edu RICE UNIVERSITY Recent (2003) Research

More information

Big Data mais basse consommation L apport des processeurs manycore

Big Data mais basse consommation L apport des processeurs manycore Big Data mais basse consommation L apport des processeurs manycore Laurent Julliard - Kalray Le potentiel et les défis du Big Data Séminaire ASPROM 2 et 3 Juillet 2013 Presentation Outline Kalray : the

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

Facilitating IP Development for the OpenCAPI Memory Interface Kevin McIlvain, Memory Development Engineer IBM. Join the Conversation #OpenPOWERSummit

Facilitating IP Development for the OpenCAPI Memory Interface Kevin McIlvain, Memory Development Engineer IBM. Join the Conversation #OpenPOWERSummit Facilitating IP Development for the OpenCAPI Memory Interface Kevin McIlvain, Memory Development Engineer IBM Join the Conversation #OpenPOWERSummit Moral of the Story OpenPOWER is the best platform to

More information

All About the Cell Processor

All About the Cell Processor All About the Cell H. Peter Hofstee, Ph. D. IBM Systems and Technology Group SCEI/Sony Toshiba IBM Design Center Austin, Texas Acknowledgements Cell is the result of a deep partnership between SCEI/Sony,

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction

More information

The Many Dimensions of SDR Hardware

The Many Dimensions of SDR Hardware The Many Dimensions of SDR Hardware Plotting a Course for the Hardware Behind the Software Sept 2017 John Orlando Epiq Solutions LO RFIC Epiq Solutions in a Nutshell Schaumburg, IL EST 2009 N. Virginia

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

High Performance Memory in FPGAs

High Performance Memory in FPGAs High Performance Memory in FPGAs Industry Trends and Customer Challenges Packet Processing & Transport > 400G OTN Software Defined Networks Video Over IP Network Function Virtualization Wireless LTE Advanced

More information

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware

More information

SoC Platforms and CPU Cores

SoC Platforms and CPU Cores SoC Platforms and CPU Cores COE838: Systems on Chip Design http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer Engineering Ryerson University

More information

Elaborazione dati real-time su architetture embedded many-core e FPGA

Elaborazione dati real-time su architetture embedded many-core e FPGA Elaborazione dati real-time su architetture embedded many-core e FPGA DAVIDE ROSSI A L E S S A N D R O C A P O T O N D I G I U S E P P E T A G L I A V I N I A N D R E A M A R O N G I U C I R I - I C T

More information

THE PROGRAMMER S GUIDE TO THE APU GALAXY. Phil Rogers, Corporate Fellow AMD

THE PROGRAMMER S GUIDE TO THE APU GALAXY. Phil Rogers, Corporate Fellow AMD THE PROGRAMMER S GUIDE TO THE APU GALAXY Phil Rogers, Corporate Fellow AMD THE OPPORTUNITY WE ARE SEIZING Make the unprecedented processing capability of the APU as accessible to programmers as the CPU

More information

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi архитектура, модели программирования, оптимизация. Нижний Новгород, 2016 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture How Programming

More information

High Performance Computing with Accelerators

High Performance Computing with Accelerators High Performance Computing with Accelerators Volodymyr Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT) National Center for Supercomputing

More information

SoC FPGAs. Your User-Customizable System on Chip Altera Corporation Public

SoC FPGAs. Your User-Customizable System on Chip Altera Corporation Public SoC FPGAs Your User-Customizable System on Chip Embedded Developers Needs Low High Increase system performance Reduce system power Reduce board size Reduce system cost 2 Providing the Best of Both Worlds

More information

n N c CIni.o ewsrg.au

n N c CIni.o ewsrg.au @NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU

More information

Classification of Semiconductor LSI

Classification of Semiconductor LSI Classification of Semiconductor LSI 1. Logic LSI: ASIC: Application Specific LSI (you have to develop. HIGH COST!) For only mass production. ASSP: Application Specific Standard Product (you can buy. Low

More information

Introducing the AM57x Sitara Processors from Texas Instruments

Introducing the AM57x Sitara Processors from Texas Instruments Introducing the AM57x Sitara Processors from Texas Instruments ARM Cortex-A15 solutions for automation, HMI, vision, analytics, and other industrial and high-performance applications. Embedded Processing

More information

Building heterogeneous networks of green base stations on TI s KeyStone II architecture

Building heterogeneous networks of green base stations on TI s KeyStone II architecture WHITE PAPER Introduction Zhihong Lin, Strategic Marketing Manager, Wireless Base Station Infrastructure Texas Instruments Fueled by the ability to connect to any device anytime and anywhere, both interpersonal

More information

Maximizing heterogeneous system performance with ARM interconnect and CCIX

Maximizing heterogeneous system performance with ARM interconnect and CCIX Maximizing heterogeneous system performance with ARM interconnect and CCIX Neil Parris, Director of product marketing Systems and software group, ARM Teratec June 2017 Intelligent flexible cloud to enable

More information

Introduction to Xeon Phi. Bill Barth January 11, 2013

Introduction to Xeon Phi. Bill Barth January 11, 2013 Introduction to Xeon Phi Bill Barth January 11, 2013 What is it? Co-processor PCI Express card Stripped down Linux operating system Dense, simplified processor Many power-hungry operations removed Wider

More information

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA STATE OF THE ART 2012 18,688 Tesla K20X GPUs 27 PetaFLOPS FLAGSHIP SCIENTIFIC APPLICATIONS

More information

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014 Profiling and Debugging OpenCL Applications with ARM Development Tools October 2014 1 Agenda 1. Introduction to GPU Compute 2. ARM Development Solutions 3. Mali GPU Architecture 4. Using ARM DS-5 Streamline

More information

Massively Parallel Processor Breadboarding (MPPB)

Massively Parallel Processor Breadboarding (MPPB) Massively Parallel Processor Breadboarding (MPPB) 28 August 2012 Final Presentation TRP study 21986 Gerard Rauwerda CTO, Recore Systems Gerard.Rauwerda@RecoreSystems.com Recore Systems BV P.O. Box 77,

More information

The Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006

The Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006 The Use Of Virtual Platforms In MP-SoC Design Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006 1 MPSoC Is MP SoC design happening? Why? Consumer Electronics Complexity Cost of ASIC Increased SW Content

More information

SiFive Freedom SoCs: Industry s First Open-Source RISC-V Chips

SiFive Freedom SoCs: Industry s First Open-Source RISC-V Chips SiFive Freedom SoCs: Industry s First Open-Source RISC-V Chips Yunsup Lee Co-Founder and CTO High Upfront Cost Has Killed Innovation Our industry needs a fundamental change Total SoC Development Cost Design

More information

CUDA. Matthew Joyner, Jeremy Williams

CUDA. Matthew Joyner, Jeremy Williams CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel

More information

Each Milliwatt Matters

Each Milliwatt Matters Each Milliwatt Matters Ultra High Efficiency Application Processors Govind Wathan Product Manager, CPG ARM Tech Symposia China 2015 November 2015 Ultra High Efficiency Processors Used in Diverse Markets

More information

OCP Engineering Workshop - Telco

OCP Engineering Workshop - Telco OCP Engineering Workshop - Telco Low Latency Mobile Edge Computing Trevor Hiatt Product Management, IDT IDT Company Overview Founded 1980 Workforce Approximately 1,800 employees Headquarters San Jose,

More information

HotChips An innovative HD video and digital image processor for low-cost digital entertainment products. Deepu Talla.

HotChips An innovative HD video and digital image processor for low-cost digital entertainment products. Deepu Talla. HotChips 2007 An innovative HD video and digital image processor for low-cost digital entertainment products Deepu Talla Texas Instruments 1 Salient features of the SoC HD video encode and decode using

More information

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated

More information

Introduction to GPU computing

Introduction to GPU computing Introduction to GPU computing Nagasaki Advanced Computing Center Nagasaki, Japan The GPU evolution The Graphic Processing Unit (GPU) is a processor that was specialized for processing graphics. The GPU

More information

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki An Ultra High Performance Scalable DSP Family for Multimedia Hot Chips 17 August 2005 Stanford, CA Erik Machnicki Media Processing Challenges Increasing performance requirements Need for flexibility &

More information