Kalray MPPA Massively Parallel Processor Array

Size: px
Start display at page:

Download "Kalray MPPA Massively Parallel Processor Array"

Transcription

1 Kalray MPPA Massively Parallel Processor Array Engineering a Manycore Processor Platform for Mission-Critical Applications Benoît Dupont de Dinechin, CTO

2 Kalray Solutions Target Two Strategic Markets Cloud and Data Center acceleration Kalray is the ideal co-processor to off-load x86 or accelerate real-time or compute intensive functions Domains of application: Networking, Storage, Big Data analytics, Cybersecurity, Deep learning High Performance Embedded Computing Embedded applications require higher performance without increasing power consumption and more integration of functions including those constrained by real-time Domains of application: aerospace, automotive, transport, energy 2016 Kalray SA All Rights Reserved MCSoC

3 Outline Mission-Critical Applications MPPA -256 Bostan Processor MPPA NoC Management MPPA Software Environment MPPA Coolidge Processor Conclusions 2016 Kalray SA All Rights Reserved MCSoC

4 Evolution of Sensor Fusion for Autonomous Driving (figures from M. Fausten keynote presentation at ESWEEK 2015) 2016 Kalray SA All Rights Reserved MCSoC

5 Number Crunching in Sensor Fusion Units (figures from M. Fausten keynote presentation at ESWEEK 2015) 2016 Kalray SA All Rights Reserved MCSoC

6 Mission-Critical Computing Dependability requirements Detect, isolate or mitigate errors that occur in digital circuits Transient electrical problems, soft errors, physical damage Timing requirements Time constraints associated with information manipulation Execution time determinism, predictability, composability Certification of worst-case response time by static analysis 2016 Kalray SA All Rights Reserved MCSoC

7 Execution Timing on a Multicore Processor Intra-core nondeterministic timing Speculative execution, branch prediction Cache content Inter-core interferences Concurrent accesses to caches, busses, and peripheral devices Cache coherence invalidations Main issue Growing gap between average execution time and static WCET 2016 Kalray SA All Rights Reserved MCSoC

8 Classic Multicore Memory Hierarchy Challenge: managing interference between cores 2016 Kalray SA All Rights Reserved MCSoC

9 Embedded Multicore Memory Hierarchy Challenge: programmability of DMA and private memories 2016 Kalray SA All Rights Reserved MCSoC

10 Outline Mission-Critical Applications MPPA -256 Bostan Processor MPPA NoC Management MPPA Software Environment MPPA Coolidge Processor Conclusions 2016 Kalray SA All Rights Reserved MCSoC

11 MPPA -256 Manycore Architecture Highlights DSP type of processing Energy efficiency Timing predictability Software programmability CPU ease of programming C/C++ GNU environment 32-bit/64-bit, little-endian Rich operating system (Linux with dynamic loading&linking) Integrated many-core processor 32 management cores on chip 256 application cores on chip High-performance I/O Scalable parallel computing MPPA processors can be tiled together through NoC extension Run-time support for pooling the external DDR memory resources Courtesy Altera 2016 Kalray SA All Rights Reserved MCSoC

12 MPPA -256 Bostan Processor Architecture Manycore Processor Compute Cluster VLIW Core 16 compute clusters 2 I/O clusters each with quad-core CPUs, DDR3, 4 Ethernet 10G and 8 PCIe Gen3 Data and control networks-on-chip Distributed memory architecture 634 GFLOPS SP for 600Mhz 16 user cores + 1 system core NoC Tx and Rx interfaces Debug & Support Unit (DSU) 2 MB multi-banked shared memory 77GB/s Shared Memory BW 16 cores SMP System 32-bit or 64-bit addresses 5-issue VLIW architecture MMU + I&D cache (8KB+8KB) 32-bit/64-bit IEEE FMA FPU Tightly coupled crypto co-processor 2.4 GFLOPS SP per 2016 Kalray SA All Rights Reserved MCSoC

13 MPPA -256 Bostan Processor VLIW cores / 18 address spaces / 400 MHz 800 MHz Physical characteristics TSMC CMOS 28HP 100µW/MHz per core+ L1 caches 2W to 3W leakage Processor interfaces 2x DDR3 Memory interfaces 2x PCIe Gen3 8-lane interface 8x 1G/10G Ethernet or 2x 40G Ethernet interfaces SPI/I2C/UART interfaces Universal Static Memory Controller (NAND/NOR/SRAM) GPIOs with Direct NoC Access NoC extension through Interlaken interface (NoCX) 2016 Kalray SA All Rights Reserved MCSoC

14 MPPA Processor Co-Design for Avionics U. Saarland / AbsIntGMBH recommendations on VLIW core and cache micro-architecture design AbsInt provides the ait static timing analysis tool used to certify flight control systems at Airbus AbsInt ait tool also targets the Kalray VLIW cores Architecture design with a focus on timing predictability Core level: micro-architecture Fully timing compositional core LRU caches and write buffer Cache bypass memory opcodes Cluster level: multi-banked shared memory Core-private buses for memory bank access Address interleaving or blocking across banks Processor level: NoC/DDR guaranteed services NoC minimum bandwidth & maximum latency DDR controller configurations and address mapping 2016 Kalray SA All Rights Reserved MCSoC

15 MPPA -256 Bostan VLIW Core Data Path 5-issue, polycyclic property Unified 10 read, 5 write 64x32-bit register file 64-bit data in register pairs Shared ACC read port Unified MAU and FPU LSU and MAU also execute single-cycle ALU instructions 2016 Kalray SA All Rights Reserved MCSoC

16 MPPA Bostan VLIW Core Energy Efficiency Kalray Bostan pj per Instruction on Matrix Multiply 117 ARM Cortex A7 (Little)* 225 GPU** 497 ARM Cortex A15 (Big)* Kalray Bostan at 141 pj per bundle with 3.14 instructions average per bundle *: An Instruction Level Energy Characterization of ARM Processors FORTH-ICS/TR-450, March 2015 **: Source: Measured on Matrix Multiply EPI Test + Source Bill Dally, To ExaScale and Beyond - NVidia 2016 Kalray SA All Rights Reserved MCSoC

17 Outline Mission-Critical Applications MPPA -256 Bostan Processor MPPA NoC Management MPPA Software Environment MPPA Coolidge Processor Conclusions 2016 Kalray SA All Rights Reserved MCSoC

18 MPPA -256 Bostan Network-on-Chip (NoC) Dual 2D-torus NoC Direct, wormhole switching D-NoC: high-bandwidth RDMA C-NoC: low-latency mailboxes 4B/cycle per link direction per NoC Nx10Gb/s NoC extensions for connection to FPGA or other MPPA Predictability Data NoC is configured by selecting routes and injection parameters Routing ensure deadlock-free traffic Injection parameters are the (σ,ρ) or (burst, rate) of network calculus 2016 Kalray SA All Rights Reserved MCSoC

19 Wormhole Switching Illustrated A packet is composed of several flits The header flits governs the route The payload flits follow the header in a pipeline fashion When the required channel is busy, the hop flow control blocks the trailing flits and they stay in flit buffers along the established route 9 B A 2016 Kalray SA All Rights Reserved MCSoC

20 Wormhole Switching Network Issues Complex to implement May betruefor input queueing & output matching (e.g. islip) The MPPA NoC routers only include demultiplexers, output queues and RR arbiters Prone to deadlocking In thisexample, the redflow cannotuse R3 R2 becausethe blueflow isusingit Likewise, the blue flow needs R1 R4 heldby the redflow Deadlock requires full queues 2016 Kalray SA All Rights Reserved MCSoC

21 Deadlock-Free Routing Wormhole Switching Deadlock results from circuits of agents and resources connected by a wait-for relation [Dally & Seitz 1998] Wormhole switching: agents are packets; resources are link buffers Links between routers and links internal to routers( turns ) Solutions whenthe network topologyisa 2D mesh Dimension order(x-y on 2D meshes) Turnmodel [Glass & Ni 1994] (not the sameas turnprohibition ) Odd-Even[Chiu 2000] (better path diversity than turn model) Hamiltonian Odd-Even[Bahrebar& Stroobandt 2015] (multicasting) Strategy for the MPPA NoC management Isolate a 2D mesh in topology and apply deadlock-free routing 2016 Kalray SA All Rights Reserved MCSoC

22 Hamiltonian Odd-Even Prohibited Turns Even rows East-South turn prohibited North-West turn prohibited Odd rows North-East turn prohibited West-South turn prohibited Odd Odd Even Even Odd Odd Even Even 2016 Kalray SA All Rights Reserved MCSoC

23 Hamiltonian Odd-Even on the MPPA NoC Routing between compute clusters Routes generated assuming a 6x6 mesh Impossible routes are discarded Routing between I/O and compute cluster Always possible, thanks to the 4 NoC nodes per I/O cluster I/O Ethernet 0 I/O DDR 1 I/O DDR 0 I/O Ethernet Kalray SA All Rights Reserved MCSoC

24 MPPA NoC Path Selection Problem For each flow, select a single path among those proposed by adaptive routing so as to maximize utility If multi-path allowed, reduces to a multi-commodity flow problem with a direct solution Adding the single path constraints makes the problem NP-hard Practical instances are solved with mixed integer programming Utility function is the max-min fairness of flow rates Maximize the lowest flow rate, then the next lowest flow rate etc. Numerical results for all pairs of flows between 8 clusters (55 flows) 2016 Kalray SA All Rights Reserved MCSoC Flow Route Bandwidth F48 R F49 R1 0 R R3 0 R4 0 F50 R F51 R R2 0

25 Principle of the MPPA NoC Guaranted Services Data NoC packet injection implements a (σ,ρ)regulation No more than σ+ρ(t-s) flits are injected for any interval [s,t] Cumulative flit count L ~ρ σ Application of Deterministic Network Calculus prevents NoC congestion and provides bounds on end-to-end delays time Network Calculus application requires feed-forward flows Deadlock-free wormhole switching implies feed-forward flows 2016 Kalray SA All Rights Reserved MCSoC

26 Outline Mission-Critical Applications MPPA -256 Bostan Processor MPPA NoC Management MPPA Software Environment MPPA Coolidge Processor Conclusions 2016 Kalray SA All Rights Reserved MCSoC

27 Complete AccessCore SDK Standard C/C++ Programming Environment C - Low Level/ Lib Programming DSP Style Simulators & Profilers, Debuggers & System Trace Operating Systems & Device Drivers AccessLib Optimized Libraries C - POSIX-Level Programming CPU Style OpenCL Programming GPU Style OpenDataPlane Open API for networking 2016 Kalray SA All Rights Reserved MCSoC

28 MPPA Distributed Memory Architecture Challenge Approaches to communication Load/Store to unified memory Remote DMA to unified memory Remote DMA to private memories Send/receive to private memories Approaches to synchronization Locks, semaphores, conditional variables Barriers, collectives (reduce/scan/all2all) Remote atomic operations Remote atomic queues Classic HPC clusters vs MPPA The working memory of HPC clusters is the collection of identical node memories The working memory of the MPPA is the collection of SMEMs backed by the DDRs 2016 Kalray SA All Rights Reserved MCSoC

29 MPPA Distributed Memory Architecture + Runtime MPPA local memory benefits Local memory accesses are more energyefficient than a Last-Level Cache (LLC) Local memory access interferences can be managed for time-critical applications MPPA DSM for implicit DDR access Escape code size and data size restrictions Mostly used for application porting Some implicit global memory accesses cannot be converted to remote DMA MPPA Asynchronous Operations Using of local memories and sharing DDR memories through remote DMA over NoC Remote atomic operations and queues Co-exist with the DSM or run stand-alone 2016 Kalray SA All Rights Reserved MCSoC

30 MPPA Asynchronous Operations Ordering 2016 Kalray SA All Rights Reserved MCSoC

31 MPPA Software Stack Overview Language Libraries GNU C, C++, Fortran POSIX threads + OpenMP Rich OS RTOS RPC, Asynchronous Hypervisor (Exokernel) Operations, DSM Low-level programming interfaces MPPA platform OpenCL C OpenCL runtime 2016 Kalray SA All Rights Reserved MCSoC

32 SCADE Code Generation for the MPPA Safety-critical control-command applications Model-based programming using SCADE Suite from Esterel Technologies Complemented with static timing analysis of binary code (ait from AbsInt) Motivations for multicore and manycore execution Distribute the compute load across cores and reduce memory interferences Effective implementation of multi-rate harmonic applications Envision use of fast Model Predictive Control (MPC) techniques 2016 Kalray SA All Rights Reserved MCSoC

33 emcos for MPPA Processors esolemcosis the world s first many-core RTOS for embedded use Runson the MPPA clusters on top of Low-Levelprogramming 2016 Kalray SA All Rights Reserved MCSoC

34 Automotive Software Development Kit Vision Number-crunching applications Use OpenCL hosted on an I/O cluster quad-core running Linux Partition the compute clusters using OpenCL subdevices Enable OpenCL Data Parallel and Task Parallel models Inside Task Parallel model, support Posix programming C/C++/Pthread/FILE instead of OpenCL-C MPPA Asynchronous operations instead of OpenCL async_copy Control-command applications SCADE Suite code generation for compute clusters Exchange with I/O application part using RDMA through NoC Automotive RTOS & middleware OSEK/VDX RTOS and support of AUTOSAR deployment environment 2016 Kalray SA All Rights Reserved MCSoC

35 Outline Mission-Critical Applications MPPA -256 Bostan Processor MPPA NoC Management MPPA Software Environment MPPA Coolidge Processor Conclusions 2016 Kalray SA All Rights Reserved MCSoC

36 MPPA Coolidge Processors Compute cluster as CPU Direct access to DDR memory Optional L1 cache coherence Full 64-bit OS support Functional safety Power domains Robust partitioning NoC error protection Security architecture Trust provisioning Secure boot chain Authenticated debug Courtesy Rambus Courtesy Inside Secure 2016 Kalray SA All Rights Reserved MCSoC

37 MPPA Coolidge Main Evolutions TSMC 16FFC technology node 50% maximum frequency increase compared to 28nm HPC 40% dynamic power reduction at same frequency compared to 28nm HPC 3 rd generation Kalray VLIW core 1GHz typical core frequency 64-bit 5-issue VLIW core architecture 128-bit load/store unit(16 bytes /cycle) 16-bit, 32-bit and 64-bit IEEE 754 FPU 16KB, 4-way associative L1 cache s 4 privilege levels like in ARMv8 Co-processing Transcendental function acceleration for deep learning and scientific computing Block cipher and hashing acceleration Pixel pre-processing for computer vision Compute cluster 4MB memory capacity, 128-bit busses Local memory / L2 cache partitioning Interleaved or locked memory map RM core as L2 cache controller Unified NoC Dual virtual channel, 128-bit flits RDMA, load/store and control 2D grid topology External memory One DDR4 channel per 4 compute clusters 1 channel on MPPA channels on MPPA -256 High-speed peripherals Ethernet 25Gbps PCIe Gen4 Interlaken 25Gbps 2016 Kalray SA All Rights Reserved MCSoC

38 MPPA Processors on Deep Learning Application Kalray KANN tool interprets Berkeley Caffefiles of trained networks and generates code Network architecture Parameters (filter coefficients) 250 Neural Network (Googlenet - FP32) Focus on low-latency neural network inference Convolutional layer neurons are spread across compute clusters NoC multicasting to copy parameters from DDR to SMEM in compute clusters Fps All data transfers and coordination with MPPA asynchronous operations 0 Kalray** Bostan Nvidia* Tegra Kalray** Coolidge 2016 Kalray SA All Rights Reserved MCSoC

39 Outline Mission-Critical Applications MPPA -256 Bostan Processor MPPA NoC Management MPPA Software Environment MPPA Coolidge Processor Conclusions 2016 Kalray SA All Rights Reserved MCSoC

40 Lessons from Engineering Manycore Processors Target applications that benefit from manycore processors High performance computing (GPU, MIC) High performance networking (Cavium, Mellanox/Tilera) High integrity embedded computing (robotics, aerospace, defense) Achieving near-data computing (using local memory) is key Required for energy efficiency and timing predictability Local memory is naturally exploited by KPN models of computation Must also support transparent and effective access to global memory Programming models must be adapted Explicit memory hierarchy and relaxed memory consistency OpenCL a first step but local memory exploitation is too restrictive 2016 Kalray SA All Rights Reserved MCSoC

41 Multiple applications Cost efficiency Sensing Networking Running in PARALLEL & in REAL TIME Computing Safety Security 2016 Kalray SA All Rights Reserved MCSoC

42 Thank you KALRAY S.A. Paris - France 86 rue de Paris, Orsay France KALRAY S.A. Grenoble - France 445 rue Lavoisier, Montbonnot France KALRAY INC. Los Altos -USA 4962 El Camino Real Los Altos, CA USA Tel: +33 (0) info@kalray.eu Tel: +33 (0) info@kalray.eu Tel: +1 (650) info@kalrayinc.com MPPA, ACCESSCORE and the Kalray logo are trademarks or registered trademarks of Kalray in various countries. All trademarks, service marks, and trade names are the marks of the respective owner(s), and any unauthorized use thereof is strictly prohibited. All terms and prices are indicatives and subject to any modification without notice Kalray SA All Rights Reserved MCSoC

43 MPPA -256 Bostan Backup Slides 2016 Kalray SA All Rights Reserved MCSoC

44 Kalray VLIW Architecture Compared to HP Labs Lx The Lx architecture begat the STMicroelectronics ST200 VLIW family Optimize use of the data memory bandwidth Widening to 64-bit, no alignment restrictions Enable large immediate values in instruction stream All memory accesses may bypass the L1 data cache & write buffer Eliminate DLX ISA features and restrictions Instructions with 3 or 4 source operands, 1 or 2 target operands No aliasing between registers and special resources (LR, zero) Memory addressing modes similar to those of PowerPC Effective floating-point support with Fused Multiply Add Rework if-conversion support Remove Boolean registers and SELECT instructions Use CMOV and conditional load/store instructions Support hardware looping 2016 Kalray SA All Rights Reserved MCSoC

45 MPPA2-256 Bostan Compute Cluster 20 bus masters 16 application cores 1 management core NoCTxand Rx interfaces Debug support unit (DSU) 16-banked shared memory 2MB extensible to 4MB No bus interferences between cores RR arbitration between bus masters Interleaved or blocked address map 2016 Kalray SA All Rights Reserved MCSoC

46 MPPA2-256 Bostan Ethernet Support Ethernet as the main highperformance / low-latency IO Integration of Ethernet Rx and Tx to the D-NoC architecture Per Ethernet Rx port 8 classification tables for hardware dispatch Round-robin or classifiedcluster & core allocation Per Ethernet Tx port 64 independent Tx FIFOs Weighted round-robin between Tx FIFOs Flow control between clusters and Tx FIFOs 2016 Kalray SA All Rights Reserved MCSoC

Kalray MPPA Manycore Challenges for the Next Generation of Professional Applications Benoît Dupont de Dinechin MPSoC 2013

Kalray MPPA Manycore Challenges for the Next Generation of Professional Applications Benoît Dupont de Dinechin MPSoC 2013 Kalray MPPA Manycore Challenges for the Next Generation of Professional Applications Benoît Dupont de Dinechin MPSoC 2013 The End of Dennard MOSFET Scaling Theory 2013 Kalray SA All Rights Reserved MPSoC

More information

At the heart of a new generation of data center infrastructures and appliances. Sept 2017

At the heart of a new generation of data center infrastructures and appliances. Sept 2017 / At the heart of a new generation of data center infrastructures and appliances Sept 2017 VIRTUALIZED DATACENTER: THE BLENDER EFFECT FOR STORAGE I/O OPERATIONS 10, 000 VMs MIOPs in random Aggregate switch

More information

Big Data mais basse consommation L apport des processeurs manycore

Big Data mais basse consommation L apport des processeurs manycore Big Data mais basse consommation L apport des processeurs manycore Laurent Julliard - Kalray Le potentiel et les défis du Big Data Séminaire ASPROM 2 et 3 Juillet 2013 Presentation Outline Kalray : the

More information

A Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013

A Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013 A Closer Look at the Epiphany IV 28nm 64 core Coprocessor Andreas Olofsson PEGPUM 2013 1 Adapteva Achieves 3 World Firsts 1. First processor company to reach 50 GFLOPS/W 3. First semiconductor company

More information

Tile Processor (TILEPro64)

Tile Processor (TILEPro64) Tile Processor Case Study of Contemporary Multicore Fall 2010 Agarwal 6.173 1 Tile Processor (TILEPro64) Performance # of cores On-chip cache (MB) Cache coherency Operations (16/32-bit BOPS) On chip bandwidth

More information

Porting Linux to a New Architecture

Porting Linux to a New Architecture Embedded Linux Conference 2014 Porting Linux to a New Architecture Marta Rybczyńska May 1 st, 2014 Different Types of Porting New board New processor from existing family New architecture 2010-2014 Kalray

More information

Parallel Code Generation of Synchronous Programs for a Many-core Architecture

Parallel Code Generation of Synchronous Programs for a Many-core Architecture Parallel Code Generation of Synchronous Programs for a Many-core Architecture Amaury Graillat Supervisors: Reviewers: Pascal Raymond (Verimag), Matthieu Moy (LIP) Benoît Dupont de Dinechin (Kalray) Jan

More information

XPU A Programmable FPGA Accelerator for Diverse Workloads

XPU A Programmable FPGA Accelerator for Diverse Workloads XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for

More information

ARM Processors for Embedded Applications

ARM Processors for Embedded Applications ARM Processors for Embedded Applications Roadmap for ARM Processors ARM Architecture Basics ARM Families AMBA Architecture 1 Current ARM Core Families ARM7: Hard cores and Soft cores Cache with MPU or

More information

Maximizing heterogeneous system performance with ARM interconnect and CCIX

Maximizing heterogeneous system performance with ARM interconnect and CCIX Maximizing heterogeneous system performance with ARM interconnect and CCIX Neil Parris, Director of product marketing Systems and software group, ARM Teratec June 2017 Intelligent flexible cloud to enable

More information

Hercules ARM Cortex -R4 System Architecture. Processor Overview

Hercules ARM Cortex -R4 System Architecture. Processor Overview Hercules ARM Cortex -R4 System Architecture Processor Overview What is Hercules? TI s 32-bit ARM Cortex -R4/R5 MCU family for Industrial, Automotive, and Transportation Safety Hardware Safety Features

More information

Messaging Overview. Introduction. Gen-Z Messaging

Messaging Overview. Introduction. Gen-Z Messaging Page 1 of 6 Messaging Overview Introduction Gen-Z is a new data access technology that not only enhances memory and data storage solutions, but also provides a framework for both optimized and traditional

More information

Each Milliwatt Matters

Each Milliwatt Matters Each Milliwatt Matters Ultra High Efficiency Application Processors Govind Wathan Product Manager, CPG ARM Tech Symposia China 2015 November 2015 Ultra High Efficiency Processors Used in Diverse Markets

More information

4. Networks. in parallel computers. Advances in Computer Architecture

4. Networks. in parallel computers. Advances in Computer Architecture 4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors

More information

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle M7: Next Generation SPARC Hotchips 26 August 12, 2014 Stephen Phillips Senior Director, SPARC Architecture Oracle Safe Harbor Statement The following is intended to outline our general product direction.

More information

Exploring System Coherency and Maximizing Performance of Mobile Memory Systems

Exploring System Coherency and Maximizing Performance of Mobile Memory Systems Exploring System Coherency and Maximizing Performance of Mobile Memory Systems Shanghai: William Orme, Strategic Marketing Manager of SSG Beijing & Shenzhen: Mayank Sharma, Product Manager of SSG ARM Tech

More information

TDT Appendix E Interconnection Networks

TDT Appendix E Interconnection Networks TDT 4260 Appendix E Interconnection Networks Review Advantages of a snooping coherency protocol? Disadvantages of a snooping coherency protocol? Advantages of a directory coherency protocol? Disadvantages

More information

Developing deterministic networking technology for railway applications using TTEthernet software-based end systems

Developing deterministic networking technology for railway applications using TTEthernet software-based end systems Developing deterministic networking technology for railway applications using TTEthernet software-based end systems Project n 100021 Astrit Ademaj, TTTech Computertechnik AG Outline GENESYS requirements

More information

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014 Profiling and Debugging OpenCL Applications with ARM Development Tools October 2014 1 Agenda 1. Introduction to GPU Compute 2. ARM Development Solutions 3. Mali GPU Architecture 4. Using ARM DS-5 Streamline

More information

Communication Patterns in Safety Critical Systems for ADAS & Autonomous Vehicles Thorsten Wilmer Tech AD Berlin, 5. March 2018

Communication Patterns in Safety Critical Systems for ADAS & Autonomous Vehicles Thorsten Wilmer Tech AD Berlin, 5. March 2018 Communication Patterns in Safety Critical Systems for ADAS & Autonomous Vehicles Thorsten Wilmer Tech AD Berlin, 5. March 2018 Agenda Motivation Introduction of Safety Components Introduction to ARMv8

More information

ARMv8-A Software Development

ARMv8-A Software Development ARMv8-A Software Development Course Description ARMv8-A software development is a 4 days ARM official course. The course goes into great depth and provides all necessary know-how to develop software for

More information

Porting Linux to a New Architecture

Porting Linux to a New Architecture Embedded Linux Conference Europe 2014 Porting Linux to a New Architecture Marta Rybczyńska October 15, 2014 Different Types of Porting New board New processor from existing family New architecture 2 New

More information

The S6000 Family of Processors

The S6000 Family of Processors The S6000 Family of Processors Today s Design Challenges The advent of software configurable processors In recent years, the widespread adoption of digital technologies has revolutionized the way in which

More information

Design and Analysis of Time-Critical Systems Timing Predictability and Analyzability + Case Studies: PTARM and Kalray MPPA-256

Design and Analysis of Time-Critical Systems Timing Predictability and Analyzability + Case Studies: PTARM and Kalray MPPA-256 Design and Analysis of Time-Critical Systems Timing Predictability and Analyzability + Case Studies: PTARM and Kalray MPPA-256 Jan Reineke @ saarland university computer science ACACES Summer School 2017

More information

Chapter 5. Introduction ARM Cortex series

Chapter 5. Introduction ARM Cortex series Chapter 5 Introduction ARM Cortex series 5.1 ARM Cortex series variants 5.2 ARM Cortex A series 5.3 ARM Cortex R series 5.4 ARM Cortex M series 5.5 Comparison of Cortex M series with 8/16 bit MCUs 51 5.1

More information

The Nios II Family of Configurable Soft-core Processors

The Nios II Family of Configurable Soft-core Processors The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture

More information

Copyright 2016 Xilinx

Copyright 2016 Xilinx Zynq Architecture Zynq Vivado 2015.4 Version This material exempt per Department of Commerce license exception TSU Objectives After completing this module, you will be able to: Identify the basic building

More information

The Next Steps in the Evolution of ARM Cortex-M

The Next Steps in the Evolution of ARM Cortex-M The Next Steps in the Evolution of ARM Cortex-M Joseph Yiu Senior Embedded Technology Manager CPU Group ARM Tech Symposia China 2015 November 2015 Trust & Device Integrity from Sensor to Server 2 ARM 2015

More information

Vector Engine Processor of SX-Aurora TSUBASA

Vector Engine Processor of SX-Aurora TSUBASA Vector Engine Processor of SX-Aurora TSUBASA Shintaro Momose, Ph.D., NEC Deutschland GmbH 9 th October, 2018 WSSP 1 NEC Corporation 2018 Contents 1) Introduction 2) VE Processor Architecture 3) Performance

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

PowerPC 740 and 750

PowerPC 740 and 750 368 floating-point registers. A reorder buffer with 16 elements is used as well to support speculative execution. The register file has 12 ports. Although instructions can be executed out-of-order, in-order

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013 NetSpeed ORION: A New Approach to Design On-chip Interconnects August 26 th, 2013 INTERCONNECTS BECOMING INCREASINGLY IMPORTANT Growing number of IP cores Average SoCs today have 100+ IPs Mixing and matching

More information

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017 VOLTA: PROGRAMMABILITY AND PERFORMANCE Jack Choquette NVIDIA Hot Chips 2017 1 TESLA V100 21B transistors 815 mm 2 80 SM 5120 CUDA Cores 640 Tensor Cores 16 GB HBM2 900 GB/s HBM2 300 GB/s NVLink *full GV100

More information

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

IBM Cell Processor. Gilbert Hendry Mark Kretschmann IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:

More information

Thomas Moscibroda Microsoft Research. Onur Mutlu CMU

Thomas Moscibroda Microsoft Research. Onur Mutlu CMU Thomas Moscibroda Microsoft Research Onur Mutlu CMU CPU+L1 CPU+L1 CPU+L1 CPU+L1 Multi-core Chip Cache -Bank Cache -Bank Cache -Bank Cache -Bank CPU+L1 CPU+L1 CPU+L1 CPU+L1 Accelerator, etc Cache -Bank

More information

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications

More information

SoC Design. Prof. Dr. Christophe Bobda Institut für Informatik Lehrstuhl für Technische Informatik

SoC Design. Prof. Dr. Christophe Bobda Institut für Informatik Lehrstuhl für Technische Informatik SoC Design Prof. Dr. Christophe Bobda Institut für Informatik Lehrstuhl für Technische Informatik Chapter 5 On-Chip Communication Outline 1. Introduction 2. Shared media 3. Switched media 4. Network on

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford

Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford Efficiency and Programmability: Enablers for ExaScale Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford Scientific Discovery and Business Analytics Driving an Insatiable

More information

Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies. Admin

Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies. Admin Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies Alvin R. Lebeck CPS 220 Admin Homework #5 Due Dec 3 Projects Final (yes it will be cumulative) CPS 220 2 1 Review: Terms Network characterized

More information

A unified multicore programming model

A unified multicore programming model A unified multicore programming model Simplifying multicore migration By Sven Brehmer Abstract There are a number of different multicore architectures and programming models available, making it challenging

More information

Building High Performance, Power Efficient Cortex and Mali systems with ARM CoreLink. Robert Kaye

Building High Performance, Power Efficient Cortex and Mali systems with ARM CoreLink. Robert Kaye Building High Performance, Power Efficient Cortex and Mali systems with ARM CoreLink Robert Kaye 1 Agenda Once upon a time ARM designed systems Compute trends Bringing it all together with CoreLink 400

More information

New ARMv8-R technology for real-time control in safetyrelated

New ARMv8-R technology for real-time control in safetyrelated New ARMv8-R technology for real-time control in safetyrelated applications James Scobie Product manager ARM Technical Symposium China: Automotive, Industrial & Functional Safety October 31 st 2016 November

More information

The Next Steps in the Evolution of Embedded Processors

The Next Steps in the Evolution of Embedded Processors The Next Steps in the Evolution of Embedded Processors Terry Kim Staff FAE, ARM Korea ARM Tech Forum Singapore July 12 th 2017 Cortex-M Processors Serving Connected Applications Energy grid Automotive

More information

Negotiating the Maze Getting the most out of memory systems today and tomorrow. Robert Kaye

Negotiating the Maze Getting the most out of memory systems today and tomorrow. Robert Kaye Negotiating the Maze Getting the most out of memory systems today and tomorrow Robert Kaye 1 System on Chip Memory Systems Systems use external memory Large address space Low cost-per-bit Large interface

More information

POWER7: IBM's Next Generation Server Processor

POWER7: IBM's Next Generation Server Processor POWER7: IBM's Next Generation Server Processor Acknowledgment: This material is based upon work supported by the Defense Advanced Research Projects Agency under its Agreement No. HR0011-07-9-0002 Outline

More information

Implementing Flexible Interconnect Topologies for Machine Learning Acceleration

Implementing Flexible Interconnect Topologies for Machine Learning Acceleration Implementing Flexible Interconnect for Machine Learning Acceleration A R M T E C H S Y M P O S I A O C T 2 0 1 8 WILLIAM TSENG Mem Controller 20 mm Mem Controller Machine Learning / AI SoC New Challenges

More information

Beyond One Core. Multi-Cores. Evolution in Frequency Has Stalled. Source: ISCA 2009 Keynote of Kathy Yelick

Beyond One Core. Multi-Cores. Evolution in Frequency Has Stalled. Source: ISCA 2009 Keynote of Kathy Yelick Beyond One Multi-s Evolution in Frequency Has Stalled Source: ISCA 2009 Keynote of Kathy Yelick 1 Multi-s Network or Bus Memory Private and/or shared caches Bus Network (on chip = NoC) Example: ARM Cortex

More information

Quick Reference Card. Timing and Stack Verifiers Supported Platforms. SCADE Suite 6.3

Quick Reference Card. Timing and Stack Verifiers Supported Platforms. SCADE Suite 6.3 Timing and Stack Verifiers Supported Platforms SCADE Suite 6.3 About Timing and Stack Verifiers supported platforms Timing and Stack Verifiers Supported Platforms SCADE Suite Timing Verifier and SCADE

More information

RapidIO.org Update. Mar RapidIO.org 1

RapidIO.org Update. Mar RapidIO.org 1 RapidIO.org Update rickoco@rapidio.org Mar 2015 2015 RapidIO.org 1 Outline RapidIO Overview & Markets Data Center & HPC Communications Infrastructure Industrial Automation Military & Aerospace RapidIO.org

More information

Timing analysis and timing predictability

Timing analysis and timing predictability Timing analysis and timing predictability Architectural Dependences Reinhard Wilhelm Saarland University, Saarbrücken, Germany ArtistDesign Summer School in China 2010 What does the execution time depends

More information

White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation

White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation Next Generation Technical Computing Unit Fujitsu Limited Contents FUJITSU Supercomputer PRIMEHPC FX100 System Overview

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing Accelerating HPC (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing SAAHPC, Knoxville, July 13, 2010 Legal Disclaimer Intel may make changes to specifications and product

More information

Higher Level Programming Abstractions for FPGAs using OpenCL

Higher Level Programming Abstractions for FPGAs using OpenCL Higher Level Programming Abstractions for FPGAs using OpenCL Desh Singh Supervising Principal Engineer Altera Corporation Toronto Technology Center ! Technology scaling favors programmability CPUs."#/0$*12'$-*

More information

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction

More information

Blue Gene/Q. Hardware Overview Michael Stephan. Mitglied der Helmholtz-Gemeinschaft

Blue Gene/Q. Hardware Overview Michael Stephan. Mitglied der Helmholtz-Gemeinschaft Blue Gene/Q Hardware Overview 02.02.2015 Michael Stephan Blue Gene/Q: Design goals System-on-Chip (SoC) design Processor comprises both processing cores and network Optimal performance / watt ratio Small

More information

Architecture and Programming Models of the Kalray MPPA Single Chip Manycore Processor. Benoît Dupont de Dinechin MCC 2013, Halmstad, Sweden

Architecture and Programming Models of the Kalray MPPA Single Chip Manycore Processor. Benoît Dupont de Dinechin MCC 2013, Halmstad, Sweden Architecture and Programming Models of the Kalray MPPA Single Chip Manycore Processor Benoît Dupont de Dinechin MCC 2013, Halmstad, Sweden Outline Kalray MPPA Products MPPA MANYCORE Architecture MPPA ACCESSLIB

More information

Real-Time Mixed-Criticality Wormhole Networks

Real-Time Mixed-Criticality Wormhole Networks eal-time Mixed-Criticality Wormhole Networks Leandro Soares Indrusiak eal-time Systems Group Department of Computer Science University of York United Kingdom eal-time Systems Group 1 Outline Wormhole Networks

More information

Nvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018

Nvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018 Nvidia Jetson TX2 and its Software Toolset João Fernandes 2017/2018 In this presentation Nvidia Jetson TX2: Hardware Nvidia Jetson TX2: Software Machine Learning: Neural Networks Convolutional Neural Networks

More information

Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems

Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems 1 Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems Ronald Dreslinski, Korey Sewell, Thomas Manville, Sudhir Satpathy, Nathaniel Pinckney, Geoff Blake, Michael Cieslak, Reetuparna

More information

ARM CORTEX-R52. Target Audience: Engineers and technicians who develop SoCs and systems based on the ARM Cortex-R52 architecture.

ARM CORTEX-R52. Target Audience: Engineers and technicians who develop SoCs and systems based on the ARM Cortex-R52 architecture. ARM CORTEX-R52 Course Family: ARMv8-R Cortex-R CPU Target Audience: Engineers and technicians who develop SoCs and systems based on the ARM Cortex-R52 architecture. Duration: 4 days Prerequisites and related

More information

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Cray-1 and Cray T3E Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector

More information

Altera SDK for OpenCL

Altera SDK for OpenCL Altera SDK for OpenCL A novel SDK that opens up the world of FPGAs to today s developers Altera Technology Roadshow 2013 Today s News Altera today announces its SDK for OpenCL Altera Joins Khronos Group

More information

NoCo: ILP-based Worst-Case Contention Estimation for Mesh Real-Time Manycores

NoCo: ILP-based Worst-Case Contention Estimation for Mesh Real-Time Manycores www.bsc.es NoCo: ILP-based Worst-Case Contention Estimation for Mesh Real-Time Manycores Jordi Cardona 1,2, Carles Hernandez 1, Enrico Mezzetti 1, Jaume Abella 1 and Francisco J.Cazorla 1,3 1 Barcelona

More information

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection Hiroyuki Usui, Jun Tanabe, Toru Sano, Hui Xu, and Takashi Miyamori Toshiba Corporation, Kawasaki, Japan Copyright 2013,

More information

Design and Analysis of Time-Critical Systems Introduction

Design and Analysis of Time-Critical Systems Introduction Design and Analysis of Time-Critical Systems Introduction Jan Reineke @ saarland university ACACES Summer School 2017 Fiuggi, Italy computer science Structure of this Course 2. How are they implemented?

More information

Network Calculus: A Comparison

Network Calculus: A Comparison Time-Division Multiplexing vs Network Calculus: A Comparison Wolfgang Puffitsch, Rasmus Bo Sørensen, Martin Schoeberl RTNS 15, Lille, France Motivation Modern multiprocessors use networks-on-chip Congestion

More information

All About the Cell Processor

All About the Cell Processor All About the Cell H. Peter Hofstee, Ph. D. IBM Systems and Technology Group SCEI/Sony Toshiba IBM Design Center Austin, Texas Acknowledgements Cell is the result of a deep partnership between SCEI/Sony,

More information

The Path to Embedded Vision & AI using a Low Power Vision DSP. Yair Siegel, Director of Segment Marketing Hotchips August 2016

The Path to Embedded Vision & AI using a Low Power Vision DSP. Yair Siegel, Director of Segment Marketing Hotchips August 2016 The Path to Embedded Vision & AI using a Low Power Vision DSP Yair Siegel, Director of Segment Marketing Hotchips August 2016 Presentation Outline Introduction The Need for Embedded Vision & AI Vision

More information

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation Kshitij Bhardwaj Dept. of Computer Science Columbia University Steven M. Nowick 2016 ACM/IEEE Design Automation

More information

Netronome NFP: Theory of Operation

Netronome NFP: Theory of Operation WHITE PAPER Netronome NFP: Theory of Operation TO ACHIEVE PERFORMANCE GOALS, A MULTI-CORE PROCESSOR NEEDS AN EFFICIENT DATA MOVEMENT ARCHITECTURE. CONTENTS 1. INTRODUCTION...1 2. ARCHITECTURE OVERVIEW...2

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki An Ultra High Performance Scalable DSP Family for Multimedia Hot Chips 17 August 2005 Stanford, CA Erik Machnicki Media Processing Challenges Increasing performance requirements Need for flexibility &

More information

3D WiNoC Architectures

3D WiNoC Architectures Interconnect Enhances Architecture: Evolution of Wireless NoC from Planar to 3D 3D WiNoC Architectures Hiroki Matsutani Keio University, Japan Sep 18th, 2014 Hiroki Matsutani, "3D WiNoC Architectures",

More information

Basic Low Level Concepts

Basic Low Level Concepts Course Outline Basic Low Level Concepts Case Studies Operation through multiple switches: Topologies & Routing v Direct, indirect, regular, irregular Formal models and analysis for deadlock and livelock

More information

Contents of this presentation: Some words about the ARM company

Contents of this presentation: Some words about the ARM company The architecture of the ARM cores Contents of this presentation: Some words about the ARM company The ARM's Core Families and their benefits Explanation of the ARM architecture Architecture details, features

More information

Mastering The Behavior of Multi-Core Systems to Match Avionics Requirements

Mastering The Behavior of Multi-Core Systems to Match Avionics Requirements www.thalesgroup.com Mastering The Behavior of Multi-Core Systems to Match Avionics Requirements Hicham AGROU, Marc GATTI, Pascal SAINRAT, Patrice TOILLON {hicham.agrou,marc-j.gatti, patrice.toillon}@fr.thalesgroup.com

More information

What Software Requires from Multicore for Certification - overview for the domains avionics, medical in comparison to automotive,...

What Software Requires from Multicore for Certification - overview for the domains avionics, medical in comparison to automotive,... What Software Requires from Multicore for Certification - overview for the domains avionics, medical in comparison to automotive,... Matthias Pruksch Motivation Collision Avoidance System [1] Mobile Diagnosis

More information

Intel Enterprise Processors Technology

Intel Enterprise Processors Technology Enterprise Processors Technology Kosuke Hirano Enterprise Platforms Group March 20, 2002 1 Agenda Architecture in Enterprise Xeon Processor MP Next Generation Itanium Processor Interconnect Technology

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

POWER7: IBM's Next Generation Server Processor

POWER7: IBM's Next Generation Server Processor Hot Chips 21 POWER7: IBM's Next Generation Server Processor Ronald Kalla Balaram Sinharoy POWER7 Chief Engineer POWER7 Chief Core Architect Acknowledgment: This material is based upon work supported by

More information

The University of Texas at Austin

The University of Texas at Austin EE382N: Principles in Computer Architecture Parallelism and Locality Fall 2009 Lecture 24 Stream Processors Wrapup + Sony (/Toshiba/IBM) Cell Broadband Engine Mattan Erez The University of Texas at Austin

More information

«Real Time Embedded systems» Multi Masters Systems

«Real Time Embedded systems» Multi Masters Systems «Real Time Embedded systems» Multi Masters Systems rene.beuchat@epfl.ch LAP/ISIM/IC/EPFL Chargé de cours rene.beuchat@hesge.ch LSN/hepia Prof. HES 1 Multi Master on Chip On a System On Chip, Master can

More information

Building blocks for 64-bit Systems Development of System IP in ARM

Building blocks for 64-bit Systems Development of System IP in ARM Building blocks for 64-bit Systems Development of System IP in ARM Research seminar @ University of York January 2015 Stuart Kenny stuart.kenny@arm.com 1 2 64-bit Mobile Devices The Mobile Consumer Expects

More information

Trustzone Security IP for IoT

Trustzone Security IP for IoT Trustzone Security IP for IoT Udi Maor CryptoCell-7xx product manager Systems & Software Group ARM Tech Forum Singapore July 12 th 2017 Why is getting security right for IoT so important? When our everyday

More information

A Developer's Guide to Security on Cortex-M based MCUs

A Developer's Guide to Security on Cortex-M based MCUs A Developer's Guide to Security on Cortex-M based MCUs 2018 Arm Limited Nazir S Arm Tech Symposia India Agenda Why do we need security? Types of attacks and security assessments Introduction to TrustZone

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823

More information

Hybrid Memory Cube (HMC)

Hybrid Memory Cube (HMC) 23 Hybrid Memory Cube (HMC) J. Thomas Pawlowski, Fellow Chief Technologist, Architecture Development Group, Micron jpawlowski@micron.com 2011 Micron Technology, I nc. All rights reserved. Products are

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction

More information

Leveraging OpenSPARC. ESA Round Table 2006 on Next Generation Microprocessors for Space Applications EDD

Leveraging OpenSPARC. ESA Round Table 2006 on Next Generation Microprocessors for Space Applications EDD Leveraging OpenSPARC ESA Round Table 2006 on Next Generation Microprocessors for Space Applications G.Furano, L.Messina TEC- OpenSPARC T1 The T1 is a new-from-the-ground-up SPARC microprocessor implementation

More information

THE PATH TO EXASCALE COMPUTING. Bill Dally Chief Scientist and Senior Vice President of Research

THE PATH TO EXASCALE COMPUTING. Bill Dally Chief Scientist and Senior Vice President of Research THE PATH TO EXASCALE COMPUTING Bill Dally Chief Scientist and Senior Vice President of Research The Goal: Sustained ExaFLOPs on problems of interest 2 Exascale Challenges Energy efficiency Programmability

More information

EXASCALE COMPUTING ROADMAP IMPACT ON LEGACY CODES MARCH 17 TH, MIC Workshop PAGE 1. MIC workshop Guillaume Colin de Verdière

EXASCALE COMPUTING ROADMAP IMPACT ON LEGACY CODES MARCH 17 TH, MIC Workshop PAGE 1. MIC workshop Guillaume Colin de Verdière EXASCALE COMPUTING ROADMAP IMPACT ON LEGACY CODES MIC workshop Guillaume Colin de Verdière MARCH 17 TH, 2015 MIC Workshop PAGE 1 CEA, DAM, DIF, F-91297 Arpajon, France March 17th, 2015 Overview Context

More information

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects Overview of Interconnects Myrinet and Quadrics Leading Modern Interconnects Presentation Outline General Concepts of Interconnects Myrinet Latest Products Quadrics Latest Release Our Research Interconnects

More information

SoC FPGAs. Your User-Customizable System on Chip Altera Corporation Public

SoC FPGAs. Your User-Customizable System on Chip Altera Corporation Public SoC FPGAs Your User-Customizable System on Chip Embedded Developers Needs Low High Increase system performance Reduce system power Reduce board size Reduce system cost 2 Providing the Best of Both Worlds

More information