Kalray MPPA Massively Parallel Processor Array
|
|
- Randell Booth
- 6 years ago
- Views:
Transcription
1 Kalray MPPA Massively Parallel Processor Array Engineering a Manycore Processor Platform for Mission-Critical Applications Benoît Dupont de Dinechin, CTO
2 Kalray Solutions Target Two Strategic Markets Cloud and Data Center acceleration Kalray is the ideal co-processor to off-load x86 or accelerate real-time or compute intensive functions Domains of application: Networking, Storage, Big Data analytics, Cybersecurity, Deep learning High Performance Embedded Computing Embedded applications require higher performance without increasing power consumption and more integration of functions including those constrained by real-time Domains of application: aerospace, automotive, transport, energy 2016 Kalray SA All Rights Reserved MCSoC
3 Outline Mission-Critical Applications MPPA -256 Bostan Processor MPPA NoC Management MPPA Software Environment MPPA Coolidge Processor Conclusions 2016 Kalray SA All Rights Reserved MCSoC
4 Evolution of Sensor Fusion for Autonomous Driving (figures from M. Fausten keynote presentation at ESWEEK 2015) 2016 Kalray SA All Rights Reserved MCSoC
5 Number Crunching in Sensor Fusion Units (figures from M. Fausten keynote presentation at ESWEEK 2015) 2016 Kalray SA All Rights Reserved MCSoC
6 Mission-Critical Computing Dependability requirements Detect, isolate or mitigate errors that occur in digital circuits Transient electrical problems, soft errors, physical damage Timing requirements Time constraints associated with information manipulation Execution time determinism, predictability, composability Certification of worst-case response time by static analysis 2016 Kalray SA All Rights Reserved MCSoC
7 Execution Timing on a Multicore Processor Intra-core nondeterministic timing Speculative execution, branch prediction Cache content Inter-core interferences Concurrent accesses to caches, busses, and peripheral devices Cache coherence invalidations Main issue Growing gap between average execution time and static WCET 2016 Kalray SA All Rights Reserved MCSoC
8 Classic Multicore Memory Hierarchy Challenge: managing interference between cores 2016 Kalray SA All Rights Reserved MCSoC
9 Embedded Multicore Memory Hierarchy Challenge: programmability of DMA and private memories 2016 Kalray SA All Rights Reserved MCSoC
10 Outline Mission-Critical Applications MPPA -256 Bostan Processor MPPA NoC Management MPPA Software Environment MPPA Coolidge Processor Conclusions 2016 Kalray SA All Rights Reserved MCSoC
11 MPPA -256 Manycore Architecture Highlights DSP type of processing Energy efficiency Timing predictability Software programmability CPU ease of programming C/C++ GNU environment 32-bit/64-bit, little-endian Rich operating system (Linux with dynamic loading&linking) Integrated many-core processor 32 management cores on chip 256 application cores on chip High-performance I/O Scalable parallel computing MPPA processors can be tiled together through NoC extension Run-time support for pooling the external DDR memory resources Courtesy Altera 2016 Kalray SA All Rights Reserved MCSoC
12 MPPA -256 Bostan Processor Architecture Manycore Processor Compute Cluster VLIW Core 16 compute clusters 2 I/O clusters each with quad-core CPUs, DDR3, 4 Ethernet 10G and 8 PCIe Gen3 Data and control networks-on-chip Distributed memory architecture 634 GFLOPS SP for 600Mhz 16 user cores + 1 system core NoC Tx and Rx interfaces Debug & Support Unit (DSU) 2 MB multi-banked shared memory 77GB/s Shared Memory BW 16 cores SMP System 32-bit or 64-bit addresses 5-issue VLIW architecture MMU + I&D cache (8KB+8KB) 32-bit/64-bit IEEE FMA FPU Tightly coupled crypto co-processor 2.4 GFLOPS SP per 2016 Kalray SA All Rights Reserved MCSoC
13 MPPA -256 Bostan Processor VLIW cores / 18 address spaces / 400 MHz 800 MHz Physical characteristics TSMC CMOS 28HP 100µW/MHz per core+ L1 caches 2W to 3W leakage Processor interfaces 2x DDR3 Memory interfaces 2x PCIe Gen3 8-lane interface 8x 1G/10G Ethernet or 2x 40G Ethernet interfaces SPI/I2C/UART interfaces Universal Static Memory Controller (NAND/NOR/SRAM) GPIOs with Direct NoC Access NoC extension through Interlaken interface (NoCX) 2016 Kalray SA All Rights Reserved MCSoC
14 MPPA Processor Co-Design for Avionics U. Saarland / AbsIntGMBH recommendations on VLIW core and cache micro-architecture design AbsInt provides the ait static timing analysis tool used to certify flight control systems at Airbus AbsInt ait tool also targets the Kalray VLIW cores Architecture design with a focus on timing predictability Core level: micro-architecture Fully timing compositional core LRU caches and write buffer Cache bypass memory opcodes Cluster level: multi-banked shared memory Core-private buses for memory bank access Address interleaving or blocking across banks Processor level: NoC/DDR guaranteed services NoC minimum bandwidth & maximum latency DDR controller configurations and address mapping 2016 Kalray SA All Rights Reserved MCSoC
15 MPPA -256 Bostan VLIW Core Data Path 5-issue, polycyclic property Unified 10 read, 5 write 64x32-bit register file 64-bit data in register pairs Shared ACC read port Unified MAU and FPU LSU and MAU also execute single-cycle ALU instructions 2016 Kalray SA All Rights Reserved MCSoC
16 MPPA Bostan VLIW Core Energy Efficiency Kalray Bostan pj per Instruction on Matrix Multiply 117 ARM Cortex A7 (Little)* 225 GPU** 497 ARM Cortex A15 (Big)* Kalray Bostan at 141 pj per bundle with 3.14 instructions average per bundle *: An Instruction Level Energy Characterization of ARM Processors FORTH-ICS/TR-450, March 2015 **: Source: Measured on Matrix Multiply EPI Test + Source Bill Dally, To ExaScale and Beyond - NVidia 2016 Kalray SA All Rights Reserved MCSoC
17 Outline Mission-Critical Applications MPPA -256 Bostan Processor MPPA NoC Management MPPA Software Environment MPPA Coolidge Processor Conclusions 2016 Kalray SA All Rights Reserved MCSoC
18 MPPA -256 Bostan Network-on-Chip (NoC) Dual 2D-torus NoC Direct, wormhole switching D-NoC: high-bandwidth RDMA C-NoC: low-latency mailboxes 4B/cycle per link direction per NoC Nx10Gb/s NoC extensions for connection to FPGA or other MPPA Predictability Data NoC is configured by selecting routes and injection parameters Routing ensure deadlock-free traffic Injection parameters are the (σ,ρ) or (burst, rate) of network calculus 2016 Kalray SA All Rights Reserved MCSoC
19 Wormhole Switching Illustrated A packet is composed of several flits The header flits governs the route The payload flits follow the header in a pipeline fashion When the required channel is busy, the hop flow control blocks the trailing flits and they stay in flit buffers along the established route 9 B A 2016 Kalray SA All Rights Reserved MCSoC
20 Wormhole Switching Network Issues Complex to implement May betruefor input queueing & output matching (e.g. islip) The MPPA NoC routers only include demultiplexers, output queues and RR arbiters Prone to deadlocking In thisexample, the redflow cannotuse R3 R2 becausethe blueflow isusingit Likewise, the blue flow needs R1 R4 heldby the redflow Deadlock requires full queues 2016 Kalray SA All Rights Reserved MCSoC
21 Deadlock-Free Routing Wormhole Switching Deadlock results from circuits of agents and resources connected by a wait-for relation [Dally & Seitz 1998] Wormhole switching: agents are packets; resources are link buffers Links between routers and links internal to routers( turns ) Solutions whenthe network topologyisa 2D mesh Dimension order(x-y on 2D meshes) Turnmodel [Glass & Ni 1994] (not the sameas turnprohibition ) Odd-Even[Chiu 2000] (better path diversity than turn model) Hamiltonian Odd-Even[Bahrebar& Stroobandt 2015] (multicasting) Strategy for the MPPA NoC management Isolate a 2D mesh in topology and apply deadlock-free routing 2016 Kalray SA All Rights Reserved MCSoC
22 Hamiltonian Odd-Even Prohibited Turns Even rows East-South turn prohibited North-West turn prohibited Odd rows North-East turn prohibited West-South turn prohibited Odd Odd Even Even Odd Odd Even Even 2016 Kalray SA All Rights Reserved MCSoC
23 Hamiltonian Odd-Even on the MPPA NoC Routing between compute clusters Routes generated assuming a 6x6 mesh Impossible routes are discarded Routing between I/O and compute cluster Always possible, thanks to the 4 NoC nodes per I/O cluster I/O Ethernet 0 I/O DDR 1 I/O DDR 0 I/O Ethernet Kalray SA All Rights Reserved MCSoC
24 MPPA NoC Path Selection Problem For each flow, select a single path among those proposed by adaptive routing so as to maximize utility If multi-path allowed, reduces to a multi-commodity flow problem with a direct solution Adding the single path constraints makes the problem NP-hard Practical instances are solved with mixed integer programming Utility function is the max-min fairness of flow rates Maximize the lowest flow rate, then the next lowest flow rate etc. Numerical results for all pairs of flows between 8 clusters (55 flows) 2016 Kalray SA All Rights Reserved MCSoC Flow Route Bandwidth F48 R F49 R1 0 R R3 0 R4 0 F50 R F51 R R2 0
25 Principle of the MPPA NoC Guaranted Services Data NoC packet injection implements a (σ,ρ)regulation No more than σ+ρ(t-s) flits are injected for any interval [s,t] Cumulative flit count L ~ρ σ Application of Deterministic Network Calculus prevents NoC congestion and provides bounds on end-to-end delays time Network Calculus application requires feed-forward flows Deadlock-free wormhole switching implies feed-forward flows 2016 Kalray SA All Rights Reserved MCSoC
26 Outline Mission-Critical Applications MPPA -256 Bostan Processor MPPA NoC Management MPPA Software Environment MPPA Coolidge Processor Conclusions 2016 Kalray SA All Rights Reserved MCSoC
27 Complete AccessCore SDK Standard C/C++ Programming Environment C - Low Level/ Lib Programming DSP Style Simulators & Profilers, Debuggers & System Trace Operating Systems & Device Drivers AccessLib Optimized Libraries C - POSIX-Level Programming CPU Style OpenCL Programming GPU Style OpenDataPlane Open API for networking 2016 Kalray SA All Rights Reserved MCSoC
28 MPPA Distributed Memory Architecture Challenge Approaches to communication Load/Store to unified memory Remote DMA to unified memory Remote DMA to private memories Send/receive to private memories Approaches to synchronization Locks, semaphores, conditional variables Barriers, collectives (reduce/scan/all2all) Remote atomic operations Remote atomic queues Classic HPC clusters vs MPPA The working memory of HPC clusters is the collection of identical node memories The working memory of the MPPA is the collection of SMEMs backed by the DDRs 2016 Kalray SA All Rights Reserved MCSoC
29 MPPA Distributed Memory Architecture + Runtime MPPA local memory benefits Local memory accesses are more energyefficient than a Last-Level Cache (LLC) Local memory access interferences can be managed for time-critical applications MPPA DSM for implicit DDR access Escape code size and data size restrictions Mostly used for application porting Some implicit global memory accesses cannot be converted to remote DMA MPPA Asynchronous Operations Using of local memories and sharing DDR memories through remote DMA over NoC Remote atomic operations and queues Co-exist with the DSM or run stand-alone 2016 Kalray SA All Rights Reserved MCSoC
30 MPPA Asynchronous Operations Ordering 2016 Kalray SA All Rights Reserved MCSoC
31 MPPA Software Stack Overview Language Libraries GNU C, C++, Fortran POSIX threads + OpenMP Rich OS RTOS RPC, Asynchronous Hypervisor (Exokernel) Operations, DSM Low-level programming interfaces MPPA platform OpenCL C OpenCL runtime 2016 Kalray SA All Rights Reserved MCSoC
32 SCADE Code Generation for the MPPA Safety-critical control-command applications Model-based programming using SCADE Suite from Esterel Technologies Complemented with static timing analysis of binary code (ait from AbsInt) Motivations for multicore and manycore execution Distribute the compute load across cores and reduce memory interferences Effective implementation of multi-rate harmonic applications Envision use of fast Model Predictive Control (MPC) techniques 2016 Kalray SA All Rights Reserved MCSoC
33 emcos for MPPA Processors esolemcosis the world s first many-core RTOS for embedded use Runson the MPPA clusters on top of Low-Levelprogramming 2016 Kalray SA All Rights Reserved MCSoC
34 Automotive Software Development Kit Vision Number-crunching applications Use OpenCL hosted on an I/O cluster quad-core running Linux Partition the compute clusters using OpenCL subdevices Enable OpenCL Data Parallel and Task Parallel models Inside Task Parallel model, support Posix programming C/C++/Pthread/FILE instead of OpenCL-C MPPA Asynchronous operations instead of OpenCL async_copy Control-command applications SCADE Suite code generation for compute clusters Exchange with I/O application part using RDMA through NoC Automotive RTOS & middleware OSEK/VDX RTOS and support of AUTOSAR deployment environment 2016 Kalray SA All Rights Reserved MCSoC
35 Outline Mission-Critical Applications MPPA -256 Bostan Processor MPPA NoC Management MPPA Software Environment MPPA Coolidge Processor Conclusions 2016 Kalray SA All Rights Reserved MCSoC
36 MPPA Coolidge Processors Compute cluster as CPU Direct access to DDR memory Optional L1 cache coherence Full 64-bit OS support Functional safety Power domains Robust partitioning NoC error protection Security architecture Trust provisioning Secure boot chain Authenticated debug Courtesy Rambus Courtesy Inside Secure 2016 Kalray SA All Rights Reserved MCSoC
37 MPPA Coolidge Main Evolutions TSMC 16FFC technology node 50% maximum frequency increase compared to 28nm HPC 40% dynamic power reduction at same frequency compared to 28nm HPC 3 rd generation Kalray VLIW core 1GHz typical core frequency 64-bit 5-issue VLIW core architecture 128-bit load/store unit(16 bytes /cycle) 16-bit, 32-bit and 64-bit IEEE 754 FPU 16KB, 4-way associative L1 cache s 4 privilege levels like in ARMv8 Co-processing Transcendental function acceleration for deep learning and scientific computing Block cipher and hashing acceleration Pixel pre-processing for computer vision Compute cluster 4MB memory capacity, 128-bit busses Local memory / L2 cache partitioning Interleaved or locked memory map RM core as L2 cache controller Unified NoC Dual virtual channel, 128-bit flits RDMA, load/store and control 2D grid topology External memory One DDR4 channel per 4 compute clusters 1 channel on MPPA channels on MPPA -256 High-speed peripherals Ethernet 25Gbps PCIe Gen4 Interlaken 25Gbps 2016 Kalray SA All Rights Reserved MCSoC
38 MPPA Processors on Deep Learning Application Kalray KANN tool interprets Berkeley Caffefiles of trained networks and generates code Network architecture Parameters (filter coefficients) 250 Neural Network (Googlenet - FP32) Focus on low-latency neural network inference Convolutional layer neurons are spread across compute clusters NoC multicasting to copy parameters from DDR to SMEM in compute clusters Fps All data transfers and coordination with MPPA asynchronous operations 0 Kalray** Bostan Nvidia* Tegra Kalray** Coolidge 2016 Kalray SA All Rights Reserved MCSoC
39 Outline Mission-Critical Applications MPPA -256 Bostan Processor MPPA NoC Management MPPA Software Environment MPPA Coolidge Processor Conclusions 2016 Kalray SA All Rights Reserved MCSoC
40 Lessons from Engineering Manycore Processors Target applications that benefit from manycore processors High performance computing (GPU, MIC) High performance networking (Cavium, Mellanox/Tilera) High integrity embedded computing (robotics, aerospace, defense) Achieving near-data computing (using local memory) is key Required for energy efficiency and timing predictability Local memory is naturally exploited by KPN models of computation Must also support transparent and effective access to global memory Programming models must be adapted Explicit memory hierarchy and relaxed memory consistency OpenCL a first step but local memory exploitation is too restrictive 2016 Kalray SA All Rights Reserved MCSoC
41 Multiple applications Cost efficiency Sensing Networking Running in PARALLEL & in REAL TIME Computing Safety Security 2016 Kalray SA All Rights Reserved MCSoC
42 Thank you KALRAY S.A. Paris - France 86 rue de Paris, Orsay France KALRAY S.A. Grenoble - France 445 rue Lavoisier, Montbonnot France KALRAY INC. Los Altos -USA 4962 El Camino Real Los Altos, CA USA Tel: +33 (0) info@kalray.eu Tel: +33 (0) info@kalray.eu Tel: +1 (650) info@kalrayinc.com MPPA, ACCESSCORE and the Kalray logo are trademarks or registered trademarks of Kalray in various countries. All trademarks, service marks, and trade names are the marks of the respective owner(s), and any unauthorized use thereof is strictly prohibited. All terms and prices are indicatives and subject to any modification without notice Kalray SA All Rights Reserved MCSoC
43 MPPA -256 Bostan Backup Slides 2016 Kalray SA All Rights Reserved MCSoC
44 Kalray VLIW Architecture Compared to HP Labs Lx The Lx architecture begat the STMicroelectronics ST200 VLIW family Optimize use of the data memory bandwidth Widening to 64-bit, no alignment restrictions Enable large immediate values in instruction stream All memory accesses may bypass the L1 data cache & write buffer Eliminate DLX ISA features and restrictions Instructions with 3 or 4 source operands, 1 or 2 target operands No aliasing between registers and special resources (LR, zero) Memory addressing modes similar to those of PowerPC Effective floating-point support with Fused Multiply Add Rework if-conversion support Remove Boolean registers and SELECT instructions Use CMOV and conditional load/store instructions Support hardware looping 2016 Kalray SA All Rights Reserved MCSoC
45 MPPA2-256 Bostan Compute Cluster 20 bus masters 16 application cores 1 management core NoCTxand Rx interfaces Debug support unit (DSU) 16-banked shared memory 2MB extensible to 4MB No bus interferences between cores RR arbitration between bus masters Interleaved or blocked address map 2016 Kalray SA All Rights Reserved MCSoC
46 MPPA2-256 Bostan Ethernet Support Ethernet as the main highperformance / low-latency IO Integration of Ethernet Rx and Tx to the D-NoC architecture Per Ethernet Rx port 8 classification tables for hardware dispatch Round-robin or classifiedcluster & core allocation Per Ethernet Tx port 64 independent Tx FIFOs Weighted round-robin between Tx FIFOs Flow control between clusters and Tx FIFOs 2016 Kalray SA All Rights Reserved MCSoC
Kalray MPPA Manycore Challenges for the Next Generation of Professional Applications Benoît Dupont de Dinechin MPSoC 2013
Kalray MPPA Manycore Challenges for the Next Generation of Professional Applications Benoît Dupont de Dinechin MPSoC 2013 The End of Dennard MOSFET Scaling Theory 2013 Kalray SA All Rights Reserved MPSoC
More informationAt the heart of a new generation of data center infrastructures and appliances. Sept 2017
/ At the heart of a new generation of data center infrastructures and appliances Sept 2017 VIRTUALIZED DATACENTER: THE BLENDER EFFECT FOR STORAGE I/O OPERATIONS 10, 000 VMs MIOPs in random Aggregate switch
More informationBig Data mais basse consommation L apport des processeurs manycore
Big Data mais basse consommation L apport des processeurs manycore Laurent Julliard - Kalray Le potentiel et les défis du Big Data Séminaire ASPROM 2 et 3 Juillet 2013 Presentation Outline Kalray : the
More informationA Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013
A Closer Look at the Epiphany IV 28nm 64 core Coprocessor Andreas Olofsson PEGPUM 2013 1 Adapteva Achieves 3 World Firsts 1. First processor company to reach 50 GFLOPS/W 3. First semiconductor company
More informationTile Processor (TILEPro64)
Tile Processor Case Study of Contemporary Multicore Fall 2010 Agarwal 6.173 1 Tile Processor (TILEPro64) Performance # of cores On-chip cache (MB) Cache coherency Operations (16/32-bit BOPS) On chip bandwidth
More informationPorting Linux to a New Architecture
Embedded Linux Conference 2014 Porting Linux to a New Architecture Marta Rybczyńska May 1 st, 2014 Different Types of Porting New board New processor from existing family New architecture 2010-2014 Kalray
More informationParallel Code Generation of Synchronous Programs for a Many-core Architecture
Parallel Code Generation of Synchronous Programs for a Many-core Architecture Amaury Graillat Supervisors: Reviewers: Pascal Raymond (Verimag), Matthieu Moy (LIP) Benoît Dupont de Dinechin (Kalray) Jan
More informationXPU A Programmable FPGA Accelerator for Diverse Workloads
XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for
More informationARM Processors for Embedded Applications
ARM Processors for Embedded Applications Roadmap for ARM Processors ARM Architecture Basics ARM Families AMBA Architecture 1 Current ARM Core Families ARM7: Hard cores and Soft cores Cache with MPU or
More informationMaximizing heterogeneous system performance with ARM interconnect and CCIX
Maximizing heterogeneous system performance with ARM interconnect and CCIX Neil Parris, Director of product marketing Systems and software group, ARM Teratec June 2017 Intelligent flexible cloud to enable
More informationHercules ARM Cortex -R4 System Architecture. Processor Overview
Hercules ARM Cortex -R4 System Architecture Processor Overview What is Hercules? TI s 32-bit ARM Cortex -R4/R5 MCU family for Industrial, Automotive, and Transportation Safety Hardware Safety Features
More informationMessaging Overview. Introduction. Gen-Z Messaging
Page 1 of 6 Messaging Overview Introduction Gen-Z is a new data access technology that not only enhances memory and data storage solutions, but also provides a framework for both optimized and traditional
More informationEach Milliwatt Matters
Each Milliwatt Matters Ultra High Efficiency Application Processors Govind Wathan Product Manager, CPG ARM Tech Symposia China 2015 November 2015 Ultra High Efficiency Processors Used in Diverse Markets
More information4. Networks. in parallel computers. Advances in Computer Architecture
4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors
More informationM7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle
M7: Next Generation SPARC Hotchips 26 August 12, 2014 Stephen Phillips Senior Director, SPARC Architecture Oracle Safe Harbor Statement The following is intended to outline our general product direction.
More informationExploring System Coherency and Maximizing Performance of Mobile Memory Systems
Exploring System Coherency and Maximizing Performance of Mobile Memory Systems Shanghai: William Orme, Strategic Marketing Manager of SSG Beijing & Shenzhen: Mayank Sharma, Product Manager of SSG ARM Tech
More informationTDT Appendix E Interconnection Networks
TDT 4260 Appendix E Interconnection Networks Review Advantages of a snooping coherency protocol? Disadvantages of a snooping coherency protocol? Advantages of a directory coherency protocol? Disadvantages
More informationDeveloping deterministic networking technology for railway applications using TTEthernet software-based end systems
Developing deterministic networking technology for railway applications using TTEthernet software-based end systems Project n 100021 Astrit Ademaj, TTTech Computertechnik AG Outline GENESYS requirements
More informationProfiling and Debugging OpenCL Applications with ARM Development Tools. October 2014
Profiling and Debugging OpenCL Applications with ARM Development Tools October 2014 1 Agenda 1. Introduction to GPU Compute 2. ARM Development Solutions 3. Mali GPU Architecture 4. Using ARM DS-5 Streamline
More informationCommunication Patterns in Safety Critical Systems for ADAS & Autonomous Vehicles Thorsten Wilmer Tech AD Berlin, 5. March 2018
Communication Patterns in Safety Critical Systems for ADAS & Autonomous Vehicles Thorsten Wilmer Tech AD Berlin, 5. March 2018 Agenda Motivation Introduction of Safety Components Introduction to ARMv8
More informationARMv8-A Software Development
ARMv8-A Software Development Course Description ARMv8-A software development is a 4 days ARM official course. The course goes into great depth and provides all necessary know-how to develop software for
More informationPorting Linux to a New Architecture
Embedded Linux Conference Europe 2014 Porting Linux to a New Architecture Marta Rybczyńska October 15, 2014 Different Types of Porting New board New processor from existing family New architecture 2 New
More informationThe S6000 Family of Processors
The S6000 Family of Processors Today s Design Challenges The advent of software configurable processors In recent years, the widespread adoption of digital technologies has revolutionized the way in which
More informationDesign and Analysis of Time-Critical Systems Timing Predictability and Analyzability + Case Studies: PTARM and Kalray MPPA-256
Design and Analysis of Time-Critical Systems Timing Predictability and Analyzability + Case Studies: PTARM and Kalray MPPA-256 Jan Reineke @ saarland university computer science ACACES Summer School 2017
More informationChapter 5. Introduction ARM Cortex series
Chapter 5 Introduction ARM Cortex series 5.1 ARM Cortex series variants 5.2 ARM Cortex A series 5.3 ARM Cortex R series 5.4 ARM Cortex M series 5.5 Comparison of Cortex M series with 8/16 bit MCUs 51 5.1
More informationThe Nios II Family of Configurable Soft-core Processors
The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture
More informationCopyright 2016 Xilinx
Zynq Architecture Zynq Vivado 2015.4 Version This material exempt per Department of Commerce license exception TSU Objectives After completing this module, you will be able to: Identify the basic building
More informationThe Next Steps in the Evolution of ARM Cortex-M
The Next Steps in the Evolution of ARM Cortex-M Joseph Yiu Senior Embedded Technology Manager CPU Group ARM Tech Symposia China 2015 November 2015 Trust & Device Integrity from Sensor to Server 2 ARM 2015
More informationVector Engine Processor of SX-Aurora TSUBASA
Vector Engine Processor of SX-Aurora TSUBASA Shintaro Momose, Ph.D., NEC Deutschland GmbH 9 th October, 2018 WSSP 1 NEC Corporation 2018 Contents 1) Introduction 2) VE Processor Architecture 3) Performance
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationPowerPC 740 and 750
368 floating-point registers. A reorder buffer with 16 elements is used as well to support speculative execution. The register file has 12 ports. Although instructions can be executed out-of-order, in-order
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationNetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013
NetSpeed ORION: A New Approach to Design On-chip Interconnects August 26 th, 2013 INTERCONNECTS BECOMING INCREASINGLY IMPORTANT Growing number of IP cores Average SoCs today have 100+ IPs Mixing and matching
More informationVOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017
VOLTA: PROGRAMMABILITY AND PERFORMANCE Jack Choquette NVIDIA Hot Chips 2017 1 TESLA V100 21B transistors 815 mm 2 80 SM 5120 CUDA Cores 640 Tensor Cores 16 GB HBM2 900 GB/s HBM2 300 GB/s NVLink *full GV100
More informationIBM Cell Processor. Gilbert Hendry Mark Kretschmann
IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:
More informationThomas Moscibroda Microsoft Research. Onur Mutlu CMU
Thomas Moscibroda Microsoft Research Onur Mutlu CMU CPU+L1 CPU+L1 CPU+L1 CPU+L1 Multi-core Chip Cache -Bank Cache -Bank Cache -Bank Cache -Bank CPU+L1 CPU+L1 CPU+L1 CPU+L1 Accelerator, etc Cache -Bank
More informationIntel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins
Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications
More informationSoC Design. Prof. Dr. Christophe Bobda Institut für Informatik Lehrstuhl für Technische Informatik
SoC Design Prof. Dr. Christophe Bobda Institut für Informatik Lehrstuhl für Technische Informatik Chapter 5 On-Chip Communication Outline 1. Introduction 2. Shared media 3. Switched media 4. Network on
More informationWHY PARALLEL PROCESSING? (CE-401)
PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationEfficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford
Efficiency and Programmability: Enablers for ExaScale Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford Scientific Discovery and Business Analytics Driving an Insatiable
More informationNetworks: Routing, Deadlock, Flow Control, Switch Design, Case Studies. Admin
Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies Alvin R. Lebeck CPS 220 Admin Homework #5 Due Dec 3 Projects Final (yes it will be cumulative) CPS 220 2 1 Review: Terms Network characterized
More informationA unified multicore programming model
A unified multicore programming model Simplifying multicore migration By Sven Brehmer Abstract There are a number of different multicore architectures and programming models available, making it challenging
More informationBuilding High Performance, Power Efficient Cortex and Mali systems with ARM CoreLink. Robert Kaye
Building High Performance, Power Efficient Cortex and Mali systems with ARM CoreLink Robert Kaye 1 Agenda Once upon a time ARM designed systems Compute trends Bringing it all together with CoreLink 400
More informationNew ARMv8-R technology for real-time control in safetyrelated
New ARMv8-R technology for real-time control in safetyrelated applications James Scobie Product manager ARM Technical Symposium China: Automotive, Industrial & Functional Safety October 31 st 2016 November
More informationThe Next Steps in the Evolution of Embedded Processors
The Next Steps in the Evolution of Embedded Processors Terry Kim Staff FAE, ARM Korea ARM Tech Forum Singapore July 12 th 2017 Cortex-M Processors Serving Connected Applications Energy grid Automotive
More informationNegotiating the Maze Getting the most out of memory systems today and tomorrow. Robert Kaye
Negotiating the Maze Getting the most out of memory systems today and tomorrow Robert Kaye 1 System on Chip Memory Systems Systems use external memory Large address space Low cost-per-bit Large interface
More informationPOWER7: IBM's Next Generation Server Processor
POWER7: IBM's Next Generation Server Processor Acknowledgment: This material is based upon work supported by the Defense Advanced Research Projects Agency under its Agreement No. HR0011-07-9-0002 Outline
More informationImplementing Flexible Interconnect Topologies for Machine Learning Acceleration
Implementing Flexible Interconnect for Machine Learning Acceleration A R M T E C H S Y M P O S I A O C T 2 0 1 8 WILLIAM TSENG Mem Controller 20 mm Mem Controller Machine Learning / AI SoC New Challenges
More informationBeyond One Core. Multi-Cores. Evolution in Frequency Has Stalled. Source: ISCA 2009 Keynote of Kathy Yelick
Beyond One Multi-s Evolution in Frequency Has Stalled Source: ISCA 2009 Keynote of Kathy Yelick 1 Multi-s Network or Bus Memory Private and/or shared caches Bus Network (on chip = NoC) Example: ARM Cortex
More informationQuick Reference Card. Timing and Stack Verifiers Supported Platforms. SCADE Suite 6.3
Timing and Stack Verifiers Supported Platforms SCADE Suite 6.3 About Timing and Stack Verifiers supported platforms Timing and Stack Verifiers Supported Platforms SCADE Suite Timing Verifier and SCADE
More informationRapidIO.org Update. Mar RapidIO.org 1
RapidIO.org Update rickoco@rapidio.org Mar 2015 2015 RapidIO.org 1 Outline RapidIO Overview & Markets Data Center & HPC Communications Infrastructure Industrial Automation Military & Aerospace RapidIO.org
More informationTiming analysis and timing predictability
Timing analysis and timing predictability Architectural Dependences Reinhard Wilhelm Saarland University, Saarbrücken, Germany ArtistDesign Summer School in China 2010 What does the execution time depends
More informationWhite paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation
White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation Next Generation Technical Computing Unit Fujitsu Limited Contents FUJITSU Supercomputer PRIMEHPC FX100 System Overview
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationAccelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing
Accelerating HPC (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing SAAHPC, Knoxville, July 13, 2010 Legal Disclaimer Intel may make changes to specifications and product
More informationHigher Level Programming Abstractions for FPGAs using OpenCL
Higher Level Programming Abstractions for FPGAs using OpenCL Desh Singh Supervising Principal Engineer Altera Corporation Toronto Technology Center ! Technology scaling favors programmability CPUs."#/0$*12'$-*
More informationEnergy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package
High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction
More informationBlue Gene/Q. Hardware Overview Michael Stephan. Mitglied der Helmholtz-Gemeinschaft
Blue Gene/Q Hardware Overview 02.02.2015 Michael Stephan Blue Gene/Q: Design goals System-on-Chip (SoC) design Processor comprises both processing cores and network Optimal performance / watt ratio Small
More informationArchitecture and Programming Models of the Kalray MPPA Single Chip Manycore Processor. Benoît Dupont de Dinechin MCC 2013, Halmstad, Sweden
Architecture and Programming Models of the Kalray MPPA Single Chip Manycore Processor Benoît Dupont de Dinechin MCC 2013, Halmstad, Sweden Outline Kalray MPPA Products MPPA MANYCORE Architecture MPPA ACCESSLIB
More informationReal-Time Mixed-Criticality Wormhole Networks
eal-time Mixed-Criticality Wormhole Networks Leandro Soares Indrusiak eal-time Systems Group Department of Computer Science University of York United Kingdom eal-time Systems Group 1 Outline Wormhole Networks
More informationNvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018
Nvidia Jetson TX2 and its Software Toolset João Fernandes 2017/2018 In this presentation Nvidia Jetson TX2: Hardware Nvidia Jetson TX2: Software Machine Learning: Neural Networks Convolutional Neural Networks
More informationSwizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems
1 Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems Ronald Dreslinski, Korey Sewell, Thomas Manville, Sudhir Satpathy, Nathaniel Pinckney, Geoff Blake, Michael Cieslak, Reetuparna
More informationARM CORTEX-R52. Target Audience: Engineers and technicians who develop SoCs and systems based on the ARM Cortex-R52 architecture.
ARM CORTEX-R52 Course Family: ARMv8-R Cortex-R CPU Target Audience: Engineers and technicians who develop SoCs and systems based on the ARM Cortex-R52 architecture. Duration: 4 days Prerequisites and related
More informationPortland State University ECE 588/688. Cray-1 and Cray T3E
Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector
More informationAltera SDK for OpenCL
Altera SDK for OpenCL A novel SDK that opens up the world of FPGAs to today s developers Altera Technology Roadshow 2013 Today s News Altera today announces its SDK for OpenCL Altera Joins Khronos Group
More informationNoCo: ILP-based Worst-Case Contention Estimation for Mesh Real-Time Manycores
www.bsc.es NoCo: ILP-based Worst-Case Contention Estimation for Mesh Real-Time Manycores Jordi Cardona 1,2, Carles Hernandez 1, Enrico Mezzetti 1, Jaume Abella 1 and Francisco J.Cazorla 1,3 1 Barcelona
More informationAn Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection
An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection Hiroyuki Usui, Jun Tanabe, Toru Sano, Hui Xu, and Takashi Miyamori Toshiba Corporation, Kawasaki, Japan Copyright 2013,
More informationDesign and Analysis of Time-Critical Systems Introduction
Design and Analysis of Time-Critical Systems Introduction Jan Reineke @ saarland university ACACES Summer School 2017 Fiuggi, Italy computer science Structure of this Course 2. How are they implemented?
More informationNetwork Calculus: A Comparison
Time-Division Multiplexing vs Network Calculus: A Comparison Wolfgang Puffitsch, Rasmus Bo Sørensen, Martin Schoeberl RTNS 15, Lille, France Motivation Modern multiprocessors use networks-on-chip Congestion
More informationAll About the Cell Processor
All About the Cell H. Peter Hofstee, Ph. D. IBM Systems and Technology Group SCEI/Sony Toshiba IBM Design Center Austin, Texas Acknowledgements Cell is the result of a deep partnership between SCEI/Sony,
More informationThe Path to Embedded Vision & AI using a Low Power Vision DSP. Yair Siegel, Director of Segment Marketing Hotchips August 2016
The Path to Embedded Vision & AI using a Low Power Vision DSP Yair Siegel, Director of Segment Marketing Hotchips August 2016 Presentation Outline Introduction The Need for Embedded Vision & AI Vision
More informationAchieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation
Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation Kshitij Bhardwaj Dept. of Computer Science Columbia University Steven M. Nowick 2016 ACM/IEEE Design Automation
More informationNetronome NFP: Theory of Operation
WHITE PAPER Netronome NFP: Theory of Operation TO ACHIEVE PERFORMANCE GOALS, A MULTI-CORE PROCESSOR NEEDS AN EFFICIENT DATA MOVEMENT ARCHITECTURE. CONTENTS 1. INTRODUCTION...1 2. ARCHITECTURE OVERVIEW...2
More informationCUDA OPTIMIZATIONS ISC 2011 Tutorial
CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationAn Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki
An Ultra High Performance Scalable DSP Family for Multimedia Hot Chips 17 August 2005 Stanford, CA Erik Machnicki Media Processing Challenges Increasing performance requirements Need for flexibility &
More information3D WiNoC Architectures
Interconnect Enhances Architecture: Evolution of Wireless NoC from Planar to 3D 3D WiNoC Architectures Hiroki Matsutani Keio University, Japan Sep 18th, 2014 Hiroki Matsutani, "3D WiNoC Architectures",
More informationBasic Low Level Concepts
Course Outline Basic Low Level Concepts Case Studies Operation through multiple switches: Topologies & Routing v Direct, indirect, regular, irregular Formal models and analysis for deadlock and livelock
More informationContents of this presentation: Some words about the ARM company
The architecture of the ARM cores Contents of this presentation: Some words about the ARM company The ARM's Core Families and their benefits Explanation of the ARM architecture Architecture details, features
More informationMastering The Behavior of Multi-Core Systems to Match Avionics Requirements
www.thalesgroup.com Mastering The Behavior of Multi-Core Systems to Match Avionics Requirements Hicham AGROU, Marc GATTI, Pascal SAINRAT, Patrice TOILLON {hicham.agrou,marc-j.gatti, patrice.toillon}@fr.thalesgroup.com
More informationWhat Software Requires from Multicore for Certification - overview for the domains avionics, medical in comparison to automotive,...
What Software Requires from Multicore for Certification - overview for the domains avionics, medical in comparison to automotive,... Matthias Pruksch Motivation Collision Avoidance System [1] Mobile Diagnosis
More informationIntel Enterprise Processors Technology
Enterprise Processors Technology Kosuke Hirano Enterprise Platforms Group March 20, 2002 1 Agenda Architecture in Enterprise Xeon Processor MP Next Generation Itanium Processor Interconnect Technology
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationPOWER7: IBM's Next Generation Server Processor
Hot Chips 21 POWER7: IBM's Next Generation Server Processor Ronald Kalla Balaram Sinharoy POWER7 Chief Engineer POWER7 Chief Core Architect Acknowledgment: This material is based upon work supported by
More informationThe University of Texas at Austin
EE382N: Principles in Computer Architecture Parallelism and Locality Fall 2009 Lecture 24 Stream Processors Wrapup + Sony (/Toshiba/IBM) Cell Broadband Engine Mattan Erez The University of Texas at Austin
More information«Real Time Embedded systems» Multi Masters Systems
«Real Time Embedded systems» Multi Masters Systems rene.beuchat@epfl.ch LAP/ISIM/IC/EPFL Chargé de cours rene.beuchat@hesge.ch LSN/hepia Prof. HES 1 Multi Master on Chip On a System On Chip, Master can
More informationBuilding blocks for 64-bit Systems Development of System IP in ARM
Building blocks for 64-bit Systems Development of System IP in ARM Research seminar @ University of York January 2015 Stuart Kenny stuart.kenny@arm.com 1 2 64-bit Mobile Devices The Mobile Consumer Expects
More informationTrustzone Security IP for IoT
Trustzone Security IP for IoT Udi Maor CryptoCell-7xx product manager Systems & Software Group ARM Tech Forum Singapore July 12 th 2017 Why is getting security right for IoT so important? When our everyday
More informationA Developer's Guide to Security on Cortex-M based MCUs
A Developer's Guide to Security on Cortex-M based MCUs 2018 Arm Limited Nazir S Arm Tech Symposia India Agenda Why do we need security? Types of attacks and security assessments Introduction to TrustZone
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationMultilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology
1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823
More informationHybrid Memory Cube (HMC)
23 Hybrid Memory Cube (HMC) J. Thomas Pawlowski, Fellow Chief Technologist, Architecture Development Group, Micron jpawlowski@micron.com 2011 Micron Technology, I nc. All rights reserved. Products are
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction
More informationLeveraging OpenSPARC. ESA Round Table 2006 on Next Generation Microprocessors for Space Applications EDD
Leveraging OpenSPARC ESA Round Table 2006 on Next Generation Microprocessors for Space Applications G.Furano, L.Messina TEC- OpenSPARC T1 The T1 is a new-from-the-ground-up SPARC microprocessor implementation
More informationTHE PATH TO EXASCALE COMPUTING. Bill Dally Chief Scientist and Senior Vice President of Research
THE PATH TO EXASCALE COMPUTING Bill Dally Chief Scientist and Senior Vice President of Research The Goal: Sustained ExaFLOPs on problems of interest 2 Exascale Challenges Energy efficiency Programmability
More informationEXASCALE COMPUTING ROADMAP IMPACT ON LEGACY CODES MARCH 17 TH, MIC Workshop PAGE 1. MIC workshop Guillaume Colin de Verdière
EXASCALE COMPUTING ROADMAP IMPACT ON LEGACY CODES MIC workshop Guillaume Colin de Verdière MARCH 17 TH, 2015 MIC Workshop PAGE 1 CEA, DAM, DIF, F-91297 Arpajon, France March 17th, 2015 Overview Context
More information1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects
Overview of Interconnects Myrinet and Quadrics Leading Modern Interconnects Presentation Outline General Concepts of Interconnects Myrinet Latest Products Quadrics Latest Release Our Research Interconnects
More informationSoC FPGAs. Your User-Customizable System on Chip Altera Corporation Public
SoC FPGAs Your User-Customizable System on Chip Embedded Developers Needs Low High Increase system performance Reduce system power Reduce board size Reduce system cost 2 Providing the Best of Both Worlds
More information