Kalray MPPA Massively Parallel Processor Array

Size: px

Start display at page:

Download "Kalray MPPA Massively Parallel Processor Array"

Randell Booth
6 years ago
Views:

1 Kalray MPPA Massively Parallel Processor Array Engineering a Manycore Processor Platform for Mission-Critical Applications Benoît Dupont de Dinechin, CTO

Kalray Solutions Target Two Strategic Markets Cloud and Data Center acceleration Kalray is the ideal co-processor to off-load x86 or accelerate

Performance Embedded Computing Embedded applications require higher performance without increasing power consumption and more integration of

2 Kalray Solutions Target Two Strategic Markets Cloud and Data Center acceleration Kalray is the ideal co-processor to off-load x86 or accelerate real-time or compute intensive functions Domains of application: Networking, Storage, Big Data analytics, Cybersecurity, Deep learning High Performance Embedded Computing Embedded applications require higher performance without increasing power consumption and more integration of functions including those constrained by real-time Domains of application: aerospace, automotive, transport, energy 2016 Kalray SA All Rights Reserved MCSoC

Number Crunching in Sensor Fusion Units (figures from M.

Mission-Critical Computing Dependability requirements Detect, isolate or mitigate errors that occur in digital circuits Transient electrical problems, soft errors, physical damage Timing requirements

6 Mission-Critical Computing Dependability requirements Detect, isolate or mitigate errors that occur in digital circuits Transient electrical problems, soft errors, physical damage Timing requirements Time constraints associated with information manipulation Execution time determinism, predictability, composability Certification of worst-case response time by static analysis 2016 Kalray SA All Rights Reserved MCSoC

caches, busses, and peripheral devices Cache coherence invalidations Main issue Growing gap

7 Execution Timing on a Multicore Processor Intra-core nondeterministic timing Speculative execution, branch prediction Cache content Inter-core interferences Concurrent accesses to caches, busses, and peripheral devices Cache coherence invalidations Main issue Growing gap between average execution time and static WCET 2016 Kalray SA All Rights Reserved MCSoC

MPPA -256 Manycore Architecture Highlights DSP type of processing Energy efficiency Timing predictability Software programmability CPU ease of programming C/C++ GNU environment 32-bit/64-bit,

11 MPPA -256 Manycore Architecture Highlights DSP type of processing Energy efficiency Timing predictability Software programmability CPU ease of programming C/C++ GNU environment 32-bit/64-bit, little-endian Rich operating system (Linux with dynamic loading&linking) Integrated many-core processor 32 management cores on chip 256 application cores on chip High-performance I/O Scalable parallel computing MPPA processors can be tiled together through NoC extension Run-time support for pooling the external DDR memory resources Courtesy Altera 2016 Kalray SA All Rights Reserved MCSoC

MPPA -256 Bostan Processor Architecture Manycore Processor Compute Cluster VLIW Core 16 compute clusters 2 I/O

Distributed memory architecture 634 GFLOPS SP for 25W @ 600Mhz 16 user cores + 1 system core NoC Tx and Rx

System 32-bit or 64-bit addresses 5-issue VLIW architecture MMU + I&D cache (8KB+8KB) 32-bit/64-bit IEEE

12 MPPA -256 Bostan Processor Architecture Manycore Processor Compute Cluster VLIW Core 16 compute clusters 2 I/O clusters each with quad-core CPUs, DDR3, 4 Ethernet 10G and 8 PCIe Gen3 Data and control networks-on-chip Distributed memory architecture 634 GFLOPS SP for 600Mhz 16 user cores + 1 system core NoC Tx and Rx interfaces Debug & Support Unit (DSU) 2 MB multi-banked shared memory 77GB/s Shared Memory BW 16 cores SMP System 32-bit or 64-bit addresses 5-issue VLIW architecture MMU + I&D cache (8KB+8KB) 32-bit/64-bit IEEE FMA FPU Tightly coupled crypto co-processor 2.4 GFLOPS SP per 2016 Kalray SA All Rights Reserved MCSoC

MPPA -256 Bostan Processor 256 + 32 VLIW cores / 18 address spaces / 400 MHz 800 MHz Physical characteristics TSMC CMOS 28HP 100µW/MHz per core+ L1 caches 2W to 3W leakage Processor interfaces 2x

13 MPPA -256 Bostan Processor VLIW cores / 18 address spaces / 400 MHz 800 MHz Physical characteristics TSMC CMOS 28HP 100µW/MHz per core+ L1 caches 2W to 3W leakage Processor interfaces 2x DDR3 Memory interfaces 2x PCIe Gen3 8-lane interface 8x 1G/10G Ethernet or 2x 40G Ethernet interfaces SPI/I2C/UART interfaces Universal Static Memory Controller (NAND/NOR/SRAM) GPIOs with Direct NoC Access NoC extension through Interlaken interface (NoCX) 2016 Kalray SA All Rights Reserved MCSoC

MPPA Processor Co-Design for Avionics U.

analysis tool used to certify flight control systems at Airbus AbsInt ait tool also targets the Kalray VLIW cores Architecture

buffer Cache bypass memory opcodes Cluster level: multi-banked shared memory Core-private buses for memory bank access Address

14 MPPA Processor Co-Design for Avionics U. Saarland / AbsIntGMBH recommendations on VLIW core and cache micro-architecture design AbsInt provides the ait static timing analysis tool used to certify flight control systems at Airbus AbsInt ait tool also targets the Kalray VLIW cores Architecture design with a focus on timing predictability Core level: micro-architecture Fully timing compositional core LRU caches and write buffer Cache bypass memory opcodes Cluster level: multi-banked shared memory Core-private buses for memory bank access Address interleaving or blocking across banks Processor level: NoC/DDR guaranteed services NoC minimum bandwidth & maximum latency DDR controller configurations and address mapping 2016 Kalray SA All Rights Reserved MCSoC

15 MPPA -256 Bostan VLIW Core Data Path 5-issue, polycyclic property Unified 10 read, 5 write 64x32-bit register file 64-bit data in register pairs Shared ACC read port Unified MAU and FPU LSU and MAU also execute single-cycle ALU instructions 2016 Kalray SA All Rights Reserved MCSoC

16 MPPA Bostan VLIW Core Energy Efficiency Kalray Bostan pj per Instruction on Matrix Multiply 117 ARM Cortex A7 (Little)* 225 GPU** 497 ARM Cortex A15 (Big)* Kalray Bostan at 141 pj per bundle with 3.14 instructions average per bundle *: An Instruction Level Energy Characterization of ARM Processors FORTH-ICS/TR-450, March 2015 **: Source: Measured on Matrix Multiply EPI Test + Source Bill Dally, To ExaScale and Beyond - NVidia 2016 Kalray SA All Rights Reserved MCSoC

MPPA -256 Bostan Network-on-Chip (NoC) Dual 2D-torus NoC Direct, wormhole switching D-NoC: high-bandwidth RDMA C-NoC: low-latency mailboxes 4B/cycle per link direction per NoC Nx10Gb/s NoC extensions

18 MPPA -256 Bostan Network-on-Chip (NoC) Dual 2D-torus NoC Direct, wormhole switching D-NoC: high-bandwidth RDMA C-NoC: low-latency mailboxes 4B/cycle per link direction per NoC Nx10Gb/s NoC extensions for connection to FPGA or other MPPA Predictability Data NoC is configured by selecting routes and injection parameters Routing ensure deadlock-free traffic Injection parameters are the (σ,ρ) or (burst, rate) of network calculus 2016 Kalray SA All Rights Reserved MCSoC

19 Wormhole Switching Illustrated A packet is composed of several flits The header flits governs the route The payload flits follow the header in a pipeline fashion When the required channel is busy, the hop flow control blocks the trailing flits and they stay in flit buffers along the established route 9 B A 2016 Kalray SA All Rights Reserved MCSoC

20 Wormhole Switching Network Issues Complex to implement May betruefor input queueing & output matching (e.g. islip) The MPPA NoC routers only include demultiplexers, output queues and RR arbiters Prone to deadlocking In thisexample, the redflow cannotuse R3 R2 becausethe blueflow isusingit Likewise, the blue flow needs R1 R4 heldby the redflow Deadlock requires full queues 2016 Kalray SA All Rights Reserved MCSoC

21 Deadlock-Free Routing Wormhole Switching Deadlock results from circuits of agents and resources connected by a wait-for relation [Dally & Seitz 1998] Wormhole switching: agents are packets; resources are link buffers Links between routers and links internal to routers( turns ) Solutions whenthe network topologyisa 2D mesh Dimension order(x-y on 2D meshes) Turnmodel [Glass & Ni 1994] (not the sameas turnprohibition ) Odd-Even[Chiu 2000] (better path diversity than turn model) Hamiltonian Odd-Even[Bahrebar& Stroobandt 2015] (multicasting) Strategy for the MPPA NoC management Isolate a 2D mesh in topology and apply deadlock-free routing 2016 Kalray SA All Rights Reserved MCSoC

22 Hamiltonian Odd-Even Prohibited Turns Even rows East-South turn prohibited North-West turn prohibited Odd rows North-East turn prohibited West-South turn prohibited Odd Odd Even Even Odd Odd Even Even 2016 Kalray SA All Rights Reserved MCSoC

23 Hamiltonian Odd-Even on the MPPA NoC Routing between compute clusters Routes generated assuming a 6x6 mesh Impossible routes are discarded Routing between I/O and compute cluster Always possible, thanks to the 4 NoC nodes per I/O cluster I/O Ethernet 0 I/O DDR 1 I/O DDR 0 I/O Ethernet Kalray SA All Rights Reserved MCSoC

24 MPPA NoC Path Selection Problem For each flow, select a single path among those proposed by adaptive routing so as to maximize utility If multi-path allowed, reduces to a multi-commodity flow problem with a direct solution Adding the single path constraints makes the problem NP-hard Practical instances are solved with mixed integer programming Utility function is the max-min fairness of flow rates Maximize the lowest flow rate, then the next lowest flow rate etc. Numerical results for all pairs of flows between 8 clusters (55 flows) 2016 Kalray SA All Rights Reserved MCSoC Flow Route Bandwidth F48 R F49 R1 0 R R3 0 R4 0 F50 R F51 R R2 0

25 Principle of the MPPA NoC Guaranted Services Data NoC packet injection implements a (σ,ρ)regulation No more than σ+ρ(t-s) flits are injected for any interval [s,t] Cumulative flit count L ~ρ σ Application of Deterministic Network Calculus prevents NoC congestion and provides bounds on end-to-end delays time Network Calculus application requires feed-forward flows Deadlock-free wormhole switching implies feed-forward flows 2016 Kalray SA All Rights Reserved MCSoC

Complete AccessCore SDK Standard C/C++ Programming

Simulators & Profilers, Debuggers & System Trace Operating

POSIX-Level Programming CPU Style OpenCL Programming GPU

27 Complete AccessCore SDK Standard C/C++ Programming Environment C - Low Level/ Lib Programming DSP Style Simulators & Profilers, Debuggers & System Trace Operating Systems & Device Drivers AccessLib Optimized Libraries C - POSIX-Level Programming CPU Style OpenCL Programming GPU Style OpenDataPlane Open API for networking 2016 Kalray SA All Rights Reserved MCSoC

MPPA Distributed Memory Architecture Challenge Approaches to communication Load/Store to unified memory Remote DMA to unified memory Remote DMA to private memories Send/receive to private memories

28 MPPA Distributed Memory Architecture Challenge Approaches to communication Load/Store to unified memory Remote DMA to unified memory Remote DMA to private memories Send/receive to private memories Approaches to synchronization Locks, semaphores, conditional variables Barriers, collectives (reduce/scan/all2all) Remote atomic operations Remote atomic queues Classic HPC clusters vs MPPA The working memory of HPC clusters is the collection of identical node memories The working memory of the MPPA is the collection of SMEMs backed by the DDRs 2016 Kalray SA All Rights Reserved MCSoC

MPPA Distributed Memory Architecture + Runtime MPPA local memory benefits Local memory accesses are more energyefficient than a Last-Level Cache (LLC) Local memory access interferences can be managed

29 MPPA Distributed Memory Architecture + Runtime MPPA local memory benefits Local memory accesses are more energyefficient than a Last-Level Cache (LLC) Local memory access interferences can be managed for time-critical applications MPPA DSM for implicit DDR access Escape code size and data size restrictions Mostly used for application porting Some implicit global memory accesses cannot be converted to remote DMA MPPA Asynchronous Operations Using of local memories and sharing DDR memories through remote DMA over NoC Remote atomic operations and queues Co-exist with the DSM or run stand-alone 2016 Kalray SA All Rights Reserved MCSoC

MPPA Asynchronous Operations Ordering 2016

31 MPPA Software Stack Overview Language Libraries GNU C, C++, Fortran POSIX threads + OpenMP Rich OS RTOS RPC, Asynchronous Hypervisor (Exokernel) Operations, DSM Low-level programming interfaces MPPA platform OpenCL C OpenCL runtime 2016 Kalray SA All Rights Reserved MCSoC

manycore execution Distribute the compute load across cores and reduce memory interferences Effective implementation of

32 SCADE Code Generation for the MPPA Safety-critical control-command applications Model-based programming using SCADE Suite from Esterel Technologies Complemented with static timing analysis of binary code (ait from AbsInt) Motivations for multicore and manycore execution Distribute the compute load across cores and reduce memory interferences Effective implementation of multi-rate harmonic applications Envision use of fast Model Predictive Control (MPC) techniques 2016 Kalray SA All Rights Reserved MCSoC

emcos for MPPA Processors esolemcosis the world

34 Automotive Software Development Kit Vision Number-crunching applications Use OpenCL hosted on an I/O cluster quad-core running Linux Partition the compute clusters using OpenCL subdevices Enable OpenCL Data Parallel and Task Parallel models Inside Task Parallel model, support Posix programming C/C++/Pthread/FILE instead of OpenCL-C MPPA Asynchronous operations instead of OpenCL async_copy Control-command applications SCADE Suite code generation for compute clusters Exchange with I/O application part using RDMA through NoC Automotive RTOS & middleware OSEK/VDX RTOS and support of AUTOSAR deployment environment 2016 Kalray SA All Rights Reserved MCSoC

NoC error protection Security architecture Trust provisioning Secure boot chain

36 MPPA Coolidge Processors Compute cluster as CPU Direct access to DDR memory Optional L1 cache coherence Full 64-bit OS support Functional safety Power domains Robust partitioning NoC error protection Security architecture Trust provisioning Secure boot chain Authenticated debug Courtesy Rambus Courtesy Inside Secure 2016 Kalray SA All Rights Reserved MCSoC

37 MPPA Coolidge Main Evolutions TSMC 16FFC technology node 50% maximum frequency increase compared to 28nm HPC 40% dynamic power reduction at same frequency compared to 28nm HPC 3 rd generation Kalray VLIW core 1GHz typical core frequency 64-bit 5-issue VLIW core architecture 128-bit load/store unit(16 bytes /cycle) 16-bit, 32-bit and 64-bit IEEE 754 FPU 16KB, 4-way associative L1 cache s 4 privilege levels like in ARMv8 Co-processing Transcendental function acceleration for deep learning and scientific computing Block cipher and hashing acceleration Pixel pre-processing for computer vision Compute cluster 4MB memory capacity, 128-bit busses Local memory / L2 cache partitioning Interleaved or locked memory map RM core as L2 cache controller Unified NoC Dual virtual channel, 128-bit flits RDMA, load/store and control 2D grid topology External memory One DDR4 channel per 4 compute clusters 1 channel on MPPA channels on MPPA -256 High-speed peripherals Ethernet 25Gbps PCIe Gen4 Interlaken 25Gbps 2016 Kalray SA All Rights Reserved MCSoC

MPPA Processors on Deep Learning Application Kalray KANN tool interprets Berkeley Caffefiles of trained networks and generates code Network architecture Parameters (filter coefficients) 250 Neural

38 MPPA Processors on Deep Learning Application Kalray KANN tool interprets Berkeley Caffefiles of trained networks and generates code Network architecture Parameters (filter coefficients) 250 Neural Network (Googlenet - FP32) Focus on low-latency neural network inference Convolutional layer neurons are spread across compute clusters NoC multicasting to copy parameters from DDR to SMEM in compute clusters Fps All data transfers and coordination with MPPA asynchronous operations 0 Kalray** Bostan Nvidia* Tegra Kalray** Coolidge 2016 Kalray SA All Rights Reserved MCSoC

40 Lessons from Engineering Manycore Processors Target applications that benefit from manycore processors High performance computing (GPU, MIC) High performance networking (Cavium, Mellanox/Tilera) High integrity embedded computing (robotics, aerospace, defense) Achieving near-data computing (using local memory) is key Required for energy efficiency and timing predictability Local memory is naturally exploited by KPN models of computation Must also support transparent and effective access to global memory Programming models must be adapted Explicit memory hierarchy and relaxed memory consistency OpenCL a first step but local memory exploitation is too restrictive 2016 Kalray SA All Rights Reserved MCSoC

Thank you KALRAY S.A. Paris - France 86 rue de Paris, 91 400 Orsay France KALRAY S.A. Grenoble - France 445 rue Lavoisier, 38 330 Montbonnot France KALRAY INC.

eu Tel: +33 (0)4 76 18 09 18 email: info@kalray.eu Tel: +1 (650) 469 3729 email: info@kalrayinc.

All trademarks, service marks, and trade names are the marks of the respective owner(s), and any unauthorized use thereof is

42 Thank you KALRAY S.A. Paris - France 86 rue de Paris, Orsay France KALRAY S.A. Grenoble - France 445 rue Lavoisier, Montbonnot France KALRAY INC. Los Altos -USA 4962 El Camino Real Los Altos, CA USA Tel: +33 (0) info@kalray.eu Tel: +33 (0) info@kalray.eu Tel: +1 (650) info@kalrayinc.com MPPA, ACCESSCORE and the Kalray logo are trademarks or registered trademarks of Kalray in various countries. All trademarks, service marks, and trade names are the marks of the respective owner(s), and any unauthorized use thereof is strictly prohibited. All terms and prices are indicatives and subject to any modification without notice Kalray SA All Rights Reserved MCSoC

44 Kalray VLIW Architecture Compared to HP Labs Lx The Lx architecture begat the STMicroelectronics ST200 VLIW family Optimize use of the data memory bandwidth Widening to 64-bit, no alignment restrictions Enable large immediate values in instruction stream All memory accesses may bypass the L1 data cache & write buffer Eliminate DLX ISA features and restrictions Instructions with 3 or 4 source operands, 1 or 2 target operands No aliasing between registers and special resources (LR, zero) Memory addressing modes similar to those of PowerPC Effective floating-point support with Fused Multiply Add Rework if-conversion support Remove Boolean registers and SELECT instructions Use CMOV and conditional load/store instructions Support hardware looping 2016 Kalray SA All Rights Reserved MCSoC

extensible to 4MB No bus interferences between cores RR arbitration between bus

45 MPPA2-256 Bostan Compute Cluster 20 bus masters 16 application cores 1 management core NoCTxand Rx interfaces Debug support unit (DSU) 16-banked shared memory 2MB extensible to 4MB No bus interferences between cores RR arbitration between bus masters Interleaved or blocked address map 2016 Kalray SA All Rights Reserved MCSoC

dispatch Round-robin or classifiedcluster & core allocation Per Ethernet Tx port 64 independent Tx FIFOs

46 MPPA2-256 Bostan Ethernet Support Ethernet as the main highperformance / low-latency IO Integration of Ethernet Rx and Tx to the D-NoC architecture Per Ethernet Rx port 8 classification tables for hardware dispatch Round-robin or classifiedcluster & core allocation Per Ethernet Tx port 64 independent Tx FIFOs Weighted round-robin between Tx FIFOs Flow control between clusters and Tx FIFOs 2016 Kalray SA All Rights Reserved MCSoC

Kalray MPPA Manycore Challenges for the Next Generation of Professional Applications Benoît Dupont de Dinechin MPSoC 2013

Kalray MPPA Manycore Challenges for the Next Generation of Professional Applications Benoît Dupont de Dinechin MPSoC 2013 The End of Dennard MOSFET Scaling Theory 2013 Kalray SA All Rights Reserved MPSoC