Kalray MPPA Manycore Challenges for the Next Generation of Professional Applications Benoît Dupont de Dinechin MPSoC 2013

Similar documents
Big Data mais basse consommation L apport des processeurs manycore

Architecture and Programming Models of the Kalray MPPA Single Chip Manycore Processor. Benoît Dupont de Dinechin MCC 2013, Halmstad, Sweden

Kalray MPPA Massively Parallel Processor Array

At the heart of a new generation of data center infrastructures and appliances. Sept 2017

Porting Linux to a New Architecture

Altera SDK for OpenCL

GPUs and Emerging Architectures

A Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Welcome. Altera Technology Roadshow 2013

EXASCALE COMPUTING ROADMAP IMPACT ON LEGACY CODES MARCH 17 TH, MIC Workshop PAGE 1. MIC workshop Guillaume Colin de Verdière

The S6000 Family of Processors

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Elaborazione dati real-time su architetture embedded many-core e FPGA

Parallella: A $99 Open Hardware Parallel Computing Platform

Porting Linux to a New Architecture

Pedraforca: a First ARM + GPU Cluster for HPC

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

Maximizing heterogeneous system performance with ARM interconnect and CCIX

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Overview of research activities Toward portability of performance

General Purpose GPU Computing in Partial Wave Analysis

THE LEADER IN VISUAL COMPUTING

Exploring System Coherency and Maximizing Performance of Mobile Memory Systems

Advanced CUDA Optimization 1. Introduction

Higher Level Programming Abstractions for FPGAs using OpenCL

GPU Architecture. Alan Gray EPCC The University of Edinburgh

A unified multicore programming model

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Arm Processor Technology Update and Roadmap

Introduction II. Overview

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API

Doing more with multicore! Utilizing the power-efficient, high-performance KeyStone multicore DSPs. November 2012

High-Performance Heterogeneous Computing Platform GRIFON

n N c CIni.o ewsrg.au

Graph Streaming Processor

How to build a Megacore microprocessor. by Andreas Olofsson (MULTIPROG WORKSHOP 2017)

Building supercomputers from embedded technologies

Parallel Computing. Hwansoo Han (SKKU)

Tesla GPU Computing A Revolution in High Performance Computing

An Alternative to GPU Acceleration For Mobile Platforms

The Future of GPU Computing

Versal: The New Xilinx Adaptive Compute Acceleration Platform (ACAP) in 7nm

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Network-on-Chip Architecture

OCP Engineering Workshop - Telco

CUDA. Matthew Joyner, Jeremy Williams

XPU A Programmable FPGA Accelerator for Diverse Workloads

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Trends in HPC (hardware complexity and software challenges)

CLICK TO EDIT MASTER TITLE STYLE. Click to edit Master text styles. Second level Third level Fourth level Fifth level

Massively Parallel Architectures

Dr. Yassine Hariri CMC Microsystems

Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays

Trends and Challenges in Multicore Programming

Vector Engine Processor of SX-Aurora TSUBASA

The Many-Core Revolution Understanding Change. Alejandro Cabrera January 29, 2009

An innovative compilation tool-chain for embedded multi-core architectures M. Torquati, Computer Science Departmente, Univ.

Introduction to Xeon Phi. Bill Barth January 11, 2013

Versal: AI Engine & Programming Environment

Parallel Computer Architecture Concepts

Building blocks for 64-bit Systems Development of System IP in ARM

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection

ARISTA: Improving Application Performance While Reducing Complexity

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing

Multimedia in Mobile Phones. Architectures and Trends Lund

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Runtime library for parallel programming in MPPA R -256 manycore processor

Mapping applications into MPSoC

The Heterogeneous Programming Jungle. Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest

Spiral. Computer Generation of Performance Libraries. José M. F. Moura Markus Püschel Franz Franchetti & the Spiral Team. Performance.

Parallelism. Parallel Hardware. Introduction to Computer Systems

Multiprocessor OS. COMP9242 Advanced Operating Systems S2/2011 Week 7: Multiprocessors Part 2. Scalability of Multiprocessor OS.

COSC 6385 Computer Architecture - Multi Processor Systems

Ultra Large-Scale FFT Processing on Graphics Processor Arrays. Author: J.B. Glenn-Anderson, PhD, CTO enparallel, Inc.

Adaptable Intelligence The Next Computing Era

CCIX: a new coherent multichip interconnect for accelerated use cases

Heterogeneous Computing and OpenCL

ARM+DSP - a winning combination on Qseven

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

Software Driven Verification at SoC Level. Perspec System Verifier Overview

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

General introduction: GPUs and the realm of parallel architectures

arm MULTICORE PLATFORMS FOR ADVANCED APPLICATIONS Product Longevity

The Multikernel A new OS architecture for scalable multicore systems

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Computer Systems Research in the Post-Dennard Scaling Era. Emilio G. Cota Candidacy Exam April 30, 2013

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Intel Many Integrated Core (MIC) Architecture

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

Computer Architecture

Optimizing HW/SW Partition of a Complex Embedded Systems. Simon George November 2015.

Markets Demanding More Performance

A Tuneable Software Cache Coherence Protocol for Heterogeneous MPSoCs. Marco Bekooij & Frank Ophelders

Transcription:

Kalray MPPA Manycore Challenges for the Next Generation of Professional Applications Benoît Dupont de Dinechin MPSoC 2013

The End of Dennard MOSFET Scaling Theory 2013 Kalray SA All Rights Reserved MPSoC 2013 2

Manycore Challenges on Next Technology Nodes Dark Silicon Projection (Esmaeilzadeh et al. CACM 2013) At 8nm, over 50% of the chip will be dark and cannot be utilized Based on Device x Core x Multicore models Multicore model assumes x86 CPU or GPU architecture Dally on Future Challenges of Large-Scale Computing (ISC 2013) Exascale computing requires1000x improvement in energy efficiency By 2020: technology => 2.2x, circuit design => 3x, architecture => 4x Power goes into moving data around communication dominates power Not considered above Manycore platforms based on low-power CPUs and distributed memory SoC nodes integrating high-speed networking and parallel processing 2013 Kalray SA All Rights Reserved MPSoC 2013 3

First MPPA -256 Chips with CMOS 28nm TSMC High processing performance 700 GOPS 230 GFLOPS SP Low power consumption - 5W High execution predictability High-level programming models Available since November 2012 PCI Gen3, Ethernet 10G, NoCX 2013 Kalray SA All Rights Reserved MPSoC 2013 4

MPPA -256 Processor I/O Interfaces DDR3 Memory interfaces PCIe Gen3 interface 1G/10G/40G Ethernet interfaces SPI/I2C/UART interfaces Universal Static Memory Controller (NAND/NOR/SRAM) GPIOs with Direct NoC Access (DNA) mode NoC extension through Interlaken interface (NoC Express) 2013 Kalray SA All Rights Reserved MPSoC 2013 5

MPPA -256 Processor Hierarchical Architecture VLIW Core Compute Cluster Manycore Processor Instruction Level Parallelism Thread Level Parallelism Process Level Parallelism 2013 Kalray SA All Rights Reserved MPSoC 2013 6

MPPA - 256 Clustered Memory Architecture Eth0 I/O Subsystem PCI0 I/O Subsystem PCI1 I/O Subsystem Eth1 I/O Subsystem 20 memory address spaces 16 compute clusters 4 I/O subsystems with direct access to external DDR memory Dual Network-on-Chip (NoC) Data NoC & Control NoC Full duplex links, 4B/cycle 2D torus topology + extension links Unicast and multicast transfers Data NoC QoS Flow control and routing at source Guaranteed services by application of network calculus Oblivious synchronization 2013 Kalray SA All Rights Reserved MPSoC 2013 7

MPPA Technology Compared to GPU & CPU MPPA-256 20pJ/Instruction Optimized for Gflops/Watt Clustered Architecture with Explicit NoC Interfaces GPU 200pJ/Instruction CPU 2000pJ/Instruction Source: Bill Dally, To ExaScale and Beyond - NVidia From 10x to 100x better energy efficiency vs. CPU, GPU or FPGA 2013 Kalray SA All Rights Reserved MPSoC 2013 8

MPPA Architecture Compared to other Manycores NVIDIA, ATI, ARM generalize the GPU architecture into GP-GPU Streaming multiprocessors that share a cache and DDR memory Each stream multiprocessor operates multi-threaded cores in SIMT CUDA or OpenCL data parallel kernel programming models Cavium, Tilera TILE Gx, Intel MIC support shared coherent memory Thread-based parallel programming (POSIX threads, OpenMP) Non uniform memory access (NUMA) times, challenging cache design Kalray MPPA extends the supercomputer clustered architecture Clustered memory architecture scales to > 1M cores (BlueGene/Q) Low energy per operation, high execution predictability Stand-alone configurations, low-latency processing 2013 Kalray SA All Rights Reserved MPSoC 2013 9

Kalray Software Development Kit MPPA ACCESSCORE MPPA ACCESSLIB Today Q4 2013 Standard C/C++ Programming Environment Dataflow Programming FPGA Style Simulators, Profilers, Debuggers & System Trace POSIX-Level Programming DSP Style Operating Systems & Device Drivers Streaming Programming GPU Style 2013 Kalray SA All Rights Reserved MPSoC 2013 10

Dataflow Programming Environment Computation blocks and communication graph written in plain C Supports task parallelism and data parallelism Cyclostatic data production & consumption Dynamic dataflow extensions Automatic mapping on MPPA memory, computing, & communication resources 2013 Kalray SA All Rights Reserved MPSoC 2013 11

POSIX-like process management POSIX-Level Programming Environment Spawn 16 processes from the I/O subsystem Process execution on the 16 clusters start with main(argc, argv) and environment Inter Process Communication (IPC) POSIX file descriptor operations on NoC Connectors Extension to the PCIe interface with the PCI Connectors Rich communication and synchronization Multi-threading inside clusters Standard GCC/G++ OpenMP support #pragma for thread-level parallelism Compiler automatically creates threads POSIX threads interface Explicit thread-level parallelism 2013 Kalray SA All Rights Reserved MPSoC 2013 12

Programming Environments Highlights Accommodate cluster memory size Automatic partition, place & route of dataflow programs GCC OpenMP support for the 16 user cores of a cluster Various programming models, from Embedded to HPC Cyclostatic dataflow, from the KPN family of programming models Communication by Sampling (CbS) for the time-triggered architecture Bulk Synchronous Parallel (BSP) model based on Oxford BSPlib Lightweight implementation of MPI (Message Passing Interface) OpenCL with task parallel model and distributed shared memory Leverage the MMU on each core to paginate DDR memory on chip 2013 Kalray SA All Rights Reserved MPSoC 2013 13

Provide Kalray core optimized building blocks for each scope MPPA Core register file & cache MPPA Cluster shared memory MPPA Partition distributed memory Delivered as C libraries Dataflow programming POSIX-level programming MPPA ACCESSLIB optimized building blocks Numerical computing FFT, Filtering and convolution BLAS-level primitives Matrix factorizations libm extensions Video and image processing H264, HEVC encode / decode Computer vision Scope Core Cluster 2013 2014 Q3 Q4 Q1 Q2 DSPLIB H264 codec HEVC Codec DSPLIB Partition Matmul, FFT BLAS OpenCV 2013 Kalray SA All Rights Reserved MPSoC 2013 14

Target Application Areas INTENSIVE COMPUTING Finance Numerical Simulation Geophysics Life sciences IMAGE & VIDEO Broadcast Medical Imaging Digital Cinema Augmented reality Vision EMBEDDED SYSTEMS Signal Processing Aerospace/Defence Transport Industrial Automation Video Protection TELECOM / NETWORKING Packet Switching Network Optimisation Security Services Software Defined Radio Software Defined Network 2013 Kalray SA All Rights Reserved MPSoC 2013 15

Computational Finance Application Option pricing by Monte Carlo method Optimized pseudo random generator Parallel Map / Reduce scheme across multiple MPPA processors Optimized mathematical primitives for Kalray core Power efficiency 5x better than recent GPU 2013 Kalray SA All Rights Reserved MPSoC 2013 16

Audio Processing Application Increase performances and Reduce Audio system cost Multi Channel processing 256 VLIW cores ~ 500 Low End DSPs Channel routing and control High performance NoC + 32 integrated DMAs System integration Up to 8 x Ethernet 1GbE Low Latency audio processing 500µs latency from input to output samples Cost effective Equivalent to complex multi DSPs + FPGAs system 2013 Kalray SA All Rights Reserved MPSoC 2013 17

Video Broadcasting Example High definition H264 encoder on one MPPA -256 System integration, lower power and cost Heterogeneous implementation Flexibility & scalability H264 Encoder running on MPPA-256 at less than 6W 2013 Kalray SA All Rights Reserved MPSoC 2013 18

Video Protection Example Improved Content analysis High resolution camera / low false detection rate Robust algorithms high performance computing of MPPA Real Time detection More simple infrastructure Compute power at the source System integration: Ethernet input / decode / content analysis / encode Multi camera system on single chip MPPA decode Content analysis 10GBe decode decode Content analysis Content analysis 2013 Kalray SA All Rights Reserved MPSoC 2013 19

Augmented Reality Example Assisted operation & maintenance ARMAR (Augmented Reality for Maintenance and Repair) Assisted conformity control 2013 Kalray SA All Rights Reserved MPSoC 2013 20

Signal Processing Example Radar applications: STAP, Beam forming : Sonar, Echography Software Defined Radio (SDR) Dedicated libraries (FFT, FTFR, ) Well suited to massively parallel architectures Alternative of embedded DSP + FPGA platforms 2013 Kalray SA All Rights Reserved MPSoC 2013 21

High-Speed VPN Gateway Example Evaluation for the implementation of a 20 to 40 Gbs VPN gateway IP packet processing AES cryptography Exploit key features of the MPPA architecture 2 x 40 Gbs Ethernet interfaces (or 8 x 10 Gbs) PCIe Gen 3 for integration Optimized instructions for efficient cryptography NoC extension interface for multi-chip solutions 2013 Kalray SA All Rights Reserved MPSoC 2013 22

KALRAY, a global solution Powerful, Low Power and Programmable Processors C/C++ based Software Development Kit (SDK) for massively parallel programing Development platform Reference Design Board BOARDS Reference Design board Application specific boards Multi-MPPA or Single-MPPA boards 2013 Kalray SA All Rights Reserved MPSoC 2013 23

Kalray Offices Headquarters Paris area 86 rue de Paris, 91 400 Orsay France Tel: +33 (0)1 69 29 08 16 email: info@kalray.eu Grenoble office 445 rue Lavoisier, 38 330 Montbonnot Saint Martin France Tel: +33 (0)4 76 18 09 18 email: info@kalray.eu Japan office CVML, 3-22-1, Toranomon, Minato-ku, Tokyo 105-0001, Japan Tel: 080-4660-2122 email: ksugiyama@kalray.eu All trademarks, service marks, and trade names are the marks of the respective owner(s), and any unauthorized use thereof is strictly prohibited. All terms and prices are indicatives and subject to any modification without notice. 2013 Kalray SA All Rights Reserved MPSoC 2013 24