Architecture and Programming Models of the Kalray MPPA Single Chip Manycore Processor. Benoît Dupont de Dinechin MCC 2013, Halmstad, Sweden

Size: px

Start display at page:

Download "Architecture and Programming Models of the Kalray MPPA Single Chip Manycore Processor. Benoît Dupont de Dinechin MCC 2013, Halmstad, Sweden"

Jane Lawson
5 years ago
Views:

1 Architecture and Programming Models of the Kalray MPPA Single Chip Manycore Processor Benoît Dupont de Dinechin MCC 2013, Halmstad, Sweden

3 KALRAY MPPA, a global solution High performance, low power and programmable massively parallel processors C/C++ based Software Development Kit (SDK) for massively parallel programing Development platform Ready to develop BOARDS Reference design boards Application specific boards Single or Multi-MPPA boards 2013 Kalray SA All Rights Reserved MCC13 November

MPPA -256 Processor with 28nm CMOS TSMC 256 Processing Engine cores + 32 Resource Management cores Processing performance 700 GOPS 230 GFLOPS (400MHz) Power efficiency 5W to 15W (10W typical) Timing

4 MPPA -256 Processor with 28nm CMOS TSMC 256 Processing Engine cores + 32 Resource Management cores Processing performance 700 GOPS 230 GFLOPS (400MHz) Power efficiency 5W to 15W (10W typical) Timing predictability DDR3, PCI Gen3, Ethernet 10G Architecture scalability Processor tiling through NoC extensions In production High level programming Advanced debugging and tracing 2013 Kalray SA All Rights Reserved MCC13 November

MPPA DEVELOPER Workstation Develop, optimize and evaluate applications Incremental transition from CPU code to MPPA code Ready to develop configuration (no specific set-up) PCIe board MPPA -256

5 MPPA DEVELOPER Workstation Develop, optimize and evaluate applications Incremental transition from CPU code to MPPA code Ready to develop configuration (no specific set-up) PCIe board MPPA -256 Processor PCIe board for debug/probe Intel core I7 CPU 3.6GHz, Linux OS MPPA ACCESSCORE SDK installed Compatible with Multi MPPA board Additional services: Extranet access Support Team access Getting started training SDK maintenance 2013 Kalray SA All Rights Reserved MCC13 November

6 MPPA -256 PCIe Application Board AB01 Connect to the 4 I/O subsystems 2 PCIe GEN3 x8 interfaces through a x16 PCIe switch 2 DDR3 interfaces 4 Ethernet interfaces (2x10G + 2x1G) 1 NoCX interface NOR flash, GPIOs, leds, buttons, extensions & debug connectors 2013 Kalray SA All Rights Reserved MCC13 November

7 MPPA -256 EMB01 Board EMB01 is the perfect continuation of MPPA DEVELOPER as the application can run on both supports. EMB01 enables to prototype application previously develop on MPPA DEVELOPER 2013 Kalray SA All Rights Reserved MCC13 November

8 MPPA Board Products Available by Q Multi-MPPA MPPA -Based Server Power efficient hardware acceleration Single-MPPA Embedded systems USB Desktop Acceleration Accessory Development & embedded applications 2013 Kalray SA All Rights Reserved MCC13 November

9 MPPA Processor Roadmap MPPA Low Power 12W V1 V1.1 V1.2 V2 MPPA -256 Low Power 5W / 2.6W MPPA -64 Very Low Power 1.8W / 0.6W Idle 75mW 28nm 16nm 1 st generation 25 GFLOPS/W 400MHz 2 nd generation 50 GFLOPS/W 700MHz 3 rd generation 75 GFLOPS/W 1GHz 2013 Kalray SA All Rights Reserved MCC13 November

10 TSMC CMOS Technology Node Roadmap 28HP 20SOC 16FF Speed Area Power Transistor HKMG* HKMG* FINFET Process 2P/2E* 3P/3E + 3D * HKMG : High K Metal Gate * 2P/2E : double patterning/double exposure 2013 Kalray SA All Rights Reserved MCC13 November

The End of Dennard MOSFET Scaling Theory Since 2005 (90nm), frequency stagnates

13 Manycore Challenges on Next Technology Nodes Dark Silicon Projection (Esmaeilzadeh et al. CACM 2013) At 8nm, over 50% of the chip will be dark and cannot be utilized Based on Device x Core x Multicore models Multicore model assumes x86 CPU or GPU architecture Dally on Future Challenges of Large-Scale Computing (ISC 2013) Exascale computing requires1000x improvement in energy efficiency By 2020: technology => 2.2x, circuit design => 3x, architecture => 4x Power goes into moving data around communication dominates power Not considered above: principles exploited by the MPPA Manycore platforms based on low-power CPUs and distributed memory SoC nodes integrating high-speed networking and parallel computing 2013 Kalray SA All Rights Reserved MCC13 November

Supercomputer Clustered Memory Architecture IBM BlueGene series, Cray XT series Compute nodes with multiple cores and shared memory I/O nodes with high-speed

14 Supercomputer Clustered Memory Architecture IBM BlueGene series, Cray XT series Compute nodes with multiple cores and shared memory I/O nodes with high-speed devices and a Linux operating system Specialized networks between the compute nodes and the I/O nodes 2013 Kalray SA All Rights Reserved MCC13 November

15 MPPA -256 Processor Hierarchical Architecture VLIW Core Compute Cluster Manycore Processor Instruction Level Parallelism Thread Level Parallelism Process Level Parallelism 2013 Kalray SA All Rights Reserved MCC13 November

MPPA -256 VLIW Core Architecture 5-issue VLIW architecture Predictability & energy efficiency 32-bit/64-bit IEEE 754 FPU MMU for rich OS support Data processing code Byte alignment for all memory

16 MPPA -256 VLIW Core Architecture 5-issue VLIW architecture Predictability & energy efficiency 32-bit/64-bit IEEE 754 FPU MMU for rich OS support Data processing code Byte alignment for all memory accesses Standard & effective FPU with FMA Configurable bitwise logic unit Hardware looping System & control code MMU single memory port no function unit clustering Execution predictability Fully timing compositional core LRU caches, low miss penalty Energy and area efficiency 7-stage instruction pipeline, 400MHz Idle modes and wake-up on interrupt 2013 Kalray SA All Rights Reserved MCC13 November

MPPA -256 Compute Cluster 16 PE cores + 1 RM core NoC Tx and Rx interfaces Debug Support Unit (DSU) 2 MB of shared memory Multi-banked parallel memory 16 banks with independent arbitrer 38,4GB/s of

17 MPPA -256 Compute Cluster 16 PE cores + 1 RM core NoC Tx and Rx interfaces Debug Support Unit (DSU) 2 MB of shared memory Multi-banked parallel memory 16 banks with independent arbitrer 38,4GB/s of Reliability ECC in the shared memory Parity check in the caches Faulty cores can be switched off Predictability Multi-banked address mapping either interleaved (64B) or blocked (128KB) Low power Memory banks with low power mode Voltage scaling 2013 Kalray SA All Rights Reserved MCC13 November

MPPA -256 Clustered Memory Architecture Explicitly addressed NoC with AFDX-like guaranteed services Eth0 I/O Subsystem PCI0 I/O Subsystem PCI1 I/O Subsystem Eth1 I/O Subsystem 20 memory address

18 MPPA -256 Clustered Memory Architecture Explicitly addressed NoC with AFDX-like guaranteed services Eth0 I/O Subsystem PCI0 I/O Subsystem PCI1 I/O Subsystem Eth1 I/O Subsystem 20 memory address spaces 16 compute clusters 4 I/O subsystems with direct access to external DDR3 memory Dual Network-on-Chip (NoC) Data NoC & Control NoC Full duplex links, 4B/cycle 2D torus topology + extension links Unicast and multicast transfers Data NoC QoS Flow control and routing at source Guaranteed services by application of network calculus Oblivious synchronization 2013 Kalray SA All Rights Reserved MCC13 November

19 MPPA -256 Data NoC Guaranteed Services Source traffic regulation using (σ, ρ) A packet flow obeys (σ, ρ) if for any time interval τ the number of packets is not greater than σ + ρτ The initial (σ, ρ) is set at the Tx Data NoC interface Packets σ+ρ t σ σ Time 2013 Kalray SA All Rights Reserved MCC13 November

Controller (NAND/NOR/SRAM) GPIOs with Direct NoC Access (DNA) mode NoC extension

20 MPPA -256 Processor I/O Interfaces DDR3 Memory interfaces PCIe Gen3 interface 1G/10G/40G Ethernet interfaces SPI/I2C/UART interfaces Universal Static Memory Controller (NAND/NOR/SRAM) GPIOs with Direct NoC Access (DNA) mode NoC extension through Interlaken interface (NoC Express) 2013 Kalray SA All Rights Reserved MCC13 November

8V pins Indirect through low-cost FPGA Input directed to a Tx packet shaper on I/O subsystem 1 Sequential Data NoC Tx configuration Data

21 MPPA -256 Direct NoC Access (DNA) AB01 board 4CGX board MPPA256 NoC connection to GPIO Stimuli generator Data sourcing HSMC interface Full-duplex bus on 8/16/24 bits + notification + ready + full bits Maximum MHz Direct to the GPIO 1.8V pins Indirect through low-cost FPGA Input directed to a Tx packet shaper on I/O subsystem 1 Sequential Data NoC Tx configuration Data processing Standard data NoC Rx configuration Application flow control to GPIO Input data decounting Communication by sampling 2013 Kalray SA All Rights Reserved MCC13 November

MPPA -256 Sample Use of NoC Extensions (NoCX) AB01 board MPPA256 HSMC interface 4SGX board IO DDR-lite Mapping of IO DDR-lite on FPGA Altera 4SGX530

5MHz Single NoC plug 4K video through HDMI emitters Interlaken configured in Rx & Tx 1.

22 MPPA -256 Sample Use of NoC Extensions (NoCX) AB01 board MPPA256 HSMC interface 4SGX board IO DDR-lite Mapping of IO DDR-lite on FPGA Altera 4SGX530 development board Interlaken sub-system (x3 lanes up to 2.5-Gbit/sec) 300MHz DDR3 x1 RM + x1 DMA Kbyte 62.5MHz Single NoC plug 4K video through HDMI emitters Interlaken configured in Rx & Tx 1.6-Gbit/sec effective data NoC bandwidth reached Limiting factor is the FPGA device internal frequency Effective = 62.5MHz * 32-bit * 80% Output of uncompressed 1080p 60-frame/sec 2013 Kalray SA All Rights Reserved MCC13 November

23 Measuring Power and Energy Consumption k1-power tool for power measurement from command line $ k1-power -h -ofile : output file -gformat : plot in FORMAT -s : standalone mode $ k1-power oplot_double gpdf \ --./matmult_double_host_hw \ Time : s Power (avg): W Energy : J $ display plot_double.pdf libk1power.so shared library, control measure from user code int k1_measure_callback_function(int (*cb_function) (float time, double power)); int k1_measure_start(const char *output_filename); int k1_measure_stop(k1_measure_t *measures); typedef struct { float time; float energy; double power; } k1_measure_t; 2013 Kalray SA All Rights Reserved MCC13 November

Kalray Software Development Kit MPPA ACCESSCORE MPPA ACCESSLIB Today Q4 2013 Standard

Debuggers & System Trace POSIX-Level Programming DSP Style Operating Systems & Device

25 Kalray Software Development Kit MPPA ACCESSCORE MPPA ACCESSLIB Today Q Standard C/C++ Programming Environment Dataflow Programming FPGA Style Simulators, Profilers, Debuggers & System Trace POSIX-Level Programming DSP Style Operating Systems & Device Drivers Streaming Programming GPU Style 2013 Kalray SA All Rights Reserved MCC13 November

7 (GNU Compiler Collection), with C89, C99, C++ and Fortran GNU Binary utilities 2011 (assembler, linker, objdump,

26 Standard C/C++ Programming Environment Eclipse Based IDE and Linux-style command line Full GNU C/C++ development tools for the Kalray VLIW core GCC 4.7 (GNU Compiler Collection), with C89, C99, C++ and Fortran GNU Binary utilities 2011 (assembler, linker, objdump, objcopy, etc.) GDB (GNU Debugger) version 7.3, with multi-threaded code debugging Standard C libraries: uclibc, Newlib + optimized libm functions 2013 Kalray SA All Rights Reserved MCC13 November

Simulators, Debuggers & System Trace Platform simulators Cycle-accurate, 400KHz per core Software trace visualization tool Performance view using standard Linux

acquisition Each cluster DSU is able to generate 200Mb/s of trace 8 simultaneous observable trace flows out of 20 sources System trace display Based on Linux

27 Simulators, Debuggers & System Trace Platform simulators Cycle-accurate, 400KHz per core Software trace visualization tool Performance view using standard Linux kcachegrind & wireshark Platform debuggers GDB-based, follow all the cores Debug routines of each core activate the tap controller for JTAG output System trace acquisition Each cluster DSU is able to generate 200Mb/s of trace 8 simultaneous observable trace flows out of 20 sources System trace display Based on Linux Trace Toolkit NG (low latency, on demand activation) System trace viewer (customized view per programming model) 2013 Kalray SA All Rights Reserved MCC13 November

NodeOS for compute clusters Operating Systems & Device Drivers Common services for POSIX-Level and OpenCL programming The RM executes system code and manages the NoC interfaces The PEs execute user

28 NodeOS for compute clusters Operating Systems & Device Drivers Common services for POSIX-Level and OpenCL programming The RM executes system code and manages the NoC interfaces The PEs execute user threads, and forward system calls to the RM RTEMS executive on I/O Subsystems Implements the POSIX threads interface on single core Supports console, SPI, I2C, RAM file system TCP/IP stack adapted from FreeBSD release 6.0 Linux kernel on I/O Subsystems Kalray core booting Linux demonstrated at ELCE 2013, Edinburgh Planned release of Linux kernel + MPPA I/O drivers in 2014Q Kalray SA All Rights Reserved MCC13 November

dataflow extensions Language called Sigma-C Automatic mapping on MPPA memory,

29 Dataflow Programming Environment Computation blocks and communication graph written in C Cyclostatic data production & consumption Firing thresholds of Karp & Miller Dynamic dataflow extensions Language called Sigma-C Automatic mapping on MPPA memory, computing, & communication resources 2013 Kalray SA All Rights Reserved MCC13 November

30 Dataflow Models of Computation Dataflow Process Networks (DPN) [Lee & Parks 1995] Kahn Process Network with functional actors (no persistent state) and sequential firing rules (pre-defined order using only blocking reads) Static Dataflow (SDF) [Lee & Messerschmitt 1987] Agents producing and consuming a constant number of tokens Single-rate SDF is also known as Homogenous SDF (HSDF) Synchronous Dataflow (SDF) [Benveniste et al. 1994] Clocks are associated with tokens carried by the channels Cyclo-Static Dataflow (CSDF) [Lauwereins 1994] A cyclic state machine unconditionally advances at each firing Known number of tokens produced and consumed for each state Computational Process Networks (CPN) [Karp & Miller 1966] SDF extended with firing thresholds 2013 Kalray SA All Rights Reserved MCC13 November

31 agent Inverter() { } Sigma-C Agent Example interface { in<unsigned char> input; /*< input byte stream */ out<unsigned char> output; /*< output byte stream */ } spec{input; output}; void invert (void) exchange (input pel_in, output pel_out) { pel_out = pel_in; } void start () { invert(); } agent keyword followed by the name of the agent interface section for input/ output channels standard C code within the agent start function is an infinite loop state machine specification for data production & consumption exchange keyword flags direct operations on input / output channels 2013 Kalray SA All Rights Reserved MCC13 November

POSIX-like process management POSIX-Level Programming Environment Spawn 16 processes from the I/O subsystem Process execution on the 16 clusters start with main(argc, argv) and environment Inter

32 POSIX-like process management POSIX-Level Programming Environment Spawn 16 processes from the I/O subsystem Process execution on the 16 clusters start with main(argc, argv) and environment Inter Process Communication (IPC) POSIX file descriptor operations on NoC Connectors Inspired by supercomputer communication and synchronization primitives Multi-threading inside clusters Standard GCC/G++ OpenMP support #pragma for thread-level parallelism Compiler automatically creates threads POSIX threads interface Explicit thread-level parallelism 2013 Kalray SA All Rights Reserved MCC13 November

33 POSIX IPC with NoC Connectors Build on the pipe & filters software component model Processes are the atomic software components NoC objects operated through file descriptors are the connectors Connector Purpose Tx:Rx Endpoints Resources Sync Half synchronization barrier N:1, N:M (multicast) CNoC Portal Remote memory window N:1, N:M (multicast) DNoC Sampler Remote circular buffer 1:1, 1:M (multicast) DNoC RQueue Remote atomic enqueue N:1 DNoC+CNoC Channel Zero-copy rendez-vous 1:1 DNoC+CNoC Synchronous operations: open(), close(), ioctl(), read(), write(), pwrite() Asynchronous I/O operations on Portal, Sampler, RQueue Based on aio_read(), aio_error(), aio_return() NoC Tx DMA engine activated by aio_write() 2013 Kalray SA All Rights Reserved MCC13 November

34 POSIX-Level & Distributed Computing Split-phase barrier (master-slave implementation) Arrival phase by using the Sync connector in N:1 mode Departure phase by using the Sync connector in 1:M mode Active message server Use RQueue connector with asynchronous call-back on aio_read() Call-back function executes on the RM and must rearm aio_read() Split-C put() and get() remote memory operations Remote put() mapped to Portal connector operations Remote get() implemented by active messages that operate a Portal Rendez-vous Channel connector with synchronous read() and write() Loosely Time Triggered Sampler connector supports 1:M Communication by Sampling (CbS) 2013 Kalray SA All Rights Reserved MCC13 November

Standard OpenCL has two programming models Data parallel, with one work item per processing element (core) Task parallel, with one work item per compute unit (multiprocessor) In native kernels, may

35 Standard OpenCL has two programming models Data parallel, with one work item per processing element (core) Task parallel, with one work item per compute unit (multiprocessor) In native kernels, may use standard C/C++ static compiler (GCC) MPPA support of OpenCL OpenCL Programming Environment Task parallel model: one kernel per compute cluster Native kernel mode: clenqueuenativekernel() Standard task parallel mode: clenqueuetask() Emulate global memory with Distributed Shared Memory (DSM) Use the MMU on each core, assume no false sharing Use the MMU on each core, resolve false sharing like Treadmarks Data parallel model once LLVM is targeted to the Kalray VLIW core Currently use LLVM to generate C99, which is compiled with GCC 2013 Kalray SA All Rights Reserved MCC13 November

MPPA ACCESSLIB optimized application building blocks Application building blocks optimized at different scopes MPPA Core register file & cache MPPA Cluster shared memory MPPA Partition distributed

37 MPPA ACCESSLIB optimized application building blocks Application building blocks optimized at different scopes MPPA Core register file & cache MPPA Cluster shared memory MPPA Partition distributed memory Delivered as C libraries Dataflow programming POSIX-level programming Numerical and signal processing FFT, Filtering and convolution BLAS-level primitives VSIPL (Vector Signal Image) libm extensions with metalibm Video and image processing H264, HEVC encode / decode OpenCV Computer vision Scope Core Q3 Q4 Q1 Q2 DSPLIB H264 codec HEVC Codec Cluster DSPLIB VSIPL Partition Matmul, FFT PBLAS OpenCV 2013 Kalray SA All Rights Reserved MCC13 November

39 The MetaLibM Project Led by INRIA AriC with LIP6, LIRMM, CERN, Kalray If you are computing accurately enough with floating-point, you are probably computing too accurately Are you able to express what your code is supposed to compute? If so, MetaLibM generates the code just right for the application First objective: address the implementation of libm on a target Elementary functions such as exponential or trigonometric High-speed, correctly rounded, target-specific implementations Second objective: open the set of functions and contexts Mathematical expressions such as sin(omega*t+phi) Code specified as an arbitrary expression + interval + range Multiple Input Multiple Output including vectorized variants Filters specified by transfer function or frequency response Solutions to differential equations 2013 Kalray SA All Rights Reserved MCC13 November

40 Prototype MetaLibM Code Generator for Kalray Generate C source code with calls to Kalray FPU intrinsic functions Intrinsics for FMA with extended precision and fixed rounding Intrinsics for scaled conversions between integer and floating-point Balance polynomial evaluation for instruction-level parallelism Let the C compiler do the register allocation and instruction scheduling Sample use case: variant of IEEE sincospio2f Sincospio2f computes sinf+cosf of argument scaled by π/2 Here argument is in [0,1] and only 16 bit of accuracy is required Clock Cycles Speedup Newlib Metalibm scalar 86 3 Metalibm paired Kalray SA All Rights Reserved MCC13 November

42 Finance Customer Application Kernel time-to-completion (s) energy-to-completion (J) ,36 13,86 X10 better 1,44 0, , ,21 X21 better 86,27 19,17 GPU (C2075) X86 (Core MPPA V1 (4 x MPPA-256) MPPA V1.2 (4 x MPPA-256) time-to-completion (s) energy-to-completion (J) Kalray SA All Rights Reserved MCC13 November

43 Black-Scholes Benchmark ,17 4xMPPAv1 Gexp/s 171,77 4xMPPAv1.2 1 hour for the code to run on MPPA Single cluster, single core Need to replace the GNU Scientific Library RNG Less than 6 LOC to be changed 2 hours to parallelize Single cluster, 16 cores Same effort needed on x Kalray SA All Rights Reserved MCC13 November

44 HEVC Software Encoder Real time HEVC 1080p encoder on 1 MPPA-256 HEVC Encode on MPPA: Resolution # cores Power SD 48 (1/5 x MPPA-256) 1080p (HD) 30fps 256 (1x MPPA-256) 4K (UHD) 30fps 1024 (4x MPPA-256) <6W 7W 30W Other solutions (results found on the web) Platform Resolution Frame Rate CPU Load Intel I7 3.33GHz 4-Core (8 thread) Intel Xeon 2.66GHz 8-Core (16 thread) 720x480 30fps 45% 1280x720 30fps 65% 1920x1080 (HD) 30fps 130% Source: ACE Thought 2013 Kalray SA All Rights Reserved MCC13 November

integration Up to 8 x Ethernet 1GbE Low Latency audio processing 500µs latency from input to output samples

45 Audio Processing Application Increase performances and reduce total system cost Multi Channel processing 256 VLIW cores ~ 500 Low End DSPs Channel routing and control High performance NoC + 32 integrated DMAs System integration Up to 8 x Ethernet 1GbE Low Latency audio processing 500µs latency from input to output samples Cost effective Equivalent to complex multi DSPs + FPGAs system 2013 Kalray SA All Rights Reserved MCC13 November

47 MPPA Unique Blend of Computing Technologies Manycore Architecture Design VLIW core architecture design and implementation (STMicro, HP Labs) Floating-point unit design and validation (INRIA Aric / INSA Lyon) Time-predictable architecture principles (VERIMAG, U. Saarland) Network-on-Chip with service guarantees (CEA LETI) Tools and run-time support Cyclostatic Dataflow language and compiler (CEA LIST, UPMC LIP6) GCC compiler suite with OpenMP (INRIA Parkas, ARMINES) VLIW code optimizations (CEA DAM, STMicro, INRIA Compsys) Platform simulators, para-virtualization (TIMA laboratory) MetaLibM code generator for Kalray VLIW cores (INRIA Aric) OpenCL framework (Technological University Tampere) Implementation CMOS 28nm TSMC technology node, verification & validation flows 2013 Kalray SA All Rights Reserved MCC13 November

48 MPPA Platform Target Markets MPPA -256 processors in large-scale systems Kalray 80W for a 4xMPPA-256, 8xDDR3, 8x10Gb/s Ethernet Geophysics for oil & gas Video broadcasting & transcoding Homomorphic cryptography Biometric database acceleration MPPA -64 processors for cyber-physical computing VLIW cores, multi-banked clustered memory, NoC guaranteed services Software Defined Radio (SDR) Advanced Driver Assistance Systems (ADAS) Next steps (2014) Improve the micro-architecture for the MPPA -256 processor v1.2 Extend the Cyclostatic Dataflow progamming model for latency control Release a self-hosted OpenCL implementation for real-time processing 2013 Kalray SA All Rights Reserved MCC13 November

Kalray MPPA Manycore Challenges for the Next Generation of Professional Applications Benoît Dupont de Dinechin MPSoC 2013

Kalray MPPA Manycore Challenges for the Next Generation of Professional Applications Benoît Dupont de Dinechin MPSoC 2013 The End of Dennard MOSFET Scaling Theory 2013 Kalray SA All Rights Reserved MPSoC