Architecture and Programming Models of the Kalray MPPA Single Chip Manycore Processor. Benoît Dupont de Dinechin MCC 2013, Halmstad, Sweden

Size: px
Start display at page:

Download "Architecture and Programming Models of the Kalray MPPA Single Chip Manycore Processor. Benoît Dupont de Dinechin MCC 2013, Halmstad, Sweden"

Transcription

1 Architecture and Programming Models of the Kalray MPPA Single Chip Manycore Processor Benoît Dupont de Dinechin MCC 2013, Halmstad, Sweden

2 Outline Kalray MPPA Products MPPA MANYCORE Architecture MPPA ACCESSLIB Toolchain MetaLibM Function Generator Some Application Examples Conclusions 2013 Kalray SA All Rights Reserved MCC13 November

3 KALRAY MPPA, a global solution High performance, low power and programmable massively parallel processors C/C++ based Software Development Kit (SDK) for massively parallel programing Development platform Ready to develop BOARDS Reference design boards Application specific boards Single or Multi-MPPA boards 2013 Kalray SA All Rights Reserved MCC13 November

4 MPPA -256 Processor with 28nm CMOS TSMC 256 Processing Engine cores + 32 Resource Management cores Processing performance 700 GOPS 230 GFLOPS (400MHz) Power efficiency 5W to 15W (10W typical) Timing predictability DDR3, PCI Gen3, Ethernet 10G Architecture scalability Processor tiling through NoC extensions In production High level programming Advanced debugging and tracing 2013 Kalray SA All Rights Reserved MCC13 November

5 MPPA DEVELOPER Workstation Develop, optimize and evaluate applications Incremental transition from CPU code to MPPA code Ready to develop configuration (no specific set-up) PCIe board MPPA -256 Processor PCIe board for debug/probe Intel core I7 CPU 3.6GHz, Linux OS MPPA ACCESSCORE SDK installed Compatible with Multi MPPA board Additional services: Extranet access Support Team access Getting started training SDK maintenance 2013 Kalray SA All Rights Reserved MCC13 November

6 MPPA -256 PCIe Application Board AB01 Connect to the 4 I/O subsystems 2 PCIe GEN3 x8 interfaces through a x16 PCIe switch 2 DDR3 interfaces 4 Ethernet interfaces (2x10G + 2x1G) 1 NoCX interface NOR flash, GPIOs, leds, buttons, extensions & debug connectors 2013 Kalray SA All Rights Reserved MCC13 November

7 MPPA -256 EMB01 Board EMB01 is the perfect continuation of MPPA DEVELOPER as the application can run on both supports. EMB01 enables to prototype application previously develop on MPPA DEVELOPER 2013 Kalray SA All Rights Reserved MCC13 November

8 MPPA Board Products Available by Q Multi-MPPA MPPA -Based Server Power efficient hardware acceleration Single-MPPA Embedded systems USB Desktop Acceleration Accessory Development & embedded applications 2013 Kalray SA All Rights Reserved MCC13 November

9 MPPA Processor Roadmap MPPA Low Power 12W V1 V1.1 V1.2 V2 MPPA -256 Low Power 5W / 2.6W MPPA -64 Very Low Power 1.8W / 0.6W Idle 75mW 28nm 16nm 1 st generation 25 GFLOPS/W 400MHz 2 nd generation 50 GFLOPS/W 700MHz 3 rd generation 75 GFLOPS/W 1GHz 2013 Kalray SA All Rights Reserved MCC13 November

10 TSMC CMOS Technology Node Roadmap 28HP 20SOC 16FF Speed Area Power Transistor HKMG* HKMG* FINFET Process 2P/2E* 3P/3E + 3D * HKMG : High K Metal Gate * 2P/2E : double patterning/double exposure 2013 Kalray SA All Rights Reserved MCC13 November

11 Outline Kalray MPPA Products MPPA MANYCORE Architecture MPPA ACCESSLIB Toolchain MetaLibM Function Generator Some Application Examples Conclusions 2013 Kalray SA All Rights Reserved MCC13 November

12 The End of Dennard MOSFET Scaling Theory Since 2005 (90nm), frequency stagnates and power per area increases 2013 Kalray SA All Rights Reserved MCC13 November

13 Manycore Challenges on Next Technology Nodes Dark Silicon Projection (Esmaeilzadeh et al. CACM 2013) At 8nm, over 50% of the chip will be dark and cannot be utilized Based on Device x Core x Multicore models Multicore model assumes x86 CPU or GPU architecture Dally on Future Challenges of Large-Scale Computing (ISC 2013) Exascale computing requires1000x improvement in energy efficiency By 2020: technology => 2.2x, circuit design => 3x, architecture => 4x Power goes into moving data around communication dominates power Not considered above: principles exploited by the MPPA Manycore platforms based on low-power CPUs and distributed memory SoC nodes integrating high-speed networking and parallel computing 2013 Kalray SA All Rights Reserved MCC13 November

14 Supercomputer Clustered Memory Architecture IBM BlueGene series, Cray XT series Compute nodes with multiple cores and shared memory I/O nodes with high-speed devices and a Linux operating system Specialized networks between the compute nodes and the I/O nodes 2013 Kalray SA All Rights Reserved MCC13 November

15 MPPA -256 Processor Hierarchical Architecture VLIW Core Compute Cluster Manycore Processor Instruction Level Parallelism Thread Level Parallelism Process Level Parallelism 2013 Kalray SA All Rights Reserved MCC13 November

16 MPPA -256 VLIW Core Architecture 5-issue VLIW architecture Predictability & energy efficiency 32-bit/64-bit IEEE 754 FPU MMU for rich OS support Data processing code Byte alignment for all memory accesses Standard & effective FPU with FMA Configurable bitwise logic unit Hardware looping System & control code MMU single memory port no function unit clustering Execution predictability Fully timing compositional core LRU caches, low miss penalty Energy and area efficiency 7-stage instruction pipeline, 400MHz Idle modes and wake-up on interrupt 2013 Kalray SA All Rights Reserved MCC13 November

17 MPPA -256 Compute Cluster 16 PE cores + 1 RM core NoC Tx and Rx interfaces Debug Support Unit (DSU) 2 MB of shared memory Multi-banked parallel memory 16 banks with independent arbitrer 38,4GB/s of Reliability ECC in the shared memory Parity check in the caches Faulty cores can be switched off Predictability Multi-banked address mapping either interleaved (64B) or blocked (128KB) Low power Memory banks with low power mode Voltage scaling 2013 Kalray SA All Rights Reserved MCC13 November

18 MPPA -256 Clustered Memory Architecture Explicitly addressed NoC with AFDX-like guaranteed services Eth0 I/O Subsystem PCI0 I/O Subsystem PCI1 I/O Subsystem Eth1 I/O Subsystem 20 memory address spaces 16 compute clusters 4 I/O subsystems with direct access to external DDR3 memory Dual Network-on-Chip (NoC) Data NoC & Control NoC Full duplex links, 4B/cycle 2D torus topology + extension links Unicast and multicast transfers Data NoC QoS Flow control and routing at source Guaranteed services by application of network calculus Oblivious synchronization 2013 Kalray SA All Rights Reserved MCC13 November

19 MPPA -256 Data NoC Guaranteed Services Source traffic regulation using (σ, ρ) A packet flow obeys (σ, ρ) if for any time interval τ the number of packets is not greater than σ + ρτ The initial (σ, ρ) is set at the Tx Data NoC interface Packets σ+ρ t σ σ Time 2013 Kalray SA All Rights Reserved MCC13 November

20 MPPA -256 Processor I/O Interfaces DDR3 Memory interfaces PCIe Gen3 interface 1G/10G/40G Ethernet interfaces SPI/I2C/UART interfaces Universal Static Memory Controller (NAND/NOR/SRAM) GPIOs with Direct NoC Access (DNA) mode NoC extension through Interlaken interface (NoC Express) 2013 Kalray SA All Rights Reserved MCC13 November

21 MPPA -256 Direct NoC Access (DNA) AB01 board 4CGX board MPPA256 NoC connection to GPIO Stimuli generator Data sourcing HSMC interface Full-duplex bus on 8/16/24 bits + notification + ready + full bits Maximum MHz Direct to the GPIO 1.8V pins Indirect through low-cost FPGA Input directed to a Tx packet shaper on I/O subsystem 1 Sequential Data NoC Tx configuration Data processing Standard data NoC Rx configuration Application flow control to GPIO Input data decounting Communication by sampling 2013 Kalray SA All Rights Reserved MCC13 November

22 MPPA -256 Sample Use of NoC Extensions (NoCX) AB01 board MPPA256 HSMC interface 4SGX board IO DDR-lite Mapping of IO DDR-lite on FPGA Altera 4SGX530 development board Interlaken sub-system (x3 lanes up to 2.5-Gbit/sec) 300MHz DDR3 x1 RM + x1 DMA Kbyte 62.5MHz Single NoC plug 4K video through HDMI emitters Interlaken configured in Rx & Tx 1.6-Gbit/sec effective data NoC bandwidth reached Limiting factor is the FPGA device internal frequency Effective = 62.5MHz * 32-bit * 80% Output of uncompressed 1080p 60-frame/sec 2013 Kalray SA All Rights Reserved MCC13 November

23 Measuring Power and Energy Consumption k1-power tool for power measurement from command line $ k1-power -h -ofile : output file -gformat : plot in FORMAT -s : standalone mode $ k1-power oplot_double gpdf \ --./matmult_double_host_hw \ Time : s Power (avg): W Energy : J $ display plot_double.pdf libk1power.so shared library, control measure from user code int k1_measure_callback_function(int (*cb_function) (float time, double power)); int k1_measure_start(const char *output_filename); int k1_measure_stop(k1_measure_t *measures); typedef struct { float time; float energy; double power; } k1_measure_t; 2013 Kalray SA All Rights Reserved MCC13 November

24 Outline Kalray MPPA Products MPPA MANYCORE Architecture MPPA ACCESSLIB Toolchain MetaLibM Function Generator Some Application Examples Conclusions 2013 Kalray SA All Rights Reserved MCC13 November

25 Kalray Software Development Kit MPPA ACCESSCORE MPPA ACCESSLIB Today Q Standard C/C++ Programming Environment Dataflow Programming FPGA Style Simulators, Profilers, Debuggers & System Trace POSIX-Level Programming DSP Style Operating Systems & Device Drivers Streaming Programming GPU Style 2013 Kalray SA All Rights Reserved MCC13 November

26 Standard C/C++ Programming Environment Eclipse Based IDE and Linux-style command line Full GNU C/C++ development tools for the Kalray VLIW core GCC 4.7 (GNU Compiler Collection), with C89, C99, C++ and Fortran GNU Binary utilities 2011 (assembler, linker, objdump, objcopy, etc.) GDB (GNU Debugger) version 7.3, with multi-threaded code debugging Standard C libraries: uclibc, Newlib + optimized libm functions 2013 Kalray SA All Rights Reserved MCC13 November

27 Simulators, Debuggers & System Trace Platform simulators Cycle-accurate, 400KHz per core Software trace visualization tool Performance view using standard Linux kcachegrind & wireshark Platform debuggers GDB-based, follow all the cores Debug routines of each core activate the tap controller for JTAG output System trace acquisition Each cluster DSU is able to generate 200Mb/s of trace 8 simultaneous observable trace flows out of 20 sources System trace display Based on Linux Trace Toolkit NG (low latency, on demand activation) System trace viewer (customized view per programming model) 2013 Kalray SA All Rights Reserved MCC13 November

28 NodeOS for compute clusters Operating Systems & Device Drivers Common services for POSIX-Level and OpenCL programming The RM executes system code and manages the NoC interfaces The PEs execute user threads, and forward system calls to the RM RTEMS executive on I/O Subsystems Implements the POSIX threads interface on single core Supports console, SPI, I2C, RAM file system TCP/IP stack adapted from FreeBSD release 6.0 Linux kernel on I/O Subsystems Kalray core booting Linux demonstrated at ELCE 2013, Edinburgh Planned release of Linux kernel + MPPA I/O drivers in 2014Q Kalray SA All Rights Reserved MCC13 November

29 Dataflow Programming Environment Computation blocks and communication graph written in C Cyclostatic data production & consumption Firing thresholds of Karp & Miller Dynamic dataflow extensions Language called Sigma-C Automatic mapping on MPPA memory, computing, & communication resources 2013 Kalray SA All Rights Reserved MCC13 November

30 Dataflow Models of Computation Dataflow Process Networks (DPN) [Lee & Parks 1995] Kahn Process Network with functional actors (no persistent state) and sequential firing rules (pre-defined order using only blocking reads) Static Dataflow (SDF) [Lee & Messerschmitt 1987] Agents producing and consuming a constant number of tokens Single-rate SDF is also known as Homogenous SDF (HSDF) Synchronous Dataflow (SDF) [Benveniste et al. 1994] Clocks are associated with tokens carried by the channels Cyclo-Static Dataflow (CSDF) [Lauwereins 1994] A cyclic state machine unconditionally advances at each firing Known number of tokens produced and consumed for each state Computational Process Networks (CPN) [Karp & Miller 1966] SDF extended with firing thresholds 2013 Kalray SA All Rights Reserved MCC13 November

31 agent Inverter() { } Sigma-C Agent Example interface { in<unsigned char> input; /*< input byte stream */ out<unsigned char> output; /*< output byte stream */ } spec{input; output}; void invert (void) exchange (input pel_in, output pel_out) { pel_out = pel_in; } void start () { invert(); } agent keyword followed by the name of the agent interface section for input/ output channels standard C code within the agent start function is an infinite loop state machine specification for data production & consumption exchange keyword flags direct operations on input / output channels 2013 Kalray SA All Rights Reserved MCC13 November

32 POSIX-like process management POSIX-Level Programming Environment Spawn 16 processes from the I/O subsystem Process execution on the 16 clusters start with main(argc, argv) and environment Inter Process Communication (IPC) POSIX file descriptor operations on NoC Connectors Inspired by supercomputer communication and synchronization primitives Multi-threading inside clusters Standard GCC/G++ OpenMP support #pragma for thread-level parallelism Compiler automatically creates threads POSIX threads interface Explicit thread-level parallelism 2013 Kalray SA All Rights Reserved MCC13 November

33 POSIX IPC with NoC Connectors Build on the pipe & filters software component model Processes are the atomic software components NoC objects operated through file descriptors are the connectors Connector Purpose Tx:Rx Endpoints Resources Sync Half synchronization barrier N:1, N:M (multicast) CNoC Portal Remote memory window N:1, N:M (multicast) DNoC Sampler Remote circular buffer 1:1, 1:M (multicast) DNoC RQueue Remote atomic enqueue N:1 DNoC+CNoC Channel Zero-copy rendez-vous 1:1 DNoC+CNoC Synchronous operations: open(), close(), ioctl(), read(), write(), pwrite() Asynchronous I/O operations on Portal, Sampler, RQueue Based on aio_read(), aio_error(), aio_return() NoC Tx DMA engine activated by aio_write() 2013 Kalray SA All Rights Reserved MCC13 November

34 POSIX-Level & Distributed Computing Split-phase barrier (master-slave implementation) Arrival phase by using the Sync connector in N:1 mode Departure phase by using the Sync connector in 1:M mode Active message server Use RQueue connector with asynchronous call-back on aio_read() Call-back function executes on the RM and must rearm aio_read() Split-C put() and get() remote memory operations Remote put() mapped to Portal connector operations Remote get() implemented by active messages that operate a Portal Rendez-vous Channel connector with synchronous read() and write() Loosely Time Triggered Sampler connector supports 1:M Communication by Sampling (CbS) 2013 Kalray SA All Rights Reserved MCC13 November

35 Standard OpenCL has two programming models Data parallel, with one work item per processing element (core) Task parallel, with one work item per compute unit (multiprocessor) In native kernels, may use standard C/C++ static compiler (GCC) MPPA support of OpenCL OpenCL Programming Environment Task parallel model: one kernel per compute cluster Native kernel mode: clenqueuenativekernel() Standard task parallel mode: clenqueuetask() Emulate global memory with Distributed Shared Memory (DSM) Use the MMU on each core, assume no false sharing Use the MMU on each core, resolve false sharing like Treadmarks Data parallel model once LLVM is targeted to the Kalray VLIW core Currently use LLVM to generate C99, which is compiled with GCC 2013 Kalray SA All Rights Reserved MCC13 November

36 OpenCL Platform Mapping to MPPA Architecture DDR Host Host Memory 2013 Kalray SA All Rights Reserved MCC13 November

37 MPPA ACCESSLIB optimized application building blocks Application building blocks optimized at different scopes MPPA Core register file & cache MPPA Cluster shared memory MPPA Partition distributed memory Delivered as C libraries Dataflow programming POSIX-level programming Numerical and signal processing FFT, Filtering and convolution BLAS-level primitives VSIPL (Vector Signal Image) libm extensions with metalibm Video and image processing H264, HEVC encode / decode OpenCV Computer vision Scope Core Q3 Q4 Q1 Q2 DSPLIB H264 codec HEVC Codec Cluster DSPLIB VSIPL Partition Matmul, FFT PBLAS OpenCV 2013 Kalray SA All Rights Reserved MCC13 November

38 Outline Kalray MPPA Products MPPA MANYCORE Architecture MPPA ACCESSLIB Toolchain MetaLibM Function Generator Some Application Examples Conclusions 2013 Kalray SA All Rights Reserved MCC13 November

39 The MetaLibM Project Led by INRIA AriC with LIP6, LIRMM, CERN, Kalray If you are computing accurately enough with floating-point, you are probably computing too accurately Are you able to express what your code is supposed to compute? If so, MetaLibM generates the code just right for the application First objective: address the implementation of libm on a target Elementary functions such as exponential or trigonometric High-speed, correctly rounded, target-specific implementations Second objective: open the set of functions and contexts Mathematical expressions such as sin(omega*t+phi) Code specified as an arbitrary expression + interval + range Multiple Input Multiple Output including vectorized variants Filters specified by transfer function or frequency response Solutions to differential equations 2013 Kalray SA All Rights Reserved MCC13 November

40 Prototype MetaLibM Code Generator for Kalray Generate C source code with calls to Kalray FPU intrinsic functions Intrinsics for FMA with extended precision and fixed rounding Intrinsics for scaled conversions between integer and floating-point Balance polynomial evaluation for instruction-level parallelism Let the C compiler do the register allocation and instruction scheduling Sample use case: variant of IEEE sincospio2f Sincospio2f computes sinf+cosf of argument scaled by π/2 Here argument is in [0,1] and only 16 bit of accuracy is required Clock Cycles Speedup Newlib Metalibm scalar 86 3 Metalibm paired Kalray SA All Rights Reserved MCC13 November

41 Outline Kalray MPPA Products MPPA MANYCORE Architecture MPPA ACCESSLIB Toolchain MetaLibM Function Generator Some Application Examples Conclusions 2013 Kalray SA All Rights Reserved MCC13 November

42 Finance Customer Application Kernel time-to-completion (s) energy-to-completion (J) ,36 13,86 X10 better 1,44 0, , ,21 X21 better 86,27 19,17 GPU (C2075) X86 (Core MPPA V1 (4 x MPPA-256) MPPA V1.2 (4 x MPPA-256) time-to-completion (s) energy-to-completion (J) Kalray SA All Rights Reserved MCC13 November

43 Black-Scholes Benchmark ,17 4xMPPAv1 Gexp/s 171,77 4xMPPAv1.2 1 hour for the code to run on MPPA Single cluster, single core Need to replace the GNU Scientific Library RNG Less than 6 LOC to be changed 2 hours to parallelize Single cluster, 16 cores Same effort needed on x Kalray SA All Rights Reserved MCC13 November

44 HEVC Software Encoder Real time HEVC 1080p encoder on 1 MPPA-256 HEVC Encode on MPPA: Resolution # cores Power SD 48 (1/5 x MPPA-256) 1080p (HD) 30fps 256 (1x MPPA-256) 4K (UHD) 30fps 1024 (4x MPPA-256) <6W 7W 30W Other solutions (results found on the web) Platform Resolution Frame Rate CPU Load Intel I7 3.33GHz 4-Core (8 thread) Intel Xeon 2.66GHz 8-Core (16 thread) 720x480 30fps 45% 1280x720 30fps 65% 1920x1080 (HD) 30fps 130% Source: ACE Thought 2013 Kalray SA All Rights Reserved MCC13 November

45 Audio Processing Application Increase performances and reduce total system cost Multi Channel processing 256 VLIW cores ~ 500 Low End DSPs Channel routing and control High performance NoC + 32 integrated DMAs System integration Up to 8 x Ethernet 1GbE Low Latency audio processing 500µs latency from input to output samples Cost effective Equivalent to complex multi DSPs + FPGAs system 2013 Kalray SA All Rights Reserved MCC13 November

46 Outline Kalray MPPA Products MPPA MANYCORE Architecture MPPA ACCESSLIB Toolchain MetaLibM Function Generator Some Application Examples Conclusions 2013 Kalray SA All Rights Reserved MCC13 November

47 MPPA Unique Blend of Computing Technologies Manycore Architecture Design VLIW core architecture design and implementation (STMicro, HP Labs) Floating-point unit design and validation (INRIA Aric / INSA Lyon) Time-predictable architecture principles (VERIMAG, U. Saarland) Network-on-Chip with service guarantees (CEA LETI) Tools and run-time support Cyclostatic Dataflow language and compiler (CEA LIST, UPMC LIP6) GCC compiler suite with OpenMP (INRIA Parkas, ARMINES) VLIW code optimizations (CEA DAM, STMicro, INRIA Compsys) Platform simulators, para-virtualization (TIMA laboratory) MetaLibM code generator for Kalray VLIW cores (INRIA Aric) OpenCL framework (Technological University Tampere) Implementation CMOS 28nm TSMC technology node, verification & validation flows 2013 Kalray SA All Rights Reserved MCC13 November

48 MPPA Platform Target Markets MPPA -256 processors in large-scale systems Kalray 80W for a 4xMPPA-256, 8xDDR3, 8x10Gb/s Ethernet Geophysics for oil & gas Video broadcasting & transcoding Homomorphic cryptography Biometric database acceleration MPPA -64 processors for cyber-physical computing VLIW cores, multi-banked clustered memory, NoC guaranteed services Software Defined Radio (SDR) Advanced Driver Assistance Systems (ADAS) Next steps (2014) Improve the micro-architecture for the MPPA -256 processor v1.2 Extend the Cyclostatic Dataflow progamming model for latency control Release a self-hosted OpenCL implementation for real-time processing 2013 Kalray SA All Rights Reserved MCC13 November

Kalray MPPA Manycore Challenges for the Next Generation of Professional Applications Benoît Dupont de Dinechin MPSoC 2013

Kalray MPPA Manycore Challenges for the Next Generation of Professional Applications Benoît Dupont de Dinechin MPSoC 2013 Kalray MPPA Manycore Challenges for the Next Generation of Professional Applications Benoît Dupont de Dinechin MPSoC 2013 The End of Dennard MOSFET Scaling Theory 2013 Kalray SA All Rights Reserved MPSoC

More information

A Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013

A Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013 A Closer Look at the Epiphany IV 28nm 64 core Coprocessor Andreas Olofsson PEGPUM 2013 1 Adapteva Achieves 3 World Firsts 1. First processor company to reach 50 GFLOPS/W 3. First semiconductor company

More information

Big Data mais basse consommation L apport des processeurs manycore

Big Data mais basse consommation L apport des processeurs manycore Big Data mais basse consommation L apport des processeurs manycore Laurent Julliard - Kalray Le potentiel et les défis du Big Data Séminaire ASPROM 2 et 3 Juillet 2013 Presentation Outline Kalray : the

More information

Porting Linux to a New Architecture

Porting Linux to a New Architecture Embedded Linux Conference 2014 Porting Linux to a New Architecture Marta Rybczyńska May 1 st, 2014 Different Types of Porting New board New processor from existing family New architecture 2010-2014 Kalray

More information

Porting Linux to a New Architecture

Porting Linux to a New Architecture Embedded Linux Conference Europe 2014 Porting Linux to a New Architecture Marta Rybczyńska October 15, 2014 Different Types of Porting New board New processor from existing family New architecture 2 New

More information

Introduction Electrical Considerations Data Transfer Synchronization Bus Arbitration VME Bus Local Buses PCI Bus PCI Bus Variants Serial Buses

Introduction Electrical Considerations Data Transfer Synchronization Bus Arbitration VME Bus Local Buses PCI Bus PCI Bus Variants Serial Buses Introduction Electrical Considerations Data Transfer Synchronization Bus Arbitration VME Bus Local Buses PCI Bus PCI Bus Variants Serial Buses 1 Most of the integrated I/O subsystems are connected to the

More information

KeyStone C665x Multicore SoC

KeyStone C665x Multicore SoC KeyStone Multicore SoC Architecture KeyStone C6655/57: Device Features C66x C6655: One C66x DSP Core at 1.0 or 1.25 GHz C6657: Two C66x DSP Cores at 0.85, 1.0, or 1.25 GHz Fixed and Floating Point Operations

More information

S2C K7 Prodigy Logic Module Series

S2C K7 Prodigy Logic Module Series S2C K7 Prodigy Logic Module Series Low-Cost Fifth Generation Rapid FPGA-based Prototyping Hardware The S2C K7 Prodigy Logic Module is equipped with one Xilinx Kintex-7 XC7K410T or XC7K325T FPGA device

More information

A Distributed Run-Time Environment for the Kalray MPPA R -256 Integrated Manycore Processor

A Distributed Run-Time Environment for the Kalray MPPA R -256 Integrated Manycore Processor Available online at www.sciencedirect.com Procedia Computer Science 18 (2013 ) 1654 1663 International Conference on Computational Science, ICCS 2013 A Distributed Run-Time Environment for the Kalray MPPA

More information

POSIX TRAINING. On the Kalray MPPA Developer With the AB01 and the TC2

POSIX TRAINING. On the Kalray MPPA Developer With the AB01 and the TC2 POSIX TRAINING On the Kalray MPPA Developer With the AB01 and the TC2 Outline MPPA Architecture: Hardware Configuration Software and Hardware Network-On-Chip ACCESSCORE-TOOLS Getting started with MPPA

More information

DIGITAL VS. ANALOG SIGNAL PROCESSING Digital signal processing (DSP) characterized by: OUTLINE APPLICATIONS OF DIGITAL SIGNAL PROCESSING

DIGITAL VS. ANALOG SIGNAL PROCESSING Digital signal processing (DSP) characterized by: OUTLINE APPLICATIONS OF DIGITAL SIGNAL PROCESSING 1 DSP applications DSP platforms The synthesis problem Models of computation OUTLINE 2 DIGITAL VS. ANALOG SIGNAL PROCESSING Digital signal processing (DSP) characterized by: Time-discrete representation

More information

Copyright 2016 Xilinx

Copyright 2016 Xilinx Zynq Architecture Zynq Vivado 2015.4 Version This material exempt per Department of Commerce license exception TSU Objectives After completing this module, you will be able to: Identify the basic building

More information

Kalray MPPA Massively Parallel Processor Array

Kalray MPPA Massively Parallel Processor Array Kalray MPPA Massively Parallel Processor Array Engineering a Manycore Processor Platform for Mission-Critical Applications Benoît Dupont de Dinechin, CTO Kalray Solutions Target Two Strategic Markets Cloud

More information

AT-501 Cortex-A5 System On Module Product Brief

AT-501 Cortex-A5 System On Module Product Brief AT-501 Cortex-A5 System On Module Product Brief 1. Scope The following document provides a brief description of the AT-501 System on Module (SOM) its features and ordering options. For more details please

More information

Markets Demanding More Performance

Markets Demanding More Performance The Tile Processor Architecture: Embedded Multicore for Networking and Digital Multimedia Tilera Corporation August 20 th 2007 Hotchips 2007 Markets Demanding More Performance Networking market - Demand

More information

BlazePPS (Blaze Packet Processing System) CSEE W4840 Project Design

BlazePPS (Blaze Packet Processing System) CSEE W4840 Project Design BlazePPS (Blaze Packet Processing System) CSEE W4840 Project Design Valeh Valiollahpour Amiri (vv2252) Christopher Campbell (cc3769) Yuanpei Zhang (yz2727) Sheng Qian ( sq2168) March 26, 2015 I) Hardware

More information

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki An Ultra High Performance Scalable DSP Family for Multimedia Hot Chips 17 August 2005 Stanford, CA Erik Machnicki Media Processing Challenges Increasing performance requirements Need for flexibility &

More information

SoC FPGAs. Your User-Customizable System on Chip Altera Corporation Public

SoC FPGAs. Your User-Customizable System on Chip Altera Corporation Public SoC FPGAs Your User-Customizable System on Chip Embedded Developers Needs Low High Increase system performance Reduce system power Reduce board size Reduce system cost 2 Providing the Best of Both Worlds

More information

KeyStone C66x Multicore SoC Overview. Dec, 2011

KeyStone C66x Multicore SoC Overview. Dec, 2011 KeyStone C66x Multicore SoC Overview Dec, 011 Outline Multicore Challenge KeyStone Architecture Reminder About KeyStone Solution Challenge Before KeyStone Multicore performance degradation Lack of efficient

More information

Intelop. *As new IP blocks become available, please contact the factory for the latest updated info.

Intelop. *As new IP blocks become available, please contact the factory for the latest updated info. A FPGA based development platform as part of an EDK is available to target intelop provided IPs or other standard IPs. The platform with Virtex-4 FX12 Evaluation Kit provides a complete hardware environment

More information

FPQ6 - MPC8313E implementation

FPQ6 - MPC8313E implementation Formation MPC8313E implementation: This course covers PowerQUICC II Pro MPC8313 - Processeurs PowerPC: NXP Power CPUs FPQ6 - MPC8313E implementation This course covers PowerQUICC II Pro MPC8313 Objectives

More information

Outline Background Jaluna-1 Presentation Jaluna-2 Presentation Overview Use Cases Architecture Features Copyright Jaluna SA. All rights reserved

Outline Background Jaluna-1 Presentation Jaluna-2 Presentation Overview Use Cases Architecture Features Copyright Jaluna SA. All rights reserved C5 Micro-Kernel: Real-Time Services for Embedded and Linux Systems Copyright 2003- Jaluna SA. All rights reserved. JL/TR-03-31.0.1 1 Outline Background Jaluna-1 Presentation Jaluna-2 Presentation Overview

More information

Software Driven Verification at SoC Level. Perspec System Verifier Overview

Software Driven Verification at SoC Level. Perspec System Verifier Overview Software Driven Verification at SoC Level Perspec System Verifier Overview June 2015 IP to SoC hardware/software integration and verification flows Cadence methodology and focus Applications (Basic to

More information

Resource Efficiency of Scalable Processor Architectures for SDR-based Applications

Resource Efficiency of Scalable Processor Architectures for SDR-based Applications Resource Efficiency of Scalable Processor Architectures for SDR-based Applications Thorsten Jungeblut 1, Johannes Ax 2, Gregor Sievers 2, Boris Hübener 2, Mario Porrmann 2, Ulrich Rückert 1 1 Cognitive

More information

From Application to Technology OpenCL Application Processors Chung-Ho Chen

From Application to Technology OpenCL Application Processors Chung-Ho Chen From Application to Technology OpenCL Application Processors Chung-Ho Chen Computer Architecture and System Laboratory (CASLab) Department of Electrical Engineering and Institute of Computer and Communication

More information

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection Hiroyuki Usui, Jun Tanabe, Toru Sano, Hui Xu, and Takashi Miyamori Toshiba Corporation, Kawasaki, Japan Copyright 2013,

More information

Introduction to CELL B.E. and GPU Programming. Agenda

Introduction to CELL B.E. and GPU Programming. Agenda Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU

More information

A Clustered Manycore Processor Architecture for Embedded and Accelerated Applications

A Clustered Manycore Processor Architecture for Embedded and Accelerated Applications A Clustered Manycore Processor Architecture for Embedded and Accelerated Applications Benoît Dupont de Dinechin, Renaud Ayrignac, Pierre-Edouard Beaucamps, Patrice Couvert, Benoît Ganne, Pierre Guironnet

More information

COMP 635: Seminar on Heterogeneous Processors. Lecture 7: ClearSpeed CSX600 Processor.

COMP 635: Seminar on Heterogeneous Processors. Lecture 7: ClearSpeed CSX600 Processor. COMP 635: Seminar on Heterogeneous Processors Lecture 7: ClearSpeed CSX600 Processor www.cs.rice.edu/~vsarkar/comp635 Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu October

More information

XPU A Programmable FPGA Accelerator for Diverse Workloads

XPU A Programmable FPGA Accelerator for Diverse Workloads XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for

More information

SoC Platforms and CPU Cores

SoC Platforms and CPU Cores SoC Platforms and CPU Cores COE838: Systems on Chip Design http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer Engineering Ryerson University

More information

Vector Engine Processor of SX-Aurora TSUBASA

Vector Engine Processor of SX-Aurora TSUBASA Vector Engine Processor of SX-Aurora TSUBASA Shintaro Momose, Ph.D., NEC Deutschland GmbH 9 th October, 2018 WSSP 1 NEC Corporation 2018 Contents 1) Introduction 2) VE Processor Architecture 3) Performance

More information

EMBEDDED SYSTEMS WITH ROBOTICS AND SENSORS USING ERLANG

EMBEDDED SYSTEMS WITH ROBOTICS AND SENSORS USING ERLANG EMBEDDED SYSTEMS WITH ROBOTICS AND SENSORS USING ERLANG Adam Lindberg github.com/eproxus HARDWARE COMPONENTS SOFTWARE FUTURE Boot, Serial console, Erlang shell DEMO THE GRISP BOARD SPECS Hardware & specifications

More information

The Challenges of System Design. Raising Performance and Reducing Power Consumption

The Challenges of System Design. Raising Performance and Reducing Power Consumption The Challenges of System Design Raising Performance and Reducing Power Consumption 1 Agenda The key challenges Visibility for software optimisation Efficiency for improved PPA 2 Product Challenge - Software

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

Design Space Exploration for Memory Subsystems of VLIW Architectures

Design Space Exploration for Memory Subsystems of VLIW Architectures E University of Paderborn Dr.-Ing. Mario Porrmann Design Space Exploration for Memory Subsystems of VLIW Architectures Thorsten Jungeblut 1, Gregor Sievers, Mario Porrmann 1, Ulrich Rückert 2 1 System

More information

Copyright 2014 Xilinx

Copyright 2014 Xilinx IP Integrator and Embedded System Design Flow Zynq Vivado 2014.2 Version This material exempt per Department of Commerce license exception TSU Objectives After completing this module, you will be able

More information

Multimedia in Mobile Phones. Architectures and Trends Lund

Multimedia in Mobile Phones. Architectures and Trends Lund Multimedia in Mobile Phones Architectures and Trends Lund 091124 Presentation Henrik Ohlsson Contact: henrik.h.ohlsson@stericsson.com Working with multimedia hardware (graphics and displays) at ST- Ericsson

More information

EyeCheck Smart Cameras

EyeCheck Smart Cameras EyeCheck Smart Cameras 2 3 EyeCheck 9xx & 1xxx series Technical data Memory: DDR RAM 128 MB FLASH 128 MB Interfaces: Ethernet (LAN) RS422, RS232 (not EC900, EC910, EC1000, EC1010) EtherNet / IP PROFINET

More information

Designing with NXP i.mx8m SoC

Designing with NXP i.mx8m SoC Designing with NXP i.mx8m SoC Course Description Designing with NXP i.mx8m SoC is a 3 days deep dive training to the latest NXP application processor family. The first part of the course starts by overviewing

More information

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction

More information

Overcoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics

Overcoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics Overcoming the Memory System Challenge in Dataflow Processing Darren Jones, Wave Computing Drew Wingard, Sonics Current Technology Limits Deep Learning Performance Deep Learning Dataflow Graph Existing

More information

Tile Processor (TILEPro64)

Tile Processor (TILEPro64) Tile Processor Case Study of Contemporary Multicore Fall 2010 Agarwal 6.173 1 Tile Processor (TILEPro64) Performance # of cores On-chip cache (MB) Cache coherency Operations (16/32-bit BOPS) On chip bandwidth

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer MIMD Overview Intel Paragon XP/S Overview! MIMDs in the 1980s and 1990s! Distributed-memory multicomputers! Intel Paragon XP/S! Thinking Machines CM-5! IBM SP2! Distributed-memory multicomputers with hardware

More information

Freescale i.mx6 Architecture

Freescale i.mx6 Architecture Freescale i.mx6 Architecture Course Description Freescale i.mx6 architecture is a 3 days Freescale official course. The course goes into great depth and provides all necessary know-how to develop software

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

The S6000 Family of Processors

The S6000 Family of Processors The S6000 Family of Processors Today s Design Challenges The advent of software configurable processors In recent years, the widespread adoption of digital technologies has revolutionized the way in which

More information

UFCETW-20-2 Examination Answer all questions in Section A (60 marks) and 2 questions from Section B (40 marks)

UFCETW-20-2 Examination Answer all questions in Section A (60 marks) and 2 questions from Section B (40 marks) Embedded Systems Programming Exam 20010-11 Answer all questions in Section A (60 marks) and 2 questions from Section B (40 marks) Section A answer all questions (60%) A1 Embedded Systems: ARM Appendix

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

The RM9150 and the Fast Device Bus High Speed Interconnect

The RM9150 and the Fast Device Bus High Speed Interconnect The RM9150 and the Fast Device High Speed Interconnect John R. Kinsel Principal Engineer www.pmc -sierra.com 1 August 2004 Agenda CPU-based SOC Design Challenges Fast Device (FDB) Overview Generic Device

More information

Overview. Think Silicon is a privately held company founded in 2007 by the core team of Atmel MMC IC group

Overview. Think Silicon is a privately held company founded in 2007 by the core team of Atmel MMC IC group Nema An OpenGL & OpenCL Embedded Programmable Engine Georgios Keramidas & Iakovos Stamoulis Think Silicon mobile GRAPHICS Overview Think Silicon is a privately held company founded in 2007 by the core

More information

Department of Computer Science, Institute for System Architecture, Operating Systems Group. Real-Time Systems '08 / '09. Hardware.

Department of Computer Science, Institute for System Architecture, Operating Systems Group. Real-Time Systems '08 / '09. Hardware. Department of Computer Science, Institute for System Architecture, Operating Systems Group Real-Time Systems '08 / '09 Hardware Marcus Völp Outlook Hardware is Source of Unpredictability Caches Pipeline

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction

More information

Multi-core microcontroller design with Cortex-M processors and CoreSight SoC

Multi-core microcontroller design with Cortex-M processors and CoreSight SoC Multi-core microcontroller design with Cortex-M processors and CoreSight SoC Joseph Yiu, ARM Ian Johnson, ARM January 2013 Abstract: While the majority of Cortex -M processor-based microcontrollers are

More information

Welcome. Altera Technology Roadshow 2013

Welcome. Altera Technology Roadshow 2013 Welcome Altera Technology Roadshow 2013 Altera at a Glance Founded in Silicon Valley, California in 1983 Industry s first reprogrammable logic semiconductors $1.78 billion in 2012 sales Over 2,900 employees

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

NanoMind Z7000. Datasheet On-board CPU and FPGA for space applications

NanoMind Z7000. Datasheet On-board CPU and FPGA for space applications NanoMind Z7000 Datasheet On-board CPU and FPGA for space applications 1 Table of Contents 1 TABLE OF CONTENTS... 2 2 OVERVIEW... 3 2.1 HIGHLIGHTED FEATURES... 3 2.2 BLOCK DIAGRAM... 4 2.3 FUNCTIONAL DESCRIPTION...

More information

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014 Profiling and Debugging OpenCL Applications with ARM Development Tools October 2014 1 Agenda 1. Introduction to GPU Compute 2. ARM Development Solutions 3. Mali GPU Architecture 4. Using ARM DS-5 Streamline

More information

Sony/Toshiba/IBM (STI) CELL Processor. Scientific Computing for Engineers: Spring 2008

Sony/Toshiba/IBM (STI) CELL Processor. Scientific Computing for Engineers: Spring 2008 Sony/Toshiba/IBM (STI) CELL Processor Scientific Computing for Engineers: Spring 2008 Nec Hercules Contra Plures Chip's performance is related to its cross section same area 2 performance (Pollack's Rule)

More information

The Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006

The Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006 The Use Of Virtual Platforms In MP-SoC Design Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006 1 MPSoC Is MP SoC design happening? Why? Consumer Electronics Complexity Cost of ASIC Increased SW Content

More information

AMD HD5450 PCIe ADD-IN BOARD. Datasheet AEGX-A3T5-01FST1

AMD HD5450 PCIe ADD-IN BOARD. Datasheet AEGX-A3T5-01FST1 AMD HD5450 PCIe ADD-IN BOARD Datasheet AEGX-A3T5-01FST1 CONTENTS 1. Feature... 3 2. Functional Overview... 4 2.1. Memory Interface... 4 2.2. Acceleration Features... 4 2.3. Avivo Display System... 5 2.4.

More information

Introduction to Sitara AM437x Processors

Introduction to Sitara AM437x Processors Introduction to Sitara AM437x Processors AM437x: Highly integrated, scalable platform with enhanced industrial communications and security AM4376 AM4378 Software Key Features AM4372 AM4377 High-performance

More information

FCQ2 - P2020 QorIQ implementation

FCQ2 - P2020 QorIQ implementation Formation P2020 QorIQ implementation: This course covers NXP QorIQ P2010 and P2020 - Processeurs PowerPC: NXP Power CPUs FCQ2 - P2020 QorIQ implementation This course covers NXP QorIQ P2010 and P2020 Objectives

More information

White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation

White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation Next Generation Technical Computing Unit Fujitsu Limited Contents FUJITSU Supercomputer PRIMEHPC FX100 System Overview

More information

Altera SDK for OpenCL

Altera SDK for OpenCL Altera SDK for OpenCL A novel SDK that opens up the world of FPGAs to today s developers Altera Technology Roadshow 2013 Today s News Altera today announces its SDK for OpenCL Altera Joins Khronos Group

More information

Chapter 7. Hardware Implementation Tools

Chapter 7. Hardware Implementation Tools Hardware Implementation Tools 137 The testing and embedding speech processing algorithm on general purpose PC and dedicated DSP platform require specific hardware implementation tools. Real time digital

More information

Lesson 6 Intel Galileo and Edison Prototype Development Platforms. Chapter-8 L06: "Internet of Things ", Raj Kamal, Publs.: McGraw-Hill Education

Lesson 6 Intel Galileo and Edison Prototype Development Platforms. Chapter-8 L06: Internet of Things , Raj Kamal, Publs.: McGraw-Hill Education Lesson 6 Intel Galileo and Edison Prototype Development Platforms 1 Intel Galileo Gen 2 Boards Based on the Intel Pentium architecture Includes features of single threaded, single core and 400 MHz constant

More information

A ONE CHIP HARDENED SOLUTION FOR HIGH SPEED SPACEWIRE SYSTEM IMPLEMENTATIONS

A ONE CHIP HARDENED SOLUTION FOR HIGH SPEED SPACEWIRE SYSTEM IMPLEMENTATIONS A ONE CHIP HARDENED SOLUTION FOR HIGH SPEED SPACEWIRE SYSTEM IMPLEMENTATIONS Joseph R. Marshall, Richard W. Berger, Glenn P. Rakow Conference Contents Standards & Topology ASIC Program History ASIC Features

More information

Blue Gene/Q. Hardware Overview Michael Stephan. Mitglied der Helmholtz-Gemeinschaft

Blue Gene/Q. Hardware Overview Michael Stephan. Mitglied der Helmholtz-Gemeinschaft Blue Gene/Q Hardware Overview 02.02.2015 Michael Stephan Blue Gene/Q: Design goals System-on-Chip (SoC) design Processor comprises both processing cores and network Optimal performance / watt ratio Small

More information

Techniques for Optimizing Performance and Energy Consumption: Results of a Case Study on an ARM9 Platform

Techniques for Optimizing Performance and Energy Consumption: Results of a Case Study on an ARM9 Platform Techniques for Optimizing Performance and Energy Consumption: Results of a Case Study on an ARM9 Platform BL Standard IC s, PL Microcontrollers October 2007 Outline LPC3180 Description What makes this

More information

The World Leader in High Performance Signal Processing Solutions. DSP Processors

The World Leader in High Performance Signal Processing Solutions. DSP Processors The World Leader in High Performance Signal Processing Solutions DSP Processors NDA required until November 11, 2008 Analog Devices Processors Broad Choice of DSPs Blackfin Media Enabled, 16/32- bit fixed

More information

LEON4: Fourth Generation of the LEON Processor

LEON4: Fourth Generation of the LEON Processor LEON4: Fourth Generation of the LEON Processor Magnus Själander, Sandi Habinc, and Jiri Gaisler Aeroflex Gaisler, Kungsgatan 12, SE-411 19 Göteborg, Sweden Tel +46 31 775 8650, Email: {magnus, sandi, jiri}@gaisler.com

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts Computer Architectures Advance CPU Design Tien-Fu Chen National Chung Cheng Univ. Adv CPU-0 MMX technology! Basic concepts " small native data types " compute-intensive operations " a lot of inherent parallelism

More information

Next Generation Multi-Purpose Microprocessor

Next Generation Multi-Purpose Microprocessor Next Generation Multi-Purpose Microprocessor Presentation at MPSA, 4 th of November 2009 www.aeroflex.com/gaisler OUTLINE NGMP key requirements Development schedule Architectural Overview LEON4FT features

More information

ARMv8-A Software Development

ARMv8-A Software Development ARMv8-A Software Development Course Description ARMv8-A software development is a 4 days ARM official course. The course goes into great depth and provides all necessary know-how to develop software for

More information

Configurable Processors for SOC Design. Contents crafted by Technology Evangelist Steve Leibson Tensilica, Inc.

Configurable Processors for SOC Design. Contents crafted by Technology Evangelist Steve Leibson Tensilica, Inc. Configurable s for SOC Design Contents crafted by Technology Evangelist Steve Leibson Tensilica, Inc. Why Listen to This Presentation? Understand how SOC design techniques, now nearly 20 years old, are

More information

ARM Cortex core microcontrollers 3. Cortex-M0, M4, M7

ARM Cortex core microcontrollers 3. Cortex-M0, M4, M7 ARM Cortex core microcontrollers 3. Cortex-M0, M4, M7 Scherer Balázs Budapest University of Technology and Economics Department of Measurement and Information Systems BME-MIT 2018 Trends of 32-bit microcontrollers

More information

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

IBM Cell Processor. Gilbert Hendry Mark Kretschmann IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:

More information

IGLOO2 Evaluation Kit Webinar

IGLOO2 Evaluation Kit Webinar Power Matters. IGLOO2 Evaluation Kit Webinar Jamie Freed jamie.freed@microsemi.com August 29, 2013 Overview M2GL010T- FG484 $99* LPDDR 10/100/1G Ethernet SERDES SMAs USB UART Available Demos Small Form

More information

BittWare s XUPP3R is a 3/4-length PCIe x16 card based on the

BittWare s XUPP3R is a 3/4-length PCIe x16 card based on the FPGA PLATFORMS Board Platforms Custom Solutions Technology Partners Integrated Platforms XUPP3R Xilinx UltraScale+ 3/4-Length PCIe Board with Quad QSFP and 512 GBytes DDR4 Xilinx Virtex UltraScale+ VU7P/VU9P/VU11P

More information

Flexible Architecture Research Machine (FARM)

Flexible Architecture Research Machine (FARM) Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense

More information

High Performance Computing. University questions with solution

High Performance Computing. University questions with solution High Performance Computing University questions with solution Q1) Explain the basic working principle of VLIW processor. (6 marks) The following points are basic working principle of VLIW processor. The

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Parallel Computing. Hwansoo Han (SKKU)

Parallel Computing. Hwansoo Han (SKKU) Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo

More information

How to build a Megacore microprocessor. by Andreas Olofsson (MULTIPROG WORKSHOP 2017)

How to build a Megacore microprocessor. by Andreas Olofsson (MULTIPROG WORKSHOP 2017) How to build a Megacore microprocessor by Andreas Olofsson (MULTIPROG WORKSHOP 2017) 1 Disclaimers 2 This presentation summarizes work done by Adapteva from 2008-2016. Statements and opinions are my own

More information

Design Choices for FPGA-based SoCs When Adding a SATA Storage }

Design Choices for FPGA-based SoCs When Adding a SATA Storage } U4 U7 U7 Q D U5 Q D Design Choices for FPGA-based SoCs When Adding a SATA Storage } Lorenz Kolb & Endric Schubert, Missing Link Electronics Rudolf Usselmann, ASICS World Services Motivation for SATA Storage

More information

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Today Non-Uniform

More information

借助 SDSoC 快速開發複雜的嵌入式應用

借助 SDSoC 快速開發複雜的嵌入式應用 借助 SDSoC 快速開發複雜的嵌入式應用 May 2017 What Is C/C++ Development System-level Profiling SoC application-like programming Tools and IP for system-level profiling Specify C/C++ Functions for Acceleration Full System

More information

Product Technical Brief S3C2413 Rev 2.2, Apr. 2006

Product Technical Brief S3C2413 Rev 2.2, Apr. 2006 Product Technical Brief Rev 2.2, Apr. 2006 Overview SAMSUNG's is a Derivative product of S3C2410A. is designed to provide hand-held devices and general applications with cost-effective, low-power, and

More information

Doing more with multicore! Utilizing the power-efficient, high-performance KeyStone multicore DSPs. November 2012

Doing more with multicore! Utilizing the power-efficient, high-performance KeyStone multicore DSPs. November 2012 Doing more with multicore! Utilizing the power-efficient, high-performance KeyStone multicore DSPs November 2012 How the world is doing more with TI s multicore Using TI multicore for wide variety of applications

More information

ARM Processors for Embedded Applications

ARM Processors for Embedded Applications ARM Processors for Embedded Applications Roadmap for ARM Processors ARM Architecture Basics ARM Families AMBA Architecture 1 Current ARM Core Families ARM7: Hard cores and Soft cores Cache with MPU or

More information

MIL-STD-1553 (T4240/T4160/T4080) 12/8/4 2 PMC/XMC 2.0 WWDT, ETR, RTC, 4 GB DDR3

MIL-STD-1553 (T4240/T4160/T4080) 12/8/4 2 PMC/XMC 2.0 WWDT, ETR, RTC, 4 GB DDR3 Rugged 6U VME Single-Slot SBC Freescale QorIQ Multicore SOC 1/8/4 e6500 Dual Thread Cores (T440/T4160/T4080) Altivec Unit Secure Boot and Trust Architecture.0 4 GB DDR3 with ECC 56 MB NOR Flash Memory

More information

. SMARC 2.0 Compliant

. SMARC 2.0 Compliant MSC SM2S-IMX8 NXP i.mx8 ARM Cortex -A72/A53 Description The new MSC SM2S-IMX8 module offers a quantum leap in terms of computing and graphics performance. It integrates the currently most powerful i.mx8

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory

More information

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects Overview of Interconnects Myrinet and Quadrics Leading Modern Interconnects Presentation Outline General Concepts of Interconnects Myrinet Latest Products Quadrics Latest Release Our Research Interconnects

More information

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &

More information

Network-on-Chip Architecture

Network-on-Chip Architecture Multiple Processor Systems(CMPE-655) Network-on-Chip Architecture Performance aspect and Firefly network architecture By Siva Shankar Chandrasekaran and SreeGowri Shankar Agenda (Enhancing performance)

More information

Early Software Development Through Emulation for a Complex SoC

Early Software Development Through Emulation for a Complex SoC Early Software Development Through Emulation for a Complex SoC FTF-NET-F0204 Raghav U. Nayak Senior Validation Engineer A P R. 2 0 1 4 TM External Use Session Objectives After completing this session you

More information