Offloading Collective Operations to Programmable Logic

Size: px
Start display at page:

Download "Offloading Collective Operations to Programmable Logic"

Transcription

1 Offloading Collective Operations to Programmable Logic Martin Swany (With thanks to Omer Arap) Center for Research in Extreme Scale Computing (CREST) Department of Intelligent Systems Engineering Indiana University, Bloomington, IN

2 FPGAs Should Be Used As MPI Accelerators I know MPI accelerators. We have the best MPI acclerators. - DT Martin Swany Center for Research in Extreme Scale Computing (CREST) Department of Intelligent Systems Engineering Indiana University, Bloomington, IN

3 Overview We have developed an architecture and implementation for offloading collective operations to programmable logic in the communication substrate Not the first to use FPGAs for collectives It is clear that there is a tradeoff between performance and generality FPGAs are at one end of that spectrum Some folks live at the other end Programmable logic is burgeoning Xilinx Zynq Intel Xeon+Altera (HARP) Mellanox ConnectX-4 + FPGA FPGAs on other NICs and in switches

4 Programming FPGAs The programmable logic provided by FPGAs is a powerful option for creating task-specific functionality for applications Programming with FPGAs is not easy Many agree: Accelerate common functionality Others want to make it easy to use generalpurpose toolchains to deploy arbitrary kernels Jeff and OpenACC Franck s workshop New NSF call for ways to make it easy It may not ever be easy

5 Things that go bump in the wire We have been exploring a (relatively) generic approach for scenarios in which there is programmable logic in the communication pipeline Addresses the problem of how to get data to and from the device A bump in the wire Rich Graham clearly got this phrase from us In teaching how to use FPGAs, we have found this model to be quite useful Students who haven t done Verilog can get to this in a semester Important topic in Indiana U. s new engineering program This also maps to a growing use case

6 Collective Offload Offloading collective logic is a common technique in various platforms to improve performance Mellanox s CORE-Direct is one such state of the art collective operation offload framework. Defines primitive tasks: send, receive, wait, binary calculations. Collective algorithm defines list of tasks and tasks are performed by the NIC without CPU involvement Collective offload systems have been done many times, but ours is easy to use and is a starting point

7 Goals of FPGA-based Offload Reduce collective operation latency and variance Provide selection of collective algorithms that are implemented in hardware Implement collective algorithms independent of end-to-end communication protocols Utilize hardware-level, protocol-independent multicasting to optimize, and when possible, redesign major collective operation algorithms Support collective operation offload for different classes of collective operations

8 NetFPGA A line-rate, flexible, open networking platform for teaching and research (a) NetFPGA SUME Board Fig. 1. NetFPGA SUME Board and Block Diagram

9 Xilinx Zynq Processing System Flash Controller NOR, NAND, SRAM, Quad SPI Multiport DRAM Controller DDR3, DDR3L, DDR2 2x SPI AMBA Interconnect AMBA Interconnect Processor I/O Mux EMIO 2x I2C 2x CAN 2x UART GPIO 2x SDIO with DMA 2x USB with DMA 2x GigE with DMA XADC 2x ADC, Mux, Thermal Sensor ARM CoreSight Multi-Core Debug and Trace NEON DSP/FPU Engine Cortex - A9 MPCore 32/32 KB I/D Caches General Purpose AXI Ports 512 Kbyte L2 Cache General Interrupt Controller Configuration Timers AMBA Interconnect Watchdog Timer DMA Snoop Control Unit Security AES, SHA, RSA NEON DSP/FPU Engine Cortex- A9 MPCore 32/32 KB I/D Caches ACP Programmable Logic (System Gates, DSP, RAM) 256 Kbyte On-Chip Memory AMBA Interconnect High Performance AXI Ports PCIe Gen2 1-8 Lanes Multi-Standard I/Os (3.3V & High-Speed 1.8V) Multi-Gigabit Transceivers

10 Zedboard & Ethernet FMC

11 Collective Offload Design Ease implementation with the bump in the wire model Host offloads collective to the Programmable Logic (PL) by sending a specially crafted UDP packet Simplest solution Host blocks until the PL generates a release message, which is also a UDP packet

12 Collective Offload Packet dst_mac src_mac_1 src_mac_2 type ver IHL Diff_Serv Total_Length Identification flags fragment_offset TTL Protocol Hdr_Cksum src_ip dst_ip_1 dst_ip_2 UDP_Source_Port UDP_Dest_POrt Length UDP_checksum coll_id comm_size coll_type algo_type node_type msg_type rank local_rank_count operation data_type count

13 AXI-Based Platforms Advanced extensible Interface (AXI) standards define intermodule communication in programmable logic Widely utilized in recent Xilinx FPGA designs Recent designs depend on AXI-based protocols AXIS (AXI-Stream) is for streaming data between modules. We employ the AXIS protocol in our design Support for multiple ranks per host Provides insight how our design scales

14 Implementation Modules inherited from the NetFPGA package Input Arbiter, Output Queues Modules from Xilinx IP catalog AXI Ethernet, AXI DMA, AXI Width Converter, FPU, AXIS Data FIFO New Modules Stream Separator Stream Aggregator Collective Forwarding Engine (CFE) Collective Instruction Processing Engine (CIPE) Implements MPI_Allreduce, MPI_Reduce, MPI_Barrier, MPI_Scan, MPI_Alltoall

15 Data Path AXI DMA AXI Ethernet AXI Ethernet AXI Ethernet AXI Ethernet 32-b ctrl 32-b data 32-b ctrl 32-b data 32-b ctrl 32-b data 32-b ctrl 32-b data 32-b ctrl 32-b data Stream Aggregator Stream Aggregator Stream Aggregator Stream Aggregator Stream Aggregator 32-bit 32-bit 32-bit 32-bit 32-bit Width Converter Width Converter Width Converter Width Converter Width Converter 64-bit 64-bit 64-bit 64-bit 64-bit Input Arbiter 64-bit CFE (Collective Forwarding Engine) 64-bit CIPE (Collective Instruction Processing Engine) 64-bit Output Queues 64-bit 64-bit 64-bit 64-bit 64-bit Width Converter Width Converter Width Converter Width Converter Width Converter 32-bit 32-bit 32-bit 32-bit 32-bit Stream Separator Stream Separator Stream Separator Stream Separator Stream Separator 32-b ctrl 32-b data 32-b ctrl 32-b data 32-b ctrl 32-b data 32-b ctrl 32-b data 32-b ctrl 32-b data AXI DMA AXI Ethernet AXI Ethernet AXI Ethernet AXI Ethernet

16 Collective Forwarding Engine (CFE) Heart of the design Responsible for: Assessing the offload request header fields. Detecting collective operation and algorithm to run Perform state transitions and save the state for algorithms Based on the state, make forwarding decisions Determine the instructions which the CIPE will apply

17 Collective Instruction Processing Engine (CIPE) CFE attaches 64-bit instruction header CIPE processes the instruction Example Instructions Forward packet, do nothing Buffer packet in specified buffer Perform reduction operations on incoming packet and specified buffers Buffer or not buffer the result Perform reduction operations on specified buffers and ignore incoming packet Buffer or not buffer the result Ignore incoming packet and forward data on specified buffer

18 Collective Instruction Processing Engine (CIPE) Interfaces with FPU and FIFO cores FPU cores for streaming and applying floating point arithmetic FIFO cores for streaming intermediate results of single cycle arithmetic operations MIN, MAX, BOR, LOR, BAND, LAND, BXOR, LXOR, Integer SUM Internal BRAM buffers

19 Collective Instruction Processing Engine (CIPE) Op1_1 Op1_2 FPU 1 Op2_1 Op2_2 FPU 2 Op3_1 Op3_2 FPU 3 Op4_1 CFE Instruction Processor Op4_2 RSLT1 RSLT2 RSLT3 RSLT4 FPU 4 FIFO 1 FIFO 2 FIFO 3 FIFO 4 Output Queues Buffer 1 Buffer 2 Buffer 3 Buffer 4

20 Collective Instruction Processing Engine (CIPE) Fully pipelined design FPU cores can be configured to produce results in 1 cycle Due to observed instabilities, we increased the first result cycle to 11 (the maximum) Each extra FPU core introduces extra cycle Each FIFO core employed introduces an extra cycle

21 Evaluation Experimental Setup 8 Zynq Zedboards with EthernetFMC adapters Directly connected Come see it at SC

22 Zynq Cluster MPI_Allreduce ARM&Host&vs&PL&Offloaded&(Msg&Size&=&8&byte)& Speedup"Ave" Speedup"Min" ARM"Ave" ARM"Min" PL"Ave" PL"Min" 12000" 12" ARM&Host&vs&PL&Offloaded&(Msg&Size&=&128&byte)& Speedup"Ave" Speedup"Min" ARM"Ave" ARM"Min" PL"Ave" PL"Min" 12000" 12" 10000" 10" 10000" 10" Time&(usecs)& 8000" 6000" 4000" 8" 6" 4" Speedup& Time&(usecs)& 8000" 6000" 4000" 8" 6" 4" Speedup& 2000" 2" 2000" 2" 0" 8" 16" 32" 64" 128" Communicator&Size& 0" 0" 8" 16" 32" 64" 128" Communicator&Size& 0" ARM&Host&vs&PL&Offloaded&(Msg&Size&=&1024&Byte)& Speedup"Ave" Speedup"Min" ARM"Ave" ARM"Min" PL"Ave" PL"Min" 12000" 12" 10000" 10" Time&(usecs)& 8000" 6000" 4000" 8" 6" 4" Speedup& 2000" 2" 0" 8" 16" 32" 64" 128" Communicator&Size& 0"

23 Zynq Cluster MPI_Allreduce Ave,)Min)Difference:)ARM)vs)PL)Msg)(Size)=)8B)) Ave,)Min)Difference:)ARM)vs)PL)Msg)(Size)=)128B)) ARM"%"Diff" PL"%"Diff" ARM"Diff" PL"Diff" ARM"%"Diff" PL"%"Diff" ARM"Diff" PL"Diff" 1800" 30" 1800" 30" 1600" 1400" 25" 1600" 1400" 25" Time)(usecs)) 1200" 1000" 800" 600" 20" 15" 10" Percentage) Time)(usecs)) 1200" 1000" 800" 600" 20" 15" 10" Percentage) 400" 200" 5" 400" 200" 5" 0" 8" 16" 32" 64" 128" Communicator)Size) 0" 0" 8" 16" 32" 64" 128" Communicator)Size) 0" Time)(usecs)) 1800" 1600" 1400" 1200" 1000" 800" 600" 400" 200" 0" Ave,)Min)Difference:)ARM)vs)PL)Msg)(Size)=)1024B)) ARM"%"Diff" PL"%"Diff" ARM"Diff" PL"Diff" 8" 16" 32" 64" 128" Communicator)Size) 30" 25" 20" 15" 10" 5" 0" Percentage)

24 Zynq Cluster MILC Perentage( 20" 18" 16" 14" 12" 10" 8" 6" 4" 2" 0" MILC(4(Clover(Dynamical( %"speedup"(min"results)" %"speedup"(average"results)" 8" 16" 32" 64" Communicator(Size( 45.00# %#speedup#(min#results)# MILC(4(Schroed_cl_inv( %#speedup#(average#results)# Perentage( 40.00# 35.00# 30.00# 25.00# 20.00# 15.00# 10.00# 5.00# 0.00# 8# 16# 32# 64# Communicator(Size(

25 Conclusions and Future Work Empirical validation of the efficacy of collective offload Promising performance Increasing availability of programmable logic stands to increase impact The use of UDP allowed straightforward implementation, but is limiting Clearly, this the right way to use FPGAs 25

26

Zynq-7000 All Programmable SoC Product Overview

Zynq-7000 All Programmable SoC Product Overview Zynq-7000 All Programmable SoC Product Overview The SW, HW and IO Programmable Platform August 2012 Copyright 2012 2009 Xilinx Introducing the Zynq -7000 All Programmable SoC Breakthrough Processing Platform

More information

Design Choices for FPGA-based SoCs When Adding a SATA Storage }

Design Choices for FPGA-based SoCs When Adding a SATA Storage } U4 U7 U7 Q D U5 Q D Design Choices for FPGA-based SoCs When Adding a SATA Storage } Lorenz Kolb & Endric Schubert, Missing Link Electronics Rudolf Usselmann, ASICS World Services Motivation for SATA Storage

More information

Offloading MPI Parallel Prefix Scan (MPI Scan) with the NetFPGA

Offloading MPI Parallel Prefix Scan (MPI Scan) with the NetFPGA 1 Offloading MPI Parallel Prefix Scan (MPI Scan) with the NetFPGA Omer Arap Martin Swany Center for Research in Extreme Scale Technology Indiana University Bloomington, IN 4745 {omerarap,swany}@crest.iu.edu

More information

Copyright 2016 Xilinx

Copyright 2016 Xilinx Zynq Architecture Zynq Vivado 2015.4 Version This material exempt per Department of Commerce license exception TSU Objectives After completing this module, you will be able to: Identify the basic building

More information

Zynq Architecture, PS (ARM) and PL

Zynq Architecture, PS (ARM) and PL , PS (ARM) and PL Joint ICTP-IAEA School on Hybrid Reconfigurable Devices for Scientific Instrumentation Trieste, 1-5 June 2015 Fernando Rincón Fernando.rincon@uclm.es 1 Contents Zynq All Programmable

More information

MYC-C7Z010/20 CPU Module

MYC-C7Z010/20 CPU Module MYC-C7Z010/20 CPU Module - 667MHz Xilinx XC7Z010/20 Dual-core ARM Cortex-A9 Processor with Xilinx 7-series FPGA logic - 1GB DDR3 SDRAM (2 x 512MB, 32-bit), 4GB emmc, 32MB QSPI Flash - On-board Gigabit

More information

SoC Platforms and CPU Cores

SoC Platforms and CPU Cores SoC Platforms and CPU Cores COE838: Systems on Chip Design http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer Engineering Ryerson University

More information

MYD-C7Z010/20 Development Board

MYD-C7Z010/20 Development Board MYD-C7Z010/20 Development Board MYC-C7Z010/20 CPU Module as Controller Board Two 0.8mm pitch 140-pin Connectors for Board-to-Board Connections 667MHz Xilinx XC7Z010/20 Dual-core ARM Cortex-A9 Processor

More information

ARM Cortex-A9 ARM v7-a. A programmer s perspective Part1

ARM Cortex-A9 ARM v7-a. A programmer s perspective Part1 ARM Cortex-A9 ARM v7-a A programmer s perspective Part1 ARM: Advanced RISC Machine First appeared in 1985 as Acorn RISC Machine from Acorn Computers in Manchester England Limited success outcompeted by

More information

S2C K7 Prodigy Logic Module Series

S2C K7 Prodigy Logic Module Series S2C K7 Prodigy Logic Module Series Low-Cost Fifth Generation Rapid FPGA-based Prototyping Hardware The S2C K7 Prodigy Logic Module is equipped with one Xilinx Kintex-7 XC7K410T or XC7K325T FPGA device

More information

SoC FPGAs. Your User-Customizable System on Chip Altera Corporation Public

SoC FPGAs. Your User-Customizable System on Chip Altera Corporation Public SoC FPGAs Your User-Customizable System on Chip Embedded Developers Needs Low High Increase system performance Reduce system power Reduce board size Reduce system cost 2 Providing the Best of Both Worlds

More information

Designing Multi-Channel, Real-Time Video Processors with Zynq All Programmable SoC Hyuk Kim Embedded Specialist Jun, 2014

Designing Multi-Channel, Real-Time Video Processors with Zynq All Programmable SoC Hyuk Kim Embedded Specialist Jun, 2014 Designing Multi-Channel, Real-Time Video Processors with Zynq All Programmable SoC Hyuk Kim Embedded Specialist Jun, 2014 Broadcast & Pro A/V Landscape Xilinx Smarter Vision in action across the entire

More information

Simplify System Complexity

Simplify System Complexity Simplify System Complexity With the new high-performance CompactRIO controller Fanie Coetzer Field Sales Engineer Northern South Africa 2 3 New control system CompactPCI MMI/Sequencing/Logging FieldPoint

More information

XMC-ZU1. XMC Module Xilinx Zynq UltraScale+ MPSoC. Overview. Key Features. Typical Applications

XMC-ZU1. XMC Module Xilinx Zynq UltraScale+ MPSoC. Overview. Key Features. Typical Applications XMC-ZU1 XMC Module Xilinx Zynq UltraScale+ MPSoC Overview PanaTeQ s XMC-ZU1 is a XMC module based on the Zynq UltraScale+ MultiProcessor SoC device from Xilinx. The Zynq UltraScale+ integrates a Quad-core

More information

Copyright 2017 Xilinx.

Copyright 2017 Xilinx. All Programmable Automotive SoC Comparison XA Zynq UltraScale+ MPSoC ZU2/3EG, ZU4/5EV Devices XA Zynq -7000 SoC Z-7010/7020/7030 Devices Application Processor Real-Time Processor Quad-core ARM Cortex -A53

More information

THE FIRST GENERATION OF EXTENSIBLE PROCESSING PLATFORMS: A NEW LEVEL OF PERFORMANCE, FLEXIBILITY AND SCALABILITY

THE FIRST GENERATION OF EXTENSIBLE PROCESSING PLATFORMS: A NEW LEVEL OF PERFORMANCE, FLEXIBILITY AND SCALABILITY PROCESSOR-CENTRIC EXTENSIBLE PLATFORMS FOR POWERFUL, SCALABLE, COST-EFFICIENT EMBEDDED DESIGNS THE FIRST GENERATION OF S: A NEW LEVEL OF PERFORMANCE, FLEXIBILITY AND SCALABILITY Embedded Systems Challenges

More information

Zynq AP SoC Family

Zynq AP SoC Family Programmable Logic (PL) Processing System (PS) Zynq -7000 AP SoC Family Cost-Optimized Devices Mid-Range Devices Device Name Z-7007S Z-7012S Z-7014S Z-7010 Z-7015 Z-7020 Z-7030 Z-7035 Z-7045 Z-7100 Part

More information

XMC-RFSOC-A. XMC Module Xilinx Zynq UltraScale+ RFSOC. Overview. Key Features. Typical Applications. Advanced Information Subject To Change

XMC-RFSOC-A. XMC Module Xilinx Zynq UltraScale+ RFSOC. Overview. Key Features. Typical Applications. Advanced Information Subject To Change Advanced Information Subject To Change XMC-RFSOC-A XMC Module Xilinx Zynq UltraScale+ RFSOC Overview PanaTeQ s XMC-RFSOC-A is a XMC module based on the Zynq UltraScale+ RFSoC device from Xilinx. The Zynq

More information

Simplify System Complexity

Simplify System Complexity 1 2 Simplify System Complexity With the new high-performance CompactRIO controller Arun Veeramani Senior Program Manager National Instruments NI CompactRIO The Worlds Only Software Designed Controller

More information

Defense-grade Zynq-7000Q SoC Data Sheet: Overview

Defense-grade Zynq-7000Q SoC Data Sheet: Overview Defense-grade Zynq-7000Q SoC Data Sheet: Overview Product Specification Defense-grade Zynq-7000Q SoC First Generation Architecture The Defense-grade Zynq -7000Q family is based on the Xilinx SoC architecture.

More information

Zynq-7000 Extensible Processing Platform Overview

Zynq-7000 Extensible Processing Platform Overview Advance Product Specification Zynq-7000 Extensible Processing Platform (EPP) First Generation Architecture The Zynq -7000 family is based on the Xilinx Extensible Processing Platform (EPP) architecture.

More information

6.9. Communicating to the Outside World: Cluster Networking

6.9. Communicating to the Outside World: Cluster Networking 6.9 Communicating to the Outside World: Cluster Networking This online section describes the networking hardware and software used to connect the nodes of cluster together. As there are whole books and

More information

Ettus Research Update

Ettus Research Update Ettus Research Update Matt Ettus Ettus Research GRCon13 Outline 1 Introduction 2 Recent New Products 3 Third Generation Introduction Who am I? Core GNU Radio contributor since 2001 Designed

More information

Designing a Multi-Processor based system with FPGAs

Designing a Multi-Processor based system with FPGAs Designing a Multi-Processor based system with FPGAs BRINGING BRINGING YOU YOU THE THE NEXT NEXT LEVEL LEVEL IN IN EMBEDDED EMBEDDED DEVELOPMENT DEVELOPMENT Frank de Bont Trainer / Consultant Cereslaan

More information

«Real Time Embedded systems» Cyclone V SOC - FPGA

«Real Time Embedded systems» Cyclone V SOC - FPGA «Real Time Embedded systems» Cyclone V SOC - FPGA Ref: http://www.altera.com rene.beuchat@epfl.ch LAP/ISIM/IC/EPFL Chargé de cours rene.beuchat@hesge.ch LSN/hepia Prof. HES 1 SOC + FPGA (ex. Cyclone V,

More information

Zynq-7000 All Programmable SoC Data Sheet: Overview

Zynq-7000 All Programmable SoC Data Sheet: Overview Zynq-7000 All Programmable SoC Data Sheet: Overview Product Specification Zynq-7000 All Programmable SoC First Generation Architecture The Zynq -7000 family is based on the Xilinx All Programmable SoC

More information

All Programmable: from Silicon to System

All Programmable: from Silicon to System All Programmable: from Silicon to System Ivo Bolsens, Senior Vice President & CTO Page 1 Moore s Law: The Technology Pipeline Page 2 Industry Debates Variability Page 3 Industry Debates on Cost Page 4

More information

NetFPGA Hardware Architecture

NetFPGA Hardware Architecture NetFPGA Hardware Architecture Jeffrey Shafer Some slides adapted from Stanford NetFPGA tutorials NetFPGA http://netfpga.org 2 NetFPGA Components Virtex-II Pro 5 FPGA 53,136 logic cells 4,176 Kbit block

More information

KeyStone C66x Multicore SoC Overview. Dec, 2011

KeyStone C66x Multicore SoC Overview. Dec, 2011 KeyStone C66x Multicore SoC Overview Dec, 011 Outline Multicore Challenge KeyStone Architecture Reminder About KeyStone Solution Challenge Before KeyStone Multicore performance degradation Lack of efficient

More information

Introduction to Sitara AM437x Processors

Introduction to Sitara AM437x Processors Introduction to Sitara AM437x Processors AM437x: Highly integrated, scalable platform with enhanced industrial communications and security AM4376 AM4378 Software Key Features AM4372 AM4377 High-performance

More information

Effective System Design with ARM System IP

Effective System Design with ARM System IP Effective System Design with ARM System IP Mentor Technical Forum 2009 Serge Poublan Product Marketing Manager ARM 1 Higher level of integration WiFi Platform OS Graphic 13 days standby Bluetooth MP3 Camera

More information

VPX3-ZU1. 3U OpenVPX Module Xilinx Zynq UltraScale+ MPSoC with FMC HPC Site. Overview. Key Features. Typical Applications

VPX3-ZU1. 3U OpenVPX Module Xilinx Zynq UltraScale+ MPSoC with FMC HPC Site. Overview. Key Features. Typical Applications VPX3-ZU1 3U OpenVPX Module Xilinx Zynq UltraScale+ MPSoC with FMC HPC Site Overview PanaTeQ s VPX3-ZU1 is a 3U OpenVPX module based on the Zynq UltraScale+ MultiProcessor SoC device from Xilinx. The Zynq

More information

The S6000 Family of Processors

The S6000 Family of Processors The S6000 Family of Processors Today s Design Challenges The advent of software configurable processors In recent years, the widespread adoption of digital technologies has revolutionized the way in which

More information

Each Milliwatt Matters

Each Milliwatt Matters Each Milliwatt Matters Ultra High Efficiency Application Processors Govind Wathan Product Manager, CPG ARM Tech Symposia China 2015 November 2015 Ultra High Efficiency Processors Used in Diverse Markets

More information

VPX3-ZU1. 3U OpenVPX Module Xilinx Zynq UltraScale+ MPSoC with FMC HPC Site. Overview. Key Features. Typical Applications

VPX3-ZU1. 3U OpenVPX Module Xilinx Zynq UltraScale+ MPSoC with FMC HPC Site. Overview. Key Features. Typical Applications VPX3-ZU1 3U OpenVPX Module Xilinx Zynq UltraScale+ MPSoC with FMC HPC Site Overview PanaTeQ s VPX3-ZU1 is a 3U OpenVPX module based on the Zynq UltraScale+ MultiProcessor SoC device from Xilinx. The Zynq

More information

CoreTile Express for Cortex-A5

CoreTile Express for Cortex-A5 CoreTile Express for Cortex-A5 For the Versatile Express Family The Versatile Express family development boards provide an excellent environment for prototyping the next generation of system-on-chip designs.

More information

Maximizing heterogeneous system performance with ARM interconnect and CCIX

Maximizing heterogeneous system performance with ARM interconnect and CCIX Maximizing heterogeneous system performance with ARM interconnect and CCIX Neil Parris, Director of product marketing Systems and software group, ARM Teratec June 2017 Intelligent flexible cloud to enable

More information

Renesas Synergy MCUs Build a Foundation for Groundbreaking Integrated Embedded Platform Development

Renesas Synergy MCUs Build a Foundation for Groundbreaking Integrated Embedded Platform Development Renesas Synergy MCUs Build a Foundation for Groundbreaking Integrated Embedded Platform Development New Family of Microcontrollers Combine Scalability and Power Efficiency with Extensive Peripheral Capabilities

More information

A Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013

A Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013 A Closer Look at the Epiphany IV 28nm 64 core Coprocessor Andreas Olofsson PEGPUM 2013 1 Adapteva Achieves 3 World Firsts 1. First processor company to reach 50 GFLOPS/W 3. First semiconductor company

More information

Extending Fixed Subsystems at the TLM Level: Experiences from the FPGA World

Extending Fixed Subsystems at the TLM Level: Experiences from the FPGA World I N V E N T I V E Extending Fixed Subsystems at the TLM Level: Experiences from the FPGA World Frank Schirrmeister, Steve Brown, Larry Melling (Cadence) Dave Beal (Xilinx) Agenda Virtual Platforms Xilinx

More information

Designing with NXP i.mx8m SoC

Designing with NXP i.mx8m SoC Designing with NXP i.mx8m SoC Course Description Designing with NXP i.mx8m SoC is a 3 days deep dive training to the latest NXP application processor family. The first part of the course starts by overviewing

More information

The Challenges of System Design. Raising Performance and Reducing Power Consumption

The Challenges of System Design. Raising Performance and Reducing Power Consumption The Challenges of System Design Raising Performance and Reducing Power Consumption 1 Agenda The key challenges Visibility for software optimisation Efficiency for improved PPA 2 Product Challenge - Software

More information

New System Solutions for Laser Printer Applications by Oreste Emanuele Zagano STMicroelectronics

New System Solutions for Laser Printer Applications by Oreste Emanuele Zagano STMicroelectronics New System Solutions for Laser Printer Applications by Oreste Emanuele Zagano STMicroelectronics Introduction Recently, the laser printer market has started to move away from custom OEM-designed 1 formatter

More information

Programmable Logic Design Grzegorz Budzyń Lecture. 15: Advanced hardware in FPGA structures

Programmable Logic Design Grzegorz Budzyń Lecture. 15: Advanced hardware in FPGA structures Programmable Logic Design Grzegorz Budzyń Lecture 15: Advanced hardware in FPGA structures Plan Introduction PowerPC block RocketIO Introduction Introduction The larger the logical chip, the more additional

More information

NetFPGA Update at GEC4

NetFPGA Update at GEC4 NetFPGA Update at GEC4 http://netfpga.org/ NSF GENI Engineering Conference 4 (GEC4) March 31, 2009 John W. Lockwood http://stanford.edu/~jwlockwd/ jwlockwd@stanford.edu NSF GEC4 1 March 2009 What is the

More information

Experience with the NetFPGA Program

Experience with the NetFPGA Program Experience with the NetFPGA Program John W. Lockwood Algo-Logic Systems Algo-Logic.com With input from the Stanford University NetFPGA Group & Xilinx XUP Program Sunday, February 21, 2010 FPGA-2010 Pre-Conference

More information

P51: High Performance Networking

P51: High Performance Networking P51: High Performance Networking Lecture 6: Programmable network devices Dr Noa Zilberman noa.zilberman@cl.cam.ac.uk Lent 2017/18 High Throughput Interfaces Performance Limitations So far we discussed

More information

A Generation Ahead Designing Advanced Embedded Systems with Xilinx Zynq All Programmable SoCs

A Generation Ahead Designing Advanced Embedded Systems with Xilinx Zynq All Programmable SoCs A Generation Ahead Designing Advanced Embedded Systems with Xilinx Zynq All Programmable SoCs 김혁 Embedded Processor Specialist, Xilinx 1 st - Oct, 2013 Market Challenges for Embedded Solutions Integrating

More information

D Demonstration of disturbance recording functions for PQ monitoring

D Demonstration of disturbance recording functions for PQ monitoring D6.3.7. Demonstration of disturbance recording functions for PQ monitoring Final Report March, 2013 M.Sc. Bashir Ahmed Siddiqui Dr. Pertti Pakonen 1. Introduction The OMAP-L138 C6-Integra DSP+ARM processor

More information

AT-501 Cortex-A5 System On Module Product Brief

AT-501 Cortex-A5 System On Module Product Brief AT-501 Cortex-A5 System On Module Product Brief 1. Scope The following document provides a brief description of the AT-501 System on Module (SOM) its features and ordering options. For more details please

More information

Chapter 5. Introduction ARM Cortex series

Chapter 5. Introduction ARM Cortex series Chapter 5 Introduction ARM Cortex series 5.1 ARM Cortex series variants 5.2 ARM Cortex A series 5.3 ARM Cortex R series 5.4 ARM Cortex M series 5.5 Comparison of Cortex M series with 8/16 bit MCUs 51 5.1

More information

Motivation to Teach Network Hardware

Motivation to Teach Network Hardware NetFPGA: An Open Platform for Gigabit-rate Network Switching and Routing John W. Lockwood, Nick McKeown Greg Watson, Glen Gibb, Paul Hartke, Jad Naous, Ramanan Raghuraman, and Jianying Luo JWLockwd@stanford.edu

More information

CONTACT: ,

CONTACT: , S.N0 Project Title Year of publication of IEEE base paper 1 Design of a high security Sha-3 keccak algorithm 2012 2 Error correcting unordered codes for asynchronous communication 2012 3 Low power multipliers

More information

STM32F7 series ARM Cortex -M7 powered Releasing your creativity

STM32F7 series ARM Cortex -M7 powered Releasing your creativity STM32F7 series ARM Cortex -M7 powered Releasing your creativity STM32 high performance Very high performance 32-bit MCU with DSP and FPU The STM32F7 with its ARM Cortex -M7 core is the smartest MCU and

More information

5. ARM 기반모니터프로그램사용. Embedded Processors. DE1-SoC 보드 (IntelFPGA) Application Processors. Development of the ARM Architecture.

5. ARM 기반모니터프로그램사용. Embedded Processors. DE1-SoC 보드 (IntelFPGA) Application Processors. Development of the ARM Architecture. Embedded Processors 5. ARM 기반모니터프로그램사용 DE1-SoC 보드 (IntelFPGA) 2 Application Processors Development of the ARM Architecture v4 v5 v6 v7 Halfword and signed halfword / byte support System mode Thumb instruction

More information

Implementing MPI_Barrier with the NetFPGA

Implementing MPI_Barrier with the NetFPGA Implementing MPI_Barrier with the NetFPGA O. Arap 1, G. Brown 2, B. Himebaugh 2, and M. Swany 1 1 enter for Research in Extreme Scale Technologies, Indiana University, Bloomington, IN, USA 2 School of

More information

ARM Cortex core microcontrollers 3. Cortex-M0, M4, M7

ARM Cortex core microcontrollers 3. Cortex-M0, M4, M7 ARM Cortex core microcontrollers 3. Cortex-M0, M4, M7 Scherer Balázs Budapest University of Technology and Economics Department of Measurement and Information Systems BME-MIT 2018 Trends of 32-bit microcontrollers

More information

STM32 Journal. In this Issue:

STM32 Journal. In this Issue: Volume 1, Issue 2 In this Issue: Bringing 32-bit Performance to 8- and 16-bit Applications Developing High-Quality Audio for Consumer Electronics Applications Bringing Floating-Point Performance and Precision

More information

FPGA Adaptive Software Debug and Performance Analysis

FPGA Adaptive Software Debug and Performance Analysis white paper Intel Adaptive Software Debug and Performance Analysis Authors Javier Orensanz Director of Product Management, System Design Division ARM Stefano Zammattio Product Manager Intel Corporation

More information

Hi Hsiao-Lung Chan, Ph.D. Dept Electrical Engineering Chang Gung University, Taiwan

Hi Hsiao-Lung Chan, Ph.D. Dept Electrical Engineering Chang Gung University, Taiwan Processors Hi Hsiao-Lung Chan, Ph.D. Dept Electrical Engineering Chang Gung University, Taiwan chanhl@maili.cgu.edu.twcgu General-purpose p processor Control unit Controllerr Control/ status Datapath ALU

More information

SoC Systeme ultra-schnell entwickeln mit Vivado und Visual System Integrator

SoC Systeme ultra-schnell entwickeln mit Vivado und Visual System Integrator SoC Systeme ultra-schnell entwickeln mit Vivado und Visual System Integrator FPGA Kongress München 2017 Martin Heimlicher Enclustra GmbH Agenda 2 What is Visual System Integrator? Introduction Platform

More information

New STM32 F7 Series. World s 1 st to market, ARM Cortex -M7 based 32-bit MCU

New STM32 F7 Series. World s 1 st to market, ARM Cortex -M7 based 32-bit MCU New STM32 F7 Series World s 1 st to market, ARM Cortex -M7 based 32-bit MCU 7 Keys of STM32 F7 series 2 1 2 3 4 5 6 7 First. ST is first to sample a fully functional Cortex-M7 based 32-bit MCU : STM32

More information

Product Technical Brief S3C2416 May 2008

Product Technical Brief S3C2416 May 2008 Product Technical Brief S3C2416 May 2008 Overview SAMSUNG's S3C2416 is a 32/16-bit RISC cost-effective, low power, high performance micro-processor solution for general applications including the GPS Navigation

More information

FPGA Entering the Era of the All Programmable SoC

FPGA Entering the Era of the All Programmable SoC FPGA Entering the Era of the All Programmable SoC Ivo Bolsens, Senior Vice President & CTO Page 1 Moore s Law: The Technology Pipeline Page 2 Industry Debates on Cost Page 3 Design Cost Estimated Chip

More information

KeyStone II. CorePac Overview

KeyStone II. CorePac Overview KeyStone II ARM Cortex A15 CorePac Overview ARM A15 CorePac in KeyStone II Standard ARM Cortex A15 MPCore processor Cortex A15 MPCore version r2p2 Quad core, dual core, and single core variants 4096kB

More information

Qsys and IP Core Integration

Qsys and IP Core Integration Qsys and IP Core Integration Stephen A. Edwards (after David Lariviere) Columbia University Spring 2016 IP Cores Altera s IP Core Integration Tools Connecting IP Cores IP Cores Cyclone V SoC: A Mix of

More information

FPQ6 - MPC8313E implementation

FPQ6 - MPC8313E implementation Formation MPC8313E implementation: This course covers PowerQUICC II Pro MPC8313 - Processeurs PowerPC: NXP Power CPUs FPQ6 - MPC8313E implementation This course covers PowerQUICC II Pro MPC8313 Objectives

More information

Midterm Exam. Solutions

Midterm Exam. Solutions Midterm Exam Solutions Problem 1 List at least 3 advantages of implementing selected portions of a complex design in software Software vs. Hardware Trade-offs Improve Performance Improve Energy Efficiency

More information

Intelop. *As new IP blocks become available, please contact the factory for the latest updated info.

Intelop. *As new IP blocks become available, please contact the factory for the latest updated info. A FPGA based development platform as part of an EDK is available to target intelop provided IPs or other standard IPs. The platform with Virtex-4 FX12 Evaluation Kit provides a complete hardware environment

More information

Designing with STM32F2x & STM32F4

Designing with STM32F2x & STM32F4 Designing with STM32F2x & STM32F4 Course Description Designing with STM32F2x & STM32F4 is a 3 days ST official course. The course provides all necessary theoretical and practical know-how for start developing

More information

GigaX API for Zynq SoC

GigaX API for Zynq SoC BUM002 v1.0 USER MANUAL A software API for Zynq PS that Enables High-speed GigaE-PL Data Transfer & Frames Management BERTEN DSP S.L. www.bertendsp.com gigax@bertendsp.com +34 942 18 10 11 Table of Contents

More information

FCQ2 - P2020 QorIQ implementation

FCQ2 - P2020 QorIQ implementation Formation P2020 QorIQ implementation: This course covers NXP QorIQ P2010 and P2020 - Processeurs PowerPC: NXP Power CPUs FCQ2 - P2020 QorIQ implementation This course covers NXP QorIQ P2010 and P2020 Objectives

More information

MYD-Y6ULX Development Board

MYD-Y6ULX Development Board MYD-Y6ULX Development Board MYC-Y6ULX CPU Module as Controller Board 528Hz NXP i.mx 6UL/6ULL ARM Cortex-A7 Processors 1.0mm pitch 140-pin Stamp Hole Expansion Interface for Board-to-Board Connections 256MB

More information

RiceNIC. Prototyping Network Interfaces. Jeffrey Shafer Scott Rixner

RiceNIC. Prototyping Network Interfaces. Jeffrey Shafer Scott Rixner RiceNIC Prototyping Network Interfaces Jeffrey Shafer Scott Rixner RiceNIC Overview Gigabit Ethernet Network Interface Card RiceNIC - Prototyping Network Interfaces 2 RiceNIC Overview Reconfigurable and

More information

SDSoC: Session 1

SDSoC: Session 1 SDSoC: Session 1 ADAM@ADIUVOENGINEERING.COM What is SDSoC SDSoC is a system optimising compiler which allows us to optimise Zynq PS / PL Zynq MPSoC PS / PL MicroBlaze What does this mean? Following the

More information

Interconnects, Memory, GPIO

Interconnects, Memory, GPIO Interconnects, Memory, GPIO Dr. Francesco Conti f.conti@unibo.it Slide contributions adapted from STMicroelectronics and from Dr. Michele Magno, others Processor vs. MCU Pipeline Harvard architecture Separate

More information

Versal: AI Engine & Programming Environment

Versal: AI Engine & Programming Environment Engineering Director, Xilinx Silicon Architecture Group Versal: Engine & Programming Environment Presented By Ambrose Finnerty Xilinx DSP Technical Marketing Manager October 16, 2018 MEMORY MEMORY MEMORY

More information

Using partial reconfiguration in an embedded message-passing system. FPGA UofT March 2011

Using partial reconfiguration in an embedded message-passing system. FPGA UofT March 2011 Using partial reconfiguration in an embedded message-passing system FPGA Seminar @ UofT March 2011 Manuel Saldaña, Arun Patel ArchES Computing Hao Jun Liu, Paul Chow University of Toronto ArchES-MPI framework

More information

INT 1011 TCP Offload Engine (Full Offload)

INT 1011 TCP Offload Engine (Full Offload) INT 1011 TCP Offload Engine (Full Offload) Product brief, features and benefits summary Provides lowest Latency and highest bandwidth. Highly customizable hardware IP block. Easily portable to ASIC flow,

More information

Yet Another Implementation of CoRAM Memory

Yet Another Implementation of CoRAM Memory Dec 7, 2013 CARL2013@Davis, CA Py Yet Another Implementation of Memory Architecture for Modern FPGA-based Computing Shinya Takamaeda-Yamazaki, Kenji Kise, James C. Hoe * Tokyo Institute of Technology JSPS

More information

Product Technical Brief S3C2413 Rev 2.2, Apr. 2006

Product Technical Brief S3C2413 Rev 2.2, Apr. 2006 Product Technical Brief Rev 2.2, Apr. 2006 Overview SAMSUNG's is a Derivative product of S3C2410A. is designed to provide hand-held devices and general applications with cost-effective, low-power, and

More information

Copyright 2014 Xilinx

Copyright 2014 Xilinx IP Integrator and Embedded System Design Flow Zynq Vivado 2014.2 Version This material exempt per Department of Commerce license exception TSU Objectives After completing this module, you will be able

More information

Lesson 6 Intel Galileo and Edison Prototype Development Platforms. Chapter-8 L06: "Internet of Things ", Raj Kamal, Publs.: McGraw-Hill Education

Lesson 6 Intel Galileo and Edison Prototype Development Platforms. Chapter-8 L06: Internet of Things , Raj Kamal, Publs.: McGraw-Hill Education Lesson 6 Intel Galileo and Edison Prototype Development Platforms 1 Intel Galileo Gen 2 Boards Based on the Intel Pentium architecture Includes features of single threaded, single core and 400 MHz constant

More information

First hour Zynq architecture

First hour Zynq architecture Introduction to the Zynq SOC INF3430/INF4431 Tønnes Nygaard tonnesfn@ifi.uio.no First hour Zynq architecture Computational platforms Design flow System overview PS APU IOP MIO EMIO Datapath PS/PL interconnect

More information

Protecting Embedded Systems from Zero-Day Attacks

Protecting Embedded Systems from Zero-Day Attacks Protecting Embedded Systems from Zero-Day Attacks Professor Stephen Taylor Thayer School of Engineering at Dartmouth stnh.email@icloud.com (603) 727-8945 MicroArx.com Apiotics.com 1 Research Support Current

More information

Vivado Design Suite User Guide: Embedded Processor Hardware Design

Vivado Design Suite User Guide: Embedded Processor Hardware Design Vivado Design Suite User Guide: Embedded Processor Hardware Design Notice of Disclaimer The information disclosed to you hereunder (the "Materials") is provided solely for the selection and use of Xilinx

More information

XMC-SDR-A. XMC Zynq MPSoC + Dual ADRV9009 module. Preliminary Information Subject To Change. Overview. Key Features. Typical Applications

XMC-SDR-A. XMC Zynq MPSoC + Dual ADRV9009 module. Preliminary Information Subject To Change. Overview. Key Features. Typical Applications Preliminary Information Subject To Change XMC-SDR-A XMC Zynq MPSoC + Dual ADRV9009 module Overview PanaTeQ s XMC-SDR-A is a XMC module based on the Zynq UltraScale+ MultiProcessor SoC device from Xilinx

More information

RISC-V based core as a soft processor in FPGAs Chowdhary Musunuri Sr. Director, Solutions & Applications Microsemi

RISC-V based core as a soft processor in FPGAs Chowdhary Musunuri Sr. Director, Solutions & Applications Microsemi Power Matters. TM RISC-V based core as a soft processor in FPGAs Chowdhary Musunuri Sr. Director, Solutions & Applications Microsemi chowdhary.musunuri@microsemi.com RIC217 1 Agenda A brief introduction

More information

Freescale i.mx6 Architecture

Freescale i.mx6 Architecture Freescale i.mx6 Architecture Course Description Freescale i.mx6 architecture is a 3 days Freescale official course. The course goes into great depth and provides all necessary know-how to develop software

More information

Product Series SoC Solutions Product Series 2016

Product Series SoC Solutions Product Series 2016 Product Series Why SPI? or We will discuss why Serial Flash chips are used in many products. What are the advantages and some of the disadvantages. We will explore how SoC Solutions SPI and QSPI IP Cores

More information

Product Technical Brief S3C2412 Rev 2.2, Apr. 2006

Product Technical Brief S3C2412 Rev 2.2, Apr. 2006 Product Technical Brief S3C2412 Rev 2.2, Apr. 2006 Overview SAMSUNG's S3C2412 is a Derivative product of S3C2410A. S3C2412 is designed to provide hand-held devices and general applications with cost-effective,

More information

Building High Performance, Power Efficient Cortex and Mali systems with ARM CoreLink. Robert Kaye

Building High Performance, Power Efficient Cortex and Mali systems with ARM CoreLink. Robert Kaye Building High Performance, Power Efficient Cortex and Mali systems with ARM CoreLink Robert Kaye 1 Agenda Once upon a time ARM designed systems Compute trends Bringing it all together with CoreLink 400

More information

MYD-JA5D2X Development Board

MYD-JA5D2X Development Board MYD-JA5D2X Development Board MYC-JA5D2X CPU Module as Controller Board 500MHz Atmel SAMA5D26/27 ARM Cortex-A5 Processor 256MB DDR3 SDRAM, 256MB Nand Flash, 4MB Data FLASH, 64KB EEPROM Serial ports, USB,

More information

Mapping applications into MPSoC

Mapping applications into MPSoC Mapping applications into MPSoC concurrency & communication Jos van Eijndhoven jos@vectorfabrics.com March 12, 2011 MPSoC mapping: exploiting concurrency 2 March 12, 2012 Computation on general purpose

More information

How to Efficiently Implement Flexible and Full-Featured Digital Radio Solutions Using All Programmable SoCs

How to Efficiently Implement Flexible and Full-Featured Digital Radio Solutions Using All Programmable SoCs Delivering a Generation Ahead How to Efficiently Implement Flexible and Full-Featured Digital Radio Solutions Using All Programmable SoCs Agenda Introduction to Mobile Network Introduction to Xilinx Solution

More information

Hello, and welcome to this presentation of the STM32L4 System Configuration Controller.

Hello, and welcome to this presentation of the STM32L4 System Configuration Controller. Hello, and welcome to this presentation of the STM32L4 System Configuration Controller. 1 Please note that this presentation has been written for STM32L47x/48x devices. The key differences with other devices

More information

Overview. Implementing Gigabit Routers with NetFPGA. Basic Architectural Components of an IP Router. Per-packet processing in an IP Router

Overview. Implementing Gigabit Routers with NetFPGA. Basic Architectural Components of an IP Router. Per-packet processing in an IP Router Overview Implementing Gigabit Routers with NetFPGA Prof. Sasu Tarkoma The NetFPGA is a low-cost platform for teaching networking hardware and router design, and a tool for networking researchers. The NetFPGA

More information

The ARM Cortex-A9 Processors

The ARM Cortex-A9 Processors The ARM Cortex-A9 Processors This whitepaper describes the details of the latest high performance processor design within the common ARM Cortex applications profile ARM Cortex-A9 MPCore processor: A multicore

More information

Introduction to Embedded System Design using Zynq

Introduction to Embedded System Design using Zynq Introduction to Embedded System Design using Zynq Zynq Vivado 2015.2 Version This material exempt per Department of Commerce license exception TSU Objectives After completing this module, you will be able

More information

IBM "Broadway" 512Mb GDDR3 Qimonda

IBM Broadway 512Mb GDDR3 Qimonda ffl Wii architecture ffl Wii components ffl Cracking Open" the Wii 20 1 CMPE112 Spring 2008 A. Di Blas 112 Spring 2008 CMPE Wii Nintendo ffl Architecture very similar to that of the ffl Fully backwards

More information