Hoplite-DSP Harnessing the Xilinx DSP48 Multiplexers to efficiently support NoCs on FPGAs. Chethan Kumar H B and Nachiket Kapre

Size: px
Start display at page:

Download "Hoplite-DSP Harnessing the Xilinx DSP48 Multiplexers to efficiently support NoCs on FPGAs. Chethan Kumar H B and Nachiket Kapre"

Transcription

1 -DSP Harnessing the Xilinx DSP Multiplexers to efficiently support NoCs on FPGAs Chethan Kumar H B and Nachiket Kapre nachiket@ieee.org

2 FPL 201 paper Jan Gray co-author Specs 60 s+100 FFs 2.9ns clock Smallest FPGA router available + RTL code 2

3 32b payload + Virtex-6 240T Router s FFs Clock Penn 1.7K 41 4.ns CMU 1.K ns FPL ns 3

4 32b payload + Virtex-6 240T Router s FFs Clock Penn 1.7K 41 4.ns CMU 1.K ns FPL ns 2x 3

5 32b payload + Virtex-6 240T Router s FFs Clock Penn 1.7K 41 4.ns CMU 1.K ns FPL ns 2x x 3

6 32b payload + Virtex-6 240T Router s FFs Clock Penn 1.7K 41 4.ns CMU 1.K ns FPL ns 2x x 1.x 3

7 47b payload + Virtex-7 T Router s FFs Clock FPL 201 -DSP FPL ns ns 4

8 47b payload + Virtex-7 T Router s FFs Clock FPL 201 -DSP FPL ns ns

9 47b payload + Virtex-7 T Router s FFs Clock FPL 201 -DSP FPL ns ns x

10 47b payload + Virtex-7 T Router s FFs Clock FPL 201 -DSP FPL ns ns x 8x

11 47b payload + Virtex-7 T Router s FFs Clock FPL 201 -DSP FPL ns ns x 8x ~

12 47b payload + Virtex-7 T Router s FFs Clock FPL 201 -DSP FPL ns ns x 8x ~ + 1 DSP 6

13 7

14 Motivation Close the gap vs. embedded NoCs do we really want clean-slate hard NoCs? Return resources to FPGA application reduce NoC overheads Find clever ways to reuse existing FPGA elements 8

15 Outline Adapting the arch. to the DSP Scaling to 2D layouts using DSP carry chains Performance and Resource evaluation 9

16 Outline Adapting the arch. to the DSP Scaling to 2D layouts using DSP carry chains Performance and Resource evaluation 10

17 Overview of switch organization NoC organised as a unidirectional torus Each switch has 2 inputs, 2 outputs into the network + PE connection Uses deflection routing no buffering, no allocation, etc from: Jan Gray 11

18 Internals W PE N E 6 SPE 12 sel0 sel1,2 DOR Logic

19 W PE N 6 E SPE summary sel0 sel1,2 DOR Logic Bulk of the footprint from -, 6- blocks implement packet multiplexers DOR logic handful of s only reads address fields, valid signals Inter- router links pipelined registers Idea: move (1) multiplexers + (2) registers into Xilinx DSP block 13

20 Xilinx DSP block INMODE OPMODE ALUMODE PCOUT A D X B 18 Y P C PCIN Z ALU 14

21 Xilinx DSP block INMODE OPMODE ALUMODE PCOUT A D X B 18 Y P C PCIN Z ALU 1

22 INMODE OPMODE ALUMODE PCOUT A D X Programmable B 18 Y P C PCIN Z ALU elements Xilinx DSP block very versatile! Typical use case: signal processing, streaming computations => mainly arithmetic INMODE 27b multiplexer between A and D OPMODE b multiplexers between A:B, C Exploit cascade links PCINPCOUT! 16

23 Input + Multiplexer Mapping 6 W PE E SPE DOR Logic sel0 sel1,2 N A D B C P 27 PCIN ALU X Z Y PCOUT OPMODE ALUMODE INMODE 17

24 Input + Multiplexer Mapping 6 W PE E SPE DOR Logic sel0 sel1,2 N A D B C P 27 PCIN ALU X Z Y PCOUT OPMODE ALUMODE INMODE WEST PE N SPE EAST 18

25 Input + Multiplexer Mapping 6 W PE E SPE DOR Logic sel0 sel1,2 N A D B C P 27 PCIN ALU X Z Y PCOUT OPMODE ALUMODE INMODE WEST PE N SPE EAST 19

26 Input + Multiplexer Mapping 6 W PE E SPE DOR Logic sel0 sel1,2 N A D B C P 27 PCIN ALU X Z Y PCOUT OPMODE ALUMODE INMODE WEST PE N SPE EAST 20

27 Multi-cycling Problem: has two outputs (three in fact, with SPE output port shared) Solution: must multi-pump the DSP block runs at 2x the frequency of the PEs First sub-cycle resolve EAST output Second sub-cycle resolve SOUTHPE output 21

28 First cycle CE INMODE OPMODE ALUMODE PCOUT A D X East Output B 18 Y P C PCIN PE Input West Input Z ALU 22

29 Second cycle CE INMODE OPMODE ALUMODE PCOUT A D B North Input X Y SouthPE Output P C PCIN PE Input West Input Z ALU 23

30 Outline Adapting the arch. to the DSP Scaling to 2D layouts using DSP carry chains Performance and Resource evaluation 24

31 DSP columnar layout DOR Logic PCIN dedicated cascade routes PCOUT P A:B P A:B C PCIN DSP Column User Logic PCOUT programmable FPGA interconnect 2

32 Layout considerations FPGA DSPs organised into vertical columns ~100s of DSPs in a column ~10s of columns Restrictions: 1. Cascade links only extend within column 2. Horizontal links must use general interconnect Key question: Adjusting NoC size vs. DSP count use passthrough DSPs 26

33 Embedded layout Top-Turn DSPs PCIN to P Router DSPs Pass-thru DSPs PCOUT to PCIN Router DSPs fabric Pass-thru DSPs PCOUT to PCIN cascade fabric Router DSPs Bottom-Turn DSPs A:B to PCOUT 27

34 Comparing Xilinx Virtex6 and Virtex7 Layouts 8x8 NoC (ML60 board) 28 16x16 NoC (VC707 board)

35 Outline Adapting the arch. to the DSP Scaling to 2D layouts using DSP carry chains Performance and Resource evaluation 29

36 s vs DSPs Simple tradeoff substantially fewer s vs. DSPs Importantly, FFs absorbed into DSP Power and effective BW for random traffic mostly identical 30

37 s vs DSPs Simple tradeoff substantially fewer s vs. DSPs Importantly, FFs absorbed into DSP Power and effective BW for random traffic mostly identical 31

38 Commentary on hard NoCs Area: Hard router = 12.4 LABs 1 Altera DSP block = 11.9 LABs Stratix-III -DSP marginally smaller Speed: Hard router ~996 MHz -DSP ~60 MHz (multi-pumped) -DSP limits freq advantage to 3x. Power Hard router ~1.8 W -DSP model ~1.1W 1% activity -DSP uses ~0% less power Abdelfattah + Betz [TRETS2014] (extrapolated results for b-wide 1VC) 32

39 Wish-list for DSPs Gen2 Configurable Cascades b switched bidirectional routing instead of just cascades (approach hard NoC wiring) option to skip DSP blocks (segment lengths) DOR routing pattern detection logic with multiple masks (similar to Altera DSP units) SIMD Multiplexing fracturing b-wide lanes into multiple lanes 33

40 Conclusions muxes mapped to DSP blocks use the dynamic OPMODE feature Reduce cost by x s, 8x FFs per router Exploit cascade links to absorb NoC wiring Significantly close the gap with hard NoCs 34

41 Embedded layout Top-Turn DSPs PCIN to P Router DSPs Top-Turn DSPs PCIN to P Three kinds of DSPs Router DSPs D H cascade Pass-thru DSPs PCOUT to PCIN Router DSPs fabric Pass-thru DSPs fabric PCOUT to PCIN Router DSPs Route DSPs Pass-thru DSPs PCOUT to PCIN Small fraction of DSPs for Router DSPs switching fabric Pass-through Pass-thru DSPs PCOUT to PCIN glorified pipelined wires Router DSPs multi-pumping 0% back to user cascade fabric fabric H H Bottom-Turn DSPs Top-Turn A:B to PCOUT DSPs PCIN to P Router DSPs Bottom-Turn DSPs A:B to PCOUT Corner-turn DSPs connect cascades to fabric Pass-thru DSPs PCOUT to PCIN 3

42 Physical FPGA layout Corner-Turn fabric cascade fabric Pass-Thru 36 2x2 NoC (ML60 board)

43

44 Efficiency 38

45 Efficiency 39

46 Efficiency 40

47 Efficiency DSPs less-efficient than -based! 41

Implementing FPGA overlay NoCs using the Xilinx UltraScale memory cascades

Implementing FPGA overlay NoCs using the Xilinx UltraScale memory cascades Implementing FPGA overlay NoCs using the Xilinx UltraScale memory cascades Nachiket Kapre University of Waterloo Waterloo, Ontario, Canada Email: nachiket@uwaterloo.ca Abstract We can enhance the performance

More information

FastTrack: Leveraging Heterogeneous FPGA Wires to Design Low-cost High-performance Soft NoCs

FastTrack: Leveraging Heterogeneous FPGA Wires to Design Low-cost High-performance Soft NoCs 1/29 FastTrack: Leveraging Heterogeneous FPGA Wires to Design Low-cost High-performance Soft NoCs Nachiket Kapre + Tushar Krishna nachiket@uwaterloo.ca, tushar@ece.gatech.edu 2/29 Claim FPGA overlay NoCs

More information

DSP Resources. Main features: 1 adder-subtractor, 1 multiplier, 1 add/sub/logic ALU, 1 comparator, several pipeline stages

DSP Resources. Main features: 1 adder-subtractor, 1 multiplier, 1 add/sub/logic ALU, 1 comparator, several pipeline stages DSP Resources Specialized FPGA columns for complex arithmetic functionality DSP48 Tile: two DSP48 slices, interconnect Each DSP48 is a self-contained arithmeticlogical unit with add/sub/multiply/logic

More information

INTRODUCTION TO FPGA ARCHITECTURE

INTRODUCTION TO FPGA ARCHITECTURE 3/3/25 INTRODUCTION TO FPGA ARCHITECTURE DIGITAL LOGIC DESIGN (BASIC TECHNIQUES) a b a y 2input Black Box y b Functional Schematic a b y a b y a b y 2 Truth Table (AND) Truth Table (OR) Truth Table (XOR)

More information

Parallel FIR Filters. Chapter 5

Parallel FIR Filters. Chapter 5 Chapter 5 Parallel FIR Filters This chapter describes the implementation of high-performance, parallel, full-precision FIR filters using the DSP48 slice in a Virtex-4 device. ecause the Virtex-4 architecture

More information

FPGA Architecture Overview. Generic FPGA Architecture (1) FPGA Architecture

FPGA Architecture Overview. Generic FPGA Architecture (1) FPGA Architecture FPGA Architecture Overview dr chris dick dsp chief architect wireless and signal processing group xilinx inc. Generic FPGA Architecture () Generic FPGA architecture consists of an array of logic tiles

More information

FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC)

FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC) FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC) D.Udhayasheela, pg student [Communication system],dept.ofece,,as-salam engineering and technology, N.MageshwariAssistant Professor

More information

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing Mingyu Gao and Christos Kozyrakis Stanford University http://mast.stanford.edu HPCA March 14, 2016 PIM is Coming Back End of Dennard

More information

Ultra-Fast NoC Emulation on a Single FPGA

Ultra-Fast NoC Emulation on a Single FPGA The 25 th International Conference on Field-Programmable Logic and Applications (FPL 2015) September 3, 2015 Ultra-Fast NoC Emulation on a Single FPGA Thiem Van Chu, Shimpei Sato, and Kenji Kise Tokyo

More information

Low-Power Interconnection Networks

Low-Power Interconnection Networks Low-Power Interconnection Networks Li-Shiuan Peh Associate Professor EECS, CSAIL & MTL MIT 1 Moore s Law: Double the number of transistors on chip every 2 years 1970: Clock speed: 108kHz No. transistors:

More information

Fast Flexible FPGA-Tuned Networks-on-Chip

Fast Flexible FPGA-Tuned Networks-on-Chip This work was funded by NSF. We thank Xilinx for their FPGA and tool donations. We thank Bluespec for their tool donations. Fast Flexible FPGA-Tuned Networks-on-Chip Michael K. Papamichael, James C. Hoe

More information

Fast Scalable FPGA-Based Network-on-Chip Simulation Models

Fast Scalable FPGA-Based Network-on-Chip Simulation Models We thank Xilinx for their FPGA and tool donations. We thank Bluespec for their tool donations and support. Computer Architecture Lab at Carnegie Mellon Fast Scalable FPGA-Based Network-on-Chip Simulation

More information

FPGA architecture and design technology

FPGA architecture and design technology CE 435 Embedded Systems Spring 2017 FPGA architecture and design technology Nikos Bellas Computer and Communications Engineering Department University of Thessaly 1 FPGA fabric A generic island-style FPGA

More information

Synthesizable FPGA Fabrics Targetable by the VTR CAD Tool

Synthesizable FPGA Fabrics Targetable by the VTR CAD Tool Synthesizable FPGA Fabrics Targetable by the VTR CAD Tool Jin Hee Kim and Jason Anderson FPL 2015 London, UK September 3, 2015 2 Motivation for Synthesizable FPGA Trend towards ASIC design flow Design

More information

EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs)

EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) September 12, 2002 John Wawrzynek Fall 2002 EECS150 - Lec06-FPGA Page 1 Outline What are FPGAs? Why use FPGAs (a short history

More information

Outline. EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) FPGA Overview. Why FPGAs?

Outline. EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) FPGA Overview. Why FPGAs? EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) September 12, 2002 John Wawrzynek Outline What are FPGAs? Why use FPGAs (a short history lesson). FPGA variations Internal logic

More information

Basic FPGA Architectures. Actel FPGAs. PLD Technologies: Antifuse. 3 Digital Systems Implementation Programmable Logic Devices

Basic FPGA Architectures. Actel FPGAs. PLD Technologies: Antifuse. 3 Digital Systems Implementation Programmable Logic Devices 3 Digital Systems Implementation Programmable Logic Devices Basic FPGA Architectures Why Programmable Logic Devices (PLDs)? Low cost, low risk way of implementing digital circuits as application specific

More information

An FPGA Architecture Supporting Dynamically-Controlled Power Gating

An FPGA Architecture Supporting Dynamically-Controlled Power Gating An FPGA Architecture Supporting Dynamically-Controlled Power Gating Altera Corporation March 16 th, 2012 Assem Bsoul and Steve Wilton {absoul, stevew}@ece.ubc.ca System-on-Chip Research Group Department

More information

Re-Examining Conventional Wisdom for Networks-on-Chip in the Context of FPGAs

Re-Examining Conventional Wisdom for Networks-on-Chip in the Context of FPGAs This work was funded by NSF. We thank Xilinx for their FPGA and tool donations. We thank Bluespec for their tool donations. Re-Examining Conventional Wisdom for Networks-on-Chip in the Context of FPGAs

More information

FastTrack: Leveraging Heterogeneous FPGA Wires to Design Low-cost High-performance Soft NoCs

FastTrack: Leveraging Heterogeneous FPGA Wires to Design Low-cost High-performance Soft NoCs FastTrack: Leveraging Heterogeneous FPGA Wires to Design Low-cost High-performance Soft NoCs Nachiket Kapre University of Waterloo Ontario, Canada nachiket@uwaterloo.ca Tushar Krishna Georgia Institute

More information

How Much Logic Should Go in an FPGA Logic Block?

How Much Logic Should Go in an FPGA Logic Block? How Much Logic Should Go in an FPGA Logic Block? Vaughn Betz and Jonathan Rose Department of Electrical and Computer Engineering, University of Toronto Toronto, Ontario, Canada M5S 3G4 {vaughn, jayar}@eecgutorontoca

More information

ECE 636. Reconfigurable Computing. Lecture 2. Field Programmable Gate Arrays I

ECE 636. Reconfigurable Computing. Lecture 2. Field Programmable Gate Arrays I ECE 636 Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays I Overview Anti-fuse and EEPROM-based devices Contemporary SRAM devices - Wiring - Embedded New trends - Single-driver wiring -

More information

3 Hoplite: A Deflection-Routed Directional Torus NoC for FPGAs

3 Hoplite: A Deflection-Routed Directional Torus NoC for FPGAs 3 Hoplite: A Deflection-Routed Directional Torus NoC for FPGAs NACHKET KAPRE, University of Waterloo JAN GRA, Gray Research LLC We can design an FPGA-optimized lightweight network-on-chip (NoC) router

More information

Marathon: Statically-Scheduled Conflict-Free Routing on FPGA Overlay NoCs

Marathon: Statically-Scheduled Conflict-Free Routing on FPGA Overlay NoCs Marathon: Statically-Scheduled Conflict-Free Routing on FPGA Overlay NoCs Nachiket Kapre Nanyang Technological University 5 Nanyang Avenue, Singapore 639798 Email: nachiket@ieee.org Abstract We can improve

More information

High Capacity and High Performance 20nm FPGAs. Steve Young, Dinesh Gaitonde August Copyright 2014 Xilinx

High Capacity and High Performance 20nm FPGAs. Steve Young, Dinesh Gaitonde August Copyright 2014 Xilinx High Capacity and High Performance 20nm FPGAs Steve Young, Dinesh Gaitonde August 2014 Not a Complete Product Overview Page 2 Outline Page 3 Petabytes per month Increasing Bandwidth Global IP Traffic Growth

More information

Reconfigurable Cell Array for DSP Applications

Reconfigurable Cell Array for DSP Applications Outline econfigurable Cell Array for DSP Applications Chenxin Zhang Department of Electrical and Information Technology Lund University, Sweden econfigurable computing Coarse-grained reconfigurable cell

More information

AUGMENTING FPGAS WITH EMBEDDED NETWORKS-ON-CHIP

AUGMENTING FPGAS WITH EMBEDDED NETWORKS-ON-CHIP AUGMENTING FPGAS WITH EMBEDDED NETWORKS-ON-CHIP Mohamed S. Abdelfattah and Vaughn Betz Department of Electrical and Computer Engineering University of Toronto, Toronto, ON, Canada {mohamed,vaughn}@eecg.utoronto.ca

More information

! Program logic functions, interconnect using SRAM. ! Advantages: ! Re-programmable; ! dynamically reconfigurable; ! uses standard processes.

! Program logic functions, interconnect using SRAM. ! Advantages: ! Re-programmable; ! dynamically reconfigurable; ! uses standard processes. Topics! SRAM-based FPGA fabrics:! Xilinx.! Altera. SRAM-based FPGAs! Program logic functions, using SRAM.! Advantages:! Re-programmable;! dynamically reconfigurable;! uses standard processes.! isadvantages:!

More information

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs IBM Research AI Systems Day DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1, Junsong Wang 2, Chao Zhu 2, Yonghua Lin 2, Jinjun Xiong 3, Wen-mei

More information

CPE/EE 422/522. Introduction to Xilinx Virtex Field-Programmable Gate Arrays Devices. Dr. Rhonda Kay Gaede UAH. Outline

CPE/EE 422/522. Introduction to Xilinx Virtex Field-Programmable Gate Arrays Devices. Dr. Rhonda Kay Gaede UAH. Outline CPE/EE 422/522 Introduction to Xilinx Virtex Field-Programmable Gate Arrays Devices Dr. Rhonda Kay Gaede UAH Outline Introduction Field-Programmable Gate Arrays Virtex Virtex-E, Virtex-II, and Virtex-II

More information

GRVI Phalanx. A Massively Parallel RISC-V FPGA Accelerator Accelerator. Jan Gray

GRVI Phalanx. A Massively Parallel RISC-V FPGA Accelerator Accelerator. Jan Gray GRVI Phalanx A Massively Parallel RISC-V FPGA Accelerator Accelerator Jan Gray jan@fpga.org Introduction FPGA accelerators are hot MSR Catapult. Intel += Altera. OpenPOWER + Xilinx FPGAs as computers Massively

More information

ReNoC: A Network-on-Chip Architecture with Reconfigurable Topology

ReNoC: A Network-on-Chip Architecture with Reconfigurable Topology 1 ReNoC: A Network-on-Chip Architecture with Reconfigurable Topology Mikkel B. Stensgaard and Jens Sparsø Technical University of Denmark Technical University of Denmark Outline 2 Motivation ReNoC Basic

More information

Altera FLEX 8000 Block Diagram

Altera FLEX 8000 Block Diagram Altera FLEX 8000 Block Diagram Figure from Altera technical literature FLEX 8000 chip contains 26 162 LABs Each LAB contains 8 Logic Elements (LEs), so a chip contains 208 1296 LEs, totaling 2,500 16,000

More information

Design and Implementation of Low Complexity Router for 2D Mesh Topology using FPGA

Design and Implementation of Low Complexity Router for 2D Mesh Topology using FPGA Design and Implementation of Low Complexity Router for 2D Mesh Topology using FPGA Maheswari Murali * and Seetharaman Gopalakrishnan # * Assistant professor, J. J. College of Engineering and Technology,

More information

Xilinx ASMBL Architecture

Xilinx ASMBL Architecture FPGA Structure Xilinx ASMBL Architecture Design Flow Synthesis: HDL to FPGA primitives Translate: FPGA Primitives to FPGA Slice components Map: Packing of Slice components into Slices, placement of Slices

More information

EN2911X: Reconfigurable Computing Topic 01: Programmable Logic

EN2911X: Reconfigurable Computing Topic 01: Programmable Logic EN2911X: Reconfigurable Computing Topic 01: Programmable Logic Prof. Sherief Reda School of Engineering, Brown University Fall 2012 1 FPGA architecture Programmable interconnect Programmable logic blocks

More information

FPGA Based Digital Design Using Verilog HDL

FPGA Based Digital Design Using Verilog HDL FPGA Based Digital Design Using Course Designed by: IRFAN FAISAL MIR ( Verilog / FPGA Designer ) irfanfaisalmir@yahoo.com * Organized by Electronics Division Integrated Circuits Uses for digital IC technology

More information

Design and Implementation of Buffer Loan Algorithm for BiNoC Router

Design and Implementation of Buffer Loan Algorithm for BiNoC Router Design and Implementation of Buffer Loan Algorithm for BiNoC Router Deepa S Dev Student, Department of Electronics and Communication, Sree Buddha College of Engineering, University of Kerala, Kerala, India

More information

Mapping a Pipelined Data Path onto a Network-on-Chip

Mapping a Pipelined Data Path onto a Network-on-Chip Mapping a Pipelined Data Path onto a Network-on-Chip Stephan Kubisch, Claas Cornelius, Ronald Hecht, Dirk Timmermann {stephan.kubisch;claas.cornelius}@uni-rostock.de University of Rostock Institute of

More information

FPGA for Software Engineers

FPGA for Software Engineers FPGA for Software Engineers Course Description This course closes the gap between hardware and software engineers by providing the software engineer all the necessary FPGA concepts and terms. The course

More information

NoC Test-Chip Project: Working Document

NoC Test-Chip Project: Working Document NoC Test-Chip Project: Working Document Michele Petracca, Omar Ahmad, Young Jin Yoon, Frank Zovko, Luca Carloni and Kenneth Shepard I. INTRODUCTION This document describes the low-power high-performance

More information

Stratix II vs. Virtex-4 Performance Comparison

Stratix II vs. Virtex-4 Performance Comparison White Paper Stratix II vs. Virtex-4 Performance Comparison Altera Stratix II devices use a new and innovative logic structure called the adaptive logic module () to make Stratix II devices the industry

More information

The Xilinx XC6200 chip, the software tools and the board development tools

The Xilinx XC6200 chip, the software tools and the board development tools The Xilinx XC6200 chip, the software tools and the board development tools What is an FPGA? Field Programmable Gate Array Fully programmable alternative to a customized chip Used to implement functions

More information

Ultra-Low Latency, Bit-Parallel Message Exchange in Optical Packet Switched Interconnection Networks

Ultra-Low Latency, Bit-Parallel Message Exchange in Optical Packet Switched Interconnection Networks Ultra-Low Latency, Bit-Parallel Message Exchange in Optical Packet Switched Interconnection Networks O. Liboiron-Ladouceur 1, C. Gray 2, D. Keezer 2 and K. Bergman 1 1 Department of Electrical Engineering,

More information

Dual Split-Merge: A High Throughput Router Architecture for FPGAs

Dual Split-Merge: A High Throughput Router Architecture for FPGAs Dual plit-erge: A High Throughput Router Architecture for FPGAs Khaled Helal a,, ameh Attia a,, Hossam Fahmy a, Tawfik Ismail a, Yehea Ismail b, Hassan ostafa a,b, Abstract a Department of Electronics

More information

An Overlay Architecture for FPGA-based Industrial Control Systems Designed with Functional Block Diagrams

An Overlay Architecture for FPGA-based Industrial Control Systems Designed with Functional Block Diagrams R2-7 SASIMI 26 Proceedings An Overlay Architecture for FPGA-based Industrial Control Systems Designed with Functional Block Diagrams Taisei Segawa, Yuichiro Shibata, Yudai Shirakura, Kenichi Morimoto,

More information

Hoplite-Q: Priority-Aware Routing in FPGA Overlay NoCs

Hoplite-Q: Priority-Aware Routing in FPGA Overlay NoCs Hoplite-Q: Priority-Aware Routing in FPGA Overlay NoCs Siddhartha Nanyang Technological University siddhart00@e.ntu.edu.sg Nachiket Kapre University of Waterloo nachiket@uwaterloo.ca Abstract The Hoplite

More information

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1 Power Analysis of Embedded NoCs on FPGAs and Comparison With Custom Buses Mohamed S. Abdelfattah, Graduate Student Member, IEEE, and Vaughn

More information

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based

More information

Routing Algorithms, Process Model for Quality of Services (QoS) and Architectures for Two-Dimensional 4 4 Mesh Topology Network-on-Chip

Routing Algorithms, Process Model for Quality of Services (QoS) and Architectures for Two-Dimensional 4 4 Mesh Topology Network-on-Chip Routing Algorithms, Process Model for Quality of Services (QoS) and Architectures for Two-Dimensional 4 4 Mesh Topology Network-on-Chip Nauman Jalil, Adnan Qureshi, Furqan Khan, and Sohaib Ayyaz Qazi Abstract

More information

Lecture 6: Hard vs Soft Logic. James C. Hoe Department of ECE Carnegie Mellon University

Lecture 6: Hard vs Soft Logic. James C. Hoe Department of ECE Carnegie Mellon University 18 643 Lecture 6: Hard vs Soft Logic James C. Hoe Department of ECE Carnegie Mellon niversity 18 643 F17 L06 S1, James C. Hoe, CM/ECE/CALCM, 2017 Housekeeping Your goal today: understand the difference

More information

Qsys and IP Core Integration

Qsys and IP Core Integration Qsys and IP Core Integration Stephen A. Edwards (after David Lariviere) Columbia University Spring 2016 IP Cores Altera s IP Core Integration Tools Connecting IP Cores IP Cores Cyclone V SoC: A Mix of

More information

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP 133 CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP 6.1 INTRODUCTION As the era of a billion transistors on a one chip approaches, a lot of Processing Elements (PEs) could be located

More information

TSEA44 - Design for FPGAs

TSEA44 - Design for FPGAs 2015-11-24 Now for something else... Adapting designs to FPGAs Why? Clock frequency Area Power Target FPGA architecture: Xilinx FPGAs with 4 input LUTs (such as Virtex-II) Determining the maximum frequency

More information

Organic Computing. Dr. rer. nat. Christophe Bobda Prof. Dr. Rolf Wanka Department of Computer Science 12 Hardware-Software-Co-Design

Organic Computing. Dr. rer. nat. Christophe Bobda Prof. Dr. Rolf Wanka Department of Computer Science 12 Hardware-Software-Co-Design Dr. rer. nat. Christophe Bobda Prof. Dr. Rolf Wanka Department of Computer Science 12 Hardware-Software-Co-Design 1 Reconfigurable Computing Platforms 2 The Von Neumann Computer Principle In 1945, the

More information

Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays

Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays Éricles Sousa 1, Frank Hannig 1, Jürgen Teich 1, Qingqing Chen 2, and Ulf Schlichtmann

More information

EECS Components and Design Techniques for Digital Systems. Lec 20 RTL Design Optimization 11/6/2007

EECS Components and Design Techniques for Digital Systems. Lec 20 RTL Design Optimization 11/6/2007 EECS 5 - Components and Design Techniques for Digital Systems Lec 2 RTL Design Optimization /6/27 Shauki Elassaad Electrical Engineering and Computer Sciences University of California, Berkeley Slides

More information

OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel

OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel Hyoukjun Kwon and Tushar Krishna Georgia Institute of Technology Synergy Lab (http://synergy.ece.gatech.edu) hyoukjun@gatech.edu April

More information

EECS150 - Digital Design Lecture 09 - Parallelism

EECS150 - Digital Design Lecture 09 - Parallelism EECS150 - Digital Design Lecture 09 - Parallelism Feb 19, 2013 John Wawrzynek Spring 2013 EECS150 - Lec09-parallel Page 1 Parallelism Parallelism is the act of doing more than one thing at a time. Optimization

More information

Verilog for High Performance

Verilog for High Performance Verilog for High Performance Course Description This course provides all necessary theoretical and practical know-how to write synthesizable HDL code through Verilog standard language. The course goes

More information

Power Solutions for Leading-Edge FPGAs. Vaughn Betz & Paul Ekas

Power Solutions for Leading-Edge FPGAs. Vaughn Betz & Paul Ekas Power Solutions for Leading-Edge FPGAs Vaughn Betz & Paul Ekas Agenda 90 nm Power Overview Stratix II : Power Optimization Without Sacrificing Performance Technical Features & Competitive Results Dynamic

More information

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow Abstract: High-level synthesis (HLS) of data-parallel input languages, such as the Compute Unified Device Architecture

More information

Bringing Programmability to the Data Plane: Packet Processing with a NoC-Enhanced FPGA

Bringing Programmability to the Data Plane: Packet Processing with a NoC-Enhanced FPGA Total Tranceiver BW (Gb/s) Bringing Programmability to the Data Plane: Packet Processing with a NoC-Enhanced FPGA Andrew Bitar, Mohamed S. Abdelfattah, Vaughn Betz Department of Electrical and Computer

More information

Messaging Overview. Introduction. Gen-Z Messaging

Messaging Overview. Introduction. Gen-Z Messaging Page 1 of 6 Messaging Overview Introduction Gen-Z is a new data access technology that not only enhances memory and data storage solutions, but also provides a framework for both optimized and traditional

More information

A Novel Energy Efficient Source Routing for Mesh NoCs

A Novel Energy Efficient Source Routing for Mesh NoCs 2014 Fourth International Conference on Advances in Computing and Communications A ovel Energy Efficient Source Routing for Mesh ocs Meril Rani John, Reenu James, John Jose, Elizabeth Isaac, Jobin K. Antony

More information

FPGA Polyphase Filter Bank Study & Implementation

FPGA Polyphase Filter Bank Study & Implementation FPGA Polyphase Filter Bank Study & Implementation Raghu Rao Matthieu Tisserand Mike Severa Prof. John Villasenor Image Communications/. Electrical Engineering Dept. UCLA 1 Introduction This document describes

More information

Computer and Hardware Architecture II. Benny Thörnberg Associate Professor in Electronics

Computer and Hardware Architecture II. Benny Thörnberg Associate Professor in Electronics Computer and Hardware Architecture II Benny Thörnberg Associate Professor in Electronics Parallelism Microscopic vs Macroscopic Microscopic parallelism hardware solutions inside system components providing

More information

The Nios II Family of Configurable Soft-core Processors

The Nios II Family of Configurable Soft-core Processors The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture

More information

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based

More information

MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect

MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect Chris Fallin, Greg Nazario, Xiangyao Yu*, Kevin Chang, Rachata Ausavarungnirun, Onur Mutlu Carnegie Mellon University *CMU

More information

4. Networks. in parallel computers. Advances in Computer Architecture

4. Networks. in parallel computers. Advances in Computer Architecture 4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors

More information

The Nostrum Network on Chip

The Nostrum Network on Chip The Nostrum Network on Chip 10 processors 10 processors Mikael Millberg, Erland Nilsson, Richard Thid, Johnny Öberg, Zhonghai Lu, Axel Jantsch Royal Institute of Technology, Stockholm November 24, 2004

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Pa Available online at: Analysis of Network Processor Elements Topologies Devesh Chaurasiya

More information

Architecture and Design of Efficient 3D Network-on-Chip for Custom Multi-Core SoC

Architecture and Design of Efficient 3D Network-on-Chip for Custom Multi-Core SoC BWCCA 2010 Fukuoka, Japan November 4-6 2010 Architecture and Design of Efficient 3D Network-on-Chip for Custom Multi-Core SoC Akram Ben Ahmed, Abderazek Ben Abdallah, Kenichi Kuroda The University of Aizu

More information

Design of Adaptive Communication Channel Buffers for Low-Power Area- Efficient Network-on. on-chip Architecture

Design of Adaptive Communication Channel Buffers for Low-Power Area- Efficient Network-on. on-chip Architecture Design of Adaptive Communication Channel Buffers for Low-Power Area- Efficient Network-on on-chip Architecture Avinash Kodi, Ashwini Sarathy * and Ahmed Louri * Department of Electrical Engineering and

More information

Reconfigurable Computing. On-line communication strategies. Chapter 7

Reconfigurable Computing. On-line communication strategies. Chapter 7 On-line communication strategies Chapter 7 Prof. Dr.-Ing. Jürgen Teich Lehrstuhl für Hardware-Software-Co-Design On-line connection - Motivation Routing-conscious temporal placement algorithms consider

More information

What is Xilinx Design Language?

What is Xilinx Design Language? Bill Jason P. Tomas University of Nevada Las Vegas Dept. of Electrical and Computer Engineering What is Xilinx Design Language? XDL is a human readable ASCII format compatible with the more widely used

More information

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis A Data-Parallel Genealogy: The GPU Family Tree John Owens University of California, Davis Outline Moore s Law brings opportunity Gains in performance and capabilities. What has 20+ years of development

More information

Ting Wu, Chi-Ying Tsui, Mounir Hamdi Hong Kong University of Science & Technology Hong Kong SAR, China

Ting Wu, Chi-Ying Tsui, Mounir Hamdi Hong Kong University of Science & Technology Hong Kong SAR, China CMOS Crossbar Ting Wu, Chi-Ying Tsui, Mounir Hamdi Hong Kong University of Science & Technology Hong Kong SAR, China OUTLINE Motivations Problems of Designing Large Crossbar Our Approach - Pipelined MUX

More information

EECS150 - Digital Design Lecture 11 SRAM (II), Caches. Announcements

EECS150 - Digital Design Lecture 11 SRAM (II), Caches. Announcements EECS15 - Digital Design Lecture 11 SRAM (II), Caches September 29, 211 Elad Alon Electrical Engineering and Computer Sciences University of California, Berkeley http//www-inst.eecs.berkeley.edu/~cs15 Fall

More information

A Protocol for Realtime Switched Communication in FPGA Clusters

A Protocol for Realtime Switched Communication in FPGA Clusters A Protocol for Realtime Switched Communication in FPGA Clusters Richard D. Anderson Computer Science and Engineering, Box 9637 Mississippi State University Mississippi State, MS 39762 rda62@msstate.edu

More information

Programmable Logic. Any other approaches?

Programmable Logic. Any other approaches? Programmable Logic So far, have only talked about PALs (see 22V10 figure next page). What is the next step in the evolution of PLDs? More gates! How do we get more gates? We could put several PALs on one

More information

Vdd Programmable and Variation Tolerant FPGA Circuits and Architectures

Vdd Programmable and Variation Tolerant FPGA Circuits and Architectures Vdd Programmable and Variation Tolerant FPGA Circuits and Architectures Prof. Lei He EE Department, UCLA LHE@ee.ucla.edu Partially supported by NSF. Pathway to Power Efficiency and Variation Tolerance

More information

Introduction to Modern FPGAs

Introduction to Modern FPGAs Introduction to Modern FPGAs Arturo Díaz Pérez Centro de Investigación y de Estudios Avanzados del IPN Departamento de Ingeniería Eléctrica Sección de Computación adiaz@cs.cinvestav.mx Outline Technology

More information

Network-on-Chip Architecture

Network-on-Chip Architecture Multiple Processor Systems(CMPE-655) Network-on-Chip Architecture Performance aspect and Firefly network architecture By Siva Shankar Chandrasekaran and SreeGowri Shankar Agenda (Enhancing performance)

More information

A Routing Approach to Reduce Glitches in Low Power FPGAs

A Routing Approach to Reduce Glitches in Low Power FPGAs A Routing Approach to Reduce Glitches in Low Power FPGAs Quang Dinh, Deming Chen, Martin Wong Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign This research

More information

Design methodology for multi processor systems design on regular platforms

Design methodology for multi processor systems design on regular platforms Design methodology for multi processor systems design on regular platforms Ph.D in Electronics, Computer Science and Telecommunications Ph.D Student: Davide Rossi Ph.D Tutor: Prof. Roberto Guerrieri Outline

More information

CS310 Embedded Computer Systems. Maeng

CS310 Embedded Computer Systems. Maeng 1 INTRODUCTION (PART II) Maeng Three key embedded system technologies 2 Technology A manner of accomplishing a task, especially using technical processes, methods, or knowledge Three key technologies for

More information

A HT3 Platform for Rapid Prototyping and High Performance Reconfigurable Computing

A HT3 Platform for Rapid Prototyping and High Performance Reconfigurable Computing A HT3 Platform for Rapid Prototyping and High Performance Reconfigurable Computing Second International Workshop on HyperTransport Research and Application (WHTRA 2011) University of Heidelberg Computer

More information

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP Presenter: Course: EEC 289Q: Reconfigurable Computing Course Instructor: Professor Soheil Ghiasi Outline Overview of M.I.T. Raw processor

More information

FPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011

FPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011 FPGA for Complex System Implementation National Chiao Tung University Chun-Jen Tsai 04/14/2011 About FPGA FPGA was invented by Ross Freeman in 1989 SRAM-based FPGA properties Standard parts Allowing multi-level

More information

Digital System Design Lecture 7: Altera FPGAs. Amir Masoud Gharehbaghi

Digital System Design Lecture 7: Altera FPGAs. Amir Masoud Gharehbaghi Digital System Design Lecture 7: Altera FPGAs Amir Masoud Gharehbaghi amgh@mehr.sharif.edu Table of Contents Altera FPGAs FLEX 8000 FLEX 10k APEX 20k Sharif University of Technology 2 FLEX 8000 Block Diagram

More information

An Asynchronous NoC Router in a 14nm FinFET Library: Comparison to an Industrial Synchronous Counterpart

An Asynchronous NoC Router in a 14nm FinFET Library: Comparison to an Industrial Synchronous Counterpart An Asynchronous NoC Router in a 14nm FinFET Library: Comparison to an Industrial Synchronous Counterpart Weiwei Jiang Columbia University, USA Gabriele Miorandi University of Ferrara, Italy Wayne Burleson

More information

Maximizing Routing Resource Reuse in a Reconfiguration-aware Connection Router for FPGAs

Maximizing Routing Resource Reuse in a Reconfiguration-aware Connection Router for FPGAs FACULTY OF ENGINEERING AND ARCHITECTURE Maximizing Routing Resource Reuse in a Reconfiguration-aware Connection Router for FPGAs Elias Vansteenkiste Karel Bruneel and Dirk Stroobandt Elias.Vansteenkiste@UGent.be

More information

The communication bottleneck

The communication bottleneck 3D-MPSoCs: architectural and design technology outlook Luca Benini DEIS Università di Bologna lbenini@deis.unibo.it The communication bottleneck Architectural issues Traditional shared buses do not scale

More information

A 256-Radix Crossbar Switch Using Mux-Matrix-Mux Folded-Clos Topology

A 256-Radix Crossbar Switch Using Mux-Matrix-Mux Folded-Clos Topology http://dx.doi.org/10.5573/jsts.014.14.6.760 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.14, NO.6, DECEMBER, 014 A 56-Radix Crossbar Switch Using Mux-Matrix-Mux Folded-Clos Topology Sung-Joon Lee

More information

Network-on-chip (NOC) Topologies

Network-on-chip (NOC) Topologies Network-on-chip (NOC) Topologies 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and performance

More information

EECS Components and Design Techniques for Digital Systems. Lec 07 PLAs and FSMs 9/ Big Idea: boolean functions <> gates.

EECS Components and Design Techniques for Digital Systems. Lec 07 PLAs and FSMs 9/ Big Idea: boolean functions <> gates. Review: minimum sum-of-products expression from a Karnaugh map EECS 5 - Components and Design Techniques for Digital Systems Lec 7 PLAs and FSMs 9/2- David Culler Electrical Engineering and Computer Sciences

More information

ESE534: Computer Organization. Tabula. Previously. Today. How often is reuse of the same operation applicable?

ESE534: Computer Organization. Tabula. Previously. Today. How often is reuse of the same operation applicable? ESE534: Computer Organization Day 22: April 9, 2012 Time Multiplexing Tabula March 1, 2010 Announced new architecture We would say w=1, c=8 arch. 1 [src: www.tabula.com] 2 Previously Today Saw how to pipeline

More information

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 14 EE141

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 14 EE141 EECS 151/251A Fall 2017 Digital Design and Integrated Circuits Instructor: John Wawrzynek and Nicholas Weaver Lecture 14 EE141 Outline Parallelism EE141 2 Parallelism Parallelism is the act of doing more

More information