SysAlloc: A Hardware Manager for Dynamic Memory Allocation in Heterogeneous Systems

Size: px
Start display at page:

Download "SysAlloc: A Hardware Manager for Dynamic Memory Allocation in Heterogeneous Systems"

Transcription

1 SysAlloc: A Hardware Manager for Dynamic Memory Allocation in Heterogeneous Systems Zeping Xue, David Thomas zeping.xue10@imperial.ac.uk FPL 2015, London 4 Sept

2 Single Processor System Allocator Memory Memory Mapped Bus 2

3 Allocator Size Allocator Address int *ptr; ptr = malloc(size); 3

4 Allocator Size 1.Search for suitable memory block 2. Record the memory allocation Address int *ptr; ptr = malloc(size); 4

5 Multi-Processors (Homogeneous) System Allocator Allocator Allocator Memory Memory Mapped Bus 5

6 What about Heterogeneous Systems? Allocator Memory Memory Mapped Bus FPGA Accelerator 1 FPGA Accelerator 2 6

7 What about Heterogeneous Systems? FPL 2015 Fleming et al. PushPush Allocator Memory Memory Mapped Bus FPGA Accelerator 1 FPGA Accelerator 2 7

8 Proposed Solution: Hardware Allocator Memory Memory Mapped Bus FPGA Accelerator 1 FPGA Accelerator 2 SysAlloc 8

9 Having memory management in heterogenous systems... Improve Memory Efficiency Reduce Design Effort Make HLS More Capable 9

10 Related Work 1. Managing Shared Distributed BRAMs in NoCs [1][2][3] 2. Crossbar connected MPSoC with shared BRAMs [4] 3. HLS DMMs [5][6] [1] Anagnostopoulos et al [2] Göhringer et al [3] Monchiero et al [4] Dessouky et al [5] Winterstein et al [6] Diamantopoulos et al

11 Related Work 1. Managing Shared Distributed BRAMs in NoCs [1][2][3] 2. Crossbar connected MPSoC with shared BRAMs [4] 3. HLS DMMs [5][6] SysAlloc [1] [2] [3] [4] Connection Bus NoC NoC NoC Crossbar Memory On/Off-Chip Shared Distributed On-Chip BRAMs [1] Anagnostopoulos et al Scalability Unbounded Platform-Dependent Limited by [2] Göhringer et al resource [3] Monchiero et al Clock Rate 150MHz Platform-Dependent [4] Dessouky et al MHz [5] Winterstein et al [6] Diamantopoulos et al. 2015

12 Our Proposed Hardware Allocator Scalable Functionality-wise, having arbitrary number of clients Flexible Any range of memory can be managed SysAlloc Efficient Clock Rate Resource Utilisation 12

13 Route Map SysAlloc: System Integration Allocator Algorithm Evaluation Results Conclusion 13

14 SysAlloc System Integration 14

15 SysAlloc System Integration 15

16 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); pointer = AXI_READ(SysA_RESULT); Memory Map 0x x SysA_TOKEN 1 SysA_CMD 0 SysA_STAT 0 SysA_RESULT 0 Storage return pointer; 16

17 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); pointer = AXI_READ(SysA_RESULT); Memory Map 0x x SysA_TOKEN 1 SysA_CMD 0 SysA_STAT 0 SysA_RESULT 0 Storage return pointer; 17

18 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); pointer = AXI_READ(SysA_RESULT); Memory Map 0x x SysA_TOKEN 0 SysA_CMD 0 SysA_STAT 0 SysA_RESULT 0 Storage return pointer; 18

19 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); pointer = AXI_READ(SysA_RESULT); Memory Map 0x x SysA_TOKEN 0 SysA_CMD 0 SysA_STAT 0 SysA_RESULT 0 Storage return pointer; 19

20 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; Memory Map 0x x Storage while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); SysA_TOKEN 0 SysA_CMD size SysA_STAT 0 SysA_RESULT 0 pointer = AXI_READ(SysA_RESULT); return pointer; 20

21 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; Memory Map 0x x Storage while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); SysA_TOKEN 0 SysA_CMD size SysA_STAT 0 SysA_RESULT 0 pointer = AXI_READ(SysA_RESULT); return pointer; 21

22 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; Memory Map 0x x Storage while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); SysA_TOKEN 0 SysA_CMD size SysA_STAT 0 SysA_RESULT 0 pointer = AXI_READ(SysA_RESULT); return pointer; 22

23 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; Memory Map 0x x Storage while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); SysA_TOKEN 0 SysA_CMD size SysA_STAT 0 SysA_RESULT 0 pointer = AXI_READ(SysA_RESULT); return pointer; 23

24 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; Memory Map 0x x Storage while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); SysA_TOKEN 0 SysA_CMD size SysA_STAT 1 SysA_RESULT 0x0000A100 pointer = AXI_READ(SysA_RESULT); return pointer; 24

25 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; Memory Map 0x x Storage while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); SysA_TOKEN 0 SysA_CMD size SysA_STAT 1 SysA_RESULT 0x0000A100 pointer = AXI_READ(SysA_RESULT); return pointer; 25

26 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; Memory Map 0x x Storage while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); SysA_TOKEN 0 SysA_CMD size SysA_STAT 1 SysA_RESULT 0x0000A100 pointer = AXI_READ(SysA_RESULT); return pointer; 26

27 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; Memory Map 0x x Storage while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); SysA_TOKEN 0 SysA_CMD size SysA_STAT 1 SysA_RESULT 0x0000A100 pointer = AXI_READ(SysA_RESULT); return pointer; 27

28 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; Memory Map 0x x Storage while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); SysA_TOKEN 0 SysA_CMD size SysA_STAT 1 SysA_RESULT 0x0000A100 pointer = AXI_READ(SysA_RESULT); return pointer; 28

29 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; Memory Map 0x x Storage while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); SysA_TOKEN 1 SysA_CMD size SysA_STAT 0 SysA_RESULT 0x0000A100 pointer = AXI_READ(SysA_RESULT); return pointer; 29

30 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); pointer = AXI_READ(SysA_RESULT); HLS Access (Vivado HLS) int SysMalloc(int size,int *PortM){ int token, status, result; token = 0; status = 0; while(token == 0){ token = PortM[SysA_TOKEN]; PortM[SysA_CMD] = size; while(status == 0){ status = PortM[SysA_STAT]; result = PortM[SysA_RESULT]; return pointer; return result + PortM; 30

31 SysAlloc Allocator Algorithm 31

32 Background of SysAlloc Allocator Algorithm [1] 32

33 Background of SysAlloc Allocator Algorithm Allocated Memory Bit-Map [1] 33

34 Background of SysAlloc Allocator Algorithm [1] [1] 34

35 Background of SysAlloc Allocator Algorithm [1] [1] 35

36 Limitation! For large bit-map, high resource utilisation and low clock rate due to long sequence of logic gates. Bit-Map 36

37 SysAlloc Allocator Algorithm Our Proposed Solution 37

38 Allocation Tree Allocation Search Binary Tree Node Value Node Status 00 Empty 10 Partially Full 11 Full Bit-Map 38

39 Allocation Tree Storage Allocation Search Binary Tree Stored in memory as groups of nodes Bit-Map 39

40 Allocation Tree Storage Bit-Map

41 Allocation Tree Storage Node Value Representation Node Status 00 Empty 10 Partially Full 11 Full Bit-Map

42 32 blocks Allocation Tree Storage 16 blocks Request size = 3 blocks 8 blocks 4 blocks 2 blocks Bit-Map

43 32 blocks Allocation Tree Storage 16 blocks Request size = 3 blocks 8 blocks 4 blocks 2 blocks Bit-Map

44 32 blocks Allocation Tree Storage 16 blocks Request size = 3 blocks 8 blocks 4 blocks 2 blocks Bit-Map

45 32 blocks Allocation Tree Storage 16 blocks Request size = 3 blocks 8 blocks 4 blocks 2 blocks Bit-Map

46 32 blocks Allocation Tree Storage 16 blocks Request size = 3 blocks 8 blocks 4 blocks 2 blocks Bit-Map

47 32 blocks Allocation Tree Storage 16 blocks Request size = 3 blocks 8 blocks 4 blocks 2 blocks Bit-Map

48 32 blocks Allocation Tree Storage 16 blocks Request size = 3 blocks 8 blocks 4 blocks 2 blocks Bit-Map

49 32 blocks Allocation Tree Storage 16 blocks Request size = 3 blocks 8 blocks 4 blocks 2 blocks Bit-Map

50 32 blocks Allocation Tree Storage 16 blocks Request size = 3 blocks 8 blocks 4 blocks 2 blocks Bit-Map

51 32 blocks Allocation Tree Storage 16 blocks Request size = 3 blocks 8 blocks 4 blocks 2 blocks Bit-Map

52 Benefits Constant FPGA Logic Utilisation 52

53 Benefits For large Bit-Maps, Max Tree Depth Log2(Bit-Map Size) << Bit-Map Size Using the search tree is more efficient than directly searching in bit-map. 53

54 Evaluation Results 54

55 Evaluation Set-up Environment SysAlloc Configuration Synthetic Clients Zynq-7000 SoC XC7Z MHz Manage 128MB DDR Minimum Block Size = 16B Request rate and size are configurable 55

56 Scalability Pathological Case 56

57 Scalability Size & Wait Period with Exponential Growth 1.36 million malloc/s 57

58 Configuration Cases Case A Case B Case C Managing 512 MB DDR with minimum block in size of 16B. A linked-list using 256KB on-chip memory with minimum block in size of 8B. Signal processing applications need 256 1MB buffers to be managed. 58

59 Configuration Cases Bit-Map Size 59

60 Configuration Cases FPGA LEs Utilisation 60

61 Configuration Cases Memory Bits Required 61

62 Configuration Cases Latency 62

63 Conclusion 63

64 Conclusion Scalable Efficient SysAlloc Flexible Peak system allocation rate = 1.36 million malloc/s, when managing 128MB DDR for 4 Clients 64

65 Future Work Faster Allocator 65

66 Future Work More Efficient Communication Protocol 66

67 Future Work Garbage Collection 67

68 Future Work HLS Access 68

69 Thank You & Questions? Source code can be found at 69

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow Abstract: High-level synthesis (HLS) of data-parallel input languages, such as the Compute Unified Device Architecture

More information

A Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on

A Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on A Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on on-chip Donghyun Kim, Kangmin Lee, Se-joong Lee and Hoi-Jun Yoo Semiconductor System Laboratory, Dept. of EECS, Korea Advanced

More information

Exploring OpenCL Memory Throughput on the Zynq

Exploring OpenCL Memory Throughput on the Zynq Exploring OpenCL Memory Throughput on the Zynq Technical Report no. 2016:04, ISSN 1652-926X Chalmers University of Technology Bo Joel Svensson bo.joel.svensson@gmail.com Abstract The Zynq platform combines

More information

Ultra-Fast NoC Emulation on a Single FPGA

Ultra-Fast NoC Emulation on a Single FPGA The 25 th International Conference on Field-Programmable Logic and Applications (FPL 2015) September 3, 2015 Ultra-Fast NoC Emulation on a Single FPGA Thiem Van Chu, Shimpei Sato, and Kenji Kise Tokyo

More information

FPGA Acceleration of the LFRic Weather and Climate Model in the EuroExa Project Using Vivado HLS

FPGA Acceleration of the LFRic Weather and Climate Model in the EuroExa Project Using Vivado HLS FPGA Acceleration of the LFRic Weather and Climate Model in the EuroExa Project Using Vivado HLS Mike Ashworth, Graham Riley, Andrew Attwood and John Mawer Advanced Processor Technologies Group School

More information

FPGA Acceleration of the LFRic Weather and Climate Model in the EuroExa Project Using Vivado HLS

FPGA Acceleration of the LFRic Weather and Climate Model in the EuroExa Project Using Vivado HLS FPGA Acceleration of the LFRic Weather and Climate Model in the EuroExa Project Using Vivado HLS Mike Ashworth, Graham Riley, Andrew Attwood and John Mawer Advanced Processor Technologies Group School

More information

Midterm Exam. Solutions

Midterm Exam. Solutions Midterm Exam Solutions Problem 1 List at least 3 advantages of implementing selected portions of a complex design in software Software vs. Hardware Trade-offs Improve Performance Improve Energy Efficiency

More information

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA Yun R. Qu, Viktor K. Prasanna Ming Hsieh Dept. of Electrical Engineering University of Southern California Los Angeles, CA 90089

More information

SDSoC: Session 1

SDSoC: Session 1 SDSoC: Session 1 ADAM@ADIUVOENGINEERING.COM What is SDSoC SDSoC is a system optimising compiler which allows us to optimise Zynq PS / PL Zynq MPSoC PS / PL MicroBlaze What does this mean? Following the

More information

The Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006

The Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006 The Use Of Virtual Platforms In MP-SoC Design Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006 1 MPSoC Is MP SoC design happening? Why? Consumer Electronics Complexity Cost of ASIC Increased SW Content

More information

Computing to the Energy and Performance Limits with Heterogeneous CPU-FPGA Devices. Dr Jose Luis Nunez-Yanez University of Bristol

Computing to the Energy and Performance Limits with Heterogeneous CPU-FPGA Devices. Dr Jose Luis Nunez-Yanez University of Bristol Computing to the Energy and Performance Limits with Heterogeneous CPU-FPGA Devices Dr Jose Luis Nunez-Yanez University of Bristol Power and energy savings at run-time Power = α.c.v 2.f+g1.V 3 Energy =

More information

HEAD HardwarE Accelerated Deduplication

HEAD HardwarE Accelerated Deduplication HEAD HardwarE Accelerated Deduplication Final Report CS710 Computing Acceleration with FPGA December 9, 2016 Insu Jang Seikwon Kim Seonyoung Lee Executive Summary A-Z development of deduplication SW version

More information

Simplifying FPGA Design for SDR with a Network on Chip Architecture

Simplifying FPGA Design for SDR with a Network on Chip Architecture Simplifying FPGA Design for SDR with a Network on Chip Architecture Matt Ettus Ettus Research GRCon13 Outline 1 Introduction 2 RF NoC 3 Status and Conclusions USRP FPGA Capability Gen

More information

Partitioning of computationally intensive tasks between FPGA and CPUs

Partitioning of computationally intensive tasks between FPGA and CPUs Partitioning of computationally intensive tasks between FPGA and CPUs Tobias Welti, MSc (Author) Institute of Embedded Systems Zurich University of Applied Sciences Winterthur, Switzerland tobias.welti@zhaw.ch

More information

A Unified OpenCL-flavor Programming Model with Scalable Hybrid Hardware Platform on FPGAs

A Unified OpenCL-flavor Programming Model with Scalable Hybrid Hardware Platform on FPGAs A Unified OpenCL-flavor Programming Model with Scalable Hybrid Hardware Platform on FPGAs Hongyuan Ding and Miaoqing Huang Department of Computer Science and Computer Engineering University of Arkansas

More information

Exploration of Cache Coherent CPU- FPGA Heterogeneous System

Exploration of Cache Coherent CPU- FPGA Heterogeneous System Exploration of Cache Coherent CPU- FPGA Heterogeneous System Wei Zhang Department of Electronic and Computer Engineering Hong Kong University of Science and Technology 1 Outline ointroduction to FPGA-based

More information

Integrating NVIDIA Deep Learning Accelerator (NVDLA) with RISC-V SoC on FireSim

Integrating NVIDIA Deep Learning Accelerator (NVDLA) with RISC-V SoC on FireSim Integrating NVIDIA Deep Learning Accelerator (NVDLA) with RISC-V SoC on FireSim Farzad Farshchi, Qijing Huang, Heechul Yun University of Kansas, University of California, Berkeley SiFive Internship Rocket

More information

XPU A Programmable FPGA Accelerator for Diverse Workloads

XPU A Programmable FPGA Accelerator for Diverse Workloads XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for

More information

Parallel graph traversal for FPGA

Parallel graph traversal for FPGA LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,

More information

(FPL 2015) Scavenger: Automating the Construction of Application-Optimized Memory Hierarchies

(FPL 2015) Scavenger: Automating the Construction of Application-Optimized Memory Hierarchies (FPL 2015) Scavenger: Automating the Construction of Application-Optimized Memory Hierarchies HSIN-JUNG YANG, CSAIL, Massachusetts Institute of Technology KERMIN FLEMING, SSG, Intel Corporation FELIX WINTERSTEIN,

More information

PushPush: Seamless Integration of Hardware and Software Objects Via Function Calls over AXI

PushPush: Seamless Integration of Hardware and Software Objects Via Function Calls over AXI 1 PushPush: Seamless Integration of Hardware and Software Objects Via Function Calls over AXI Shane T. Fleming, Ivan Beretta, David B. Thomas, George A. Constantinides, and Dan R. Ghica Dept. of Electrical

More information

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs IBM Research AI Systems Day DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1, Junsong Wang 2, Chao Zhu 2, Yonghua Lin 2, Jinjun Xiong 3, Wen-mei

More information

Optimizing Performance and Scalability on Hybrid MPSoCs

Optimizing Performance and Scalability on Hybrid MPSoCs University of Arkansas, Fayetteville ScholarWorks@UARK Theses and Dissertations 12-2014 Optimizing Performance and Scalability on Hybrid MPSoCs Hongyuan Ding University of Arkansas, Fayetteville Follow

More information

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM

More information

GRVI Phalanx Update: Plowing the Cloud with Thousands of RISC-V Chickens. Jan Gray

GRVI Phalanx Update: Plowing the Cloud with Thousands of RISC-V Chickens. Jan Gray If you were plowing a field, which would you rather use: two strong oxen or 1024 chickens? Seymour Cray GRVI Phalanx Update: Plowing the Cloud with Thousands of RISC-V Chickens Jan Gray jan@fpga.org http://fpga.org

More information

An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware

An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware Tao Chen, Shreesha Srinath Christopher Batten, G. Edward Suh Computer Systems Laboratory School of Electrical

More information

Zynq-7000 All Programmable SoC Product Overview

Zynq-7000 All Programmable SoC Product Overview Zynq-7000 All Programmable SoC Product Overview The SW, HW and IO Programmable Platform August 2012 Copyright 2012 2009 Xilinx Introducing the Zynq -7000 All Programmable SoC Breakthrough Processing Platform

More information

Support Triangle rendering with texturing: used for bitmap rotation, transformation or scaling

Support Triangle rendering with texturing: used for bitmap rotation, transformation or scaling logibmp Bitmap 2.5D Graphics Accelerator March 12 th, 2015 Data Sheet Version: v2.2 Xylon d.o.o. Fallerovo setaliste 22 10000 Zagreb, Croatia Phone: +385 1 368 00 26 Fax: +385 1 365 51 67 E-mail: support@logicbricks.com

More information

Motor Control: Model-Based Design from Concept to Implementation on heterogeneous SoC FPGAs Alexander Schreiber, MathWorks

Motor Control: Model-Based Design from Concept to Implementation on heterogeneous SoC FPGAs Alexander Schreiber, MathWorks Motor Control: Model-Based Design from Concept to Implementation on heterogeneous SoC FPGAs Alexander Schreiber, MathWorks 2014 The MathWorks, Inc. 1 Some components of a production application Production

More information

A Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013

A Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013 A Closer Look at the Epiphany IV 28nm 64 core Coprocessor Andreas Olofsson PEGPUM 2013 1 Adapteva Achieves 3 World Firsts 1. First processor company to reach 50 GFLOPS/W 3. First semiconductor company

More information

Energy Optimization of FPGA-Based Stream- Oriented Computing with Power Gating

Energy Optimization of FPGA-Based Stream- Oriented Computing with Power Gating Energy Optimization of FPGA-Based Stream- Oriented Computing with Power Gating Mohammad Hosseinabady and Jose Luis Nunez-Yanez Department of Electrical and Electronic Engineering University of Bristol,

More information

Time-Shared Execution of Realtime Computer Vision Pipelines by Dynamic Partial Reconfiguration

Time-Shared Execution of Realtime Computer Vision Pipelines by Dynamic Partial Reconfiguration Time-Shared Execution of Realtime Computer Vision Pipelines by Dynamic Partial Reconfiguration Marie Nguyen Carnegie Mellon University Pittsburgh, Pennsylvania James C. Hoe Carnegie Mellon University Pittsburgh,

More information

A flexible memory shuffling unit for image processing accelerators

A flexible memory shuffling unit for image processing accelerators Eindhoven University of Technology MASTER A flexible memory shuffling unit for image processing accelerators Xie, R.Z. Award date: 2013 Disclaimer This document contains a student thesis (bachelor's or

More information

FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC)

FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC) FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC) D.Udhayasheela, pg student [Communication system],dept.ofece,,as-salam engineering and technology, N.MageshwariAssistant Professor

More information

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference FINN: A Framework for Fast, Scalable Binarized Neural Network Inference Yaman Umuroglu (XIR & NTNU), Nick Fraser (XIR & USydney), Giulio Gambardella (XIR), Michaela Blott (XIR), Philip Leong (USydney),

More information

借助 SDSoC 快速開發複雜的嵌入式應用

借助 SDSoC 快速開發複雜的嵌入式應用 借助 SDSoC 快速開發複雜的嵌入式應用 May 2017 What Is C/C++ Development System-level Profiling SoC application-like programming Tools and IP for system-level profiling Specify C/C++ Functions for Acceleration Full System

More information

Zynq Ultrascale+ Architecture

Zynq Ultrascale+ Architecture Zynq Ultrascale+ Architecture Stephanie Soldavini and Andrew Ramsey CMPE-550 Dec 2017 Soldavini, Ramsey (CMPE-550) Zynq Ultrascale+ Architecture Dec 2017 1 / 17 Agenda Heterogeneous Computing Zynq Ultrascale+

More information

GRVI Phalanx Update: Plowing the Cloud with Thousands of RISC-V Chickens. Jan Gray

GRVI Phalanx Update: Plowing the Cloud with Thousands of RISC-V Chickens. Jan Gray If you were plowing a field, which would you rather use: two strong oxen or 1024 chickens? Seymour Cray GRVI Phalanx Update: Plowing the Cloud with Thousands of RISC-V Chickens Jan Gray jan@fpga.org http://fpga.org

More information

ECEN 449: Microprocessor System Design Department of Electrical and Computer Engineering Texas A&M University. Laboratory Exercise #1 Using the Vivado

ECEN 449: Microprocessor System Design Department of Electrical and Computer Engineering Texas A&M University. Laboratory Exercise #1 Using the Vivado ECEN 449: Microprocessor System Design Department of Electrical and Computer Engineering Texas A&M University Prof. Sunil P Khatri (Lab exercise created and tested by Ramu Endluri, He Zhou, Andrew Douglass

More information

Midterm Exam. Solutions

Midterm Exam. Solutions Midterm Exam Solutions Problem 1 List at least 3 advantages of implementing selected portions of a design in hardware, and at least 3 advantages of implementing the remaining portions of the design in

More information

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP 133 CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP 6.1 INTRODUCTION As the era of a billion transistors on a one chip approaches, a lot of Processing Elements (PEs) could be located

More information

Versal: AI Engine & Programming Environment

Versal: AI Engine & Programming Environment Engineering Director, Xilinx Silicon Architecture Group Versal: Engine & Programming Environment Presented By Ambrose Finnerty Xilinx DSP Technical Marketing Manager October 16, 2018 MEMORY MEMORY MEMORY

More information

FCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA

FCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA 1 FCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA Compiler Tan Nguyen 1, Swathi Gurumani 1, Kyle Rupnow 1, Deming Chen 2 1 Advanced Digital Sciences Center, Singapore {tan.nguyen,

More information

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection Hiroyuki Usui, Jun Tanabe, Toru Sano, Hui Xu, and Takashi Miyamori Toshiba Corporation, Kawasaki, Japan Copyright 2013,

More information

Field Programmable Gate Array (FPGA) Devices

Field Programmable Gate Array (FPGA) Devices Field Programmable Gate Array (FPGA) Devices 1 Contents Altera FPGAs and CPLDs CPLDs FPGAs with embedded processors ACEX FPGAs Cyclone I,II FPGAs APEX FPGAs Stratix FPGAs Stratix II,III FPGAs Xilinx FPGAs

More information

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School

More information

Sri Vidya College of Engineering and Technology. EC6703 Embedded and Real Time Systems Unit IV Page 1.

Sri Vidya College of Engineering and Technology. EC6703 Embedded and Real Time Systems Unit IV Page 1. Sri Vidya College of Engineering and Technology ERTS Course Material EC6703 Embedded and Real Time Systems Page 1 Sri Vidya College of Engineering and Technology ERTS Course Material EC6703 Embedded and

More information

High Capacity and High Performance 20nm FPGAs. Steve Young, Dinesh Gaitonde August Copyright 2014 Xilinx

High Capacity and High Performance 20nm FPGAs. Steve Young, Dinesh Gaitonde August Copyright 2014 Xilinx High Capacity and High Performance 20nm FPGAs Steve Young, Dinesh Gaitonde August 2014 Not a Complete Product Overview Page 2 Outline Page 3 Petabytes per month Increasing Bandwidth Global IP Traffic Growth

More information

Copyright 2017 Xilinx.

Copyright 2017 Xilinx. All Programmable Automotive SoC Comparison XA Zynq UltraScale+ MPSoC ZU2/3EG, ZU4/5EV Devices XA Zynq -7000 SoC Z-7010/7020/7030 Devices Application Processor Real-Time Processor Quad-core ARM Cortex -A53

More information

ECE 5775 Student-Led Discussions (10/16)

ECE 5775 Student-Led Discussions (10/16) ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A Adam Macioszek, Julia Currie, Nick Sarkis Sparse Matrix Vector Multiplication Nick Comly, Felipe Fortuna, Mark Li, Serena Krech Matrix

More information

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013 NetSpeed ORION: A New Approach to Design On-chip Interconnects August 26 th, 2013 INTERCONNECTS BECOMING INCREASINGLY IMPORTANT Growing number of IP cores Average SoCs today have 100+ IPs Mixing and matching

More information

COMMUNICATION AND I/O ARCHITECTURES FOR HIGHLY INTEGRATED MPSoC PLATFORMS OUTLINE

COMMUNICATION AND I/O ARCHITECTURES FOR HIGHLY INTEGRATED MPSoC PLATFORMS OUTLINE COMMUNICATION AND I/O ARCHITECTURES FOR HIGHLY INTEGRATED MPSoC PLATFORMS Martino Ruggiero Luca Benini University of Bologna Simone Medardoni Davide Bertozzi University of Ferrara In cooperation with STMicroelectronics

More information

Tutorial on Software-Hardware Codesign with CORDIC

Tutorial on Software-Hardware Codesign with CORDIC ECE5775 High-Level Digital Design Automation, Fall 2017 School of Electrical Computer Engineering, Cornell University Tutorial on Software-Hardware Codesign with CORDIC 1 Introduction So far in ECE5775

More information

Developing a Prototyping Board for Emerging Memory

Developing a Prototyping Board for Emerging Memory Developing a Prototyping Board for Emerging Memory 2013. 10. 25 Sungjoo Yoo Embedded System Architecture Lab. POSTECH Introduction scaling problem [ITRS, 2012] Year 2012 2013 2014 2015 2016 2017 2018 2019

More information

Near Memory Key/Value Lookup Acceleration MemSys 2017

Near Memory Key/Value Lookup Acceleration MemSys 2017 Near Key/Value Lookup Acceleration MemSys 2017 October 3, 2017 Scott Lloyd, Maya Gokhale Center for Applied Scientific Computing This work was performed under the auspices of the U.S. Department of Energy

More information

Generation of Multigrid-based Numerical Solvers for FPGA Accelerators

Generation of Multigrid-based Numerical Solvers for FPGA Accelerators Generation of Multigrid-based Numerical Solvers for FPGA Accelerators Christian Schmitt, Moritz Schmid, Frank Hannig, Jürgen Teich, Sebastian Kuckuk, Harald Köstler Hardware/Software Co-Design, System

More information

MYC-C7Z010/20 CPU Module

MYC-C7Z010/20 CPU Module MYC-C7Z010/20 CPU Module - 667MHz Xilinx XC7Z010/20 Dual-core ARM Cortex-A9 Processor with Xilinx 7-series FPGA logic - 1GB DDR3 SDRAM (2 x 512MB, 32-bit), 4GB emmc, 32MB QSPI Flash - On-board Gigabit

More information

MYD-C7Z010/20 Development Board

MYD-C7Z010/20 Development Board MYD-C7Z010/20 Development Board MYC-C7Z010/20 CPU Module as Controller Board Two 0.8mm pitch 140-pin Connectors for Board-to-Board Connections 667MHz Xilinx XC7Z010/20 Dual-core ARM Cortex-A9 Processor

More information

Xilinx ML Suite Overview

Xilinx ML Suite Overview Xilinx ML Suite Overview Yao Fu System Architect Data Center Acceleration Xilinx Accelerated Computing Workloads Machine Learning Inference Image classification and object detection Video Streaming Frame

More information

GigaX API for Zynq SoC

GigaX API for Zynq SoC BUM002 v1.0 USER MANUAL A software API for Zynq PS that Enables High-speed GigaE-PL Data Transfer & Frames Management BERTEN DSP S.L. www.bertendsp.com gigax@bertendsp.com +34 942 18 10 11 Table of Contents

More information

INT-1010 TCP Offload Engine

INT-1010 TCP Offload Engine INT-1010 TCP Offload Engine Product brief, features and benefits summary Highly customizable hardware IP block. Easily portable to ASIC flow, Xilinx or Altera FPGAs INT-1010 is highly flexible that is

More information

Yet Another Implementation of CoRAM Memory

Yet Another Implementation of CoRAM Memory Dec 7, 2013 CARL2013@Davis, CA Py Yet Another Implementation of Memory Architecture for Modern FPGA-based Computing Shinya Takamaeda-Yamazaki, Kenji Kise, James C. Hoe * Tokyo Institute of Technology JSPS

More information

FPGA Implementation of a Memory-Efficient Hough Parameter Space for the Detection of Lines

FPGA Implementation of a Memory-Efficient Hough Parameter Space for the Detection of Lines FPGA Implementation of a Memory-Efficient Hough Parameter Space for the Detection of Lines David Northcote*, Louise H. Crockett, Paul Murray Department of Electronic and Electrical Engineering, University

More information

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference FINN: A Framework for Fast, Scalable Binarized Neural Network Inference Yaman Umuroglu (NTNU & Xilinx Research Labs Ireland) in collaboration with N Fraser, G Gambardella, M Blott, P Leong, M Jahre and

More information

EFFICIENT AUTOMATED SYNTHESIS, PROGRAMING, AND IMPLEMENTATION OF MULTI-PROCESSOR PLATFORMS ON FPGA CHIPS. Hristo Nikolov Todor Stefanov Ed Deprettere

EFFICIENT AUTOMATED SYNTHESIS, PROGRAMING, AND IMPLEMENTATION OF MULTI-PROCESSOR PLATFORMS ON FPGA CHIPS. Hristo Nikolov Todor Stefanov Ed Deprettere EFFICIENT AUTOMATED SYNTHESIS, PROGRAMING, AND IMPLEMENTATION OF MULTI-PROCESSOR PLATFORMS ON FPGA CHIPS Hristo Nikolov Todor Stefanov Ed Deprettere Leiden Embedded Research Center Leiden Institute of

More information

Protecting Embedded Systems from Zero-Day Attacks

Protecting Embedded Systems from Zero-Day Attacks Protecting Embedded Systems from Zero-Day Attacks Professor Stephen Taylor Thayer School of Engineering at Dartmouth stnh.email@icloud.com (603) 727-8945 MicroArx.com Apiotics.com 1 Research Support Current

More information

Third Genera+on USRP Devices and the RF Network- On- Chip. Leif Johansson Market Development RF, Comm and SDR

Third Genera+on USRP Devices and the RF Network- On- Chip. Leif Johansson Market Development RF, Comm and SDR Third Genera+on USRP Devices and the RF Network- On- Chip Leif Johansson Market Development RF, Comm and SDR About Ettus Research Leader in soeware defined radio and signals intelligence Maker of USRP

More information

High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS

High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS Yury Rumyantsev rumyantsev@rosta.ru 1 Agenda 1. Intoducing Rosta and Hardware overview 2. Microtubule Modeling Problem 3.

More information

Requirement ZYNQ SOC Development Board: Z-Turn by MYiR ZYNQ-7020 (XC7Z020-1CLG400C) Vivado and Xilinx SDK TF Card Reader (Micro SD) Windows 7

Requirement ZYNQ SOC Development Board: Z-Turn by MYiR ZYNQ-7020 (XC7Z020-1CLG400C) Vivado and Xilinx SDK TF Card Reader (Micro SD) Windows 7 Project Description The ARM CPU is configured to perform read and write operations on the Block Memory. The Block Memory is created in the PL side of the ZYNQ device. The ARM CPU is configured as Master

More information

Integrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali

Integrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali Integrating DMA capabilities into BLIS for on-chip data movement Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali 5 Generations of TI Multicore Processors Keystone architecture Lowers

More information

S2C K7 Prodigy Logic Module Series

S2C K7 Prodigy Logic Module Series S2C K7 Prodigy Logic Module Series Low-Cost Fifth Generation Rapid FPGA-based Prototyping Hardware The S2C K7 Prodigy Logic Module is equipped with one Xilinx Kintex-7 XC7K410T or XC7K325T FPGA device

More information

Design of Reconfigurable Router for NOC Applications Using Buffer Resizing Techniques

Design of Reconfigurable Router for NOC Applications Using Buffer Resizing Techniques Design of Reconfigurable Router for NOC Applications Using Buffer Resizing Techniques Nandini Sultanpure M.Tech (VLSI Design and Embedded System), Dept of Electronics and Communication Engineering, Lingaraj

More information

High-Speed NAND Flash

High-Speed NAND Flash High-Speed NAND Flash Design Considerations to Maximize Performance Presented by: Robert Pierce Sr. Director, NAND Flash Denali Software, Inc. History of NAND Bandwidth Trend MB/s 20 60 80 100 200 The

More information

3D WiNoC Architectures

3D WiNoC Architectures Interconnect Enhances Architecture: Evolution of Wireless NoC from Planar to 3D 3D WiNoC Architectures Hiroki Matsutani Keio University, Japan Sep 18th, 2014 Hiroki Matsutani, "3D WiNoC Architectures",

More information

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley Computer Systems Architecture I CSE 560M Lecture 19 Prof. Patrick Crowley Plan for Today Announcement No lecture next Wednesday (Thanksgiving holiday) Take Home Final Exam Available Dec 7 Due via email

More information

Improving Energy Efficiency with Special-Purpose Accelerators

Improving Energy Efficiency with Special-Purpose Accelerators Improving Energy Efficiency with Special-Purpose Accelerators Alexandru Fiodorov Embedded Computing Systems Submission date: June 2013 Supervisor: Magnus Jahre, IDI Norwegian University of Science and

More information

Implementation of Hardware Accelerators on Zynq

Implementation of Hardware Accelerators on Zynq Downloaded from orbit.dtu.dk on: Dec 29, 2018 Implementation of Hardware Accelerators on Zynq Toft, Jakob Kenn; Nannarelli, Alberto Publication date: 2016 Document Version Publisher's PDF, also known as

More information

ESE532: System-on-a-Chip Architecture. Today. Message. Graph Cycles. Preclass 1. Reminder

ESE532: System-on-a-Chip Architecture. Today. Message. Graph Cycles. Preclass 1. Reminder ESE532: System-on-a-Chip Architecture Day 8: September 26, 2018 Spatial Computations Today Graph Cycles (from Day 7) Accelerator Pipelines FPGAs Zynq Computational Capacity 1 2 Message Custom accelerators

More information

Are Your Passwords Safe: Energy-Efficient Bcrypt Cracking with Low-Cost Parallel Hardware

Are Your Passwords Safe: Energy-Efficient Bcrypt Cracking with Low-Cost Parallel Hardware Are Your Passwords Safe: Energy-Efficient Bcrypt Cracking with Low-Cost Parallel Hardware Katja Malvoni (kmalvoni at openwall.com) Solar Designer (solar at openwall.com) Josip Knezovic (josip.knezovic

More information

SERVICE ORIENTED REAL-TIME BUFFER MANAGEMENT FOR QOS ON ADAPTIVE ROUTERS

SERVICE ORIENTED REAL-TIME BUFFER MANAGEMENT FOR QOS ON ADAPTIVE ROUTERS SERVICE ORIENTED REAL-TIME BUFFER MANAGEMENT FOR QOS ON ADAPTIVE ROUTERS 1 SARAVANAN.K, 2 R.M.SURESH 1 Asst.Professor,Department of Information Technology, Velammal Engineering College, Chennai, Tamilnadu,

More information

NEW FPGA DESIGN AND VERIFICATION TECHNIQUES MICHAL HUSEJKO IT-PES-ES

NEW FPGA DESIGN AND VERIFICATION TECHNIQUES MICHAL HUSEJKO IT-PES-ES NEW FPGA DESIGN AND VERIFICATION TECHNIQUES MICHAL HUSEJKO IT-PES-ES Design: Part 1 High Level Synthesis (Xilinx Vivado HLS) Part 2 SDSoC (Xilinx, HLS + ARM) Part 3 OpenCL (Altera OpenCL SDK) Verification:

More information

Clock Speed Optimization of Runtime Reconfigurable Systems by Signal Latency Measurement

Clock Speed Optimization of Runtime Reconfigurable Systems by Signal Latency Measurement Department of Electrical Engineering Computer Engineering Helmut Schmidt University, Hamburg University of the Federal Armed Forces of Germany Meyer, Haase, Eckert, Klauer Clock Speed Optimization of Runtime

More information

Exploring Automatically Generated Platforms in High Performance FPGAs

Exploring Automatically Generated Platforms in High Performance FPGAs Exploring Automatically Generated Platforms in High Performance FPGAs Panagiotis Skrimponis b, Georgios Zindros a, Ioannis Parnassos a, Muhsen Owaida b, Nikolaos Bellas a, and Paolo Ienne b a Electrical

More information

Porting VME-Based Optical-Link Remote I/O Module to a PLC Platform - an Approach to Maximize Cross-Platform Portability Using SoC

Porting VME-Based Optical-Link Remote I/O Module to a PLC Platform - an Approach to Maximize Cross-Platform Portability Using SoC Porting VME-Based Optical-Link Remote I/O Module to a PLC Platform - an Approach to Maximize Cross-Platform Portability Using SoC T. Masuda, A. Kiyomichi Japan Synchrotron Radiation Research Institute

More information

High Level Synthesis with a Dataflow Architectural Template

High Level Synthesis with a Dataflow Architectural Template High Level Synthesis with a Dataflow Architectural Template Shaoyi Cheng and John Wawrzynek Department of EECS, UC Berkeley, California, USA 94720 Email: sh cheng@berkeley.edu, johnw@eecs.berkeley.edu

More information

Chapter 2 Designing Crossbar Based Systems

Chapter 2 Designing Crossbar Based Systems Chapter 2 Designing Crossbar Based Systems Over the last decade, the communication architecture of SoCs has evolved from single shared bus systems to multi-bus systems. Today, state-of-the-art bus based

More information

Dynamic Packet Fragmentation for Increased Virtual Channel Utilization in On-Chip Routers

Dynamic Packet Fragmentation for Increased Virtual Channel Utilization in On-Chip Routers Dynamic Packet Fragmentation for Increased Virtual Channel Utilization in On-Chip Routers Young Hoon Kang, Taek-Jun Kwon, and Jeff Draper {youngkan, tjkwon, draper}@isi.edu University of Southern California

More information

Data Storage and Query Answering. Data Storage and Disk Structure (2)

Data Storage and Query Answering. Data Storage and Disk Structure (2) Data Storage and Query Answering Data Storage and Disk Structure (2) Review: The Memory Hierarchy Swapping, Main-memory DBMS s Tertiary Storage: Tape, Network Backup 3,200 MB/s (DDR-SDRAM @200MHz) 6,400

More information

A So%ware Developer's Journey into a Deeply Heterogeneous World. Tomas Evensen, CTO Embedded So%ware, Xilinx

A So%ware Developer's Journey into a Deeply Heterogeneous World. Tomas Evensen, CTO Embedded So%ware, Xilinx A So%ware Developer's Journey into a Deeply Heterogeneous World Tomas Evensen, CTO Embedded So%ware, Xilinx Embedded Development: Then Simple single CPU Most code developed internally 10 s of thousands

More information

IMPROVES. Initial Investment is Low Compared to SoC Performance and Cost Benefits

IMPROVES. Initial Investment is Low Compared to SoC Performance and Cost Benefits NOC INTERCONNECT IMPROVES SOC ECONO CONOMICS Initial Investment is Low Compared to SoC Performance and Cost Benefits A s systems on chip (SoCs) have interconnect, along with its configuration, verification,

More information

Neural Computer Architectures

Neural Computer Architectures Neural Computer Architectures 5kk73 Embedded Computer Architecture By: Maurice Peemen Date: Convergence of different domains Neurobiology Applications 1 Constraints Machine Learning Technology Innovations

More information

The RM9150 and the Fast Device Bus High Speed Interconnect

The RM9150 and the Fast Device Bus High Speed Interconnect The RM9150 and the Fast Device High Speed Interconnect John R. Kinsel Principal Engineer www.pmc -sierra.com 1 August 2004 Agenda CPU-based SOC Design Challenges Fast Device (FDB) Overview Generic Device

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms

Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms Ruizhe Zhao 1, Xinyu Niu 1, Yajie Wu 2, Wayne Luk 1, and Qiang Liu 3 1 Imperial College London {ruizhe.zhao15,niu.xinyu10,w.luk}@imperial.ac.uk

More information

Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems

Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems 1 Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems Ronald Dreslinski, Korey Sewell, Thomas Manville, Sudhir Satpathy, Nathaniel Pinckney, Geoff Blake, Michael Cieslak, Reetuparna

More information

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain

More information

A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models

A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating Michael Price*, James Glass, Anantha Chandrakasan MIT, Cambridge, MA * now at Analog Devices, Cambridge,

More information

FastTrack: Leveraging Heterogeneous FPGA Wires to Design Low-cost High-performance Soft NoCs

FastTrack: Leveraging Heterogeneous FPGA Wires to Design Low-cost High-performance Soft NoCs 1/29 FastTrack: Leveraging Heterogeneous FPGA Wires to Design Low-cost High-performance Soft NoCs Nachiket Kapre + Tushar Krishna nachiket@uwaterloo.ca, tushar@ece.gatech.edu 2/29 Claim FPGA overlay NoCs

More information

ESE532: System-on-a-Chip Architecture. Today. Message. Clock Cycle BRAM

ESE532: System-on-a-Chip Architecture. Today. Message. Clock Cycle BRAM ESE532: System-on-a-Chip Architecture Day 20: April 3, 2017 Pipelining, Frequency, Dataflow Today What drives cycle times Pipelining in Vivado HLS C Avoiding bottlenecks feeding data in Vivado HLS C Penn

More information

Ettus Research Update

Ettus Research Update Ettus Research Update Matt Ettus Ettus Research GRCon13 Outline 1 Introduction 2 Recent New Products 3 Third Generation Introduction Who am I? Core GNU Radio contributor since 2001 Designed

More information