SysAlloc: A Hardware Manager for Dynamic Memory Allocation in Heterogeneous Systems
|
|
- Aubrey Kelley
- 5 years ago
- Views:
Transcription
1 SysAlloc: A Hardware Manager for Dynamic Memory Allocation in Heterogeneous Systems Zeping Xue, David Thomas zeping.xue10@imperial.ac.uk FPL 2015, London 4 Sept
2 Single Processor System Allocator Memory Memory Mapped Bus 2
3 Allocator Size Allocator Address int *ptr; ptr = malloc(size); 3
4 Allocator Size 1.Search for suitable memory block 2. Record the memory allocation Address int *ptr; ptr = malloc(size); 4
5 Multi-Processors (Homogeneous) System Allocator Allocator Allocator Memory Memory Mapped Bus 5
6 What about Heterogeneous Systems? Allocator Memory Memory Mapped Bus FPGA Accelerator 1 FPGA Accelerator 2 6
7 What about Heterogeneous Systems? FPL 2015 Fleming et al. PushPush Allocator Memory Memory Mapped Bus FPGA Accelerator 1 FPGA Accelerator 2 7
8 Proposed Solution: Hardware Allocator Memory Memory Mapped Bus FPGA Accelerator 1 FPGA Accelerator 2 SysAlloc 8
9 Having memory management in heterogenous systems... Improve Memory Efficiency Reduce Design Effort Make HLS More Capable 9
10 Related Work 1. Managing Shared Distributed BRAMs in NoCs [1][2][3] 2. Crossbar connected MPSoC with shared BRAMs [4] 3. HLS DMMs [5][6] [1] Anagnostopoulos et al [2] Göhringer et al [3] Monchiero et al [4] Dessouky et al [5] Winterstein et al [6] Diamantopoulos et al
11 Related Work 1. Managing Shared Distributed BRAMs in NoCs [1][2][3] 2. Crossbar connected MPSoC with shared BRAMs [4] 3. HLS DMMs [5][6] SysAlloc [1] [2] [3] [4] Connection Bus NoC NoC NoC Crossbar Memory On/Off-Chip Shared Distributed On-Chip BRAMs [1] Anagnostopoulos et al Scalability Unbounded Platform-Dependent Limited by [2] Göhringer et al resource [3] Monchiero et al Clock Rate 150MHz Platform-Dependent [4] Dessouky et al MHz [5] Winterstein et al [6] Diamantopoulos et al. 2015
12 Our Proposed Hardware Allocator Scalable Functionality-wise, having arbitrary number of clients Flexible Any range of memory can be managed SysAlloc Efficient Clock Rate Resource Utilisation 12
13 Route Map SysAlloc: System Integration Allocator Algorithm Evaluation Results Conclusion 13
14 SysAlloc System Integration 14
15 SysAlloc System Integration 15
16 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); pointer = AXI_READ(SysA_RESULT); Memory Map 0x x SysA_TOKEN 1 SysA_CMD 0 SysA_STAT 0 SysA_RESULT 0 Storage return pointer; 16
17 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); pointer = AXI_READ(SysA_RESULT); Memory Map 0x x SysA_TOKEN 1 SysA_CMD 0 SysA_STAT 0 SysA_RESULT 0 Storage return pointer; 17
18 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); pointer = AXI_READ(SysA_RESULT); Memory Map 0x x SysA_TOKEN 0 SysA_CMD 0 SysA_STAT 0 SysA_RESULT 0 Storage return pointer; 18
19 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); pointer = AXI_READ(SysA_RESULT); Memory Map 0x x SysA_TOKEN 0 SysA_CMD 0 SysA_STAT 0 SysA_RESULT 0 Storage return pointer; 19
20 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; Memory Map 0x x Storage while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); SysA_TOKEN 0 SysA_CMD size SysA_STAT 0 SysA_RESULT 0 pointer = AXI_READ(SysA_RESULT); return pointer; 20
21 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; Memory Map 0x x Storage while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); SysA_TOKEN 0 SysA_CMD size SysA_STAT 0 SysA_RESULT 0 pointer = AXI_READ(SysA_RESULT); return pointer; 21
22 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; Memory Map 0x x Storage while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); SysA_TOKEN 0 SysA_CMD size SysA_STAT 0 SysA_RESULT 0 pointer = AXI_READ(SysA_RESULT); return pointer; 22
23 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; Memory Map 0x x Storage while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); SysA_TOKEN 0 SysA_CMD size SysA_STAT 0 SysA_RESULT 0 pointer = AXI_READ(SysA_RESULT); return pointer; 23
24 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; Memory Map 0x x Storage while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); SysA_TOKEN 0 SysA_CMD size SysA_STAT 1 SysA_RESULT 0x0000A100 pointer = AXI_READ(SysA_RESULT); return pointer; 24
25 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; Memory Map 0x x Storage while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); SysA_TOKEN 0 SysA_CMD size SysA_STAT 1 SysA_RESULT 0x0000A100 pointer = AXI_READ(SysA_RESULT); return pointer; 25
26 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; Memory Map 0x x Storage while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); SysA_TOKEN 0 SysA_CMD size SysA_STAT 1 SysA_RESULT 0x0000A100 pointer = AXI_READ(SysA_RESULT); return pointer; 26
27 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; Memory Map 0x x Storage while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); SysA_TOKEN 0 SysA_CMD size SysA_STAT 1 SysA_RESULT 0x0000A100 pointer = AXI_READ(SysA_RESULT); return pointer; 27
28 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; Memory Map 0x x Storage while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); SysA_TOKEN 0 SysA_CMD size SysA_STAT 1 SysA_RESULT 0x0000A100 pointer = AXI_READ(SysA_RESULT); return pointer; 28
29 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; Memory Map 0x x Storage while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); SysA_TOKEN 1 SysA_CMD size SysA_STAT 0 SysA_RESULT 0x0000A100 pointer = AXI_READ(SysA_RESULT); return pointer; 29
30 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); pointer = AXI_READ(SysA_RESULT); HLS Access (Vivado HLS) int SysMalloc(int size,int *PortM){ int token, status, result; token = 0; status = 0; while(token == 0){ token = PortM[SysA_TOKEN]; PortM[SysA_CMD] = size; while(status == 0){ status = PortM[SysA_STAT]; result = PortM[SysA_RESULT]; return pointer; return result + PortM; 30
31 SysAlloc Allocator Algorithm 31
32 Background of SysAlloc Allocator Algorithm [1] 32
33 Background of SysAlloc Allocator Algorithm Allocated Memory Bit-Map [1] 33
34 Background of SysAlloc Allocator Algorithm [1] [1] 34
35 Background of SysAlloc Allocator Algorithm [1] [1] 35
36 Limitation! For large bit-map, high resource utilisation and low clock rate due to long sequence of logic gates. Bit-Map 36
37 SysAlloc Allocator Algorithm Our Proposed Solution 37
38 Allocation Tree Allocation Search Binary Tree Node Value Node Status 00 Empty 10 Partially Full 11 Full Bit-Map 38
39 Allocation Tree Storage Allocation Search Binary Tree Stored in memory as groups of nodes Bit-Map 39
40 Allocation Tree Storage Bit-Map
41 Allocation Tree Storage Node Value Representation Node Status 00 Empty 10 Partially Full 11 Full Bit-Map
42 32 blocks Allocation Tree Storage 16 blocks Request size = 3 blocks 8 blocks 4 blocks 2 blocks Bit-Map
43 32 blocks Allocation Tree Storage 16 blocks Request size = 3 blocks 8 blocks 4 blocks 2 blocks Bit-Map
44 32 blocks Allocation Tree Storage 16 blocks Request size = 3 blocks 8 blocks 4 blocks 2 blocks Bit-Map
45 32 blocks Allocation Tree Storage 16 blocks Request size = 3 blocks 8 blocks 4 blocks 2 blocks Bit-Map
46 32 blocks Allocation Tree Storage 16 blocks Request size = 3 blocks 8 blocks 4 blocks 2 blocks Bit-Map
47 32 blocks Allocation Tree Storage 16 blocks Request size = 3 blocks 8 blocks 4 blocks 2 blocks Bit-Map
48 32 blocks Allocation Tree Storage 16 blocks Request size = 3 blocks 8 blocks 4 blocks 2 blocks Bit-Map
49 32 blocks Allocation Tree Storage 16 blocks Request size = 3 blocks 8 blocks 4 blocks 2 blocks Bit-Map
50 32 blocks Allocation Tree Storage 16 blocks Request size = 3 blocks 8 blocks 4 blocks 2 blocks Bit-Map
51 32 blocks Allocation Tree Storage 16 blocks Request size = 3 blocks 8 blocks 4 blocks 2 blocks Bit-Map
52 Benefits Constant FPGA Logic Utilisation 52
53 Benefits For large Bit-Maps, Max Tree Depth Log2(Bit-Map Size) << Bit-Map Size Using the search tree is more efficient than directly searching in bit-map. 53
54 Evaluation Results 54
55 Evaluation Set-up Environment SysAlloc Configuration Synthetic Clients Zynq-7000 SoC XC7Z MHz Manage 128MB DDR Minimum Block Size = 16B Request rate and size are configurable 55
56 Scalability Pathological Case 56
57 Scalability Size & Wait Period with Exponential Growth 1.36 million malloc/s 57
58 Configuration Cases Case A Case B Case C Managing 512 MB DDR with minimum block in size of 16B. A linked-list using 256KB on-chip memory with minimum block in size of 8B. Signal processing applications need 256 1MB buffers to be managed. 58
59 Configuration Cases Bit-Map Size 59
60 Configuration Cases FPGA LEs Utilisation 60
61 Configuration Cases Memory Bits Required 61
62 Configuration Cases Latency 62
63 Conclusion 63
64 Conclusion Scalable Efficient SysAlloc Flexible Peak system allocation rate = 1.36 million malloc/s, when managing 128MB DDR for 4 Clients 64
65 Future Work Faster Allocator 65
66 Future Work More Efficient Communication Protocol 66
67 Future Work Garbage Collection 67
68 Future Work HLS Access 68
69 Thank You & Questions? Source code can be found at 69
FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow
FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow Abstract: High-level synthesis (HLS) of data-parallel input languages, such as the Compute Unified Device Architecture
More informationA Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on
A Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on on-chip Donghyun Kim, Kangmin Lee, Se-joong Lee and Hoi-Jun Yoo Semiconductor System Laboratory, Dept. of EECS, Korea Advanced
More informationExploring OpenCL Memory Throughput on the Zynq
Exploring OpenCL Memory Throughput on the Zynq Technical Report no. 2016:04, ISSN 1652-926X Chalmers University of Technology Bo Joel Svensson bo.joel.svensson@gmail.com Abstract The Zynq platform combines
More informationUltra-Fast NoC Emulation on a Single FPGA
The 25 th International Conference on Field-Programmable Logic and Applications (FPL 2015) September 3, 2015 Ultra-Fast NoC Emulation on a Single FPGA Thiem Van Chu, Shimpei Sato, and Kenji Kise Tokyo
More informationFPGA Acceleration of the LFRic Weather and Climate Model in the EuroExa Project Using Vivado HLS
FPGA Acceleration of the LFRic Weather and Climate Model in the EuroExa Project Using Vivado HLS Mike Ashworth, Graham Riley, Andrew Attwood and John Mawer Advanced Processor Technologies Group School
More informationFPGA Acceleration of the LFRic Weather and Climate Model in the EuroExa Project Using Vivado HLS
FPGA Acceleration of the LFRic Weather and Climate Model in the EuroExa Project Using Vivado HLS Mike Ashworth, Graham Riley, Andrew Attwood and John Mawer Advanced Processor Technologies Group School
More informationMidterm Exam. Solutions
Midterm Exam Solutions Problem 1 List at least 3 advantages of implementing selected portions of a complex design in software Software vs. Hardware Trade-offs Improve Performance Improve Energy Efficiency
More informationScalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA
Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA Yun R. Qu, Viktor K. Prasanna Ming Hsieh Dept. of Electrical Engineering University of Southern California Los Angeles, CA 90089
More informationSDSoC: Session 1
SDSoC: Session 1 ADAM@ADIUVOENGINEERING.COM What is SDSoC SDSoC is a system optimising compiler which allows us to optimise Zynq PS / PL Zynq MPSoC PS / PL MicroBlaze What does this mean? Following the
More informationThe Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006
The Use Of Virtual Platforms In MP-SoC Design Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006 1 MPSoC Is MP SoC design happening? Why? Consumer Electronics Complexity Cost of ASIC Increased SW Content
More informationComputing to the Energy and Performance Limits with Heterogeneous CPU-FPGA Devices. Dr Jose Luis Nunez-Yanez University of Bristol
Computing to the Energy and Performance Limits with Heterogeneous CPU-FPGA Devices Dr Jose Luis Nunez-Yanez University of Bristol Power and energy savings at run-time Power = α.c.v 2.f+g1.V 3 Energy =
More informationHEAD HardwarE Accelerated Deduplication
HEAD HardwarE Accelerated Deduplication Final Report CS710 Computing Acceleration with FPGA December 9, 2016 Insu Jang Seikwon Kim Seonyoung Lee Executive Summary A-Z development of deduplication SW version
More informationSimplifying FPGA Design for SDR with a Network on Chip Architecture
Simplifying FPGA Design for SDR with a Network on Chip Architecture Matt Ettus Ettus Research GRCon13 Outline 1 Introduction 2 RF NoC 3 Status and Conclusions USRP FPGA Capability Gen
More informationPartitioning of computationally intensive tasks between FPGA and CPUs
Partitioning of computationally intensive tasks between FPGA and CPUs Tobias Welti, MSc (Author) Institute of Embedded Systems Zurich University of Applied Sciences Winterthur, Switzerland tobias.welti@zhaw.ch
More informationA Unified OpenCL-flavor Programming Model with Scalable Hybrid Hardware Platform on FPGAs
A Unified OpenCL-flavor Programming Model with Scalable Hybrid Hardware Platform on FPGAs Hongyuan Ding and Miaoqing Huang Department of Computer Science and Computer Engineering University of Arkansas
More informationExploration of Cache Coherent CPU- FPGA Heterogeneous System
Exploration of Cache Coherent CPU- FPGA Heterogeneous System Wei Zhang Department of Electronic and Computer Engineering Hong Kong University of Science and Technology 1 Outline ointroduction to FPGA-based
More informationIntegrating NVIDIA Deep Learning Accelerator (NVDLA) with RISC-V SoC on FireSim
Integrating NVIDIA Deep Learning Accelerator (NVDLA) with RISC-V SoC on FireSim Farzad Farshchi, Qijing Huang, Heechul Yun University of Kansas, University of California, Berkeley SiFive Internship Rocket
More informationXPU A Programmable FPGA Accelerator for Diverse Workloads
XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for
More informationParallel graph traversal for FPGA
LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,
More information(FPL 2015) Scavenger: Automating the Construction of Application-Optimized Memory Hierarchies
(FPL 2015) Scavenger: Automating the Construction of Application-Optimized Memory Hierarchies HSIN-JUNG YANG, CSAIL, Massachusetts Institute of Technology KERMIN FLEMING, SSG, Intel Corporation FELIX WINTERSTEIN,
More informationPushPush: Seamless Integration of Hardware and Software Objects Via Function Calls over AXI
1 PushPush: Seamless Integration of Hardware and Software Objects Via Function Calls over AXI Shane T. Fleming, Ivan Beretta, David B. Thomas, George A. Constantinides, and Dan R. Ghica Dept. of Electrical
More informationDNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs
IBM Research AI Systems Day DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1, Junsong Wang 2, Chao Zhu 2, Yonghua Lin 2, Jinjun Xiong 3, Wen-mei
More informationOptimizing Performance and Scalability on Hybrid MPSoCs
University of Arkansas, Fayetteville ScholarWorks@UARK Theses and Dissertations 12-2014 Optimizing Performance and Scalability on Hybrid MPSoCs Hongyuan Ding University of Arkansas, Fayetteville Follow
More informationFrequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System
Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM
More informationGRVI Phalanx Update: Plowing the Cloud with Thousands of RISC-V Chickens. Jan Gray
If you were plowing a field, which would you rather use: two strong oxen or 1024 chickens? Seymour Cray GRVI Phalanx Update: Plowing the Cloud with Thousands of RISC-V Chickens Jan Gray jan@fpga.org http://fpga.org
More informationAn Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware
An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware Tao Chen, Shreesha Srinath Christopher Batten, G. Edward Suh Computer Systems Laboratory School of Electrical
More informationZynq-7000 All Programmable SoC Product Overview
Zynq-7000 All Programmable SoC Product Overview The SW, HW and IO Programmable Platform August 2012 Copyright 2012 2009 Xilinx Introducing the Zynq -7000 All Programmable SoC Breakthrough Processing Platform
More informationSupport Triangle rendering with texturing: used for bitmap rotation, transformation or scaling
logibmp Bitmap 2.5D Graphics Accelerator March 12 th, 2015 Data Sheet Version: v2.2 Xylon d.o.o. Fallerovo setaliste 22 10000 Zagreb, Croatia Phone: +385 1 368 00 26 Fax: +385 1 365 51 67 E-mail: support@logicbricks.com
More informationMotor Control: Model-Based Design from Concept to Implementation on heterogeneous SoC FPGAs Alexander Schreiber, MathWorks
Motor Control: Model-Based Design from Concept to Implementation on heterogeneous SoC FPGAs Alexander Schreiber, MathWorks 2014 The MathWorks, Inc. 1 Some components of a production application Production
More informationA Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013
A Closer Look at the Epiphany IV 28nm 64 core Coprocessor Andreas Olofsson PEGPUM 2013 1 Adapteva Achieves 3 World Firsts 1. First processor company to reach 50 GFLOPS/W 3. First semiconductor company
More informationEnergy Optimization of FPGA-Based Stream- Oriented Computing with Power Gating
Energy Optimization of FPGA-Based Stream- Oriented Computing with Power Gating Mohammad Hosseinabady and Jose Luis Nunez-Yanez Department of Electrical and Electronic Engineering University of Bristol,
More informationTime-Shared Execution of Realtime Computer Vision Pipelines by Dynamic Partial Reconfiguration
Time-Shared Execution of Realtime Computer Vision Pipelines by Dynamic Partial Reconfiguration Marie Nguyen Carnegie Mellon University Pittsburgh, Pennsylvania James C. Hoe Carnegie Mellon University Pittsburgh,
More informationA flexible memory shuffling unit for image processing accelerators
Eindhoven University of Technology MASTER A flexible memory shuffling unit for image processing accelerators Xie, R.Z. Award date: 2013 Disclaimer This document contains a student thesis (bachelor's or
More informationFPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC)
FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC) D.Udhayasheela, pg student [Communication system],dept.ofece,,as-salam engineering and technology, N.MageshwariAssistant Professor
More informationFINN: A Framework for Fast, Scalable Binarized Neural Network Inference
FINN: A Framework for Fast, Scalable Binarized Neural Network Inference Yaman Umuroglu (XIR & NTNU), Nick Fraser (XIR & USydney), Giulio Gambardella (XIR), Michaela Blott (XIR), Philip Leong (USydney),
More information借助 SDSoC 快速開發複雜的嵌入式應用
借助 SDSoC 快速開發複雜的嵌入式應用 May 2017 What Is C/C++ Development System-level Profiling SoC application-like programming Tools and IP for system-level profiling Specify C/C++ Functions for Acceleration Full System
More informationZynq Ultrascale+ Architecture
Zynq Ultrascale+ Architecture Stephanie Soldavini and Andrew Ramsey CMPE-550 Dec 2017 Soldavini, Ramsey (CMPE-550) Zynq Ultrascale+ Architecture Dec 2017 1 / 17 Agenda Heterogeneous Computing Zynq Ultrascale+
More informationGRVI Phalanx Update: Plowing the Cloud with Thousands of RISC-V Chickens. Jan Gray
If you were plowing a field, which would you rather use: two strong oxen or 1024 chickens? Seymour Cray GRVI Phalanx Update: Plowing the Cloud with Thousands of RISC-V Chickens Jan Gray jan@fpga.org http://fpga.org
More informationECEN 449: Microprocessor System Design Department of Electrical and Computer Engineering Texas A&M University. Laboratory Exercise #1 Using the Vivado
ECEN 449: Microprocessor System Design Department of Electrical and Computer Engineering Texas A&M University Prof. Sunil P Khatri (Lab exercise created and tested by Ramu Endluri, He Zhou, Andrew Douglass
More informationMidterm Exam. Solutions
Midterm Exam Solutions Problem 1 List at least 3 advantages of implementing selected portions of a design in hardware, and at least 3 advantages of implementing the remaining portions of the design in
More informationCHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP
133 CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP 6.1 INTRODUCTION As the era of a billion transistors on a one chip approaches, a lot of Processing Elements (PEs) could be located
More informationVersal: AI Engine & Programming Environment
Engineering Director, Xilinx Silicon Architecture Group Versal: Engine & Programming Environment Presented By Ambrose Finnerty Xilinx DSP Technical Marketing Manager October 16, 2018 MEMORY MEMORY MEMORY
More informationFCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA
1 FCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA Compiler Tan Nguyen 1, Swathi Gurumani 1, Kyle Rupnow 1, Deming Chen 2 1 Advanced Digital Sciences Center, Singapore {tan.nguyen,
More informationAn Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection
An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection Hiroyuki Usui, Jun Tanabe, Toru Sano, Hui Xu, and Takashi Miyamori Toshiba Corporation, Kawasaki, Japan Copyright 2013,
More informationField Programmable Gate Array (FPGA) Devices
Field Programmable Gate Array (FPGA) Devices 1 Contents Altera FPGAs and CPLDs CPLDs FPGAs with embedded processors ACEX FPGAs Cyclone I,II FPGAs APEX FPGAs Stratix FPGAs Stratix II,III FPGAs Xilinx FPGAs
More informationScalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA
Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School
More informationSri Vidya College of Engineering and Technology. EC6703 Embedded and Real Time Systems Unit IV Page 1.
Sri Vidya College of Engineering and Technology ERTS Course Material EC6703 Embedded and Real Time Systems Page 1 Sri Vidya College of Engineering and Technology ERTS Course Material EC6703 Embedded and
More informationHigh Capacity and High Performance 20nm FPGAs. Steve Young, Dinesh Gaitonde August Copyright 2014 Xilinx
High Capacity and High Performance 20nm FPGAs Steve Young, Dinesh Gaitonde August 2014 Not a Complete Product Overview Page 2 Outline Page 3 Petabytes per month Increasing Bandwidth Global IP Traffic Growth
More informationCopyright 2017 Xilinx.
All Programmable Automotive SoC Comparison XA Zynq UltraScale+ MPSoC ZU2/3EG, ZU4/5EV Devices XA Zynq -7000 SoC Z-7010/7020/7030 Devices Application Processor Real-Time Processor Quad-core ARM Cortex -A53
More informationECE 5775 Student-Led Discussions (10/16)
ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A Adam Macioszek, Julia Currie, Nick Sarkis Sparse Matrix Vector Multiplication Nick Comly, Felipe Fortuna, Mark Li, Serena Krech Matrix
More informationNetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013
NetSpeed ORION: A New Approach to Design On-chip Interconnects August 26 th, 2013 INTERCONNECTS BECOMING INCREASINGLY IMPORTANT Growing number of IP cores Average SoCs today have 100+ IPs Mixing and matching
More informationCOMMUNICATION AND I/O ARCHITECTURES FOR HIGHLY INTEGRATED MPSoC PLATFORMS OUTLINE
COMMUNICATION AND I/O ARCHITECTURES FOR HIGHLY INTEGRATED MPSoC PLATFORMS Martino Ruggiero Luca Benini University of Bologna Simone Medardoni Davide Bertozzi University of Ferrara In cooperation with STMicroelectronics
More informationTutorial on Software-Hardware Codesign with CORDIC
ECE5775 High-Level Digital Design Automation, Fall 2017 School of Electrical Computer Engineering, Cornell University Tutorial on Software-Hardware Codesign with CORDIC 1 Introduction So far in ECE5775
More informationDeveloping a Prototyping Board for Emerging Memory
Developing a Prototyping Board for Emerging Memory 2013. 10. 25 Sungjoo Yoo Embedded System Architecture Lab. POSTECH Introduction scaling problem [ITRS, 2012] Year 2012 2013 2014 2015 2016 2017 2018 2019
More informationNear Memory Key/Value Lookup Acceleration MemSys 2017
Near Key/Value Lookup Acceleration MemSys 2017 October 3, 2017 Scott Lloyd, Maya Gokhale Center for Applied Scientific Computing This work was performed under the auspices of the U.S. Department of Energy
More informationGeneration of Multigrid-based Numerical Solvers for FPGA Accelerators
Generation of Multigrid-based Numerical Solvers for FPGA Accelerators Christian Schmitt, Moritz Schmid, Frank Hannig, Jürgen Teich, Sebastian Kuckuk, Harald Köstler Hardware/Software Co-Design, System
More informationMYC-C7Z010/20 CPU Module
MYC-C7Z010/20 CPU Module - 667MHz Xilinx XC7Z010/20 Dual-core ARM Cortex-A9 Processor with Xilinx 7-series FPGA logic - 1GB DDR3 SDRAM (2 x 512MB, 32-bit), 4GB emmc, 32MB QSPI Flash - On-board Gigabit
More informationMYD-C7Z010/20 Development Board
MYD-C7Z010/20 Development Board MYC-C7Z010/20 CPU Module as Controller Board Two 0.8mm pitch 140-pin Connectors for Board-to-Board Connections 667MHz Xilinx XC7Z010/20 Dual-core ARM Cortex-A9 Processor
More informationXilinx ML Suite Overview
Xilinx ML Suite Overview Yao Fu System Architect Data Center Acceleration Xilinx Accelerated Computing Workloads Machine Learning Inference Image classification and object detection Video Streaming Frame
More informationGigaX API for Zynq SoC
BUM002 v1.0 USER MANUAL A software API for Zynq PS that Enables High-speed GigaE-PL Data Transfer & Frames Management BERTEN DSP S.L. www.bertendsp.com gigax@bertendsp.com +34 942 18 10 11 Table of Contents
More informationINT-1010 TCP Offload Engine
INT-1010 TCP Offload Engine Product brief, features and benefits summary Highly customizable hardware IP block. Easily portable to ASIC flow, Xilinx or Altera FPGAs INT-1010 is highly flexible that is
More informationYet Another Implementation of CoRAM Memory
Dec 7, 2013 CARL2013@Davis, CA Py Yet Another Implementation of Memory Architecture for Modern FPGA-based Computing Shinya Takamaeda-Yamazaki, Kenji Kise, James C. Hoe * Tokyo Institute of Technology JSPS
More informationFPGA Implementation of a Memory-Efficient Hough Parameter Space for the Detection of Lines
FPGA Implementation of a Memory-Efficient Hough Parameter Space for the Detection of Lines David Northcote*, Louise H. Crockett, Paul Murray Department of Electronic and Electrical Engineering, University
More informationFINN: A Framework for Fast, Scalable Binarized Neural Network Inference
FINN: A Framework for Fast, Scalable Binarized Neural Network Inference Yaman Umuroglu (NTNU & Xilinx Research Labs Ireland) in collaboration with N Fraser, G Gambardella, M Blott, P Leong, M Jahre and
More informationEFFICIENT AUTOMATED SYNTHESIS, PROGRAMING, AND IMPLEMENTATION OF MULTI-PROCESSOR PLATFORMS ON FPGA CHIPS. Hristo Nikolov Todor Stefanov Ed Deprettere
EFFICIENT AUTOMATED SYNTHESIS, PROGRAMING, AND IMPLEMENTATION OF MULTI-PROCESSOR PLATFORMS ON FPGA CHIPS Hristo Nikolov Todor Stefanov Ed Deprettere Leiden Embedded Research Center Leiden Institute of
More informationProtecting Embedded Systems from Zero-Day Attacks
Protecting Embedded Systems from Zero-Day Attacks Professor Stephen Taylor Thayer School of Engineering at Dartmouth stnh.email@icloud.com (603) 727-8945 MicroArx.com Apiotics.com 1 Research Support Current
More informationThird Genera+on USRP Devices and the RF Network- On- Chip. Leif Johansson Market Development RF, Comm and SDR
Third Genera+on USRP Devices and the RF Network- On- Chip Leif Johansson Market Development RF, Comm and SDR About Ettus Research Leader in soeware defined radio and signals intelligence Maker of USRP
More informationHigh Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS
High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS Yury Rumyantsev rumyantsev@rosta.ru 1 Agenda 1. Intoducing Rosta and Hardware overview 2. Microtubule Modeling Problem 3.
More informationRequirement ZYNQ SOC Development Board: Z-Turn by MYiR ZYNQ-7020 (XC7Z020-1CLG400C) Vivado and Xilinx SDK TF Card Reader (Micro SD) Windows 7
Project Description The ARM CPU is configured to perform read and write operations on the Block Memory. The Block Memory is created in the PL side of the ZYNQ device. The ARM CPU is configured as Master
More informationIntegrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali
Integrating DMA capabilities into BLIS for on-chip data movement Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali 5 Generations of TI Multicore Processors Keystone architecture Lowers
More informationS2C K7 Prodigy Logic Module Series
S2C K7 Prodigy Logic Module Series Low-Cost Fifth Generation Rapid FPGA-based Prototyping Hardware The S2C K7 Prodigy Logic Module is equipped with one Xilinx Kintex-7 XC7K410T or XC7K325T FPGA device
More informationDesign of Reconfigurable Router for NOC Applications Using Buffer Resizing Techniques
Design of Reconfigurable Router for NOC Applications Using Buffer Resizing Techniques Nandini Sultanpure M.Tech (VLSI Design and Embedded System), Dept of Electronics and Communication Engineering, Lingaraj
More informationHigh-Speed NAND Flash
High-Speed NAND Flash Design Considerations to Maximize Performance Presented by: Robert Pierce Sr. Director, NAND Flash Denali Software, Inc. History of NAND Bandwidth Trend MB/s 20 60 80 100 200 The
More information3D WiNoC Architectures
Interconnect Enhances Architecture: Evolution of Wireless NoC from Planar to 3D 3D WiNoC Architectures Hiroki Matsutani Keio University, Japan Sep 18th, 2014 Hiroki Matsutani, "3D WiNoC Architectures",
More informationComputer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley
Computer Systems Architecture I CSE 560M Lecture 19 Prof. Patrick Crowley Plan for Today Announcement No lecture next Wednesday (Thanksgiving holiday) Take Home Final Exam Available Dec 7 Due via email
More informationImproving Energy Efficiency with Special-Purpose Accelerators
Improving Energy Efficiency with Special-Purpose Accelerators Alexandru Fiodorov Embedded Computing Systems Submission date: June 2013 Supervisor: Magnus Jahre, IDI Norwegian University of Science and
More informationImplementation of Hardware Accelerators on Zynq
Downloaded from orbit.dtu.dk on: Dec 29, 2018 Implementation of Hardware Accelerators on Zynq Toft, Jakob Kenn; Nannarelli, Alberto Publication date: 2016 Document Version Publisher's PDF, also known as
More informationESE532: System-on-a-Chip Architecture. Today. Message. Graph Cycles. Preclass 1. Reminder
ESE532: System-on-a-Chip Architecture Day 8: September 26, 2018 Spatial Computations Today Graph Cycles (from Day 7) Accelerator Pipelines FPGAs Zynq Computational Capacity 1 2 Message Custom accelerators
More informationAre Your Passwords Safe: Energy-Efficient Bcrypt Cracking with Low-Cost Parallel Hardware
Are Your Passwords Safe: Energy-Efficient Bcrypt Cracking with Low-Cost Parallel Hardware Katja Malvoni (kmalvoni at openwall.com) Solar Designer (solar at openwall.com) Josip Knezovic (josip.knezovic
More informationSERVICE ORIENTED REAL-TIME BUFFER MANAGEMENT FOR QOS ON ADAPTIVE ROUTERS
SERVICE ORIENTED REAL-TIME BUFFER MANAGEMENT FOR QOS ON ADAPTIVE ROUTERS 1 SARAVANAN.K, 2 R.M.SURESH 1 Asst.Professor,Department of Information Technology, Velammal Engineering College, Chennai, Tamilnadu,
More informationNEW FPGA DESIGN AND VERIFICATION TECHNIQUES MICHAL HUSEJKO IT-PES-ES
NEW FPGA DESIGN AND VERIFICATION TECHNIQUES MICHAL HUSEJKO IT-PES-ES Design: Part 1 High Level Synthesis (Xilinx Vivado HLS) Part 2 SDSoC (Xilinx, HLS + ARM) Part 3 OpenCL (Altera OpenCL SDK) Verification:
More informationClock Speed Optimization of Runtime Reconfigurable Systems by Signal Latency Measurement
Department of Electrical Engineering Computer Engineering Helmut Schmidt University, Hamburg University of the Federal Armed Forces of Germany Meyer, Haase, Eckert, Klauer Clock Speed Optimization of Runtime
More informationExploring Automatically Generated Platforms in High Performance FPGAs
Exploring Automatically Generated Platforms in High Performance FPGAs Panagiotis Skrimponis b, Georgios Zindros a, Ioannis Parnassos a, Muhsen Owaida b, Nikolaos Bellas a, and Paolo Ienne b a Electrical
More informationPorting VME-Based Optical-Link Remote I/O Module to a PLC Platform - an Approach to Maximize Cross-Platform Portability Using SoC
Porting VME-Based Optical-Link Remote I/O Module to a PLC Platform - an Approach to Maximize Cross-Platform Portability Using SoC T. Masuda, A. Kiyomichi Japan Synchrotron Radiation Research Institute
More informationHigh Level Synthesis with a Dataflow Architectural Template
High Level Synthesis with a Dataflow Architectural Template Shaoyi Cheng and John Wawrzynek Department of EECS, UC Berkeley, California, USA 94720 Email: sh cheng@berkeley.edu, johnw@eecs.berkeley.edu
More informationChapter 2 Designing Crossbar Based Systems
Chapter 2 Designing Crossbar Based Systems Over the last decade, the communication architecture of SoCs has evolved from single shared bus systems to multi-bus systems. Today, state-of-the-art bus based
More informationDynamic Packet Fragmentation for Increased Virtual Channel Utilization in On-Chip Routers
Dynamic Packet Fragmentation for Increased Virtual Channel Utilization in On-Chip Routers Young Hoon Kang, Taek-Jun Kwon, and Jeff Draper {youngkan, tjkwon, draper}@isi.edu University of Southern California
More informationData Storage and Query Answering. Data Storage and Disk Structure (2)
Data Storage and Query Answering Data Storage and Disk Structure (2) Review: The Memory Hierarchy Swapping, Main-memory DBMS s Tertiary Storage: Tape, Network Backup 3,200 MB/s (DDR-SDRAM @200MHz) 6,400
More informationA So%ware Developer's Journey into a Deeply Heterogeneous World. Tomas Evensen, CTO Embedded So%ware, Xilinx
A So%ware Developer's Journey into a Deeply Heterogeneous World Tomas Evensen, CTO Embedded So%ware, Xilinx Embedded Development: Then Simple single CPU Most code developed internally 10 s of thousands
More informationIMPROVES. Initial Investment is Low Compared to SoC Performance and Cost Benefits
NOC INTERCONNECT IMPROVES SOC ECONO CONOMICS Initial Investment is Low Compared to SoC Performance and Cost Benefits A s systems on chip (SoCs) have interconnect, along with its configuration, verification,
More informationNeural Computer Architectures
Neural Computer Architectures 5kk73 Embedded Computer Architecture By: Maurice Peemen Date: Convergence of different domains Neurobiology Applications 1 Constraints Machine Learning Technology Innovations
More informationThe RM9150 and the Fast Device Bus High Speed Interconnect
The RM9150 and the Fast Device High Speed Interconnect John R. Kinsel Principal Engineer www.pmc -sierra.com 1 August 2004 Agenda CPU-based SOC Design Challenges Fast Device (FDB) Overview Generic Device
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationOptimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms
Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms Ruizhe Zhao 1, Xinyu Niu 1, Yajie Wu 2, Wayne Luk 1, and Qiang Liu 3 1 Imperial College London {ruizhe.zhao15,niu.xinyu10,w.luk}@imperial.ac.uk
More informationSwizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems
1 Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems Ronald Dreslinski, Korey Sewell, Thomas Manville, Sudhir Satpathy, Nathaniel Pinckney, Geoff Blake, Michael Cieslak, Reetuparna
More informationExperiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor
Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain
More informationA Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models
A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating Michael Price*, James Glass, Anantha Chandrakasan MIT, Cambridge, MA * now at Analog Devices, Cambridge,
More informationFastTrack: Leveraging Heterogeneous FPGA Wires to Design Low-cost High-performance Soft NoCs
1/29 FastTrack: Leveraging Heterogeneous FPGA Wires to Design Low-cost High-performance Soft NoCs Nachiket Kapre + Tushar Krishna nachiket@uwaterloo.ca, tushar@ece.gatech.edu 2/29 Claim FPGA overlay NoCs
More informationESE532: System-on-a-Chip Architecture. Today. Message. Clock Cycle BRAM
ESE532: System-on-a-Chip Architecture Day 20: April 3, 2017 Pipelining, Frequency, Dataflow Today What drives cycle times Pipelining in Vivado HLS C Avoiding bottlenecks feeding data in Vivado HLS C Penn
More informationEttus Research Update
Ettus Research Update Matt Ettus Ettus Research GRCon13 Outline 1 Introduction 2 Recent New Products 3 Third Generation Introduction Who am I? Core GNU Radio contributor since 2001 Designed
More information