Broadening the Exploration of the Accelerator Design Space in Embedded Scalable Platforms
|
|
- Eileen Nelson
- 5 years ago
- Views:
Transcription
1 IEEE High Performance Extreme Computing Conference (HPEC), 2017 Broadening the Exploration of the Design Space in Embedded Scalable Platforms Luca Piccolboni, Paolo Mantovani, Giuseppe Di Guglielmo, Luca Carloni Columbia University, New York, NY, USA
2 Why s? High-performance embedded systems are heterogeneous: they include multiple general-purpose processor cores they include special-function hardware accelerators Generality Processor Cores s Efficiency Processor Core #1 Processor Core #3 Processor Core #2 Processor Core #4 IEEE High Performance Extreme Computing Conference (HPEC), / 15
3 Embedded Scalable Platforms (ESP) To balance the demand for hardware specialization with the need of maintaining helpful degrees of regularity and modularity we proposed: Embedded Scalable Platforms [L. Carloni, The Case for Embedded Scalable Platforms, DAC 2016] Memory Controller Processor Core I/O Misc. Channels, etc. Memory Controller IEEE High Performance Extreme Computing Conference (HPEC), / 15
4 Embedded Scalable Platforms (ESP) To balance the demand for hardware specialization with the need of maintaining helpful degrees of regularity and modularity we proposed: Embedded Scalable Platforms ESP instance for WAMI (Wide-Area Motion Imagery) Memory Controller Proc. core LEON3 CPU I/O Misc. Channels, etc. WARP GRAYSCALE MATRIX-SUB MATRIX-RES SD-UPDATE GRADIENT MATRIX-ADD CHANGE-DET HESSIAN DEBAYER MATRIX-MUL STEEP-DESC. Memory Controller IEEE High Performance Extreme Computing Conference (HPEC), / 15
5 Embedded Scalable Platforms (ESP) To balance the demand for hardware specialization with the need of maintaining helpful degrees of regularity and modularity we proposed: Embedded Scalable Platforms System-Level Design with High-Level Synthesis (HLS) Memory Controller Proc. core LEON3 CPU I/O Misc. Channels, etc. HLS WARP HLS HLS HLS GRAYSCALE MATRIX-SUB MATRIX-RES HLS HLS HLS HLS GRADIENT MATRIX-ADD CHANGE-DET HLS SD-UPDATE HESSIAN HLS HLS HLS DEBAYER MATRIX-MUL STEEP-DESC. Memory Controller IEEE High Performance Extreme Computing Conference (HPEC), / 15
6 Embedded Scalable Platforms (ESP) To balance the demand for hardware specialization with the need of maintaining helpful degrees of regularity and modularity we proposed: Embedded Scalable Platforms System-Level Design with High-Level Synthesis (HLS) Memory Controller GRAYSCALE Proc. core LEON3 CPU MATRIX-SUB I/O Misc. Channels, etc. MATRIX-RES WARP SD-UPDATE rapid integration and prototyping GRADIENT MATRIX-ADD CHANGE-DET HESSIAN DEBAYER MATRIX-MUL STEEP-DESC. Memory Controller ESP instance for WAMI (Wide-Area Motion Imagery) IEEE High Performance Extreme Computing Conference (HPEC), / 15
7 s with HLS SystemC Specification GRAYSCALE Interface Memory Controller GRAYSCALE Proc. core LEON3 CPU MATRIX-SUB I/O Misc. Channels, etc. MATRIX-RES WARP SD-UPDATE GRAYSCALE Logic GRADIENT MATRIX-ADD CHANGE-DET HESSIAN DEBAYER MATRIX-MUL STEEP-DESC. Memory Controller load compute store bank bank bank bank bank bank Input PLM bank bank Output PLM Private Local Memories (PLMs) IEEE High Performance Extreme Computing Conference (HPEC), / 15
8 s with HLS SystemC Specification GRAYSCALE Interface GRAYSCALE Logic High-Level Synthesis (HLS) load compute store bank bank bank bank bank bank bank bank Input PLM Output PLM Private Local Memories (PLMs) Cost (Area) RTL knob conf. #1 Performance (Latency) IEEE High Performance Extreme Computing Conference (HPEC), / 15
9 s with HLS SystemC Specification GRAYSCALE Interface GRAYSCALE Logic High-Level Synthesis (HLS) load compute store bank bank bank bank bank bank bank bank Input PLM Output PLM Private Local Memories (PLMs) Cost (Area) RTL knob conf. #2 Performance (Latency) IEEE High Performance Extreme Computing Conference (HPEC), / 15
10 s with HLS SystemC Specification GRAYSCALE Interface GRAYSCALE Logic High-Level Synthesis (HLS) load compute store bank bank bank bank bank bank bank bank Input PLM Output PLM Private Local Memories (PLMs) Cost (Area) RTL knob conf. #3 Performance (Latency) IEEE High Performance Extreme Computing Conference (HPEC), / 15
11 s with HLS SystemC Specification GRAYSCALE Interface GRAYSCALE Logic High-Level Synthesis (HLS) load compute store bank bank bank bank bank bank bank bank Input PLM Output PLM Private Local Memories (PLMs) Cost (Area) RTL knob conf. #4 Performance (Latency) IEEE High Performance Extreme Computing Conference (HPEC), / 15
12 s with HLS SystemC Specification GRAYSCALE Interface GRAYSCALE Logic load compute store bank bank bank bank bank bank bank bank Input PLM Output PLM Private Local Memories (PLMs) Cost (Area) High-Level Synthesis (HLS) Pareto Optimal Pareto Dominated RTL Performance (Latency) IEEE High Performance Extreme Computing Conference (HPEC), / 15
13 Standard HLS Knobs Standard knobs provided by the current HLS tools Knob Loop manipulations Array mappings Clock period Settings and Effects Unrolls, pipelines or breaks the body of loops Maps arrays to registers or on-chip memories Sets the target clock period for synthesis These knobs enable already a rich design-space exploration However, they are not sufficient for exploring accelerators We need other knobs to broaden the exploration IEEE High Performance Extreme Computing Conference (HPEC), / 15
14 Motivational Example #1 synthesized with the standard knobs synthesized with the proposed knobs Normalized Area DEBAYER Bounded by on-chip memory bandwidth Normalized Effective Latency Limiting factor: limited bandwidth to the on-chip memory We need knobs to tailor the PLM to the accelerator needs IEEE High Performance Extreme Computing Conference (HPEC), / 15
15 Motivational Example #2 synthesized with the standard knobs synthesized with the proposed knobs Normalized Area GRAYSCALE Bounded by off-chip memory bandwidth Normalized Effective Latency Limiting factor: limited bandwidth to the off-chip memory We need knobs to operate on the communication interfaces IEEE High Performance Extreme Computing Conference (HPEC), / 15
16 Contributions: Xknobs extended Knobs for High-Level Synthesis XKnob PLM PORTS DMA WIDTH DMA CHUNK Settings and Effects Sets the on-chip memory bandwidth Sets the off-chip memory bandwidth Sets the size of the input and output PLM IEEE High Performance Extreme Computing Conference (HPEC), / 15
17 Xknob #1: PLM PORTS Sets the number of read/write ports of input/output PLMs Higher values of PLM PORTS more read/write accesses Higher values of PLM PORTS higher area (more banks) Normalized Area PLM PORTS = 1 PLM PORTS = 2 PLM PORTS = DEBAYER Normalized Effective Latency IEEE High Performance Extreme Computing Conference (HPEC), / 15
18 Xknob #2: DMA WIDTH Set the size in bits of the DMA communication channels Higher values of DMA WIDTH higher mem. throughput Higher values of DMA WIDTH higher area (more banks) (higher number of write/read ports of input/output PLMs) DMA WIDTH = 64 DMA WIDTH = 128 DMA WIDTH = 256 DMA WIDTH = 512 Normalized Area GRAYSCALE Normalized Effective Latency IEEE High Performance Extreme Computing Conference (HPEC), / 15
19 Xknob #3: DMA CHUNK Set the size of the PLM in multiple of the stored data type Higher values of DMA CHUNK optimized communication Higher values of DMA CHUNK higher area (for the PLM) DMA CHUNK = 256 DMA CHUNK = 512 DMA CHUNK = 1024 DMA CHUNK = 2048 without contention with contention Normalized Area GRAYSCALE DMA WIDTH = 256 PLM PORTS = 4/8 Normalized Area GRAYSCALE DMA WIDTH = 256 PLM PORTS = 4/ Normalized Effective Latency Normalized Effective Latency IEEE High Performance Extreme Computing Conference (HPEC), / 15
20 Experimental Results We evaluate the combined effects of the XKnobs by using: GRAYSCALE accelerator limited by communication DEBAYER accelerator limited by computation The other WAMI accelerators behave similarly to either the GRAYSCALE accelerator or the DEBAYER accelerator IEEE High Performance Extreme Computing Conference (HPEC), / 15
21 Experiment #1 We consider two XKnobs: PLM PORTS and DMA WIDTH DMA WIDTH = 32 DMA WIDTH = 64 DMA WIDTH = 128 DMA WIDTH = GRAYSCALE Normalized Area PLM PORTS = 8 PLM PORTS = 4 PLM PORTS = 2 PLM PORTS = Normalized Effective Latency GRAYSCALE accelerator limited by communication IEEE High Performance Extreme Computing Conference (HPEC), / 15
22 Experiment #1 We consider two XKnobs: PLM PORTS and DMA WIDTH DMA WIDTH = 32 DMA WIDTH = 64 DMA WIDTH = 128 DMA WIDTH = PLM PORTS = 4 DEBAYER Normalized Area PLM PORTS = 2 PLM PORTS = Normalized Effective Latency DEBAYER accelerator limited by computation IEEE High Performance Extreme Computing Conference (HPEC), / 15
23 Experiment #2 We consider two XKnobs: PLM PORTS and DMA CHUNK DMA CHUNK = 256 DMA CHUNK = 512 DMA CHUNK = 1024 DMA CHUNK = 2048 Normalized Area PLM PORTS = 8 PLM PORTS = 4 PLM PORTS = 2 DMA WIDTH = 256 CONTENTION GRAYSCALE PLM PORTS = Normalized Effective Latency GRAYSCALE accelerator limited by communication IEEE High Performance Extreme Computing Conference (HPEC), / 15
24 Experiment #2 We consider two XKnobs: PLM PORTS and DMA CHUNK DMA CHUNK = 256 DMA CHUNK = 512 DMA CHUNK = 1024 DMA CHUNK = 2048 Normalized Area PLM PORTS = 4 DMA WIDTH = 256 CONTENTION DEBAYER PLM PORTS = 2 PLM PORTS = Normalized Effective Latency DEBAYER accelerator limited by computation IEEE High Performance Extreme Computing Conference (HPEC), / 15
25 Concluding Remarks We presented the XKnobs a set of knobs that aims at extending the standard knobs used in current HLS tools For WAMI, the Xknobs broaden the design space by up to 8.5x for performance and 3.5x for cost The XKnobs can be integrated in any HLS tools and design-space exploration methodologies to enrich the set of Pareto-optimal implementations of hardware accelerators IEEE High Performance Extreme Computing Conference (HPEC), / 15
26 IEEE High Performance Extreme Computing Conference (HPEC), 2017 Broadening the Exploration of the Design Space in Embedded Scalable Platforms Thank you for the attention! Speaker: Luca Piccolboni Columbia University, NY
Optimization of Behavioral IPs in Multi-Processor System-on- Chips
Optimization of Behavioral IPs in Multi-Processor System-on- Chips Yidi Liu and Benjamin Carrion Schafer # Department of Electronic and Information Engineering b.carrionschafer@polyu.edu.hk # Outline High-Level
More informationDarkMem: Fine-Grained Power Management of Local Memories for Accelerators in Embedded Systems
DarkMem: Fine-Grained Power Management of Local Memories for Accelerators in Embedded Systems Christian Pilato Università della Svizzera italiana Lugano, Switzerland e-mail: christian.pilato@usi.ch Luca
More informationRuntime Reconfigurable Memory Hierarchy in Embedded Scalable Platforms
Runtime Reconfigurable Memory Hierarchy in Embedded Scalable Platforms Davide Giri Columbia University New York, USA davide_giri@cs.columbia.edu ABSTRACT In heterogeneous systems-on-chip, the optimal choice
More informationEfficient Control-Flow Subgraph Matching for Detecting Hardware Trojans in RTL Models
ACM/IEEE CODES + ISSS 2017, Seoul, South Korea Efficient Control-Flow Subgraph Matching for Detecting Hardware Trojans in RTL Models L. Piccolboni 1,2, A. Menon 2, and G. Pravadelli 2 1 Columbia University,
More informationSystem-Level Design of Networks-on-Chip for Heterogeneous Systems-on-Chip
System-Level Design of Networks-on-Chip for Heterogeneous Systems-on-Chip Young Jin Yoon Department of Computer Science Columbia University New York,New York youngjin@cs.columbia.edu ABSTRACT The network-on-chip()
More informationSODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou
SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou University of California, Los Angeles 1 What is stencil computation? 2 What is Stencil Computation? A sliding
More informationDesign methodology for multi processor systems design on regular platforms
Design methodology for multi processor systems design on regular platforms Ph.D in Electronics, Computer Science and Telecommunications Ph.D Student: Davide Rossi Ph.D Tutor: Prof. Roberto Guerrieri Outline
More informationScalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA
Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School
More informationEfficient Hardware Acceleration on SoC- FPGA using OpenCL
Efficient Hardware Acceleration on SoC- FPGA using OpenCL Advisor : Dr. Benjamin Carrion Schafer Susmitha Gogineni 30 th August 17 Presentation Overview 1.Objective & Motivation 2.Configurable SoC -FPGA
More informationDesign and Optimization of Networks-on-Chip for Future Heterogeneous Systems-on-Chip. Young Jin Yoon
Design and Optimization of Networks-on-Chip for Future Heterogeneous Systems-on-Chip Young Jin Yoon Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate
More informationINPUT/OUTPUT DEVICES Dr. Bill Yi Santa Clara University
INPUT/OUTPUT DEVICES Dr. Bill Yi Santa Clara University (Based on text: David A. Patterson & John L. Hennessy, Computer Organization and Design: The Hardware/Software Interface, 3 rd Ed., Morgan Kaufmann,
More informationSDSoC: Session 1
SDSoC: Session 1 ADAM@ADIUVOENGINEERING.COM What is SDSoC SDSoC is a system optimising compiler which allows us to optimise Zynq PS / PL Zynq MPSoC PS / PL MicroBlaze What does this mean? Following the
More informationGlue: A Scalable Memory Hierarchy with HLS
Glue: A Scalable Memory Hierarchy with HLS ESP Cache Coherence Chae Jubb ecj2122@columbia.edu 1. INTRODUCTION As we construct more complex heterogeneous architectures, we must ensure that our development
More informationNetwork-on-Chip Micro-Benchmarks
Network-on-Chip Micro-Benchmarks Zhonghai Lu *, Axel Jantsch *, Erno Salminen and Cristian Grecu * Royal Institute of Technology, Sweden Tampere University of Technology, Finland Abstract University of
More informationEE 457 Unit 7b. Main Memory Organization
1 EE 457 Unit 7b Main Memory Organization 2 Motivation Organize main memory to Facilitate byte-addressability while maintaining Efficient fetching of the words in a cache block Low order interleaving (L.O.I)
More informationHandling Large Data Sets for High-Performance Embedded Applications in Heterogeneous Systems-on-Chip
Handling Large Data Sets for High-Performance Embedded Applications in Heterogeneous Systems-on-Chip Paolo Mantovani Emilio G. Cota Christian Pilato Giuseppe Di Guglielmo Luca P. Carloni Department of
More informationDesign AXI Master IP using Vivado HLS tool
W H I T E P A P E R Venkatesh W VLSI Design Engineer and Srikanth Reddy Sr.VLSI Design Engineer Design AXI Master IP using Vivado HLS tool Abstract Vivado HLS (High-Level Synthesis) tool converts C, C++
More informationExploring Automatically Generated Platforms in High Performance FPGAs
Exploring Automatically Generated Platforms in High Performance FPGAs Panagiotis Skrimponis b, Georgios Zindros a, Ioannis Parnassos a, Muhsen Owaida b, Nikolaos Bellas a, and Paolo Ienne b a Electrical
More informationESE Back End 2.0. D. Gajski, S. Abdi. (with contributions from H. Cho, D. Shin, A. Gerstlauer)
ESE Back End 2.0 D. Gajski, S. Abdi (with contributions from H. Cho, D. Shin, A. Gerstlauer) Center for Embedded Computer Systems University of California, Irvine http://www.cecs.uci.edu 1 Technology advantages
More informationHigh-Level Synthesis: Accelerating Alignment Algorithm using SDSoC
High-Level Synthesis: Accelerating Alignment Algorithm using SDSoC Steven Derrien & Simon Rokicki The objective of this lab is to present how High-Level Synthesis (HLS) can be used to accelerate a given
More informationSystemC Synthesis Standard: Which Topics for Next Round? Frederic Doucet Qualcomm Atheros, Inc
SystemC Synthesis Standard: Which Topics for Next Round? Frederic Doucet Qualcomm Atheros, Inc 2/29/2016 Frederic Doucet, Qualcomm Atheros, Inc 2 What to Standardize Next Benefit of current standard: Provides
More informationUnit 2: High-Level Synthesis
Course contents Unit 2: High-Level Synthesis Hardware modeling Data flow Scheduling/allocation/assignment Reading Chapter 11 Unit 2 1 High-Level Synthesis (HLS) Hardware-description language (HDL) synthesis
More informationIntroduction to High level. Synthesis
Introduction to High level Synthesis LISHA/UFSC Prof. Dr. Antônio Augusto Fröhlich Tiago Rogério Mück http://www.lisha.ufsc.br/~guto June 2007 http://www.lisha.ufsc.br/ 1 What is HLS? Example: High level
More informationInterconnecting Components
Interconnecting Components Need interconnections between CPU, memory, controllers Bus: shared communication channel Parallel set of wires for data and synchronization of data transfer Can become a bottleneck
More informationHardware Acceleration in Computer Networks. Jan Kořenek Conference IT4Innovations, Ostrava
Hardware Acceleration in Computer Networks Outline Motivation for hardware acceleration Longest prefix matching using FPGA Hardware acceleration of time critical operations Framework and applications Contracted
More informationEarly Models in Silicon with SystemC synthesis
Early Models in Silicon with SystemC synthesis Agility Compiler summary C-based design & synthesis for SystemC Pure, standard compliant SystemC/ C++ Most widely used C-synthesis technology Structural SystemC
More informationOptimal Partition with Block-Level Parallelization in C-to-RTL Synthesis for Streaming Applications
Optimal Partition with Block-Level Parallelization in C-to-RTL Synthesis for Streaming Applications Authors: Shuangchen Li, Yongpan Liu, X.Sharon Hu, Xinyu He, Pei Zhang, and Huazhong Yang 2013/01/23 Outline
More informationArea/Delay Estimation for Digital Signal Processor Cores
Area/Delay Estimation for Digital Signal Processor Cores Yuichiro Miyaoka Yoshiharu Kataoka, Nozomu Togawa Masao Yanagisawa Tatsuo Ohtsuki Dept. of Electronics, Information and Communication Engineering,
More informationInstruction Set Extensions for Photonic Synchronous Coalesced Access
Instruction Set Extensions for Photonic Synchronous Coalesced Access Paul Keltcher, David Whelihan, Jeffrey Hughes September 12, 2013 This work is sponsored by Defense Advanced Research Projects Agency
More informationCor Meenderinck, Ben Juurlink Nexus: hardware support for task-based programming
Cor Meenderinck, Ben Juurlink Nexus: hardware support for task-based programming Conference Object, Postprint version This version is available at http://dx.doi.org/0.479/depositonce-577. Suggested Citation
More informationChapter 2 Designing Crossbar Based Systems
Chapter 2 Designing Crossbar Based Systems Over the last decade, the communication architecture of SoCs has evolved from single shared bus systems to multi-bus systems. Today, state-of-the-art bus based
More informationSystem Level Design with IBM PowerPC Models
September 2005 System Level Design with IBM PowerPC Models A view of system level design SLE-m3 The System-Level Challenges Verification escapes cost design success There is a 45% chance of committing
More informationMinje Jun. Eui-Young Chung
Mixed Integer Linear Programming-based g Optimal Topology Synthesis of Cascaded Crossbar Switches Minje Jun Sungjoo Yoo Eui-Young Chung Contents Motivation Related Works Overview of the Method Problem
More informationLecture: Storage, GPUs. Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4)
Lecture: Storage, GPUs Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4) 1 Magnetic Disks A magnetic disk consists of 1-12 platters (metal or glass disk covered with magnetic recording material
More informationQuality-of-Service for a High-Radix Switch
Quality-of-Service for a High-Radix Switch Nilmini Abeyratne, Supreet Jeloka, Yiping Kang, David Blaauw, Ronald G. Dreslinski, Reetuparna Das, and Trevor Mudge University of Michigan 51 st DAC 06/05/2014
More informationFINN: A Framework for Fast, Scalable Binarized Neural Network Inference
FINN: A Framework for Fast, Scalable Binarized Neural Network Inference Yaman Umuroglu (XIR & NTNU), Nick Fraser (XIR & USydney), Giulio Gambardella (XIR), Michaela Blott (XIR), Philip Leong (USydney),
More informationDesign Space Explorer for FPGAs
Efficient and Reliable High-Level Synthesis Design Space Explorer for FPGAs Dong Liu 1, Benjamin Carrion Schafer 2 Department of Electronic and Information Engineering The Hong Kong Polytechnic University
More informationA Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on
A Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on on-chip Donghyun Kim, Kangmin Lee, Se-joong Lee and Hoi-Jun Yoo Semiconductor System Laboratory, Dept. of EECS, Korea Advanced
More informationNEW FPGA DESIGN AND VERIFICATION TECHNIQUES MICHAL HUSEJKO IT-PES-ES
NEW FPGA DESIGN AND VERIFICATION TECHNIQUES MICHAL HUSEJKO IT-PES-ES Design: Part 1 High Level Synthesis (Xilinx Vivado HLS) Part 2 SDSoC (Xilinx, HLS + ARM) Part 3 OpenCL (Altera OpenCL SDK) Verification:
More informationCo-Design of Many-Accelerator Heterogeneous Systems Exploiting Virtual Platforms. SAMOS XIV July 14-17,
Co-Design of Many-Accelerator Heterogeneous Systems Exploiting Virtual Platforms SAMOS XIV July 14-17, 2014 1 Outline Introduction + Motivation Design requirements for many-accelerator SoCs Design problems
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationDNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs
IBM Research AI Systems Day DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1, Junsong Wang 2, Chao Zhu 2, Yonghua Lin 2, Jinjun Xiong 3, Wen-mei
More informationConquering Memory Bandwidth Challenges in High-Performance SoCs
Conquering Memory Bandwidth Challenges in High-Performance SoCs ABSTRACT High end System on Chip (SoC) architectures consist of tens of processing engines. In SoCs targeted at high performance computing
More informationAn FPGA-Based Infrastructure for Fine-Grained DVFS Analysis in High-Performance Embedded Systems
An FPGA-Based Infrastructure for Fine-Grained DVFS Analysis in High-Performance Embedded Systems Paolo Mantovani Emilio G. Cota Kevin Tien $ Christian Pilato Giuseppe Di Guglielmo Ken Shepard $ Luca P.
More informationScalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA
Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA Yun R. Qu, Viktor K. Prasanna Ming Hsieh Dept. of Electrical Engineering University of Southern California Los Angeles, CA 90089
More informationFCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow
FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow Abstract: High-level synthesis (HLS) of data-parallel input languages, such as the Compute Unified Device Architecture
More informationFast and Accurate Source-Level Simulation Considering Target-Specific Compiler Optimizations
FZI Forschungszentrum Informatik at the University of Karlsruhe Fast and Accurate Source-Level Simulation Considering Target-Specific Compiler Optimizations Oliver Bringmann 1 RESEARCH ON YOUR BEHALF Outline
More informationArchitecture and Implementation of Database Systems (Winter 2014/15)
Jens Teubner Architecture & Implementation of DBMS Winter 2014/15 1 Architecture and Implementation of Database Systems (Winter 2014/15) Jens Teubner, DBIS Group jens.teubner@cs.tu-dortmund.de Winter 2014/15
More informationImproving Area and Resource Utilization Lab
Lab Workbook Introduction This lab introduces various techniques and directives which can be used in Vivado HLS to improve design performance as well as area and resource utilization. The design under
More informationA Thermal-aware Application specific Routing Algorithm for Network-on-chip Design
A Thermal-aware Application specific Routing Algorithm for Network-on-chip Design Zhi-Liang Qian and Chi-Ying Tsui VLSI Research Laboratory Department of Electronic and Computer Engineering The Hong Kong
More informationTransaction Level Model Simulator for NoC-based MPSoC Platform
Proceedings of the 6th WSEAS International Conference on Instrumentation, Measurement, Circuits & Systems, Hangzhou, China, April 15-17, 27 17 Transaction Level Model Simulator for NoC-based MPSoC Platform
More informationCosimulation of ITRON-Based Embedded Software with SystemC
Cosimulation of ITRON-Based Embedded Software with SystemC Shin-ichiro Chikada, Shinya Honda, Hiroyuki Tomiyama, Hiroaki Takada Graduate School of Information Science, Nagoya University Information Technology
More informationMOJTABA MAHDAVI Mojtaba Mahdavi DSP Design Course, EIT Department, Lund University, Sweden
High Level Synthesis with Catapult MOJTABA MAHDAVI 1 Outline High Level Synthesis HLS Design Flow in Catapult Data Types Project Creation Design Setup Data Flow Analysis Resource Allocation Scheduling
More informationA flexible memory shuffling unit for image processing accelerators
Eindhoven University of Technology MASTER A flexible memory shuffling unit for image processing accelerators Xie, R.Z. Award date: 2013 Disclaimer This document contains a student thesis (bachelor's or
More informationA Lost Cycles Analysis for Performance Prediction using High-Level Synthesis
A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,
More informationHardware/Software Co-design
Hardware/Software Co-design Zebo Peng, Department of Computer and Information Science (IDA) Linköping University Course page: http://www.ida.liu.se/~petel/codesign/ 1 of 52 Lecture 1/2: Outline : an Introduction
More informationComputer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationPlace Your Logo Here. K. Charles Janac
Place Your Logo Here K. Charles Janac President and CEO Arteris is the Leading Network on Chip IP Provider Multiple Traffic Classes Low Low cost cost Control Control CPU DSP DMA Multiple Interconnect Types
More informationParallel Architectures
Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36
More informationEfficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford
Efficiency and Programmability: Enablers for ExaScale Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford Scientific Discovery and Business Analytics Driving an Insatiable
More informationA unified multicore programming model
A unified multicore programming model Simplifying multicore migration By Sven Brehmer Abstract There are a number of different multicore architectures and programming models available, making it challenging
More informationV ENTTI: a Vertically Integrated Framework for Simulation and Optimization of Networks-On-Chip
V ENTTI: a Vertically Integrated Framework for Simulation and Optimization of Networks-On-Chip Young Jin Yoon, Nicola Concer, and Luca Carloni Department of Computer Science, Columbia University {youngjin,
More informationFrom Concept to Silicon
From Concept to Silicon How an idea becomes a part of a new chip at ATI Richard Huddy ATI Research From Concept to Silicon Creating a new Visual Processing Unit (VPU) is a complex task involving many people
More informationThe RM9150 and the Fast Device Bus High Speed Interconnect
The RM9150 and the Fast Device High Speed Interconnect John R. Kinsel Principal Engineer www.pmc -sierra.com 1 August 2004 Agenda CPU-based SOC Design Challenges Fast Device (FDB) Overview Generic Device
More informationDeveloping Dynamic Profiling and Debugging Support in OpenCL for FPGAs
Developing Dynamic Profiling and Debugging Support in OpenCL for FPGAs ABSTRACT Anshuman Verma Virginia Tech, Blacksburg, VA anshuman@vt.edu Skip Booth, Robbie King, James Coole, Andy Keep, John Marshall
More informationParallel Processing SIMD, Vector and GPU s cont.
Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP
More informationXPU A Programmable FPGA Accelerator for Diverse Workloads
XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for
More informationAddress InterLeaving for Low- Cost NoCs
Address InterLeaving for Low- Cost NoCs Miltos D. Grammatikakis, Kyprianos Papadimitriou, Polydoros Petrakis, Marcello Coppola, and Michael Soulie Technological Educational Institute of Crete, GR STMicroelectronics,
More informationThroughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks
Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Naveen Suda, Vikas Chandra *, Ganesh Dasika *, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu
More informationCadence SystemC Design and Verification. NMI FPGA Network Meeting Jan 21, 2015
Cadence SystemC Design and Verification NMI FPGA Network Meeting Jan 21, 2015 The High Level Synthesis Opportunity Raising Abstraction Improves Design & Verification Optimizes Power, Area and Timing for
More informationFINN: A Framework for Fast, Scalable Binarized Neural Network Inference
FINN: A Framework for Fast, Scalable Binarized Neural Network Inference Yaman Umuroglu (NTNU & Xilinx Research Labs Ireland) in collaboration with N Fraser, G Gambardella, M Blott, P Leong, M Jahre and
More informationECE 8823: GPU Architectures. Objectives
ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading
More informationJakub Cabal et al. CESNET
CONFIGURABLE FPGA PACKET PARSER FOR TERABIT NETWORKS WITH GUARANTEED WIRE- SPEED THROUGHPUT Jakub Cabal et al. CESNET 2018/02/27 FPGA, Monterey, USA Packet parsing INTRODUCTION It is among basic operations
More informationTowards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.
Towards a codelet-based runtime for exascale computing Chris Lauderdale ET International, Inc. What will be covered Slide 2 of 24 Problems & motivation Codelet runtime overview Codelets & complexes Dealing
More informationIntroduction to Electronic Design Automation. Model of Computation. Model of Computation. Model of Computation
Introduction to Electronic Design Automation Model of Computation Jie-Hong Roland Jiang 江介宏 Department of Electrical Engineering National Taiwan University Spring 03 Model of Computation In system design,
More informationEmerging Platforms, Emerging Technologies, and the Need for Crosscutting Tools Luca Carloni
Emerging Platforms, Emerging Technologies, and the Need for Crosscutting Tools Luca Carloni Department of Computer Science Columbia University in the City of New York NSF Workshop on Emerging Technologies
More informationPrototyping Architectural Support for Program Rollback Using FPGAs
Prototyping Architectural Support for Program Rollback Using FPGAs Radu Teodorescu and Josep Torrellas http://iacoma.cs.uiuc.edu University of Illinois at Urbana-Champaign Motivation Problem: Software
More informationA framework for optimizing OpenVX Applications on Embedded Many Core Accelerators
A framework for optimizing OpenVX Applications on Embedded Many Core Accelerators Giuseppe Tagliavini, DEI University of Bologna Germain Haugou, IIS ETHZ Andrea Marongiu, DEI University of Bologna & IIS
More informationArchitectural-Level Synthesis. Giovanni De Micheli Integrated Systems Centre EPF Lausanne
Architectural-Level Synthesis Giovanni De Micheli Integrated Systems Centre EPF Lausanne This presentation can be used for non-commercial purposes as long as this note and the copyright footers are not
More informationA New Electronic System Level Methodology for Complex Chip Designs
A New Electronic System Level Methodology for Complex Chip Designs Chad Spackman President, Co-Founder 1 Copyright 2006. All rights reserved. We are an EDA Tool Company: C2R Compiler, Inc. General purpose
More informationAn Efficient AXI Read and Write Channel for Memory Interface in System-on-Chip
An Efficient AXI Read and Write Channel for Memory Interface in System-on-Chip Abhinav Tiwari M. Tech. Scholar, Embedded System and VLSI Design Acropolis Institute of Technology and Research, Indore (India)
More informationA Methodology for Energy Efficient FPGA Designs Using Malleable Algorithms
A Methodology for Energy Efficient FPGA Designs Using Malleable Algorithms Jingzhao Ou and Viktor K. Prasanna Department of Electrical Engineering, University of Southern California Los Angeles, California,
More informationAn Analysis of Blocking vs Non-Blocking Flow Control in On-Chip Networks
An Analysis of Blocking vs Non-Blocking Flow Control in On-Chip Networks ABSTRACT High end System-on-Chip (SoC) architectures consist of tens of processing engines. These processing engines have varied
More informationPower dissipation! The VLSI Interconnect Challenge. Interconnect is the crux of the problem. Interconnect is the crux of the problem.
The VLSI Interconnect Challenge Avinoam Kolodny Electrical Engineering Department Technion Israel Institute of Technology VLSI Challenges System complexity Performance Tolerance to digital noise and faults
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationARM processor organization
ARM processor organization P. Bakowski bako@ieee.org ARM register bank The register bank,, which stores the processor state. r00 r01 r14 r15 P. Bakowski 2 ARM register bank It has two read ports and one
More informationBasics DRAM ORGANIZATION. Storage element (capacitor) Data In/Out Buffers. Word Line. Bit Line. Switching element HIGH-SPEED MEMORY SYSTEMS
Basics DRAM ORGANIZATION DRAM Word Line Bit Line Storage element (capacitor) In/Out Buffers Decoder Sense Amps... Bit Lines... Switching element Decoder... Word Lines... Memory Array Page 1 Basics BUS
More informationEliminating Dark Bandwidth
Eliminating Dark Bandwidth Jonathan Beard (twitter: @jonathan_beard) Staff Research Engineer Memory Systems, HPC HCPM 2017 22 June 2017 Dark Bandwidth: Data moved throughout the memory hierarchy that is
More informationENHANCED TOOLS FOR RISC-V PROCESSOR DEVELOPMENT
ENHANCED TOOLS FOR RISC-V PROCESSOR DEVELOPMENT THE FREE AND OPEN RISC INSTRUCTION SET ARCHITECTURE Codasip is the leading provider of RISC-V processor IP Codasip Bk: A portfolio of RISC-V processors Uniquely
More informationDesign Space Exploration Using Parameterized Cores
RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS UNIVERSITY OF WINDSOR Design Space Exploration Using Parameterized Cores Ian D. L. Anderson M.A.Sc. Candidate March 31, 2006 Supervisor: Dr. M. Khalid 1 OUTLINE
More informationEfficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip
ASP-DAC 2010 20 Jan 2010 Session 6C Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip Jonas Diemer, Rolf Ernst TU Braunschweig, Germany diemer@ida.ing.tu-bs.de Michael Kauschke Intel,
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationComputer Systems Laboratory Sungkyunkwan University
I/O System Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Introduction (1) I/O devices can be characterized by Behavior: input, output, storage
More informationExample Networks on chip Freescale: MPC Telematics chip
Lecture 22: Interconnects & I/O Administration Take QUIZ 16 over P&H 6.6-10, 6.12-14 before 11:59pm Project: Cache Simulator, Due April 29, 2010 NEW OFFICE HOUR TIME: Tuesday 1-2, McKinley Exams in ACES
More informationEarly Performance-Cost Estimation of Application-Specific Data Path Pipelining
Early Performance-Cost Estimation of Application-Specific Data Path Pipelining Jelena Trajkovic Computer Science Department École Polytechnique de Montréal, Canada Email: jelena.trajkovic@polymtl.ca Daniel
More informationHRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing
HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing Mingyu Gao and Christos Kozyrakis Stanford University http://mast.stanford.edu HPCA March 14, 2016 PIM is Coming Back End of Dennard
More informationA Process Model suitable for defining and programming MpSoCs
A Process Model suitable for defining and programming MpSoCs MpSoC-Workshop at Rheinfels, 29-30.6.2010 F. Mayer-Lindenberg, TU Hamburg-Harburg 1. Motivation 2. The Process Model 3. Mapping to MpSoC 4.
More informationTowards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA
Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, Chunyuan Zhang National University of Defense Technology,
More informationAnalyzing and Debugging Performance Issues with Advanced ARM CoreLink System IP Components
Analyzing and Debugging Performance Issues with Advanced ARM CoreLink System IP Components By William Orme, Strategic Marketing Manager, ARM Ltd. and Nick Heaton, Senior Solutions Architect, Cadence Finding
More information