ScalaPipe: A Streaming Application Generator
|
|
- Madeleine Hampton
- 6 years ago
- Views:
Transcription
1 ScalaPipe: A Streaming Application Generator Joseph G. Wingbermuehle, Roger D. Chamberlain, Ron K. Cytron This work is supported by the National Science Foundation under grants CNS and CNS
2 Streaming Computation kernels, or blocks connected by explicit communication channels Advantages: Performance Reuse Abstraction Systems: Auto-Pipe [Fr06] Streams-C [Go00] StreamIT [Th02] Stage 1 Stage 2 Stage 3 2
3 Example:Solution to Laplace s Equation PDE with several uses, including stationary heat diffusion Solvable using a Monte-Carlo technique 3
4 Streaming Implementation Random Walk Print 4
5 Parallel Walks Walk Random Split Average Print Walk 5
6 Auto-Pipe & X X Description X compiler C Block Application VHDL Block 6
7 Laplace Application in X e2 walk1 e4 rand e1 split avg e6 print Labels e3 walk2 e5 block top { Random rand; Split split; Walk walk1; Walk walk2; Average avg; Print print; e1: rand -> split; e2: split.y0 -> walk1; e3: split.y1 -> walk2; e4: walk1 -> avg.x0; e5: walk2 -> avg.x1; e6: avg -> print; }; Block instances Edge Connections Blocks are implemented externally in C or an HDL. 7
8 Observation 1 As the number of Walk blocks increases, the amount of configuration code increases Lines Ê Ê Two Walk blocks: e1: rand -> split; e2: split.y0 -> walk1; e3: split.y1 -> walk2; e4: walk1 -> avg.x0; e5: walk2 -> avg.x1; e6: avg -> print; Ê Ê Walk blocks requires 896 lines of X Ê Walk Blocks Four Walk blocks: e1: rand -> split1; e2: split1.y0 -> split2; e3: split1.y1 -> split3; e4: split2.y0 -> walk1; e5: split2.y1 -> walk2; e6: split3.y0 -> walk3; e7: split3.y1 -> walk4; e8: walk1 -> avg1.x0; e9: walk2 -> avg1.x1; e10: walk3 -> avg2.x0; e11: walk4 -> avg2.x1; e12: avg1 -> avg3.x0; e13: avg2 -> avg3.x1; e14: avg3 -> print; 8
9 Our Approach Type-safe generator language val Laplace = new AutoPipeApp { val random = Random() val splits = iteratedmap(levels, random, SplitU32) val walks = Array.tabulate(1 << levels) { x => Walk(splits(x))() } val result = iteratedfold(walks, AverageU32) Print(result) } Same code can generate 1 Walk block or 128 Walk blocks. 9
10 Observation 2 Moving blocks to a new device requires reimplementation HDL Implementation C Implementation Others 10
11 Our Approach A single language for block implementations ScalaPipe Block HDL Implementation C Implementation Others 11
12 Observation 3 Changing the data type requires new block implementations module ShiftRightU32(...); input wire[31:0] input_x; output wire[31:0] output_y;... output_y <= input_x >> 1;... endmodule module ShiftRightS64(...); input wire[63:0] input_x; output wire[63:0] output_y;... output_y <= input_x >>> 1;... endmodule 12
13 Our Solution Polymorphic block implementations class Average(t: AutoPipeType) extends AutoPipeBlock { val in0 = input(t) val in1 = input(t) val out = output(t) out = (in0 + in1) / 2 } Same implementation works for integral, fixed point, and floating point types. 13
14 Observation 4 The block interface for blocks on the same resource is a bottleneck Block Interface Block 1 Implementation Runtime System Block Interface Block 2 Implementation 14
15 Our Approach Single compiler for both the block language and coordination language. Compiler Coordination Language Block Language 15
16 ScalaPipe Source code (Scala) Scala compiler Generator Application 1 (e.g. 2 Walks) Coordination DSL ScalaPipe Library Block DSL Application 2 (e.g. 8 Walks) 16
17 AverageU32 Block val AverageU32 extends AutoPipeBlock { val in0 = input(unsigned32) val in1 = input(unsigned32) val out = output(unsigned32) out = (in0 + in1) / 2 } in0 AverageU32 out in1 17
18 Polymorphic Average Block class Average(t: AutoPipeType) extends AutoPipeBlock { val in0 = input(t) val in1 = input(t) val out = output(t) out = (in0 + in1) / 2 } val AverageU32 = new Average(UNSIGNED32) t can be any of the following: Signed or unsigned integer of any width Fixed point type Floating point type 18
19 Language Virtualization [Ch10] class Repeat(v: Int, count: Int) extends AutoPipeBlock { val in = input(signed32) val out = output(signed32) val tmp = local(signed32) tmp = in if (tmp == v) { // Evaluated at run time for (i <- 1 to count) { // Expanded at compile time out = tmp } } else { out = tmp } } 19
20 External AverageU32 Potentially more efficient External and internal blocks can be mixed val AverageU32 = new AutoPipeBlock { val in0 = input(unsigned32) val in1 = input(unsigned32) val out = output(unsigned32) external( HDL, AverageU32 ) // Optional internal implementation } 20
21 Block Code Generation Internal Block Specification Abstract Syntax Tree C Control Flow Graph OpenCL C External Block Specification Optimizer Verilog 21
22 HDL Code Optimizer Common subexpression elimination Dead store elimination Dead code elimination Strength reduction Copy propagation ASAP scheduling 22
23 Coordination DSL Describes the topology and resource mapping val Laplace = new AutoPipeApp { val random = Random() val splits = iteratedmap(levels, random, SplitU32) val walks = Array.tabulate(1 << levels) { x => Walk(splits(x))() } val result = iteratedfold(walks, AverageU32) Print(result) } 23
24 Generating Pipelines Inc Inc Inc Inc X language: block pipeline { input UNSIGNED32 source; output UNSIGNED32 result; Inc inc1; Inc inc2; Inc inc3; Inc inc4; }; source -> inc1; inc1 -> inc2; inc2 -> inc3; inc3 -> inc4; inc4 -> result; ScalaPipe: def pipeline(s: Stream, b: AutoPipeBlock, n: Int): Stream = { if (n > 0) { pipeline(b(s), b, n - 1) } else { s } } val result = pipeline(source, Inc, 4) 24
25 Aspect-Oriented Resource Mapping map(random -> ANY_BLOCK, CPU2FPGA()) CPU 0 FPGA 0 CPU 0 Walk Random Split Average Print Walk map(any_block -> Print, FPGA2CPU() 25
26 TimeTrial [La11] How do we find bottlenecks? measure(any_block -> Walk, backpressure) Walk Random Split Average Print % Backpressure Walk Frame 26
27 Illustration of Use Time HsL 200 CPU 0 RNG Walk Print s 50 CPU FPGA 16 Walks Custom RNG 27
28 Illustration of Use Time HsL FPGA 0 CPU s RNG Walk Print s 50 83% Backpressure CPU FPGA 16 Walks Custom RNG 28
29 Illustration of Use Time HsL FPGA 0 Walk CPU s RNG Split Print s Walk 50 41s 0% Backpressure CPU FPGA 16 Walks Custom RNG 29
30 Illustration of Use Time HsL FPGA 0 Walk CPU s crng Split Print s Walk 50 41s 12s CPU FPGA 16 Walks Custom RNG 30
31 The Current State of ScalaPipe Code generation for CPUs, FPGAs, and GPUs FPGA and GPU code generation is suboptimal No cross-block optimizations 31
32 The Future of ScalaPipe Improved code generation - Consume multiple items at a time - More Verilog and OpenCL C optimizations Support for more devices Library generation Cross-block optimizations 32
33 Conclusion ScalaPipe is a streaming application generator The block DSL allows code reuse across data types and platforms The coordination DSL allows easy generation of large and complex topologies Keeping everything in the same language exposes optimization opportunities ScalaPipe Coordination DSL Block DSL 33
34 References H. CHAFI, Z. DEVITO, A. MOORS, T. ROMPF, A. K. SUJEETH, P. HANRAHAN, M. ODERSKY, AND K. OLUKOTUN, Language virtualization for hetero- geneous parallel computing, in Proc. of ACM Int l Conf. on Object Oriented Programming Systems, Languages, and Applications, 2010, pp J.M. LANCASTER, J. G. WINGBERMUEHLE, AND R. D. CHAMBERLAIN, Asking for performance: Exploiting developer intuition to guide instrumentation with TimeTrial, in Proc. of IEEE 13th Int l Conf. on High Performance Computing and Communcations, Sep. 2011, pp M. A. FRANKLIN, E. J. TYSON, J. BUCKLEY, P. CROWLEY, AND J. MASCHMEYER, Auto-Pipe and the X language: A pipeline design tool and description language, in Proc. of Int l Parallel and Distributed Processing Symp., Apr M. B. GOKHALE, J. M. STONE, J. ARNOLD, AND M. KALINOWSKI, Stream- oriented FPGA computing in the Streams-C high level language, in Proc. of IEEE Symp. on Field-Programmable Custom Computing Machines, Apr. 2000, pp W. THIES, M. KARCZMAREK, AND S. AMARASINGHE, StreamIt: A language for streaming applications, in Proc. of 11th Int l Conf. on Compiler Construction, 2002, pp
ScalaPipe: A Streaming Application Generator
ScalaPipe: A Streaming Application Generator Joseph G. Wingbermuehle Roger D. Chamberlain Ron K. Cytron Joseph G. Wingbermuehle, Roger D. Chamberlain, and Ron K. Cytron, ScalaPipe: A Streaming Application
More informationJoe Wingbermuehle, (A paper written under the guidance of Prof. Raj Jain)
1 of 11 5/4/2011 4:49 PM Joe Wingbermuehle, wingbej@wustl.edu (A paper written under the guidance of Prof. Raj Jain) Download The Auto-Pipe system allows one to evaluate various resource mappings and topologies
More informationScalaPipe. Contents. From Auto-Pipe Wiki
ScalaPipe From Auto-Pipe Wiki Contents 1 Getting Started 1.1 Prerequisites 1.2 Obtaining ScalaPipe 1.3 Building The Examples 1.4 Creating a New Project 2 Types 2.1 Primitive Types 2.2 Array Types 2.3 Fixed
More informationDelite: A Framework for Heterogeneous Parallel DSLs
Delite: A Framework for Heterogeneous Parallel DSLs Hassan Chafi, Arvind Sujeeth, Kevin Brown, HyoukJoong Lee, Kunle Olukotun Stanford University Tiark Rompf, Martin Odersky EPFL Heterogeneous Parallel
More informationParallel Programming
Parallel Programming 9. Pipeline Parallelism Christoph von Praun praun@acm.org 09-1 (1) Parallel algorithm structure design space Organization by Data (1.1) Geometric Decomposition Organization by Tasks
More informationLow-Impact Profiling of Streaming, Heterogeneous Applications
Washington University in St. Louis Washington University Open Scholarship All Theses and Dissertations (ETDs) 1-1-2011 Low-Impact Profiling of Streaming, Heterogeneous Applications Joseph Lancaster Washington
More informationApplication-guided Tool Development for Architecturally Diverse Computation
Application-guided Tool Development for Architecturally Diverse Computation Roger D. Chamberlain Jeremy Buhler Mark A. Franklin James H. Buckley Roger D. Chamberlain, Jeremy Buhler, Mark A. Franklin, and
More informationSuperoptimized Memory Subsystems for Streaming Applications
Superoptimized Memory Subsystems for Streaming Applications Joseph G. Wingbermuehle Ron K. Cytron Roger D. Chamberlain Joseph G. Wingbermuehle, Ron K. Cytron, and Roger D. Chamberlain, Superoptimized Memory
More informationSimulation of Streaming Applications on Multicore Systems
Simulation of Streaming Applications on Multicore Systems Saurabh Gayen Mark A. Franklin Eric J. Tyson Roger D. Chamberlain Saurabh Gayen, Mark A. Franklin, Eric J. Tyson, Roger D. Chamberlain, Simulation
More informationOrchestrating Safe Streaming Computations with Precise Control
Orchestrating Safe Streaming Computations with Precise Control Peng Li, Kunal Agrawal, Jeremy Buhler, Roger D. Chamberlain Department of Computer Science and Engineering Washington University in St. Louis
More informationOptiML: An Implicitly Parallel Domain-Specific Language for ML
OptiML: An Implicitly Parallel Domain-Specific Language for ML Arvind K. Sujeeth, HyoukJoong Lee, Kevin J. Brown, Hassan Chafi, Michael Wu, Anand Atreya, Kunle Olukotun Stanford University Pervasive Parallelism
More informationAnalysis of Sorting as a Streaming Application
1 of 10 Analysis of Sorting as a Streaming Application Greg Galloway, ggalloway@wustl.edu (A class project report written under the guidance of Prof. Raj Jain) Download Abstract Expressing concurrency
More informationDelite. Hassan Chafi, Arvind Sujeeth, Kevin Brown, HyoukJoong Lee, Kunle Olukotun Stanford University. Tiark Rompf, Martin Odersky EPFL
Delite Hassan Chafi, Arvind Sujeeth, Kevin Brown, HyoukJoong Lee, Kunle Olukotun Stanford University Tiark Rompf, Martin Odersky EPFL Administrative PS 1 due today Email to me PS 2 out soon Build a simple
More informationCommunication Library to Overlap Computation and Communication for OpenCL Application
Communication Library to Overlap Computation and Communication for OpenCL Application Toshiya Komoda, Shinobu Miwa, Hiroshi Nakamura Univ.Tokyo What is today s talk about? Heterogeneous Computing System
More informationHigher Level Programming Abstractions for FPGAs using OpenCL
Higher Level Programming Abstractions for FPGAs using OpenCL Desh Singh Supervising Principal Engineer Altera Corporation Toronto Technology Center ! Technology scaling favors programmability CPUs."#/0$*12'$-*
More informationA Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function
A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function Chen-Ting Chang, Yu-Sheng Chen, I-Wei Wu, and Jyh-Jiun Shann Dept. of Computer Science, National Chiao
More informationA DOMAIN SPECIFIC APPROACH TO HETEROGENEOUS PARALLELISM
A DOMAIN SPECIFIC APPROACH TO HETEROGENEOUS PARALLELISM Hassan Chafi, Arvind Sujeeth, Kevin Brown, HyoukJoong Lee, Anand Atreya, Kunle Olukotun Stanford University Pervasive Parallelism Laboratory (PPL)
More informationLanguage Virtualization for Heterogeneous Parallel Computing
Language Virtualization for Heterogeneous Parallel Computing Hassan Chafi, Arvind Sujeeth, Zach DeVito, Pat Hanrahan, Kunle Olukotun Stanford University Adriaan Moors, Tiark Rompf, Martin Odersky EPFL
More informationRaftLib: A C++ Template Library for High Performance Stream Parallel Processing
RaftLib: A C++ Template Library for High Performance Stream Parallel Processing Jonathan C. Beard, Peng Li and Roger D. Chamberlain Dept. of Computer Science and Engineering Washington University in St.
More informationKunle Olukotun Pervasive Parallelism Laboratory Stanford University
Kunle Olukotun Pervasive Parallelism Laboratory Stanford University Unleash full power of future computing platforms Make parallel application development practical for the masses (Joe the programmer)
More informationCustom computing systems
Custom computing systems difference engine: Charles Babbage 1832 - compute maths tables digital orrery: MIT 1985 - special-purpose engine, found pluto motion chaotic Splash2: Supercomputing esearch Center
More informationGeneral Purpose GPU Programming. Advanced Operating Systems Tutorial 7
General Purpose GPU Programming Advanced Operating Systems Tutorial 7 Tutorial Outline Review of lectured material Key points Discussion OpenCL Future directions 2 Review of Lectured Material Heterogeneous
More informationSimple Analytic Performance Models for Streaming Data Applications Deployed on Diverse Architectures
Washington University in St. Louis Washington University Open Scholarship All Computer Science and Engineering Research Computer Science and Engineering Report Number: WUCSE-2013-2 2013 Simple Analytic
More informationLift: a Functional Approach to Generating High Performance GPU Code using Rewrite Rules
Lift: a Functional Approach to Generating High Performance GPU Code using Rewrite Rules Toomas Remmelg Michel Steuwer Christophe Dubach The 4th South of England Regional Programming Language Seminar 27th
More informationDynamic Cuda with F# HPC GPU & F# Meetup. March 19. San Jose, California
Dynamic Cuda with F# HPC GPU & F# Meetup March 19 San Jose, California Dr. Daniel Egloff daniel.egloff@quantalea.net +41 44 520 01 17 +41 79 430 03 61 About Us! Software development and consulting company!
More informationSorting on Architecturally Diverse Computer Systems
Sorting on Architecturally Diverse Computer Systems Roger D. Chamberlain Narayan Ganesan Roger D. Chamberlain and Narayan Ganesan, Sorting on Architecturally Diverse Computer Systems, in Proc. of Third
More informationAltera SDK for OpenCL
Altera SDK for OpenCL A novel SDK that opens up the world of FPGAs to today s developers Altera Technology Roadshow 2013 Today s News Altera today announces its SDK for OpenCL Altera Joins Khronos Group
More informationArchitectural-Level Synthesis. Giovanni De Micheli Integrated Systems Centre EPF Lausanne
Architectural-Level Synthesis Giovanni De Micheli Integrated Systems Centre EPF Lausanne This presentation can be used for non-commercial purposes as long as this note and the copyright footers are not
More informationComputer Aided Design Basic Syntax Gate Level Modeling Behavioral Modeling. Verilog
Verilog Radek Pelánek and Šimon Řeřucha Contents 1 Computer Aided Design 2 Basic Syntax 3 Gate Level Modeling 4 Behavioral Modeling Computer Aided Design Hardware Description Languages (HDL) Verilog C
More informationDomain Specific Languages for Financial Payoffs. Matthew Leslie Bank of America Merrill Lynch
Domain Specific Languages for Financial Payoffs Matthew Leslie Bank of America Merrill Lynch Outline Introduction What, How, and Why do we use DSLs in Finance? Implementation Interpreting, Compiling Performance
More informationSDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center
SDAccel Environment The Xilinx SDAccel Development Environment Bringing The Best Performance/Watt to the Data Center Introduction Data center operators constantly seek more server performance. Currently
More informationGeneral Purpose GPU Programming. Advanced Operating Systems Tutorial 9
General Purpose GPU Programming Advanced Operating Systems Tutorial 9 Tutorial Outline Review of lectured material Key points Discussion OpenCL Future directions 2 Review of Lectured Material Heterogeneous
More informationDesign and Performance of the OP2 Library for Unstructured Mesh Applications
Design and Performance of the OP2 Library for Unstructured Mesh Applications Carlo Bertolli 1, Adam Betts 1, Gihan Mudalige 2,MikeGiles 2, and Paul Kelly 1 1 Department of Computing, Imperial College London
More informationScala. Fernando Medeiros Tomás Paim
Scala Fernando Medeiros fernfreire@gmail.com Tomás Paim tomasbmp@gmail.com Topics A Scalable Language Classes and Objects Basic Types Functions and Closures Composition and Inheritance Scala s Hierarchy
More informationCS153: Compilers Lecture 15: Local Optimization
CS153: Compilers Lecture 15: Local Optimization Stephen Chong https://www.seas.harvard.edu/courses/cs153 Announcements Project 4 out Due Thursday Oct 25 (2 days) Project 5 out Due Tuesday Nov 13 (21 days)
More informationOptiML: An Implicitly Parallel Domain-Specific Language for ML
OptiML: An Implicitly Parallel Domain-Specific Language for ML Arvind K. Sujeeth, HyoukJoong Lee, Kevin J. Brown, Hassan Chafi, Michael Wu, Anand Atreya, Kunle Olukotun Stanford University Pervasive Parallelism
More informationA Stream Compiler for Communication-Exposed Architectures
A Stream Compiler for Communication-Exposed Architectures Michael Gordon, William Thies, Michal Karczmarek, Jasper Lin, Ali Meli, Andrew Lamb, Chris Leger, Jeremy Wong, Henry Hoffmann, David Maze, Saman
More informationProject Final Report High Performance Pipeline Compiler
Project Final Report High Performance Pipeline Compiler Yong He, Yan Gu 1 Introduction Writing stream processing programs directly in low level languages, such as C++, is tedious and bug prone. A lot of
More informationAccelerator Spectrum
Active/HardBD Panel Mohammad Sadoghi, Purdue University Sebastian Breß, German Research Center for Artificial Intelligence Vassilis J. Tsotras, University of California Riverside Accelerator Spectrum Commodity
More informationCode Optimization. Code Optimization
161 Code Optimization Code Optimization 162 Two steps: 1. Analysis (to uncover optimization opportunities) 2. Optimizing transformation Optimization: must be semantically correct. shall improve program
More informationSimplifying Parallel Programming with Domain Specific Languages
Simplifying Parallel Programming with Domain Specific Languages Hassan Chafi, HyoukJoong Lee, Arvind Sujeeth, Kevin Brown, Anand Atreya, Nathan Bronson, Kunle Olukotun Stanford University Pervasive Parallelism
More informationIntro to HW Design & Externs for P4àNetFPGA. CS344 Lecture 5
Intro to HW Design & Externs for P4àNetFPGA CS344 Lecture 5 Announcements Updated deliverable description for next Tuesday Implement most of the required functionality Make sure baseline tests are passing
More informationMIT Introduction to Program Analysis and Optimization. Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology
MIT 6.035 Introduction to Program Analysis and Optimization Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology Program Analysis Compile-time reasoning about run-time behavior
More informationICS 252 Introduction to Computer Design
ICS 252 Introduction to Computer Design Lecture 3 Fall 2006 Eli Bozorgzadeh Computer Science Department-UCI System Model According to Abstraction level Architectural, logic and geometrical View Behavioral,
More informationNew Developments in Spark
New Developments in Spark And Rethinking APIs for Big Data Matei Zaharia and many others What is Spark? Unified computing engine for big data apps > Batch, streaming and interactive Collection of high-level
More informationWhat Compilers Can and Cannot Do. Saman Amarasinghe Fall 2009
What Compilers Can and Cannot Do Saman Amarasinghe Fall 009 Optimization Continuum Many examples across the compilation pipeline Static Dynamic Program Compiler Linker Loader Runtime System Optimization
More informationAuto-Pipe and the X Language: A Pipeline Design Tool and Description Language
Auto-Pipe and the X Language: A Pipeline Design Tool and Description Language Mark A. Franklin Eric J. Tyson James Buckley Patrick Crowley John Maschmeyer Mark A. Franklin, Eric J. Tyson, James Buckley,
More informationPower Efficient Solutions w/ FPGAs. Bill Jenkins Altera Sr. Product Specialist for Programming Language Solutions
1 Poer Efficient Solutions / FPGs Bill Jenkins ltera Sr. Product Specialist for Programming Language Solutions System Challenges CPU rchitecture is inefficient for most parallel computing applications
More informationVHDL for Synthesis. Course Description. Course Duration. Goals
VHDL for Synthesis Course Description This course provides all necessary theoretical and practical know how to write an efficient synthesizable HDL code through VHDL standard language. The course goes
More informationFPGAs as Components in Heterogeneous HPC Systems: Raising the Abstraction Level of Heterogeneous Programming
FPGAs as Components in Heterogeneous HPC Systems: Raising the Abstraction Level of Heterogeneous Programming Wim Vanderbauwhede School of Computing Science University of Glasgow A trip down memory lane
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationPlatform-Specific Optimization and Mapping of Stencil Codes through Refinement
Platform-Specific Optimization and Mapping of Stencil Codes through Refinement Marcel Köster, Roland Leißa, and Sebastian Hack Compiler Design Lab, Saarland University Intel Visual Computing Institute
More informationOverview of ROCCC 2.0
Overview of ROCCC 2.0 Walid Najjar and Jason Villarreal SUMMARY FPGAs have been shown to be powerful platforms for hardware code acceleration. However, their poor programmability is the main impediment
More informationVerification and Validation of X-Sim: A Trace-Based Simulator
http://www.cse.wustl.edu/~jain/cse567-06/ftp/xsim/index.html 1 of 11 Verification and Validation of X-Sim: A Trace-Based Simulator Saurabh Gayen, sg3@wustl.edu Abstract X-Sim is a trace-based simulator
More informationHDL. Operations and dependencies. FSMs Logic functions HDL. Interconnected logic blocks HDL BEHAVIORAL VIEW LOGIC LEVEL ARCHITECTURAL LEVEL
ARCHITECTURAL-LEVEL SYNTHESIS Motivation. Outline cgiovanni De Micheli Stanford University Compiling language models into abstract models. Behavioral-level optimization and program-level transformations.
More informationPython VSIP API: A first draft
Python VSIP API: A first draft Stefan Seefeld HPEC WG meeting, December 9, 2014 Goals Use cases: Promote VSIP standard to a wider audience (SciPy users) Add more hardware acceleration to SciPy Allow VSIP
More informationLecture 38 VHDL Description: Addition of Two [5 5] Matrices
Lecture 38 VHDL Description: Addition of Two [5 5] Matrices -- First, write a package to declare a two-dimensional --array with five elements library IEEE; use IEEE.STD_LOGIC_1164.all; package twodm_array
More informationDesign Verification Lecture 01
M. Hsiao 1 Design Verification Lecture 01 Course Title: Verification of Digital Systems Professor: Michael Hsiao (355 Durham) Prerequisites: Digital Logic Design, C/C++ Programming, Data Structures, Computer
More informationCS377P Programming for Performance GPU Programming - II
CS377P Programming for Performance GPU Programming - II Sreepathi Pai UTCS November 11, 2015 Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work Outline
More informationHardware Acceleration of Edge Detection Algorithm on FPGAs
Hardware Acceleration of Edge Detection Algorithm on FPGAs Muthukumar Venkatesan and Daggu Venkateshwar Rao Department of Electrical and Computer Engineering University of Nevada Las Vegas. Las Vegas NV
More informationAutomated Reliability Classification of Queueing Models for Streaming Computation
Automated Reliability Classification of Queueing Models for Streaming Computation Jonathan C. Beard Cooper Epstein Roger D. Chamberlain Jonathan C. Beard, Cooper Epstein, and Roger D. Chamberlain. Automated
More informationSupporting Data Parallelism in Matcloud: Final Report
Supporting Data Parallelism in Matcloud: Final Report Yongpeng Zhang, Xing Wu 1 Overview Matcloud is an on-line service to run Matlab-like script on client s web browser. Internally it is accelerated by
More informationTranslating Haskell to Hardware. Lianne Lairmore Columbia University
Translating Haskell to Hardware Lianne Lairmore Columbia University FHW Project Functional Hardware (FHW) Martha Kim Stephen Edwards Richard Townsend Lianne Lairmore Kuangya Zhai CPUs file: ///home/lianne/
More informationCompiler Code Generation COMP360
Compiler Code Generation COMP360 Students who acquire large debts putting themselves through school are unlikely to think about changing society. When you trap people in a system of debt, they can t afford
More informationEnergy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package
High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction
More informationCreating Safe State Machines
Creating Safe State Machines Definition & Overview Finite state machines are widely used in digital circuit designs. Generally, when designing a state machine using a hardware description language (HDL),
More informationGeneral Purpose GPU Programming (1) Advanced Operating Systems Lecture 14
General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14 Lecture Outline Heterogenous multi-core systems and general purpose GPU programming Programming models Heterogenous multi-kernels
More informationVerilog for High Performance
Verilog for High Performance Course Description This course provides all necessary theoretical and practical know-how to write synthesizable HDL code through Verilog standard language. The course goes
More informationA RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT N. Vassiliadis, N. Kavvadias, G. Theodoridis, S. Nikolaidis Section of Electronics and Computers, Department of Physics,
More informationSynthesizing Benchmarks for Predictive Modeling.
Synthesizing Benchmarks for Predictive Modeling http://chriscummins.cc/cgo17 Chris Cummins University of Edinburgh Pavlos Petoumenos University of Edinburgh Zheng Wang Lancaster University Hugh Leather
More informationstructure syntax different levels of abstraction
This and the next lectures are about Verilog HDL, which, together with another language VHDL, are the most popular hardware languages used in industry. Verilog is only a tool; this course is about digital
More informationHere is a list of lecture objectives. They are provided for you to reflect on what you are supposed to learn, rather than an introduction to this
This and the next lectures are about Verilog HDL, which, together with another language VHDL, are the most popular hardware languages used in industry. Verilog is only a tool; this course is about digital
More informationChapter 3: Dataflow Modeling
Chapter 3: Dataflow Modeling Prof. Soo-Ik Chae Digital System Designs and Practices Using Verilog HDL and FPGAs @ 2008, John Wiley 3-1 Objectives After completing this chapter, you will be able to: Describe
More information[Sub Track 1-3] FPGA/ASIC 을타겟으로한알고리즘의효율적인생성방법및신기능소개
[Sub Track 1-3] FPGA/ASIC 을타겟으로한알고리즘의효율적인생성방법및신기능소개 정승혁과장 Senior Application Engineer MathWorks Korea 2015 The MathWorks, Inc. 1 Outline When FPGA, ASIC, or System-on-Chip (SoC) hardware is needed Hardware
More informationA Simple Path to Parallelism with Intel Cilk Plus
Introduction This introductory tutorial describes how to use Intel Cilk Plus to simplify making taking advantage of vectorization and threading parallelism in your code. It provides a brief description
More informationECE331: Hardware Organization and Design
ECE331: Hardware Organization and Design Lecture 19: Verilog and Processor Performance Adapted from Computer Organization and Design, Patterson & Hennessy, UCB Verilog Basics Hardware description language
More informationIntroduction to Verilog
Introduction to Verilog Synthesis and HDLs Verilog: The Module Continuous (Dataflow) Assignment Gate Level Description Procedural Assignment with always Verilog Registers Mix-and-Match Assignments The
More informationWhat is a compiler? Xiaokang Qiu Purdue University. August 21, 2017 ECE 573
What is a compiler? Xiaokang Qiu Purdue University ECE 573 August 21, 2017 What is a compiler? What is a compiler? Traditionally: Program that analyzes and translates from a high level language (e.g.,
More informationGeneration of Multigrid-based Numerical Solvers for FPGA Accelerators
Generation of Multigrid-based Numerical Solvers for FPGA Accelerators Christian Schmitt, Moritz Schmid, Frank Hannig, Jürgen Teich, Sebastian Kuckuk, Harald Köstler Hardware/Software Co-Design, System
More informationOverview. CSE372 Digital Systems Organization and Design Lab. Hardware CAD. Two Types of Chips
Overview CSE372 Digital Systems Organization and Design Lab Prof. Milo Martin Unit 5: Hardware Synthesis CAD (Computer Aided Design) Use computers to design computers Virtuous cycle Architectural-level,
More informationRTL Coding General Concepts
RTL Coding General Concepts Typical Digital System 2 Components of a Digital System Printed circuit board (PCB) Embedded d software microprocessor microcontroller digital signal processor (DSP) ASIC Programmable
More informationHigh Level Synthesis
High Level Synthesis Design Representation Intermediate representation essential for efficient processing. Input HDL behavioral descriptions translated into some canonical intermediate representation.
More informationHIGH PERFORMANCE PEDESTRIAN DETECTION ON TEGRA X1
April 4-7, 2016 Silicon Valley HIGH PERFORMANCE PEDESTRIAN DETECTION ON TEGRA X1 Max Lv, NVIDIA Brant Zhao, NVIDIA April 7 mlv@nvidia.com https://github.com/madeye Histogram of Oriented Gradients on GPU
More informationHigh Level Programming for GPGPU. Jason Yang Justin Hensley
Jason Yang Justin Hensley Outline Brook+ Brook+ demonstration on R670 AMD IL 2 Brook+ Introduction 3 What is Brook+? Brook is an extension to the C-language for stream programming originally developed
More informationProgramming in C++ 6. Floating point data types
Programming in C++ 6. Floating point data types! Introduction! Type double! Type float! Changing types! Type promotion & conversion! Casts! Initialization! Assignment operators! Summary 1 Introduction
More informationLoop Optimizations. Outline. Loop Invariant Code Motion. Induction Variables. Loop Invariant Code Motion. Loop Invariant Code Motion
Outline Loop Optimizations Induction Variables Recognition Induction Variables Combination of Analyses Copyright 2010, Pedro C Diniz, all rights reserved Students enrolled in the Compilers class at the
More informationVisions for Application Development on Hybrid Computing Systems
Visions for Application Development on Hybrid Computing Systems Roger D. Chamberlain, Joseph Lancaster, Ron K. Cytron Dept. of Computer Science and Engineering Washington University in St. Louis Abstract
More informationImplementation of DSP Algorithms
Implementation of DSP Algorithms Main frame computers Dedicated (application specific) architectures Programmable digital signal processors voice band data modem speech codec 1 PDSP and General-Purpose
More informationNETWORK ON CHIP TO IMPLEMENT THE SYSTEM-LEVEL COMMUNICATION SIMPLIFIES THE DISTRIBUTION OF I/O DATA THROUGHOUT THE CHIP, AND IS ALWAYS
... THE CASE FOR EMBEDDED NETWORKS ON CHIP ON FIELD-PROGRAMMABLE GATE ARRAYS... THE AUTHORS PROPOSE AUGMENTING THE FPGA ARCHITECTURE WITH AN EMBEDDED NETWORK ON CHIP TO IMPLEMENT THE SYSTEM-LEVEL COMMUNICATION
More informationLecture 15: System Modeling and Verilog
Lecture 15: System Modeling and Verilog Slides courtesy of Deming Chen Intro. VLSI System Design Outline Outline Modeling Digital Systems Introduction to Verilog HDL Use of Verilog HDL in Synthesis Reading
More informationStreamIt on Fleet. Amir Kamil Computer Science Division, University of California, Berkeley UCB-AK06.
StreamIt on Fleet Amir Kamil Computer Science Division, University of California, Berkeley kamil@cs.berkeley.edu UCB-AK06 July 16, 2008 1 Introduction StreamIt [1] is a high-level programming language
More informationSimone Campanoni Loop transformations
Simone Campanoni simonec@eecs.northwestern.edu Loop transformations Outline Simple loop transformations Loop invariants Induction variables Complex loop transformations Simple loop transformations Simple
More informationIntroduction to Multicore Programming
Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Automatic Parallelization and OpenMP 3 GPGPU 2 Multithreaded Programming
More informationProfiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency
Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu
More informationScheduling Image Processing Pipelines
Lecture 15: Scheduling Image Processing Pipelines Visual Computing Systems Simple image processing kernel int WIDTH = 1024; int HEIGHT = 1024; float input[width * HEIGHT]; float output[width * HEIGHT];
More informationProgram Optimization. Jo, Heeseung
Program Optimization Jo, Heeseung Today Overview Generally Useful Optimizations Code motion/precomputation Strength reduction Sharing of common subexpressions Removing unnecessary procedure calls Optimization
More informationUsing Static Single Assignment Form
Using Static Single Assignment Form Announcements Project 2 schedule due today HW1 due Friday Last Time SSA Technicalities Today Constant propagation Loop invariant code motion Induction variables CS553
More informationFPGA Design Challenge :Techkriti 14 Digital Design using Verilog Part 1
FPGA Design Challenge :Techkriti 14 Digital Design using Verilog Part 1 Anurag Dwivedi Digital Design : Bottom Up Approach Basic Block - Gates Digital Design : Bottom Up Approach Gates -> Flip Flops Digital
More informationCS 240 Final Exam Review
CS 240 Final Exam Review Linux I/O redirection Pipelines Standard commands C++ Pointers How to declare How to use Pointer arithmetic new, delete Memory leaks C++ Parameter Passing modes value pointer reference
More informationMapping Vector Codes to a Stream Processor (Imagine)
Mapping Vector Codes to a Stream Processor (Imagine) Mehdi Baradaran Tahoori and Paul Wang Lee {mtahoori,paulwlee}@stanford.edu Abstract: We examined some basic problems in mapping vector codes to stream
More information