ScalaPipe: A Streaming Application Generator

Size: px
Start display at page:

Download "ScalaPipe: A Streaming Application Generator"

Transcription

1 ScalaPipe: A Streaming Application Generator Joseph G. Wingbermuehle, Roger D. Chamberlain, Ron K. Cytron This work is supported by the National Science Foundation under grants CNS and CNS

2 Streaming Computation kernels, or blocks connected by explicit communication channels Advantages: Performance Reuse Abstraction Systems: Auto-Pipe [Fr06] Streams-C [Go00] StreamIT [Th02] Stage 1 Stage 2 Stage 3 2

3 Example:Solution to Laplace s Equation PDE with several uses, including stationary heat diffusion Solvable using a Monte-Carlo technique 3

4 Streaming Implementation Random Walk Print 4

5 Parallel Walks Walk Random Split Average Print Walk 5

6 Auto-Pipe & X X Description X compiler C Block Application VHDL Block 6

7 Laplace Application in X e2 walk1 e4 rand e1 split avg e6 print Labels e3 walk2 e5 block top { Random rand; Split split; Walk walk1; Walk walk2; Average avg; Print print; e1: rand -> split; e2: split.y0 -> walk1; e3: split.y1 -> walk2; e4: walk1 -> avg.x0; e5: walk2 -> avg.x1; e6: avg -> print; }; Block instances Edge Connections Blocks are implemented externally in C or an HDL. 7

8 Observation 1 As the number of Walk blocks increases, the amount of configuration code increases Lines Ê Ê Two Walk blocks: e1: rand -> split; e2: split.y0 -> walk1; e3: split.y1 -> walk2; e4: walk1 -> avg.x0; e5: walk2 -> avg.x1; e6: avg -> print; Ê Ê Walk blocks requires 896 lines of X Ê Walk Blocks Four Walk blocks: e1: rand -> split1; e2: split1.y0 -> split2; e3: split1.y1 -> split3; e4: split2.y0 -> walk1; e5: split2.y1 -> walk2; e6: split3.y0 -> walk3; e7: split3.y1 -> walk4; e8: walk1 -> avg1.x0; e9: walk2 -> avg1.x1; e10: walk3 -> avg2.x0; e11: walk4 -> avg2.x1; e12: avg1 -> avg3.x0; e13: avg2 -> avg3.x1; e14: avg3 -> print; 8

9 Our Approach Type-safe generator language val Laplace = new AutoPipeApp { val random = Random() val splits = iteratedmap(levels, random, SplitU32) val walks = Array.tabulate(1 << levels) { x => Walk(splits(x))() } val result = iteratedfold(walks, AverageU32) Print(result) } Same code can generate 1 Walk block or 128 Walk blocks. 9

10 Observation 2 Moving blocks to a new device requires reimplementation HDL Implementation C Implementation Others 10

11 Our Approach A single language for block implementations ScalaPipe Block HDL Implementation C Implementation Others 11

12 Observation 3 Changing the data type requires new block implementations module ShiftRightU32(...); input wire[31:0] input_x; output wire[31:0] output_y;... output_y <= input_x >> 1;... endmodule module ShiftRightS64(...); input wire[63:0] input_x; output wire[63:0] output_y;... output_y <= input_x >>> 1;... endmodule 12

13 Our Solution Polymorphic block implementations class Average(t: AutoPipeType) extends AutoPipeBlock { val in0 = input(t) val in1 = input(t) val out = output(t) out = (in0 + in1) / 2 } Same implementation works for integral, fixed point, and floating point types. 13

14 Observation 4 The block interface for blocks on the same resource is a bottleneck Block Interface Block 1 Implementation Runtime System Block Interface Block 2 Implementation 14

15 Our Approach Single compiler for both the block language and coordination language. Compiler Coordination Language Block Language 15

16 ScalaPipe Source code (Scala) Scala compiler Generator Application 1 (e.g. 2 Walks) Coordination DSL ScalaPipe Library Block DSL Application 2 (e.g. 8 Walks) 16

17 AverageU32 Block val AverageU32 extends AutoPipeBlock { val in0 = input(unsigned32) val in1 = input(unsigned32) val out = output(unsigned32) out = (in0 + in1) / 2 } in0 AverageU32 out in1 17

18 Polymorphic Average Block class Average(t: AutoPipeType) extends AutoPipeBlock { val in0 = input(t) val in1 = input(t) val out = output(t) out = (in0 + in1) / 2 } val AverageU32 = new Average(UNSIGNED32) t can be any of the following: Signed or unsigned integer of any width Fixed point type Floating point type 18

19 Language Virtualization [Ch10] class Repeat(v: Int, count: Int) extends AutoPipeBlock { val in = input(signed32) val out = output(signed32) val tmp = local(signed32) tmp = in if (tmp == v) { // Evaluated at run time for (i <- 1 to count) { // Expanded at compile time out = tmp } } else { out = tmp } } 19

20 External AverageU32 Potentially more efficient External and internal blocks can be mixed val AverageU32 = new AutoPipeBlock { val in0 = input(unsigned32) val in1 = input(unsigned32) val out = output(unsigned32) external( HDL, AverageU32 ) // Optional internal implementation } 20

21 Block Code Generation Internal Block Specification Abstract Syntax Tree C Control Flow Graph OpenCL C External Block Specification Optimizer Verilog 21

22 HDL Code Optimizer Common subexpression elimination Dead store elimination Dead code elimination Strength reduction Copy propagation ASAP scheduling 22

23 Coordination DSL Describes the topology and resource mapping val Laplace = new AutoPipeApp { val random = Random() val splits = iteratedmap(levels, random, SplitU32) val walks = Array.tabulate(1 << levels) { x => Walk(splits(x))() } val result = iteratedfold(walks, AverageU32) Print(result) } 23

24 Generating Pipelines Inc Inc Inc Inc X language: block pipeline { input UNSIGNED32 source; output UNSIGNED32 result; Inc inc1; Inc inc2; Inc inc3; Inc inc4; }; source -> inc1; inc1 -> inc2; inc2 -> inc3; inc3 -> inc4; inc4 -> result; ScalaPipe: def pipeline(s: Stream, b: AutoPipeBlock, n: Int): Stream = { if (n > 0) { pipeline(b(s), b, n - 1) } else { s } } val result = pipeline(source, Inc, 4) 24

25 Aspect-Oriented Resource Mapping map(random -> ANY_BLOCK, CPU2FPGA()) CPU 0 FPGA 0 CPU 0 Walk Random Split Average Print Walk map(any_block -> Print, FPGA2CPU() 25

26 TimeTrial [La11] How do we find bottlenecks? measure(any_block -> Walk, backpressure) Walk Random Split Average Print % Backpressure Walk Frame 26

27 Illustration of Use Time HsL 200 CPU 0 RNG Walk Print s 50 CPU FPGA 16 Walks Custom RNG 27

28 Illustration of Use Time HsL FPGA 0 CPU s RNG Walk Print s 50 83% Backpressure CPU FPGA 16 Walks Custom RNG 28

29 Illustration of Use Time HsL FPGA 0 Walk CPU s RNG Split Print s Walk 50 41s 0% Backpressure CPU FPGA 16 Walks Custom RNG 29

30 Illustration of Use Time HsL FPGA 0 Walk CPU s crng Split Print s Walk 50 41s 12s CPU FPGA 16 Walks Custom RNG 30

31 The Current State of ScalaPipe Code generation for CPUs, FPGAs, and GPUs FPGA and GPU code generation is suboptimal No cross-block optimizations 31

32 The Future of ScalaPipe Improved code generation - Consume multiple items at a time - More Verilog and OpenCL C optimizations Support for more devices Library generation Cross-block optimizations 32

33 Conclusion ScalaPipe is a streaming application generator The block DSL allows code reuse across data types and platforms The coordination DSL allows easy generation of large and complex topologies Keeping everything in the same language exposes optimization opportunities ScalaPipe Coordination DSL Block DSL 33

34 References H. CHAFI, Z. DEVITO, A. MOORS, T. ROMPF, A. K. SUJEETH, P. HANRAHAN, M. ODERSKY, AND K. OLUKOTUN, Language virtualization for hetero- geneous parallel computing, in Proc. of ACM Int l Conf. on Object Oriented Programming Systems, Languages, and Applications, 2010, pp J.M. LANCASTER, J. G. WINGBERMUEHLE, AND R. D. CHAMBERLAIN, Asking for performance: Exploiting developer intuition to guide instrumentation with TimeTrial, in Proc. of IEEE 13th Int l Conf. on High Performance Computing and Communcations, Sep. 2011, pp M. A. FRANKLIN, E. J. TYSON, J. BUCKLEY, P. CROWLEY, AND J. MASCHMEYER, Auto-Pipe and the X language: A pipeline design tool and description language, in Proc. of Int l Parallel and Distributed Processing Symp., Apr M. B. GOKHALE, J. M. STONE, J. ARNOLD, AND M. KALINOWSKI, Stream- oriented FPGA computing in the Streams-C high level language, in Proc. of IEEE Symp. on Field-Programmable Custom Computing Machines, Apr. 2000, pp W. THIES, M. KARCZMAREK, AND S. AMARASINGHE, StreamIt: A language for streaming applications, in Proc. of 11th Int l Conf. on Compiler Construction, 2002, pp

ScalaPipe: A Streaming Application Generator

ScalaPipe: A Streaming Application Generator ScalaPipe: A Streaming Application Generator Joseph G. Wingbermuehle Roger D. Chamberlain Ron K. Cytron Joseph G. Wingbermuehle, Roger D. Chamberlain, and Ron K. Cytron, ScalaPipe: A Streaming Application

More information

Joe Wingbermuehle, (A paper written under the guidance of Prof. Raj Jain)

Joe Wingbermuehle, (A paper written under the guidance of Prof. Raj Jain) 1 of 11 5/4/2011 4:49 PM Joe Wingbermuehle, wingbej@wustl.edu (A paper written under the guidance of Prof. Raj Jain) Download The Auto-Pipe system allows one to evaluate various resource mappings and topologies

More information

ScalaPipe. Contents. From Auto-Pipe Wiki

ScalaPipe. Contents. From Auto-Pipe Wiki ScalaPipe From Auto-Pipe Wiki Contents 1 Getting Started 1.1 Prerequisites 1.2 Obtaining ScalaPipe 1.3 Building The Examples 1.4 Creating a New Project 2 Types 2.1 Primitive Types 2.2 Array Types 2.3 Fixed

More information

Delite: A Framework for Heterogeneous Parallel DSLs

Delite: A Framework for Heterogeneous Parallel DSLs Delite: A Framework for Heterogeneous Parallel DSLs Hassan Chafi, Arvind Sujeeth, Kevin Brown, HyoukJoong Lee, Kunle Olukotun Stanford University Tiark Rompf, Martin Odersky EPFL Heterogeneous Parallel

More information

Parallel Programming

Parallel Programming Parallel Programming 9. Pipeline Parallelism Christoph von Praun praun@acm.org 09-1 (1) Parallel algorithm structure design space Organization by Data (1.1) Geometric Decomposition Organization by Tasks

More information

Low-Impact Profiling of Streaming, Heterogeneous Applications

Low-Impact Profiling of Streaming, Heterogeneous Applications Washington University in St. Louis Washington University Open Scholarship All Theses and Dissertations (ETDs) 1-1-2011 Low-Impact Profiling of Streaming, Heterogeneous Applications Joseph Lancaster Washington

More information

Application-guided Tool Development for Architecturally Diverse Computation

Application-guided Tool Development for Architecturally Diverse Computation Application-guided Tool Development for Architecturally Diverse Computation Roger D. Chamberlain Jeremy Buhler Mark A. Franklin James H. Buckley Roger D. Chamberlain, Jeremy Buhler, Mark A. Franklin, and

More information

Superoptimized Memory Subsystems for Streaming Applications

Superoptimized Memory Subsystems for Streaming Applications Superoptimized Memory Subsystems for Streaming Applications Joseph G. Wingbermuehle Ron K. Cytron Roger D. Chamberlain Joseph G. Wingbermuehle, Ron K. Cytron, and Roger D. Chamberlain, Superoptimized Memory

More information

Simulation of Streaming Applications on Multicore Systems

Simulation of Streaming Applications on Multicore Systems Simulation of Streaming Applications on Multicore Systems Saurabh Gayen Mark A. Franklin Eric J. Tyson Roger D. Chamberlain Saurabh Gayen, Mark A. Franklin, Eric J. Tyson, Roger D. Chamberlain, Simulation

More information

Orchestrating Safe Streaming Computations with Precise Control

Orchestrating Safe Streaming Computations with Precise Control Orchestrating Safe Streaming Computations with Precise Control Peng Li, Kunal Agrawal, Jeremy Buhler, Roger D. Chamberlain Department of Computer Science and Engineering Washington University in St. Louis

More information

OptiML: An Implicitly Parallel Domain-Specific Language for ML

OptiML: An Implicitly Parallel Domain-Specific Language for ML OptiML: An Implicitly Parallel Domain-Specific Language for ML Arvind K. Sujeeth, HyoukJoong Lee, Kevin J. Brown, Hassan Chafi, Michael Wu, Anand Atreya, Kunle Olukotun Stanford University Pervasive Parallelism

More information

Analysis of Sorting as a Streaming Application

Analysis of Sorting as a Streaming Application 1 of 10 Analysis of Sorting as a Streaming Application Greg Galloway, ggalloway@wustl.edu (A class project report written under the guidance of Prof. Raj Jain) Download Abstract Expressing concurrency

More information

Delite. Hassan Chafi, Arvind Sujeeth, Kevin Brown, HyoukJoong Lee, Kunle Olukotun Stanford University. Tiark Rompf, Martin Odersky EPFL

Delite. Hassan Chafi, Arvind Sujeeth, Kevin Brown, HyoukJoong Lee, Kunle Olukotun Stanford University. Tiark Rompf, Martin Odersky EPFL Delite Hassan Chafi, Arvind Sujeeth, Kevin Brown, HyoukJoong Lee, Kunle Olukotun Stanford University Tiark Rompf, Martin Odersky EPFL Administrative PS 1 due today Email to me PS 2 out soon Build a simple

More information

Communication Library to Overlap Computation and Communication for OpenCL Application

Communication Library to Overlap Computation and Communication for OpenCL Application Communication Library to Overlap Computation and Communication for OpenCL Application Toshiya Komoda, Shinobu Miwa, Hiroshi Nakamura Univ.Tokyo What is today s talk about? Heterogeneous Computing System

More information

Higher Level Programming Abstractions for FPGAs using OpenCL

Higher Level Programming Abstractions for FPGAs using OpenCL Higher Level Programming Abstractions for FPGAs using OpenCL Desh Singh Supervising Principal Engineer Altera Corporation Toronto Technology Center ! Technology scaling favors programmability CPUs."#/0$*12'$-*

More information

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function Chen-Ting Chang, Yu-Sheng Chen, I-Wei Wu, and Jyh-Jiun Shann Dept. of Computer Science, National Chiao

More information

A DOMAIN SPECIFIC APPROACH TO HETEROGENEOUS PARALLELISM

A DOMAIN SPECIFIC APPROACH TO HETEROGENEOUS PARALLELISM A DOMAIN SPECIFIC APPROACH TO HETEROGENEOUS PARALLELISM Hassan Chafi, Arvind Sujeeth, Kevin Brown, HyoukJoong Lee, Anand Atreya, Kunle Olukotun Stanford University Pervasive Parallelism Laboratory (PPL)

More information

Language Virtualization for Heterogeneous Parallel Computing

Language Virtualization for Heterogeneous Parallel Computing Language Virtualization for Heterogeneous Parallel Computing Hassan Chafi, Arvind Sujeeth, Zach DeVito, Pat Hanrahan, Kunle Olukotun Stanford University Adriaan Moors, Tiark Rompf, Martin Odersky EPFL

More information

RaftLib: A C++ Template Library for High Performance Stream Parallel Processing

RaftLib: A C++ Template Library for High Performance Stream Parallel Processing RaftLib: A C++ Template Library for High Performance Stream Parallel Processing Jonathan C. Beard, Peng Li and Roger D. Chamberlain Dept. of Computer Science and Engineering Washington University in St.

More information

Kunle Olukotun Pervasive Parallelism Laboratory Stanford University

Kunle Olukotun Pervasive Parallelism Laboratory Stanford University Kunle Olukotun Pervasive Parallelism Laboratory Stanford University Unleash full power of future computing platforms Make parallel application development practical for the masses (Joe the programmer)

More information

Custom computing systems

Custom computing systems Custom computing systems difference engine: Charles Babbage 1832 - compute maths tables digital orrery: MIT 1985 - special-purpose engine, found pluto motion chaotic Splash2: Supercomputing esearch Center

More information

General Purpose GPU Programming. Advanced Operating Systems Tutorial 7

General Purpose GPU Programming. Advanced Operating Systems Tutorial 7 General Purpose GPU Programming Advanced Operating Systems Tutorial 7 Tutorial Outline Review of lectured material Key points Discussion OpenCL Future directions 2 Review of Lectured Material Heterogeneous

More information

Simple Analytic Performance Models for Streaming Data Applications Deployed on Diverse Architectures

Simple Analytic Performance Models for Streaming Data Applications Deployed on Diverse Architectures Washington University in St. Louis Washington University Open Scholarship All Computer Science and Engineering Research Computer Science and Engineering Report Number: WUCSE-2013-2 2013 Simple Analytic

More information

Lift: a Functional Approach to Generating High Performance GPU Code using Rewrite Rules

Lift: a Functional Approach to Generating High Performance GPU Code using Rewrite Rules Lift: a Functional Approach to Generating High Performance GPU Code using Rewrite Rules Toomas Remmelg Michel Steuwer Christophe Dubach The 4th South of England Regional Programming Language Seminar 27th

More information

Dynamic Cuda with F# HPC GPU & F# Meetup. March 19. San Jose, California

Dynamic Cuda with F# HPC GPU & F# Meetup. March 19. San Jose, California Dynamic Cuda with F# HPC GPU & F# Meetup March 19 San Jose, California Dr. Daniel Egloff daniel.egloff@quantalea.net +41 44 520 01 17 +41 79 430 03 61 About Us! Software development and consulting company!

More information

Sorting on Architecturally Diverse Computer Systems

Sorting on Architecturally Diverse Computer Systems Sorting on Architecturally Diverse Computer Systems Roger D. Chamberlain Narayan Ganesan Roger D. Chamberlain and Narayan Ganesan, Sorting on Architecturally Diverse Computer Systems, in Proc. of Third

More information

Altera SDK for OpenCL

Altera SDK for OpenCL Altera SDK for OpenCL A novel SDK that opens up the world of FPGAs to today s developers Altera Technology Roadshow 2013 Today s News Altera today announces its SDK for OpenCL Altera Joins Khronos Group

More information

Architectural-Level Synthesis. Giovanni De Micheli Integrated Systems Centre EPF Lausanne

Architectural-Level Synthesis. Giovanni De Micheli Integrated Systems Centre EPF Lausanne Architectural-Level Synthesis Giovanni De Micheli Integrated Systems Centre EPF Lausanne This presentation can be used for non-commercial purposes as long as this note and the copyright footers are not

More information

Computer Aided Design Basic Syntax Gate Level Modeling Behavioral Modeling. Verilog

Computer Aided Design Basic Syntax Gate Level Modeling Behavioral Modeling. Verilog Verilog Radek Pelánek and Šimon Řeřucha Contents 1 Computer Aided Design 2 Basic Syntax 3 Gate Level Modeling 4 Behavioral Modeling Computer Aided Design Hardware Description Languages (HDL) Verilog C

More information

Domain Specific Languages for Financial Payoffs. Matthew Leslie Bank of America Merrill Lynch

Domain Specific Languages for Financial Payoffs. Matthew Leslie Bank of America Merrill Lynch Domain Specific Languages for Financial Payoffs Matthew Leslie Bank of America Merrill Lynch Outline Introduction What, How, and Why do we use DSLs in Finance? Implementation Interpreting, Compiling Performance

More information

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center SDAccel Environment The Xilinx SDAccel Development Environment Bringing The Best Performance/Watt to the Data Center Introduction Data center operators constantly seek more server performance. Currently

More information

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9 General Purpose GPU Programming Advanced Operating Systems Tutorial 9 Tutorial Outline Review of lectured material Key points Discussion OpenCL Future directions 2 Review of Lectured Material Heterogeneous

More information

Design and Performance of the OP2 Library for Unstructured Mesh Applications

Design and Performance of the OP2 Library for Unstructured Mesh Applications Design and Performance of the OP2 Library for Unstructured Mesh Applications Carlo Bertolli 1, Adam Betts 1, Gihan Mudalige 2,MikeGiles 2, and Paul Kelly 1 1 Department of Computing, Imperial College London

More information

Scala. Fernando Medeiros Tomás Paim

Scala. Fernando Medeiros Tomás Paim Scala Fernando Medeiros fernfreire@gmail.com Tomás Paim tomasbmp@gmail.com Topics A Scalable Language Classes and Objects Basic Types Functions and Closures Composition and Inheritance Scala s Hierarchy

More information

CS153: Compilers Lecture 15: Local Optimization

CS153: Compilers Lecture 15: Local Optimization CS153: Compilers Lecture 15: Local Optimization Stephen Chong https://www.seas.harvard.edu/courses/cs153 Announcements Project 4 out Due Thursday Oct 25 (2 days) Project 5 out Due Tuesday Nov 13 (21 days)

More information

OptiML: An Implicitly Parallel Domain-Specific Language for ML

OptiML: An Implicitly Parallel Domain-Specific Language for ML OptiML: An Implicitly Parallel Domain-Specific Language for ML Arvind K. Sujeeth, HyoukJoong Lee, Kevin J. Brown, Hassan Chafi, Michael Wu, Anand Atreya, Kunle Olukotun Stanford University Pervasive Parallelism

More information

A Stream Compiler for Communication-Exposed Architectures

A Stream Compiler for Communication-Exposed Architectures A Stream Compiler for Communication-Exposed Architectures Michael Gordon, William Thies, Michal Karczmarek, Jasper Lin, Ali Meli, Andrew Lamb, Chris Leger, Jeremy Wong, Henry Hoffmann, David Maze, Saman

More information

Project Final Report High Performance Pipeline Compiler

Project Final Report High Performance Pipeline Compiler Project Final Report High Performance Pipeline Compiler Yong He, Yan Gu 1 Introduction Writing stream processing programs directly in low level languages, such as C++, is tedious and bug prone. A lot of

More information

Accelerator Spectrum

Accelerator Spectrum Active/HardBD Panel Mohammad Sadoghi, Purdue University Sebastian Breß, German Research Center for Artificial Intelligence Vassilis J. Tsotras, University of California Riverside Accelerator Spectrum Commodity

More information

Code Optimization. Code Optimization

Code Optimization. Code Optimization 161 Code Optimization Code Optimization 162 Two steps: 1. Analysis (to uncover optimization opportunities) 2. Optimizing transformation Optimization: must be semantically correct. shall improve program

More information

Simplifying Parallel Programming with Domain Specific Languages

Simplifying Parallel Programming with Domain Specific Languages Simplifying Parallel Programming with Domain Specific Languages Hassan Chafi, HyoukJoong Lee, Arvind Sujeeth, Kevin Brown, Anand Atreya, Nathan Bronson, Kunle Olukotun Stanford University Pervasive Parallelism

More information

Intro to HW Design & Externs for P4àNetFPGA. CS344 Lecture 5

Intro to HW Design & Externs for P4àNetFPGA. CS344 Lecture 5 Intro to HW Design & Externs for P4àNetFPGA CS344 Lecture 5 Announcements Updated deliverable description for next Tuesday Implement most of the required functionality Make sure baseline tests are passing

More information

MIT Introduction to Program Analysis and Optimization. Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology

MIT Introduction to Program Analysis and Optimization. Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology MIT 6.035 Introduction to Program Analysis and Optimization Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology Program Analysis Compile-time reasoning about run-time behavior

More information

ICS 252 Introduction to Computer Design

ICS 252 Introduction to Computer Design ICS 252 Introduction to Computer Design Lecture 3 Fall 2006 Eli Bozorgzadeh Computer Science Department-UCI System Model According to Abstraction level Architectural, logic and geometrical View Behavioral,

More information

New Developments in Spark

New Developments in Spark New Developments in Spark And Rethinking APIs for Big Data Matei Zaharia and many others What is Spark? Unified computing engine for big data apps > Batch, streaming and interactive Collection of high-level

More information

What Compilers Can and Cannot Do. Saman Amarasinghe Fall 2009

What Compilers Can and Cannot Do. Saman Amarasinghe Fall 2009 What Compilers Can and Cannot Do Saman Amarasinghe Fall 009 Optimization Continuum Many examples across the compilation pipeline Static Dynamic Program Compiler Linker Loader Runtime System Optimization

More information

Auto-Pipe and the X Language: A Pipeline Design Tool and Description Language

Auto-Pipe and the X Language: A Pipeline Design Tool and Description Language Auto-Pipe and the X Language: A Pipeline Design Tool and Description Language Mark A. Franklin Eric J. Tyson James Buckley Patrick Crowley John Maschmeyer Mark A. Franklin, Eric J. Tyson, James Buckley,

More information

Power Efficient Solutions w/ FPGAs. Bill Jenkins Altera Sr. Product Specialist for Programming Language Solutions

Power Efficient Solutions w/ FPGAs. Bill Jenkins Altera Sr. Product Specialist for Programming Language Solutions 1 Poer Efficient Solutions / FPGs Bill Jenkins ltera Sr. Product Specialist for Programming Language Solutions System Challenges CPU rchitecture is inefficient for most parallel computing applications

More information

VHDL for Synthesis. Course Description. Course Duration. Goals

VHDL for Synthesis. Course Description. Course Duration. Goals VHDL for Synthesis Course Description This course provides all necessary theoretical and practical know how to write an efficient synthesizable HDL code through VHDL standard language. The course goes

More information

FPGAs as Components in Heterogeneous HPC Systems: Raising the Abstraction Level of Heterogeneous Programming

FPGAs as Components in Heterogeneous HPC Systems: Raising the Abstraction Level of Heterogeneous Programming FPGAs as Components in Heterogeneous HPC Systems: Raising the Abstraction Level of Heterogeneous Programming Wim Vanderbauwhede School of Computing Science University of Glasgow A trip down memory lane

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Platform-Specific Optimization and Mapping of Stencil Codes through Refinement

Platform-Specific Optimization and Mapping of Stencil Codes through Refinement Platform-Specific Optimization and Mapping of Stencil Codes through Refinement Marcel Köster, Roland Leißa, and Sebastian Hack Compiler Design Lab, Saarland University Intel Visual Computing Institute

More information

Overview of ROCCC 2.0

Overview of ROCCC 2.0 Overview of ROCCC 2.0 Walid Najjar and Jason Villarreal SUMMARY FPGAs have been shown to be powerful platforms for hardware code acceleration. However, their poor programmability is the main impediment

More information

Verification and Validation of X-Sim: A Trace-Based Simulator

Verification and Validation of X-Sim: A Trace-Based Simulator http://www.cse.wustl.edu/~jain/cse567-06/ftp/xsim/index.html 1 of 11 Verification and Validation of X-Sim: A Trace-Based Simulator Saurabh Gayen, sg3@wustl.edu Abstract X-Sim is a trace-based simulator

More information

HDL. Operations and dependencies. FSMs Logic functions HDL. Interconnected logic blocks HDL BEHAVIORAL VIEW LOGIC LEVEL ARCHITECTURAL LEVEL

HDL. Operations and dependencies. FSMs Logic functions HDL. Interconnected logic blocks HDL BEHAVIORAL VIEW LOGIC LEVEL ARCHITECTURAL LEVEL ARCHITECTURAL-LEVEL SYNTHESIS Motivation. Outline cgiovanni De Micheli Stanford University Compiling language models into abstract models. Behavioral-level optimization and program-level transformations.

More information

Python VSIP API: A first draft

Python VSIP API: A first draft Python VSIP API: A first draft Stefan Seefeld HPEC WG meeting, December 9, 2014 Goals Use cases: Promote VSIP standard to a wider audience (SciPy users) Add more hardware acceleration to SciPy Allow VSIP

More information

Lecture 38 VHDL Description: Addition of Two [5 5] Matrices

Lecture 38 VHDL Description: Addition of Two [5 5] Matrices Lecture 38 VHDL Description: Addition of Two [5 5] Matrices -- First, write a package to declare a two-dimensional --array with five elements library IEEE; use IEEE.STD_LOGIC_1164.all; package twodm_array

More information

Design Verification Lecture 01

Design Verification Lecture 01 M. Hsiao 1 Design Verification Lecture 01 Course Title: Verification of Digital Systems Professor: Michael Hsiao (355 Durham) Prerequisites: Digital Logic Design, C/C++ Programming, Data Structures, Computer

More information

CS377P Programming for Performance GPU Programming - II

CS377P Programming for Performance GPU Programming - II CS377P Programming for Performance GPU Programming - II Sreepathi Pai UTCS November 11, 2015 Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work Outline

More information

Hardware Acceleration of Edge Detection Algorithm on FPGAs

Hardware Acceleration of Edge Detection Algorithm on FPGAs Hardware Acceleration of Edge Detection Algorithm on FPGAs Muthukumar Venkatesan and Daggu Venkateshwar Rao Department of Electrical and Computer Engineering University of Nevada Las Vegas. Las Vegas NV

More information

Automated Reliability Classification of Queueing Models for Streaming Computation

Automated Reliability Classification of Queueing Models for Streaming Computation Automated Reliability Classification of Queueing Models for Streaming Computation Jonathan C. Beard Cooper Epstein Roger D. Chamberlain Jonathan C. Beard, Cooper Epstein, and Roger D. Chamberlain. Automated

More information

Supporting Data Parallelism in Matcloud: Final Report

Supporting Data Parallelism in Matcloud: Final Report Supporting Data Parallelism in Matcloud: Final Report Yongpeng Zhang, Xing Wu 1 Overview Matcloud is an on-line service to run Matlab-like script on client s web browser. Internally it is accelerated by

More information

Translating Haskell to Hardware. Lianne Lairmore Columbia University

Translating Haskell to Hardware. Lianne Lairmore Columbia University Translating Haskell to Hardware Lianne Lairmore Columbia University FHW Project Functional Hardware (FHW) Martha Kim Stephen Edwards Richard Townsend Lianne Lairmore Kuangya Zhai CPUs file: ///home/lianne/

More information

Compiler Code Generation COMP360

Compiler Code Generation COMP360 Compiler Code Generation COMP360 Students who acquire large debts putting themselves through school are unlikely to think about changing society. When you trap people in a system of debt, they can t afford

More information

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction

More information

Creating Safe State Machines

Creating Safe State Machines Creating Safe State Machines Definition & Overview Finite state machines are widely used in digital circuit designs. Generally, when designing a state machine using a hardware description language (HDL),

More information

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14 General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14 Lecture Outline Heterogenous multi-core systems and general purpose GPU programming Programming models Heterogenous multi-kernels

More information

Verilog for High Performance

Verilog for High Performance Verilog for High Performance Course Description This course provides all necessary theoretical and practical know-how to write synthesizable HDL code through Verilog standard language. The course goes

More information

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT N. Vassiliadis, N. Kavvadias, G. Theodoridis, S. Nikolaidis Section of Electronics and Computers, Department of Physics,

More information

Synthesizing Benchmarks for Predictive Modeling.

Synthesizing Benchmarks for Predictive Modeling. Synthesizing Benchmarks for Predictive Modeling http://chriscummins.cc/cgo17 Chris Cummins University of Edinburgh Pavlos Petoumenos University of Edinburgh Zheng Wang Lancaster University Hugh Leather

More information

structure syntax different levels of abstraction

structure syntax different levels of abstraction This and the next lectures are about Verilog HDL, which, together with another language VHDL, are the most popular hardware languages used in industry. Verilog is only a tool; this course is about digital

More information

Here is a list of lecture objectives. They are provided for you to reflect on what you are supposed to learn, rather than an introduction to this

Here is a list of lecture objectives. They are provided for you to reflect on what you are supposed to learn, rather than an introduction to this This and the next lectures are about Verilog HDL, which, together with another language VHDL, are the most popular hardware languages used in industry. Verilog is only a tool; this course is about digital

More information

Chapter 3: Dataflow Modeling

Chapter 3: Dataflow Modeling Chapter 3: Dataflow Modeling Prof. Soo-Ik Chae Digital System Designs and Practices Using Verilog HDL and FPGAs @ 2008, John Wiley 3-1 Objectives After completing this chapter, you will be able to: Describe

More information

[Sub Track 1-3] FPGA/ASIC 을타겟으로한알고리즘의효율적인생성방법및신기능소개

[Sub Track 1-3] FPGA/ASIC 을타겟으로한알고리즘의효율적인생성방법및신기능소개 [Sub Track 1-3] FPGA/ASIC 을타겟으로한알고리즘의효율적인생성방법및신기능소개 정승혁과장 Senior Application Engineer MathWorks Korea 2015 The MathWorks, Inc. 1 Outline When FPGA, ASIC, or System-on-Chip (SoC) hardware is needed Hardware

More information

A Simple Path to Parallelism with Intel Cilk Plus

A Simple Path to Parallelism with Intel Cilk Plus Introduction This introductory tutorial describes how to use Intel Cilk Plus to simplify making taking advantage of vectorization and threading parallelism in your code. It provides a brief description

More information

ECE331: Hardware Organization and Design

ECE331: Hardware Organization and Design ECE331: Hardware Organization and Design Lecture 19: Verilog and Processor Performance Adapted from Computer Organization and Design, Patterson & Hennessy, UCB Verilog Basics Hardware description language

More information

Introduction to Verilog

Introduction to Verilog Introduction to Verilog Synthesis and HDLs Verilog: The Module Continuous (Dataflow) Assignment Gate Level Description Procedural Assignment with always Verilog Registers Mix-and-Match Assignments The

More information

What is a compiler? Xiaokang Qiu Purdue University. August 21, 2017 ECE 573

What is a compiler? Xiaokang Qiu Purdue University. August 21, 2017 ECE 573 What is a compiler? Xiaokang Qiu Purdue University ECE 573 August 21, 2017 What is a compiler? What is a compiler? Traditionally: Program that analyzes and translates from a high level language (e.g.,

More information

Generation of Multigrid-based Numerical Solvers for FPGA Accelerators

Generation of Multigrid-based Numerical Solvers for FPGA Accelerators Generation of Multigrid-based Numerical Solvers for FPGA Accelerators Christian Schmitt, Moritz Schmid, Frank Hannig, Jürgen Teich, Sebastian Kuckuk, Harald Köstler Hardware/Software Co-Design, System

More information

Overview. CSE372 Digital Systems Organization and Design Lab. Hardware CAD. Two Types of Chips

Overview. CSE372 Digital Systems Organization and Design Lab. Hardware CAD. Two Types of Chips Overview CSE372 Digital Systems Organization and Design Lab Prof. Milo Martin Unit 5: Hardware Synthesis CAD (Computer Aided Design) Use computers to design computers Virtuous cycle Architectural-level,

More information

RTL Coding General Concepts

RTL Coding General Concepts RTL Coding General Concepts Typical Digital System 2 Components of a Digital System Printed circuit board (PCB) Embedded d software microprocessor microcontroller digital signal processor (DSP) ASIC Programmable

More information

High Level Synthesis

High Level Synthesis High Level Synthesis Design Representation Intermediate representation essential for efficient processing. Input HDL behavioral descriptions translated into some canonical intermediate representation.

More information

HIGH PERFORMANCE PEDESTRIAN DETECTION ON TEGRA X1

HIGH PERFORMANCE PEDESTRIAN DETECTION ON TEGRA X1 April 4-7, 2016 Silicon Valley HIGH PERFORMANCE PEDESTRIAN DETECTION ON TEGRA X1 Max Lv, NVIDIA Brant Zhao, NVIDIA April 7 mlv@nvidia.com https://github.com/madeye Histogram of Oriented Gradients on GPU

More information

High Level Programming for GPGPU. Jason Yang Justin Hensley

High Level Programming for GPGPU. Jason Yang Justin Hensley Jason Yang Justin Hensley Outline Brook+ Brook+ demonstration on R670 AMD IL 2 Brook+ Introduction 3 What is Brook+? Brook is an extension to the C-language for stream programming originally developed

More information

Programming in C++ 6. Floating point data types

Programming in C++ 6. Floating point data types Programming in C++ 6. Floating point data types! Introduction! Type double! Type float! Changing types! Type promotion & conversion! Casts! Initialization! Assignment operators! Summary 1 Introduction

More information

Loop Optimizations. Outline. Loop Invariant Code Motion. Induction Variables. Loop Invariant Code Motion. Loop Invariant Code Motion

Loop Optimizations. Outline. Loop Invariant Code Motion. Induction Variables. Loop Invariant Code Motion. Loop Invariant Code Motion Outline Loop Optimizations Induction Variables Recognition Induction Variables Combination of Analyses Copyright 2010, Pedro C Diniz, all rights reserved Students enrolled in the Compilers class at the

More information

Visions for Application Development on Hybrid Computing Systems

Visions for Application Development on Hybrid Computing Systems Visions for Application Development on Hybrid Computing Systems Roger D. Chamberlain, Joseph Lancaster, Ron K. Cytron Dept. of Computer Science and Engineering Washington University in St. Louis Abstract

More information

Implementation of DSP Algorithms

Implementation of DSP Algorithms Implementation of DSP Algorithms Main frame computers Dedicated (application specific) architectures Programmable digital signal processors voice band data modem speech codec 1 PDSP and General-Purpose

More information

NETWORK ON CHIP TO IMPLEMENT THE SYSTEM-LEVEL COMMUNICATION SIMPLIFIES THE DISTRIBUTION OF I/O DATA THROUGHOUT THE CHIP, AND IS ALWAYS

NETWORK ON CHIP TO IMPLEMENT THE SYSTEM-LEVEL COMMUNICATION SIMPLIFIES THE DISTRIBUTION OF I/O DATA THROUGHOUT THE CHIP, AND IS ALWAYS ... THE CASE FOR EMBEDDED NETWORKS ON CHIP ON FIELD-PROGRAMMABLE GATE ARRAYS... THE AUTHORS PROPOSE AUGMENTING THE FPGA ARCHITECTURE WITH AN EMBEDDED NETWORK ON CHIP TO IMPLEMENT THE SYSTEM-LEVEL COMMUNICATION

More information

Lecture 15: System Modeling and Verilog

Lecture 15: System Modeling and Verilog Lecture 15: System Modeling and Verilog Slides courtesy of Deming Chen Intro. VLSI System Design Outline Outline Modeling Digital Systems Introduction to Verilog HDL Use of Verilog HDL in Synthesis Reading

More information

StreamIt on Fleet. Amir Kamil Computer Science Division, University of California, Berkeley UCB-AK06.

StreamIt on Fleet. Amir Kamil Computer Science Division, University of California, Berkeley UCB-AK06. StreamIt on Fleet Amir Kamil Computer Science Division, University of California, Berkeley kamil@cs.berkeley.edu UCB-AK06 July 16, 2008 1 Introduction StreamIt [1] is a high-level programming language

More information

Simone Campanoni Loop transformations

Simone Campanoni Loop transformations Simone Campanoni simonec@eecs.northwestern.edu Loop transformations Outline Simple loop transformations Loop invariants Induction variables Complex loop transformations Simple loop transformations Simple

More information

Introduction to Multicore Programming

Introduction to Multicore Programming Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Automatic Parallelization and OpenMP 3 GPGPU 2 Multithreaded Programming

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

Scheduling Image Processing Pipelines

Scheduling Image Processing Pipelines Lecture 15: Scheduling Image Processing Pipelines Visual Computing Systems Simple image processing kernel int WIDTH = 1024; int HEIGHT = 1024; float input[width * HEIGHT]; float output[width * HEIGHT];

More information

Program Optimization. Jo, Heeseung

Program Optimization. Jo, Heeseung Program Optimization Jo, Heeseung Today Overview Generally Useful Optimizations Code motion/precomputation Strength reduction Sharing of common subexpressions Removing unnecessary procedure calls Optimization

More information

Using Static Single Assignment Form

Using Static Single Assignment Form Using Static Single Assignment Form Announcements Project 2 schedule due today HW1 due Friday Last Time SSA Technicalities Today Constant propagation Loop invariant code motion Induction variables CS553

More information

FPGA Design Challenge :Techkriti 14 Digital Design using Verilog Part 1

FPGA Design Challenge :Techkriti 14 Digital Design using Verilog Part 1 FPGA Design Challenge :Techkriti 14 Digital Design using Verilog Part 1 Anurag Dwivedi Digital Design : Bottom Up Approach Basic Block - Gates Digital Design : Bottom Up Approach Gates -> Flip Flops Digital

More information

CS 240 Final Exam Review

CS 240 Final Exam Review CS 240 Final Exam Review Linux I/O redirection Pipelines Standard commands C++ Pointers How to declare How to use Pointer arithmetic new, delete Memory leaks C++ Parameter Passing modes value pointer reference

More information

Mapping Vector Codes to a Stream Processor (Imagine)

Mapping Vector Codes to a Stream Processor (Imagine) Mapping Vector Codes to a Stream Processor (Imagine) Mehdi Baradaran Tahoori and Paul Wang Lee {mtahoori,paulwlee}@stanford.edu Abstract: We examined some basic problems in mapping vector codes to stream

More information