Compiling Parallel Algorithms to Memory Systems

Similar documents
Functioning Hardware from Functional Specifications

Functioning Hardware from Functional Specifications

Functioning Hardware from Functional Specifications

Haskell to Hardware and Other Dreams

Translating Haskell to Hardware. Lianne Lairmore Columbia University

Functional Fibonacci to a Fast FPGA

A Finer Functional Fibonacci on a Fast fpga

A Deterministic Concurrent Language for Embedded Systems

Parallel Computer Architecture Spring Memory Consistency. Nikos Bellas

Programming Language Pragmatics

Chapter 2: Instructions How we talk to the computer

Statically Unrolling Recursion to Improve Opportunities for Parallelism

Dynamic analysis in the Reduceron. Matthew Naylor and Colin Runciman University of York

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Parallel Functional Programming Lecture 1. John Hughes

COMP3151/9151 Foundations of Concurrency Lecture 8

Spring Prof. Hyesoon Kim

Programming Kotlin. Familiarize yourself with all of Kotlin s features with this in-depth guide. Stephen Samuel Stefan Bocutiu BIRMINGHAM - MUMBAI

Introduction to Functional Programming in Haskell 1 / 56

Concurrent/Parallel Processing

Lecture 4: ISA Tradeoffs (Continued) and Single-Cycle Microarchitectures

A Deterministic Concurrent Language for Embedded Systems

Functional Programming

LECTURE 16. Functional Programming

CENG3420 Lecture 03 Review

Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines

Concurrent Preliminaries

Continuation Passing Style. Continuation Passing Style

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

The Go Programming Language. Frank Roberts

Martin Kruliš, v

The Processor: Instruction-Level Parallelism

Superscalar Processors Ch 14

Last Time. Introduction to Design Languages. Syntax, Semantics, and Model. This Time. Syntax. Computational Model. Introduction to the class

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Urmas Tamm Tartu 2013

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

CSC 2400: Computer Systems. Using the Stack for Function Calls

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

301AA - Advanced Programming

Dataflow: The Road Less Complex

Haskell: From Basic to Advanced. Part 2 Type Classes, Laziness, IO, Modules

COP4020 Programming Languages. Functional Programming Prof. Robert van Engelen

Higher Level Programming Abstractions for FPGAs using OpenCL

Programming Languages Lecture 14: Sum, Product, Recursive Types

It turns out that races can be eliminated without sacrificing much in terms of performance or expressive power.

Towards Lean 4: Sebastian Ullrich 1, Leonardo de Moura 2.

Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Intermediate Code Generation

Modern Processor Architectures. L25: Modern Compiler Design

Computer Architecture. Chapter 2-2. Instructions: Language of the Computer

The Reduceron reconfigured and re-evaluated

Haskell Monads CSC 131. Kim Bruce

CS 432 Fall Mike Lam, Professor. Code Generation

CPL 2016, week 10. Clojure functional core. Oleg Batrashev. April 11, Institute of Computer Science, Tartu, Estonia

Chapter 11 :: Functional Languages

Processing Unit CS206T

Parallel Programming Multicore systems

Programming with Math and Logic

Run-time Program Management. Hwansoo Han

Spring 2009 Prof. Hyesoon Kim

4/19/2018. Chapter 11 :: Functional Languages

Instruction Level Parallelism

Order Is A Lie. Are you sure you know how your code runs?

G Programming Languages - Fall 2012

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville

Composable Shared Memory Transactions Lecture 20-2

A Deterministic Concurrent Language for Embedded Systems

The CPU and Memory. How does a computer work? How does a computer interact with data? How are instructions performed? Recall schematic diagram:

Giving Haskell a Promotion

Functional Languages. Hwansoo Han

High Performance Computing on GPUs using NVIDIA CUDA

Getting CPI under 1: Outline

Total No. of Questions : 18] [Total No. of Pages : 02. M.Sc. DEGREE EXAMINATION, DEC First Year COMPUTER SCIENCE.

Chapter 6 Control Flow. June 9, 2015

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

Region-Based Memory Management for Task Dataflow Models

Control Instructions

Control Instructions. Computer Organization Architectures for Embedded Computing. Thursday, 26 September Summary

AN 831: Intel FPGA SDK for OpenCL

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

The University of Texas at Austin

JVM ByteCode Interpreter

G Programming Languages - Fall 2012

SHard: a Scheme to Hardware Compiler

B. Tech. Project Second Stage Report on

Typed Scheme: Scheme with Static Types

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Logic - CM0845 Introduction to Haskell

Rui Wang, Assistant professor Dept. of Information and Communication Tongji University.

Short Notes of CS201

CSE 141 Computer Architecture Spring Lecture 3 Instruction Set Architecute. Course Schedule. Announcements

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Topics. Block Diagram of Mic-1 Microarchitecture. Review Sheet. Microarchitecture IJVM ISA

Virtual Machine. Part I: Stack Arithmetic. Building a Modern Computer From First Principles.

Graphical Untyped Lambda Calculus Interactive Interpreter

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Transcription:

Compiling Parallel Algorithms to Memory Systems Stephen A. Edwards Columbia University Presented at Jane Street, April 16, 2012 (λx.?)f = FPGA

Parallelism is the Big Question

Massive On-Chip Parallelism is Inevitable Intel s 48-core Single Chip Cloud Computer

The Future is Wires and Memory

...and it s Already Here Altera Stratix IV FPGA

What We are Doing About It abstraction C et al. gcc et al. x86 et al. today time

What We are Doing About It abstraction Future Languages Higher-level languages C et al. gcc et al. x86 et al. Future ISAs More hardware reality today time

What We are Doing About It abstraction Future Languages Higher-level languages C et al. gcc et al. x86 et al. Future ISAs A Functional IR FPGAs More hardware reality today time

Why Functional Specifications? Referential transparency/side-effect freedom make formal reasoning about programs vastly easier Inherently concurrent and race-free (Thank Church and Rosser). If you want races and deadlocks, you need to add constructs. Immutable data structures makes it vastly easier to reason about memory in the presence of concurrency

Why FPGAs? We do not know the structure of future memory systems Homogeneous/Heterogeneous? Levels of Hierarchy? Communication Mechanisms? We do not know the architecture of future multi-cores Programmable in Assembly/C? Single- or multi-threaded? Use FPGAs as a surrogate. Ultimately too flexible, but representative of the long-term solution.

The Memory Hierarchy is the Interesting Part

Multiprocessor Memory is a Headache Cache Coherency Write buffers Sequential Memory Consistency Memory barriers Data Races Atomic operations Immutable data structures simplify these

The Practical Question How do we synthesize hardware from pure functional languages for FPGAs? Control and datapath are easy; the memory system is interesting.

To Implement Real Algorithms in Hardware, We Need Structured, recursive data types Recursion to handle recursive data types Memories Memory Hierarchy

Example: Huffman Decoder in Haskell data HTree = Branch HTree HTree Leaf Char decode :: HTree > [Bool] > [Char] Huffman tree & bitstream to symbols decode table str = decoder table str where decoder (Leaf s) i = s : (decoder table i) Identified symbol; start again decoder _ [] = [] decoder (Branch f _) (False:xs) = decoder f xs 0: follow left branch decoder (Branch _ t) (True:xs) = decoder t xs 1: follow right branch Three data types: Input bitstream, output character stream, and Huffman tree

Planned Optimizations Split Memories Input Output Memory HTree Use Streams In FIFO Out FIFO Mem HTree Mem Unroll for locality In FIFO Mem HTree Out FIFO Mem Speculate In FIFO Mem HTree Mem Out FIFO Mem

One Way to Encode the Types Huffman tree nodes: (19 bits) 0 8-bit character (unused) Leaf Char 1 9-bit tree ptr. 9-bit tree ptr. Branch Tree Tree Boolean input stream: (10 bits) 0 (unused) Nil 1 bit 8-bit tail pointer Cons Bool List Character output stream: (19 bits) 0 (unused) Nil 1 8-bit character 10-bit tail pointer Cons Char List

Intermediate Representation Desiderata Mathematical formalism convenient for performing parallelizing transformations, a.k.a. parallel design patterns Pipeline Speculation Multiple workers Map-reduce

Intermediate Representation: Recursive Islands program ::= island island ::= island name arg = expr state state ::= label arg = expr Group of states w/ stack Arguments & expression expr ::= name var Apply a function let (var = expr) + in expr Parallel evaluation case var of (pattern -> expr) + Multiway conditional var literal recurse label var ( var ) Explicit continuation return var goto label var Branch to another state pattern ::= name var literal _ Constructor/literal/def.

Huffman as a Recursive Island data HTree = Branch HTree HTree Leaf Char decode :: HTree > [Bool] > [Char] decode table str = decoder table str where decoder (Leaf s) i = s : (decoder table i) decoder _ [] = [] decoder (Branch f _) (False:xs) = decoder f xs decoder (Branch _ t) (True:xs) = decoder t xs island decoder treep ip = let r = dec treep treep ip in return r island dec treep statep ip = let i = fetchi ip state = fetcht statep in case state of Leaf a > recurse s1 a (treep treep ip) Branch f t > case i of Nil > let np = Nil in return np Cons x xsp > case x of True > goto dec treep t xsp False > goto dec treep f xsp s1 a rp = let rrp = Cons a rp in return rrp

The Basic Translation Template go inputs ready result Strobe-based interface: go indicates inputs are valid; ready pulses once when result is valid.

Translating Let and Case go e 1 } v1 e ready result e n } v2 Let makes new values available to an expression. go v e 1 e n ready result Case invokes one of its sub-expressions, then synchronizes.

Translating an Island go Controller go 1 ready arg 1 arg n e 1 push pop wr SP Stack go 2 go n e 2 ready result e n Each island consists of expressions for each state, its own stack, and a controller that manages the stack and invokes the states.

Constructors and Memory A constructor is a function that stores data in memory. constructor α :: α Ptr α Memory access functions turn pointers into data. fetch α :: Ptr α α Memory stores return an address, not take one as an argument Constructor is responsible for memory management. By default, each data type gets its own memory.

Duplication for Performance fib 0 = 0 fib 1 = 1 fib n = fib (n 1) + fib (n 2)

Duplication for Performance After duplicating functions: fib 0 = 0 fib 1 = 1 fib n = fib (n 1) + fib (n 2) fib 0 = 0 fib 1 = 1 fib n = fib (n 1) + fib (n 2) fib 0 = 0 fib 1 = 1 fib n = fib (n 1) + fib (n 2) fib 0 = 0 fib 1 = 1 fib n = fib (n 1) + fib (n 2) Here, fib and fib may run in parallel.

Unrolling Recursive Data Structures Like a blocking factor, but more general. Idea is to create larger memory blocks that can be operated on in parallel. Original Huffman tree type: data Htree = Branch Htree HTree Leaf Char Unrolled Huffman tree type: data Htree = Branch Htree HTree Leaf Char data Htree = Branch Htree HTree Leaf Char data Htree = Branch Htree HTree Leaf Char Recursive instances must be pointers; others can be explicit. Functions must be similarly modified to work with the new types.

Acknowledgements Project started while at MSR Cambridge Satman Singh (now at Google) Simon Peyton Jones (MSR) Martha Kim (Columbia)