Functioning Hardware from Functional Specifications

Size: px

Start display at page:

Download "Functioning Hardware from Functional Specifications"

Allyson Patterson
5 years ago
Views:

1 Functioning Hardware from Functional Specifications Stephen A. Edwards Columbia University DIMACS Workshop, July 22, 2014

2 Where s my 10 GHz processor?

3 Moore s Law: Transistors Shrink and Get Cheaper The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Closer to every 24 months Gordon Moore, Cramming More Components onto Integrated Circuits, Electronics, 38(8) April 19, 1965.

4 Intel CPU Trends Sutter, The Free Lunch is Over, DDJ Data: Intel, Wikipedia, K. Olukotun

5 Intel CPU Trends Sutter, The Free Lunch is Over, DDJ Data: Intel, Wikipedia, K. Olukotun

6 Pollack s Rule: Diminishing Returns for Processors Integer Perf (X) 10 Performance ~ Sqrt(Area) Slope = ~ Area (X) Single-threaded processor performance grows with the square root of area. It takes 4 the transistors to give 2 the performance. Fred J. Pollack, MICRO 1999 keynote. Graph from Borkar, DAC 2007.

7 Intel CPU Trends Sutter, The Free Lunch is Over, DDJ Data: Intel, Wikipedia, K. Olukotun

8 Intel Processors to Scale Pentium P II P III P IV Core

9 What Happened in 2005? Pentium core Transistors: 42 M Core 2 Duo cores 291 M Xeon E cores 2.3 G

10 Intel CPU Trends Sutter, The Free Lunch is Over, DDJ Data: Intel, Wikipedia, K. Olukotun

11 The Cray-2: Immersed in Fluorinert 1985 ECL 150 kw

12 Liquid Cooled Apple Power Mac G CMOS 1.2 kw

13 Where s all that power going? What can we do about it?

14 Dally: Calculation Cheap; Communication Costly 64b FPU 0.1mm 2 50pJ/op 1.5GHz 64b 1mm Channel 25pJ/word 10mm 250pJ, 4 cycles 20mm Chips are power limited and most power is spent moving data Performance = Parallelism 64b Off-Chip Channel 1nJ/word Efficiency = Locality 64b Floating Point Bill Dally s 2009 DAC Keynote, The End of Denial Architecture

15 Parallelism for Performance; Locality for Efficiency Dally: Single-thread processors are in denial about these two facts We need different programming paradigms and different architectures on which to run them.

16 Massive On-Chip Parallelism: The NVIDIA GTX Titan

17 The NVIDIA GTX Titan/GK110 GPU Speed 4.5 TFLOP/s Frequency 876 MHz Power 250 W Transistors 7 G Area 561 mm 2 Cores 2688 Size Bus width Clock Bandwidth Memory Price 6 Gb 384 bits 1.5 GHz 288 Gb/s $1000 e740

18 The Future is Wires and Memory

19 How best to use all those transistors?

20 Dark Silicon

21 Taylor and Swanson s Conservation Cores BB0 BB1 BB2 CFG Datapath LD + LD Inter-BB State Machine LD * ST <N?.V.V C-core Generation Code to Stylized Verilog and through a CAD flow. Synopsys IC Compiler, P&R, CTS 0.01 mm 2 in 45 nm TSMC runs at 1.4 GHz Custom datapaths, controllers for loop kernels; uses existing memory hierarchy Swanson, Taylor, et al. Conservation Cores. ASPLOS 2010.

22 Bacon et al. s Liquid Metal Fig.2. Block level diagram of DES and Lime code snippet JITting Lime (Java-like, side-effect-free, streaming) to FPGAs Huang, Hormati, Bacon, and Rabbah, Liquid Metal, ECOOP 2008.

23 Arvind, Hoe, et al. s Bluespec GCDModRule Gcd(a, b) if(ab)^(b6= 0)! Gcd(a;b, b) GCD Flip Rule Gcd(a,b)ifa<b!Gcd(b,a) π π Flip+ π Mod Mod δ Mod,a δ Flip,a π Flip δ Flip,b ce a b ce π Flip δ Flip,b δ Flip,a δ Mod,a =0 π Flip π Mod Figure 1.3 Circuit for computing Gcd(a, b) from Example 1. Guarded commands and functions to synchronous logic Hoe and Arvind, Term Rewriting, VLSI 1999

24 Kuper et al. s CλaSH fir (State (xs,hs)) x = (State (shiftinto x xs,hs),(x xs) hs) Fig taps FIR Filter More operational Haskell specifications of regular structures Baaij, Kooijman, Kuper, Boeijink, and Gerards. Cλash, DSD 2010

25 What am I doing about it?

26 Functional Programs to FPGAs

27 Functional Programs to FPGAs

28 Functional Programs to FPGAs

29 Functional Programs to FPGAs

30 Functional Programs to FPGAs

31 Functional Programs to FPGAs

32 Functional Programs to FPGAs

reasoning about programs vastly easier Inherently

If you want races and deadlocks, you need to add

33 Why Functional Specifications? Referential transparency/side-effect freedom make formal reasoning about programs vastly easier Inherently concurrent and race-free (Thank Church and Rosser). If you want races and deadlocks, you need to add constructs. Immutable data structures makes it vastly easier to reason about memory in the presence of concurrency

We do not know the architecture of future multi-cores

34 Why FPGAs? We do not know the structure of future memory systems Homogeneous/Heterogeneous? Levels of Hierarchy? Communication Mechanisms? We do not know the architecture of future multi-cores Programmable in Assembly/C? Single- or multi-threaded? Use FPGAs as a surrogate. Ultimately too flexible, but representative of the long-term solution.

5KB 600 MHz memory blocks; 6 Mb total 350

35 A High-End FPGA: Altera s Stratix V 2500 dual-ported 2.5KB 600 MHz memory blocks; 6 Mb total bit 500 MHz DSP blocks (MAC-oriented datapaths) input LUTs; 28 nm feature size

36 The Practical Question How do we synthesize hardware from pure functional languages for FPGAs? Control and datapath are easy; the memory system is interesting.

37 To Implement Real Algorithms, We Need Structured, recursive data types Recursion to handle recursive data types Memories Memory Hierarchy

38 The Type System: Algebraic Data Types Types are primitive (Boolean, Integer, etc.) or other ADTs: type ::= Type Primitive type Constr Type Constr Type Tagged union Subsume C structs, unions, and enums Comparable power to C++ objects with virtual methods Algebraic because they are sum-of-product types.

39 The Type System: Algebraic Data Types Types are primitive (Boolean, Integer, etc.) or other ADTs: type ::= Type Primitive type Constr Type Constr Type Tagged union Examples: data Intlist = Nil Linked list of integers Cons Int Intlist data Bintree = Leaf Int Binary tree of integers Branch Bintree Bintree data Expr = Literal Int Arithmetic expression Var String Binop Expr Op Expr data Op = Add Sub Mult Div

40 Example: Huffman Decoder in Haskell data HTree = Branch HTree HTree Leaf Char decode :: HTree [Bool] [Char] decode table str = bit str table where bit (False:xs) (Branch l _) = bit xs l 0: left bit (True:xs) (Branch _ r) = bit xs r 1: right bit x (Leaf c) = c : bit x table leaf bit [] _ = [] done Three data types: Input bitstream Output character stream Huffman tree [Bool] (list of Booleans) [Char] (list of Characters) HTree

41 Encoding the Types Huffman tree nodes: (19 bits) 8-bit char 1 Leaf 9-bit pointer 9-bit pointer 0 Branch Boolean input stream: (14 bits) 12-bit pointer B 1 Cons 0 Nil Character output stream: (19 bits) 10-bit pointer 8-bit char 1 Cons 0 Nil

42 Memory Optimizations

43 Memory Split Memories Input HTree Output Optimizations

44 Split Memories Input Output Memory HTree Use Streams Optimizations In FIFO Mem HTree Out FIFO Mem

45 Split Memories Input Output Memory HTree Use Streams Optimizations In FIFO Mem HTree Out FIFO Mem Unroll for locality In FIFO Out FIFO Mem HTree Mem

46 Split Memories Input Output Memory HTree Use Streams Optimizations In FIFO Mem HTree Out FIFO Mem Unroll for locality In FIFO Out FIFO Mem HTree Mem Speculate In FIFO Mem HTree Mem Out FIFO Mem

47 Hardware Synthesis: Semantics-preserving steps to a low-level dialect

48 Removing Recursion: The Fib Example fib n = case n of n fib (n 1) + fib (n 2)

49 Transform to Continuation-Passing Style fibk n k = case n of 1 k 1 2 k 1 n fibk (n 1) (λn1 fibk (n 2) (λn2 k (n1 + n2))) fib n = fibk n (λx x)

50 Name Lambda Expressions (Lambda Lifting) fibk n k = case n of 1 k 1 2 k 1 n fibk (n 1) (k1 n k) k1 n k n1 = fibk (n 2) (k2 n1 k) k2 n1 k n2 = k (n1 + n2) k0 x = x fib n = fibk n k0

51 Represent Continuations with a Type data Cont = K0 K1 Int Cont K2 Int Cont fibk n k = case (n,k) of (1, k) kk k 1 (2, k) kk k 1 (n, k) fibk (n 1) (K1 n k) kk k a = case (k, a) of ((K1 n k), n1) fibk (n 2) (K2 n1 k) ((K2 n1 k), n2) kk k (n1 + n2) (K0, x ) x fib n = fibk n K0

52 Merge Functions data Cont = K0 K1 Int Cont K2 Int Cont data Call = Fibk Int Cont KK Cont Int fibk z = case z of (Fibk 1 k) fibk (KK k 1) (Fibk 2 k) fibk (KK k 1) (Fibk n k) fibk (Fibk (n 1) (K1 n k)) (KK (K1 n k) n1) fibk (Fibk (n 2) (K2 n1 k)) (KK (K2 n1 k) n2) fibk (KK k (n1 + n2)) (KK K0 x ) x fib n = fibk (Fibk n K0)

53 Add Explicit Memory Operations load :: CRef Cont store :: Cont CRef data Cont = K0 K1 Int CRef K2 Int CRef data Call = Fibk Int CRef KK Cont Int fibk z = case z of (Fibk 1 k) fibk (KK (load k) 1) (Fibk 2 k) fibk (KK (load k) 1) (Fibk n k) fibk (Fibk (n 1) (store (K1 n k))) (KK (K1 n k) n1) fibk (Fibk (n 2) (store (K2 n1 k))) (KK (K2 n1 k) n2) fibk (KK (load k) (n1 + n2)) (KK K0 x ) x fib n = fibk (Fibk n (store K0))

54 Syntax-Directed Translation to Hardware n fib Call CRef fibk Cont CRef store load we di a mem do load Cont result

55 Duplication Can Increase Parallelism fib 0 = 0 fib 1 = 1 fib n = fib (n 1) + fib (n 2)

56 Duplication Can Increase Parallelism After duplicating functions: fib 0 = 0 fib 1 = 1 fib n = fib (n 1) + fib (n 2) fib 0 = 0 fib 1 = 1 fib n = fib (n 1) + fib (n 2) fib 0 = 0 fib 1 = 1 fib n = fib (n 1) + fib (n 2) fib 0 = 0 fib 1 = 1 fib n = fib (n 1) + fib (n 2) Here, fib and fib may run in parallel.

57 Unrolling Recursive Data Structures Original Huffman tree type: data Htree = Branch Htree HTree Leaf Char Unrolled Huffman tree type: data Htree = Branch Htree HTree Leaf Char data Htree = Branch Htree HTree Leaf Char data Htree = Branch Htree HTree Leaf Char Increases locality: larger data blocks. A type-aware cache line

58 I Moore s Law is alive and well I But we hit a power wall in Massive parallelism now mandatory 64b FPU 0.1mm2 50pJ /op 1.5GHz I Communication is the culprit 64b Off-Chip Channel 1nJ /word 64b Floating Point 10mm 250pJ, 4 cycles 64b 1mm Channel 25pJ /word 20mm

59 * Dark Silicon is the future: faster transistors; most must remain off BB0 BB1 C-core Generation Custom accelerators are the future; many approaches BB2 CFG Datapath + LD + LD Inter-BB State Machine + Code to Stylized Verilog and through a CAD flow..v Synopsys IC Compiler, P&R, CTS + +.V + LD + ST +1 <N? 0.01 mm 2 in 45 nm TSMC runs at 1.4 GHz My project: A Pure Functional Language to FPGAs abstraction C et al. gcc et al. x86 et al. Future Languages Future ISAs A Functional IR Higher-level languages More FPGAs hardware reality today time

60 Encoding the Types Huffman tree nodes: (19 bits) 8-bit char 1 Leaf 9-bit pointer 9-bit pointer 0 Branch Boolean input stream: (14 bits) Algebraic Data Types in Hardware 12-bit pointer B 1 Cons 0 Nil Character output stream: (19 bits) 10-bit pointer 8-bit char 1 Cons 0 Nil Split Memories Input Output Memory HTree Use Streams Optimizations In FIFO Out FIFO Optimizations Mem HTree Mem In FIFO Out FIFO Mem Mem HTree Unroll for locality Speculate In FIFO Mem HTree Mem Out FIFO Mem Syntax-Directed Translation to Hardware Removing recursion n fib fibk Cont store CRef load Call CRef we di a mem do load Cont result

Functioning Hardware from Functional Specifications

Functioning Hardware from Functional Specifications Stephen A. Edwards Martha A. Kim Richard Townsend Kuangya Zhai Lianne Lairmore Columbia University IBM PL Day, November 18, 2014 Where s my 10 GHz processor?