VIRTUALIZATION IN THE AGE OF HETEROGENEOUS MACHINES

Size: px

Start display at page:

Download "VIRTUALIZATION IN THE AGE OF HETEROGENEOUS MACHINES"

Percival Chambers
6 years ago
Views:

1 VIRTUALIZATION IN THE AGE OF HETEROGENEOUS MACHINES David F. Bacon The ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE 11) Keynote - March 9, 2011

2 WHAT HETEROGENEOUS MACHINES?

3 WHAT HETEROGENEOUS MACHINES?

4 WHAT HETEROGENEOUS MACHINES? It s The Multicore Era!

5 THE PATH OF LEAST IMAGINATION For years, computer architects have been saying that a big new idea in computing was needed. Indeed, as transistors have continued to shrink, rather than continuing to innovate, computer designers have simply adopted a so-called multicore approach, where multiple processors are added as more chip real estate became available. Source: Remapping Computer Circuitry to Avert Impending Bottlenecks, February 28, 2011

6 MEANWHILE... GPU FPGA Tilera 64 Cell BE IBM WSP

7 WHAT IS PERFORMANCE? Peak Performance Source: Brodtkorb et al, The State-of-the-Art in Heterogeneous Computing, 2010

8 PERFORMANCE, TAKE 2 Actual Performance (Random Number Generation) Source: Thomas et al, A comparison of CPUs, GPUs, FPGAs and masssively parallel processor arrays for random number generation, 2009

9 WHAT S THE BEST USE OF TRANSISTORS? Homogeneous vs. Heterogeneous

10 WHAT S THE BEST USE OF TRANSISTORS? Homogeneous vs. Heterogeneous Keep Transistors Busy vs. Turn Transistors Off

11 Should We Virtualize Heterogeneous Architectures?

12 How?

13 At the Language or System Level? Or Both?

14 What is Virtualization Anyway?

15 THE ONLY 2 IDEAS IN COMPUTER SCIENCE

16 THE ONLY 2 IDEAS IN COMPUTER SCIENCE Hashing f(x)

17 THE ONLY 2 IDEAS IN COMPUTER SCIENCE Hashing f(x) Indirection

18 THE ONLY 2 IDEAS IN COMPUTER SCIENCE Hashing f(x) Indirection

19 SYSTEM VS. LANGUAGE VMS User-Mode ISA Supervisor-Mode ISA Device Drivers

20 SYSTEM VS. LANGUAGE VMS Supervisor-Mode ISA Device Drivers x86 User-Mode ISA x86

21 SYSTEM VS. LANGUAGE VMS Supervisor-Mode ISA Device Drivers x86 Supervisor-Mode ISA Device Drivers x86 User-Mode ISA x86 Supervisor-Mode ISA Device Drivers x86

22 SYSTEM VS. LANGUAGE VMS Supervisor-Mode ISA Device Drivers x86 Supervisor-Mode ISA User-Mode ISA User-Mode ISA User-Mode ISA Device Drivers x86 ARM POWER x86 Supervisor-Mode ISA Device Drivers x86

23 SYSTEM VS. LANGUAGE VMS Virtualize Environment Hold ISA Constant Virtualize Processor Vary ISA

24 OVERLAP: SYNERGY AND CONFLICT

25 OVERLAP: SYNERGY AND CONFLICT Memory

26 OVERLAP: SYNERGY AND CONFLICT Memory Threads

27 OVERLAP: SYNERGY AND CONFLICT Memory Threads User-Mode ISA Supervisor-Mode ISA

28 DEVICE OR PROCESSOR? GPU FPGA

29 THE ACCELERATOR MODEL Main Program GPU CPU FPGA

30 THE ACCELERATOR MODEL Force Calc Main Program GPU CPU FPGA

31 THE ACCELERATOR MODEL Force Calc Main Program FFT GPU CPU FPGA

32 THE REALITY

33 THE REALITY

34 THE REALITY

35 THE REALITY

36 THE REALITY

37 THE REALITY

38 THE REALITY

39 THE REALITY

40 Language VMs for Heterogenous Systems

41 Liquid Metal Joshua Auerbach Perry Cheng Rodric Rabbah David F. Bacon Steven Fink Sunil Shukla Christophe Dubach Yu Zhang

42 HETEROGENEOUS PROGRAMMING Node Compiler Synthesis C+Intrinsics VHDL Verilog SystemC GPU Compiler Cuda OpenCL CPU Compiler Java C++ Python binary binary binary bitfile CPU GPU WSP FPGA

43 CPU Compiler GPU Compiler Node Compiler Synthesis binary binary binary bitfile CPU GPU WSP FPGA

44 CPU Backend GPU Backend Node Backend Verilog Backend bytecode binary binary bitfile CPU GPU WSP FPGA

45 THE LIQUID METAL PROGRAMMING LANGUAGE Lime CPU Backend GPU Backend Node Backend Verilog Backend Lime Compiler bytecode binary binary bitfile CPU GPU WSP FPGA

46 THE ARTIFACT STORE & EXCLUSION Lime Lime Compiler bytecode binary binary bitfile CPU GPU WSP FPGA

47 THE ARTIFACT STORE & EXCLUSION Lime Lime Compiler bytecode binary binary bitfile Artifact Store CPU GPU WSP FPGA

48 THE ARTIFACT STORE & EXCLUSION Lime Lime Compiler bytecode binary binary bitfile bitfile bitfile Artifact Store CPU GPU WSP FPGA

49 THE ARTIFACT STORE & EXCLUSION Lime Lime Compiler bytecode binary bitfile bitfile bitfile Artifact Store CPU GPU WSP FPGA

50 THE ARTIFACT STORE & EXCLUSION Lime Lime Compiler bytecode binary bitfile bitfile bitfile Artifact Store CPU GPU WSP FPGA

51 HETEROGENEOUS EXECUTION OF LIME Lime Lime Compiler bytecode binary bitfile bitfile bitfile Artifact Store CPU GPU WSP FPGA

52 HETEROGENEOUS EXECUTION OF LIME Lime Lime Compiler bytecode binary bitfile bitfile bitfile Artifact Store Lime Virtual Machine (LmVM) CPU GPU WSP FPGA

53 HETEROGENEOUS EXECUTION OF LIME Lime Lime Compiler bytecode binary bitfile bitfile bitfile Artifact Store Lime Virtual Machine (LmVM) CPU GPU Bus WSP FPGA

54 HETEROGENEOUS EXECUTION OF LIME Lime Lime Compiler bytecode binary bitfile bitfile bitfile Artifact Store Lime Virtual Machine (LmVM) CPU bytecode GPU Bus WSP FPGA

55 HETEROGENEOUS EXECUTION OF LIME Lime Lime Compiler bytecode binary bitfile bitfile bitfile Artifact Store Lime Virtual Machine (LmVM) CPU bytecode GPU Bus WSP FPGA bitfile

56 THE LIME LANGUAGE: VIRTUALIZING HETEROGENEOUS COMPUTATION

57 LIME: JAVA IS (ALMOST) A SUBSET % javac MyClass.java % java MyClass

58 LIME: JAVA IS (ALMOST) A SUBSET % javac MyClass.java % java MyClass % mv MyClass.java MyClass.lime

59 LIME: JAVA IS (ALMOST) A SUBSET % javac MyClass.java % java MyClass % mv MyClass.java MyClass.lime % limec MyClass.lime

60 LIME: JAVA IS (ALMOST) A SUBSET % javac MyClass.java % java MyClass % mv MyClass.java MyClass.lime % limec MyClass.lime % java MyClass

61 LIME: JAVA IS (ALMOST) A SUBSET % javac MyClass.java % java MyClass % mv MyClass.java MyClass.lime % limec MyClass.lime % java MyClass INCREMENTALLY USE LIME FEATURES

62 LIME LANGUAGE OVERVIEW BIT-LEVEL PARALLELISM Core Features Programmable Primitives Stream Programming Collective Operations PIPELINE PARALLELISM Immutable Types Bounded Types Bounded Arrays Primitive Supertypes DATA PARALLELISM Graph Construction Isolation Enforcement Closed World Support Rate Matching Messaging Reifiable Generics Ranges, Bounded for User-defined operators Supporting Features Typedefs Local type inference Tuples

VIRTUALIZATION COMPARISON FPGA CPU Physical Virtual Physical Virtual Core Thread LUTs bit RAM Heap CLBs User Primitives byte word float dfloat

63 VIRTUALIZATION COMPARISON FPGA CPU Physical Virtual Physical Virtual Core Thread LUTs bit RAM Heap CLBs User Primitives byte word float dfloat Compare&Swap byte float int Slices double Block RAM Bounded Arrays Routing Network Streams Reduce DMA, SerDes Streams DSP Multipliers integer<n> synchronized tasks Map

64 STREAMING COMPUTATION PIPELINE PARALLELISM

65 TASK CREATION (STATELESS) var uppercaser = task work; primitive filter stream char char port type port local char work(char c) { return toupper(c); } stream type worker method

66 TASK CREATION (STATELESS) Task Creation var uppercaser = task work; primitive filter stream char char port type port local char work(char c) { return toupper(c); } stream type worker method Isolation Keywords

67 PIPELINES var pipeline = task worker1 => task worker2 => task worker3; port-to-stream connection port-to-stream connection char char int int[[5]] worker1( ) { } worker2( ) { } worker3( ) { } compound filter

68 PIPELINES var pipeline = task worker1 => task worker2 => task worker3; Graph Construction port-to-stream connection port-to-stream connection char char int int[[5]] worker1( ) { } worker2( ) { } worker3( ) { } compound filter

69 SOURCES AND SINKS source filter sink char char reader( ) { } worker( ) { } writer( ) { } Heap /tmp/mydata File System /dev/tty

70 HELLO WORLD, STREAMING STYLE public static void main(string[[]] args) { char[[]] msg = { 'H','E','L','L','O',',',' ', 'W','O','R','L','D','!','\n' }; var hello = msg.source(1) => task Character.toLowerCase(char) => task System.out.print(char); } hello.finish();

71 DEMO HELLO WORLD LIME/ECLIPSE ENVIRONMENT

73 STATEFUL TASKS var averager = task Averager().avg; instance variables (local state) primitive filter double double total; long count; double double avg(double x) { total += x; return total/++count; }

74 RATE MATCHING var matchedpipe = task AddStuff().work1 => # => task work2; rate matcher (rate 1:2) int int x; int[[2]] Buffer int int int[[2]] work1(int i) { int r = i+x; x = i/2; return { r, i }; } int work2(int i) { return i*3; } compound filter (rate 1:2)

75 DEMO N-BODY SIMULATION: CPU VS. GPU

76 9X SPEEDUP (9.26 GFLOPS) ON LAPTOP

77 VIRTUALIZATION OF DATA MOVEMENT =>

78 VIRTUALIZATION OF DATA MOVEMENT => +

79 VIRTUALIZATION OF DATA MOVEMENT => + x 3 months

80 WHAT S SO HARD? Projects routinely spend 60-80% of time on data xfer Verilog is hard, but driving devices and buses is harder Motherboard chips, BIOS settings: 4x slowdown each Signal transitions are considered an interface spec IP extremely sensitive to configurations and often broken when it mismatches the original Device drivers also highly sensitive Often specialized to one I/O style; fall off cliff otherwise Hardware design culture accepts this as C.O.D.B.

81 COLLECTIVE OPERATIONS DATA PARALLELISM

82 float[ ] a =...; float[ ] b =...; float[ ] c = b; ARRAY PARALLELISM a b

83 ARRAY PARALLELISM float[ ] a =...; float[ ] b =...; float[ ] c = b; Indexable<int,float> a b

84 ARRAY PARALLELISM float[ ] a =...; float[ ] b =...; float[ ] c = b; Indexable<int,float> index a b

85 ARRAY PARALLELISM float[ ] a =...; float[ ] b =...; float[ ] c = b; Indexable<int,float> index a b domain( )

86 ARRAY PARALLELISM float[ ] a =...; float[ ] b =...; float[ ] c = b; Indexable<int,float> index a b domain( ) 0 :: 2 0 :: 2

87 ARRAY PARALLELISM float[ ] a =...; float[ ] b =...; float[ ] c = b; Indexable<int,float> index a b domain( ) 0 :: 2 ==? 0 :: 2

88 ARRAY PARALLELISM float[ ] a =...; float[ ] b =...; float[ ] c = b; Indexable<int,float> Collectable<int,float> index a b domain( ) 0 :: 2 ==? 0 :: 2

89 ARRAY PARALLELISM float[ ] a =...; float[ ] b =...; float[ ] c = b; Indexable<int,float> Collectable<int,float> index a b collector temp domain( ) 0 :: 2 ==? 0 :: 2

90 ARRAY PARALLELISM float[ ] a =...; float[ ] b =...; float[ ] c = b; Indexable<int,float> Collectable<int,float> index a b collector temp domain( ) 0 :: 2 ==? 0 :: 2

91 ARRAY PARALLELISM float[ ] a =...; float[ ] b =...; float[ ] c = b; Indexable<int,float> Collectable<int,float> index a b collector c domain( ) 0 :: 2 ==? 0 :: 2

92 MANY THINGS ARE COLLECTABLE var a = new Matrix<3,quaternion>(100,200,100); var b = new Matrix<3,quaternion>(100,200,100); var c = + b; ( 0::99, 0::199, 0::99 ) Map<string,List<int>> a =... Map<string,List<int>> b =... Map<string,List<int>> c = append(b); { foo, bar } string foo = Character.toLowerCase(@ FOO ); 0 :: 2

93 COLLECTIVE OPERATIONS IN ACTION package lime.util.synthesizable; public class FixedHashMap<K extends Value, V extends Value, BUCKETS extends ordinal<buckets>, LINKS extends ordinal<links>> extends AbstractMap<K,V> { protected final nodes = new Node<K,V>[BUCKETS][LINKS]; LINKS = 2 nodes BUCKETS = 3 Node 1 K V Node 1 K V Node 0 K V Node 1 K V Node 0 K V Node 0 K V

94 GET OPERATION, STEP 1: SELECT ROW public local V get(k key) { Node[LINKS] row = nodes[hash(key)]; boolean[links] selections = comparekey(key); V[LINKS] vals = getvalueordefault(selections); return! vals; } key LINKS = 2 row nodes BUCKETS = 3 Node 1 K V Node 1 K V Node 0 K V Node 1 K V Node 0 K V Node 0 K V

95 GET OPERATION, STEP 1: SELECT ROW public local V get(k key) { Node[LINKS] row = nodes[hash(key)]; boolean[links] selections = comparekey(key); V[LINKS] vals = getvalueordefault(selections); return! vals; } key row Node 1 K V Node 1 K V

96 STEP 2: COMPARE ALL KEYS public local V get(k key) { Node[LINKS] row = nodes[hash(key)]; boolean[links] selections = comparekey(key); V[LINKS] vals = getvalueordefault(selections); return! vals; } key row Node 1 K V Node 1 K V

97 STEP 2: COMPARE ALL KEYS public local V get(k key) { Node[LINKS] row = nodes[hash(key)]; boolean[links] selections = comparekey(key); V[LINKS] vals = getvalueordefault(selections); return! vals; } key row Node 1 K V Node 1 K V selections

98 STEP 2: COMPARE ALL KEYS public local V get(k key) { Node[LINKS] row = nodes[hash(key)]; boolean[links] selections = comparekey(key); V[LINKS] vals = getvalueordefault(selections); return! vals; } key key row Node 1 K V Node 1 K V selections

99 STEP 2: COMPARE ALL KEYS public local V get(k key) { Node[LINKS] row = nodes[hash(key)]; boolean[links] selections = comparekey(key); V[LINKS] vals = getvalueordefault(selections); return! vals; } key key row Node 1 K V Node 1 K V true false selections

100 STEP 2: COMPARE ALL KEYS public local V get(k key) { Node[LINKS] row = nodes[hash(key)]; boolean[links] selections = comparekey(key); V[LINKS] vals = getvalueordefault(selections); return! vals; } row Node 1 K V Node 1 K V selections true false

101 STEP 3: GET VALUES/ZEROS public local V get(k key) { Node[LINKS] row = nodes[hash(key)]; boolean[links] selections = comparekey(key); V[LINKS] vals = getvalueordefault(selections); return! vals; } row Node 1 K V Node 1 K V selections true false

102 STEP 3: GET VALUES/ZEROS public local V get(k key) { Node[LINKS] row = nodes[hash(key)]; boolean[links] selections = comparekey(key); V[LINKS] vals = getvalueordefault(selections); return! vals; } row Node 1 K V Node 1 K V selections true false vals

103 STEP 3: GET VALUES/ZEROS public local V get(k key) { Node[LINKS] row = nodes[hash(key)]; boolean[links] selections = comparekey(key); V[LINKS] vals = getvalueordefault(selections); return! vals; } row Node 1 K V Node 1 K V selections true false vals V V0

104 STEP 4: OR-REDUCE FOR RESULT public local V get(k key) { Node[LINKS] row = nodes[hash(key)]; boolean[links] selections = comparekey(key); V[LINKS] vals = getvalueordefault(selections); return! vals; } row Node 1 K V Node 1 K V selections true false vals V 0

105 STEP 4: OR-REDUCE FOR RESULT public local V get(k key) { Node[LINKS] row = nodes[hash(key)]; boolean[links] selections = comparekey(key); V[LINKS] vals = getvalueordefault(selections); return! vals; } vals V 0

106 STEP 4: OR-REDUCE FOR RESULT public local V get(k key) { Node[LINKS] row = nodes[hash(key)]; boolean[links] selections = comparekey(key); V[LINKS] vals = getvalueordefault(selections); return! vals; } vals V 0

107 STEP 4: OR-REDUCE FOR RESULT public local V get(k key) { Node[LINKS] row = nodes[hash(key)]; boolean[links] selections = comparekey(key); V[LINKS] vals = getvalueordefault(selections); return! vals; 0V } vals V 0

108 VIRTUALIZATION OF DATA

109 CURRENT RESULTS

110 HOW DO WE EVALUATE PERFORMANCE? Speedup for Naïve Users How much faster than Java? Slowdown for Expert Users? How much slower than hand-tuned low-level code? Our methodology: Write/tune/compare 4 versions of each benchmark: Java, Lime, OpenCL, Verilog Doesn t address flops/watt, flops/watt/$, productivity

111 EXPERT VS NAÏVE SPEEDUP: END-TO-END (JAVA BASELINE) 6.67x 136x 1.06x 1.0x 6.10x 1.06x Expert Lime 66x 0.04x GPU GPU FPGA FPGA

112 EXPERT VS NAÏVE SPEEDUP: KERNEL TIME (JAVA BASELINE) 37x 193x 10x 150x 10x Expert Lime 26x 126x 0.02x GPU GPU FPGA FPGA

113 Should We Virtualize Heterogeneous Architectures?

114 Yes We Can!

115 USER VS. SUPERVISOR

116 ACCELERATOR VS. FIRST-CLASS DEVICE

117 ANOTHER UNSOLVED PROBLEM

118 SUMMARY

119 SUMMARY Heterogeneity is Here to Stay

120 SUMMARY Heterogeneity is Here to Stay Multicore Isn t Dead - Just Less Important

121 SUMMARY Heterogeneity is Here to Stay Multicore Isn t Dead - Just Less Important HLLs Can Insulate Programmer from Low-Level Details

122 SUMMARY Heterogeneity is Here to Stay Multicore Isn t Dead - Just Less Important HLLs Can Insulate Programmer from Low-Level Details But not fundamental computational structure

123 SUMMARY Heterogeneity is Here to Stay Multicore Isn t Dead - Just Less Important HLLs Can Insulate Programmer from Low-Level Details But not fundamental computational structure System-Level Virtualization is an Open (Hard) Problem

124 REAL-TIME PHOTOMOSAIC

125 REAL-TIME PHOTOMOSAIC

126 Questions?

LIQUID METAL Taming Heterogeneity

LIQUID METAL Taming Heterogeneity Stephen Fink IBM Research! IBM Research Liquid Metal Team (IBM T. J. Watson Research Center) Josh Auerbach Perry Cheng 2 David Bacon Stephen Fink Ioana Baldini Rodric