Computing on the Lunatic Fringe: Exascale Computers and Why You Should Care. Mattan Erez The University of Texas at Austin

Size: px

Start display at page:

Download "Computing on the Lunatic Fringe: Exascale Computers and Why You Should Care. Mattan Erez The University of Texas at Austin"

Howard Fowler
5 years ago
Views:

1 1 Computing on the Lunatic Fringe: Exascale Computers and Why You Should Care Mattan Erez The University of Texas at Austin

2 (C) Mattan Erez 2 Arch-focused whole-system approach Efficiency requirements require crossing layers Algorithms are key Compute less, move less, store less Proportional systems Minimize waste Utilize and improve emerging technologies Explore (and act) up and down Programming model to circuits Preferably implement at micro-arch/system

3 (C) Mattan Erez 3 Big problems and emerging platforms Memory systems Capacity, bandwidth, efficiency impossible to balance Adaptive and dynamic management helps New technologies to the rescue? Opportunities for in-memory computing GPUs, supercomputers, clouds, and more Throughput oriented designs are a must Centralization trend is interesting Reliability and resilience More, smaller devices danger of poor reliability Trimmed margins less room for error Hard constraints efficiency is a must Scale exacerbates reliability and resilience concerns

4 (C) Mattan Erez 4 Lots of interesting multi-level projects Resilience/ Reliability Memory GPU/CPU

5 (c) Mattan Erez 5 A superscale computer is a high-performance system for solving big problems

6 (c) Mattan Erez 6 A supercomputer is a high-performance system for solving big cohesive problems

7 (c) Mattan Erez 7 A supercomputer is a high-performance system for solving big cohesive problems

8 (c) Mattan Erez 8 Simulate Analyze Predict

9 (c) Mattan Erez 9 Simulate Analyze Predict

10 (c) Mattan Erez 10 Supercomputers are general Not domain specific Cloud computers are general Algorithms keep changing

11 (c) Mattan Erez 11 Superscale computers must balance Compute Storage Communication Constraints OpEx and CapEx

12 (c) Mattan Erez 12 Super is never super enough

13 Sustained FLOP/s (c) Mattan Erez 13 1E+18 1E+17 1E+16 1E+15 1E+14 1E+13 1E+12 1E+11 1E+10 1E+9 1E+8 Num. 1 Num. 10 Num.500 Idealized VLSI

14 (c) Mattan Erez 14 What s keeping us from exascale?

15 Total Power [kw] Efficiency [GFLOPS/kW] (c) Mattan Erez 15 The power problem Power Efficiency

16 (c) Mattan Erez 16 Solution: build more efficient systems

17 (c) Mattan Erez 17 Exascale computers are a lunacy: Crazy big (1M processors, >10MW) Need efficiency of embedded devices Effective for many domains Fully programmable

18 (c) Mattan Erez 18 Why should you care? Harbingers of things to come Discovery awaits Enable and promote open innovation

19 (c) Mattan Erez 19 Where does the power go?

20 (c) Mattan Erez 20 Cooling and infrastructure Amazing mechanical and electrical systems Getting close to optimal efficiency 17 PFLOP/s 18,688 nodes >8 MW ~200 cabinets ~400 m 2 floorspace

(c) Mattan Erez 21 Actual processing I/O 100101001 0100 Arithmetic

21 (c) Mattan Erez 21 Actual processing I/O Arithmetic + x 2 Comm. Control Memory Idle/margin How much of each component?

22 (c) Mattan Erez 22 The budget: Power 20MW

23 (c) Mattan Erez 23 20MW / 1 exa-flop/s Energy 20pJ/op 50 GFLOPs/W sustained

24 (c) Mattan Erez 24 20MW / 1 exa-flop/s Energy 20pJ/op 50 GFLOPs/W sustained Best supercomputer today: ~300pJ/op

25 (c) Mattan Erez 25 Arithmetic 64-bit floating-point operation + x Arith Today Scaled Researchy Headroom Rough estimated numbers

26 Enough headroom? (c) Mattan Erez 26

27 (c) Mattan Erez 27 Unfortunately, hard tradeoffs 64-bit DP 15pJ 20mm 15 pj 80 pj 2 nj DRAM Rd/Wr 256-bit buses 256-bit access 8 kb SRAM 15 pj 500 pj Efficient off-chip link 100 pj 10nm

28 (c) Mattan Erez 28 + x 2 Need more headroom Minimize waste

29 (c) Mattan Erez 29 + x 2 Do we care about single-unit performance? Must all results be equally precise? Must all results be correct? Lunacy?

30 (c) Mattan Erez 30 Relaxed reliability and precision Some lunacy (rare easy-to-detect errors + parallelism) Lunatic fringe: bounded imprecision Lunacy: live with real unpredictable errors + x Arith Today Scaled Researchy Some lunacy Rough estimated numbers for illustration purposes Headroom Lunatic fringe 18 2 Lunacy

31 (c) Mattan Erez 31 Bounded Approximate Duplication + x 2

32 (c) Mattan Erez 32 Bounded Approximate Duplication + x 2

33 (c) Mattan Erez 33 + x 2 Supercomputers are general Dynamic adaptivity and tuning Some programmers crazier than others Large effort in error-tolerant algorithms

34 (c) Mattan Erez 34 Proportionality and overprovisioning Arithmetic Control Memory Margin Dense Latency Sparse

35 (c) Mattan Erez 35 Architecture goals: Balance possible and practical Enable generality Don t hurt the common case It s all about the algorithm

36 (c) Mattan Erez 36 Architecture goals: Balance possible and practical Enable generality Don t hurt the common case Architecture concepts so far: Proportionality Adaptivity SW-HW co-tuning Locality Parallelism

37 (c) Mattan Erez 37 How to control a billion arithmetic units?

38 (c) Mattan Erez 38 Hierarchy minimizes control cost Amortize control decisions Program Machine Teams Processes Threads Cabinets /modules Nodes Cores Instructions HW units

39 (c) Mattan Erez 39 Hierarchy minimizes control cost Amortize control decisions Program Teams Processes Threads Instructions

40 (c) Mattan Erez 40 What does a HW instruction do? x x x x CPU x x x x x x GPU Embedded Amortize/specialize more

41 (c) Mattan Erez 41 Single-thread or bulk-thread latency oriented CPU throughput oriented GPU

42 (c) Mattan Erez 42 Heterogeneity is a necessity disciplined specialization

43 (c) Mattan Erez 43 Must balance generality and control cost Over-specialization stifles innovation Algorithms more important than hardware

44 (c) Mattan Erez 44 Heterogeneity is lunacy Programing and tuning extremely tricky Diverse and rapidly evolving algorithms

45 (c) Mattan Erez 45 Disciplined heterogeneity Scarcity of choice Throughput oriented Latency oriented Disciplined reconfigurable accelerators

46 (c) Mattan Erez 46 Heterogeneous computing already here GPU for throughput CPU for latency Abstracted FPGA accelerators

47 (c) AMD. Cray, Intel, NVIDIA, xsede 47 Heterogeneous computing already here Titan node (Cray XK7) Copyrighted picture of an XK7 node Stampede node Copyrighted picture of Xeon Phi card Copyrighted picture of Xeon Phi die Copyrighted picture of Stampede node Copyrighted picture of Kepler die Copyrighted picture of Opteron die Copyrighted picture of Sany Bridge die NVIDIA GPU + AMD CPU Intel Xeon CPU + Xeon Phi

48 (c) Mattan Erez 48 Choose most efficient core

49 (c) Mattan Erez 49 Choose most efficient core Arithmetic Control Memory Margin

50 (c) Mattan Erez 50 Generalize the throughput core =?

51 (c) Mattan Erez 51 Generalize the throughput core

52 (c) Mattan Erez 52 Generalize the throughput core

53 (c) Mattan Erez 53 Synchronization may impair execution Predict when necessary robust optimization Adaptive compaction : Compaction barrier

54 (c) Mattan Erez Architecture goals: Balance possible and practical Enable generality Don t hurt the common case 54 Architecture so far: Proportionality Adaptivity Heterogeneity SW-HW co-tuning Locality Parallelism Hierarchy

55 (c) Mattan Erez 55 Heterogeneity is lunacy Programing and tuning extremely tricky Diverse and rapidly evolving algorithms How can we program these machines

56 (c) Mattan Erez 56 Hierarchical languages restore some sanity Abstract key tuning knobs Parallelism Locality Hierarchy Throughput

57 (c) Mattan Erez 57 Hierarchical languages restore some sanity CUDA and OpenCL Sequoia (Stanford) Habanero (Rice) Phalanx (NVIDIA) Legion (Stanford) Working into X10, Chapel Domain specific languages even better

(c) DE Shaw Research 58 A counter example:

dynamics (Google image search) Copyrighted

58 (c) DE Shaw Research 58 A counter example: DE Shaw Research Anton Copyrighted picture of Anton package (Google image search) Copyrighted figure demonstrating molecular dynamics (Google image search) Copyrighted figure of Anton system (Google image search)

Within form-factor and cost constraints Architected interfaces and compiler Compiler for

59 (c) Mattan Erez 59 Another counter example? Microsoft Catapult Disciplined reconfigurability for the cloud Distributed FPGA accelerator Within form-factor and cost constraints Architected interfaces and compiler Compiler for software, not (just) hardeare Copyrighted figure of datacenter (Google image search) Copyrighted figure of Catapult diagram (from ISCA 2014 paper)

60 (c) Mattan Erez 60 Cheap Reliable Efficient Memory Fast Large

61 Fast + large hierarchy (c) Mattan Erez 61

62 (c) Micron, Intel, Mattan Erez 62 Fast + large hierarchical TSV Silicon Interposer mem mem

63 (c) Mattan Erez 63 Large + efficient heterogeneity

64 (c) Micron, Intel, Mattan Erez 64 Large + efficient heterogeneous TSV Silicon Interposer DRAM + NVRAM

65 (c) Mattan Erez 65 Heterogeneous + hierarchical big mess Research just starting New mechanisms needed Very interesting tradeoffs with reliability

66 (c) Mattan Erez 66 Fast + efficient control hierarchy + parallelism

67 (c) Mattan Erez 67 Fast + efficient hierarchy + parallelism Access cell

68 (c) Mattan Erez 68 Fast + efficient hierarchy + parallelism Access many cells in parallel Many pins in parallel

69 (c) Mattan Erez 69 Fast + efficient hierarchy + parallelism Many pins in parallel over multiple cycles Access even more cells in parallel

70 (c) Mattan Erez 70 Fast + efficient hierarchy + parallelism Many pins in parallel over multiple cycles Access even more cells in parallel

71 (c) Mattan Erez 71 Hierarchy + parallelism potential waste

72 (c) Mattan Erez 72 Is fine-grained access better? Wasteful when all data is needed CG C Data ECC FG C 0 C 1 C 2 C 3 C 4 C 5 C 6 C 7 D0 E0 D1 E1 D2 E2 D3 E3 D4 E4 D5 E5 D6 E6 D7 E7 CG C Data ECC FG C 0 D0 E0 72

73 (c) Mattan Erez 73 Dynamically adapt granularity best of both worlds Locality Predictor Sector cache (8 8B sectors per line) SLP Processor Core $D L2 $I Core 0 Core 1 Core N-1 Co-scheduling CG/FG w/ split CG Last Level Cache MC Sub-Ranked DRAM w/ 2X ABUS Support both CG and FG DRAM

74 System Throughput (c) Mattan Erez 74 Dynamically adapt granularity CG+ECC FG+ECC AG+ECC mcf x8 omnetpp x8 mst x8 em3d x8 canneal x8 SSCA2 x8 linked list x8 gups x8 MIX-A MIX-B MIX-C MIX-D

75 (c) Mattan Erez 75 Cost + poor reliability Efficiency + poor reliability New + poor reliability

76 (c) Mattan Erez 76 Compensate with error protection ECC

77 (c) Mattan Erez 77 Compensate with proportional protection

78 (c) Mattan Erez 78 Adaptive reliability Adapt the mechanism Adapt the error rate precision (stochastic precision)

79 Adaptive reliability By type Application guided By location Machine-state guided (c) Mattan Erez 79

80 Example: Virtualized ECC Adapt the mechanism (c) Mattan Erez 80

81 (c) Mattan Erez 81 Protection Level Low Medium Virtual Address space Virtual page x Virtual page y Virtual Page to Physical Frame mapping Physical Memory Page frame x Page frame y High Virtual page z Page frame z Data ECC

82 (c) Mattan Erez 82 Protection Level Low Medium Virtual Address space Virtual page x Virtual page y Virtual Page to Physical Frame mapping Physical Memory Page frame x Page frame y High Virtual page z Page frame z ECC ECC page y Physical Frame to ECC Page mapping ECC page z Data Stronger ECC

83 (c) Mattan Erez 83 Protection Level Low Medium Virtual Address space Virtual page x Virtual page y Virtual Page to Physical Frame mapping Physical Memory Page frame x Page frame y High Virtual page z Page frame z T2EC for Chipkill ECC page y Physical Frame to ECC Page mapping ECC page z Data T1EC T2EC for Double Chipkill

84 (c) Mattan Erez 84 Write Read T1EC Decoding T1EC/T2EC Encoding Two-Tiered ECC Data T1EC T2EC

85 (c) Mattan Erez 85 Write Read T1EC Decoding T1EC/T2EC Encoding T2EC Mapping Virtualized T2EC Data T1EC

86 (c) Mattan Erez 86 Caching work great Chipkill Detect Chipkill Correct 2 Chipkill Correct Normalized Execution Time

(c) Mattan Erez 87 Bonus: flexibility of ECC 1.00 0.95 0.90 0.85 0.80 0.

87 (c) Mattan Erez 87 Bonus: flexibility of ECC Chipkill Detect Chipkill Correct 2 Chipkill Correct Normalized Energy-Delay Product

88 (C) Mattan Erez 88 Can we do even better? Hide second-tier

89 (C) Mattan Erez 89 FrugalECC = Compression + VECC

90 (C) Mattan Erez 90 Adapt resilience scheme or adapt storage?

91 (C) Mattan Erez 91 Adapt compression scheme too! Coverage-oriented-Compression (CoC)

92 (C) Mattan Erez 92

93 (C) Mattan Erez 93

94 (C) Mattan Erez 94 Experiments promising: Match or exceed reliability Improve power Minimal impact on performance

95 (C) Mattan Erez 95

96 (C) Mattan Erez 96

97 (c) Mattan Erez Architecture goals: Balance possible and practical Enable generality Don t hurt the common case 97 Architecture so far: Proportionality Adaptivity SW-HW co-tuning Heterogeneity Locality Parallelism Hierarchy

98 Projected MTTI [hours] (c) Mattan Erez 98 The reliability challenge 1000 Hard Soft 100 Intermittent SW Overall (15PF) 2015 (40PF) 2017 (130PF) 2019 (450PF) Year (Expected performance in PetaFlops) 2021 (1500PF)

99 (c) Mattan Erez 99 Detect Contain Repair Recover

100 (c) Mattan Erez 100 Today s techniques not good enough Hardware support expensive Checkpoint-restart doesn t scale

101 (c) Mattan Erez 101 Exascale resilience must be: Proportional Parallel Adaptive Hierarchical Analyzable

102 (c) Mattan Erez 102 Containment domains Embed resilience within application Match system hierarchy and characteristics Analyzable abstraction consistent across layers Root CD CDs don t communicate erroneous data CDs have means to recover Child CD

$(c) Mattan Erez 103 Phalanx C++ program int main(int argc, char **argv) { main_task here =$ = async(here, here.processor(), n) (host_saxpy, 2.

async(here, here.cuda_gpu(0), n) (device_saxpy, 2.

103 (c) Mattan Erez 103 Phalanx C++ program int main(int argc, char **argv) { main_task here = phalanx::initialize(argc, argv); Create test arrays here // Launch kernel on default CPU ( host ) openmp_event e1 = async(here, here.processor(), n) (host_saxpy, 2.0f, host_x, host_y); CD Annotations resiliency model } // Launch kernel on default GPU ( device ) cuda_event e2 = async(here, here.cuda_gpu(0), n) (device_saxpy, 2.0f, device_x, device_y); wait(here, e1&e2); return 0; efficiency-oriented programming model CD API resiliency interface Runtime Library Interface CD control and persistence Microkernel Hardware Abstraction Layer ECC, status Machine Error Reporting Architecture 101, 106

104 (c) Mattan Erez 104 Mapping example: SpMV M V void task<inner> SpMV( in M, in V i, out R i ){ forall( ) reduce( ) SpMV(M[ ],V i [ ],R i [ ]); } Matrix M Vector V void task<leaf> SpMV( ){ } for r=0..n for c=rows[r]..rows[r+1] { res i [r]+=data[c]*v i [cidx[c]]; prevc=c; }

105 (c) Mattan Erez 105 Mapping example: SpMV M 00 M 01 M 10 M 11 Matrix M V 0 V 1 Vector V void task<inner> SpMV( in M, in V i, out R i ){ forall( ) reduce( ) SpMV(M[ ],V i [ ],R i [ ]); } void task<leaf> SpMV( ){ } for r=0..n for c=rows[r]..rows[r+1] { res i [r]+=data[c]*v i [cidx[c]]; prevc=c; }

106 (c) Mattan Erez 106 Mapping example: SpMV M 00 M 01 V 0 void task<inner> SpMV( in M, in V i, out R i ){ forall( ) reduce( ) SpMV(M[ ],V i [ ],R i [ ]); } M 10 Matrix M M 11 V 1 Vector V void task<leaf> SpMV( ){ } for r=0..n for c=rows[r]..rows[r+1] { res i [r]+=data[c]*v i [cidx[c]]; prevc=c; } Distributed to 4 nodes M 00 V 0 M 10 V 0 M 01 V 1 M 11 V 1

107 (c) Mattan Erez 107 Mapping example: SpMV M 00 M 01 V 0 void task<inner> SpMV( in M, in V i, out R i ){ forall( ) reduce( ) SpMV(M[ ],V i [ ],R i [ ]); } M 10 Matrix M M 11 V 1 Vector V void task<leaf> SpMV( ){ } for r=0..n for c=rows[r]..rows[r+1] { res i [r]+=data[c]*v i [cidx[c]]; prevc=c; } Distributed to 4 nodes

108 (c) Mattan Erez 108 Mapping example: SpMV Parent CD Preserve M V Child Preserve (Parent) Child Child CD Preserve Preserve Detect Recover Preserve Preserve Detect Recover Child Detect Recover Child M 00 V 0 M 10 V 0 M 01 V 1 M 11 V 1 Detect Recover Detect Recover Detect Recover Detect Recover Detect Recover Detect Recover Detect (Parent) Recover (Parent)

109 (c) Mattan Erez 109 void task<inner> SpMV(in M, in V i, out R i ) { cd = begin(parentcd); preserve_via_copy(cd, matrix, ); forall( ) reduce( ) SpMV(M[ ],V i [ ],R i [ ]); complete(cd); API } void task<leaf> SpMV( ) { cd = begin(parentcd); preserve_via_copy(cd, matrix, ); preserve_via_parent(cd, vec i, ); for r=0..n for c=rows[r]..rows[r+1] { res i [r]+=data[c]*v i [cidx[c]]; check {fault<fail>(c > prevc);} prevc=c; } complete(cd); } begin configure preserve check complete

110 (c) Mattan Erez 110 SpMV preservation tuning Hierarchy M 00 M 01 M 10 M 11 Matrix M V 0 V 1 Vector V void task<leaf> SpMV( ) { cd = begin(parentcd); preserve_via_copy(cd, matrix, ); preserve_via_parent(cd, vec i, ); for r=0..n for c=rows[r]..rows[r+1] { res i [r]+=data[c]*v i [cidx[c]]; check {fault<fail>(c > prevc);} prevc=c; } complete(cd); } M 00 V 0 M 10 V 0 M 01 V 1 M 11 V 1 Natural redundancy

(c) Mattan Erez 111 Concise abstraction for

cd = begin(parentcd); preserve_via_copy(cd,

[cidx[c]]; check {fault<fail>(c > prevc);}

111 (c) Mattan Erez 111 Concise abstraction for complex behavior void task<leaf> SpMV( ) { cd = begin(parentcd); preserve_via_copy(cd, matrix, ); preserve_via_parent(cd, vec i, ); for r=0..n for c=rows[r]..rows[r+1] { res i [r]+=data[c]*v i [cidx[c]]; check {fault<fail>(c > prevc);} prevc=c; } complete(cd);} Local copy or regen Sibling Parent (unchanged)

112 Time (c) Mattan Erez 112 Recovery: concurrent, hierarchical, adaptive Preserve Compute Detect Re-execution overhead

113 Execution time (c) Mattan Erez 113 Analyzable Application model (tree) Overhead model Fault model

114 Re-execution time (c) Mattan Erez 114 Analyzable q 0, m, n = 1 p c m n Probability that all child CDs experienced no failures x q x, m, n = i=0 i+m 1 i p c i 1 p c m n Probability that all child CDs experienced at most x failures in the m serial CDs d 0, m, n = q[0, m, n] Probability that all child CDs experienced exactly 0 failures d x, m, n = q x, m, n q x 1, m, n Probability that sibling with the most failure experiences exactly x failures T parent x, m, n = i=0 i + m T c d[i, m, n] Parallel domains Execution Re-execution Heterogeneous CDs Asymmetric preservation and restoration

115 Execution time (c) Mattan Erez 115 Analyzable? 108

116 Performance Efficiency Performance Efficiency Performance Efficiency (c) Mattan Erez 116 CDs are scalable NT SpMV HPCCG 100% 80% 60% 40% 20% 0% 100% 80% 60% 40% 20% 0% 100% 80% 60% 40% 20% 0% 2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF 2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF 2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF Peak System Performance CDs, NT h-cpr, 80% g-cpr, 80% CDs, SpMV h-cpr, 50% g-cpr, 50% CDs, HPCCG h-cpr, 10% g-cpr, 10%

117 Energy Overhead Energy Overhead Energy Overhead (c) Mattan Erez 117 CDs are proportional and efficient NT SpMV HPCCG 20% 10% 0% 20% 10% 0% 20% 10% 0% 2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF 2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF 2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF Peak System Performance CDs, NT h-cpr, 80% gcpr, 80% CDs, SpMV h-cpr, 50% g-cpr, 50% CDs, HPCCG h-cpr, 10% g-cpr, 10%

118 (c) Mattan Erez 118 Architecture goals: Balance possible and practical Enable generality Don t hurt the common case It s all about the algorithm

119 (c) Mattan Erez Architecture goals: Balance possible and practical Enable generality Don t hurt the common case Architecture so far: Proportionality Adaptivity SW-HW co-tuning + analyzability Heterogeneity Locality Parallelism Hierarchy 119 Arithmetic Control Memory Reliability Programming

120 (c) Mattan Erez 120 Exascale computers are lunacy

121 (c) Mattan Erez 121 Exascale computers are lunacy science

122 (c) Mattan Erez 122 Harbingers of things to come Discovery awaits Enable and promote open innovation Math + Systems + Engineering + Technology

Virtualized ECC: Flexible Reliability in Memory Systems

Virtualized ECC: Flexible Reliability in Memory Systems Doe Hyun Yoon Advisor: Mattan Erez Electrical and Computer Engineering The University of Texas at Austin Motivation Reliability concerns are growing