Computing on the Lunatic Fringe: Exascale Computers and Why You Should Care. Mattan Erez The University of Texas at Austin

Size: px
Start display at page:

Download "Computing on the Lunatic Fringe: Exascale Computers and Why You Should Care. Mattan Erez The University of Texas at Austin"

Transcription

1 1 Computing on the Lunatic Fringe: Exascale Computers and Why You Should Care Mattan Erez The University of Texas at Austin

2 (C) Mattan Erez 2 Arch-focused whole-system approach Efficiency requirements require crossing layers Algorithms are key Compute less, move less, store less Proportional systems Minimize waste Utilize and improve emerging technologies Explore (and act) up and down Programming model to circuits Preferably implement at micro-arch/system

3 (C) Mattan Erez 3 Big problems and emerging platforms Memory systems Capacity, bandwidth, efficiency impossible to balance Adaptive and dynamic management helps New technologies to the rescue? Opportunities for in-memory computing GPUs, supercomputers, clouds, and more Throughput oriented designs are a must Centralization trend is interesting Reliability and resilience More, smaller devices danger of poor reliability Trimmed margins less room for error Hard constraints efficiency is a must Scale exacerbates reliability and resilience concerns

4 (C) Mattan Erez 4 Lots of interesting multi-level projects Resilience/ Reliability Memory GPU/CPU

5 (c) Mattan Erez 5 A superscale computer is a high-performance system for solving big problems

6 (c) Mattan Erez 6 A supercomputer is a high-performance system for solving big cohesive problems

7 (c) Mattan Erez 7 A supercomputer is a high-performance system for solving big cohesive problems

8 (c) Mattan Erez 8 Simulate Analyze Predict

9 (c) Mattan Erez 9 Simulate Analyze Predict

10 (c) Mattan Erez 10 Supercomputers are general Not domain specific Cloud computers are general Algorithms keep changing

11 (c) Mattan Erez 11 Superscale computers must balance Compute Storage Communication Constraints OpEx and CapEx

12 (c) Mattan Erez 12 Super is never super enough

13 Sustained FLOP/s (c) Mattan Erez 13 1E+18 1E+17 1E+16 1E+15 1E+14 1E+13 1E+12 1E+11 1E+10 1E+9 1E+8 Num. 1 Num. 10 Num.500 Idealized VLSI

14 (c) Mattan Erez 14 What s keeping us from exascale?

15 Total Power [kw] Efficiency [GFLOPS/kW] (c) Mattan Erez 15 The power problem Power Efficiency

16 (c) Mattan Erez 16 Solution: build more efficient systems

17 (c) Mattan Erez 17 Exascale computers are a lunacy: Crazy big (1M processors, >10MW) Need efficiency of embedded devices Effective for many domains Fully programmable

18 (c) Mattan Erez 18 Why should you care? Harbingers of things to come Discovery awaits Enable and promote open innovation

19 (c) Mattan Erez 19 Where does the power go?

20 (c) Mattan Erez 20 Cooling and infrastructure Amazing mechanical and electrical systems Getting close to optimal efficiency 17 PFLOP/s 18,688 nodes >8 MW ~200 cabinets ~400 m 2 floorspace

21 (c) Mattan Erez 21 Actual processing I/O Arithmetic + x 2 Comm. Control Memory Idle/margin How much of each component?

22 (c) Mattan Erez 22 The budget: Power 20MW

23 (c) Mattan Erez 23 20MW / 1 exa-flop/s Energy 20pJ/op 50 GFLOPs/W sustained

24 (c) Mattan Erez 24 20MW / 1 exa-flop/s Energy 20pJ/op 50 GFLOPs/W sustained Best supercomputer today: ~300pJ/op

25 (c) Mattan Erez 25 Arithmetic 64-bit floating-point operation + x Arith Today Scaled Researchy Headroom Rough estimated numbers

26 Enough headroom? (c) Mattan Erez 26

27 (c) Mattan Erez 27 Unfortunately, hard tradeoffs 64-bit DP 15pJ 20mm 15 pj 80 pj 2 nj DRAM Rd/Wr 256-bit buses 256-bit access 8 kb SRAM 15 pj 500 pj Efficient off-chip link 100 pj 10nm

28 (c) Mattan Erez 28 + x 2 Need more headroom Minimize waste

29 (c) Mattan Erez 29 + x 2 Do we care about single-unit performance? Must all results be equally precise? Must all results be correct? Lunacy?

30 (c) Mattan Erez 30 Relaxed reliability and precision Some lunacy (rare easy-to-detect errors + parallelism) Lunatic fringe: bounded imprecision Lunacy: live with real unpredictable errors + x Arith Today Scaled Researchy Some lunacy Rough estimated numbers for illustration purposes Headroom Lunatic fringe 18 2 Lunacy

31 (c) Mattan Erez 31 Bounded Approximate Duplication + x 2

32 (c) Mattan Erez 32 Bounded Approximate Duplication + x 2

33 (c) Mattan Erez 33 + x 2 Supercomputers are general Dynamic adaptivity and tuning Some programmers crazier than others Large effort in error-tolerant algorithms

34 (c) Mattan Erez 34 Proportionality and overprovisioning Arithmetic Control Memory Margin Dense Latency Sparse

35 (c) Mattan Erez 35 Architecture goals: Balance possible and practical Enable generality Don t hurt the common case It s all about the algorithm

36 (c) Mattan Erez 36 Architecture goals: Balance possible and practical Enable generality Don t hurt the common case Architecture concepts so far: Proportionality Adaptivity SW-HW co-tuning Locality Parallelism

37 (c) Mattan Erez 37 How to control a billion arithmetic units?

38 (c) Mattan Erez 38 Hierarchy minimizes control cost Amortize control decisions Program Machine Teams Processes Threads Cabinets /modules Nodes Cores Instructions HW units

39 (c) Mattan Erez 39 Hierarchy minimizes control cost Amortize control decisions Program Teams Processes Threads Instructions

40 (c) Mattan Erez 40 What does a HW instruction do? x x x x CPU x x x x x x GPU Embedded Amortize/specialize more

41 (c) Mattan Erez 41 Single-thread or bulk-thread latency oriented CPU throughput oriented GPU

42 (c) Mattan Erez 42 Heterogeneity is a necessity disciplined specialization

43 (c) Mattan Erez 43 Must balance generality and control cost Over-specialization stifles innovation Algorithms more important than hardware

44 (c) Mattan Erez 44 Heterogeneity is lunacy Programing and tuning extremely tricky Diverse and rapidly evolving algorithms

45 (c) Mattan Erez 45 Disciplined heterogeneity Scarcity of choice Throughput oriented Latency oriented Disciplined reconfigurable accelerators

46 (c) Mattan Erez 46 Heterogeneous computing already here GPU for throughput CPU for latency Abstracted FPGA accelerators

47 (c) AMD. Cray, Intel, NVIDIA, xsede 47 Heterogeneous computing already here Titan node (Cray XK7) Copyrighted picture of an XK7 node Stampede node Copyrighted picture of Xeon Phi card Copyrighted picture of Xeon Phi die Copyrighted picture of Stampede node Copyrighted picture of Kepler die Copyrighted picture of Opteron die Copyrighted picture of Sany Bridge die NVIDIA GPU + AMD CPU Intel Xeon CPU + Xeon Phi

48 (c) Mattan Erez 48 Choose most efficient core

49 (c) Mattan Erez 49 Choose most efficient core Arithmetic Control Memory Margin

50 (c) Mattan Erez 50 Generalize the throughput core =?

51 (c) Mattan Erez 51 Generalize the throughput core

52 (c) Mattan Erez 52 Generalize the throughput core

53 (c) Mattan Erez 53 Synchronization may impair execution Predict when necessary robust optimization Adaptive compaction : Compaction barrier

54 (c) Mattan Erez Architecture goals: Balance possible and practical Enable generality Don t hurt the common case 54 Architecture so far: Proportionality Adaptivity Heterogeneity SW-HW co-tuning Locality Parallelism Hierarchy

55 (c) Mattan Erez 55 Heterogeneity is lunacy Programing and tuning extremely tricky Diverse and rapidly evolving algorithms How can we program these machines

56 (c) Mattan Erez 56 Hierarchical languages restore some sanity Abstract key tuning knobs Parallelism Locality Hierarchy Throughput

57 (c) Mattan Erez 57 Hierarchical languages restore some sanity CUDA and OpenCL Sequoia (Stanford) Habanero (Rice) Phalanx (NVIDIA) Legion (Stanford) Working into X10, Chapel Domain specific languages even better

58 (c) DE Shaw Research 58 A counter example: DE Shaw Research Anton Copyrighted picture of Anton package (Google image search) Copyrighted figure demonstrating molecular dynamics (Google image search) Copyrighted figure of Anton system (Google image search)

59 (c) Mattan Erez 59 Another counter example? Microsoft Catapult Disciplined reconfigurability for the cloud Distributed FPGA accelerator Within form-factor and cost constraints Architected interfaces and compiler Compiler for software, not (just) hardeare Copyrighted figure of datacenter (Google image search) Copyrighted figure of Catapult diagram (from ISCA 2014 paper)

60 (c) Mattan Erez 60 Cheap Reliable Efficient Memory Fast Large

61 Fast + large hierarchy (c) Mattan Erez 61

62 (c) Micron, Intel, Mattan Erez 62 Fast + large hierarchical TSV Silicon Interposer mem mem

63 (c) Mattan Erez 63 Large + efficient heterogeneity

64 (c) Micron, Intel, Mattan Erez 64 Large + efficient heterogeneous TSV Silicon Interposer DRAM + NVRAM

65 (c) Mattan Erez 65 Heterogeneous + hierarchical big mess Research just starting New mechanisms needed Very interesting tradeoffs with reliability

66 (c) Mattan Erez 66 Fast + efficient control hierarchy + parallelism

67 (c) Mattan Erez 67 Fast + efficient hierarchy + parallelism Access cell

68 (c) Mattan Erez 68 Fast + efficient hierarchy + parallelism Access many cells in parallel Many pins in parallel

69 (c) Mattan Erez 69 Fast + efficient hierarchy + parallelism Many pins in parallel over multiple cycles Access even more cells in parallel

70 (c) Mattan Erez 70 Fast + efficient hierarchy + parallelism Many pins in parallel over multiple cycles Access even more cells in parallel

71 (c) Mattan Erez 71 Hierarchy + parallelism potential waste

72 (c) Mattan Erez 72 Is fine-grained access better? Wasteful when all data is needed CG C Data ECC FG C 0 C 1 C 2 C 3 C 4 C 5 C 6 C 7 D0 E0 D1 E1 D2 E2 D3 E3 D4 E4 D5 E5 D6 E6 D7 E7 CG C Data ECC FG C 0 D0 E0 72

73 (c) Mattan Erez 73 Dynamically adapt granularity best of both worlds Locality Predictor Sector cache (8 8B sectors per line) SLP Processor Core $D L2 $I Core 0 Core 1 Core N-1 Co-scheduling CG/FG w/ split CG Last Level Cache MC Sub-Ranked DRAM w/ 2X ABUS Support both CG and FG DRAM

74 System Throughput (c) Mattan Erez 74 Dynamically adapt granularity CG+ECC FG+ECC AG+ECC mcf x8 omnetpp x8 mst x8 em3d x8 canneal x8 SSCA2 x8 linked list x8 gups x8 MIX-A MIX-B MIX-C MIX-D

75 (c) Mattan Erez 75 Cost + poor reliability Efficiency + poor reliability New + poor reliability

76 (c) Mattan Erez 76 Compensate with error protection ECC

77 (c) Mattan Erez 77 Compensate with proportional protection

78 (c) Mattan Erez 78 Adaptive reliability Adapt the mechanism Adapt the error rate precision (stochastic precision)

79 Adaptive reliability By type Application guided By location Machine-state guided (c) Mattan Erez 79

80 Example: Virtualized ECC Adapt the mechanism (c) Mattan Erez 80

81 (c) Mattan Erez 81 Protection Level Low Medium Virtual Address space Virtual page x Virtual page y Virtual Page to Physical Frame mapping Physical Memory Page frame x Page frame y High Virtual page z Page frame z Data ECC

82 (c) Mattan Erez 82 Protection Level Low Medium Virtual Address space Virtual page x Virtual page y Virtual Page to Physical Frame mapping Physical Memory Page frame x Page frame y High Virtual page z Page frame z ECC ECC page y Physical Frame to ECC Page mapping ECC page z Data Stronger ECC

83 (c) Mattan Erez 83 Protection Level Low Medium Virtual Address space Virtual page x Virtual page y Virtual Page to Physical Frame mapping Physical Memory Page frame x Page frame y High Virtual page z Page frame z T2EC for Chipkill ECC page y Physical Frame to ECC Page mapping ECC page z Data T1EC T2EC for Double Chipkill

84 (c) Mattan Erez 84 Write Read T1EC Decoding T1EC/T2EC Encoding Two-Tiered ECC Data T1EC T2EC

85 (c) Mattan Erez 85 Write Read T1EC Decoding T1EC/T2EC Encoding T2EC Mapping Virtualized T2EC Data T1EC

86 (c) Mattan Erez 86 Caching work great Chipkill Detect Chipkill Correct 2 Chipkill Correct Normalized Execution Time

87 (c) Mattan Erez 87 Bonus: flexibility of ECC Chipkill Detect Chipkill Correct 2 Chipkill Correct Normalized Energy-Delay Product

88 (C) Mattan Erez 88 Can we do even better? Hide second-tier

89 (C) Mattan Erez 89 FrugalECC = Compression + VECC

90 (C) Mattan Erez 90 Adapt resilience scheme or adapt storage?

91 (C) Mattan Erez 91 Adapt compression scheme too! Coverage-oriented-Compression (CoC)

92 (C) Mattan Erez 92

93 (C) Mattan Erez 93

94 (C) Mattan Erez 94 Experiments promising: Match or exceed reliability Improve power Minimal impact on performance

95 (C) Mattan Erez 95

96 (C) Mattan Erez 96

97 (c) Mattan Erez Architecture goals: Balance possible and practical Enable generality Don t hurt the common case 97 Architecture so far: Proportionality Adaptivity SW-HW co-tuning Heterogeneity Locality Parallelism Hierarchy

98 Projected MTTI [hours] (c) Mattan Erez 98 The reliability challenge 1000 Hard Soft 100 Intermittent SW Overall (15PF) 2015 (40PF) 2017 (130PF) 2019 (450PF) Year (Expected performance in PetaFlops) 2021 (1500PF)

99 (c) Mattan Erez 99 Detect Contain Repair Recover

100 (c) Mattan Erez 100 Today s techniques not good enough Hardware support expensive Checkpoint-restart doesn t scale

101 (c) Mattan Erez 101 Exascale resilience must be: Proportional Parallel Adaptive Hierarchical Analyzable

102 (c) Mattan Erez 102 Containment domains Embed resilience within application Match system hierarchy and characteristics Analyzable abstraction consistent across layers Root CD CDs don t communicate erroneous data CDs have means to recover Child CD

103 (c) Mattan Erez 103 Phalanx C++ program int main(int argc, char **argv) { main_task here = phalanx::initialize(argc, argv); Create test arrays here // Launch kernel on default CPU ( host ) openmp_event e1 = async(here, here.processor(), n) (host_saxpy, 2.0f, host_x, host_y); CD Annotations resiliency model } // Launch kernel on default GPU ( device ) cuda_event e2 = async(here, here.cuda_gpu(0), n) (device_saxpy, 2.0f, device_x, device_y); wait(here, e1&e2); return 0; efficiency-oriented programming model CD API resiliency interface Runtime Library Interface CD control and persistence Microkernel Hardware Abstraction Layer ECC, status Machine Error Reporting Architecture 101, 106

104 (c) Mattan Erez 104 Mapping example: SpMV M V void task<inner> SpMV( in M, in V i, out R i ){ forall( ) reduce( ) SpMV(M[ ],V i [ ],R i [ ]); } Matrix M Vector V void task<leaf> SpMV( ){ } for r=0..n for c=rows[r]..rows[r+1] { res i [r]+=data[c]*v i [cidx[c]]; prevc=c; }

105 (c) Mattan Erez 105 Mapping example: SpMV M 00 M 01 M 10 M 11 Matrix M V 0 V 1 Vector V void task<inner> SpMV( in M, in V i, out R i ){ forall( ) reduce( ) SpMV(M[ ],V i [ ],R i [ ]); } void task<leaf> SpMV( ){ } for r=0..n for c=rows[r]..rows[r+1] { res i [r]+=data[c]*v i [cidx[c]]; prevc=c; }

106 (c) Mattan Erez 106 Mapping example: SpMV M 00 M 01 V 0 void task<inner> SpMV( in M, in V i, out R i ){ forall( ) reduce( ) SpMV(M[ ],V i [ ],R i [ ]); } M 10 Matrix M M 11 V 1 Vector V void task<leaf> SpMV( ){ } for r=0..n for c=rows[r]..rows[r+1] { res i [r]+=data[c]*v i [cidx[c]]; prevc=c; } Distributed to 4 nodes M 00 V 0 M 10 V 0 M 01 V 1 M 11 V 1

107 (c) Mattan Erez 107 Mapping example: SpMV M 00 M 01 V 0 void task<inner> SpMV( in M, in V i, out R i ){ forall( ) reduce( ) SpMV(M[ ],V i [ ],R i [ ]); } M 10 Matrix M M 11 V 1 Vector V void task<leaf> SpMV( ){ } for r=0..n for c=rows[r]..rows[r+1] { res i [r]+=data[c]*v i [cidx[c]]; prevc=c; } Distributed to 4 nodes

108 (c) Mattan Erez 108 Mapping example: SpMV Parent CD Preserve M V Child Preserve (Parent) Child Child CD Preserve Preserve Detect Recover Preserve Preserve Detect Recover Child Detect Recover Child M 00 V 0 M 10 V 0 M 01 V 1 M 11 V 1 Detect Recover Detect Recover Detect Recover Detect Recover Detect Recover Detect Recover Detect (Parent) Recover (Parent)

109 (c) Mattan Erez 109 void task<inner> SpMV(in M, in V i, out R i ) { cd = begin(parentcd); preserve_via_copy(cd, matrix, ); forall( ) reduce( ) SpMV(M[ ],V i [ ],R i [ ]); complete(cd); API } void task<leaf> SpMV( ) { cd = begin(parentcd); preserve_via_copy(cd, matrix, ); preserve_via_parent(cd, vec i, ); for r=0..n for c=rows[r]..rows[r+1] { res i [r]+=data[c]*v i [cidx[c]]; check {fault<fail>(c > prevc);} prevc=c; } complete(cd); } begin configure preserve check complete

110 (c) Mattan Erez 110 SpMV preservation tuning Hierarchy M 00 M 01 M 10 M 11 Matrix M V 0 V 1 Vector V void task<leaf> SpMV( ) { cd = begin(parentcd); preserve_via_copy(cd, matrix, ); preserve_via_parent(cd, vec i, ); for r=0..n for c=rows[r]..rows[r+1] { res i [r]+=data[c]*v i [cidx[c]]; check {fault<fail>(c > prevc);} prevc=c; } complete(cd); } M 00 V 0 M 10 V 0 M 01 V 1 M 11 V 1 Natural redundancy

111 (c) Mattan Erez 111 Concise abstraction for complex behavior void task<leaf> SpMV( ) { cd = begin(parentcd); preserve_via_copy(cd, matrix, ); preserve_via_parent(cd, vec i, ); for r=0..n for c=rows[r]..rows[r+1] { res i [r]+=data[c]*v i [cidx[c]]; check {fault<fail>(c > prevc);} prevc=c; } complete(cd);} Local copy or regen Sibling Parent (unchanged)

112 Time (c) Mattan Erez 112 Recovery: concurrent, hierarchical, adaptive Preserve Compute Detect Re-execution overhead

113 Execution time (c) Mattan Erez 113 Analyzable Application model (tree) Overhead model Fault model

114 Re-execution time (c) Mattan Erez 114 Analyzable q 0, m, n = 1 p c m n Probability that all child CDs experienced no failures x q x, m, n = i=0 i+m 1 i p c i 1 p c m n Probability that all child CDs experienced at most x failures in the m serial CDs d 0, m, n = q[0, m, n] Probability that all child CDs experienced exactly 0 failures d x, m, n = q x, m, n q x 1, m, n Probability that sibling with the most failure experiences exactly x failures T parent x, m, n = i=0 i + m T c d[i, m, n] Parallel domains Execution Re-execution Heterogeneous CDs Asymmetric preservation and restoration

115 Execution time (c) Mattan Erez 115 Analyzable? 108

116 Performance Efficiency Performance Efficiency Performance Efficiency (c) Mattan Erez 116 CDs are scalable NT SpMV HPCCG 100% 80% 60% 40% 20% 0% 100% 80% 60% 40% 20% 0% 100% 80% 60% 40% 20% 0% 2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF 2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF 2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF Peak System Performance CDs, NT h-cpr, 80% g-cpr, 80% CDs, SpMV h-cpr, 50% g-cpr, 50% CDs, HPCCG h-cpr, 10% g-cpr, 10%

117 Energy Overhead Energy Overhead Energy Overhead (c) Mattan Erez 117 CDs are proportional and efficient NT SpMV HPCCG 20% 10% 0% 20% 10% 0% 20% 10% 0% 2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF 2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF 2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF Peak System Performance CDs, NT h-cpr, 80% gcpr, 80% CDs, SpMV h-cpr, 50% g-cpr, 50% CDs, HPCCG h-cpr, 10% g-cpr, 10%

118 (c) Mattan Erez 118 Architecture goals: Balance possible and practical Enable generality Don t hurt the common case It s all about the algorithm

119 (c) Mattan Erez Architecture goals: Balance possible and practical Enable generality Don t hurt the common case Architecture so far: Proportionality Adaptivity SW-HW co-tuning + analyzability Heterogeneity Locality Parallelism Hierarchy 119 Arithmetic Control Memory Reliability Programming

120 (c) Mattan Erez 120 Exascale computers are lunacy

121 (c) Mattan Erez 121 Exascale computers are lunacy science

122 (c) Mattan Erez 122 Harbingers of things to come Discovery awaits Enable and promote open innovation Math + Systems + Engineering + Technology

Virtualized ECC: Flexible Reliability in Memory Systems

Virtualized ECC: Flexible Reliability in Memory Systems Virtualized ECC: Flexible Reliability in Memory Systems Doe Hyun Yoon Advisor: Mattan Erez Electrical and Computer Engineering The University of Texas at Austin Motivation Reliability concerns are growing

More information

Toward Exascale Resilience

Toward Exascale Resilience Toward Exascale Resilience Part 8: Containment Domains / cross-layer schemes Mattan Erez The University of Texas at Austin July 2015 2 Credit to: UT Austin students: Benjaming Cho, Jinsuk Chung, Ali Fakhrzadehgan,

More information

THE DYNAMIC GRANULARITY MEMORY SYSTEM

THE DYNAMIC GRANULARITY MEMORY SYSTEM THE DYNAMIC GRANULARITY MEMORY SYSTEM Doe Hyun Yoon IIL, HP Labs Michael Sullivan Min Kyu Jeong Mattan Erez ECE, UT Austin MEMORY ACCESS GRANULARITY The size of block for accessing main memory Often, equal

More information

Virtualized and Flexible ECC for Main Memory

Virtualized and Flexible ECC for Main Memory Virtualized and Flexible ECC for Main Memory Doe Hyun Yoon and Mattan Erez Dept. Electrical and Computer Engineering The University of Texas at Austin ASPLOS 2010 1 Memory Error Protection Applying ECC

More information

Containment Domains Resilience Mechanisms and Tools Toward Exascale Resilience. Mattan Erez The University of Texas at Austin

Containment Domains Resilience Mechanisms and Tools Toward Exascale Resilience. Mattan Erez The University of Texas at Austin Containment Domains Resilience Mechanisms and Tools Toward Exascale Resilience Mattan Erez The University of Texas at Austin 2 Yes, resilience is an exascale concern Checkpoint-restart not good enough

More information

THE PATH TO EXASCALE COMPUTING. Bill Dally Chief Scientist and Senior Vice President of Research

THE PATH TO EXASCALE COMPUTING. Bill Dally Chief Scientist and Senior Vice President of Research THE PATH TO EXASCALE COMPUTING Bill Dally Chief Scientist and Senior Vice President of Research The Goal: Sustained ExaFLOPs on problems of interest 2 Exascale Challenges Energy efficiency Programmability

More information

Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford

Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford Efficiency and Programmability: Enablers for ExaScale Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford Scientific Discovery and Business Analytics Driving an Insatiable

More information

Timothy Lanfear, NVIDIA HPC

Timothy Lanfear, NVIDIA HPC GPU COMPUTING AND THE Timothy Lanfear, NVIDIA FUTURE OF HPC Exascale Computing will Enable Transformational Science Results First-principles simulation of combustion for new high-efficiency, lowemision

More information

Let Software Decide: Matching Application Diversity with One- Size-Fits-All Memory

Let Software Decide: Matching Application Diversity with One- Size-Fits-All Memory Let Software Decide: Matching Application Diversity with One- Size-Fits-All Memory Mattan Erez The University of Teas at Austin 2010 Workshop on Architecting Memory Systems March 1, 2010 iggest Problems

More information

Stream Processing: a New HW/SW Contract for High-Performance Efficient Computation

Stream Processing: a New HW/SW Contract for High-Performance Efficient Computation Stream Processing: a New HW/SW Contract for High-Performance Efficient Computation Mattan Erez The University of Texas at Austin CScADS Autotuning Workshop July 11, 2007 Snowbird, Utah Stream Processors

More information

The Stampede is Coming: A New Petascale Resource for the Open Science Community

The Stampede is Coming: A New Petascale Resource for the Open Science Community The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center boisseau@tacc.utexas.edu Stampede: Solicitation US National Science Foundation

More information

Trends in HPC (hardware complexity and software challenges)

Trends in HPC (hardware complexity and software challenges) Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18

More information

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Steve Scott, Tesla CTO SC 11 November 15, 2011

Steve Scott, Tesla CTO SC 11 November 15, 2011 Steve Scott, Tesla CTO SC 11 November 15, 2011 What goal do these products have in common? Performance / W Exaflop Expectations First Exaflop Computer K Computer ~10 MW CM5 ~200 KW Not constant size, cost

More information

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CSAIL IAP MEETING MAY 21, 2013 Research Agenda Lack of technology progress Moore s Law still alive Power

More information

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics

More information

Robust GPU Architectures Improving Irregular Execution on Architectures Tuned for Regularity

Robust GPU Architectures Improving Irregular Execution on Architectures Tuned for Regularity Robust GPU Architectures Improving Irregular Execution on Architectures Tuned for Regularity Mattan Erez The University of Texas at Austin (C) Mattan Erez 2 Lots of interesting multi-level projects Resilience/

More information

FUSION PROCESSORS AND HPC

FUSION PROCESSORS AND HPC FUSION PROCESSORS AND HPC Chuck Moore AMD Corporate Fellow & Technology Group CTO June 14, 2011 Fusion Processors and HPC Today: Multi-socket x86 CMPs + optional dgpu + high BW memory Fusion APUs (SPFP)

More information

CUDA. Matthew Joyner, Jeremy Williams

CUDA. Matthew Joyner, Jeremy Williams CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel

More information

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel

More information

HPC future trends from a science perspective

HPC future trends from a science perspective HPC future trends from a science perspective Simon McIntosh-Smith University of Bristol HPC Research Group simonm@cs.bris.ac.uk 1 Business as usual? We've all got used to new machines being relatively

More information

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated

More information

EE 7722 GPU Microarchitecture. Offered by: Prerequisites By Topic: Text EE 7722 GPU Microarchitecture. URL:

EE 7722 GPU Microarchitecture. Offered by: Prerequisites By Topic: Text EE 7722 GPU Microarchitecture. URL: 00 1 EE 7722 GPU Microarchitecture 00 1 EE 7722 GPU Microarchitecture URL: http://www.ece.lsu.edu/gp/. Offered by: David M. Koppelman 345 ERAD, 578-5482, koppel@ece.lsu.edu, http://www.ece.lsu.edu/koppel

More information

Race to Exascale: Opportunities and Challenges. Avinash Sodani, Ph.D. Chief Architect MIC Processor Intel Corporation

Race to Exascale: Opportunities and Challenges. Avinash Sodani, Ph.D. Chief Architect MIC Processor Intel Corporation Race to Exascale: Opportunities and Challenges Avinash Sodani, Ph.D. Chief Architect MIC Processor Intel Corporation Exascale Goal: 1-ExaFlops (10 18 ) within 20 MW by 2018 1 ZFlops 100 EFlops 10 EFlops

More information

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011 Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011 Performance Optimization Process Use appropriate performance metric for each kernel For example, Gflops/s don t make sense for

More information

Technology for a better society. hetcomp.com

Technology for a better society. hetcomp.com Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware

More information

The Future of High Performance Computing

The Future of High Performance Computing The Future of High Performance Computing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Comparing Two Large-Scale Systems Oakridge Titan Google Data Center 2 Monolithic supercomputer

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor,

More information

The University of Texas at Austin

The University of Texas at Austin EE382N: Principles in Computer Architecture Parallelism and Locality Fall 2009 Lecture 24 Stream Processors Wrapup + Sony (/Toshiba/IBM) Cell Broadband Engine Mattan Erez The University of Texas at Austin

More information

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Waiting for Moore s Law to save your serial code start getting bleak in 2004 Source: published SPECInt data Moore s Law is not at all

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

! Readings! ! Room-level, on-chip! vs.!

! Readings! ! Room-level, on-chip! vs.! 1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Addressing the Memory Wall

Addressing the Memory Wall Lecture 26: Addressing the Memory Wall Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Cage the Elephant Back Against the Wall (Cage the Elephant) This song is for the

More information

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

Overview. CS 472 Concurrent & Parallel Programming University of Evansville Overview CS 472 Concurrent & Parallel Programming University of Evansville Selection of slides from CIS 410/510 Introduction to Parallel Computing Department of Computer and Information Science, University

More information

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources This Unit: Putting It All Together CIS 501 Computer Architecture Unit 12: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital Circuits

More information

ECE 8823: GPU Architectures. Objectives

ECE 8823: GPU Architectures. Objectives ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading

More information

The Future of GPU Computing

The Future of GPU Computing The Future of GPU Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November 18, 2009 The Future of Computing Bill Dally Chief Scientist

More information

Parallel Computing. November 20, W.Homberg

Parallel Computing. November 20, W.Homberg Mitglied der Helmholtz-Gemeinschaft Parallel Computing November 20, 2017 W.Homberg Why go parallel? Problem too large for single node Job requires more memory Shorter time to solution essential Better

More information

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows

More information

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture? This Unit: Putting It All Together CIS 371 Computer Organization and Design Unit 15: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital

More information

Lecture 5: Scheduling and Reliability. Topics: scheduling policies, handling DRAM errors

Lecture 5: Scheduling and Reliability. Topics: scheduling policies, handling DRAM errors Lecture 5: Scheduling and Reliability Topics: scheduling policies, handling DRAM errors 1 PAR-BS Mutlu and Moscibroda, ISCA 08 A batch of requests (per bank) is formed: each thread can only contribute

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 1. Computer Abstractions and Technology

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 1. Computer Abstractions and Technology COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology The Computer Revolution Progress in computer technology Underpinned by Moore

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

GPU Architecture. Alan Gray EPCC The University of Edinburgh

GPU Architecture. Alan Gray EPCC The University of Edinburgh GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From

More information

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console Computer Architecture Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Milo Martin & Amir Roth at University of Pennsylvania! Computer Architecture

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

Center Extreme Scale CS Research

Center Extreme Scale CS Research Center Extreme Scale CS Research Center for Compressible Multiphase Turbulence University of Florida Sanjay Ranka Herman Lam Outline 10 6 10 7 10 8 10 9 cores Parallelization and UQ of Rocfun and CMT-Nek

More information

Self-Repair for Robust System Design. Yanjing Li Intel Labs Stanford University

Self-Repair for Robust System Design. Yanjing Li Intel Labs Stanford University Self-Repair for Robust System Design Yanjing Li Intel Labs Stanford University 1 Hardware Failures: Major Concern Permanent: our focus Temporary 2 Tolerating Permanent Hardware Failures Detection Diagnosis

More information

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors

More information

GPUfs: Integrating a file system with GPUs

GPUfs: Integrating a file system with GPUs GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Traditional System Architecture Applications OS CPU

More information

Critically Missing Pieces on Accelerators: A Performance Tools Perspective

Critically Missing Pieces on Accelerators: A Performance Tools Perspective Critically Missing Pieces on Accelerators: A Performance Tools Perspective, Karthik Murthy, Mike Fagan, and John Mellor-Crummey Rice University SC 2013 Denver, CO November 20, 2013 What Is Missing in GPUs?

More information

Atos announces the Bull sequana X1000 the first exascale-class supercomputer. Jakub Venc

Atos announces the Bull sequana X1000 the first exascale-class supercomputer. Jakub Venc Atos announces the Bull sequana X1000 the first exascale-class supercomputer Jakub Venc The world is changing The world is changing Digital simulation will be the key contributor to overcome 21 st century

More information

MPI in 2020: Opportunities and Challenges. William Gropp

MPI in 2020: Opportunities and Challenges. William Gropp MPI in 2020: Opportunities and Challenges William Gropp www.cs.illinois.edu/~wgropp MPI and Supercomputing The Message Passing Interface (MPI) has been amazingly successful First released in 1992, it is

More information

The Future of High- Performance Computing

The Future of High- Performance Computing Lecture 26: The Future of High- Performance Computing Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017 Comparing Two Large-Scale Systems Oakridge Titan Google Data Center Monolithic

More information

The End of Denial Architecture and The Rise of Throughput Computing

The End of Denial Architecture and The Rise of Throughput Computing The End of Denial Architecture and The Rise of Throughput Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University Outline! Performance = Parallelism

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

The Cray Rainier System: Integrated Scalar/Vector Computing

The Cray Rainier System: Integrated Scalar/Vector Computing THE SUPERCOMPUTER COMPANY The Cray Rainier System: Integrated Scalar/Vector Computing Per Nyberg 11 th ECMWF Workshop on HPC in Meteorology Topics Current Product Overview Cray Technology Strengths Rainier

More information

Introduction to Xeon Phi. Bill Barth January 11, 2013

Introduction to Xeon Phi. Bill Barth January 11, 2013 Introduction to Xeon Phi Bill Barth January 11, 2013 What is it? Co-processor PCI Express card Stripped down Linux operating system Dense, simplified processor Many power-hungry operations removed Wider

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory

More information

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar CRAY XK6 REDEFINING SUPERCOMPUTING - Sanjana Rakhecha - Nishad Nerurkar CONTENTS Introduction History Specifications Cray XK6 Architecture Performance Industry acceptance and applications Summary INTRODUCTION

More information

Real-Time Rendering Architectures

Real-Time Rendering Architectures Real-Time Rendering Architectures Mike Houston, AMD Part 1: throughput processing Three key concepts behind how modern GPU processing cores run code Knowing these concepts will help you: 1. Understand

More information

GPUfs: Integrating a file system with GPUs

GPUfs: Integrating a file system with GPUs GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Building systems with GPUs is hard. Why? 2 Goal of

More information

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 23 Memory Systems

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 23 Memory Systems EE382 (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 23 Memory Systems Mattan Erez The University of Texas at Austin EE382: Principles of Computer Architecture, Fall 2011 -- Lecture

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Introducing the Cray XMT. Petr Konecny May 4 th 2007

Introducing the Cray XMT. Petr Konecny May 4 th 2007 Introducing the Cray XMT Petr Konecny May 4 th 2007 Agenda Origins of the Cray XMT Cray XMT system architecture Cray XT infrastructure Cray Threadstorm processor Shared memory programming model Benefits/drawbacks/solutions

More information

PHX: Memory Speed HPC I/O with NVM. Pradeep Fernando Sudarsun Kannan, Ada Gavrilovska, Karsten Schwan

PHX: Memory Speed HPC I/O with NVM. Pradeep Fernando Sudarsun Kannan, Ada Gavrilovska, Karsten Schwan PHX: Memory Speed HPC I/O with NVM Pradeep Fernando Sudarsun Kannan, Ada Gavrilovska, Karsten Schwan Node Local Persistent I/O? Node local checkpoint/ restart - Recover from transient failures ( node restart)

More information

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know. Administrivia HW0 scores, HW1 peer-review assignments out. HW2 out, due Nov. 2. If you re having Cython trouble with HW2, let us know. Review on Wednesday: Post questions on Piazza Introduction to GPUs

More information

Towards Exascale Computing with the Atmospheric Model NUMA

Towards Exascale Computing with the Atmospheric Model NUMA Towards Exascale Computing with the Atmospheric Model NUMA Andreas Müller, Daniel S. Abdi, Michal Kopera, Lucas Wilcox, Francis X. Giraldo Department of Applied Mathematics Naval Postgraduate School, Monterey

More information

Distributed Data Infrastructures, Fall 2017, Chapter 2. Jussi Kangasharju

Distributed Data Infrastructures, Fall 2017, Chapter 2. Jussi Kangasharju Distributed Data Infrastructures, Fall 2017, Chapter 2 Jussi Kangasharju Chapter Outline Warehouse-scale computing overview Workloads and software infrastructure Failures and repairs Note: Term Warehouse-scale

More information

Memory: Past, Present and Future Trends Paolo Faraboschi

Memory: Past, Present and Future Trends Paolo Faraboschi Memory: Past, Present and Future Trends Paolo Faraboschi Fellow, Hewlett Packard Labs Systems Research Lab Quiz ( Excerpt from Intel Developer Forum Keynote 2000 ) ANDREW GROVE: is there a role for more

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

GPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA

GPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA GPU COMPUTING AND THE FUTURE OF HPC Timothy Lanfear, NVIDIA ~1 W ~3 W ~100 W ~30 W 1 kw 100 kw 20 MW Power-constrained Computers 2 EXASCALE COMPUTING WILL ENABLE TRANSFORMATIONAL SCIENCE RESULTS First-principles

More information

Mathematical computations with GPUs

Mathematical computations with GPUs Master Educational Program Information technology in applications Mathematical computations with GPUs Introduction Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University How to.. Process terabytes

More information

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III) EE382 (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III) Mattan Erez The University of Texas at Austin EE382: Principles of Computer Architecture, Fall 2011 -- Lecture

More information

There s STILL plenty of room at the bottom! Andreas Olofsson

There s STILL plenty of room at the bottom! Andreas Olofsson There s STILL plenty of room at the bottom! Andreas Olofsson 1 Richard Feynman s Lecture (1959) There's Plenty of Room at the Bottom An Invitation to Enter a New Field of Physics Why cannot we write the

More information

Near-Threshold Computing: How Close Should We Get?

Near-Threshold Computing: How Close Should We Get? Near-Threshold Computing: How Close Should We Get? Alaa R. Alameldeen Intel Labs Workshop on Near-Threshold Computing June 14, 2014 Overview High-level talk summarizing my architectural perspective on

More information

An Exploration into Object Storage for Exascale Supercomputers. Raghu Chandrasekar

An Exploration into Object Storage for Exascale Supercomputers. Raghu Chandrasekar An Exploration into Object Storage for Exascale Supercomputers Raghu Chandrasekar Agenda Introduction Trends and Challenges Design and Implementation of SAROJA Preliminary evaluations Summary and Conclusion

More information

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Tutorial Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld dan/lars/bbarth/milfeld@tacc.utexas.edu XSEDE 12 July 16, 2012

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Exascale: challenges and opportunities in a power constrained world

Exascale: challenges and opportunities in a power constrained world Exascale: challenges and opportunities in a power constrained world Carlo Cavazzoni c.cavazzoni@cineca.it SuperComputing Applications and Innovation Department CINECA CINECA non profit Consortium, made

More information

Hybrid Architectures Why Should I Bother?

Hybrid Architectures Why Should I Bother? Hybrid Architectures Why Should I Bother? CSCS-FoMICS-USI Summer School on Computer Simulations in Science and Engineering Michael Bader July 8 19, 2013 Computer Simulations in Science and Engineering,

More information

Fundamentals of Quantitative Design and Analysis

Fundamentals of Quantitative Design and Analysis Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature

More information

Red Fox: An Execution Environment for Relational Query Processing on GPUs

Red Fox: An Execution Environment for Relational Query Processing on GPUs Red Fox: An Execution Environment for Relational Query Processing on GPUs Georgia Institute of Technology: Haicheng Wu, Ifrah Saeed, Sudhakar Yalamanchili LogicBlox Inc.: Daniel Zinn, Martin Bravenboer,

More information

ESE532: System-on-a-Chip Architecture. Today. Programmable SoC. Message. Process. Reminder

ESE532: System-on-a-Chip Architecture. Today. Programmable SoC. Message. Process. Reminder ESE532: System-on-a-Chip Architecture Day 5: September 18, 2017 Dataflow Process Model Today Dataflow Process Model Motivation Issues Abstraction Basic Approach Dataflow variants Motivations/demands for

More information

CPU-GPU Heterogeneous Computing

CPU-GPU Heterogeneous Computing CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems

More information

Using FPGAs as Microservices

Using FPGAs as Microservices Using FPGAs as Microservices David Ojika, Ann Gordon-Ross, Herman Lam, Bhavesh Patel, Gaurav Kaul, Jayson Strayer (University of Florida, DELL EMC, Intel Corporation) The 9 th Workshop on Big Data Benchmarks,

More information

The Heterogeneous Programming Jungle. Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest

The Heterogeneous Programming Jungle. Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest The Heterogeneous Programming Jungle Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest June 19, 2012 Outline 1. Introduction 2. Heterogeneous System Zoo 3. Similarities 4. Programming

More information

The future is parallel but it may not be easy

The future is parallel but it may not be easy The future is parallel but it may not be easy Michael J. Flynn Maxeler and Stanford University M. J. Flynn 1 HiPC Dec 07 Outline I The big technology tradeoffs: area, time, power HPC: What s new at the

More information

Parallelism. CS6787 Lecture 8 Fall 2017

Parallelism. CS6787 Lecture 8 Fall 2017 Parallelism CS6787 Lecture 8 Fall 2017 So far We ve been talking about algorithms We ve been talking about ways to optimize their parameters But we haven t talked about the underlying hardware How does

More information

HPC Technology Trends

HPC Technology Trends HPC Technology Trends High Performance Embedded Computing Conference September 18, 2007 David S Scott, Ph.D. Petascale Product Line Architect Digital Enterprise Group Risk Factors Today s s presentations

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Advanced Topics on Heterogeneous System Architectures HSA Foundation! Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2

More information

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information