Computing on the Lunatic Fringe: Exascale Computers and Why You Should Care. Mattan Erez The University of Texas at Austin
|
|
- Howard Fowler
- 5 years ago
- Views:
Transcription
1 1 Computing on the Lunatic Fringe: Exascale Computers and Why You Should Care Mattan Erez The University of Texas at Austin
2 (C) Mattan Erez 2 Arch-focused whole-system approach Efficiency requirements require crossing layers Algorithms are key Compute less, move less, store less Proportional systems Minimize waste Utilize and improve emerging technologies Explore (and act) up and down Programming model to circuits Preferably implement at micro-arch/system
3 (C) Mattan Erez 3 Big problems and emerging platforms Memory systems Capacity, bandwidth, efficiency impossible to balance Adaptive and dynamic management helps New technologies to the rescue? Opportunities for in-memory computing GPUs, supercomputers, clouds, and more Throughput oriented designs are a must Centralization trend is interesting Reliability and resilience More, smaller devices danger of poor reliability Trimmed margins less room for error Hard constraints efficiency is a must Scale exacerbates reliability and resilience concerns
4 (C) Mattan Erez 4 Lots of interesting multi-level projects Resilience/ Reliability Memory GPU/CPU
5 (c) Mattan Erez 5 A superscale computer is a high-performance system for solving big problems
6 (c) Mattan Erez 6 A supercomputer is a high-performance system for solving big cohesive problems
7 (c) Mattan Erez 7 A supercomputer is a high-performance system for solving big cohesive problems
8 (c) Mattan Erez 8 Simulate Analyze Predict
9 (c) Mattan Erez 9 Simulate Analyze Predict
10 (c) Mattan Erez 10 Supercomputers are general Not domain specific Cloud computers are general Algorithms keep changing
11 (c) Mattan Erez 11 Superscale computers must balance Compute Storage Communication Constraints OpEx and CapEx
12 (c) Mattan Erez 12 Super is never super enough
13 Sustained FLOP/s (c) Mattan Erez 13 1E+18 1E+17 1E+16 1E+15 1E+14 1E+13 1E+12 1E+11 1E+10 1E+9 1E+8 Num. 1 Num. 10 Num.500 Idealized VLSI
14 (c) Mattan Erez 14 What s keeping us from exascale?
15 Total Power [kw] Efficiency [GFLOPS/kW] (c) Mattan Erez 15 The power problem Power Efficiency
16 (c) Mattan Erez 16 Solution: build more efficient systems
17 (c) Mattan Erez 17 Exascale computers are a lunacy: Crazy big (1M processors, >10MW) Need efficiency of embedded devices Effective for many domains Fully programmable
18 (c) Mattan Erez 18 Why should you care? Harbingers of things to come Discovery awaits Enable and promote open innovation
19 (c) Mattan Erez 19 Where does the power go?
20 (c) Mattan Erez 20 Cooling and infrastructure Amazing mechanical and electrical systems Getting close to optimal efficiency 17 PFLOP/s 18,688 nodes >8 MW ~200 cabinets ~400 m 2 floorspace
21 (c) Mattan Erez 21 Actual processing I/O Arithmetic + x 2 Comm. Control Memory Idle/margin How much of each component?
22 (c) Mattan Erez 22 The budget: Power 20MW
23 (c) Mattan Erez 23 20MW / 1 exa-flop/s Energy 20pJ/op 50 GFLOPs/W sustained
24 (c) Mattan Erez 24 20MW / 1 exa-flop/s Energy 20pJ/op 50 GFLOPs/W sustained Best supercomputer today: ~300pJ/op
25 (c) Mattan Erez 25 Arithmetic 64-bit floating-point operation + x Arith Today Scaled Researchy Headroom Rough estimated numbers
26 Enough headroom? (c) Mattan Erez 26
27 (c) Mattan Erez 27 Unfortunately, hard tradeoffs 64-bit DP 15pJ 20mm 15 pj 80 pj 2 nj DRAM Rd/Wr 256-bit buses 256-bit access 8 kb SRAM 15 pj 500 pj Efficient off-chip link 100 pj 10nm
28 (c) Mattan Erez 28 + x 2 Need more headroom Minimize waste
29 (c) Mattan Erez 29 + x 2 Do we care about single-unit performance? Must all results be equally precise? Must all results be correct? Lunacy?
30 (c) Mattan Erez 30 Relaxed reliability and precision Some lunacy (rare easy-to-detect errors + parallelism) Lunatic fringe: bounded imprecision Lunacy: live with real unpredictable errors + x Arith Today Scaled Researchy Some lunacy Rough estimated numbers for illustration purposes Headroom Lunatic fringe 18 2 Lunacy
31 (c) Mattan Erez 31 Bounded Approximate Duplication + x 2
32 (c) Mattan Erez 32 Bounded Approximate Duplication + x 2
33 (c) Mattan Erez 33 + x 2 Supercomputers are general Dynamic adaptivity and tuning Some programmers crazier than others Large effort in error-tolerant algorithms
34 (c) Mattan Erez 34 Proportionality and overprovisioning Arithmetic Control Memory Margin Dense Latency Sparse
35 (c) Mattan Erez 35 Architecture goals: Balance possible and practical Enable generality Don t hurt the common case It s all about the algorithm
36 (c) Mattan Erez 36 Architecture goals: Balance possible and practical Enable generality Don t hurt the common case Architecture concepts so far: Proportionality Adaptivity SW-HW co-tuning Locality Parallelism
37 (c) Mattan Erez 37 How to control a billion arithmetic units?
38 (c) Mattan Erez 38 Hierarchy minimizes control cost Amortize control decisions Program Machine Teams Processes Threads Cabinets /modules Nodes Cores Instructions HW units
39 (c) Mattan Erez 39 Hierarchy minimizes control cost Amortize control decisions Program Teams Processes Threads Instructions
40 (c) Mattan Erez 40 What does a HW instruction do? x x x x CPU x x x x x x GPU Embedded Amortize/specialize more
41 (c) Mattan Erez 41 Single-thread or bulk-thread latency oriented CPU throughput oriented GPU
42 (c) Mattan Erez 42 Heterogeneity is a necessity disciplined specialization
43 (c) Mattan Erez 43 Must balance generality and control cost Over-specialization stifles innovation Algorithms more important than hardware
44 (c) Mattan Erez 44 Heterogeneity is lunacy Programing and tuning extremely tricky Diverse and rapidly evolving algorithms
45 (c) Mattan Erez 45 Disciplined heterogeneity Scarcity of choice Throughput oriented Latency oriented Disciplined reconfigurable accelerators
46 (c) Mattan Erez 46 Heterogeneous computing already here GPU for throughput CPU for latency Abstracted FPGA accelerators
47 (c) AMD. Cray, Intel, NVIDIA, xsede 47 Heterogeneous computing already here Titan node (Cray XK7) Copyrighted picture of an XK7 node Stampede node Copyrighted picture of Xeon Phi card Copyrighted picture of Xeon Phi die Copyrighted picture of Stampede node Copyrighted picture of Kepler die Copyrighted picture of Opteron die Copyrighted picture of Sany Bridge die NVIDIA GPU + AMD CPU Intel Xeon CPU + Xeon Phi
48 (c) Mattan Erez 48 Choose most efficient core
49 (c) Mattan Erez 49 Choose most efficient core Arithmetic Control Memory Margin
50 (c) Mattan Erez 50 Generalize the throughput core =?
51 (c) Mattan Erez 51 Generalize the throughput core
52 (c) Mattan Erez 52 Generalize the throughput core
53 (c) Mattan Erez 53 Synchronization may impair execution Predict when necessary robust optimization Adaptive compaction : Compaction barrier
54 (c) Mattan Erez Architecture goals: Balance possible and practical Enable generality Don t hurt the common case 54 Architecture so far: Proportionality Adaptivity Heterogeneity SW-HW co-tuning Locality Parallelism Hierarchy
55 (c) Mattan Erez 55 Heterogeneity is lunacy Programing and tuning extremely tricky Diverse and rapidly evolving algorithms How can we program these machines
56 (c) Mattan Erez 56 Hierarchical languages restore some sanity Abstract key tuning knobs Parallelism Locality Hierarchy Throughput
57 (c) Mattan Erez 57 Hierarchical languages restore some sanity CUDA and OpenCL Sequoia (Stanford) Habanero (Rice) Phalanx (NVIDIA) Legion (Stanford) Working into X10, Chapel Domain specific languages even better
58 (c) DE Shaw Research 58 A counter example: DE Shaw Research Anton Copyrighted picture of Anton package (Google image search) Copyrighted figure demonstrating molecular dynamics (Google image search) Copyrighted figure of Anton system (Google image search)
59 (c) Mattan Erez 59 Another counter example? Microsoft Catapult Disciplined reconfigurability for the cloud Distributed FPGA accelerator Within form-factor and cost constraints Architected interfaces and compiler Compiler for software, not (just) hardeare Copyrighted figure of datacenter (Google image search) Copyrighted figure of Catapult diagram (from ISCA 2014 paper)
60 (c) Mattan Erez 60 Cheap Reliable Efficient Memory Fast Large
61 Fast + large hierarchy (c) Mattan Erez 61
62 (c) Micron, Intel, Mattan Erez 62 Fast + large hierarchical TSV Silicon Interposer mem mem
63 (c) Mattan Erez 63 Large + efficient heterogeneity
64 (c) Micron, Intel, Mattan Erez 64 Large + efficient heterogeneous TSV Silicon Interposer DRAM + NVRAM
65 (c) Mattan Erez 65 Heterogeneous + hierarchical big mess Research just starting New mechanisms needed Very interesting tradeoffs with reliability
66 (c) Mattan Erez 66 Fast + efficient control hierarchy + parallelism
67 (c) Mattan Erez 67 Fast + efficient hierarchy + parallelism Access cell
68 (c) Mattan Erez 68 Fast + efficient hierarchy + parallelism Access many cells in parallel Many pins in parallel
69 (c) Mattan Erez 69 Fast + efficient hierarchy + parallelism Many pins in parallel over multiple cycles Access even more cells in parallel
70 (c) Mattan Erez 70 Fast + efficient hierarchy + parallelism Many pins in parallel over multiple cycles Access even more cells in parallel
71 (c) Mattan Erez 71 Hierarchy + parallelism potential waste
72 (c) Mattan Erez 72 Is fine-grained access better? Wasteful when all data is needed CG C Data ECC FG C 0 C 1 C 2 C 3 C 4 C 5 C 6 C 7 D0 E0 D1 E1 D2 E2 D3 E3 D4 E4 D5 E5 D6 E6 D7 E7 CG C Data ECC FG C 0 D0 E0 72
73 (c) Mattan Erez 73 Dynamically adapt granularity best of both worlds Locality Predictor Sector cache (8 8B sectors per line) SLP Processor Core $D L2 $I Core 0 Core 1 Core N-1 Co-scheduling CG/FG w/ split CG Last Level Cache MC Sub-Ranked DRAM w/ 2X ABUS Support both CG and FG DRAM
74 System Throughput (c) Mattan Erez 74 Dynamically adapt granularity CG+ECC FG+ECC AG+ECC mcf x8 omnetpp x8 mst x8 em3d x8 canneal x8 SSCA2 x8 linked list x8 gups x8 MIX-A MIX-B MIX-C MIX-D
75 (c) Mattan Erez 75 Cost + poor reliability Efficiency + poor reliability New + poor reliability
76 (c) Mattan Erez 76 Compensate with error protection ECC
77 (c) Mattan Erez 77 Compensate with proportional protection
78 (c) Mattan Erez 78 Adaptive reliability Adapt the mechanism Adapt the error rate precision (stochastic precision)
79 Adaptive reliability By type Application guided By location Machine-state guided (c) Mattan Erez 79
80 Example: Virtualized ECC Adapt the mechanism (c) Mattan Erez 80
81 (c) Mattan Erez 81 Protection Level Low Medium Virtual Address space Virtual page x Virtual page y Virtual Page to Physical Frame mapping Physical Memory Page frame x Page frame y High Virtual page z Page frame z Data ECC
82 (c) Mattan Erez 82 Protection Level Low Medium Virtual Address space Virtual page x Virtual page y Virtual Page to Physical Frame mapping Physical Memory Page frame x Page frame y High Virtual page z Page frame z ECC ECC page y Physical Frame to ECC Page mapping ECC page z Data Stronger ECC
83 (c) Mattan Erez 83 Protection Level Low Medium Virtual Address space Virtual page x Virtual page y Virtual Page to Physical Frame mapping Physical Memory Page frame x Page frame y High Virtual page z Page frame z T2EC for Chipkill ECC page y Physical Frame to ECC Page mapping ECC page z Data T1EC T2EC for Double Chipkill
84 (c) Mattan Erez 84 Write Read T1EC Decoding T1EC/T2EC Encoding Two-Tiered ECC Data T1EC T2EC
85 (c) Mattan Erez 85 Write Read T1EC Decoding T1EC/T2EC Encoding T2EC Mapping Virtualized T2EC Data T1EC
86 (c) Mattan Erez 86 Caching work great Chipkill Detect Chipkill Correct 2 Chipkill Correct Normalized Execution Time
87 (c) Mattan Erez 87 Bonus: flexibility of ECC Chipkill Detect Chipkill Correct 2 Chipkill Correct Normalized Energy-Delay Product
88 (C) Mattan Erez 88 Can we do even better? Hide second-tier
89 (C) Mattan Erez 89 FrugalECC = Compression + VECC
90 (C) Mattan Erez 90 Adapt resilience scheme or adapt storage?
91 (C) Mattan Erez 91 Adapt compression scheme too! Coverage-oriented-Compression (CoC)
92 (C) Mattan Erez 92
93 (C) Mattan Erez 93
94 (C) Mattan Erez 94 Experiments promising: Match or exceed reliability Improve power Minimal impact on performance
95 (C) Mattan Erez 95
96 (C) Mattan Erez 96
97 (c) Mattan Erez Architecture goals: Balance possible and practical Enable generality Don t hurt the common case 97 Architecture so far: Proportionality Adaptivity SW-HW co-tuning Heterogeneity Locality Parallelism Hierarchy
98 Projected MTTI [hours] (c) Mattan Erez 98 The reliability challenge 1000 Hard Soft 100 Intermittent SW Overall (15PF) 2015 (40PF) 2017 (130PF) 2019 (450PF) Year (Expected performance in PetaFlops) 2021 (1500PF)
99 (c) Mattan Erez 99 Detect Contain Repair Recover
100 (c) Mattan Erez 100 Today s techniques not good enough Hardware support expensive Checkpoint-restart doesn t scale
101 (c) Mattan Erez 101 Exascale resilience must be: Proportional Parallel Adaptive Hierarchical Analyzable
102 (c) Mattan Erez 102 Containment domains Embed resilience within application Match system hierarchy and characteristics Analyzable abstraction consistent across layers Root CD CDs don t communicate erroneous data CDs have means to recover Child CD
103 (c) Mattan Erez 103 Phalanx C++ program int main(int argc, char **argv) { main_task here = phalanx::initialize(argc, argv); Create test arrays here // Launch kernel on default CPU ( host ) openmp_event e1 = async(here, here.processor(), n) (host_saxpy, 2.0f, host_x, host_y); CD Annotations resiliency model } // Launch kernel on default GPU ( device ) cuda_event e2 = async(here, here.cuda_gpu(0), n) (device_saxpy, 2.0f, device_x, device_y); wait(here, e1&e2); return 0; efficiency-oriented programming model CD API resiliency interface Runtime Library Interface CD control and persistence Microkernel Hardware Abstraction Layer ECC, status Machine Error Reporting Architecture 101, 106
104 (c) Mattan Erez 104 Mapping example: SpMV M V void task<inner> SpMV( in M, in V i, out R i ){ forall( ) reduce( ) SpMV(M[ ],V i [ ],R i [ ]); } Matrix M Vector V void task<leaf> SpMV( ){ } for r=0..n for c=rows[r]..rows[r+1] { res i [r]+=data[c]*v i [cidx[c]]; prevc=c; }
105 (c) Mattan Erez 105 Mapping example: SpMV M 00 M 01 M 10 M 11 Matrix M V 0 V 1 Vector V void task<inner> SpMV( in M, in V i, out R i ){ forall( ) reduce( ) SpMV(M[ ],V i [ ],R i [ ]); } void task<leaf> SpMV( ){ } for r=0..n for c=rows[r]..rows[r+1] { res i [r]+=data[c]*v i [cidx[c]]; prevc=c; }
106 (c) Mattan Erez 106 Mapping example: SpMV M 00 M 01 V 0 void task<inner> SpMV( in M, in V i, out R i ){ forall( ) reduce( ) SpMV(M[ ],V i [ ],R i [ ]); } M 10 Matrix M M 11 V 1 Vector V void task<leaf> SpMV( ){ } for r=0..n for c=rows[r]..rows[r+1] { res i [r]+=data[c]*v i [cidx[c]]; prevc=c; } Distributed to 4 nodes M 00 V 0 M 10 V 0 M 01 V 1 M 11 V 1
107 (c) Mattan Erez 107 Mapping example: SpMV M 00 M 01 V 0 void task<inner> SpMV( in M, in V i, out R i ){ forall( ) reduce( ) SpMV(M[ ],V i [ ],R i [ ]); } M 10 Matrix M M 11 V 1 Vector V void task<leaf> SpMV( ){ } for r=0..n for c=rows[r]..rows[r+1] { res i [r]+=data[c]*v i [cidx[c]]; prevc=c; } Distributed to 4 nodes
108 (c) Mattan Erez 108 Mapping example: SpMV Parent CD Preserve M V Child Preserve (Parent) Child Child CD Preserve Preserve Detect Recover Preserve Preserve Detect Recover Child Detect Recover Child M 00 V 0 M 10 V 0 M 01 V 1 M 11 V 1 Detect Recover Detect Recover Detect Recover Detect Recover Detect Recover Detect Recover Detect (Parent) Recover (Parent)
109 (c) Mattan Erez 109 void task<inner> SpMV(in M, in V i, out R i ) { cd = begin(parentcd); preserve_via_copy(cd, matrix, ); forall( ) reduce( ) SpMV(M[ ],V i [ ],R i [ ]); complete(cd); API } void task<leaf> SpMV( ) { cd = begin(parentcd); preserve_via_copy(cd, matrix, ); preserve_via_parent(cd, vec i, ); for r=0..n for c=rows[r]..rows[r+1] { res i [r]+=data[c]*v i [cidx[c]]; check {fault<fail>(c > prevc);} prevc=c; } complete(cd); } begin configure preserve check complete
110 (c) Mattan Erez 110 SpMV preservation tuning Hierarchy M 00 M 01 M 10 M 11 Matrix M V 0 V 1 Vector V void task<leaf> SpMV( ) { cd = begin(parentcd); preserve_via_copy(cd, matrix, ); preserve_via_parent(cd, vec i, ); for r=0..n for c=rows[r]..rows[r+1] { res i [r]+=data[c]*v i [cidx[c]]; check {fault<fail>(c > prevc);} prevc=c; } complete(cd); } M 00 V 0 M 10 V 0 M 01 V 1 M 11 V 1 Natural redundancy
111 (c) Mattan Erez 111 Concise abstraction for complex behavior void task<leaf> SpMV( ) { cd = begin(parentcd); preserve_via_copy(cd, matrix, ); preserve_via_parent(cd, vec i, ); for r=0..n for c=rows[r]..rows[r+1] { res i [r]+=data[c]*v i [cidx[c]]; check {fault<fail>(c > prevc);} prevc=c; } complete(cd);} Local copy or regen Sibling Parent (unchanged)
112 Time (c) Mattan Erez 112 Recovery: concurrent, hierarchical, adaptive Preserve Compute Detect Re-execution overhead
113 Execution time (c) Mattan Erez 113 Analyzable Application model (tree) Overhead model Fault model
114 Re-execution time (c) Mattan Erez 114 Analyzable q 0, m, n = 1 p c m n Probability that all child CDs experienced no failures x q x, m, n = i=0 i+m 1 i p c i 1 p c m n Probability that all child CDs experienced at most x failures in the m serial CDs d 0, m, n = q[0, m, n] Probability that all child CDs experienced exactly 0 failures d x, m, n = q x, m, n q x 1, m, n Probability that sibling with the most failure experiences exactly x failures T parent x, m, n = i=0 i + m T c d[i, m, n] Parallel domains Execution Re-execution Heterogeneous CDs Asymmetric preservation and restoration
115 Execution time (c) Mattan Erez 115 Analyzable? 108
116 Performance Efficiency Performance Efficiency Performance Efficiency (c) Mattan Erez 116 CDs are scalable NT SpMV HPCCG 100% 80% 60% 40% 20% 0% 100% 80% 60% 40% 20% 0% 100% 80% 60% 40% 20% 0% 2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF 2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF 2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF Peak System Performance CDs, NT h-cpr, 80% g-cpr, 80% CDs, SpMV h-cpr, 50% g-cpr, 50% CDs, HPCCG h-cpr, 10% g-cpr, 10%
117 Energy Overhead Energy Overhead Energy Overhead (c) Mattan Erez 117 CDs are proportional and efficient NT SpMV HPCCG 20% 10% 0% 20% 10% 0% 20% 10% 0% 2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF 2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF 2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF Peak System Performance CDs, NT h-cpr, 80% gcpr, 80% CDs, SpMV h-cpr, 50% g-cpr, 50% CDs, HPCCG h-cpr, 10% g-cpr, 10%
118 (c) Mattan Erez 118 Architecture goals: Balance possible and practical Enable generality Don t hurt the common case It s all about the algorithm
119 (c) Mattan Erez Architecture goals: Balance possible and practical Enable generality Don t hurt the common case Architecture so far: Proportionality Adaptivity SW-HW co-tuning + analyzability Heterogeneity Locality Parallelism Hierarchy 119 Arithmetic Control Memory Reliability Programming
120 (c) Mattan Erez 120 Exascale computers are lunacy
121 (c) Mattan Erez 121 Exascale computers are lunacy science
122 (c) Mattan Erez 122 Harbingers of things to come Discovery awaits Enable and promote open innovation Math + Systems + Engineering + Technology
Virtualized ECC: Flexible Reliability in Memory Systems
Virtualized ECC: Flexible Reliability in Memory Systems Doe Hyun Yoon Advisor: Mattan Erez Electrical and Computer Engineering The University of Texas at Austin Motivation Reliability concerns are growing
More informationToward Exascale Resilience
Toward Exascale Resilience Part 8: Containment Domains / cross-layer schemes Mattan Erez The University of Texas at Austin July 2015 2 Credit to: UT Austin students: Benjaming Cho, Jinsuk Chung, Ali Fakhrzadehgan,
More informationTHE DYNAMIC GRANULARITY MEMORY SYSTEM
THE DYNAMIC GRANULARITY MEMORY SYSTEM Doe Hyun Yoon IIL, HP Labs Michael Sullivan Min Kyu Jeong Mattan Erez ECE, UT Austin MEMORY ACCESS GRANULARITY The size of block for accessing main memory Often, equal
More informationVirtualized and Flexible ECC for Main Memory
Virtualized and Flexible ECC for Main Memory Doe Hyun Yoon and Mattan Erez Dept. Electrical and Computer Engineering The University of Texas at Austin ASPLOS 2010 1 Memory Error Protection Applying ECC
More informationContainment Domains Resilience Mechanisms and Tools Toward Exascale Resilience. Mattan Erez The University of Texas at Austin
Containment Domains Resilience Mechanisms and Tools Toward Exascale Resilience Mattan Erez The University of Texas at Austin 2 Yes, resilience is an exascale concern Checkpoint-restart not good enough
More informationTHE PATH TO EXASCALE COMPUTING. Bill Dally Chief Scientist and Senior Vice President of Research
THE PATH TO EXASCALE COMPUTING Bill Dally Chief Scientist and Senior Vice President of Research The Goal: Sustained ExaFLOPs on problems of interest 2 Exascale Challenges Energy efficiency Programmability
More informationEfficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford
Efficiency and Programmability: Enablers for ExaScale Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford Scientific Discovery and Business Analytics Driving an Insatiable
More informationTimothy Lanfear, NVIDIA HPC
GPU COMPUTING AND THE Timothy Lanfear, NVIDIA FUTURE OF HPC Exascale Computing will Enable Transformational Science Results First-principles simulation of combustion for new high-efficiency, lowemision
More informationLet Software Decide: Matching Application Diversity with One- Size-Fits-All Memory
Let Software Decide: Matching Application Diversity with One- Size-Fits-All Memory Mattan Erez The University of Teas at Austin 2010 Workshop on Architecting Memory Systems March 1, 2010 iggest Problems
More informationStream Processing: a New HW/SW Contract for High-Performance Efficient Computation
Stream Processing: a New HW/SW Contract for High-Performance Efficient Computation Mattan Erez The University of Texas at Austin CScADS Autotuning Workshop July 11, 2007 Snowbird, Utah Stream Processors
More informationThe Stampede is Coming: A New Petascale Resource for the Open Science Community
The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center boisseau@tacc.utexas.edu Stampede: Solicitation US National Science Foundation
More informationTrends in HPC (hardware complexity and software challenges)
Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18
More informationIntel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins
Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationSteve Scott, Tesla CTO SC 11 November 15, 2011
Steve Scott, Tesla CTO SC 11 November 15, 2011 What goal do these products have in common? Performance / W Exaflop Expectations First Exaflop Computer K Computer ~10 MW CM5 ~200 KW Not constant size, cost
More informationSOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS
SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CSAIL IAP MEETING MAY 21, 2013 Research Agenda Lack of technology progress Moore s Law still alive Power
More informationHybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS
+ Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics
More informationRobust GPU Architectures Improving Irregular Execution on Architectures Tuned for Regularity
Robust GPU Architectures Improving Irregular Execution on Architectures Tuned for Regularity Mattan Erez The University of Texas at Austin (C) Mattan Erez 2 Lots of interesting multi-level projects Resilience/
More informationFUSION PROCESSORS AND HPC
FUSION PROCESSORS AND HPC Chuck Moore AMD Corporate Fellow & Technology Group CTO June 14, 2011 Fusion Processors and HPC Today: Multi-socket x86 CMPs + optional dgpu + high BW memory Fusion APUs (SPFP)
More informationCUDA. Matthew Joyner, Jeremy Williams
CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel
More informationEE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin
EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel
More informationHPC future trends from a science perspective
HPC future trends from a science perspective Simon McIntosh-Smith University of Bristol HPC Research Group simonm@cs.bris.ac.uk 1 Business as usual? We've all got used to new machines being relatively
More informationNVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU
NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated
More informationEE 7722 GPU Microarchitecture. Offered by: Prerequisites By Topic: Text EE 7722 GPU Microarchitecture. URL:
00 1 EE 7722 GPU Microarchitecture 00 1 EE 7722 GPU Microarchitecture URL: http://www.ece.lsu.edu/gp/. Offered by: David M. Koppelman 345 ERAD, 578-5482, koppel@ece.lsu.edu, http://www.ece.lsu.edu/koppel
More informationRace to Exascale: Opportunities and Challenges. Avinash Sodani, Ph.D. Chief Architect MIC Processor Intel Corporation
Race to Exascale: Opportunities and Challenges Avinash Sodani, Ph.D. Chief Architect MIC Processor Intel Corporation Exascale Goal: 1-ExaFlops (10 18 ) within 20 MW by 2018 1 ZFlops 100 EFlops 10 EFlops
More informationIdentifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011
Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011 Performance Optimization Process Use appropriate performance metric for each kernel For example, Gflops/s don t make sense for
More informationTechnology for a better society. hetcomp.com
Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware
More informationThe Future of High Performance Computing
The Future of High Performance Computing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Comparing Two Large-Scale Systems Oakridge Titan Google Data Center 2 Monolithic supercomputer
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationGPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten
GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor,
More informationThe University of Texas at Austin
EE382N: Principles in Computer Architecture Parallelism and Locality Fall 2009 Lecture 24 Stream Processors Wrapup + Sony (/Toshiba/IBM) Cell Broadband Engine Mattan Erez The University of Texas at Austin
More informationIt s a Multicore World. John Urbanic Pittsburgh Supercomputing Center
It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Waiting for Moore s Law to save your serial code start getting bleak in 2004 Source: published SPECInt data Moore s Law is not at all
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More information! Readings! ! Room-level, on-chip! vs.!
1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationAddressing the Memory Wall
Lecture 26: Addressing the Memory Wall Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Cage the Elephant Back Against the Wall (Cage the Elephant) This song is for the
More informationOverview. CS 472 Concurrent & Parallel Programming University of Evansville
Overview CS 472 Concurrent & Parallel Programming University of Evansville Selection of slides from CIS 410/510 Introduction to Parallel Computing Department of Computer and Information Science, University
More informationThis Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources
This Unit: Putting It All Together CIS 501 Computer Architecture Unit 12: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital Circuits
More informationECE 8823: GPU Architectures. Objectives
ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading
More informationThe Future of GPU Computing
The Future of GPU Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November 18, 2009 The Future of Computing Bill Dally Chief Scientist
More informationParallel Computing. November 20, W.Homberg
Mitglied der Helmholtz-Gemeinschaft Parallel Computing November 20, 2017 W.Homberg Why go parallel? Problem too large for single node Job requires more memory Shorter time to solution essential Better
More informationProgramming Models for Multi- Threading. Brian Marshall, Advanced Research Computing
Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows
More informationThis Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?
This Unit: Putting It All Together CIS 371 Computer Organization and Design Unit 15: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital
More informationLecture 5: Scheduling and Reliability. Topics: scheduling policies, handling DRAM errors
Lecture 5: Scheduling and Reliability Topics: scheduling policies, handling DRAM errors 1 PAR-BS Mutlu and Moscibroda, ISCA 08 A batch of requests (per bank) is formed: each thread can only contribute
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 1. Computer Abstractions and Technology
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology The Computer Revolution Progress in computer technology Underpinned by Moore
More informationCUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav
CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture
More informationGPU Architecture. Alan Gray EPCC The University of Edinburgh
GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From
More informationUnit 11: Putting it All Together: Anatomy of the XBox 360 Game Console
Computer Architecture Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Milo Martin & Amir Roth at University of Pennsylvania! Computer Architecture
More informationCSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller
Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,
More informationCenter Extreme Scale CS Research
Center Extreme Scale CS Research Center for Compressible Multiphase Turbulence University of Florida Sanjay Ranka Herman Lam Outline 10 6 10 7 10 8 10 9 cores Parallelization and UQ of Rocfun and CMT-Nek
More informationSelf-Repair for Robust System Design. Yanjing Li Intel Labs Stanford University
Self-Repair for Robust System Design Yanjing Li Intel Labs Stanford University 1 Hardware Failures: Major Concern Permanent: our focus Temporary 2 Tolerating Permanent Hardware Failures Detection Diagnosis
More informationCSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore
CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors
More informationGPUfs: Integrating a file system with GPUs
GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Traditional System Architecture Applications OS CPU
More informationCritically Missing Pieces on Accelerators: A Performance Tools Perspective
Critically Missing Pieces on Accelerators: A Performance Tools Perspective, Karthik Murthy, Mike Fagan, and John Mellor-Crummey Rice University SC 2013 Denver, CO November 20, 2013 What Is Missing in GPUs?
More informationAtos announces the Bull sequana X1000 the first exascale-class supercomputer. Jakub Venc
Atos announces the Bull sequana X1000 the first exascale-class supercomputer Jakub Venc The world is changing The world is changing Digital simulation will be the key contributor to overcome 21 st century
More informationMPI in 2020: Opportunities and Challenges. William Gropp
MPI in 2020: Opportunities and Challenges William Gropp www.cs.illinois.edu/~wgropp MPI and Supercomputing The Message Passing Interface (MPI) has been amazingly successful First released in 1992, it is
More informationThe Future of High- Performance Computing
Lecture 26: The Future of High- Performance Computing Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017 Comparing Two Large-Scale Systems Oakridge Titan Google Data Center Monolithic
More informationThe End of Denial Architecture and The Rise of Throughput Computing
The End of Denial Architecture and The Rise of Throughput Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University Outline! Performance = Parallelism
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationThe Cray Rainier System: Integrated Scalar/Vector Computing
THE SUPERCOMPUTER COMPANY The Cray Rainier System: Integrated Scalar/Vector Computing Per Nyberg 11 th ECMWF Workshop on HPC in Meteorology Topics Current Product Overview Cray Technology Strengths Rainier
More informationIntroduction to Xeon Phi. Bill Barth January 11, 2013
Introduction to Xeon Phi Bill Barth January 11, 2013 What is it? Co-processor PCI Express card Stripped down Linux operating system Dense, simplified processor Many power-hungry operations removed Wider
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory
More informationCRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar
CRAY XK6 REDEFINING SUPERCOMPUTING - Sanjana Rakhecha - Nishad Nerurkar CONTENTS Introduction History Specifications Cray XK6 Architecture Performance Industry acceptance and applications Summary INTRODUCTION
More informationReal-Time Rendering Architectures
Real-Time Rendering Architectures Mike Houston, AMD Part 1: throughput processing Three key concepts behind how modern GPU processing cores run code Knowing these concepts will help you: 1. Understand
More informationGPUfs: Integrating a file system with GPUs
GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Building systems with GPUs is hard. Why? 2 Goal of
More informationEE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 23 Memory Systems
EE382 (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 23 Memory Systems Mattan Erez The University of Texas at Austin EE382: Principles of Computer Architecture, Fall 2011 -- Lecture
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationIntroducing the Cray XMT. Petr Konecny May 4 th 2007
Introducing the Cray XMT Petr Konecny May 4 th 2007 Agenda Origins of the Cray XMT Cray XMT system architecture Cray XT infrastructure Cray Threadstorm processor Shared memory programming model Benefits/drawbacks/solutions
More informationPHX: Memory Speed HPC I/O with NVM. Pradeep Fernando Sudarsun Kannan, Ada Gavrilovska, Karsten Schwan
PHX: Memory Speed HPC I/O with NVM Pradeep Fernando Sudarsun Kannan, Ada Gavrilovska, Karsten Schwan Node Local Persistent I/O? Node local checkpoint/ restart - Recover from transient failures ( node restart)
More informationAdministrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.
Administrivia HW0 scores, HW1 peer-review assignments out. HW2 out, due Nov. 2. If you re having Cython trouble with HW2, let us know. Review on Wednesday: Post questions on Piazza Introduction to GPUs
More informationTowards Exascale Computing with the Atmospheric Model NUMA
Towards Exascale Computing with the Atmospheric Model NUMA Andreas Müller, Daniel S. Abdi, Michal Kopera, Lucas Wilcox, Francis X. Giraldo Department of Applied Mathematics Naval Postgraduate School, Monterey
More informationDistributed Data Infrastructures, Fall 2017, Chapter 2. Jussi Kangasharju
Distributed Data Infrastructures, Fall 2017, Chapter 2 Jussi Kangasharju Chapter Outline Warehouse-scale computing overview Workloads and software infrastructure Failures and repairs Note: Term Warehouse-scale
More informationMemory: Past, Present and Future Trends Paolo Faraboschi
Memory: Past, Present and Future Trends Paolo Faraboschi Fellow, Hewlett Packard Labs Systems Research Lab Quiz ( Excerpt from Intel Developer Forum Keynote 2000 ) ANDREW GROVE: is there a role for more
More informationMemory Systems IRAM. Principle of IRAM
Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several
More informationGPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA
GPU COMPUTING AND THE FUTURE OF HPC Timothy Lanfear, NVIDIA ~1 W ~3 W ~100 W ~30 W 1 kw 100 kw 20 MW Power-constrained Computers 2 EXASCALE COMPUTING WILL ENABLE TRANSFORMATIONAL SCIENCE RESULTS First-principles
More informationMathematical computations with GPUs
Master Educational Program Information technology in applications Mathematical computations with GPUs Introduction Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University How to.. Process terabytes
More informationEE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)
EE382 (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III) Mattan Erez The University of Texas at Austin EE382: Principles of Computer Architecture, Fall 2011 -- Lecture
More informationThere s STILL plenty of room at the bottom! Andreas Olofsson
There s STILL plenty of room at the bottom! Andreas Olofsson 1 Richard Feynman s Lecture (1959) There's Plenty of Room at the Bottom An Invitation to Enter a New Field of Physics Why cannot we write the
More informationNear-Threshold Computing: How Close Should We Get?
Near-Threshold Computing: How Close Should We Get? Alaa R. Alameldeen Intel Labs Workshop on Near-Threshold Computing June 14, 2014 Overview High-level talk summarizing my architectural perspective on
More informationAn Exploration into Object Storage for Exascale Supercomputers. Raghu Chandrasekar
An Exploration into Object Storage for Exascale Supercomputers Raghu Chandrasekar Agenda Introduction Trends and Challenges Design and Implementation of SAROJA Preliminary evaluations Summary and Conclusion
More informationTutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers
Tutorial Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld dan/lars/bbarth/milfeld@tacc.utexas.edu XSEDE 12 July 16, 2012
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationExascale: challenges and opportunities in a power constrained world
Exascale: challenges and opportunities in a power constrained world Carlo Cavazzoni c.cavazzoni@cineca.it SuperComputing Applications and Innovation Department CINECA CINECA non profit Consortium, made
More informationHybrid Architectures Why Should I Bother?
Hybrid Architectures Why Should I Bother? CSCS-FoMICS-USI Summer School on Computer Simulations in Science and Engineering Michael Bader July 8 19, 2013 Computer Simulations in Science and Engineering,
More informationFundamentals of Quantitative Design and Analysis
Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature
More informationRed Fox: An Execution Environment for Relational Query Processing on GPUs
Red Fox: An Execution Environment for Relational Query Processing on GPUs Georgia Institute of Technology: Haicheng Wu, Ifrah Saeed, Sudhakar Yalamanchili LogicBlox Inc.: Daniel Zinn, Martin Bravenboer,
More informationESE532: System-on-a-Chip Architecture. Today. Programmable SoC. Message. Process. Reminder
ESE532: System-on-a-Chip Architecture Day 5: September 18, 2017 Dataflow Process Model Today Dataflow Process Model Motivation Issues Abstraction Basic Approach Dataflow variants Motivations/demands for
More informationCPU-GPU Heterogeneous Computing
CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems
More informationUsing FPGAs as Microservices
Using FPGAs as Microservices David Ojika, Ann Gordon-Ross, Herman Lam, Bhavesh Patel, Gaurav Kaul, Jayson Strayer (University of Florida, DELL EMC, Intel Corporation) The 9 th Workshop on Big Data Benchmarks,
More informationThe Heterogeneous Programming Jungle. Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest
The Heterogeneous Programming Jungle Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest June 19, 2012 Outline 1. Introduction 2. Heterogeneous System Zoo 3. Similarities 4. Programming
More informationThe future is parallel but it may not be easy
The future is parallel but it may not be easy Michael J. Flynn Maxeler and Stanford University M. J. Flynn 1 HiPC Dec 07 Outline I The big technology tradeoffs: area, time, power HPC: What s new at the
More informationParallelism. CS6787 Lecture 8 Fall 2017
Parallelism CS6787 Lecture 8 Fall 2017 So far We ve been talking about algorithms We ve been talking about ways to optimize their parameters But we haven t talked about the underlying hardware How does
More informationHPC Technology Trends
HPC Technology Trends High Performance Embedded Computing Conference September 18, 2007 David S Scott, Ph.D. Petascale Product Line Architect Digital Enterprise Group Risk Factors Today s s presentations
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationHSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!
Advanced Topics on Heterogeneous System Architectures HSA Foundation! Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2
More informationKartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18
Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More information