Steve Scott, Tesla CTO SC 11 November 15, 2011
What goal do these products have in common? Performance / W
Exaflop Expectations First Exaflop Computer K Computer ~10 MW CM5 ~200 KW Not constant size, cost or power
The Road Ahead is Steep In the Good Old Days Leakage was not important, and voltage scaled with feature size L = L/2 V = V/2 E = CV 2 = E/8 f = 2f D = 1/L 2 = 4D P = P Halve L and get 4x the transistors and 8x the capability for the same power! MF to GF to TF and almost to PF Technology was giving us 68% per year in perf/w! Processors realized ~50% per year in perf/w (spent it on single thread performance) The New Reality Leakage has limited threshold voltage, largely ending voltage scaling Halve L and get 2x the capability for the same power. Technology will give us only 19% per year in perf/w
The Future Belongs to the Efficient Chips have become power (not area) constrained density increases quadratically with feature size energy/op decreases linearly with feature size
Which Takes More Energy? Performing a 64-bit floating-point FMA: 893,500.288914668 43.90230564772498 = 39,226,722.78026233027699 + 2.02789331400154 = 39,226,724.80815564 Or moving the three 64-bit operands 18 mm across the die: This one takes over 4.2x the energy (40nm)! It s getting worse: in10nm, relative cost will be 15x! Loading the data from off chip takes >> 100x the energy. Flops are cheap. Communication is expensive.
Achieving Energy Efficiency Reduce overhead. Minimize data motion.
Multi-core CPUs Industry has gone multi-core as a first response to power issues Performance through parallelism Dial back complexity and clock rate Exploit locality But CPUs are fundamentally designed for single thread performance rather than energy efficiency Fast clock rates with deep pipelines Data and instruction caches optimized for latency Superscalar issue with out-of-order execution Dynamic conflict detection Lots of predictions and speculative execution Lots of instruction overhead per operation Less than 2% of chip power today goes to flops.
CPU 1690 pj/flop Optimized for Latency Caches GPU 225 pj/flop Optimized for Throughput Explicit Management of On-chip Memory Westmere 32nm Fermi 40nm
Growing Momentum for GPUs in Supercomputing Tesla Powers 3 of 5 Top Systems (November 2011) #2 : Tianhe-1A 7168 Tesla GPUs 2.6 PFLOPS #1 : K Computer 88K Fujitsu Sparc CPUs 10.5 PFLOPS #4 : Nebulae 4650 Tesla GPUs 1.3 PFLOPS #3 : Jaguar 36K AMD Opteron CPUs 1.8 PFLOPS Titan 18000 Tesla GPUs >20 PFLOPS #5 : Tsubame 2.0 4224 Tesla GPUs 1.2 PFLOPS (most efficient PF system)
NVIDIA GPU Computing Uptake >300,000,000 >500,000 >100,000 >450 Compute-capable NVIDIA GPUs NVIDIA SW Development Toolkit Downloads Active NVIDIA GPU Computing Developers Universities Teaching GPU Computing Widespread adoption by HPC OEMs
ORNL Adopts GPUs for Next-Gen Supercomputer Titan Cray XK6 18,000 Tesla GPUs 20+ PetaFlops ~90% of flops from GPUs 2x Faster, 3x More Energy Efficient (and much smaller!) than Current #1 (K Computer)
NCSA Mixes GPUs into BlueWaters By incorporating a future version of the XK6 system, Blue Waters will provide a bridge to the future of scientific computing. --NCSA Director Thom Dunning Contains over 30 Cray XK cabinets With over 3000 NVIDIA Tesla GPUs
What About Programming?
GPU Libraries: Plug In & Play cublas Dense Linear Algebra Parallel Algorithms QUDA Lattice QCD
Directives: Ease of Programming and Portability Available from PGI, CAPS, and soon Cray OpenMP Cray Directives CPU CPU GPU main() { double pi = 0.0; long i; main() { double pi = 0.0; long i; #pragma omp parallel for reduction(+:pi) for (i=0; i<n; i++) { double t = (double)((i+0.05)/n); pi += 4.0/(1.0+t*t); } printf( pi = %f\n, pi/n); } #pragma omp acc_region_loop #pragma omp parallel for reduction(+:pi) for (i=0; i<n; i++) { double t = (double)((i+0.05)/n); pi += 4.0/(1.0+t*t); } #pragma omp end acc_region_loop printf( pi = %f\n, pi/n); }
OpenACC: Open Parallel Programming Standard Easy, Fast, Portable OpenACC will enable programmers to easily develop portable applications that maximize the performance and power efficiency benefits of the hybrid CPU/GPU architecture of Titan. Buddy Bland Titan Project Director Oak Ridge National Lab OpenACC is a technically impressive initiative brought together by members of the OpenMP Working Group on Accelerators, as well as many others. We look forward to releasing a version of this proposal in the next release of OpenMP. Michael Wong CEO, OpenMP Directives Board
Focus on Exposing Parallelism With Directives, tuning work focusses on exposing parallelism, which makes codes inherently better Example: Application tuning work using directives for new Titan system at ORNL S3D Research more efficient combustion with nextgeneration fuels CAM-SE Answer questions about specific climate change adaptation and mitigation scenarios Tuning top 3 kernels (90% of runtime) 3 to 6x faster on CPU+GPU vs. CPU+CPU But also improved all-cpu version by 50% Tuning top key kernel (50% of runtime) 6.5x faster on CPU+GPU vs. CPU+CPU Improved performance of CPU version by 100%
The Future of HPC is Green We re constrained by power You can t simultaneously optimize for single thread performance and power efficiency Most work must be done by cores designed for throughput and efficiency Locality is key explicit control of memory hierarchy GPUs are the path to our tightly-coupled, energyefficient, hybrid processor future
Thank You. Questions?