Supercomputing at 1/10 th the Cost.

Size: px

Start display at page:

Download "Supercomputing at 1/10 th the Cost."

Kelley Sutton
6 years ago
Views:

1 Supercomputing at 1/10 th the Cost

2 GPU Computing * CPU + GPU Co-Processing 4 cores CPU 48 GigaFlops (DP) GPU 515 GigaFlops (DP) * aka GPGPU 2

3 146X 36X 18X 50X 100X Medical Imaging U of Utah Molecular Dynamics U of Illinois, Urbana Video Transcoding Elemental Tech Matlab Computing AccelerEyes Astrophysics RIKEN 15x 150x 149X 47X 20X 130X 30X Financial simulation Oxford Linear Algebra Universidad Jaime 3D Ultrasound Techniscan Quantum Chemistry U of Illinois, Urbana Gene Sequencing U of Maryland 3

Tesla Data Center & Workstation GPU Solutions Tesla

Example: Nitro G5 1U Tesla S-series 1U Systems S2050

C2050 C1060 Integrated CPU-GPU Servers & Blades OEM

4 Tesla Data Center & Workstation GPU Solutions Tesla M-series GPUs M2070 M2050 M1060 M-series M-series Example: Nitro G5 1U Tesla S-series 1U Systems S2050 S1070 S-series 1U System Tesla C-series GPUs C2070 C2050 C1060 Integrated CPU-GPU Servers & Blades OEM CPU Server + Tesla S-series 1U Workstations 2 to 4 Tesla GPUs 4

DRAM I/F Giga Thread HOST I/F DRAM I/F Fermi: The

of CPUs IEEE 754-2008 SP & DP Floating Point

48 KB Added L1 and L2 Caches ECC on all Internal

GPU Memories High Speed GDDR5 Memory Interface L2

Simultaneous Tasks on GPU 10x Faster Atomic

5 DRAM I/F Giga Thread HOST I/F DRAM I/F Fermi: The Computational GPU Performance 7x Double Precision of CPUs IEEE SP & DP Floating Point Flexibility Increased Shared Memory from 16 KB to 48 KB Added L1 and L2 Caches ECC on all Internal and External Memories Enable up to 1 TeraByte of GPU Memories High Speed GDDR5 Memory Interface L2 DRAM I/F DRAM I/F DRAM I/F Usability Multiple Simultaneous Tasks on GPU 10x Faster Atomic Operations C++ Support System Calls, printf support DRAM I/F 5

6 NVIDIA Tesla GPU Computing Products Data Center Products Workstation Server Module 1U Systems Workstation Boards Tesla M2070 / Tesla M2050 Tesla M1060 Tesla S2050 Tesla S1070 Tesla C2070 / Tesla C2050 Tesla C1060 GPUs 1 T20 GPU 1 T10 GPU 4 T20 GPUs 4 T10 GPUs 1 T20 GPU 1 T10 GPU Single Precision 1030 GFlops 933 GFlops 4120 GFlops 4140 GFlops 1030 Gflops 933 GFlops Double Precision 515 Gflops 78 GFlops 2060 GFlops 346 GFlops 515 Gflops 78 GFlops Memory 6 GB / 3 GB 4 GB 12 GB (S2050) 16 GB 4 GB / GPU 6 GB / 3 GB 4 GB Mem BW GB/s 102 GB/s GB/s 102 GB/s 144 GB/s 102 GB/s 6

7 NVIDIA GPGPU Developer Eco-System Numerical Packages MATLAB Mathematica NI LabView pycuda Debuggers & Profilers cuda-gdb NV Visual Profiler Parallel Nsight Visual Studio Allinea TotalView GPU Compilers C C++ Fortran OpenCL DirectCompute Java Python Parallelizing Compilers PGI Accelerator CAPS HMPP mcuda OpenMP Libraries BLAS FFT LAPACK NPP Video Imaging GPULib GPGPU Consultants & Training OEM Solution Providers ANEO GPU Tech 7

8 World s Fastest GPU Supercomputer Petaflops Top500: #1 in Asia, #2 in the World 8

9 #2: Dawning Nebulae 1.27 Petaflop 4640 GPUs 2.55 MWatts 9

10 Power MegaWatts 8 2x Performance / Watt Jaguar x86 CPU 4 3 Nebulae Tesla GPU 2 1 IPE, CAS Tesla GPU JUGENE BlueGene Roadrunner Cell Linpack Performance (Teraflops) 10

11 8x Higher Linpack Performance Gflops Performance / $ Gflops / $K Performance / watt Gflops / kwatt CPU Server GPU-CPU Server 0 CPU Server GPU-CPU Server 0 CPU Server GPU-CPU Server CPU 1U Server: 2x Intel Xeon X5550 (Nehalem) 2.66 GHz, 48 GB memory, $7K, 0.55 kw GPU-CPU 1U Server: 2x Tesla C x Intel Xeon X5550, 48 GB memory, $11K, 1.0 kw 11

12 20+ Oil & Gas Companies Porting to CUDA Successful Customers GPU vs CPU Improvements Oil & Gas ISVs Performance / Watt 18x - 27x 12x - 17x Performance / Space 20x - 31x 15x - 20x Performance / Cost 15x - 20x 10x - 12x 12

13 Oil & Gas: Seismic Processing 1 Equal Performance 1 32 Tesla S1070s 31x Less Space 2000 CPU Servers 20x Lower Cost ~$400 K ~$8 M 45 kwatts 27x Lower Power 1200 kwatts 13

Finance: 10+ Banks Porting to CUDA Successful Customers Derivative Pricing using SciFinance Basket Equity-Linked Structured Note 2 Tesla C1060s 0.

14 Finance: 10+ Banks Porting to CUDA Successful Customers Derivative Pricing using SciFinance Basket Equity-Linked Structured Note 2 Tesla C1060s 0.25 secs 124x Several unannounced Finance ISVs 1 Tesla C secs 77x Intel Xeon (2.6 GHz) secs Time (secs) UnRisk 14

15 Bloomberg: Bond Pricing 48 GPUs 42x Lower Space 2000 CPUs $144K 28x Lower Cost $4 Million $31K / year 38x Lower Power Cost $1.2 Million / year 15

16 Tesla Bio WorkBench : Bio-Chemistry & Bio-Informatics TeraChem Hex (Docking) Applications LAMMPS CUDA-BLASTP CUDA-MEME MUMmerGPU CUDA-EC Community Download, Documentation Technical papers Discussion Forums Benchmarks & Configurations Tesla Personal Supercomputer Tesla GPU Clusters Platforms 16

17 Finding a Better Shampoo Tesla PSC Equal Performance 32 CPU Servers No Data Center Required 1 kwatt 21 kwatts 13x Lower System Cost $7 K $128 K 19x Lower Power & Cooling Cost $2 K $37 K 17

18 The Tesla Advantage 18

19 Parallel Computing on All GPUs 100+ Million CUDA GPUs Deployed GeForce Entertainment Tesla TM High-Performance Computing Quadro Design & Creation 19

20 Tesla : Built for Professional Computing Features Quality & Warranty Support & Lifecycle Feature 4x Higher double precision (on 20-series) ECC only on Tesla & Quadro (on 20-series) Bi-directional PCI-E communication (Tesla has Dual DMA Engines, GeForce has only 1 DMA Engine) Larger memory for larger data sets 3GB and 6GB Products Cluster management software tools available on Tesla only TCC (Tesla Compute Cluster) driver supported for Windows OS only on Tesla. Integrated OEM workstations and servers Professional ISVs will certify CUDA applications only on Tesla 2 to 4 day Stress testing & memory burn-in for reliability. Added margin in memory and core clocks for added reliability. Manufactured & guaranteed by NVIDIA 3-year warranty from NVIDIA Enterprise support, higher priority for CUDA bugs and requests Benefits Higher Performance for scientific CUDA applications Data reliability inside the GPU and on DRAM memories Higher Performance for CUDA applications (by overlapping communication & computation) Higher performance on wide range of applications (medical, oil & gas, manufacturing, FEA, CAE) Needed for GPU monitoring and job scheduling in data center deployments Higher performance for CUDA applications due to lower kernel launch overhead. TCC adds support for RDP and Services Trusted, reliable systems built for Tesla products. Bug reproduction, support, feature requests for Tesla only. Built for 24/7 computing in data center and workstation environments. No changes in key components like GPU and memory without notice. Always the same clocks for known, reliable performance. Reliable, long life products Ability to influence CUDA and GPU roadmap. Get early access to features requests months availability + 6-month EOL notice Reliable product supply 20

21 Programming GPUs 21

Doing GPGPU Right: Combination of Hardware

Libraries and Middleware cufft cublas CULA

22 Doing GPGPU Right: Combination of Hardware and Software GPU Computing Applications Libraries and Middleware cufft cublas CULA LAPACK NPP & cudpp Video PhysX Physics OptiX Ray Tracing mental ray iray Rendering Reality Server 3D Web Services C++ tm Direct C OpenCL Compute Fortran Java and Python NVIDIA GPU CUDA Parallel Computing Architecture 22 OpenCL is trademark of Apple Inc. used under license to the Khronos Group Inc.

23 CUDA C/C++: C with a few new keywords void saxpy_serial(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } // Invoke serial SAXPY kernel saxpy_serial(n, 2.0, x, y); Standard C Code global void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockidx.x*blockdim.x + threadidx.x; if (i < n) y[i] = a*x[i] + y[i]; } // Invoke parallel SAXPY kernel with 256 threads/block int nblocks = (n + 255) / 256; saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y); Parallel C Code 23

24 CUDA Programming Effort / Performance Source : MIT CUDA Course 24

25 Fermi NEXT GENERATION ARCHITECTURE 25

26 Introducing the Fermi Architecture 3 billion transistors 512 cores (max) DP performance 50% of SP ECC L1 and L2 Caches GDDR5 Memory Up to 1 Terabyte of GPU Memory Concurrent Kernels, C++ 26

27 Fermi SM Architecture 32 CUDA cores per SM (512 total) Double precision 50% of single precision 8x over GT200 Dual Thread Scheduler 64 KB of RAM for shared memory and L1 cache (configurable) 27

CUDA Core Architecture New IEEE 754-2008 floating-point standard,

instruction for both single and double precision Newly designed

28 CUDA Core Architecture New IEEE floating-point standard, surpassing even the most advanced CPUs Fused multiply-add (FMA) instruction for both single and double precision Newly designed integer ALU optimized for 64-bit and extended precision operations 28

Cached Memory Hierarchy First GPU architecture to support a true cache hierarchy in combination with on-chip shared memory L1 Cache per SM (per 32 cores)

29 Cached Memory Hierarchy First GPU architecture to support a true cache hierarchy in combination with on-chip shared memory L1 Cache per SM (per 32 cores) Improves bandwidth and reduces latency Unified L2 Cache (768 KB) Fast, coherent data sharing across all cores in the GPU Parallel DataCache Memory Hierarchy 29

30 ECC ECC protection for DRAM ECC supported for GDDR5 memory All major internal memories Register file, shared memory, L1 cache, L2 cache Detect 2-bit errors, correct 1-bit errors (per word) 30

31 GigaThread Hardware Thread Scheduler Hierarchically manages thousands of simultaneously active threads 10x faster application context switching Concurrent kernel execution 31

32 GigaThread Hardware Thread Scheduler Concurrent Kernel Execution + Faster Context Switch Serial Kernel Execution Parallel Kernel Execution 32

33 GigaThread Streaming Data Transfer Engine Dual DMA engines Simultaneous CPU GPU and GPU CPU data transfer Fully overlapped with CPU and GPU processing time Activity Snapshot: 33

34 Enhanced Software Support Full C++ Support Virtual functions Try/Catch hardware support System call support Support for pipes, semaphores, printf, etc Unified 64-bit memory addressing 34

35 Performance Benchmarks 35

36 8x Higher Linpack CPU Server 8x 5x 4.5x Performance Gflops GPU-CPU Server Performance / $ Gflops / $K 11 CPU Server 60 GPU-CPU Server Performance / watt Gflops / kwatt CPU Server GPU-CPU Server CPU 1U Server: 2x Intel Xeon X5550 (Nehalem) 2.66 GHz, 48 GB memory, $7K, 0.55 kw GPU-CPU 1U Server: 2x Tesla C x Intel Xeon X5550, 48 GB memory, $11K, 1.0 kw 36

37 Speed-up Speed-up Speed-up Speed-up Speed-up Performance Summary MIDG: Discontinuous Galerkin Solvers for PDEs AMBER Molecular Dynamics (Mixed Precision) 18 Lock Exchange Problem OpenCurrent 180 OpenEye ROCS Virtual Drug Screening 8 Radix Sort CUDA SDK Preliminary data 37

38 Standard FFT Library: cufft 3.1 Gflops 400 Single Precision FFT Gflops 120 Double Precision FFT Transform Size Transform Size cufft 3.1: NVIDIA Tesla C1060, Tesla C2050 (Fermi) MKL : Quad-Core Intel Xeon 5550, 2.67 GHz 38

39 Standard BLAS Library: cublas 3.1 Gflops Single Precision BLAS: SGEMM 256x x x x x x8192 Gflops Double Precision BLAS: DGEMM 256x x x x x x8192 Matrix Size Matrix Size cublas 3.1: NVIDIA Tesla C1060, Tesla C2050 (Fermi) MKL : Quad-Core Intel Xeon 5550, 2.67 GHz 39

40 80 x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x 8112 Matrix Size for Best cublas Performance Gflops SGEMM: Multiples of Gflops DGEMM: Multiples of cublas 3.1: NVIDIA Tesla C1060, Tesla C2050 (Fermi) MKL : Quad-Core Intel Xeon 5550, 2.67 GHz 40

41 GLOPS GLOPS GLOPS CULA 1.3 LAPACK Library from EM Photonics QR Decomposition (DGEQRF) Householder method; Operation count estimated as 1.33N 3 LU Decomposition (DGETRF) Partial pivoting; Operation count estimated as 0.66N 3 Singular Value Decomposition (DGESVD) Left & right singular vectors; Operation count estimated as 21N Matrix Size Matrix Size Matrix Size Core i7-920 Tesla C1060 Tesla C2050 Core i7-920 Tesla C1060 Tesla C2050 Core i7-920 Tesla C1060 Tesla C2050 Double Precision Results Data Courtesy: EM Photonics 41

42 Sparse Matrix-Vector Multiplication (SpMV) Gflops Single Precision Tesla C2050 (ECC off) Tesla C2050 (ECC on) Tesla C1060 Intel Xeon 5550 Gflops Double Precision Tesla C2050 (ECC off) Tesla C2050 (ECC on) Tesla C1060 Intel Xeon SpMv: CUDA 3.0, Tesla C1060 and Tesla C2050 MKL 10.2: Intel Xeon 5550, 2.67 GHz Preliminary data 42

Australia GPU Users Groups Informal local special interest groups bring together GPU users from all fields and experience levels Groups in Brisbane, Sydney, Perth, and soon Melbourne First meetings

43 Australia GPU Users Groups Informal local special interest groups bring together GPU users from all fields and experience levels Groups in Brisbane, Sydney, Perth, and soon Melbourne First meetings held June, 2010 Second meetings likely in August Also look for the Australia GPU Users group on linkedin.com 43

44 NVIDIA Academic Partnership Support faculty research & teaching efforts Small equipment gifts (1-2 GPUs) Significant discounts on GPU purchases Easy Especially Quadro, Tesla equipment Useful for cost matching Research contracts Small cash grants (typically ~$25K gifts) Medium-scale equipment donations (10-30 GPUs) Informal proposals, reviewed quarterly Focus areas: GPU computing, especially with an educational mission or component NVIDIA PhD Fellowship: accepting apps in November! Competitive 44

GPU Technology Conference 2010 Monday, Sept.

23, 2010 San Jose Convention Center, San Jose, California The most important event in

technologies and emerging applications Get tools and techniques to impact mission

consider the GPU Technology Conference to be the single best place to see the amazing

It s a great venue for meeting researchers, developers, scientists, and entrepreneurs

45 GPU Technology Conference 2010 Monday, Sept Thurs., Sept. 23, 2010 San Jose Convention Center, San Jose, California The most important event in the GPU ecosystem Learn about seismic shifts in GPU computing Preview disruptive technologies and emerging applications Get tools and techniques to impact mission critical projects Network with experts, colleagues, and peers across industries I consider the GPU Technology Conference to be the single best place to see the amazing work enabled by the GPU. It s a great venue for meeting researchers, developers, scientists, and entrepreneurs from around the world. -- Professor Hanspeter Pfister, Harvard University and GTC 2009 keynote speaker Opportunities Call for Submissions Sessions & posters Sponsors / Exhibitors Reach decision makers CEO on Stage Showcase for Startups Tell your story to VCs and analysts 45

46 Tesla GPU Computing Questions?

47 Tesla GPU Computing Products 47

48 NVIDIA Tesla GPU Computing Products Data Center Products Workstation Server Module 1U Systems Workstation Boards Tesla M2070 / Tesla M2050 Tesla M1060 Tesla S2050 Tesla S1070 Tesla C2070 / Tesla C2050 Tesla C1060 GPUs 1 T20 GPU 1 T10 GPU 4 T20 GPUs 4 T10 GPUs 1 T20 GPU 1 T10 GPU Single Precision 1030 GFlops 933 GFlops 4120 GFlops 4140 GFlops 1030 Gflops 933 GFlops Double Precision 515 Gflops 78 GFlops 2060 GFlops 346 GFlops 515 Gflops 78 GFlops Memory 6 GB / 3 GB 4 GB 12 GB (S2050) 16 GB 4 GB / GPU 6 GB / 3 GB 4 GB Mem BW GB/s 102 GB/s GB/s 102 GB/s 144 GB/s 102 GB/s 48

49 Announced OEM Servers w/ Tesla M-series GPUs 49

50 OEM Servers with Tesla M2050 GPUs 2 Tesla GPUs 4 Tesla GPUs 10 Tesla GPUs 8 Tesla GPUs SuperServer 2 CPUs + 2 GPUs in 1U Tetra 2 CPUs + 4 GPUs in 1U GreenBlade 10 CPUs + 10 GPUs in 5U B CPUs + 8 GPUs in 4U 50

51 OEM Servers with Tesla M1060 GPUs 2 Tesla M1060 GPUs Upto 18 Tesla M1060 GPUs 8 Tesla M1060 GPUs SuperServer 6016GT-TF 2 CPUs + 2 GPUs in 1U Cray CX1000 and Bull Bullx 36 CPUs + 18 GPUs in 7U Tyan B CPUs + 8 GPUs in 4U 51

52 In testing our key applications, the Tesla GPUs delivered speed-ups that we had never seen before, sometimes even orders of magnitude Satoshi Matsuoka Professor Tokyo Institute of Technology I believe history will record Fermi as a significant milestone. Dave Patterson Director Parallel Computing Research Laboratory, U.C. Berkeley Co-Author of Computer Architecture: A Quantitative Approach Future computing architectures will be hybrid systems with parallel-core GPUs working in tandem with multi-core CPUs Jack Dongarra Professor, University of Tennessee Author of Linpack 52

53 $1M Budget 53

Racks of CPUs $740 K 5x Lower Cost $3.8 M $117 K 4.

54 4.5x Lower Power & Cooling Costs 37 TeraFlop System : Top 150 System 2 Racks of GPU+CPUs 7x Less Space Required 15 Racks of CPUs $740 K 5x Lower Cost $3.8 M $117 K 4.5x Power Savings every Year $524 K As per November 2009 Top 500 List 54

55 Power MegaWatts 25 Scaling to 5 PetaFlop Cluster Mwatt x86 CPU Jaguar x86 CPU 10 Mwatt Tesla GPU 5 0 JUGENE BlueGene Roadrunner Cell Nebulae Tesla GPU Linpack Performance (Teraflops) 55

56 Linpack Teraflops What if Every Supercomputer Had Fermi? GPUs 110 TeraFlops $2.2 M Top GPUs 55 TeraFlops $1.1 M Top GPUs 37 TeraFlops $740K Top Top 500 Supercomputers (Nov 2009) 56

Defense / Federal Agencies Software Available

Defense services Speedups 10x-50x GIS Manifold, PCI

MATLAB GPU Plugin available UAV video analysis

57 Defense / Federal Agencies Software Available Opportunities Defense Contractors Federal Agencies Defense services Speedups 10x-50x GIS Manifold, PCI Geomatics, DigitalGlobe Signal Processing GPU VSIPL MATLAB GPU Plugin available UAV video analysis MotionDSP Ikena Virtual Prototyping RealityServer Surveillance, Cryptography 57

SWAN CUDA to OpencL Other GPUs MCUDA: http://impact.crhc.illinois.edu/mcuda.

58 Targeting Multiple Platforms with CUDA NVCC NVIDIA CUDA Toolkit NVIDIA GPUs PTX Ocelot PTX to Multi-core CUDA C / C++ MCUDA CUDA to Multi-core Multi-Core CPUs SWAN CUDA to OpencL Other GPUs MCUDA: Ocelot: Swan: 58

59 Teraflops Soul of a Supercomputer 25 Linpack Teraflops per Rack Nehalem 4-core CPU Westmere 6-core CPU Tesla 10-series Tesla 20-series 59

60 The Tesla Advantage 60

61 Parallel Computing on All GPUs 100+ Million CUDA GPUs Deployed GeForce Entertainment Tesla TM High-Performance Computing Quadro Design & Creation 61

62 Tesla : Built for Professional Computing Features Quality & Warranty Support & Lifecycle Feature 4x Higher double precision (on 20-series) ECC only on Tesla & Quadro (on 20-series) Bi-directional PCI-E communication (Tesla has Dual DMA Engines, GeForce has only 1 DMA Engine) Larger memory for larger data sets 3GB and 6GB Products Cluster management software tools available on Tesla only TCC (Tesla Compute Cluster) driver supported for Windows OS only on Tesla. Integrated OEM workstations and servers Professional ISVs will certify CUDA applications only on Tesla 2 to 4 day Stress testing & memory burn-in for reliability. Added margin in memory and core clocks for added reliability. Manufactured & guaranteed by NVIDIA 3-year warranty from NVIDIA Enterprise support, higher priority for CUDA bugs and requests Benefits Higher Performance for scientific CUDA applications Data reliability inside the GPU and on DRAM memories Higher Performance for CUDA applications (by overlapping communication & computation) Higher performance on wide range of applications (medical, oil & gas, manufacturing, FEA, CAE) Needed for GPU monitoring and job scheduling in data center deployments Higher performance for CUDA applications due to lower kernel launch overhead. TCC adds support for RDP and Services Trusted, reliable systems built for Tesla products. Bug reproduction, support, feature requests for Tesla only. Built for 24/7 computing in data center and workstation environments. No changes in key components like GPU and memory without notice. Always the same clocks for known, reliable performance. Reliable, long life products Ability to influence CUDA and GPU roadmap. Get early access to features requests months availability + 6-month EOL notice Reliable product supply 62

63 Relative Performance Benefits of Larger Tesla Memory 2 Benefit of Larger Tesla Memory 1 GB GPU board Tesla C1060 with 4GB Larger GPU memory enables operating on larger data sets App runs out of memory with 1GB of DRAM Applications run out of memory Molecular dynamics Medical imaging Seismic processing 0.5 CFD Electro-magnetics Finite Element Analysis 0 Seismic Processing AMBER Molecular Dynamics Data Analytics Database search 63

64 Tesla Data Center Software Tools System management tools nv-smi Thermal, Power, and System monitoring software Integration of nv-smi with Ganglia system monitoring software IPMI support for Tesla M-based hybrid CPU-GPU servers Clustering software Platform Computing : Cluster Manager Penguin Computing : Scyld Cluster Corp : Rocks Roll Altair: PBS Works 64

65 Tesla Compute Cluster (TCC) Driver Available Enables Windows HPC on Tesla Tesla driver for Windows Windows 7, Server 2008 R2, Windows Vista, Server 2008 Tesla 8-series, 10-series and 20-series GPUs supported Works with CUDA C/C++, Fortran Does not support OpenCL, OpenGL and DirectX Enables the following features under Windows with CUDA RDP (Remote Desktop) Launch CUDA applications via Windows Services No Windows Timeout issues No penalty on launch overhead KVM-over-IP enabled (CPU Server on-board graphics chipset enabled) 65

66 Tesla C-series Display Features Why does Tesla C2050 have DVI out? Even researchers and engineers need to see their work! Tesla: High Performance Computing Performance Graphics Features ISV Certs / SW Feature Double Precision FP Perf Geometry Perf (Tris/Sec) Viewperf Perf (Geomean) Level 12-bit/30-bit Quad-Buffered Stereo Display SLI MultiOS Workstation ISV Certs Tesla ISV Certs Tesla C2050 Same as Quadro Much lower than Quadro Lower than Quadro Baseline Level No No Single DVI-I Dual Link No No Yes Quadro: Visual Computing 66

Increasing Number of CUDA Applications A l r e a d y Av a i l a b l e 2010 Q2 Q3 Q4 Tools & Libraries CULA LAPACK Library Thrust: C++ Template Lib CUDA C/C++, PGI Fortran Jacket: MATLAB Plugin Nsight

Enhancements Mathworks MATLAB Oil & Gas Bio- Chemistry Seismic Analysis: ffa, HeadWave Seismic Analysis: Geostar AMBER, GROMACS, GROMOS, HOOMD, LAMMPS, NAMD, VMD Seismic City, Acceleware BigDFT,

67 Increasing Number of CUDA Applications A l r e a d y Av a i l a b l e 2010 Q2 Q3 Q4 Tools & Libraries CULA LAPACK Library Thrust: C++ Template Lib CUDA C/C++, PGI Fortran Jacket: MATLAB Plugin Nsight Visual Studio IDE NPP Performance Primitives (NPP) Allinea Debugger Platform Cluster Management TotalView Debugger Bright Cluster Management Mathematica PGI Accelerator Enhancements CAPS HMPP Enhancements Mathworks MATLAB Oil & Gas Bio- Chemistry Seismic Analysis: ffa, HeadWave Seismic Analysis: Geostar AMBER, GROMACS, GROMOS, HOOMD, LAMMPS, NAMD, VMD Seismic City, Acceleware BigDFT, ABINIT, TeraChem Quantum Chem Code 1 Seismic Interpretation Other Popular MD code Reservoir Simulation 1 Quantum Chem Code 2 Reservoir Simulation 2 Bio- Informatics Hex Protein Docking CUDA-BLASTP, GPU-HMMER, MUMmerGPU, MEME, CUDA-EC Protein Docking Short-read seq analysis Video & Rendering Finance Fraunhofer JPEG2K NAG: RNGs OptiX Ray Tracing Engine NumeriX: CounterParty Risk mental ray with iray Scicomp SciFinance Main Concept Video Encoder Risk Analysis 1 Elemental Video Live Risk Analysis 2 3D CAD SW with iray Credit Risk Analysis ISV Trading Platform ISV CAE AutoDesk Moldflow OpenCurrent: CFD/PDE Library Moldex3D Acusim AcuSolve CFD Structural Mechanics ISV MSC MARC MSC Nastran Several CAE ISVs EDA Electro-magnetics: Agilent, CST, Remcom, SPEAG Agilent ADS Spice Simulator Verilog Simulator Lithography Products SPICE Simulator Released Product Announced Product Unannounced Product 67

68 Explosive Growth in CUDA Developers 200,000 CUDA Developer Downloads Teaching Parallel Programming using NVIDIA GPUs 68

69 NVIDIA s Leadership in GPU Computing Applications and Papers using NVIDIA GPUs 10 GPGPU Books using NVIDIA GPUs 69

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory