The success of multipole methods: Are we there yet?

Size: px
Start display at page:

Download "The success of multipole methods: Are we there yet?"

Transcription

1 King Abdullah University of Science and Technology Saudi Arabia, Apr. 2830, 2012 The success of multipole methods: Are we there yet? Lorena A Barba, Boston University

2 1948 The world s first computation by a storedprogram computer: The Manchester Baby

3

4 th Anniversary Celebration at the University of Machester

5 Professor Lloyd N Trefethen FRS Head of Oxford's Numerical Analysis Group President of SIAM ( )

6 1998 Lloyd Trefethen: Predictions for Scientific Computing 50 Years From Now

7 #1. We may not be here

8 #2. We ll talk to computers more often than type to them, and they ll respond with pictures more often than numbers.

9 level 1 M2M M2L M2L L2L x x P2M leaf level L2P #7. Multipole methods and their descendants will be ubiquitous

10 ... the fundamental law of computer science [is]: the faster the computer, the greater the importance of speed of algorithms Trefethen & Bau Numerical Linear Algebra SIAM

11 Fast multipole method O(N) solution of N-body problem Top 10 Algorithm of the 20th century

12 1946 The Monte Carlo method Simplex Method for Linear Programming Krylov Subspace Iteration Method The Decompositional Approach to Matrix Computations The Fortran Compiler QR Algorithm for Computing Eigenvalues Quicksort Algorithms for Sorting Fast Fourier Transform Integer Relation Detection Fast Multipole Method Dongarra& Sullivan, IEEE Comput. Sci. Eng., Vol. 2(1):22 23 (2000)

13 N-body Problem: updates to a system where each element of the system rigorously depends on the state of every other element of the system.

14 N-body Problem: updates to a system where each element of the system rigorously depends on the state of every other element of the system.

15 Credit: Mark Stock

16

17 M31 Andromeda galaxy # stars: 10 12

18

19 Fast N-body method stars of the Andromeda galaxy Earth

20 Fast N-body method stars of the Andromeda galaxy

21 Fast N-body method stars of the Andromeda galaxy O(N)

22 Treecode & Fast multipole method reduces operation count from O(N 2 ) to O(N log N) or O(N) f(y) = N c i K(y x i ) y [1...N] i=1 root level 1 leaf level Image: A tuned and scalable fast multipole method as a preeminent algorithm for exascale systems, Rio Yokota, L A Barba. Int. J. High-perf. Comput. Accepted (2011) To appear; preprint arxiv:

23 information moves from red to blue M2M multipole to multipole treecode & FMM P2M particle to multipole treecode & FMM M2L multipole to local FMM M2P multipole to particle treecode L2L local to local FMM L2P local to particle FMM source particles P2P particle to particle treecode & FMM target particles Image: Treecode and fast multipole method for N-body simulation with CUDA, Rio Yokota, Lorena A Barba, Ch. 9 in GPU Computing Gems Emerald Edition, Wen-mei Hwu, ed.; Morgan Kaufmann/Elsevier (2011) pp

24 root level 1 M2M M2L M2L L2L leaf level x x P2M L2P A tuned and scalable fast multipole method as a preeminent algorithm for exascale systems, Rio Yokota, L A Barba. Int. J. High-perf. Comput., Jan This figure shared under CC-BY on figshare.com

25 Diversity of N-body problems e.g., atoms/ions in electrostatic or van der Waals forces integral formulation of elliptic PDE 2 u = f u = Gfd Numerical integration

26 Applications of the FMM 2 u = f u = Gfd Poisson 2 u = f Astrophysics Electrostatics Fluid mechanics Helmholtz 2 u + k 2 u = f Acoustics Electromagnetics Poisson-Boltzmann fast mat-vec: ( u)+k 2 u = f Geophysics Biophysics accelerate iterations of Krylov solvers speeds-up Boundary Element Method (BEM) solvers

27 Progress: 25 years since G&R 87 FMM matures in late 90s, early 2000s Fourteen Gordon Bell awards for N-body Variants (kifmm) and several optimizations Last 5 years N-body simulation on GPU hardware

28 FMM matures in late 90s, early 2000s First parallel implementations Application to acoustics, scattering Translation theory Meanwhile success in treecodes

29 Parallel FMM 1990 G&G shared memory A parallel adaptive fast multipole method, J. P. Singh, C. Holt, J. L. Hennessy, A. Gupta, Supercomputing '93 Proceedings of the 1993 ACM/ IEEE Conference Load Balancing and Data Locality in Adaptive Hierarchical N-Body Methods..., J.P. Singh, C. Holt, T. Totsuka, A. Gupta, J. Hennessy, Journal of Parallel and Distributed Computing, 27(2): (1995) A new parallel kernel-independent fast multipole method, L. Ying, G.Biros, D. Zorin, H. Langston, Supercomputing, 2003 ACM/IEEE Conference (Best Student Paper Award)

30 Acoustics, Helmholtz & Maxwell equations References below by N. Gumerov & R. Duraiswami, Univ. of Maryland Computation of scattering from N spheres using multipole reexpansion, J. Acoust. Soc. Am. Volume 112, Issue 6, pp (2002) Efficient computation of acoustical scattering from N spheres via the fast multipole method..., J. Acoust. Soc. Am., 116(4): 2528 (2004) Computation of scattering from clusters of spheres using the fast multipole method, J. Acoust. Soc. Am., 117(4): (2005) A broadband fast multipole accelerated boundary element method for the three dimensional Helmholtz equation, J. Acoust. Soc. Am., 125(1): (2009)

31

32 Helmholtz: hard life at high frequencies (Roklin) classical translation operators are of no use Summary of Duraiswami group techniques: level-dependent truncation number; use of rotation based symmetries and local expansion with special functions; switch to diagonal representations later

33 14 Gordon Bell awards for N-body Performance 1992 Warren & Salmon, 5 Gflop/s Price/performance 1997 Warren et al., 18 Gflop/s / $1 M Price/performance 2009 Hamada et al., 124 Mflop/s / $1 Performance 2010 Rahimian et al., 0.7 Pflop/s on Jaguar

34 14 Gordon Bell awards for N-body Performance 1992 Warren & Salmon, 5 Gflop/s Price/performance 1997 Warren et al., 18 Gflop/s / $1 M 6200x cheaper Price/performance 2009 Hamada et al., 124 Mflop/s / $1 Performance 2010 Rahimian et al., 0.7 Pflop/s on Jaguar

35 14 Gordon Bell awards for N-body Performance 1992 Warren & Salmon, 5 Gflop/s Price/performance 1997 Warren et al., 18 Gflop/s / $1 M 34x more than Moore s law 6200x cheaper Price/performance 2009 Hamada et al., 124 Mflop/s / $1 Performance 2010 Rahimian et al., 0.7 Pflop/s on Jaguar

36 largest simulation 90 billion unknowns scale 256 GPUs of Lincoln cluster / 196,608 cores of Jaguar numerical engine: FMM (kernel-independent version, kifmm )

37 Most recent progress on FMM The black-box fast multipole method, W. Fong, E. Darve, J. Comput. Phys., 228(23): (2009) Separation of variables via Chebyshev interpolation of a kernel M2L compressed with SVD Any non-oscillatory kernel Generalized fast multipole method, P.-D. Létourneau, C. Cecka, E. Darve, IOP Conference Series: Materials Science and Engineering, 10(1): (2010) Separation of variables via Laplace transform Diagonal M2L, Any analytic kernel

38 N-body simulation on GPU hardware The algorithmic and hardware speed-ups multiply

39 Early application of GPUs direct N-body 2007, Hamada & Iitaka CUNbody distributed source particles among thread blocks, requiring reduction 2007, Nyland et al. GPU Gems 3 target particles were distributed, no reduction necessary 2008, Belleman et al. Kirin code 2009, Gaburov et al. Sapporo code

40 FMM on GPU 2008

41 FMM on GPU multiplying speed-ups Note: p= Direct (CPU) Direct (GPU) FMM (CPU) FMM (GPU) L 2 -norm error (normalized): 10-4 time [s] N Treecode and fast multipole method for N-body simulation with CUDA, R Yokota & L A Barba, Ch. 9 in GPU Computing Gems Emerald Edition, Elsevier/Morgan Kaufman (2011)

42 FMM on GPU multiplying speed-ups Note: p= Direct (CPU) Direct (GPU) FMM (CPU) FMM (GPU) 20 L 2 -norm error (normalized): 10-4 time [s] N Treecode and fast multipole method for N-body simulation with CUDA, R Yokota & L A Barba, Ch. 9 in GPU Computing Gems Emerald Edition, Elsevier/Morgan Kaufman (2011)

43 FMM on GPU multiplying speed-ups Note: p= Direct (CPU) Direct (GPU) FMM (CPU) FMM (GPU) 20 L 2 -norm error (normalized): 10-4 time [s] N Treecode and fast multipole method for N-body simulation with CUDA, R Yokota & L A Barba, Ch. 9 in GPU Computing Gems Emerald Edition, Elsevier/Morgan Kaufman (2011)

44 FMM on GPU multiplying speed-ups Note: p= Direct (CPU) Direct (GPU) FMM (CPU) FMM (GPU) 20 L 2 -norm error (normalized): 10-4 time [s] N Treecode and fast multipole method for N-body simulation with CUDA, R Yokota & L A Barba, Ch. 9 in GPU Computing Gems Emerald Edition, Elsevier/Morgan Kaufman (2011)

45 New M2L optimizations for GPU Optimizing the multipole-to-local operator in the fast multipole method for graphical processing units, T. Takahashi, C. Cecka, W. Fong, E. Darve, Int. J. Num. Meth. Eng., 89(1): (Jan. 2012) Use structure of FMM to increase flop:byte Result is GPU compute-bound rather than memory-bound Full implementation of bbfmm on GPU M2L achieves ~6x over 12-core CPU implementation Achieves ~3.2x total over 12-core CPU implementation Code available:

46 Advantage of N-body algorithms on GPUs quantify using the Roofline Model shows hardware barriers ( ceiling ) on a computational kernel Components of performance: Computation Communication Locality

47 Performance: Computation Metric: Gflop/s Computation dp / sp Peak achivable if: Communication Locality exploit FMA, etc. non-divergence (GPU) Intra-node parallelism: explicit in algorithm explicit in code Source: ParLab, UC Berkeley

48 Performance: Communication Metric: GB/s Computation Peak achivable if optimizations are explicit prefetching Communication allocation/usage Locality stride streams coalescing on GPU Source: ParLab, UC Berkeley

49 Computation Performance: Locality Communication Computation is free Locality Maximize locality > minimize communication Comm lower bound minimize capacity misses Hardware aids cache size Optimizations via software blocking minimize conflict misses associativities padding Source: ParLab, UC Berkeley

50 Roofline model Roofline: An Insightful Visual Performance Model for Multicore Architectures, S. Williams, A. Waterman, D. Patterson. Communictions of the ACM, April Operational intensity = total flop / total byte = Gflop/s / GB/s Attainable flop/s (Gflop/s) SFU +FMA single-precision peak no SFU, no FMA NVIDIA C2050 log/log scale 1/16 1/8 1/4 1/ Operational intensity (flop/byte)

51 Roofline model Roofline: An Insightful Visual Performance Model for Multicore Architectures, S. Williams, A. Waterman, D. Patterson. Communictions of the ACM, April Operational intensity = total flop / total byte = Gflop/s / GB/s Attainable flop/s (Gflop/s) SFU +FMA single-precision peak no SFU, no FMA peak floating-point performance NVIDIA C2050 log/log scale 1/16 1/8 1/4 1/ Operational intensity (flop/byte)

52 Roofline model Roofline: An Insightful Visual Performance Model for Multicore Architectures, S. Williams, A. Waterman, D. Patterson. Communictions of the ACM, April Operational intensity = total flop / total byte = Gflop/s / GB/s Attainable flop/s (Gflop/s) SFU +FMA no SFU, no FMA peak memory performance single-precision peak peak floating-point performance NVIDIA C2050 log/log scale 1/16 1/8 1/4 1/ Operational intensity (flop/byte)

53 Advantage of N-body algorithms on GPUs Attainable flop/s (Gflop/s) SpMV Stencil 3-D FFT +SFU +FMA Fast N-body (cell-cell) single-precision peak no SFU, no FMA 1/16 1/8 1/4 1/ Operational intensity (flop/byte) Fast N-body (particle-particle) NVIDIA C2050 Image: Hierarchical N-body simulations with auto-tuning for heterogeneous systems, Rio Yokota, L A Barba. Computing in Science and Engineering (CiSE), 3 January 2012, IEEE Computer Society, doi: /mcse

54 Scalability in many-gpus & many-cpu systems Our own progress so far: 1) 1 billion unknowns on 512 GPUs (Degima) 2) 32 billion on 32,768 processors of Kraken 3) 69 billion on 4096 GPUs of Tsubame 2.0 achieved 1 petaflop/s on turbulence simulation

55 Degima cluster Nagasaki Advanced Computing Center

56

57 Lysozyme molecule charges mesh discretized with 102,486 boundary elements

58 1000 Lysozyme molecules largest calculation: 10,648 molecules each discretized with 102,486 boundary elements more than 20 million atoms 1 billion unknowns one minute per iteration on 512 GPUs of Degima

59 Kraken Cray XT5 system at NICS, Tennessee: 9,408 nodes with 12 CPU cores each, 16 GB memory peak performance is 1.17 Petaflop/s. # 11 in Top500 (Jun 11 & Nov 11)

60 Weak scaling on Kraken p=3 N=10 6, per process parallel efficiency 72% at 32,768 processors largest run: 32 Billion points time [s] tree construction mpisendp2p mpisendm2l P2Pkernel P2Mkernel M2Mkernel M2Lkernel L2Lkernel L2Pkernel time to solution: <40 s N procs Image: A tuned and scalable fast multipole method as a preeminent algorithm for exascale systems, Rio Yokota, L A Barba. Int. J. High-perf. Comput. (Jan. 2012)

61 Tsubame nodes with 12 CPU cores each, 3 nvidia M2050 GPUs, 54 GB of RAM. Total of 4224 GPUs peak performance 2.4 Petaflop/s. # 5 in Top500 (Jun 11 & Nov 11)

62 Weak scaling on Tsubame 4 million points per process Time (sec) Tree construction GPU communication MPI communication FMM evaluation Local evaluation Number of processes (GPUs) Petascale turbulence simulation using a highly parallel fast multipole method, Rio Yokota, L A Barba, Tetsu Narumi, Kenji Yasuoka. Comput. Phys. Commun., under revision (minor) Preprint arxiv:

63 Weak scaling on Tsubame 4 million points per process Time (sec) Tree construction GPU communication MPI communication FMM evaluation Local evaluation Number of processes (GPUs) ~9 billion points, 30s on 2048 GPUs Petascale turbulence simulation using a highly parallel fast multipole method, Rio Yokota, L A Barba, Tetsu Narumi, Kenji Yasuoka. Comput. Phys. Commun., under revision (minor) Preprint arxiv:

64 FMM vs. FFT, weak scaling FMM spectral method Parallel efficiency Number of processes

65 Petascale turbulence simulation using vortex method 4,0963 grid, 69 billion points 1 Pflop/s Energy spectrum well-matched 0 spectral method vortex method E(k) k 10 Petascale turbulence simulation using a highly parallel fast multipole method, Rio Yokota, L A Barba, Tetsu Narumi, Kenji Yasuoka. Comput. Phys. Commun., under revision (minor) Preprint arxiv:

66 to conclude Hierarchical N-body algorithms are wellsuited for achieving exascale FMM is a particularly favorable algorithm for the emerging heterogeneous, many-core architectural landscape.

Hierarchical N-body algorithms: A pattern likely to lead at extreme scales

Hierarchical N-body algorithms: A pattern likely to lead at extreme scales ICERM, Brown University Topical Workshop: Synchronization-reducing and Communication-reducing Algorithms and Programming Models for Large-scale Simulations Providence, Jan. 9 13, 2012 Hierarchical N-body

More information

Fast-multipole algorithms moving to Exascale

Fast-multipole algorithms moving to Exascale Numerical Algorithms for Extreme Computing Architectures Software Institute for Methodologies and Abstractions for Codes SIMAC 3 Fast-multipole algorithms moving to Exascale Lorena A. Barba The George

More information

Scaling Fast Multipole Methods up to 4000 GPUs

Scaling Fast Multipole Methods up to 4000 GPUs Scaling Fast Multipole Methods up to 4000 GPUs Rio Yokota King Abdullah University of Science and Technology 4700 KAUST, Thuwal 23955-6900, Saudi Arabia +966-2-808-0321 rio.yokota@kaust.edu.sa Lorena Barba

More information

Fast Multipole and Related Algorithms

Fast Multipole and Related Algorithms Fast Multipole and Related Algorithms Ramani Duraiswami University of Maryland, College Park http://www.umiacs.umd.edu/~ramani Joint work with Nail A. Gumerov Efficiency by exploiting symmetry and A general

More information

Center for Computational Science

Center for Computational Science Center for Computational Science Toward GPU-accelerated meshfree fluids simulation using the fast multipole method Lorena A Barba Boston University Department of Mechanical Engineering with: Felipe Cruz,

More information

The Fast Multipole Method on NVIDIA GPUs and Multicore Processors

The Fast Multipole Method on NVIDIA GPUs and Multicore Processors The Fast Multipole Method on NVIDIA GPUs and Multicore Processors Toru Takahashi, a Cris Cecka, b Eric Darve c a b c Department of Mechanical Science and Engineering, Nagoya University Institute for Applied

More information

A Kernel-independent Adaptive Fast Multipole Method

A Kernel-independent Adaptive Fast Multipole Method A Kernel-independent Adaptive Fast Multipole Method Lexing Ying Caltech Joint work with George Biros and Denis Zorin Problem Statement Given G an elliptic PDE kernel, e.g. {x i } points in {φ i } charges

More information

Fast N-body Simulations on GPUs

Fast N-body Simulations on GPUs Fast N-body Simulations on GPUs Algorithms designed to efficiently solve this classical problem of physics fit very well on GPU hardware, and exhibit excellent scalability on many GPUs. Their computational

More information

Efficient O(N log N) algorithms for scattered data interpolation

Efficient O(N log N) algorithms for scattered data interpolation Efficient O(N log N) algorithms for scattered data interpolation Nail Gumerov University of Maryland Institute for Advanced Computer Studies Joint work with Ramani Duraiswami February Fourier Talks 2007

More information

Using GPUs to compute the multilevel summation of electrostatic forces

Using GPUs to compute the multilevel summation of electrostatic forces Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of

More information

Kernel Independent FMM

Kernel Independent FMM Kernel Independent FMM FMM Issues FMM requires analytical work to generate S expansions, R expansions, S S (M2M) translations S R (M2L) translations R R (L2L) translations Such analytical work leads to

More information

Instructor: Leopold Grinberg

Instructor: Leopold Grinberg Part 1 : Roofline Model Instructor: Leopold Grinberg IBM, T.J. Watson Research Center, USA e-mail: leopoldgrinberg@us.ibm.com 1 ICSC 2014, Shanghai, China The Roofline Model DATA CALCULATIONS (+, -, /,

More information

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Accelerating GPU computation through mixed-precision methods Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Outline Motivation Truncated Precision using CUDA Solving Linear

More information

Scalable Distributed Fast Multipole Methods

Scalable Distributed Fast Multipole Methods Scalable Distributed Fast Multipole Methods Qi Hu, Nail A. Gumerov, Ramani Duraiswami University of Maryland Institute for Advanced Computer Studies (UMIACS) Department of Computer Science, University

More information

GPU accelerated heterogeneous computing for Particle/FMM Approaches and for Acoustic Imaging

GPU accelerated heterogeneous computing for Particle/FMM Approaches and for Acoustic Imaging GPU accelerated heterogeneous computing for Particle/FMM Approaches and for Acoustic Imaging Ramani Duraiswami University of Maryland, College Park http://www.umiacs.umd.edu/~ramani With Nail A. Gumerov,

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,

More information

Scalable Force Directed Graph Layout Algorithms Using Fast Multipole Methods

Scalable Force Directed Graph Layout Algorithms Using Fast Multipole Methods Scalable Force Directed Graph Layout Algorithms Using Fast Multipole Methods Enas Yunis, Rio Yokota and Aron Ahmadia King Abdullah University of Science and Technology 7 KAUST, Thuwal, KSA 3955-69 {enas.yunis,

More information

CUDA Experiences: Over-Optimization and Future HPC

CUDA Experiences: Over-Optimization and Future HPC CUDA Experiences: Over-Optimization and Future HPC Carl Pearson 1, Simon Garcia De Gonzalo 2 Ph.D. candidates, Electrical and Computer Engineering 1 / Computer Science 2, University of Illinois Urbana-Champaign

More information

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for

More information

Fast Multipole Method on the GPU

Fast Multipole Method on the GPU Fast Multipole Method on the GPU with application to the Adaptive Vortex Method University of Bristol, Bristol, United Kingdom. 1 Introduction Particle methods Highly parallel Computational intensive Numerical

More information

Hierarchical Low-Rank Approximation Methods on Distributed Memory and GPUs

Hierarchical Low-Rank Approximation Methods on Distributed Memory and GPUs JH160041-NAHI Hierarchical Low-Rank Approximation Methods on Distributed Memory and GPUs Rio Yokota(Tokyo Institute of Technology) Hierarchical low-rank approximation methods such as H-matrix, H2-matrix,

More information

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst Ali Khajeh-Saeed Software Engineer CD-adapco J. Blair Perot Mechanical Engineering UMASS, Amherst Supercomputers Optimization Stream Benchmark Stag++ (3D Incompressible Flow Code) Matrix Multiply Function

More information

On the limits of (and opportunities for?) GPU acceleration

On the limits of (and opportunities for?) GPU acceleration On the limits of (and opportunities for?) GPU acceleration Aparna Chandramowlishwaran, Jee Choi, Kenneth Czechowski, Murat (Efe) Guney, Logan Moon, Aashay Shringarpure, Richard (Rich) Vuduc HotPar 10,

More information

Terascale on the desktop: Fast Multipole Methods on Graphical Processors

Terascale on the desktop: Fast Multipole Methods on Graphical Processors Terascale on the desktop: Fast Multipole Methods on Graphical Processors Nail A. Gumerov Fantalgo, LLC Institute for Advanced Computer Studies University of Maryland (joint work with Ramani Duraiswami)

More information

ExaFMM. Fast multipole method software aiming for exascale systems. User's Manual. Rio Yokota, L. A. Barba. November Revision 1

ExaFMM. Fast multipole method software aiming for exascale systems. User's Manual. Rio Yokota, L. A. Barba. November Revision 1 ExaFMM Fast multipole method software aiming for exascale systems User's Manual Rio Yokota, L. A. Barba November 2011 --- Revision 1 ExaFMM User's Manual i Revision History Name Date Notes Rio Yokota,

More information

CMSC 858M/AMSC 698R. Fast Multipole Methods. Nail A. Gumerov & Ramani Duraiswami. Lecture 20. Outline

CMSC 858M/AMSC 698R. Fast Multipole Methods. Nail A. Gumerov & Ramani Duraiswami. Lecture 20. Outline CMSC 858M/AMSC 698R Fast Multipole Methods Nail A. Gumerov & Ramani Duraiswami Lecture 20 Outline Two parts of the FMM Data Structures FMM Cost/Optimization on CPU Fine Grain Parallelization for Multicore

More information

Roofline Model (Will be using this in HW2)

Roofline Model (Will be using this in HW2) Parallel Architecture Announcements HW0 is due Friday night, thank you for those who have already submitted HW1 is due Wednesday night Today Computing operational intensity Dwarves and Motifs Stencil computation

More information

Dense Linear Algebra. HPC - Algorithms and Applications

Dense Linear Algebra. HPC - Algorithms and Applications Dense Linear Algebra HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 6 th 2017 Last Tutorial CUDA Architecture thread hierarchy:

More information

FMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E)

FMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E) FMM implementation on CPU and GPU Nail A. Gumerov (Lecture for CMSC 828E) Outline Two parts of the FMM Data Structure Flow Chart of the Run Algorithm FMM Cost/Optimization on CPU Programming on GPU Fast

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

Fast multipole methods for axisymmetric geometries

Fast multipole methods for axisymmetric geometries Fast multipole methods for axisymmetric geometries Victor Churchill New York University May 13, 2016 Abstract Fast multipole methods (FMMs) are one of the main numerical methods used in the solution of

More information

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29 Introduction CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Introduction Spring 2018 1 / 29 Outline 1 Preface Course Details Course Requirements 2 Background Definitions

More information

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics

More information

Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC

Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC Fourth Workshop on Accelerator Programming Using Directives (WACCPD), Nov. 13, 2017 Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC Takuma

More information

Stokes Preconditioning on a GPU

Stokes Preconditioning on a GPU Stokes Preconditioning on a GPU Matthew Knepley 1,2, Dave A. Yuen, and Dave A. May 1 Computation Institute University of Chicago 2 Department of Molecular Biology and Physiology Rush University Medical

More information

Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators

Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators Ahmad Abdelfattah 1, Jack Dongarra 2, David Keyes 1 and Hatem Ltaief 3 1 KAUST Division of Mathematical and Computer Sciences and

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Tree-based methods on GPUs

Tree-based methods on GPUs Tree-based methods on GPUs Felipe Cruz 1 and Matthew Knepley 2,3 1 Department of Mathematics University of Bristol 2 Computation Institute University of Chicago 3 Department of Molecular Biology and Physiology

More information

The Future of GPU Computing

The Future of GPU Computing The Future of GPU Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November 18, 2009 The Future of Computing Bill Dally Chief Scientist

More information

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea. Abdulrahman Manea PhD Student Hamdi Tchelepi Associate Professor, Co-Director, Center for Computational Earth and Environmental Science Energy Resources Engineering Department School of Earth Sciences

More information

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA 2006: Short-Range Molecular Dynamics on GPU San Jose, CA September 22, 2010 Peng Wang, NVIDIA Overview The LAMMPS molecular dynamics (MD) code Cell-list generation and force calculation Algorithm & performance

More information

Bandwidth Avoiding Stencil Computations

Bandwidth Avoiding Stencil Computations Bandwidth Avoiding Stencil Computations By Kaushik Datta, Sam Williams, Kathy Yelick, and Jim Demmel, and others Berkeley Benchmarking and Optimization Group UC Berkeley March 13, 2008 http://bebop.cs.berkeley.edu

More information

How HPC Hardware and Software are Evolving Towards Exascale

How HPC Hardware and Software are Evolving Towards Exascale How HPC Hardware and Software are Evolving Towards Exascale Kathy Yelick Associate Laboratory Director and NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley NERSC Overview

More information

Software and Performance Engineering for numerical codes on GPU clusters

Software and Performance Engineering for numerical codes on GPU clusters Software and Performance Engineering for numerical codes on GPU clusters H. Köstler International Workshop of GPU Solutions to Multiscale Problems in Science and Engineering Harbin, China 28.7.2010 2 3

More information

Reconstruction of Trees from Laser Scan Data and further Simulation Topics

Reconstruction of Trees from Laser Scan Data and further Simulation Topics Reconstruction of Trees from Laser Scan Data and further Simulation Topics Helmholtz-Research Center, Munich Daniel Ritter http://www10.informatik.uni-erlangen.de Overview 1. Introduction of the Chair

More information

EECS Electrical Engineering and Computer Sciences

EECS Electrical Engineering and Computer Sciences P A R A L L E L C O M P U T I N G L A B O R A T O R Y Auto-tuning Stencil Codes for Cache-Based Multicore Platforms Kaushik Datta Dissertation Talk December 4, 2009 Motivation Multicore revolution has

More information

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs Gordon Erlebacher Department of Scientific Computing Sept. 28, 2012 with Dimitri Komatitsch (Pau,France) David Michea

More information

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC

More information

CS 6210 Fall 2016 Bei Wang. Lecture 1 A (Hopefully) Fun Introduction to Scientific Computing

CS 6210 Fall 2016 Bei Wang. Lecture 1 A (Hopefully) Fun Introduction to Scientific Computing CS 6210 Fall 2016 Bei Wang Lecture 1 A (Hopefully) Fun Introduction to Scientific Computing About this class Technical content followed by fun investigations Stay engaged in the classroom Share your SC-related

More information

Porting the ICON Non-hydrostatic Dynamics and Physics to GPUs

Porting the ICON Non-hydrostatic Dynamics and Physics to GPUs Porting the ICON Non-hydrostatic Dynamics and Physics to GPUs William Sawyer (CSCS/ETH), Christian Conti (ETH), Xavier Lapillonne (C2SM/ETH) Programming weather, climate, and earth-system models on heterogeneous

More information

Real Parallel Computers

Real Parallel Computers Real Parallel Computers Modular data centers Overview Short history of parallel machines Cluster computing Blue Gene supercomputer Performance development, top-500 DAS: Distributed supercomputing Short

More information

Commodity Cluster Computing

Commodity Cluster Computing Commodity Cluster Computing Ralf Gruber, EPFL-SIC/CAPA/Swiss-Tx, Lausanne http://capawww.epfl.ch Commodity Cluster Computing 1. Introduction 2. Characterisation of nodes, parallel machines,applications

More information

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm

More information

Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling

Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling Iterative Solvers Numerical Results Conclusion and outlook 1/22 Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling Part II: GPU Implementation and Scaling on Titan Eike

More information

OpenACC programming for GPGPUs: Rotor wake simulation

OpenACC programming for GPGPUs: Rotor wake simulation DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing

More information

AMath 483/583 Lecture 11. Notes: Notes: Comments on Homework. Notes: AMath 483/583 Lecture 11

AMath 483/583 Lecture 11. Notes: Notes: Comments on Homework. Notes: AMath 483/583 Lecture 11 AMath 483/583 Lecture 11 Outline: Computer architecture Cache considerations Fortran optimization Reading: S. Goedecker and A. Hoisie, Performance Optimization of Numerically Intensive Codes, SIAM, 2001.

More information

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms Complexity and Advanced Algorithms Introduction to Parallel Algorithms Why Parallel Computing? Save time, resources, memory,... Who is using it? Academia Industry Government Individuals? Two practical

More information

AMath 483/583 Lecture 11

AMath 483/583 Lecture 11 AMath 483/583 Lecture 11 Outline: Computer architecture Cache considerations Fortran optimization Reading: S. Goedecker and A. Hoisie, Performance Optimization of Numerically Intensive Codes, SIAM, 2001.

More information

Iterative methods for use with the Fast Multipole Method

Iterative methods for use with the Fast Multipole Method Iterative methods for use with the Fast Multipole Method Ramani Duraiswami Perceptual Interfaces and Reality Lab. Computer Science & UMIACS University of Maryland, College Park, MD Joint work with Nail

More information

Developing PIC Codes for the Next Generation Supercomputer using GPUs. Viktor K. Decyk UCLA

Developing PIC Codes for the Next Generation Supercomputer using GPUs. Viktor K. Decyk UCLA Developing PIC Codes for the Next Generation Supercomputer using GPUs Viktor K. Decyk UCLA Abstract The current generation of supercomputer (petaflops scale) cannot be scaled up to exaflops (1000 petaflops),

More information

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010

More information

The Future of High Performance Computing

The Future of High Performance Computing The Future of High Performance Computing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Comparing Two Large-Scale Systems Oakridge Titan Google Data Center 2 Monolithic supercomputer

More information

Jack Dongarra University of Tennessee Oak Ridge National Laboratory

Jack Dongarra University of Tennessee Oak Ridge National Laboratory Jack Dongarra University of Tennessee Oak Ridge National Laboratory 3/9/11 1 TPP performance Rate Size 2 100 Pflop/s 100000000 10 Pflop/s 10000000 1 Pflop/s 1000000 100 Tflop/s 100000 10 Tflop/s 10000

More information

CS560 Lecture Parallel Architecture 1

CS560 Lecture Parallel Architecture 1 Parallel Architecture Announcements The RamCT merge is done! Please repost introductions. Manaf s office hours HW0 is due tomorrow night, please try RamCT submission HW1 has been posted Today Isoefficiency

More information

High-Performance Scientific Computing

High-Performance Scientific Computing High-Performance Scientific Computing Instructor: Randy LeVeque TA: Grady Lemoine Applied Mathematics 483/583, Spring 2011 http://www.amath.washington.edu/~rjl/am583 World s fastest computers http://top500.org

More information

Experiences with ENZO on the Intel Many Integrated Core Architecture

Experiences with ENZO on the Intel Many Integrated Core Architecture Experiences with ENZO on the Intel Many Integrated Core Architecture Dr. Robert Harkness National Institute for Computational Sciences April 10th, 2012 Overview ENZO applications at petascale ENZO and

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved. Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE

More information

GPUML: Graphical processors for speeding up kernel machines

GPUML: Graphical processors for speeding up kernel machines GPUML: Graphical processors for speeding up kernel machines http://www.umiacs.umd.edu/~balajiv/gpuml.htm Balaji Vasan Srinivasan, Qi Hu, Ramani Duraiswami Department of Computer Science, University of

More information

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 Project proposals Due 5PM, Wednesday, March 7 (hard

More information

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Alan Humphrey, Qingyu Meng, Martin Berzins Scientific Computing and Imaging Institute & University of Utah I. Uintah Overview

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA. Matrix Algebra on GPU and Multicore Architectures MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/

More information

MAGMA: a New Generation

MAGMA: a New Generation 1.3 MAGMA: a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Jack Dongarra T. Dong, M. Gates, A. Haidar, S. Tomov, and I. Yamazaki University of Tennessee, Knoxville Release

More information

Graphics Processor Acceleration and YOU

Graphics Processor Acceleration and YOU Graphics Processor Acceleration and YOU James Phillips Research/gpu/ Goals of Lecture After this talk the audience will: Understand how GPUs differ from CPUs Understand the limits of GPU acceleration Have

More information

Trends in HPC (hardware complexity and software challenges)

Trends in HPC (hardware complexity and software challenges) Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18

More information

Cauchy Fast Multipole Method for Analytic Kernel

Cauchy Fast Multipole Method for Analytic Kernel Cauchy Fast Multipole Method for Analytic Kernel Pierre-David Létourneau 1 Cristopher Cecka 2 Eric Darve 1 1 Stanford University 2 Harvard University ICIAM July 2011 Pierre-David Létourneau Cristopher

More information

Intermediate Parallel Programming & Cluster Computing

Intermediate Parallel Programming & Cluster Computing High Performance Computing Modernization Program (HPCMP) Summer 2011 Puerto Rico Workshop on Intermediate Parallel Programming & Cluster Computing in conjunction with the National Computational Science

More information

Lecture 15: More Iterative Ideas

Lecture 15: More Iterative Ideas Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!

More information

Intel Performance Libraries

Intel Performance Libraries Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation

More information

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber, HPC trends (Myths about) accelerator cards & more June 24, 2015 - Martin Schreiber, M.Schreiber@exeter.ac.uk Outline HPC & current architectures Performance: Programming models: OpenCL & OpenMP Some applications:

More information

Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford

Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford Efficiency and Programmability: Enablers for ExaScale Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford Scientific Discovery and Business Analytics Driving an Insatiable

More information

Scalable Fast Multipole Methods on Distributed Heterogeneous Architectures

Scalable Fast Multipole Methods on Distributed Heterogeneous Architectures Scalable Fast Multipole Methods on Distributed Heterogeneous Architectures Qi Hu huqi@cs.umd.edu Nail A. Gumerov gumerov@umiacs.umd.edu Ramani Duraiswami ramani@umiacs.umd.edu Institute for Advanced Computer

More information

Transport Simulations beyond Petascale. Jing Fu (ANL)

Transport Simulations beyond Petascale. Jing Fu (ANL) Transport Simulations beyond Petascale Jing Fu (ANL) A) Project Overview The project: Peta- and exascale algorithms and software development (petascalable codes: Nek5000, NekCEM, NekLBM) Science goals:

More information

A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids

A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids Patrice Castonguay and Antony Jameson Aerospace Computing Lab, Stanford University GTC Asia, Beijing, China December 15 th, 2011

More information

A Simulation of Global Atmosphere Model NICAM on TSUBAME 2.5 Using OpenACC

A Simulation of Global Atmosphere Model NICAM on TSUBAME 2.5 Using OpenACC A Simulation of Global Atmosphere Model NICAM on TSUBAME 2.5 Using OpenACC Hisashi YASHIRO RIKEN Advanced Institute of Computational Science Kobe, Japan My topic The study for Cloud computing My topic

More information

Warps and Reduction Algorithms

Warps and Reduction Algorithms Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum

More information

FMM accelerated BEM for 3D Helmholtz equation

FMM accelerated BEM for 3D Helmholtz equation FMM accelerated BEM for 3D Helmholtz equation Nail A. Gumerov and Ramani Duraiswami Institute for Advanced Computer Studies University of Maryland, U.S.A. also @ Fantalgo, LLC, U.S.A. www.umiacs.umd.edu/~gumerov

More information

Large scale Imaging on Current Many- Core Platforms

Large scale Imaging on Current Many- Core Platforms Large scale Imaging on Current Many- Core Platforms SIAM Conf. on Imaging Science 2012 May 20, 2012 Dr. Harald Köstler Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen,

More information

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort rob@cs.vu.nl Schedule 2 1. Introduction, performance metrics & analysis 2. Many-core hardware 3. Cuda class 1: basics 4. Cuda class

More information

Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX

Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX David Pfander*, Gregor Daiß*, Dominic Marcello**, Hartmut Kaiser**, Dirk Pflüger* * University of Stuttgart ** Louisiana State

More information

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics H. Y. Schive ( 薛熙于 ) Graduate Institute of Physics, National Taiwan University Leung Center for Cosmology and Particle Astrophysics

More information

NVIDIA GPUs in Earth System Modelling Thomas Bradley

NVIDIA GPUs in Earth System Modelling Thomas Bradley NVIDIA GPUs in Earth System Modelling Thomas Bradley Agenda: GPU Developments for CWO Motivation for GPUs in CWO Parallelisation Considerations GPU Technology Roadmap MOTIVATION FOR GPUS IN CWO NVIDIA

More information

Technology for a better society. hetcomp.com

Technology for a better society. hetcomp.com Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm Martin Burtscher Department of Computer Science Texas State University-San Marcos Mapping Regular Code to GPUs Regular codes Operate on

More information

A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs. Li-Wen Chang, Wen-mei Hwu University of Illinois

A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs. Li-Wen Chang, Wen-mei Hwu University of Illinois A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs Li-Wen Chang, Wen-mei Hwu University of Illinois A Scalable, Numerically Stable, High- How to Build a gtsv for Performance

More information

Parallelising Pipelined Wavefront Computations on the GPU

Parallelising Pipelined Wavefront Computations on the GPU Parallelising Pipelined Wavefront Computations on the GPU S.J. Pennycook G.R. Mudalige, S.D. Hammond, and S.A. Jarvis. High Performance Systems Group Department of Computer Science University of Warwick

More information

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA SUDHEER CHUNDURI, SCOTT PARKER, KEVIN HARMS, VITALI MOROZOV, CHRIS KNIGHT, KALYAN KUMARAN Performance Engineering Group Argonne Leadership Computing Facility

More information

Fast Multipole Methods on a Cluster of GPUs for the Meshless Simulation of Turbulence

Fast Multipole Methods on a Cluster of GPUs for the Meshless Simulation of Turbulence Fast Multipole Methods on a Cluster of GPUs for the Meshless Simulation of Turbulence Rio Yokota 1, Tetsu Narumi 2, Ryuji Sakamaki 3, Shun Kameoka 3, Shinnosuke Obi 3, Kenji Yasuoka 3 1 Department of Mathematics,

More information