Scripting CUDA (using python, R and MATLAB)

Size: px
Start display at page:

Download "Scripting CUDA (using python, R and MATLAB)"

Transcription

1 Scripting CUDA (using python, R and MATLAB) Ferdinand Jamitzky jamitzky@lrz.de

2 Why parallel programming? End of the free lunch Moore's law means no longer faster processors, only more of them. But beware! 2 x 3 GHz < 6 GHz (cache consistency, multi-threading, etc)

3 The future is parallel Moore's law is still valid Number of transistors doubles every 2 years Clock speed saturates at 3 to 4 GHz multi-core processors vs many-core processors grid/cloud computing clusters GPGPUs (intel 2005)

4 Supercomputer scaling

5 Supercomputer: SMP SMP Machine: Example: gvs1 shared memory typically 10s of cores threaded programs bus interconnect 128 GB RAM 16 cores in R: GB RAM cores library(multicore) and inlined code Example: uv2/3

6 Supercomputer: MPI Cluster of machines: Example: linux MPP cluster distributed memory typically 100s of cores message passing interface infiniband interconnect 2752 GB RAM 2752 cores in R: 340,000 GB RAM 155,656 Intel cores library(rmpi) and inlined code Example: supermuc

7 Supercomputer: GPGPU Graphics Card: shared memory typically 1000s of cores CUDA or opencl on chip interconnect in R: library(gputools) and inlined code Example: Tesla K20X 6 GB RAM 2688 Threads Example: Titan ORNL GB RAM 18,688 GPU Cards 50,233,344 Threads

8 The future is massively parallel Connection Machine CM-1 (1983) 12-D Hypercube bit cores (AND, OR, NOT) Rmax: 20 GFLOP/s

9 The future is massively parallel JUGENE Blue Gene/P (2007) 3-D Torus or Tree bit cores (PowerPC 450) Rmax: 222 TFLOP/s now: 1 PFLOP/s cores

10 Levels of Parallelism Node Level (e.g. SuperMUC has approx nodes) each node has 2 sockets Socket Level each socket contains 8 cores Core Level each core has 16 vector registers Vector Level (e.g. lxgp1 GPGPU has 480 vector registers) Pipeline Level (how many simultaneous pipelines) hyperthreading Instruction Level (instructions per cycle) out of order execution, branch prediction

11 Problems: Access Times Getting data from: CPU register 1ns Getting some food from: fridge 10s L2 cache 10ns microwave 100s ~ 2min memory 80 ns pizza service 800s ~ 15min network(ib) 200 ns city mall GPU(PCIe) ns mum sends cake s~1 week harddisk ns grown in own garden 2000s ~ 0.5h 5Ms ~ 2months

12 Amdahl's law Computing time for N processors T(N) = T(1)/N + Tserial + Tcomm * N Acceleration factor: T(1)/T(N) = N / (1 + Tserial/T(1)*N + Tcomm/T(1)*N^2) small N: T(1)/T(N) ~ N large N: T(1)/T(N) ~ 1/N saturation point!

13 Amdahl's law III > plot(n,type="l") > lines(n/(1+0.01*n),col="red") > lines(n/(1+0.01*n+0.001*n**2),col="green") > Tserial=0.01 > Tcomm=0.001

14 How are High-Performance Codes constructed? Traditional Construction of High-Performance Codes: C/C++/Fortran Libraries Alternative Construction of High-Performance Codes: Scripting for brains GPUs for inner loops Play to the strengths of each programming environment.

15 Hierarchical architecture of hardware vs software accelerators (gpus, xeon phi) in-core vectorisation (avx) multicore nodes (qpi, pci bus) strongly coupled nodes (infiniband, 10GE) weakly coupled clusters (cloud) Cuda, intrinsics vectorisation pragmas openmp MPI workflow middleware

16 Why Scripting? Do you: want to reuse CUDA code easily (e.g. as a library)? want to dynamically determine whether CUDA is available? want to use multi-threading (painlessly)? want to use MPI (painlessly)? want to use loose coupling (grid computing)? want dynamic exception handling and fallbacks? want dynamic compilation of CUDA code? If you answered "yes" to one of these questions, you should consider a scripting language

17 Parallel Tools in python, R and MATLAB SMP multicore parallelism R domc, dosmp, pnmath, BLAS no max cores MMP massive parallel processing GPGPU CUDA opencl dosnow, dompi, doredis rgpu, gputools multiprocessing futures parallel python, mpi4py pycuda, pyopencl parfor, spmd max 8 cores jobs, pmode gpuarray python MATLAB

18 Scripting CUDA CUDA Compiler PGI Fortran python NumbraPro R Interpreter pycuda rgpu MATLAB

19 MATLAB GPU

20 MATLAB GPU # load matlab module and start command line version module load cuda module load matlab/r2011a matlab -nodesktop

21 MATLAB gpuarray Copy data to GPGPU and return a handle on the object All operations on the handle are performed on the GPGPU x=rand(100); gx=gpuarray(x); how to compute the GFlop/s tic; M=gpuArray(rand(np*1000)); gather(sum(sum(m*m))); 2*np^3/toc

22 pycuda Gives you the following advantages: 1. Combining Two Strong Tools 2. Scripting CUDA 3. Run-Time Code Generation special thanks to a.klöckner

23 LRZ log in to lxgp1 $ module load python $ module load cuda $ module load boost $ python Python (r261:67515, Apr , 17:25:25) [GCC (SUSE Linux)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>>

24 Simple Example from numpy import * import pycuda.autoinit import pycuda.gpuarray as gpu a_gpu = gpu.to_gpu(random.randn(4,4).astype (float32)) a_doubled = (2 a_gpu).get() print a_doubled print a_gpu

25 gpuarray class pycuda.gpuarray: Meant to look and feel just like numpy. gpuarray.to gpu(numpy array) numpy array = gpuarray.get() +, -,, /, fill, sin, exp, rand, basic indexing, norm, inner product Mixed types (int32 + float32 = float64) print gpuarray for debugging. Allows access to raw bits Use as kernel arguments, textures, etc.

26 gpuarray: Elementwise expressions Avoiding extra store-fetch cycles for elementwise math: from pycuda.curandom import rand as curand a_gpu = curand((50,)) b_gpu = curand((50,)) from pycuda.elementwise import ElementwiseKernel lin_comb = ElementwiseKernel( float a, float x, float b, float y, float z, z[ i ] = a x[i ] + b y[i] ) c_gpu = gpuarray.empty_like (a_gpu) lin_comb(5, a_gpu, 6, b_gpu, c_gpu) assert la.norm((c_gpu (5 a_gpu+6 b_gpu)).get()) < 1e 5

27 gpuarray: Reduction made easy Example: A scalar product calculation from pycuda.reduction import ReductionKernel dot = ReductionKernel(dtype_out=numpy.float32, neutral= 0, reduce_expr= a+b, map_expr= x[i] y[i], arguments= const float x, const float y ) from pycuda.curandom import rand as curand x = curand(( ), dtype=numpy.float32) y = curand(( ), dtype=numpy.float32) x_dot_y = dot(x,y).get() x_dot_y_cpu = numpy.dot(x.get(), y.get ())

28 CUDA Kernels in pycuda import pycuda.autoinit import pycuda.driver as drv import numpy from pycuda.compiler import SourceModule mod = SourceModule(""" global void multiply_them(float *dest, float *a, float *b) { const int i = threadidx.x; dest[i] = a[i] * b[i]; }""") multiply_them = mod.get_function("multiply_them") a = numpy.random.randn(400).astype(numpy.float32) b = numpy.random.randn(400).astype(numpy.float32) dest = numpy.zeros_like(a) multiply_them( drv.out(dest), drv.in(a), drv.in(b), block=(400,1,1) print dest-a*b

29 Completeness PyCUDA exposes all of CUDA. For example: Arrays and Textures Pagelocked host memory Memory transfers (asynchronous, structured) Streams and Events Device queries GL Interop And furthermore: Allow interactive use Integrate tightly with numpy

30 pycuda showcase Agent-based Models Computational Visual Neuroscience Discontinuous Galerkin Finite Element PDE Solvers Estimating the Entropy of Natural Scenes Facial Image Database Search Filtered Backprojection for Radar Imaging LINGO Chemical Similarities Recurrence Diagrams Sailfish: Lattice Boltzmann Fluid Dynamics Selective Embedded Just In Time Specialization Simulation of spiking neural networks

31 NumbraPro Generate CUDA Kernels using a Just-in-time compiler from numbapro import float32[:], float32[:])') def sum(a, b, result): i = cuda.grid(1) # equals to threadidx.x + blockidx.x * blockdim.x result[i] = a[i] + b[i] # Invoke like: result_array) sum[grid_dim, block_dim](big_input_1, big_input_2,

32 The Language R

33 R in a nutshell module load cuda/2.3 module load R/serial/2.13 > x=1:10 > y=x**2 > str(y) > print(x) > times2 = function(x) 2*x graphics! > plot(x,y) = and <- are interchangable

34 rgpu a set of functions for loading data toa gpu and manipulating the data there: exportgpu(x) evalgpu(x+y) lsgpu() rmgpu("x") sumgpu(x), meangpu(x), gemmgpu(a,b) cos, sin,.., +, -, *, /, **, %*%

35 Example load the correct R module $ module load R/serial/2.13 start R $R R version ( ) Copyright (C) 2011 The R Foundation for Statistical Computing ISBN load rgpu library > library(rgpu) > help(package="rgpu") > rgpudetails()

36 Data on the GPGPU one million random uniform numbers > x=runif( ) send data to gpu > exportgpu(x) do some calculations > evalgpu(sumgpu(sin(x)+cos(x)+tan(x)+exp(x))) do some timing comparisons (GPU vs CPU): > system.time(evalgpu(sumgpu(sin(x)+cos(x)+tan(x)+exp(x)))) > system.time(sum(sin(x)+cos(x)+tan(x)+exp(x)))

37 real world examples: gputools gputools is a package of precompiled CUDA functions for statistics, linear algebra and machine learning choosegpu getgpuid() gpucor, gpuaucestimate gpudist, gpudistclust, gpuhclust, gpufastica gpuglm, gpulm gpugranger, gpumi gpumatmult, gpuqr, gpusvd, gpusolve gpulsfit gpusvmpredict, gpusvmtrain gputtest

38 Example: Matrix Inversion np < x <- matrix(runif(np**2), np,np) system.time(gpusolve(x)) system.time(solve(x))

39 Example: Hierarchical Clustering numvectors <- 5 dimension <- 10 Vectors <- matrix(runif(numvectors*dimension), numvectors, dimension) distmat <- gpudist(vectors, "euclidean") myclust <- gpuhclust(distmat, "single") plot(myclust) for other examples try: example(hclust)

40 Fortran 90 Example program myprog! simulate harmonic oscillator integer, parameter :: np=1000, nstep=1000 real :: x(np), v(np), dx(np), dv(np), dt=0.01 integer :: i,j forall(i=1:np) x(i)=i forall(i=1:np) v(i)=i do j=1,nstep dx=v*dt; dv=-x*dt x=x+dx; v=v+dv end do print*, " total energy: ",sum(x**2+v**2) end program

41 PGI Compiler log in to lxgp1 $ module load fortran/pgi/11.8 $ pgf90 -o myprog.exe myprog.f90 $ time./myprog.exe exercise for you: compute MFlop/s (Floating Point Operations: 4 * np * nstep) optimize (hint: -Minfo, -fast, -O3)

42 Fortran 90 Example program myprog! simulate harmonic oscillator integer, parameter :: np=1000, nstep=1000 real :: x(np), v(np), dx(np), dv(np), dt=0.01 integer :: i,j forall(i=1:np) x(i)=i forall(i=1:np) v(i)=i do j=1,nstep!$acc region dx=v*dt; dv=-x*dt x=x+dx; v=v+dv!$acc end region end do print*, " total energy: ",sum(x**2+v**2) end program

43 PGI Compiler accelerator module load fortran/pgi pgf90 -ta=nvidia -o myprog.exe myprog.f90 time./myprog.exe exercise for you: compute MFlop/s (Floating Point Operations: 4 * np * nstep) optimize (hint: change acc region)

44 Use R as scripting language R can dynamically load shared objects: dyn.load("lib.so") these functions can then be called via.c("fname", args).fortran("fname", args)

45 R subroutine subroutine mysub_cuda(x,v,nstep)! simulate harmonic oscillator integer, parameter :: np= real*8 :: x(np), v(np), dx(np), dv(np), dt=0.001 integer :: i,j, nstep forall(i=1:np) x(i)=real(i)/np forall(i=1:np) v(i)=real(i)/np do j=1,nstep dx=v*dt; dv=-x*dt x=x+dx; v=v+dv end do return end subroutine

46 Compile two versions don't forget to load the modules! module unload ccomp fortran module load ccomp/pgi/11.8 module load fortran/pgi/11.8 module load R/serial/2.13 pgf90 -shared -fpic -o mysub_host.so mysub_host.f90 pgf90 -ta=nvidia -shared -fpic -o mysub_cuda.so mysub_cuda.f90

47 Load and run Load dynamic libraries > dyn.load("mysub_host.so"), dyn.load("mysub_cuda.so"); np= Benchmark > system.time(str(.fortran("mysub_host",x=numeric(np),v=numeric(np),nstep=as.integer(1000)))) total energy: total energy: List of 3 $ x : num [1: ] -3.01e e e e e $ v : num [1: ] 1.38e e e e e $ nstep: int 1000 user system elapsed > system.time(str(.fortran("mysub_cuda",x=numeric(np),v=numeric(np),nstep=as.integer(1000)))) total energy: total energy: List of 3 $ x : num [1: ] -3.01e e e e e $ v : num [1: ] 1.38e e e e e $ nstep: int 1000 user system elapsed Acceleration Factor: > 26.9/0.83 [1]

48 Matrix Multipl. in FORTRAN subroutine mmult(a,b,c,np) integer np real*8 a(np,np), b(np,np), c(np,np) integer i,j, k do k=1, np forall(i=1:np, j=1:np) a(i,j) = a (i,j) + b(i,k)*c(k,j) end do return end subroutine

49 Call FORTRAN from R # compile f90 to shared object library system("pgf90 -shared -fpic -o mmult.so mmult. f90"); # dynamically load library dyn.load("mmult.so") # define multiplication function mmult.f <- function(a,b,c).fortran("mmult", a=a,b=b,c=c, np=as.integer(dim(a)[1]))

50 Call FORTRAN binary np=100 system.time( mmult.f( a = matrix(numeric(np*np),np,np), b = matrix(numeric(np*np)+1.,np,np), c = matrix(numeric(np*np)+1.,np,np) ) ) Exercise: make a plot system-time vs matrix-dimension

51 PGI accelerator directives subroutine mmult(a,b,c,np) integer np real*8 a(np,np), b(np,np), c(np,np) integer i,j, k do k=1, np!$acc region forall(i=1:np, j=1:np) a(i,j) = a(i,j) + b(i,k)*c(k,j)!$acc end region end do return end subroutine

52 Call FORTRAN from R # compile f90 to shared object library system("pgf90 -ta=nvidia -shared -fpic -o mmult.so mmult.f90"); # dynamically load library dyn.load("mmult.so") # define multiplication function mmult.f <- function(a,b,c).fortran("mmult", a=a,b=b,c=c, np=as.integer(dim(a)[1]))

53 Compute MFlop/s print(paste(2.*2.*np**3/ /system.time( str(mmult.f(...)) )[[3]]," MFlop/s")) Exercise: Compare MFlop/s vs dimension for serial and accelerated code

54 Scripting Parallel Execution R implicit rgpu jit pnmath explicite MKL domc dompi hierarchical parallelisation: - accelerator: rgpu, pnmath, MKL - intra-node: jit, domc, MKL - intra-cluster: SNOW, MPI, pbdmpi - inter-cluster: Redis, SNOW dosnow doredis

55 foreach package # new R foreach # old R code library(foreach) alist=list() alist <foreach (i=1:n) %do% call(i) for(i in 1:N) alist[i] <-call(i) foreach is a function for is a language keyword

56 multithreading with R library(foreach) library(foreach) library(domc) registerdomc() foreach(i=1:n) %do% { mmult.f() } foreach(i=1:n) %dopar% { mmult.f() } # serial execution # thread execution

57 MPI with R library(foreach) library(foreach) library(dosnow) registerdosnow() foreach(i=1:n) %do% { mmult.f() } foreach(i=1:n) %dopar% { mmult.f() } # serial execution # MPI execution

58 dosnow # R > library(dosnow) > cl <- makesockcluster(4) > registerdosnow(cl) > system.time(foreach(i=1:10) %do% sum(runif( ))) user system elapsed > system.time(foreach(i=1:10) %dopar% sum(runif( ))) user system elapsed

59 domc # R > library(domc) > registerdomc(cores=4) > system.time(foreach(i=1:10) %do% sum(runif( ))) user system elapsed > system.time(foreach(i=1:10) %dopar% sum(runif( ))) user system elapsed

60 nosql databases Redis is an open source, advanced key-value store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets. Clients are available for C, C++, C#, Objective-C, Clojure, Common Lisp, Erlang, Go, Haskell, Io, Lua, Perl, Python, PHP, R ruby, scala, smalltalk, tcl

61 doredis / workers start redis worker: > echo "require('doredis');redisworker('jobs')" R The workers can be distributed over the internet > startredisworkers(100)

62 doredis # R > library(doredis) > registerdoredis("jobs") > system.time(foreach(i=1:10) %do% sum(runif( ))) user system elapsed > system.time(foreach(i=1:10) %dopar% sum(runif( ))) user system elapsed

63 MPI-CUDA with R Using dosnow and dyn.load with pgifortran: library(dosnow) cl=makecluster(c("gvs1","gvs2"),type="sock") registerdosnow(cl) foreach(i=1:2) %dopar% setwd("~/kurse/r_cuda") foreach(i=1:2) %dopar% dyn.load("mysub_cuda.so") system.time( foreach(i=1:4) %dopar% str(.fortran("mysub_cuda",x=numeric(np),v=numeric (np), nstep=as.integer(1000))))

64 Big Memory Logical Setup of Node Logical Setup of Node Logical Setup of Node without shared memory with shared memory with file-backed memory R R MEM MEM R R R R Logical Setup of Node with network attached filebacked memory R R MEM MEM Disk Network Disk Network Disk Network MEM

65 library(bigmemory) shared memory regions for several processes in SMP file backed arrays for several node over network file systems library(bigmemory) x <- as.big.matrix(matrix(runif( ), 1000, 1000))) sum(x[1,1:1000])

PyCUDA and PyUblas: Hybrid HPC in Python made easy

PyCUDA and PyUblas: Hybrid HPC in Python made easy PyCUDA and PyUblas: Hybrid HPC in Python made easy Applied Mathematics, Brown University March 5, 2009 Thanks Jan Hesthaven (Brown) Tim Warburton (Rice) Lucas Wilcox (UT Austin) Akil Narayan (Brown) PyCUDA

More information

PyCUDA. An Introduction

PyCUDA. An Introduction PyCUDA An Introduction Scripting GPUs with PyCUDA Why do Scripting for GPUs? GPUs are everything that scripting languages are not: Highly parallel Very architecture-sensitive Built for maximum FP/memory

More information

PyCUDA. Continued...

PyCUDA. Continued... PyCUDA Continued... gpuarray Vector Types pycuda.gpuarray.vec All CUDA vector types are supported: float3, int3, long4, etc, Available as numpy data types Field names x, y, z, and w as in CUDA Construct

More information

CSC573: TSHA Introduction to Accelerators

CSC573: TSHA Introduction to Accelerators CSC573: TSHA Introduction to Accelerators Sreepathi Pai September 5, 2017 URCS Outline Introduction to Accelerators GPU Architectures GPU Programming Models Outline Introduction to Accelerators GPU Architectures

More information

GPU Programming Languages

GPU Programming Languages GPU Programming Languages Vilhelm Sjöberg April 5, 2010 What s wrong with CUDA? Low-level programs structured by kernels, not data flow. Limited metaprogramming features It s just not Haskell! (Or Python,

More information

MapReduce Locality Sensitive Hashing GPUs. NLP ML Web Andrew Rosenberg

MapReduce Locality Sensitive Hashing GPUs. NLP ML Web Andrew Rosenberg MapReduce Locality Sensitive Hashing GPUs NLP ML Web Andrew Rosenberg Big Data What is Big Data? Data Analysis based on more data than previously considered. Analysis that requires new or different processing

More information

Multi-Processors and GPU

Multi-Processors and GPU Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

GPU Metaprogramming using PyCUDA: Methods & Applications

GPU Metaprogramming using PyCUDA: Methods & Applications GPU Metaprogramming using PyCUDA: Methods & Applications Division of Applied Mathematics Brown University Nvidia GTC October 2, 2009 Thanks Tim Warburton (Rice) Jan Hesthaven (Brown) Nicolas Pinto (MIT)

More information

Trends in HPC (hardware complexity and software challenges)

Trends in HPC (hardware complexity and software challenges) Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18

More information

GPUs and Emerging Architectures

GPUs and Emerging Architectures GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs

More information

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017 INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and

More information

Architecture, Programming and Performance of MIC Phi Coprocessor

Architecture, Programming and Performance of MIC Phi Coprocessor Architecture, Programming and Performance of MIC Phi Coprocessor JanuszKowalik, Piotr Arłukowicz Professor (ret), The Boeing Company, Washington, USA Assistant professor, Faculty of Mathematics, Physics

More information

Lecture 1: Introduction and Computational Thinking

Lecture 1: Introduction and Computational Thinking PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 1: Introduction and Computational Thinking 1 Course Objective To master the most commonly used algorithm techniques and computational

More information

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Waiting for Moore s Law to save your serial code start getting bleak in 2004 Source: published SPECInt data Moore s Law is not at all

More information

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Moore's Law abandoned serial programming around 2004 Courtesy Liberty Computer Architecture Research Group

More information

Technology for a better society. hetcomp.com

Technology for a better society. hetcomp.com Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction

More information

Programming Parallel Computers

Programming Parallel Computers ICS-E4020 Programming Parallel Computers Jukka Suomela Jaakko Lehtinen Samuli Laine Aalto University Spring 2015 users.ics.aalto.fi/suomela/ppc-2015/ Introduction Modern computers have high-performance

More information

Accelerating Implicit LS-DYNA with GPU

Accelerating Implicit LS-DYNA with GPU Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,

More information

Introduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University

Introduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University Introduction to OpenACC Shaohao Chen Research Computing Services Information Services and Technology Boston University Outline Introduction to GPU and OpenACC Basic syntax and the first OpenACC program:

More information

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

VSC Users Day 2018 Start to GPU Ehsan Moravveji

VSC Users Day 2018 Start to GPU Ehsan Moravveji Outline A brief intro Available GPUs at VSC GPU architecture Benchmarking tests General Purpose GPU Programming Models VSC Users Day 2018 Start to GPU Ehsan Moravveji Image courtesy of Nvidia.com Generally

More information

OpenACC programming for GPGPUs: Rotor wake simulation

OpenACC programming for GPGPUs: Rotor wake simulation DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing

More information

Lecture 11: GPU programming

Lecture 11: GPU programming Lecture 11: GPU programming David Bindel 4 Oct 2011 Logistics Matrix multiply results are ready Summary on assignments page My version (and writeup) on CMS HW 2 due Thursday Still working on project 2!

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Speeding up MATLAB Applications Sean de Wolski Application Engineer

Speeding up MATLAB Applications Sean de Wolski Application Engineer Speeding up MATLAB Applications Sean de Wolski Application Engineer 2014 The MathWorks, Inc. 1 Non-rigid Displacement Vector Fields 2 Agenda Leveraging the power of vector and matrix operations Addressing

More information

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming KFUPM HPC Workshop April 29-30 2015 Mohamed Mekias HPC Solutions Consultant Introduction to CUDA programming 1 Agenda GPU Architecture Overview Tools of the Trade Introduction to CUDA C Patterns of Parallel

More information

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor,

More information

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Moore's Law abandoned serial programming around 2004 Courtesy Liberty Computer Architecture Research Group

More information

Supporting Data Parallelism in Matcloud: Final Report

Supporting Data Parallelism in Matcloud: Final Report Supporting Data Parallelism in Matcloud: Final Report Yongpeng Zhang, Xing Wu 1 Overview Matcloud is an on-line service to run Matlab-like script on client s web browser. Internally it is accelerated by

More information

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016 OpenACC. Part I Ned Nedialkov McMaster University Canada October 2016 Outline Introduction Execution model Memory model Compiling pgaccelinfo Example Speedups Profiling c 2016 Ned Nedialkov 2/23 Why accelerators

More information

Programming Parallel Computers

Programming Parallel Computers ICS-E4020 Programming Parallel Computers Jukka Suomela Jaakko Lehtinen Samuli Laine Aalto University Spring 2016 users.ics.aalto.fi/suomela/ppc-2016/ New code must be parallel! otherwise a computer from

More information

The Stampede is Coming: A New Petascale Resource for the Open Science Community

The Stampede is Coming: A New Petascale Resource for the Open Science Community The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center boisseau@tacc.utexas.edu Stampede: Solicitation US National Science Foundation

More information

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Speedup Altair RADIOSS Solvers Using NVIDIA GPU Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

An Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters

An Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters An Extension of XcalableMP PGAS Lanaguage for Multi-node Clusters Jinpil Lee, Minh Tuan Tran, Tetsuya Odajima, Taisuke Boku and Mitsuhisa Sato University of Tsukuba 1 Presentation Overview l Introduction

More information

Preparing for Highly Parallel, Heterogeneous Coprocessing

Preparing for Highly Parallel, Heterogeneous Coprocessing Preparing for Highly Parallel, Heterogeneous Coprocessing Steve Lantz Senior Research Associate Cornell CAC Workshop: Parallel Computing on Ranger and Lonestar May 17, 2012 What Are We Talking About Here?

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15: Profiling and Parallelizing with the OpenACC Toolkit

More information

Introduction to CELL B.E. and GPU Programming. Agenda

Introduction to CELL B.E. and GPU Programming. Agenda Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU

More information

Bring your application to a new era:

Bring your application to a new era: Bring your application to a new era: learning by example how to parallelize and optimize for Intel Xeon processor and Intel Xeon Phi TM coprocessor Manel Fernández, Roger Philp, Richard Paul Bayncore Ltd.

More information

Introduction to GPGPU and GPU-architectures

Introduction to GPGPU and GPU-architectures Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks

More information

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &

More information

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer OpenACC Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer 3 WAYS TO ACCELERATE APPLICATIONS Applications Libraries Compiler Directives Programming Languages Easy to use Most Performance

More information

An Introduc+on to OpenACC Part II

An Introduc+on to OpenACC Part II An Introduc+on to OpenACC Part II Wei Feinstein HPC User Services@LSU LONI Parallel Programming Workshop 2015 Louisiana State University 4 th HPC Parallel Programming Workshop An Introduc+on to OpenACC-

More information

HOKUSAI System. Figure 0-1 System diagram

HOKUSAI System. Figure 0-1 System diagram HOKUSAI System October 11, 2017 Information Systems Division, RIKEN 1.1 System Overview The HOKUSAI system consists of the following key components: - Massively Parallel Computer(GWMPC,BWMPC) - Application

More information

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows

More information

Introduction to GPU Computing. 周国峰 Wuhan University 2017/10/13

Introduction to GPU Computing. 周国峰 Wuhan University 2017/10/13 Introduction to GPU Computing chandlerz@nvidia.com 周国峰 Wuhan University 2017/10/13 GPU and Its Application 3 Ways to Develop Your GPU APP An Example to Show the Developments Add GPUs: Accelerate Science

More information

45-year CPU Evolution: 1 Law -2 Equations

45-year CPU Evolution: 1 Law -2 Equations 4004 8086 PowerPC 601 Pentium 4 Prescott 1971 1978 1992 45-year CPU Evolution: 1 Law -2 Equations Daniel Etiemble LRI Université Paris Sud 2004 Xeon X7560 Power9 Nvidia Pascal 2010 2017 2016 Are there

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

NumbaPro CUDA Python. Square matrix multiplication

NumbaPro CUDA Python. Square matrix multiplication NumbaPro Enables parallel programming in Python Support various entry points: Low-level (CUDA-C like) programming language High-level array oriented interface CUDA library bindings Also support multicore

More information

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:

More information

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Waiting for Moore s Law to save your serial code started getting bleak in 2004 Source: published SPECInt

More information

Advances of parallel computing. Kirill Bogachev May 2016

Advances of parallel computing. Kirill Bogachev May 2016 Advances of parallel computing Kirill Bogachev May 2016 Demands in Simulations Field development relies more and more on static and dynamic modeling of the reservoirs that has come a long way from being

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

GPU programming. Dr. Bernhard Kainz

GPU programming. Dr. Bernhard Kainz GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling

More information

High Performance Computing with Python

High Performance Computing with Python High Performance Computing with Python Pawel Pomorski SHARCNET University of Waterloo ppomorsk@sharcnet.ca March 15,2017 Outline Speeding up Python code with NumPy Speeding up Python code with Cython Speeding

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

Introduction to CUDA Programming

Introduction to CUDA Programming Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna14/ [ 10 ] GPU and CUDA Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture and Performance

More information

GPU Programming Using NVIDIA CUDA

GPU Programming Using NVIDIA CUDA GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics

More information

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel

More information

Parallel Computing with MATLAB

Parallel Computing with MATLAB Parallel Computing with MATLAB Jos Martin Principal Architect, Parallel Computing Tools jos.martin@mathworks.co.uk 1 2013 The MathWorks, Inc. www.matlabexpo.com Code used in this presentation can be found

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal

More information

Electronic structure calculations on Thousands of CPU's and GPU's

Electronic structure calculations on Thousands of CPU's and GPU's Electronic structure calculations on Thousands of CPU's and GPU's Emil Briggs, North Carolina State University 1. Outline of real-space Multigrid (RMG) 2. Trends in high performance computing 3. Scalability

More information

High Performance Computing Course Notes HPC Fundamentals

High Performance Computing Course Notes HPC Fundamentals High Performance Computing Course Notes 2008-2009 2009 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs

More information

Lect. 2: Types of Parallelism

Lect. 2: Types of Parallelism Lect. 2: Types of Parallelism Parallelism in Hardware (Uniprocessor) Parallelism in a Uniprocessor Pipelining Superscalar, VLIW etc. SIMD instructions, Vector processors, GPUs Multiprocessor Symmetric

More information

n N c CIni.o ewsrg.au

n N c CIni.o ewsrg.au @NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU

More information

Turbostream: A CFD solver for manycore

Turbostream: A CFD solver for manycore Turbostream: A CFD solver for manycore processors Tobias Brandvik Whittle Laboratory University of Cambridge Aim To produce an order of magnitude reduction in the run-time of CFD solvers for the same hardware

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware OpenACC Standard Directives for Accelerators Credits http://www.openacc.org/ o V1.0: November 2011 Specification OpenACC, Directives for Accelerators, Nvidia Slideware CAPS OpenACC Compiler, HMPP Workbench

More information

High Performance Computing (HPC) Introduction

High Performance Computing (HPC) Introduction High Performance Computing (HPC) Introduction Ontario Summer School on High Performance Computing Scott Northrup SciNet HPC Consortium Compute Canada June 25th, 2012 Outline 1 HPC Overview 2 Parallel Computing

More information

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh GPU Programming EPCC The University of Edinburgh Contents NVIDIA CUDA C Proprietary interface to NVIDIA architecture CUDA Fortran Provided by PGI OpenCL Cross platform API 2 NVIDIA CUDA CUDA allows NVIDIA

More information

Chapter 3 Parallel Software

Chapter 3 Parallel Software Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers

More information

Intel Performance Libraries

Intel Performance Libraries Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Waiting for Moore s Law to save your serial code started getting bleak in 2004 Source: published SPECInt

More information

GPU Acceleration of Matrix Algebra. Dr. Ronald C. Young Multipath Corporation. fmslib.com

GPU Acceleration of Matrix Algebra. Dr. Ronald C. Young Multipath Corporation. fmslib.com GPU Acceleration of Matrix Algebra Dr. Ronald C. Young Multipath Corporation FMS Performance History Machine Year Flops DEC VAX 1978 97,000 FPS 164 1982 11,000,000 FPS 164-MAX 1985 341,000,000 DEC VAX

More information

Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018

Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018 Introduction CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University What is Parallel

More information

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2 Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2 This release of the Intel C++ Compiler 16.0 product is a Pre-Release, and as such is 64 architecture processor supporting

More information

High Performance Computing and GPU Programming

High Performance Computing and GPU Programming High Performance Computing and GPU Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model Objectives Objectives Before we begin a little motivation Intel Xeon 2.67GHz

More information

Performance of deal.ii on a node

Performance of deal.ii on a node Performance of deal.ii on a node Bruno Turcksin Texas A&M University, Dept. of Mathematics Bruno Turcksin Deal.II on a node 1/37 Outline 1 Introduction 2 Architecture 3 Paralution 4 Other Libraries 5 Conclusions

More information

Software and Performance Engineering for numerical codes on GPU clusters

Software and Performance Engineering for numerical codes on GPU clusters Software and Performance Engineering for numerical codes on GPU clusters H. Köstler International Workshop of GPU Solutions to Multiscale Problems in Science and Engineering Harbin, China 28.7.2010 2 3

More information

Numerical Algorithms on Multi-GPU Architectures

Numerical Algorithms on Multi-GPU Architectures Numerical Algorithms on Multi-GPU Architectures Dr.-Ing. Harald Köstler 2 nd International Workshops on Advances in Computational Mechanics Yokohama, Japan 30.3.2010 2 3 Contents Motivation: Applications

More information

Intel Xeon Phi Coprocessors

Intel Xeon Phi Coprocessors Intel Xeon Phi Coprocessors Reference: Parallel Programming and Optimization with Intel Xeon Phi Coprocessors, by A. Vladimirov and V. Karpusenko, 2013 Ring Bus on Intel Xeon Phi Example with 8 cores Xeon

More information

OP2 FOR MANY-CORE ARCHITECTURES

OP2 FOR MANY-CORE ARCHITECTURES OP2 FOR MANY-CORE ARCHITECTURES G.R. Mudalige, M.B. Giles, Oxford e-research Centre, University of Oxford gihan.mudalige@oerc.ox.ac.uk 27 th Jan 2012 1 AGENDA OP2 Current Progress Future work for OP2 EPSRC

More information

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty

More information

Multi-core Programming: Introduction

Multi-core Programming: Introduction Multi-core Programming: Introduction Timo Lilja January 22, 2009 1 Outline Outline Contents 1 Practical Arrangements 1 2 Multi-core processors 1 2.1 CPUs.................................. 1 2.2 GPUs..................................

More information

Lecture 3: Intro to parallel machines and models

Lecture 3: Intro to parallel machines and models Lecture 3: Intro to parallel machines and models David Bindel 1 Sep 2011 Logistics Remember: http://www.cs.cornell.edu/~bindel/class/cs5220-f11/ http://www.piazza.com/cornell/cs5220 Note: the entire class

More information

Fujitsu s Approach to Application Centric Petascale Computing

Fujitsu s Approach to Application Centric Petascale Computing Fujitsu s Approach to Application Centric Petascale Computing 2 nd Nov. 2010 Motoi Okuda Fujitsu Ltd. Agenda Japanese Next-Generation Supercomputer, K Computer Project Overview Design Targets System Overview

More information

Trends and Challenges in Multicore Programming

Trends and Challenges in Multicore Programming Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores

More information

Parallel Computing with MATLAB

Parallel Computing with MATLAB Parallel Computing with MATLAB CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University

More information

INTRODUCTION TO OPENACC

INTRODUCTION TO OPENACC INTRODUCTION TO OPENACC Hossein Pourreza hossein.pourreza@umanitoba.ca March 31, 2016 Acknowledgement: Most of examples and pictures are from PSC (https://www.psc.edu/images/xsedetraining/openacc_may2015/

More information

CDA3101 Recitation Section 13

CDA3101 Recitation Section 13 CDA3101 Recitation Section 13 Storage + Bus + Multicore and some exam tips Hard Disks Traditional disk performance is limited by the moving parts. Some disk terms Disk Performance Platters - the surfaces

More information

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D. Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic

More information

GPU Programming. Ringberg Theorie Seminar 2010

GPU Programming. Ringberg Theorie Seminar 2010 or How to tremendously accelerate your code? Michael Kraus, Christian Konz Max-Planck-Institut für Plasmaphysik, Garching Ringberg Theorie Seminar 2010 Introduction? GPU? GPUs can do more than just render

More information

The Starving CPU Problem

The Starving CPU Problem Or Why Should I Care About Memory Access? Software Architect Continuum Analytics Outline Motivation 1 Motivation 2 3 Computing a Polynomial We want to compute the next polynomial: y = 0.25x 3 + 0.75x²

More information

Introduction to Parallel Computing with CUDA. Oswald Haan

Introduction to Parallel Computing with CUDA. Oswald Haan Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries

More information