Scripting CUDA (using python, R and MATLAB)
|
|
- Clarissa Lee
- 5 years ago
- Views:
Transcription
1 Scripting CUDA (using python, R and MATLAB) Ferdinand Jamitzky jamitzky@lrz.de
2 Why parallel programming? End of the free lunch Moore's law means no longer faster processors, only more of them. But beware! 2 x 3 GHz < 6 GHz (cache consistency, multi-threading, etc)
3 The future is parallel Moore's law is still valid Number of transistors doubles every 2 years Clock speed saturates at 3 to 4 GHz multi-core processors vs many-core processors grid/cloud computing clusters GPGPUs (intel 2005)
4 Supercomputer scaling
5 Supercomputer: SMP SMP Machine: Example: gvs1 shared memory typically 10s of cores threaded programs bus interconnect 128 GB RAM 16 cores in R: GB RAM cores library(multicore) and inlined code Example: uv2/3
6 Supercomputer: MPI Cluster of machines: Example: linux MPP cluster distributed memory typically 100s of cores message passing interface infiniband interconnect 2752 GB RAM 2752 cores in R: 340,000 GB RAM 155,656 Intel cores library(rmpi) and inlined code Example: supermuc
7 Supercomputer: GPGPU Graphics Card: shared memory typically 1000s of cores CUDA or opencl on chip interconnect in R: library(gputools) and inlined code Example: Tesla K20X 6 GB RAM 2688 Threads Example: Titan ORNL GB RAM 18,688 GPU Cards 50,233,344 Threads
8 The future is massively parallel Connection Machine CM-1 (1983) 12-D Hypercube bit cores (AND, OR, NOT) Rmax: 20 GFLOP/s
9 The future is massively parallel JUGENE Blue Gene/P (2007) 3-D Torus or Tree bit cores (PowerPC 450) Rmax: 222 TFLOP/s now: 1 PFLOP/s cores
10 Levels of Parallelism Node Level (e.g. SuperMUC has approx nodes) each node has 2 sockets Socket Level each socket contains 8 cores Core Level each core has 16 vector registers Vector Level (e.g. lxgp1 GPGPU has 480 vector registers) Pipeline Level (how many simultaneous pipelines) hyperthreading Instruction Level (instructions per cycle) out of order execution, branch prediction
11 Problems: Access Times Getting data from: CPU register 1ns Getting some food from: fridge 10s L2 cache 10ns microwave 100s ~ 2min memory 80 ns pizza service 800s ~ 15min network(ib) 200 ns city mall GPU(PCIe) ns mum sends cake s~1 week harddisk ns grown in own garden 2000s ~ 0.5h 5Ms ~ 2months
12 Amdahl's law Computing time for N processors T(N) = T(1)/N + Tserial + Tcomm * N Acceleration factor: T(1)/T(N) = N / (1 + Tserial/T(1)*N + Tcomm/T(1)*N^2) small N: T(1)/T(N) ~ N large N: T(1)/T(N) ~ 1/N saturation point!
13 Amdahl's law III > plot(n,type="l") > lines(n/(1+0.01*n),col="red") > lines(n/(1+0.01*n+0.001*n**2),col="green") > Tserial=0.01 > Tcomm=0.001
14 How are High-Performance Codes constructed? Traditional Construction of High-Performance Codes: C/C++/Fortran Libraries Alternative Construction of High-Performance Codes: Scripting for brains GPUs for inner loops Play to the strengths of each programming environment.
15 Hierarchical architecture of hardware vs software accelerators (gpus, xeon phi) in-core vectorisation (avx) multicore nodes (qpi, pci bus) strongly coupled nodes (infiniband, 10GE) weakly coupled clusters (cloud) Cuda, intrinsics vectorisation pragmas openmp MPI workflow middleware
16 Why Scripting? Do you: want to reuse CUDA code easily (e.g. as a library)? want to dynamically determine whether CUDA is available? want to use multi-threading (painlessly)? want to use MPI (painlessly)? want to use loose coupling (grid computing)? want dynamic exception handling and fallbacks? want dynamic compilation of CUDA code? If you answered "yes" to one of these questions, you should consider a scripting language
17 Parallel Tools in python, R and MATLAB SMP multicore parallelism R domc, dosmp, pnmath, BLAS no max cores MMP massive parallel processing GPGPU CUDA opencl dosnow, dompi, doredis rgpu, gputools multiprocessing futures parallel python, mpi4py pycuda, pyopencl parfor, spmd max 8 cores jobs, pmode gpuarray python MATLAB
18 Scripting CUDA CUDA Compiler PGI Fortran python NumbraPro R Interpreter pycuda rgpu MATLAB
19 MATLAB GPU
20 MATLAB GPU # load matlab module and start command line version module load cuda module load matlab/r2011a matlab -nodesktop
21 MATLAB gpuarray Copy data to GPGPU and return a handle on the object All operations on the handle are performed on the GPGPU x=rand(100); gx=gpuarray(x); how to compute the GFlop/s tic; M=gpuArray(rand(np*1000)); gather(sum(sum(m*m))); 2*np^3/toc
22 pycuda Gives you the following advantages: 1. Combining Two Strong Tools 2. Scripting CUDA 3. Run-Time Code Generation special thanks to a.klöckner
23 LRZ log in to lxgp1 $ module load python $ module load cuda $ module load boost $ python Python (r261:67515, Apr , 17:25:25) [GCC (SUSE Linux)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>>
24 Simple Example from numpy import * import pycuda.autoinit import pycuda.gpuarray as gpu a_gpu = gpu.to_gpu(random.randn(4,4).astype (float32)) a_doubled = (2 a_gpu).get() print a_doubled print a_gpu
25 gpuarray class pycuda.gpuarray: Meant to look and feel just like numpy. gpuarray.to gpu(numpy array) numpy array = gpuarray.get() +, -,, /, fill, sin, exp, rand, basic indexing, norm, inner product Mixed types (int32 + float32 = float64) print gpuarray for debugging. Allows access to raw bits Use as kernel arguments, textures, etc.
26 gpuarray: Elementwise expressions Avoiding extra store-fetch cycles for elementwise math: from pycuda.curandom import rand as curand a_gpu = curand((50,)) b_gpu = curand((50,)) from pycuda.elementwise import ElementwiseKernel lin_comb = ElementwiseKernel( float a, float x, float b, float y, float z, z[ i ] = a x[i ] + b y[i] ) c_gpu = gpuarray.empty_like (a_gpu) lin_comb(5, a_gpu, 6, b_gpu, c_gpu) assert la.norm((c_gpu (5 a_gpu+6 b_gpu)).get()) < 1e 5
27 gpuarray: Reduction made easy Example: A scalar product calculation from pycuda.reduction import ReductionKernel dot = ReductionKernel(dtype_out=numpy.float32, neutral= 0, reduce_expr= a+b, map_expr= x[i] y[i], arguments= const float x, const float y ) from pycuda.curandom import rand as curand x = curand(( ), dtype=numpy.float32) y = curand(( ), dtype=numpy.float32) x_dot_y = dot(x,y).get() x_dot_y_cpu = numpy.dot(x.get(), y.get ())
28 CUDA Kernels in pycuda import pycuda.autoinit import pycuda.driver as drv import numpy from pycuda.compiler import SourceModule mod = SourceModule(""" global void multiply_them(float *dest, float *a, float *b) { const int i = threadidx.x; dest[i] = a[i] * b[i]; }""") multiply_them = mod.get_function("multiply_them") a = numpy.random.randn(400).astype(numpy.float32) b = numpy.random.randn(400).astype(numpy.float32) dest = numpy.zeros_like(a) multiply_them( drv.out(dest), drv.in(a), drv.in(b), block=(400,1,1) print dest-a*b
29 Completeness PyCUDA exposes all of CUDA. For example: Arrays and Textures Pagelocked host memory Memory transfers (asynchronous, structured) Streams and Events Device queries GL Interop And furthermore: Allow interactive use Integrate tightly with numpy
30 pycuda showcase Agent-based Models Computational Visual Neuroscience Discontinuous Galerkin Finite Element PDE Solvers Estimating the Entropy of Natural Scenes Facial Image Database Search Filtered Backprojection for Radar Imaging LINGO Chemical Similarities Recurrence Diagrams Sailfish: Lattice Boltzmann Fluid Dynamics Selective Embedded Just In Time Specialization Simulation of spiking neural networks
31 NumbraPro Generate CUDA Kernels using a Just-in-time compiler from numbapro import float32[:], float32[:])') def sum(a, b, result): i = cuda.grid(1) # equals to threadidx.x + blockidx.x * blockdim.x result[i] = a[i] + b[i] # Invoke like: result_array) sum[grid_dim, block_dim](big_input_1, big_input_2,
32 The Language R
33 R in a nutshell module load cuda/2.3 module load R/serial/2.13 > x=1:10 > y=x**2 > str(y) > print(x) > times2 = function(x) 2*x graphics! > plot(x,y) = and <- are interchangable
34 rgpu a set of functions for loading data toa gpu and manipulating the data there: exportgpu(x) evalgpu(x+y) lsgpu() rmgpu("x") sumgpu(x), meangpu(x), gemmgpu(a,b) cos, sin,.., +, -, *, /, **, %*%
35 Example load the correct R module $ module load R/serial/2.13 start R $R R version ( ) Copyright (C) 2011 The R Foundation for Statistical Computing ISBN load rgpu library > library(rgpu) > help(package="rgpu") > rgpudetails()
36 Data on the GPGPU one million random uniform numbers > x=runif( ) send data to gpu > exportgpu(x) do some calculations > evalgpu(sumgpu(sin(x)+cos(x)+tan(x)+exp(x))) do some timing comparisons (GPU vs CPU): > system.time(evalgpu(sumgpu(sin(x)+cos(x)+tan(x)+exp(x)))) > system.time(sum(sin(x)+cos(x)+tan(x)+exp(x)))
37 real world examples: gputools gputools is a package of precompiled CUDA functions for statistics, linear algebra and machine learning choosegpu getgpuid() gpucor, gpuaucestimate gpudist, gpudistclust, gpuhclust, gpufastica gpuglm, gpulm gpugranger, gpumi gpumatmult, gpuqr, gpusvd, gpusolve gpulsfit gpusvmpredict, gpusvmtrain gputtest
38 Example: Matrix Inversion np < x <- matrix(runif(np**2), np,np) system.time(gpusolve(x)) system.time(solve(x))
39 Example: Hierarchical Clustering numvectors <- 5 dimension <- 10 Vectors <- matrix(runif(numvectors*dimension), numvectors, dimension) distmat <- gpudist(vectors, "euclidean") myclust <- gpuhclust(distmat, "single") plot(myclust) for other examples try: example(hclust)
40 Fortran 90 Example program myprog! simulate harmonic oscillator integer, parameter :: np=1000, nstep=1000 real :: x(np), v(np), dx(np), dv(np), dt=0.01 integer :: i,j forall(i=1:np) x(i)=i forall(i=1:np) v(i)=i do j=1,nstep dx=v*dt; dv=-x*dt x=x+dx; v=v+dv end do print*, " total energy: ",sum(x**2+v**2) end program
41 PGI Compiler log in to lxgp1 $ module load fortran/pgi/11.8 $ pgf90 -o myprog.exe myprog.f90 $ time./myprog.exe exercise for you: compute MFlop/s (Floating Point Operations: 4 * np * nstep) optimize (hint: -Minfo, -fast, -O3)
42 Fortran 90 Example program myprog! simulate harmonic oscillator integer, parameter :: np=1000, nstep=1000 real :: x(np), v(np), dx(np), dv(np), dt=0.01 integer :: i,j forall(i=1:np) x(i)=i forall(i=1:np) v(i)=i do j=1,nstep!$acc region dx=v*dt; dv=-x*dt x=x+dx; v=v+dv!$acc end region end do print*, " total energy: ",sum(x**2+v**2) end program
43 PGI Compiler accelerator module load fortran/pgi pgf90 -ta=nvidia -o myprog.exe myprog.f90 time./myprog.exe exercise for you: compute MFlop/s (Floating Point Operations: 4 * np * nstep) optimize (hint: change acc region)
44 Use R as scripting language R can dynamically load shared objects: dyn.load("lib.so") these functions can then be called via.c("fname", args).fortran("fname", args)
45 R subroutine subroutine mysub_cuda(x,v,nstep)! simulate harmonic oscillator integer, parameter :: np= real*8 :: x(np), v(np), dx(np), dv(np), dt=0.001 integer :: i,j, nstep forall(i=1:np) x(i)=real(i)/np forall(i=1:np) v(i)=real(i)/np do j=1,nstep dx=v*dt; dv=-x*dt x=x+dx; v=v+dv end do return end subroutine
46 Compile two versions don't forget to load the modules! module unload ccomp fortran module load ccomp/pgi/11.8 module load fortran/pgi/11.8 module load R/serial/2.13 pgf90 -shared -fpic -o mysub_host.so mysub_host.f90 pgf90 -ta=nvidia -shared -fpic -o mysub_cuda.so mysub_cuda.f90
47 Load and run Load dynamic libraries > dyn.load("mysub_host.so"), dyn.load("mysub_cuda.so"); np= Benchmark > system.time(str(.fortran("mysub_host",x=numeric(np),v=numeric(np),nstep=as.integer(1000)))) total energy: total energy: List of 3 $ x : num [1: ] -3.01e e e e e $ v : num [1: ] 1.38e e e e e $ nstep: int 1000 user system elapsed > system.time(str(.fortran("mysub_cuda",x=numeric(np),v=numeric(np),nstep=as.integer(1000)))) total energy: total energy: List of 3 $ x : num [1: ] -3.01e e e e e $ v : num [1: ] 1.38e e e e e $ nstep: int 1000 user system elapsed Acceleration Factor: > 26.9/0.83 [1]
48 Matrix Multipl. in FORTRAN subroutine mmult(a,b,c,np) integer np real*8 a(np,np), b(np,np), c(np,np) integer i,j, k do k=1, np forall(i=1:np, j=1:np) a(i,j) = a (i,j) + b(i,k)*c(k,j) end do return end subroutine
49 Call FORTRAN from R # compile f90 to shared object library system("pgf90 -shared -fpic -o mmult.so mmult. f90"); # dynamically load library dyn.load("mmult.so") # define multiplication function mmult.f <- function(a,b,c).fortran("mmult", a=a,b=b,c=c, np=as.integer(dim(a)[1]))
50 Call FORTRAN binary np=100 system.time( mmult.f( a = matrix(numeric(np*np),np,np), b = matrix(numeric(np*np)+1.,np,np), c = matrix(numeric(np*np)+1.,np,np) ) ) Exercise: make a plot system-time vs matrix-dimension
51 PGI accelerator directives subroutine mmult(a,b,c,np) integer np real*8 a(np,np), b(np,np), c(np,np) integer i,j, k do k=1, np!$acc region forall(i=1:np, j=1:np) a(i,j) = a(i,j) + b(i,k)*c(k,j)!$acc end region end do return end subroutine
52 Call FORTRAN from R # compile f90 to shared object library system("pgf90 -ta=nvidia -shared -fpic -o mmult.so mmult.f90"); # dynamically load library dyn.load("mmult.so") # define multiplication function mmult.f <- function(a,b,c).fortran("mmult", a=a,b=b,c=c, np=as.integer(dim(a)[1]))
53 Compute MFlop/s print(paste(2.*2.*np**3/ /system.time( str(mmult.f(...)) )[[3]]," MFlop/s")) Exercise: Compare MFlop/s vs dimension for serial and accelerated code
54 Scripting Parallel Execution R implicit rgpu jit pnmath explicite MKL domc dompi hierarchical parallelisation: - accelerator: rgpu, pnmath, MKL - intra-node: jit, domc, MKL - intra-cluster: SNOW, MPI, pbdmpi - inter-cluster: Redis, SNOW dosnow doredis
55 foreach package # new R foreach # old R code library(foreach) alist=list() alist <foreach (i=1:n) %do% call(i) for(i in 1:N) alist[i] <-call(i) foreach is a function for is a language keyword
56 multithreading with R library(foreach) library(foreach) library(domc) registerdomc() foreach(i=1:n) %do% { mmult.f() } foreach(i=1:n) %dopar% { mmult.f() } # serial execution # thread execution
57 MPI with R library(foreach) library(foreach) library(dosnow) registerdosnow() foreach(i=1:n) %do% { mmult.f() } foreach(i=1:n) %dopar% { mmult.f() } # serial execution # MPI execution
58 dosnow # R > library(dosnow) > cl <- makesockcluster(4) > registerdosnow(cl) > system.time(foreach(i=1:10) %do% sum(runif( ))) user system elapsed > system.time(foreach(i=1:10) %dopar% sum(runif( ))) user system elapsed
59 domc # R > library(domc) > registerdomc(cores=4) > system.time(foreach(i=1:10) %do% sum(runif( ))) user system elapsed > system.time(foreach(i=1:10) %dopar% sum(runif( ))) user system elapsed
60 nosql databases Redis is an open source, advanced key-value store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets. Clients are available for C, C++, C#, Objective-C, Clojure, Common Lisp, Erlang, Go, Haskell, Io, Lua, Perl, Python, PHP, R ruby, scala, smalltalk, tcl
61 doredis / workers start redis worker: > echo "require('doredis');redisworker('jobs')" R The workers can be distributed over the internet > startredisworkers(100)
62 doredis # R > library(doredis) > registerdoredis("jobs") > system.time(foreach(i=1:10) %do% sum(runif( ))) user system elapsed > system.time(foreach(i=1:10) %dopar% sum(runif( ))) user system elapsed
63 MPI-CUDA with R Using dosnow and dyn.load with pgifortran: library(dosnow) cl=makecluster(c("gvs1","gvs2"),type="sock") registerdosnow(cl) foreach(i=1:2) %dopar% setwd("~/kurse/r_cuda") foreach(i=1:2) %dopar% dyn.load("mysub_cuda.so") system.time( foreach(i=1:4) %dopar% str(.fortran("mysub_cuda",x=numeric(np),v=numeric (np), nstep=as.integer(1000))))
64 Big Memory Logical Setup of Node Logical Setup of Node Logical Setup of Node without shared memory with shared memory with file-backed memory R R MEM MEM R R R R Logical Setup of Node with network attached filebacked memory R R MEM MEM Disk Network Disk Network Disk Network MEM
65 library(bigmemory) shared memory regions for several processes in SMP file backed arrays for several node over network file systems library(bigmemory) x <- as.big.matrix(matrix(runif( ), 1000, 1000))) sum(x[1,1:1000])
PyCUDA and PyUblas: Hybrid HPC in Python made easy
PyCUDA and PyUblas: Hybrid HPC in Python made easy Applied Mathematics, Brown University March 5, 2009 Thanks Jan Hesthaven (Brown) Tim Warburton (Rice) Lucas Wilcox (UT Austin) Akil Narayan (Brown) PyCUDA
More informationPyCUDA. An Introduction
PyCUDA An Introduction Scripting GPUs with PyCUDA Why do Scripting for GPUs? GPUs are everything that scripting languages are not: Highly parallel Very architecture-sensitive Built for maximum FP/memory
More informationPyCUDA. Continued...
PyCUDA Continued... gpuarray Vector Types pycuda.gpuarray.vec All CUDA vector types are supported: float3, int3, long4, etc, Available as numpy data types Field names x, y, z, and w as in CUDA Construct
More informationCSC573: TSHA Introduction to Accelerators
CSC573: TSHA Introduction to Accelerators Sreepathi Pai September 5, 2017 URCS Outline Introduction to Accelerators GPU Architectures GPU Programming Models Outline Introduction to Accelerators GPU Architectures
More informationGPU Programming Languages
GPU Programming Languages Vilhelm Sjöberg April 5, 2010 What s wrong with CUDA? Low-level programs structured by kernels, not data flow. Limited metaprogramming features It s just not Haskell! (Or Python,
More informationMapReduce Locality Sensitive Hashing GPUs. NLP ML Web Andrew Rosenberg
MapReduce Locality Sensitive Hashing GPUs NLP ML Web Andrew Rosenberg Big Data What is Big Data? Data Analysis based on more data than previously considered. Analysis that requires new or different processing
More informationMulti-Processors and GPU
Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationGPU Metaprogramming using PyCUDA: Methods & Applications
GPU Metaprogramming using PyCUDA: Methods & Applications Division of Applied Mathematics Brown University Nvidia GTC October 2, 2009 Thanks Tim Warburton (Rice) Jan Hesthaven (Brown) Nicolas Pinto (MIT)
More informationTrends in HPC (hardware complexity and software challenges)
Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18
More informationGPUs and Emerging Architectures
GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs
More informationINTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017
INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and
More informationArchitecture, Programming and Performance of MIC Phi Coprocessor
Architecture, Programming and Performance of MIC Phi Coprocessor JanuszKowalik, Piotr Arłukowicz Professor (ret), The Boeing Company, Washington, USA Assistant professor, Faculty of Mathematics, Physics
More informationLecture 1: Introduction and Computational Thinking
PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 1: Introduction and Computational Thinking 1 Course Objective To master the most commonly used algorithm techniques and computational
More informationIt s a Multicore World. John Urbanic Pittsburgh Supercomputing Center
It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Waiting for Moore s Law to save your serial code start getting bleak in 2004 Source: published SPECInt data Moore s Law is not at all
More informationIt s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist
It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Moore's Law abandoned serial programming around 2004 Courtesy Liberty Computer Architecture Research Group
More informationTechnology for a better society. hetcomp.com
Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction
More informationProgramming Parallel Computers
ICS-E4020 Programming Parallel Computers Jukka Suomela Jaakko Lehtinen Samuli Laine Aalto University Spring 2015 users.ics.aalto.fi/suomela/ppc-2015/ Introduction Modern computers have high-performance
More informationAccelerating Implicit LS-DYNA with GPU
Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,
More informationIntroduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University
Introduction to OpenACC Shaohao Chen Research Computing Services Information Services and Technology Boston University Outline Introduction to GPU and OpenACC Basic syntax and the first OpenACC program:
More informationDIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka
USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods
More informationHow to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić
How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about
More informationVSC Users Day 2018 Start to GPU Ehsan Moravveji
Outline A brief intro Available GPUs at VSC GPU architecture Benchmarking tests General Purpose GPU Programming Models VSC Users Day 2018 Start to GPU Ehsan Moravveji Image courtesy of Nvidia.com Generally
More informationOpenACC programming for GPGPUs: Rotor wake simulation
DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing
More informationLecture 11: GPU programming
Lecture 11: GPU programming David Bindel 4 Oct 2011 Logistics Matrix multiply results are ready Summary on assignments page My version (and writeup) on CMS HW 2 due Thursday Still working on project 2!
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationSpeeding up MATLAB Applications Sean de Wolski Application Engineer
Speeding up MATLAB Applications Sean de Wolski Application Engineer 2014 The MathWorks, Inc. 1 Non-rigid Displacement Vector Fields 2 Agenda Leveraging the power of vector and matrix operations Addressing
More informationHPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming
KFUPM HPC Workshop April 29-30 2015 Mohamed Mekias HPC Solutions Consultant Introduction to CUDA programming 1 Agenda GPU Architecture Overview Tools of the Trade Introduction to CUDA C Patterns of Parallel
More informationGPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten
GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor,
More informationIt s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist
It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Moore's Law abandoned serial programming around 2004 Courtesy Liberty Computer Architecture Research Group
More informationSupporting Data Parallelism in Matcloud: Final Report
Supporting Data Parallelism in Matcloud: Final Report Yongpeng Zhang, Xing Wu 1 Overview Matcloud is an on-line service to run Matlab-like script on client s web browser. Internally it is accelerated by
More informationOpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016
OpenACC. Part I Ned Nedialkov McMaster University Canada October 2016 Outline Introduction Execution model Memory model Compiling pgaccelinfo Example Speedups Profiling c 2016 Ned Nedialkov 2/23 Why accelerators
More informationProgramming Parallel Computers
ICS-E4020 Programming Parallel Computers Jukka Suomela Jaakko Lehtinen Samuli Laine Aalto University Spring 2016 users.ics.aalto.fi/suomela/ppc-2016/ New code must be parallel! otherwise a computer from
More informationThe Stampede is Coming: A New Petascale Resource for the Open Science Community
The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center boisseau@tacc.utexas.edu Stampede: Solicitation US National Science Foundation
More informationSpeedup Altair RADIOSS Solvers Using NVIDIA GPU
Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and
More informationAn Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters
An Extension of XcalableMP PGAS Lanaguage for Multi-node Clusters Jinpil Lee, Minh Tuan Tran, Tetsuya Odajima, Taisuke Boku and Mitsuhisa Sato University of Tsukuba 1 Presentation Overview l Introduction
More informationPreparing for Highly Parallel, Heterogeneous Coprocessing
Preparing for Highly Parallel, Heterogeneous Coprocessing Steve Lantz Senior Research Associate Cornell CAC Workshop: Parallel Computing on Ranger and Lonestar May 17, 2012 What Are We Talking About Here?
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationProfiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015
Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15: Profiling and Parallelizing with the OpenACC Toolkit
More informationIntroduction to CELL B.E. and GPU Programming. Agenda
Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU
More informationBring your application to a new era:
Bring your application to a new era: learning by example how to parallelize and optimize for Intel Xeon processor and Intel Xeon Phi TM coprocessor Manel Fernández, Roger Philp, Richard Paul Bayncore Ltd.
More informationIntroduction to GPGPU and GPU-architectures
Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks
More informationPerformance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &
More informationOpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer
OpenACC Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer 3 WAYS TO ACCELERATE APPLICATIONS Applications Libraries Compiler Directives Programming Languages Easy to use Most Performance
More informationAn Introduc+on to OpenACC Part II
An Introduc+on to OpenACC Part II Wei Feinstein HPC User Services@LSU LONI Parallel Programming Workshop 2015 Louisiana State University 4 th HPC Parallel Programming Workshop An Introduc+on to OpenACC-
More informationHOKUSAI System. Figure 0-1 System diagram
HOKUSAI System October 11, 2017 Information Systems Division, RIKEN 1.1 System Overview The HOKUSAI system consists of the following key components: - Massively Parallel Computer(GWMPC,BWMPC) - Application
More informationProgramming Models for Multi- Threading. Brian Marshall, Advanced Research Computing
Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows
More informationIntroduction to GPU Computing. 周国峰 Wuhan University 2017/10/13
Introduction to GPU Computing chandlerz@nvidia.com 周国峰 Wuhan University 2017/10/13 GPU and Its Application 3 Ways to Develop Your GPU APP An Example to Show the Developments Add GPUs: Accelerate Science
More information45-year CPU Evolution: 1 Law -2 Equations
4004 8086 PowerPC 601 Pentium 4 Prescott 1971 1978 1992 45-year CPU Evolution: 1 Law -2 Equations Daniel Etiemble LRI Université Paris Sud 2004 Xeon X7560 Power9 Nvidia Pascal 2010 2017 2016 Are there
More informationGeneral Purpose GPU Computing in Partial Wave Analysis
JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data
More informationNumbaPro CUDA Python. Square matrix multiplication
NumbaPro Enables parallel programming in Python Support various entry points: Low-level (CUDA-C like) programming language High-level array oriented interface CUDA library bindings Also support multicore
More informationCOSC 6385 Computer Architecture - Multi Processor Systems
COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:
More informationIt s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist
It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Waiting for Moore s Law to save your serial code started getting bleak in 2004 Source: published SPECInt
More informationAdvances of parallel computing. Kirill Bogachev May 2016
Advances of parallel computing Kirill Bogachev May 2016 Demands in Simulations Field development relies more and more on static and dynamic modeling of the reservoirs that has come a long way from being
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationGPU programming. Dr. Bernhard Kainz
GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling
More informationHigh Performance Computing with Python
High Performance Computing with Python Pawel Pomorski SHARCNET University of Waterloo ppomorsk@sharcnet.ca March 15,2017 Outline Speeding up Python code with NumPy Speeding up Python code with Cython Speeding
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationIntroduction to CUDA Programming
Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna14/ [ 10 ] GPU and CUDA Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture and Performance
More informationGPU Programming Using NVIDIA CUDA
GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics
More informationIntroduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model
Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel
More informationParallel Computing with MATLAB
Parallel Computing with MATLAB Jos Martin Principal Architect, Parallel Computing Tools jos.martin@mathworks.co.uk 1 2013 The MathWorks, Inc. www.matlabexpo.com Code used in this presentation can be found
More informationMartin Kruliš, v
Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal
More informationElectronic structure calculations on Thousands of CPU's and GPU's
Electronic structure calculations on Thousands of CPU's and GPU's Emil Briggs, North Carolina State University 1. Outline of real-space Multigrid (RMG) 2. Trends in high performance computing 3. Scalability
More informationHigh Performance Computing Course Notes HPC Fundamentals
High Performance Computing Course Notes 2008-2009 2009 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs
More informationLect. 2: Types of Parallelism
Lect. 2: Types of Parallelism Parallelism in Hardware (Uniprocessor) Parallelism in a Uniprocessor Pipelining Superscalar, VLIW etc. SIMD instructions, Vector processors, GPUs Multiprocessor Symmetric
More informationn N c CIni.o ewsrg.au
@NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU
More informationTurbostream: A CFD solver for manycore
Turbostream: A CFD solver for manycore processors Tobias Brandvik Whittle Laboratory University of Cambridge Aim To produce an order of magnitude reduction in the run-time of CFD solvers for the same hardware
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationOpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware
OpenACC Standard Directives for Accelerators Credits http://www.openacc.org/ o V1.0: November 2011 Specification OpenACC, Directives for Accelerators, Nvidia Slideware CAPS OpenACC Compiler, HMPP Workbench
More informationHigh Performance Computing (HPC) Introduction
High Performance Computing (HPC) Introduction Ontario Summer School on High Performance Computing Scott Northrup SciNet HPC Consortium Compute Canada June 25th, 2012 Outline 1 HPC Overview 2 Parallel Computing
More informationGPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh
GPU Programming EPCC The University of Edinburgh Contents NVIDIA CUDA C Proprietary interface to NVIDIA architecture CUDA Fortran Provided by PGI OpenCL Cross platform API 2 NVIDIA CUDA CUDA allows NVIDIA
More informationChapter 3 Parallel Software
Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers
More informationIntel Performance Libraries
Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationIt s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist
It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Waiting for Moore s Law to save your serial code started getting bleak in 2004 Source: published SPECInt
More informationGPU Acceleration of Matrix Algebra. Dr. Ronald C. Young Multipath Corporation. fmslib.com
GPU Acceleration of Matrix Algebra Dr. Ronald C. Young Multipath Corporation FMS Performance History Machine Year Flops DEC VAX 1978 97,000 FPS 164 1982 11,000,000 FPS 164-MAX 1985 341,000,000 DEC VAX
More informationIntroduction. CSCI 4850/5850 High-Performance Computing Spring 2018
Introduction CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University What is Parallel
More informationIntel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2
Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2 This release of the Intel C++ Compiler 16.0 product is a Pre-Release, and as such is 64 architecture processor supporting
More informationHigh Performance Computing and GPU Programming
High Performance Computing and GPU Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model Objectives Objectives Before we begin a little motivation Intel Xeon 2.67GHz
More informationPerformance of deal.ii on a node
Performance of deal.ii on a node Bruno Turcksin Texas A&M University, Dept. of Mathematics Bruno Turcksin Deal.II on a node 1/37 Outline 1 Introduction 2 Architecture 3 Paralution 4 Other Libraries 5 Conclusions
More informationSoftware and Performance Engineering for numerical codes on GPU clusters
Software and Performance Engineering for numerical codes on GPU clusters H. Köstler International Workshop of GPU Solutions to Multiscale Problems in Science and Engineering Harbin, China 28.7.2010 2 3
More informationNumerical Algorithms on Multi-GPU Architectures
Numerical Algorithms on Multi-GPU Architectures Dr.-Ing. Harald Köstler 2 nd International Workshops on Advances in Computational Mechanics Yokohama, Japan 30.3.2010 2 3 Contents Motivation: Applications
More informationIntel Xeon Phi Coprocessors
Intel Xeon Phi Coprocessors Reference: Parallel Programming and Optimization with Intel Xeon Phi Coprocessors, by A. Vladimirov and V. Karpusenko, 2013 Ring Bus on Intel Xeon Phi Example with 8 cores Xeon
More informationOP2 FOR MANY-CORE ARCHITECTURES
OP2 FOR MANY-CORE ARCHITECTURES G.R. Mudalige, M.B. Giles, Oxford e-research Centre, University of Oxford gihan.mudalige@oerc.ox.ac.uk 27 th Jan 2012 1 AGENDA OP2 Current Progress Future work for OP2 EPSRC
More informationG P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G
Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty
More informationMulti-core Programming: Introduction
Multi-core Programming: Introduction Timo Lilja January 22, 2009 1 Outline Outline Contents 1 Practical Arrangements 1 2 Multi-core processors 1 2.1 CPUs.................................. 1 2.2 GPUs..................................
More informationLecture 3: Intro to parallel machines and models
Lecture 3: Intro to parallel machines and models David Bindel 1 Sep 2011 Logistics Remember: http://www.cs.cornell.edu/~bindel/class/cs5220-f11/ http://www.piazza.com/cornell/cs5220 Note: the entire class
More informationFujitsu s Approach to Application Centric Petascale Computing
Fujitsu s Approach to Application Centric Petascale Computing 2 nd Nov. 2010 Motoi Okuda Fujitsu Ltd. Agenda Japanese Next-Generation Supercomputer, K Computer Project Overview Design Targets System Overview
More informationTrends and Challenges in Multicore Programming
Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores
More informationParallel Computing with MATLAB
Parallel Computing with MATLAB CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University
More informationINTRODUCTION TO OPENACC
INTRODUCTION TO OPENACC Hossein Pourreza hossein.pourreza@umanitoba.ca March 31, 2016 Acknowledgement: Most of examples and pictures are from PSC (https://www.psc.edu/images/xsedetraining/openacc_may2015/
More informationCDA3101 Recitation Section 13
CDA3101 Recitation Section 13 Storage + Bus + Multicore and some exam tips Hard Disks Traditional disk performance is limited by the moving parts. Some disk terms Disk Performance Platters - the surfaces
More informationResources Current and Future Systems. Timothy H. Kaiser, Ph.D.
Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic
More informationGPU Programming. Ringberg Theorie Seminar 2010
or How to tremendously accelerate your code? Michael Kraus, Christian Konz Max-Planck-Institut für Plasmaphysik, Garching Ringberg Theorie Seminar 2010 Introduction? GPU? GPUs can do more than just render
More informationThe Starving CPU Problem
Or Why Should I Care About Memory Access? Software Architect Continuum Analytics Outline Motivation 1 Motivation 2 3 Computing a Polynomial We want to compute the next polynomial: y = 0.25x 3 + 0.75x²
More informationIntroduction to Parallel Computing with CUDA. Oswald Haan
Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries
More information