GPUBenchmark results for tesla2
|
|
- Stewart Bell
- 6 years ago
- Views:
Transcription
1 Benchmark results for tesla2 May 4, 202 Abstract This report shows the Benchmark results obtained on tesla2 on May 4, 202. Contents Introduction 2 Hardware description 3 Transfer speed between hard disk and main memory 3 4 Transfer speed between and main memory 3 5 Transfer speed between two s 5 6 Matrix-matrix multiplication performance 7 7 Matrix-vector multiplication performance 9 Introduction Benchmark has been used to evaluate different aspects of the tesla2 computer. Depending on its hardware architecture and the libraries available when Benchmark was run, some or all of the following aspects will be reported in this document: Transfer speed between hard disk and main memory. Transfer speed between and main memory. Transfer speed between two s. Matrix-matrix multiplication performance. Matrix-vector multiplication performance. The next section describes the hardware characteristics of tesla2. Each one of the remainder sections will focus in one of the previously enumerated performance aspects.
2 Executed on May 4, 202 Benchmark results for tesla2 2 Hardware description This section shows the characteristics of the s and of tesla2. The s available at tesla2 have the next characteristics: cache size : 644 KB bogomips : cache size : 644 KB bogomips : cache size : 644 KB bogomips : cache size : 644 KB bogomips : cache size : 644 KB bogomips : cache size : 644 KB bogomips : cache size : 644 KB bogomips : cache size : 644 KB 2
3 Benchmark results for tesla2 Executed on May 4, 202 bogomips : The Device installed on tesla2 that has been used to perform some of the tests has the next characteristics: Tesla T20 Processor CUDA driver version : 4000 CUDA Runtime version : 4000 Multiprocessors : 4 Global memory (total) : bytes Constant memory (total) : bytes Shared memory per block (total) : 4952 bytes Available registers per block : Threads per block : 24 Max. dimension of a block : 24 x 24 x 64 Max. dimension of a grid : x x Clock rate :.5 GHz 3 Transfer speed between hard disk and main memory To obtain the transfer speed from hard disk to main memory, and from main memory to hard disk, several matrices of floats with different number of rows and columns have been created, and the time required to transfer these matrices from hard disk to main memory, and viceversa, has been measured. In order to obtain a more accurate measure, for each number of rows and columns, ten transfers were carried out in both directions. The median of the results for each case is reported. Table shows the transfer speed obtained for several numbers of rows and columns, from hard disk to main memory, and from main memory to hard disk. Note that the transfer speed is given in Mebibytes per second (MiB/s). These results are reported graphically in Figure. Rows Columns Size Hard disk Main memory (MiB) Main memory (MiB/s) Hard disk (MiB/s) Table : Transfer speed from hard disk to main memory, and from main memory to hard disk, for different matrix sizes A Mebibyte is defined as 2 20 bytes 3
4 Executed on May 4, 202 Benchmark results for tesla Transfer speed (MiB/s) Hard disk to memory Memory to hard disk Size of matrix (MiB) Figure : Transfer speed from hard disk to main memory, and from main memory to hard disk, for different matrix sizes 4 Transfer speed between and main memory To obtain the transfer speed from internal memory to main memory, and from main memory to internal memory, several matrices of floats with different number of rows and columns have been created, and the time required to transfer these matrices from internal memory to main memory, and viceversa, has been measured. In order to obtain a more accurate measure, for each number of rows and columns, ten transfers were carried out in both directions. The median of the results for each case is reported. Table 2 shows the transfer speed obtained for several numbers of rows and columns, from to main memory, and from main memory to. Note that the transfer speed is given in Mebibytes per second (MiB/s) 2. Rows Columns Size Main memory (MiB) Main memory (MiB/s) (MiB/s) Table 2: Transfer speed from to main memory, and from main memory to, for different matrix sizes 2 A Mebibyte is defined as 2 20 bytes 4
5 Benchmark results for tesla2 Executed on May 4, 202 The same tests have been done using padding to allocate the matrices in the internal memory. When padding is applied, it may be necessary to reserve additional storage to ensure that corresponding pointers in any given row will continue to meet the alignment requirements for coalescing. Padding is the recommended allocation method for 2D arrays. Table 3 shows the transfer speed obtained for several numbers of rows and columns, from to main memory, and from main memory to, when padding is in use. Note that the transfer speed is given in Mebibytes per second. Rows Columns Size Main memory (MiB) Main memory (MiB/s) (MiB/s) Table 3: Transfer speed from to main memory, and from main memory to, when using padding, for different matrix sizes Figure 2 shows the transfer speed from to main memory, and from main memory to, with and without padding Transfer speed (MiB/s) > main memory (with padding) Main memory -> (with padding) -> main memory Main memory -> Size of matrix (MiB) Figure 2: Transfer speed from to main memory, and from main memory to, with and without padding, for different matrix sizes 5
6 Executed on May 4, 202 Benchmark results for tesla2 5 Transfer speed between two s To obtain the transfer speed between two s using Direct or via main memory, several matrices of floats with different number of rows and columns have been created, and the time required to transfer these matrices from one to another, both using Direct and via main memory, has been measured. In order to obtain a more accurate measure, for each number of rows and columns, ten transfers were carried out in both directions. The median of the results for each case is reported. Table 4 shows the transfer speed obtained for several numbers of rows and columns, from one to another both using Direct and via main memory. Note that the transfer speed is given in Mebibytes per second (MiB/s) 3. These results are reported graphically in Figure 3. Rows Columns Size (MiB) 2 (MiB/s) MM 2 (MiB/s) Table 4: Transfer speed from one to another using Direct and via main memory Transfer speed (MiB/s) >2 ->MM-> Size of matrix (MiB) Figure 3: Transfer speed from one to another using Direct and via main memory 3 A Mebibyte is defined as 2 20 bytes 6
7 Benchmark results for tesla2 Executed on May 4, Matrix-matrix multiplication performance This section shows both the time required and the GFLOPS obtained by the and the on tesla2 when computing the operation: C = αab + βc, where A R m k, B R k n, C R m n, and α and β are scalars. This operation has been performed on the by calling the CBLAS cblas_sgemm() function, and on the by calling the CUBLAS cublassgemm() function. Different m, n, and k values have been used. For each value of m, the n and k values have been obtained as 25, 50, 75, and 0% of m. In order to obtain a more accurate measure, for each combination of m, n, and k, ten operations have been carried on in both and. Then, the median of the outcomes for each case has been computed. The GFLOPS for each combination of m, n, and k has been computed as: 2mnk 9 GF LOP S = time Table 5 shows the time and GFLOPS results obtained for the given m, n, and k values. m n k (ms) (ms) GFLOPS GFLOPS Table 5: Time and GFLOPS for the operation C = αab + βc on the and on the for different m, n and k values Figure 4 shows the time to compute the operation C = αab + βc on the and on the (notice that n and k are 25, 50, 75, and 0% of m). Figure 5 shows the and GFLOPS for these values of m, n and k. 7
8 Executed on May 4, 202 Benchmark results for tesla (a) n and k are 25% of m (b) n and k are 50% of m 0000 e (c) n and k are 75% of m (d) n and k are 0% of m Figure 4: Time to compute the operation C = αab + βc on the and on the 00 GFLOPS GFLOPS n=k=25% m n=k=50% m n=k=75% m n=k=0% m n=k=25% m n=k=50% m n=k=75% m n=k=0% m Figure 5: and GFLOPS for C = αab + βc with several m, n and k values (n and k are 25, 50, 75, and 0% of m) 8
9 Benchmark results for tesla2 Executed on May 4, Matrix-vector multiplication performance This section shows both the time required and the GFLOPS obtained by the and the on tesla2 when computing the operation: y = αax + βy, where A R m n, x R n, y R m, and α and β are scalars. This operation has been performed on the by calling the CBLAS cblas_sgemv() function, and on the by calling the CUBLAS cublassgemv function. Different m and n values have been used. For each value of m, the n values have been obtained as 25, 50, 75, and 0% of m. In order to obtain a more accurate measure, for each combination of m and n, ten operations have been carried on in both and. Then, the median of the outcomes for each case has been computed. The GFLOPS for each combination of m and n has been computed as: GF LOP S = 2mn 9 time Table 6 shows the time and GFLOPS results obtained for the given m and n values. Figure 6 shows the time to compute the operation y = αax + βy on the and on the (notice that n is 25, 50, 75, and 0% of m). Figure 7 shows the and GFLOPS for these values of m and n. 9
10 Executed on May 4, 202 Benchmark results for tesla2 m n (ms) (ms) GFLOPS GFLOPS Table 6: Time and GFLOPS for the operation y = αax + βy on the and on the for different m and n values
11 Benchmark results for tesla2 Executed on May 4, (a) n is 25% of m (b) n is 50% of m (c) n is 75% of m (d) n is 0% of m Figure 6: Time to compute the operation y = αax + βy on the and on the 0 GFLOPS GFLOPS n=25% m n=50% m n=75% m n=0% m n=25% m n=50% m n=75% m n=0% m Figure 7: and GFLOPS for y = αax + βy with several m and n values (n is 25, 50, 75, and 0% of m)
Accelerating GPU Kernels for Dense Linear Algebra
Accelerating GPU Kernels for Dense Linear Algebra Rajib Nath, Stan Tomov, and Jack Dongarra Innovative Computing Lab University of Tennessee, Knoxville July 9, 21 xgemm performance of CUBLAS-2.3 on GTX28
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationQR Decomposition on GPUs
QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of
More informationDouble-Precision Matrix Multiply on CUDA
Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices
More informationAuto-tunable GPU BLAS
Auto-tunable GPU BLAS Jarle Erdal Steinsland Master of Science in Computer Science Submission date: June 2011 Supervisor: Anne Cathrine Elster, IDI Norwegian University of Science and Technology Department
More informationGPU Fundamentals Jeff Larkin November 14, 2016
GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate
More informationHIGH PERFORMANCE NUMERICAL LINEAR ALGEBRA. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA
1 HIGH PERFORMANCE NUMERICAL LINEAR ALGEBRA Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA 2 BLAS BLAS 1, 2, 3 Performance GEMM Optimized BLAS Parallel
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationIntroduction to CUDA
Introduction to CUDA Overview HW computational power Graphics API vs. CUDA CUDA glossary Memory model, HW implementation, execution Performance guidelines CUDA compiler C/C++ Language extensions Limitations
More informationGREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES. Nikolay Markovskiy Peter Messmer
GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES Nikolay Markovskiy Peter Messmer ABOUT CP2K Atomistic and molecular simulations of solid state From ab initio DFT and Hartree-Fock
More informationDevice Memories and Matrix Multiplication
Device Memories and Matrix Multiplication 1 Device Memories global, constant, and shared memories CUDA variable type qualifiers 2 Matrix Multiplication an application of tiling runningmatrixmul in the
More informationDIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka
USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods
More informationEFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT
EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT JOSEPH L. GREATHOUSE, MAYANK DAGA AMD RESEARCH 11/20/2014 THIS TALK IN ONE SLIDE Demonstrate how to save space and time
More informationGPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum
GPU & High Performance Computing (by NVIDIA) CUDA Compute Unified Device Architecture 29.02.2008 Florian Schornbaum GPU Computing Performance In the last few years the GPU has evolved into an absolute
More informationPerformance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &
More informationCUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)
CUDA Memory Model Some material David Kirk, NVIDIA and Wen-mei W. Hwu, 2007-2009 (used with permission) 1 G80 Implementation of CUDA Memories Each thread can: Grid Read/write per-thread registers Read/write
More informationCS 179 Lecture 4. GPU Compute Architecture
CS 179 Lecture 4 GPU Compute Architecture 1 This is my first lecture ever Tell me if I m not speaking loud enough, going too fast/slow, etc. Also feel free to give me lecture feedback over email or at
More informationReview. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API
Review Lecture 10 CUDA (II) host device CUDA many core processor threads thread blocks grid # threads >> # of cores to be efficient Threads within blocks can cooperate Threads between thread blocks cannot
More information1/25/12. Administrative
Administrative L3: Memory Hierarchy Optimization I, Locality and Data Placement Next assignment due Friday, 5 PM Use handin program on CADE machines handin CS6235 lab1 TA: Preethi Kotari - Email:
More informationSparse Linear Algebra in CUDA
Sparse Linear Algebra in CUDA HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 22 nd 2017 Table of Contents Homework - Worksheet 2
More informationAutomatic Development of Linear Algebra Libraries for the Tesla Series
Automatic Development of Linear Algebra Libraries for the Tesla Series Enrique S. Quintana-Ortí quintana@icc.uji.es Universidad Jaime I de Castellón (Spain) Dense Linear Algebra Major problems: Source
More informationCSE 160 Lecture 24. Graphical Processing Units
CSE 160 Lecture 24 Graphical Processing Units Announcements Next week we meet in 1202 on Monday 3/11 only On Weds 3/13 we have a 2 hour session Usual class time at the Rady school final exam review SDSC
More informationA study on linear algebra operations using extended precision floating-point arithmetic on GPUs
A study on linear algebra operations using extended precision floating-point arithmetic on GPUs Graduate School of Systems and Information Engineering University of Tsukuba November 2013 Daichi Mukunoki
More informationAutomated Finite Element Computations in the FEniCS Framework using GPUs
Automated Finite Element Computations in the FEniCS Framework using GPUs Florian Rathgeber (f.rathgeber10@imperial.ac.uk) Advanced Modelling and Computation Group (AMCG) Department of Earth Science & Engineering
More informationBLAS: Basic Linear Algebra Subroutines I
BLAS: Basic Linear Algebra Subroutines I Most numerical programs do similar operations 90% time is at 10% of the code If these 10% of the code is optimized, programs will be fast Frequently used subroutines
More informationCUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.
Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication
More informationGPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34
1 / 34 GPU Programming Lecture 2: CUDA C Basics Miaoqing Huang University of Arkansas 2 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions
More informationEffect of memory latency
CACHE AWARENESS Effect of memory latency Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM with a latency of 100 ns. Assume that the processor has two ALU units and it is capable
More informationBy: Tomer Morad Based on: Erik Lindholm, John Nickolls, Stuart Oberman, John Montrym. NVIDIA TESLA: A UNIFIED GRAPHICS AND COMPUTING ARCHITECTURE In IEEE Micro 28(2), 2008 } } Erik Lindholm, John Nickolls,
More informationIntroduction to CELL B.E. and GPU Programming. Agenda
Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU
More informationBLAS: Basic Linear Algebra Subroutines I
BLAS: Basic Linear Algebra Subroutines I Most numerical programs do similar operations 90% time is at 10% of the code If these 10% of the code is optimized, programs will be fast Frequently used subroutines
More informationHiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.
HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes Ian Glendinning Outline NVIDIA GPU cards CUDA & OpenCL Parallel Implementation
More informationMatrix-Matrix Multiplications on GPUs for Accelerating a Parallel Fluid Dynamics Code
Matrix-Matrix Multiplications on GPUs for Accelerating a Parallel Fluid Dynamics Code by Kenneth Webster A research paper presented to the University of Waterloo in partial fulfillment of the requirement
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationPerformance and power analysis for high performance computation benchmarks
Cent. Eur. J. Comp. Sci. 3(1) 2013 1-16 DOI: 10.2478/s13537-013-0101-5 Central European Journal of Computer Science Performance and power analysis for high performance computation benchmarks Research Article
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna14/ [ 10 ] GPU and CUDA Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture and Performance
More informationUniversity of Bielefeld
Geistes-, Natur-, Sozial- und Technikwissenschaften gemeinsam unter einem Dach Introduction to GPU Programming using CUDA Olaf Kaczmarek University of Bielefeld STRONGnet Summerschool 2011 ZIF Bielefeld
More informationCUDA Programming Model
CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming
More informationarxiv: v1 [physics.comp-ph] 25 Feb 2016
GPU performance analysis of a nodal discontinuous Galerin method for acoustic and elastic models arxiv:1602.07997v1 [physics.comp-ph] 25 Feb 2016 A. Modave 1, A. St-Cyr 2, and T. Warburton 1 1 Virginia
More informationGeneral Purpose GPU Computing in Partial Wave Analysis
JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data
More informationMathematical computations with GPUs
Master Educational Program Information technology in applications Mathematical computations with GPUs GPU architecture Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University GPU Graphical Processing
More informationCUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer
CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer Outline We ll be focussing on optimizing global memory throughput on Fermi-class GPUs
More informationCS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST
CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8 Markus Hadwiger, KAUST Reading Assignment #5 (until March 12) Read (required): Programming Massively Parallel Processors book, Chapter
More informationGPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N.
GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran G. Ruetsch, M. Fatica, E. Phillips, N. Juffa Outline WRF and RRTM Previous Work CUDA Fortran Features RRTM in CUDA
More informationAdministrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve
Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 Project proposals Due 5PM, Wednesday, March 7 (hard
More informationData Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions
Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing
More informationSolving Dense Linear Systems on Graphics Processors
Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad
More informationHigh-Performance Computing Using GPUs
High-Performance Computing Using GPUs Luca Caucci caucci@email.arizona.edu Center for Gamma-Ray Imaging November 7, 2012 Outline Slide 1 of 27 Why GPUs? What is CUDA? The CUDA programming model Anatomy
More informationGPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh
GPU Performance Optimisation EPCC The University of Edinburgh Hardware NVIDIA accelerated system: Memory Memory GPU vs CPU: Theoretical Peak capabilities NVIDIA Fermi AMD Magny-Cours (6172) Cores 448 (1.15GHz)
More informationParallelising Pipelined Wavefront Computations on the GPU
Parallelising Pipelined Wavefront Computations on the GPU S.J. Pennycook G.R. Mudalige, S.D. Hammond, and S.A. Jarvis. High Performance Systems Group Department of Computer Science University of Warwick
More informationAn Evaluation of Unified Memory Technology on NVIDIA GPUs
An Evaluation of Unified Memory Technology on NVIDIA GPUs Wenqiang Li 1, Guanghao Jin 2, Xuewen Cui 1, Simon See 1,3 Center for High Performance Computing, Shanghai Jiao Tong University, China 1 Tokyo
More informationWhat is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms
CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D
More informationParallel Numerics. 1 Data Dependency Graphs & DAGs. Exercise 3: Vector/Vector Operations & P2P Communication II
Technische Universität München WiSe 2014/15 Institut für Informatik Prof. Dr. Thomas Huckle Dipl.-Inf. Christoph Riesinger Sebastian Rettenberger, M.Sc. Parallel Numerics Exercise 3: Vector/Vector Operations
More informationAccelerating Linpack with CUDA on heterogeneous clusters
Accelerating Linpack with CUDA on heterogeneous clusters Massimiliano Fatica NVIDIA Corporation 2701 San Tomas Expressway Santa Clara CA 95050 mfatica@nvidia.com ABSTRACT This paper describes the use of
More informationGPU Profiling and Optimization. Scott Grauer-Gray
GPU Profiling and Optimization Scott Grauer-Gray Benefits of GPU Programming "Free" speedup with new architectures More cores in new architecture Improved features such as L1 and L2 cache Increased shared/local
More informationIntroduction to CUDA (1 of n*)
Administrivia Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Paper presentation due Wednesday, 02/23 Topics first come, first serve Assignment 4 handed today
More informationE6895 Advanced Big Data Analytics Lecture 8: GPU Examples and GPU on ios devices
E6895 Advanced Big Data Analytics Lecture 8: GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph
More informationB. Tech. Project Second Stage Report on
B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic
More informationTHE USE of graphics processing units (GPUs) in scientific
Proceedings of the International Multiconference on Computer Science and Information Technology pp. 285 291 ISBN 978-83-60810-14-9 ISSN 1896-7094 Testing Tesla Architecture for Scientific Computing: the
More informationImplementing a Speech Recognition System on a GPU using CUDA. Presented by Omid Talakoub Astrid Yi
Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi Outline Background Motivation Speech recognition algorithm Implementation steps GPU implementation strategies
More informationIntroduction to GPGPU and GPU-architectures
Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks
More informationGPU Memory Memory issue for CUDA programming
Memory issue for CUDA programming Variable declaration Memory Scope Lifetime device local int LocalVar; local thread thread device shared int SharedVar; shared block block device int GlobalVar; global
More informationProgramming in CUDA. Malik M Khan
Programming in CUDA October 21, 2010 Malik M Khan Outline Reminder of CUDA Architecture Execution Model - Brief mention of control flow Heterogeneous Memory Hierarchy - Locality through data placement
More informationarxiv: v1 [hep-lat] 29 Oct 2008
arxiv:0810.5365v1 [hep-lat] 29 Oct 2008 Kipton Barros E-mail: kbarros@bu.edu Ronald Babich E-mail: rbabich@bu.edu Richard Brower E-mail: brower@bu.edu Michael A. Clark Center for Computational Science,
More informationAccelerating Dense Linear Algebra for GPUs, Multicores and Hybrid Architectures: an Autotuned and Algorithmic Approach
University of Tennessee, Knoxville Trace: Tennessee Research and Creative Exchange Masters Theses Graduate School 8-21 Accelerating Dense Linear Algebra for GPUs, Multicores and Hybrid Architectures: an
More informationCUDA Architecture & Programming Model
CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New
More informationStudy and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou
Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm
More informationIntroduction to GPU computing
Introduction to GPU computing Nagasaki Advanced Computing Center Nagasaki, Japan The GPU evolution The Graphic Processing Unit (GPU) is a processor that was specialized for processing graphics. The GPU
More informationHand-Tuned SGEMM on GT200 GPU
Hand-Tuned SGEMM on GT200 GPU Lung-Sheng Chien Department of Mathematics, Tsing Hua university, R.O.C. (Taiwan) d947207@oz.nthu.edu.tw February 2010, version 1.1 Abstract: in this work, we improve SGEMM
More informationCuda Compilation Utilizing the NVIDIA GPU Oct 19, 2017
Cuda Compilation Utilizing the NVIDIA GPU Oct 19, 2017 This document will essentially provide the reader with the understanding on how to use the CUDA 7.0 environment within the Electrical and Computer
More informationIntroduction to CUDA Programming
Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationMulti GPU Performance of Conjugate Gradient Algorithm with Staggered Fermions
of Conjugate Gradient Algorithm with Staggered Fermions, Weonjong Lee Lattice Gauge Theory Research Center, FPRD, and CTP Department of Physics and Astronomy, Seoul National University, Seoul, 151-747,
More informationCUDA Toolkit 4.1 CUSPARSE Library. PG _v01 January 2012
CUDA Toolkit 4.1 CUSPARSE Library PG-05329-041_v01 January 2012 Contents 1 Introduction 2 1.1 New and Legacy CUSPARSE API........................ 2 1.2 Naming Convention................................
More informationS4289: Efficient solution of multiple scalar and block-tridiagonal equations
S4289: Efficient solution of multiple scalar and block-tridiagonal equations Endre László endre.laszlo [at] oerc.ox.ac.uk Oxford e-research Centre, University of Oxford, UK Pázmány Péter Catholic University,
More informationLecture 6 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden. Computing with Graphical Processing Units CUDA Programming Matrix multiplication
Lecture 6 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Computing with Graphical Processing Units CUDA Programming Matrix multiplication Announcements A2 has been released: Matrix multiplication
More informationFundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA
Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU
More informationApplications of Berkeley s Dwarfs on Nvidia GPUs
Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA The Dwarfs Dynamic Programming Sparse
More informationMultithreaded Architectures and The Sort Benchmark. Phil Garcia Hank Korth Dept. of Computer Science and Engineering Lehigh University
Multithreaded Architectures and The Sort Benchmark Phil Garcia Hank Korth Dept. of Computer Science and Engineering Lehigh University About our Sort Benchmark Based on the benchmark proposed in A measure
More informationExperiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor
Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain
More informationCUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17
CUDA Lecture 2 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de December 15, 2015 CUDA Programming Fundamentals CUDA
More informationMemory. Lecture 2: different memory and variable types. Memory Hierarchy. CPU Memory Hierarchy. Main memory
Memory Lecture 2: different memory and variable types Prof. Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Key challenge in modern computer architecture
More informationHomework 3 (r1.2) Due: Part (A) -- Apr 28, 2017, 11:55pm Part (B) -- Apr 28, 2017, 11:55pm Part (C) -- Apr 28, 2017, 11:55pm
Second Semester, 2016 17 Homework 3 (r1.2) Due: Part (A) -- Apr 28, 2017, 11:55pm Part (B) -- Apr 28, 2017, 11:55pm Part (C) -- Apr 28, 2017, 11:55pm Instruction: Submit your answers electronically through
More informationCSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591: GPU Programming Programmer Interface Klaus Mueller Computer Science Department Stony Brook University Compute Levels Encodes the hardware capability of a GPU card newer cards have higher compute
More informationChapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationCode Optimizations for High Performance GPU Computing
Code Optimizations for High Performance GPU Computing Yi Yang and Huiyang Zhou Department of Electrical and Computer Engineering North Carolina State University 1 Question to answer Given a task to accelerate
More informationLecture 2: different memory and variable types
Lecture 2: different memory and variable types Prof. Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 2 p. 1 Memory Key challenge in modern
More informationMassively Parallel Architectures
Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI joel.falcou@lri.fr Bat. 490 - Bureau 104 20 janvier 2009 Motivation The CELL processor Harder,Better,Faster,Stronger
More informationComputer Graphics. Coordinate Systems and Change of Frames. Based on slides by Dianna Xu, Bryn Mawr College
Computer Graphics Coordinate Systems and Change of Frames Based on slides by Dianna Xu, Bryn Mawr College Linear Independence A set of vectors independent if is linearly If a set of vectors is linearly
More informationPerformance and accuracy of the matrix multiplication routines : CUBLAS on Nvidia Tesla versus MKL and ATLAS on Intel Nehalem
Performance and accuracy of the matrix multiplication routines : on Nvidia Tesla versus and on Intel Nehalem Philippe Estival, Luc Giraud To cite this version: Philippe Estival, Luc Giraud. Performance
More informationThis is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.
David Kirk/NVIDIA and Wen-mei Hwu, 2006-2008 This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC. Please send any comment to dkirk@nvidia.com
More informationIntroduction to GPGPUs and to CUDA programming model
Introduction to GPGPUs and to CUDA programming model www.cineca.it Marzia Rivi m.rivi@cineca.it GPGPU architecture CUDA programming model CUDA efficient programming Debugging & profiling tools CUDA libraries
More informationEvaluation and Tuning of the Level 3 CUBLAS for Graphics Processors
Evaluation and Tuning of the Level 3 CUBLAS for Graphics Processors Sergio Barrachina Maribel Castillo Francisco D. Igual Rafael Mayo Enrique S. Quintana-Ortí Depto. de Ingeniería y Ciencia de Computadores
More informationSubmission instructions (read carefully): SS17 / Assignment 4 Instructor: Markus Püschel. ETH Zurich
263-2300-00: How To Write Fast Numerical Code Assignment 4: 120 points Due Date: Th, April 13th, 17:00 http://www.inf.ethz.ch/personal/markusp/teaching/263-2300-eth-spring17/course.html Questions: fastcode@lists.inf.ethz.ch
More informationState-of-the-art in Heterogeneous Computing
State-of-the-art in Heterogeneous Computing Guest Lecture NTNU Trond Hagen, Research Manager SINTEF, Department of Applied Mathematics 1 Overview Introduction GPU Programming Strategies Trends: Heterogeneous
More informationGeneral Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop
General Purpose GPU programming (GP-GPU) with Nvidia CUDA Libby Shoop 3 What is (Historical) GPGPU? General Purpose computation using GPU and graphics API in applications other than 3D graphics GPU accelerates
More information