Improving Host-GPU Communication with Buffering Schemes
|
|
- Giles Freeman
- 6 years ago
- Views:
Transcription
1 Improving Host-GPU Communication with Buffering Schemes Guillermo Marcus University of Heidelberg
2 Overview Motivation Buffering Schemes Converting data in the loop 2
3 Why We know about the benefits of double/pooled buffers in DMA transactions. Why not use them in GPUs? When using an accelerator, most of the time the data format in the GPU and in the application do not match For some apps, we do not want to reserve multi-gigabyte buffers of host memory for transfers 3
4 Transfers in CUDA Running on... Device : GeForce GTX 48 Quick Mode Host to Device Bandwidth, 1 Device(s), Pinned memory Transfer Size (Bytes) Bandwidth(MB/s) Device to Host Bandwidth, 1 Device(s), Pinned memory Transfer Size (Bytes) Bandwidth(MB/s) 616. Device to Device Bandwidth, 1 Device(s) Transfer Size (Bytes) Bandwidth(MB/s) [bandwidthtest] test results... PASSED 7 read write CUDA Performance Reference 4
5 With data conversions 7 read read DP-SP read AOS-SOA write write DP-SP write AOS-SOA CUDA Performance Reference Convert data from double to single precision float Convert data from AOS to SOA Now both need to pass data by the CPU 5
6 Using Buffering Schemes!"" #$%%&'()*+*,&' )&-.'/ )!(#$%%&' 1$&$& #.*'2 3."/ 4'*+56*78.+ Provides one or more memory buffers paired with a GPU buffer. Implements typical schemes D + E 9:;(3<=>? 92;(@=#AB 9&;(C@@AB 6
7 Chunk Buffer DMAopsCUDA::board_type device(); // select CUDA device DMAopsCUDA::buffer_type buf1(chunk, MAX, sizeof(int)); ChunkBuffer< int, DMAopsCUDA > chunk_buffer_test(device, buf1, buf1.getbuffersize());! int * data = (int*) malloc(sizeof(int)*max);! chunk_buffer_test.write(data, MAX,,, true, false);!... chunk_buffer_test.read(check, MAX,,, true, false); 7
8 Chunk Buffer CUDA device DMAopsCUDA::board_type device(); // select CUDA device DMAopsCUDA::buffer_type buf1(chunk, MAX, sizeof(int)); ChunkBuffer< int, DMAopsCUDA > chunk_buffer_test(device, buf1, buf1.getbuffersize());! int * data = (int*) malloc(sizeof(int)*max);! chunk_buffer_test.write(data, MAX,,, true, false);!... chunk_buffer_test.read(check, MAX,,, true, false); 7
9 Chunk Buffer CUDA device Buffer, including the device memory DMAopsCUDA::board_type device(); // select CUDA device DMAopsCUDA::buffer_type buf1(chunk, MAX, sizeof(int)); ChunkBuffer< int, DMAopsCUDA > chunk_buffer_test(device, buf1, buf1.getbuffersize());! int * data = (int*) malloc(sizeof(int)*max);! chunk_buffer_test.write(data, MAX,,, true, false);!... chunk_buffer_test.read(check, MAX,,, true, false); 7
10 Chunk Buffer CUDA device Buffer, including the device memory DMAopsCUDA::board_type device(); // select CUDA device DMAopsCUDA::buffer_type buf1(chunk, MAX, sizeof(int)); ChunkBuffer< int, DMAopsCUDA > chunk_buffer_test(device, buf1, buf1.getbuffersize());! int * data = (int*) malloc(sizeof(int)*max);! chunk_buffer_test.write(data, MAX,,, true, false);!... chunk_buffer_test.read(check, MAX,,, true, false); Create the buffer manager 7
11 Chunk Buffer Read Performance Chunk Buffer Read Performance Chunk Buffer 7 read write
12 Double Buffer DMAopsCUDA::board_type device(); //CUDA device DMAopsCUDA::buffer_type buf2_1(chunk, MAX, sizeof(ht)); DMAopsCUDA::buffer_type buf2_2(buf2_1);! DoubleBuffer<HT, DMAopsCUDA > double_buffer_test(device, buf2_1, buf2_2, buf2_1.getbuffersize());!... double_buffer_test.write(data2, MAX,,, true, false); double_buffer_test.read(check2, MAX,,, true, false); 9
13 Double Buffer DMAopsCUDA::board_type device(); //CUDA device Second Buffer, linked to device memory of buffer 1 DMAopsCUDA::buffer_type buf2_1(chunk, MAX, sizeof(ht)); DMAopsCUDA::buffer_type buf2_2(buf2_1);! DoubleBuffer<HT, DMAopsCUDA > double_buffer_test(device, buf2_1, buf2_2, buf2_1.getbuffersize());!... double_buffer_test.write(data2, MAX,,, true, false); double_buffer_test.read(check2, MAX,,, true, false); 9
14 Double Buffer Read Performance Double Buffer Write Performance Double Buffer 7 read write
15 Pooled Buffer DMAopsCUDA::board_type device(); // CUDA device! DMAPool< HT, DMAopsCUDA > pool(nbuf, CHUNK, MAX, sizeof(ht)); PooledBuffer< HT, DMAopsCUDA> pooled_buffer_test(sizeof(ht), board, pool); pooled_buffer_test.write(data3, MAX,,, true, false);... pooled_buffer_test.read(check3, MAX,,, true, false); 11
16 Pooled Buffer Create a Pool of Buffers DMAopsCUDA::board_type device(); // CUDA device! DMAPool< HT, DMAopsCUDA > pool(nbuf, CHUNK, MAX, sizeof(ht)); PooledBuffer< HT, DMAopsCUDA> pooled_buffer_test(sizeof(ht), board, pool); pooled_buffer_test.write(data3, MAX,,, true, false);... pooled_buffer_test.read(check3, MAX,,, true, false); 11
17 Pooled Buffers Read Performance Pooled Buffer Write Performance Pooled Buffer 7 read write
18 Translators Defines how to convert back and forth the data types in the host and the GPU template<class T> class TrNOP { public:! typedef T host_type;! typedef T board_type;! inline static void host2board(unsigned int const count, T *in, T *out, unsigned int const in_offset, unsigned int const out_offset);! inline static void board2host(unsigned int const count, T *in, T *out, unsigned int const in_offset, unsigned int const out_offset); }; // end template 13
19 Translator DP-SP template<typename T1, typename T2> class TrTemplate { public: typedef T1 host_type; typedef T2 board_type; inline static void host2board(unsigned int const count, T1 *in, T2 *out, unsigned int const in_offset, unsigned int const out_offset); inline static void board2host(unsigned int const count, T2 *in, T1 *out, unsigned int const in_offset, unsigned int const out_offset); }; // end template // Implementation of the template template<typename T1, typename T2> void TrTemplate<T1, T2>::host2board( unsigned int const count, T1 *in, T2 *out, unsigned int const in_offset, unsigned int const out_offset ) { for(int i=;i<count;++i) out[out_offset+i] = static_cast<t2>(in[in_offset+i]); } template<typename T1, typename T2> void TrTemplate<T1, T2>::board2host( unsigned int const count, T2 *in, T1 *out, unsigned int const in_offset, unsigned int const out_offset ) { for(int i=;i<count;++i) out[out_offset+i] = static_cast<t1>(in[in_offset+i]); } Makes static cast between T1 and T2 There is also an SSE optimized version for double-float conversion 14
20 Double Buffer DP-SP Read Performance Double Buffer DP-SP Write Performance Double Buffer DP-SP 7 read read DP-SP write write DP-SP
21 Pooled Buffer DP-SP Read Performance Pooled Buffer DP-SP Write Performance Pooled Buffer DP-SP 7 read read DP-SP write write DP-SP
22 Translator AOS-SOA // Implementation of the template template<typename T1, typename T2> void TrAoStoSoA<T1, T2>::host2board(! unsigned int const count,! void *in,! void *out,! unsigned int const in_offset,! unsigned int const out_offset ) {! //implementation! host_type * input = static_cast<host_type *>(in);! board_type * output = static_cast<board_type *>(out);! unsigned int i = ;! while (i<count) {!! output[out_offset+i] = static_cast<t2>(input[in_offset+i].x);!! output[count+out_offset+i] = static_cast<t2>(input[in_offset+i].y);!! output[count*2+out_offset+i] = static_cast<t2>(input[in_offset+i].z);!! output[count*3+out_offset+i] = static_cast<t2>(input[in_offset+i].a);!! ++i;! } } template<typename T1, typename T2> void TrAoStoSoA<T1, T2>::board2host(! unsigned int const count,! void *in,! void *out,! unsigned int const in_offset,! unsigned int const out_offset ) {! //implementation! board_type * input = static_cast<board_type *>(in);! host_type * output = static_cast<host_type *>(out);! unsigned int i = ;! while (i<count) {!! output[out_offset+i].x = static_cast<t1>(input[in_offset+i]);!! output[out_offset+i].y = static_cast<t1>(input[count + in_offset+i]);!! output[out_offset+i].z = static_cast<t1>(input[count * 2 + in_offset+i]);!! output[out_offset+i].a = static_cast<t1>(input[count * 3 + in_offset+i]);!! ++i;! } } In our example, it is 4 elements of type T1 (floats), converted into 4 interleaved blocks of floats. 17
23 Double Buffer AOS-SOA Read Performance Double Buffer AOS-SOA Write Performance Double Buffer AOS-SOA 7 read read AOS write write AOS
24 Pooled Buffer AOS-SOA Read Performance Pooled Buffer DP-SP Write Performance Pooled Buffer DP-SP 7 read read DP-SP write write DP-SP
25 Conclusions We present a way to composite buffering schemes with data transformation using templates We reduce the pinned memory needed to perform transfers We improve the performance of the transfers in comparison to simple CUDA implementations Questions? This work was done with support from the Volkswagen Foundation under the GRACE project 2
robotics/ openel.h File Reference Macros Macro Definition Documentation Typedefs Functions
openel.h File Reference Macros #define EL_TRUE 1 #define EL_FALSE 0 #define EL_NXT_PORT_A 0 #define EL_NXT_PORT_B 1 #define EL_NXT_PORT_C 2 #define EL_NXT_PORT_S1 0 #define EL_NXT_PORT_S2 1 #define EL_NXT_PORT_S3
More informationHigh Performance Matrix-matrix Multiplication of Very Small Matrices
High Performance Matrix-matrix Multiplication of Very Small Matrices Ian Masliah, Marc Baboulin, ICL people University Paris-Sud - LRI Sparse Days Cerfacs, Toulouse, 1/07/2016 Context Tensor Contractions
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationDesigning and Optimizing LQCD code using OpenACC
Designing and Optimizing LQCD code using OpenACC E Calore, S F Schifano, R Tripiccione Enrico Calore University of Ferrara and INFN-Ferrara, Italy GPU Computing in High Energy Physics Pisa, Sep. 10 th,
More informationConvolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam
Convolution Soup: A case study in CUDA optimization The Fairmont San Jose Joe Stam Optimization GPUs are very fast BUT Poor programming can lead to disappointing performance Squeaking out the most speed
More informationInformation Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)
26(86) Information Coding / Computer Graphics, ISY, LiTH CUDA memory Coalescing Constant memory Texture memory Pinned memory 26(86) CUDA memory We already know... Global memory is slow. Shared memory is
More informationGPGPU Programming & Erlang. Kevin A. Smith
GPGPU Programming & Erlang Kevin A. Smith What is GPGPU Programming? Using the graphics processor for nongraphical programming Writing algorithms for the GPU instead of the host processor Why? Ridiculous
More informationCUDA 7.5 OVERVIEW WEBINAR 7/23/15
CUDA 7.5 OVERVIEW WEBINAR 7/23/15 CUDA 7.5 https://developer.nvidia.com/cuda-toolkit 16-bit Floating-Point Storage 2x larger datasets in GPU memory Great for Deep Learning cusparse Dense Matrix * Sparse
More informationMonday, September 20, Developing CUDA Accelerated.NET Plugins for Excel NVIDIA 2010 Conference
Developing CUDA Accelerated.NET Plugins for Excel NVIDIA 2010 Conference Cuda Development XLDeveloper-Cuda enabled Cuda in a larger organization/codebase XLDeveloper: Motivation Provide productive environment
More informationIntroduction to Parallel Computing with CUDA. Oswald Haan
Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries
More informationTokens, Expressions and Control Structures
3 Tokens, Expressions and Control Structures Tokens Keywords Identifiers Data types User-defined types Derived types Symbolic constants Declaration of variables Initialization Reference variables Type
More informationCAAM 420 Fall 2012 Lecture 29. Duncan Eddy
CAAM 420 Fall 2012 Lecture 29 Duncan Eddy November 7, 2012 Table of Contents 1 Templating in C++ 3 1.1 Motivation.............................................. 3 1.2 Templating Functions........................................
More informationAuto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters
Auto-Generation and Auto-Tuning of 3D Stencil s on GPU Clusters Yongpeng Zhang, Frank Mueller North Carolina State University CGO 2012 Outline Motivation DSL front-end and Benchmarks Framework Experimental
More informationOpenStaPLE, an OpenACC Lattice QCD Application
OpenStaPLE, an OpenACC Lattice QCD Application Enrico Calore Postdoctoral Researcher Università degli Studi di Ferrara INFN Ferrara Italy GTC Europe, October 10 th, 2018 E. Calore (Univ. and INFN Ferrara)
More information/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 10: GPGPU (3) Welcome!
/INFOMOV/ Optimization & Vectorization J. Bikker - Sep-Nov 2018 - Lecture 10: GPGPU (3) Welcome! Today s Agenda: Don t Trust the Template The Prefix Sum Parallel Sorting Stream Filtering Optimizing GPU
More informationBasic Types, Variables, Literals, Constants
Basic Types, Variables, Literals, Constants What is in a Word? A byte is the basic addressable unit of memory in RAM Typically it is 8 bits (octet) But some machines had 7, or 9, or... A word is the basic
More informationCUDA Memories. Introduction 5/4/11
5/4/11 CUDA Memories James Gain, Michelle Kuttel, Sebastian Wyngaard, Simon Perkins and Jason Brownbridge { jgain mkuttel sperkins jbrownbr}@cs.uct.ac.za swyngaard@csir.co.za 3-6 May 2011 Introduction
More informationUsing SYCL as an Implementation Framework for HPX.Compute
Using SYCL as an Implementation Framework for HPX.Compute Marcin Copik 1 Hartmut Kaiser 2 1 RWTH Aachen University mcopik@gmail.com 2 Louisiana State University Center for Computation and Technology The
More informationReductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research
Reductions and Low-Level Performance Considerations CME343 / ME339 27 May 2011 David Tarjan [dtarjan@nvidia.com] NVIDIA Research REDUCTIONS Reduction! Reduce vector to a single value! Via an associative
More informationFrom Hello World to Exascale
From Hello World to Exascale Rob Farber Chief Scien0st, BlackDog Endeavors, LLC Author, CUDA Applica0on Design and Development Research consultant: ICHEC and others Doctor Dobb s Journal CUDA & OpenACC
More informationA brief introduction to HONEI
A brief introduction to HONEI Danny van Dyk, Markus Geveler, Dominik Göddeke, Carsten Gutwenger, Sven Mallach, Dirk Ribbrock March 2009 Contents 1 Introduction 2 2 Using HONEI 2 3 Developing HONEI kernels
More informationPOINTERS - Pointer is a variable that holds a memory address of another variable of same type. - It supports dynamic allocation routines. - It can improve the efficiency of certain routines. C++ Memory
More informationIntroduction to GPGPU and GPU-architectures
Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks
More informationConvolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam
Convolution Soup: A case study in CUDA optimization The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Optimization GPUs are very fast BUT Naïve programming can result in disappointing performance
More informationEEE145 Computer Programming
EEE145 Computer Programming Content of Topic 2 Extracted from cpp.gantep.edu.tr Topic 2 Dr. Ahmet BİNGÜL Department of Engineering Physics University of Gaziantep Modifications by Dr. Andrew BEDDALL Department
More informationExample 1: Color-to-Grayscale Image Processing
GPU Teaching Kit Accelerated Computing Lecture 16: CUDA Parallelism Model Examples Example 1: Color-to-Grayscale Image Processing RGB Color Image Representation Each pixel in an image is an RGB value The
More informationAppendix. Grammar. A.1 Introduction. A.2 Keywords. There is no worse danger for a teacher than to teach words instead of things.
A Appendix Grammar There is no worse danger for a teacher than to teach words instead of things. Marc Block Introduction keywords lexical conventions programs expressions statements declarations declarators
More information11 'e' 'x' 'e' 'm' 'p' 'l' 'i' 'f' 'i' 'e' 'd' bool equal(const unsigned char pstr[], const char *cstr) {
This document contains the questions and solutions to the CS107 midterm given in Spring 2016 by instructors Julie Zelenski and Michael Chang. This was an 80-minute exam. Midterm questions Problem 1: C-strings
More informationCPSC 427: Object-Oriented Programming
CPSC 427: Object-Oriented Programming Michael J. Fischer Lecture 20 November 14, 2016 CPSC 427, Lecture 20 1/19 Templates Casts and Conversions CPSC 427, Lecture 20 2/19 Templates CPSC 427, Lecture 20
More informationComparison of High-Speed Ray Casting on GPU
Comparison of High-Speed Ray Casting on GPU using CUDA and OpenGL November 8, 2008 NVIDIA 1,2, Andreas Weinlich 1, Holger Scherl 2, Markus Kowarschik 2 and Joachim Hornegger 1 1 Chair of Pattern Recognition
More informationPage 1. Agenda. Programming Languages. C Compilation Process
EE 472 Embedded Systems Dr. Shwetak Patel Assistant Professor Computer Science & Engineering Electrical Engineering Agenda Announcements C programming intro + pointers Shwetak N. Patel - EE 472 2 Programming
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and
More informationNon-numeric types, boolean types, arithmetic. operators. Comp Sci 1570 Introduction to C++ Non-numeric types. const. Reserved words.
, ean, arithmetic s s on acters Comp Sci 1570 Introduction to C++ Outline s s on acters 1 2 3 4 s s on acters Outline s s on acters 1 2 3 4 s s on acters ASCII s s on acters ASCII s s on acters Type: acter
More information情報処理学会研究報告 IPSJ SIG Technical Report Vol.2015-HPC-150 No /8/6 SPH CUDA 1 1 SPH GPU GPGPU CPU GPU GPU GPU CUDA SPH SoA(Structures Of Array) GPU
SPH CUDA 1 1 SPH GPU GPGPU CPU GPU GPU GPU CUDA SPH SoA(Structures Of Array) GPU CUDA SPH Acceleration of Uniform Grid-based SPH Particle Method using CUDA Takada Kisei 1 Ohno Kazuhiko 1 Abstract: SPH
More informationVariables. Data Types.
Variables. Data Types. The usefulness of the "Hello World" programs shown in the previous section is quite questionable. We had to write several lines of code, compile them, and then execute the resulting
More informationMulti-Processors and GPU
Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock
More informationBreaking the Memory Barrier for Finite Difference Algorithms
Breaking the Memory Barrier for Finite Difference Algorithms Gerhard Zumbusch Institut für Angewandte Mathematik Friedrich-Schiller Universität Jena GTC 2013, S3096 Model problem Finite Difference Stencil
More informationSIMD in Scientific Computing
SIMD in Scientific Computing Tim Haines (terminal) PhD Candidate University of Wisconsin-Madison Department of Astronomy State of Tree(PM)-based N-Body solvers in Astronomy SPH? GPU? Xeon Phi? Gadget2
More informationAssignment Operations
ECE 114-4 Control Statements-2 Dr. Z. Aliyazicioglu Cal Poly Pomona Electrical & Computer Engineering Cal Poly Pomona Electrical & Computer Engineering 1 Assignment Operations C++ provides several assignment
More informationObject-Oriented Programming for Scientific Computing
Object-Oriented Programming for Scientific Computing Traits and Policies Ole Klein Interdisciplinary Center for Scientific Computing Heidelberg University ole.klein@iwr.uni-heidelberg.de Summer Semester
More informationOptiX Utility Library
OptiX Utility Library 3.0.0 Generated by Doxygen 1.7.6.1 Wed Nov 21 2012 12:59:03 CONTENTS i Contents 1 Module Documentation 1 1.1 rtutraversal: traversal API allowing batch raycasting queries utilizing
More informationOpenACC Fundamentals. Steve Abbott November 15, 2017
OpenACC Fundamentals Steve Abbott , November 15, 2017 AGENDA Data Regions Deep Copy 2 while ( err > tol && iter < iter_max ) { err=0.0; JACOBI ITERATION #pragma acc parallel loop reduction(max:err)
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationReal-time Graphics 9. GPGPU
Real-time Graphics 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing
More informationRTfact: Concepts for Generic and High Performance Ray Tracing
RTfact: Concepts for Generic and High Performance Ray Tracing Ray tracers are used in RT08 papers Change packet size? Change data structures? No common software base No tools for writing composable software
More informationCh. 3: The C in C++ - Continued -
Ch. 3: The C in C++ - Continued - QUIZ What are the 3 ways a reference can be passed to a C++ function? QUIZ True or false: References behave like constant pointers with automatic dereferencing. QUIZ What
More informationAdapting applications to exploit virtualization management knowledge
Adapting applications to exploit virtualization management knowledge DMTF SVM 2013 Outline Motivation Applications running on virtualized infrastructure suffer! 1 Example of suffering, by experiment 2
More informationLecture 8: GPU Programming. CSE599G1: Spring 2017
Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library
More informationProgramming in C++ 4. The lexical basis of C++
Programming in C++ 4. The lexical basis of C++! Characters and tokens! Permissible characters! Comments & white spaces! Identifiers! Keywords! Constants! Operators! Summary 1 Characters and tokens A C++
More informationeingebetteter Systeme
Praktikum: Entwicklung interaktiver eingebetteter Systeme C++-Tutorial (falk@cs.fau.de) 1 Agenda Classes Pointers and References Functions and Methods Function and Operator Overloading Template Classes
More informationStream Computing using Brook+
Stream Computing using Brook+ School of Electrical Engineering and Computer Science University of Central Florida Slides courtesy of P. Bhaniramka Outline Overview of Brook+ Brook+ Software Architecture
More informationOptimisation Myths and Facts as Seen in Statistical Physics
Optimisation Myths and Facts as Seen in Statistical Physics Massimo Bernaschi Institute for Applied Computing National Research Council & Computer Science Department University La Sapienza Rome - ITALY
More informationOperator overloading. Conversions. friend. inline
Operator overloading Conversions friend inline. Operator Overloading Operators like +, -, *, are actually methods, and can be overloaded. Syntactic sugar. What is it good for - 1 Natural usage. compare:
More informationGREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES. Nikolay Markovskiy Peter Messmer
GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES Nikolay Markovskiy Peter Messmer ABOUT CP2K Atomistic and molecular simulations of solid state From ab initio DFT and Hartree-Fock
More informationAdvanced Topics: Streams, Multi-GPU, Tools, Libraries, etc.
CSC 391/691: GPU Programming Fall 2011 Advanced Topics: Streams, Multi-GPU, Tools, Libraries, etc. Copyright 2011 Samuel S. Cho Streams Until now, we have largely focused on massively data-parallel execution
More information1. The term STL stands for?
1. The term STL stands for? a) Simple Template Library b) Static Template Library c) Single Type Based Library d) Standard Template Library Answer : d 2. Which of the following statements regarding the
More informationMassively Parallel Architectures
Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI joel.falcou@lri.fr Bat. 490 - Bureau 104 20 janvier 2009 Motivation The CELL processor Harder,Better,Faster,Stronger
More information6.096 Introduction to C++ January (IAP) 2009
MIT OpenCourseWare http://ocw.mit.edu 6.096 Introduction to C++ January (IAP) 2009 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. Welcome to 6.096 Lecture
More informationFundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA
Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU
More informationBUILDING PARALLEL ALGORITHMS WITH BULK. Jared Hoberock Programming Systems and Applications NVIDIA Research github.
BUILDING PARALLEL ALGORITHMS WITH BULK Jared Hoberock Programming Systems and Applications NVIDIA Research github.com/jaredhoberock B HELLO, BROTHER! #include #include struct hello
More informationPDF Document structure, that need for managing of PDF file. It uses in all functions from EMF2PDF SDK.
EMF2PDF SDK Pilot Structures struct pdf_document { PDFDocument4 *pdfdoc; }; PDF Document structure, that need for managing of PDF file. It uses in all functions from EMF2PDF SDK. typedef enum { conone
More informationIntroduction to CELL B.E. and GPU Programming. Agenda
Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU
More informationChapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for
More informationObject-Oriented Programming for Scientific Computing
Object-Oriented Programming for Scientific Computing Templates and Static Polymorphism Ole Klein Interdisciplinary Center for Scientific Computing Heidelberg University ole.klein@iwr.uni-heidelberg.de
More informationNVJPEG. DA _v0.2.0 October nvjpeg Libary Guide
NVJPEG DA-06762-001_v0.2.0 October 2018 Libary Guide TABLE OF CONTENTS Chapter 1. Introduction...1 Chapter 2. Using the Library... 3 2.1. Single Image Decoding... 3 2.3. Batched Image Decoding... 6 2.4.
More informationCS11 Advanced C++ Fall Lecture 7
CS11 Advanced C++ Fall 2006-2007 Lecture 7 Today s Topics Explicit casting in C++ mutable keyword and const Template specialization Template subclassing Explicit Casts in C and C++ C has one explicit cast
More informationProgram des-simple3.cc
1 // A simple trivial Discrete Event Simulator to illustrate DES concepts 2 // This one uses typesafe callbacks for the event handlers. 3 4 // George F. Riley, Georgia Tech, Fall 2011 ECE8893 5 6 #include
More informationINTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017
INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and
More informationWhy? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators
Remote CUDA (rcuda) Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators Better performance-watt, performance-cost
More informationCS377P Programming for Performance GPU Programming - I
CS377P Programming for Performance GPU Programming - I Sreepathi Pai UTCS November 9, 2015 Outline 1 Introduction to CUDA 2 Basic Performance 3 Memory Performance Outline 1 Introduction to CUDA 2 Basic
More informationBRAIN INTERNATIONAL SCHOOL. Term-I Class XI Sub: Computer Science Revision Worksheet
BRAIN INTERNATIONAL SCHOOL Term-I Class XI 2018-19 Sub: Computer Science Revision Worksheet Chapter-1. Computer Overview 1. Which electronic device invention brought revolution in earlier computers? 2.
More informationCpt S 122 Data Structures. Templates
Cpt S 122 Data Structures Templates Nirmalya Roy School of Electrical Engineering and Computer Science Washington State University Topics Introduction Function Template Function-template and function-template
More informationLectures 5-6: Introduction to C
Lectures 5-6: Introduction to C Motivation: C is both a high and a low-level language Very useful for systems programming Faster than Java This intro assumes knowledge of Java Focus is on differences Most
More informationCUDA Advanced Techniques 2 Mohamed Zahran (aka Z)
CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming CUDA Advanced Techniques 2 Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Alignment Memory Alignment Memory
More informationStructuur van Computerprogramma s 2
Structuur van Computerprogramma s 2 dr. Dirk Deridder Dirk.Deridder@vub.ac.be http://soft.vub.ac.be/ Vrije Universiteit Brussel - Faculty of Science and Bio-Engineering Sciences - Computer Science Department
More informationCUDA Performance Optimization. Patrick Legresley
CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations
More informationMartin Kruliš, v
Martin Kruliš 1 GPGPU History Current GPU Architecture OpenCL Framework Example (and its Optimization) Alternative Frameworks Most Recent Innovations 2 1996: 3Dfx Voodoo 1 First graphical (3D) accelerator
More informationRecap. ANSI C Reserved Words C++ Multimedia Programming Lecture 2. Erwin M. Bakker Joachim Rijsdam
Multimedia Programming 2004 Lecture 2 Erwin M. Bakker Joachim Rijsdam Recap Learning C++ by example No groups: everybody should experience developing and programming in C++! Assignments will determine
More informationScalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009
Scalable Multi Agent Simulation on the GPU Avi Bleiweiss NVIDIA Corporation San Jose, 2009 Reasoning Explicit State machine, serial Implicit Compute intensive Fits SIMT well Collision avoidance Motivation
More informationLecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1
Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei
More informationStructures, Operators
Structures Typedef Operators Type conversion Structures, Operators Basics of Programming 1 G. Horváth, A.B. Nagy, Z. Zsóka, P. Fiala, A. Vitéz 10 October, 2018 c based on slides by Zsóka, Fiala, Vitéz
More informationCompiling CUDA and Other Languages for GPUs. Vinod Grover and Yuan Lin
Compiling CUDA and Other Languages for GPUs Vinod Grover and Yuan Lin Agenda Vision Compiler Architecture Scenarios SDK Components Roadmap Deep Dive SDK Samples Demos Vision Build a platform for GPU computing
More informationThe Challenges of System Design. Raising Performance and Reducing Power Consumption
The Challenges of System Design Raising Performance and Reducing Power Consumption 1 Agenda The key challenges Visibility for software optimisation Efficiency for improved PPA 2 Product Challenge - Software
More informationUniversity of Technology. Laser & Optoelectronics Engineering Department. C++ Lab.
University of Technology Laser & Optoelectronics Engineering Department C++ Lab. Second week Variables Data Types. The usefulness of the "Hello World" programs shown in the previous section is quite questionable.
More informationION - Large pages for devices
ION - Large pages for devices John Einar Reitan Android/Mobile Microconference - LPC 2016 Motivation ARM Display + IOMMU need 2MB pages when rotating Native page size 4kB 64kB pages
More informationFixed-Point Math and Other Optimizations
Fixed-Point Math and Other Optimizations Embedded Systems 8-1 Fixed Point Math Why and How Floating point is too slow and integers truncate the data Floating point subroutines: slower than native, overhead
More informationLecture 11: OpenCL and Altera OpenCL. James C. Hoe Department of ECE Carnegie Mellon University
18 643 Lecture 11: OpenCL and Altera OpenCL James C. Hoe Department of ECE Carnegie Mellon University 18 643 F17 L11 S1, James C. Hoe, CMU/ECE/CALCM, 2017 Housekeeping Your goal today: understand Altera
More informationDesign Patterns in C++
Design Patterns in C++ Metaprogramming applied Giuseppe Lipari http://retis.sssup.it/~lipari Scuola Superiore Sant Anna Pisa April 13, 2011 G. Lipari (Scuola Superiore Sant Anna) Metaprogramming applied
More informationlibknx Generated by Doxygen Wed Aug :37:55
libknx Generated by Doxygen 1.8.1.2 Wed Aug 7 2013 01:37:55 Contents 1 KNX interface library 1 2 Namespace Index 3 2.1 Namespace List............................................ 3 3 Class Index 5 3.1
More informationME240 Computation for Mechanical Engineering. Lecture 4. C++ Data Types
ME240 Computation for Mechanical Engineering Lecture 4 C++ Data Types Introduction In this lecture we will learn some fundamental elements of C++: Introduction Data Types Identifiers Variables Constants
More informationC Programming. Course Outline. C Programming. Code: MBD101. Duration: 10 Hours. Prerequisites:
C Programming Code: MBD101 Duration: 10 Hours Prerequisites: You are a computer science Professional/ graduate student You can execute Linux/UNIX commands You know how to use a text-editing tool You should
More informationC++/Java. C++: Hello World. Java : Hello World. Expression Statement. Compound Statement. Declaration Statement
C++/Java HB/C2/KERNELC++/1 Statements HB/C2/KERNELC++/2 Java: a pure object-oriented language. C++: a hybrid language. Allows multiple programming styles. Kernel C++/ a better C: the subset of C++ without
More informationGridKa School 2013: Effective Analysis C++ Templates
GridKa School 2013: Effective Analysis C++ Templates Introduction Jörg Meyer, Steinbuch Centre for Computing, Scientific Data Management KIT University of the State of Baden-Wuerttemberg and National Research
More informationCompile-time factorization
Compile-time factorization [crazy template meta-programming] Vladimir Mirnyy C++ Meetup, 15 Sep. 2015, C Base, Berlin blog.scientificcpp.com Compile-time factorization C++ Meetup, Sep. 2015 1 / 20 The
More informationChapter 3. Fundamental Data Types
Chapter 3. Fundamental Data Types Byoung-Tak Zhang TA: Hanock Kwak Biointelligence Laboratory School of Computer Science and Engineering Seoul National Univertisy http://bi.snu.ac.kr Variable Declaration
More informationFCUDA: Enabling Efficient Compilation of CUDA Kernels onto
FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:
More informationStructures for Heterogeneous Computing. The Best of Both Worlds: Flexible Data. Integrative Scientific Computing Max Planck Institut Informatik
The Best of Both Worlds: Flexible Data Structures for Heterogeneous Computing Robert Strzodka Integrative Scientific Computing Max Planck Institut Informatik My GPU Programming Challenges 2000: Conjugate
More informationDesign Patterns in C++
Design Patterns in C++ Template metaprogramming Giuseppe Lipari http://retis.sssup.it/~lipari Scuola Superiore Sant Anna Pisa April 6, 2011 G. Lipari (Scuola Superiore Sant Anna) Template metaprogramming
More informationCSE 599 I Accelerated Computing - Programming GPUS. Memory performance
CSE 599 I Accelerated Computing - Programming GPUS Memory performance GPU Teaching Kit Accelerated Computing Module 6.1 Memory Access Performance DRAM Bandwidth Objective To learn that memory bandwidth
More informationMartin Kruliš, v
Martin Kruliš 1 GPGPU History Current GPU Architecture OpenCL Framework Example Optimizing Previous Example Alternative Architectures 2 1996: 3Dfx Voodoo 1 First graphical (3D) accelerator for desktop
More information